Re: [ippm] Benjamin Kaduk's No Objection on draft-ietf-ippm-capacity-metric-method-06: (with COMMENT)

"MORTON, ALFRED C (AL)" <acm@research.att.com> Sat, 27 February 2021 20:19 UTC

From: "MORTON, ALFRED C (AL)" <acm@research.att.com>
To: Benjamin Kaduk <kaduk@mit.edu>
CC: The IESG <iesg@ietf.org>, "draft-ietf-ippm-capacity-metric-method@ietf.org" <draft-ietf-ippm-capacity-metric-method@ietf.org>, "ippm-chairs@ietf.org" <ippm-chairs@ietf.org>, "ippm@ietf.org" <ippm@ietf.org>, Ian Swett <ianswett@google.com>, "tpauly@apple.com" <tpauly@apple.com>
Thread-Topic: Benjamin Kaduk's No Objection on draft-ietf-ippm-capacity-metric-method-06: (with COMMENT)
Thread-Index: AQHXCubpIAgrxqGmcU+aUTtf1usNw6pnzooAgAHzjYCAAqh1sA==
Date: Sat, 27 Feb 2021 20:19:43 +0000
Message-ID: <4D7F4AD313D3FC43A053B309F97543CF01476A103F@njmtexg5.research.att.com>
References: <161419645471.18083.16706266293896961774@ietfa.amsl.com> <4D7F4AD313D3FC43A053B309F97543CF01476A0549@njmtexg5.research.att.com> <20210225220325.GX21@kduck.mit.edu>
In-Reply-To: <20210225220325.GX21@kduck.mit.edu>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/ippm/EPj0KRsh84xOiwajgPA_J3dO0xU>
Subject: Re: [ippm] Benjamin Kaduk's No Objection on draft-ietf-ippm-capacity-metric-method-06: (with COMMENT)
Precedence: list

Hi Ben,

Thanks for your reply.  I deleted some early agreements below, and
added very few more replies.

Please see [acm] with no indents...

> -----Original Message-----
> From: Benjamin Kaduk [mailto:kaduk@mit.edu]
> Sent: Thursday, February 25, 2021 5:03 PM
> To: MORTON, ALFRED C (AL) <acm@research.att.com>
> Cc: The IESG <iesg@ietf.org>; draft-ietf-ippm-capacity-metric-
> method@ietf.org; ippm-chairs@ietf.org; ippm@ietf.org; Ian Swett
> <ianswett@google.com>; tpauly@apple.com
> Subject: Re: Benjamin Kaduk's No Objection on draft-ietf-ippm-capacity-
> metric-method-06: (with COMMENT)
> 
> Hi Al,
> 
> Also inline...
> 
> On Thu, Feb 25, 2021 at 03:08:29AM +0000, MORTON, ALFRED C (AL) wrote:
> > Hi Ben,
> >
> > Thanks for your detailed review; it will be a better draft when we're
> done, as always!
> >
> > All the changes identified below are implemented in my working version.
> >
> > Please see replies below, [acm]
> > Al
> >
> > > -----Original Message-----
> > > From: Benjamin Kaduk via Datatracker [mailto:noreply@ietf.org]
> > > Sent: Wednesday, February 24, 2021 2:54 PM
> > > To: The IESG <iesg@ietf.org>
> > > Cc: draft-ietf-ippm-capacity-metric-method@ietf.org; ippm-
> chairs@ietf.org;
> > > ippm@ietf.org; Ian Swett <ianswett@google.com>; tpauly@apple.com;
> > > tpauly@apple.com
> > > Subject: Benjamin Kaduk's No Objection on draft-ietf-ippm-capacity-
> metric-
> > > method-06: (with COMMENT)
> > >
> > > Benjamin Kaduk has entered the following ballot position for
> > > draft-ietf-ippm-capacity-metric-method-06: No Objection
> > >
> > ...
> > > ----------------------------------------------------------------------
> > > COMMENT:
> > > ----------------------------------------------------------------------
... <we agreed on changes in earlier sections>

> > > Section 8.1
> > >
> > >    At the beginning of a test, the sender begins sending at rate R1
> and
> > >    the receiver starts a feedback timer at interval F (while awaiting
> > >
> > > It's a little hard to search for, but I didn't find any previous
> mention
> > > of 'F' or it being defined as a parameter or term.  Should it be a
> > > listed parameter somewhere?
> > [acm]
> > We define a lot of variables in this section, with limited scope of use.
> > F is one of them, but I found that we had already defined F in Section
> 4!
> > So... F becomes FT, and like R1 and ss and cc, gets no special treatment
> > beyond a definition in the text (IF you're ok with that).
> 
> That should be fine.
[acm] 
Since this came-up again with Magnus, I made FT a parameter in section 4.


> 
> > >
> > >    If the feedback indicates that sequence number anomalies were
> > >    detected OR the delay range was above the upper threshold, the
> > >    offered load rate is decreased.  Also, if congestion is now
> confirmed
> > >    by the current feedback message being processed, then the offered
> > >    load rate is decreased by more than one rate (e.g., Rx-30).  [...]
> > >
> > > Does "congestion is now confirmed" mean that "congestion confirmed" is
> > > like a one-way latch and this transition only occurs at most once over
> > > the course of a test?  Or could the Rx-30 happen multiple times?
> > > (The pseudocode indicates the former.)
> > [acm]
> > Yes, we are trying to describe the pseudocode, and "congestion
> confirmed"
> > latches when the slowAdjCount equals the slowAdjThresh (and after that,
> > slowAdjCount continues upward, the slowAdjCount < slowAdjThresh
> condition
> > fails and slowAdjCount is never reset to zero).
> >
> > So, I think we want to say:
> > OLD
> > Also, if congestion is now confirmed by the current feedback message...
> > NEW
> > Also, if congestion is now confirmed for the first time by the
> > current feedback message...
> 
> +1
> 
> > >
> > >    If the feedback indicates that there were no sequence number
> > >    anomalies AND the delay range was above the lower threshold, but
> > >    below the upper threshold, the offered load rate is not changed.
> > >
> > > The way this is written suggests that there will always be a lower and
> > > an upper threshold for delay, but the rest of the document so far didn't
> > > give me that impression.  E.g., we talk about PM only as "at least one
> > > fundamental metric and target performance threshold MUST be supplied",
> > > and to me having both upper and lower thresholds would be two
> > > thresholds, not one.
> > [acm]
> > That's true, we tried not to force our current/best algorithm into the
> > metric definition. We require some measurement to use for feedback
> > in rate adjustment, otherwise you just have iPerf or other fixed rate
> > tools that can only blast packets.
> >
> > You can build an "ok" feedback system with just one metric and one threshold,
> > but there are drawbacks and it may take longer duration tests to measure
> > the true maximum capacity due to a technical limitation.
> > So, we gave a really good algorithm, in 8.1, in pseudocode, and even
> > in running code
> 
> That all makes sense to me.  I'm not sure if there's a good way to shoehorn
> some of that insight into the text of the document itself, but it also
> doesn't seem like something that's critical to do.
[acm] 
Thanks.

> 
> >
> > >
> > > Section 8.2
> > >
> > >    Here, as with any Active Capacity test, the test duration must be
> > >    kept short. 10 second tests for each direction of transmission are
> > >    common today.  The default measurement interval specified here is I
> =
> > >    10 seconds).  In combination with a fast search method and user-
> > >    network coordination, the concerns raised in RFC 6815[RFC6815] are
> > >    alleviated.  [...]
> > >
> > > I skimmed RFC 6815 and had a bit of a hard time making the connection
> > > for why combining a 10-second interval, fast search method, and
> > > user-network coordination alleviate the concerns of RFC 6815.  There
> > > doesn't seem to be much in 6815 itself about how testing in production
> > > can be done safely,
> > [acm]
> > That's certainly true, but we did say:
> >
> >    The world will not spin off axis while waiting for appropriate and
> >    standardized methods to emerge from the consensus process.
> >
> > > so my current working assumption is that the
> > > conclusion presented here reflects the results of "new work" being
> > > recorded for the first time (in the RFC series) in this document.
> > [acm]
> >
> > When you put it that way, yes. Although it is a different metric from
> > RFC2544 Throughput, the load adjustment search algorithm alone helps
> > to make this method safer to use than any fixed-rate UDP packet blaster,
> > or even a binary search-controlled measurement because of near real-time
> > feedback.
> >
> > The other reasons why this work is different are that
> > RFC 2544 Throughput measurements intend to overload the isolated test
> > environment for extended periods of time:
> >
> > 24. Trial duration (from
> https://urldefense.com/v3/__https://tools.ietf.org/html/rfc2544*section-24__;Iw!!BhdT!zEoCgaLen-TJZODnn94eTalwD4PeqYAT9DJPq4WKsUdT8dlW3SvLdotESTp-B0s$ )
> >
> >    The aim of these tests is to determine the rate continuously
> >    supportable by the DUT.  The actual duration of the test trials must
> >    be a compromise between this aim and the duration of the benchmarking
> >    test suite.  The duration of the test portion of each trial SHOULD be
> >    at least 60 seconds. ...
> >
> > Many automated RFC2544 test devices start a test at the highest load,
> and
> > search their way down to the zero-loss Throughput, subjecting the
> > device under test to potentially extreme overload multiple times before
> > reaching the test outcome
> >
> > > If that assumption is correct, I'd suggest spending some more words to
> > > support the conclusion, e.g., making analogies to other "normal"
> traffic
> > > patterns and how the benchmarking setup is not qualitatively different
> > > from them.
> > [acm]
> >
> > OK, I put some more background together and made the case stronger:
> > the memo we wrote hear is exactly what the RFC6815 authors were asking
> for.
> >
> > The Max IP Capacity metric and method for assessing is very different
> from classic RFC2544
> > Throughput metric and methods : it uses near-real-time load adjustments
> that are sensitive to loss and delay, similar to other congestion control
> algorithms used on the Internet every day, along with limited duration. On
> the other hand, RFC2544 Throughput measurements can produce sustained
> overload conditions for extended periods of time. Individual trials in a
> test governed by a binary search can last 60 seconds for each step, and
> the final confirmation trial may be even longer. This is very different
> from "normal" traffic levels, but overload conditions are not a concern in
> the isolated test environment. The concerns raised in RFC6815 were that
> RFC2544 methods would be let loose on production networks, and instead the
> authors challenged the standards community to develop metrics and methods
> like those described in this memo.
> 
> Thanks; that is what I was asking for.
[acm] 
Great!

> I think this is related to Magnus's Discuss point, though, and I cannot
> speak for whether it will make him happy as well...
[acm] 
Of course. It's part of the topic: testing production networks safely.
We've been running versions of this method on the Internet for years,
as the references and IPPM's literature clearly show.

> 
> >
> > >
> > > Section 8.3
> > >
> > >    As testing continues, implementers should expect some evolution in
> > >    the methods.  The ITU-T has published a Supplement (60) to the
> > >    Y-series of Recommendations, "Interpreting ITU-T Y.1540 maximum IP-
> > >    layer capacity measurements", [Y.Sup60], which is the result of
> > >    continued testing with the metric and method described here.
> > >
> > > I pulled up the [Y.Sup60] reference, and it does not seem to reference
> > > this draft by name.  On what basis do we conclude that it "is the result
> > > of continued testing with the metric and method described here"?
> > > Skimming/searching, I do see many similar formulae and methods
> > > presented, but how do we conclude they are precisely the same?
> > [acm]
> > I'll soften that a bit. The Max IP-Layer Capacity metric is
> > the same, but it is likely that a few details in the method have diverged
> > over time -- much of the ITU-T testing and spec development came first.
> >
> > NEW
> > ... [Y.Sup60], which is the result of continued testing with the metric,
> > and those results have improved the method described here.
> 
> Thanks.  My primary concern here was that it seemed like we were making
> statements about what some other SDO did, and typically it's good to have
> sign-off from the other SDO before doing that.  The NEW option doesn't seem
> to have that issue, so it should be good.
[acm] 
Nice, thanks.

> 
> >
> >
> > >
> > > Section 10
> > >
> > > Should we say something about making sure that I is reasonably
> bounded?
> > > IIRC we say so elsewhere in the text but not exactly here.
> > [acm]
> > I added a direct reference to I at the end of item 6.:
> >
> > ... Testing with the Service Provider's measurement hosts SHOULD be
> limited in frequency and/or overall volume of test traffic (for example,
> the range of I duration values SHOULD be limited).
> > >
> > >    2.  A REQUIRED user client-initiated setup handshake between
> > >        cooperating hosts and allows firewalls to control inbound
> > >        unsolicited UDP which either go to a control port [expected and
> > >        w/authentication] or to ephemeral ports that are only created
> as
> > >        needed.  [...]
> > >
> > > nit: the grammar is odd in the first part of this sentence; the part
> > > before the "and" doesn't seem like it can join up with anything after
> > > the "and".  Is the intent something like "It is REQUIRED to have a user
> > > client-initiated setup handshake between cooperating hosts that allows
> > > firewalls to [...]"?
> > [acm]
> > Thanks, good re-wording, it's in.
> >
> > >
> > >    3.  Integrity protection for feedback messages conveying measurements
> > >        is RECOMMENDED.
> > >
> > > (In some sense you want authentication as well as integrity protection.)
> > [acm]
> > Yes. The running code has optional authentication now.
> >
> > NEW
> > 3. Client-server authentication and integrity protection for feedback
> >    messages conveying measurements is RECOMMENDED.
> >
> > >
> > >    5.  Senders MUST be rate-limited.  This can be accomplished using the
> > >        pre-built table defining all the offered load rates that will be
> > >        supported (Section 8.1).  The recommended load-control search
> > >        algorithm results in "ramp up" from the lowest rate in the table.
> > >
> > > nit: since (effectively) each implementation will have their own
> > > pre-built table, I think it should be "using a pre-built table".
> > [acm]
> > OK, "a" it is.
> >
> >
> > >
> > > Appendix 13
> > >
> > > If we start at Rx (row) 1, is it going to cause problems when we drop
> > > down to Rx = 0 in the loss/congestion cases?
> > [acm]
> > It would, but the current table includes a Row zero, which is where we
> > cold-start. I guess it would less confusing to say:
> >
> > Rx = 0  # The current sending rate (equivalent to a row of the table)
> 
> Yes, I think so.
> 
> >
> > >
> > > The mechcanism in the pseudocode to stop taking large increments in
> > > sending rate above the "hSpeedThresh" does not seem to be described in
> > > the prose in §8.1.  (That said, it seems like a good idea, given the
> > > likely table composition.)
[acm] 

We fixed this in a separate exchange, after you discovered my mis-read.
Thanks!


> > [acm]
> > It's getting late here now, but I think it's in the first two If
> statements:
> >
> > if ( seqErr == 0 && delay < lowThresh ) {                         # no
> loss or delay problems, and
> >         if ( Rx < hSpeedTresh && slowAdjCount < slowAdjThresh ) { # Rate
> < hSpeedThresh && etc.
> >                         Rx += highSpeedDelta;                     # can
> still use large increments
> >                         slowAdjCount = 0;
> >         } else {                                                  #
> otherwise (Rate >= hSpeedThresh)
> >                         if ( Rx < maxLoadRates - 1 )              #
> after checking headroom,
> >                                         Rx++;                     # can
> only increase by one
> 
> I followed up on this out-of-band -- I agree that the pseudocode is good,
> and was wondering that the prose in Section 8.1 was divergent from the
> pseudocode.  Your proposal for new prose text looks good:
> 
> % However, if a rate threshold between high and very high sending
> % rates (such as 1Gbps) is exceeded, the offered load rate is only
> % increased by one (Rx+1) above the rate threshold in any congestion
> state.
> 
> >
> > >
> > > (Also, indenting one tab for the outer conditionals and two more for the
> > > inner ones looks a bit unusual.)
> > [acm]
> > I like unusual :-)
> >
> > >
> > > Section 14
> > >
> > > It's not entirely clear to me why RFC 2330 is classified as normative
> > > but RFC 7312 is informative, just based on the locations where they are
> > > referenced.
> > [acm]
> > It's more than that. RFC 7312 describes some unusual access conditions
> that might be encountered and is only cited at the end of the Intro, once.
> Certainly the measured evidence of bimodal (turbo-mode) access behavior is
> in the category of RFC 7312 "messy stuff", but we still manage that pretty
> well and it's not specifically mentioned in 7312.
> >
> > OTOH, RFC 2330 is the IPPM Framework, with an exception to be
> Informative status, yet Normative in our memos, and much is owed to 2330,
> starting with the singleton, sample, statistic metric development,
> unmentioned stuff about clocks and accuracy, etc.
> >
> 
> Understood.
> 
> Thanks for all the updates and explanations!
> 
> -Ben
[acm] 
You're welcome. Thanks for a very productive exchange!

Al (for the co-authors)

[ippm] Benjamin Kaduk's No Objection on draft-iet… Benjamin Kaduk via Datatracker
Re: [ippm] Benjamin Kaduk's No Objection on draft… MORTON, ALFRED C (AL)
Re: [ippm] Benjamin Kaduk's No Objection on draft… Benjamin Kaduk
Re: [ippm] Benjamin Kaduk's No Objection on draft… MORTON, ALFRED C (AL)