Re: [alto] Benjamin Kaduk's Discuss on draft-ietf-alto-performance-metrics-21: (with DISCUSS and COMMENT)

Hi Ben,

Sorry for the late reply. We have uploaded a new version to address your
wonderful comments, and the latest version is at:
https://www.ietf.org/staging/draft-ietf-alto-performance-metrics-23.html

Summary of main changes:
- For your comment (1), we have recomputed the Content-Length. Could you
please take a look?
- For your comment (2), we have removed the extra addition of bw-utilized.
It was added to be as matching IS-IS, OSPF, BGP-LS as possible but this is
too late stage to do so.
- For the change of delay-ow and delay-rt semantics from milliseconds to
microseconds, it is exactly what you said, to be consistent with the TE
metrics.
- Section 3.3.3 comment. It is excellent. We have changed to "A
non-normative reference definition of end-to-end one-way delay variation is
<xref target="RFC3393" />. Note that <xref target="RFC3393" /> allows the
specification of a generic selection function F to unambiguously define the
two packets selected to compute delay variations. This document defines the
specific case that F selects as the "first" packet the one with the
smallest one-way delay." The staging version has a small issue and we fixed
in the version that we are using right now.
- Sec. 3.4.4. we changed to "For estimation by aggregation of routing
protocol link metrics, the default aggregation of the average of loss rate
is the sum of the link link loss rates. But this default aggregation is
valid only if two conditions are met: (1) it is valid only when link loss
rates are low, and (2) it assumes that each link's loss events are
uncorrelated with every other link's loss events. When loss rates at the
links are high but independent, the general formula for aggregating loss
assuming each link is independent is to compute end-to-end loss as one
minus the product of the success rate for each link. Aggregation when
losses at links are correlated can be more complex and the ALTO server
should be cognizant of correlated loss rates."

The two nits are fixed as well.

We cannot appreciate more the wonderful reviews and suggestions!!

Richard

On Fri, Jan 28, 2022 at 2:38 PM Benjamin Kaduk <kaduk@mit.edu> wrote:

> Hi Richard,
>
> On Tue, Dec 21, 2021 at 08:13:02PM -0500, Y. Richard Yang wrote:
> > Hi Ben,
> >
> > Thank you so much for the wonderful, fast, thorough reviews! I understand
> > that Qin has already sent a reply and please see below.
>
> Yes, Qin's reply was quite helpful -- thanks, Qin!
>
> > On Mon, Dec 20, 2021 at 7:58 PM Benjamin Kaduk via Datatracker <
> > noreply@ietf.org> wrote:
> >
> > > Benjamin Kaduk has entered the following ballot position for
> > > draft-ietf-alto-performance-metrics-21: Discuss
> > >
> > > When responding, please keep the subject line intact and reply to all
> > > email addresses included in the To and CC lines. (Feel free to cut this
> > > introductory paragraph, however.)
> > >
> > >
> > > Please refer to
> https://www.ietf.org/blog/handling-iesg-ballot-positions/
> > > for more information about how to handle DISCUSS and COMMENT positions.
> > >
> > >
> > > The document, along with other ballot positions, can be found here:
> > > https://datatracker.ietf.org/doc/draft-ietf-alto-performance-metrics/
> > >
> > >
> > >
> > > ----------------------------------------------------------------------
> > > DISCUSS:
> > > ----------------------------------------------------------------------
> > >
> > > Thank you for addressing my previous discuss points with the -21 (and
> my
> > > apologies for the spurious one!); I'm glad to see that they were indeed
> > > easy to address.
> > >
> > > However, I have looked over the changes from -20 to -21 and seem to
> have
> > > found a couple more issues that should be addressed:
> > >
> > > (1) I can't replicate the Content-Length values in the examples (I only
> > > looked at Examples 1 and 2).  Can you please share the methodology used
> > > to generate the values?  My testing involved copy/paste from the
> > > htmlized version of the draft to a file, manually editing that file to
> > > remove the leading three spaces that come from the formatting of the
> > > draft, and using Unix wc(1) on the resulting file.  It looks like the
> > > numbers reported in the -21 are computed as the overall number of
> > > characters in the file *minus* the number of lines in the file, but I
> > > think it should be the number of characters *plus* the number of lines,
> > > to accommodate the HTTP CRLF line endings.  (My local temporary files
> > > contain standard Unix LF (0x0a) line endings, verified by hexdump(1).)
> > >
> > >
> > This is very helpful and we are impressed! Below is the method that we
> are
> > finally using:
> > copy .json file to a text file, e.g., example1-req.json
> > {
> >   "cost-type" : {"cost-mode"   : "numerical",
> >                  "cost-metric" : "delay-ow"},
> >   "endpoints" : {
> >     "srcs" : [ "ipv4:192.0.2.2" ],
> >     "dsts" : [
> >       "ipv4:192.0.2.89",
> >       "ipv4:198.51.100.34"
> >     ]
> >   }
> > }
> >
> > Issue a curl request to example.com:
> > curl -v --http1.1 -X POST -H 'Content-Type: application/json' --crlf
> > --data-binary @$i example.com 2>&1 -o /dev/null | grep -Fi '>
> > Content-Length'
> >
> > Note that it sets http/1.1, and asks curl to fix the crlf issue (I am
> using
> > a mac and hence have 0x0a). Note $i is the input file. For example, for
> the
> > updated example1-req.json, we get 225. Is it consistent with your method?
>
> For what it's worth, my reading of the curl manual is that the "--crlf" is
> not needed here (and indeed in my local testing on FreeBSD the same
> content-length is used regardless of whether or not --crlf is used).
>
> I believe that this curl-based procedure will produce the correct results.
> (My original statement about adding the number of lines seems to be
> incorrect; sorry to have caused confusion with that.)
>
> Thanks for investigating and finding a good solution.
>
> >
> > > (2) We seem to be inconsistent about what the "cur" statistical
> operator
> > > for the "bw-utilized" metric indicates -- in §4.4.3 it is "the current
> > > instantaneous sample", but in §4.4.4 it is somehow repurposed as "The
> > > current ("cur") utilized bandwidth of a path is the maximum of the
> > > available bandwidth of all links on the path."
> > >
> >
> > Good comment. I see where the potential confusion can be. How about the
> > following change to make clear the definition of utilized bandwidth for a
> > path?
> > "The base semantics of the metric is the Unidirectional Utilized
> Bandwidth
> > metric defined in [RFC8571,RFC8570,RFC7471], but instead of specifying
> the
> > utilized bandwidth for a link, it is the utilized bandwidth of the path
> > from the source to the destination, where the utilized bandwidth of the
> > path from the source to the destination is defined as the maximum
> utilized
> > bandwidth among all links from the source to the destination."
> >
> > Overall, I agree that we should not add bw-utilized to be fully
> consistent,
> > and we will just reverted bw-utilized.
>
> I don't want to dwell on this topic too much since it sounds like we're
> going to just remove bw-utilized, but to briefly answer your question: the
> proposed text looks like a good clear definition of the semantics of the
> metric.  I think that the §4.4.4 text would need to be adjusted to talk
> about "utilized bandwidth" rather than "available bandwidth" of all links
> on the path, in order to match it.
>
> >
> >
> > >
> > > ----------------------------------------------------------------------
> > > COMMENT:
> > > ----------------------------------------------------------------------
> > >
> > > I cannot currently provide a concise explanation of the nature of my
> > > unease with the "bw-utilized" metric specification that is new in this
> > > revision (so as to elevate it to a Discuss-level concern), but I
> > > strongly urge the authors and WG to consider my comments on Section
> 4.4.3.
> > >
> > > The new text in Section 1 explaining the origins of the metrics (e.g.,
> > > from TE performance metrics) and why some other TE metrics are not
> > > defined is nicely done.  I trust the responsible AD and WG chairs to
> > > ensure that it, and the other places where we have added new
> exposition,
> > > has gotten the appropriate level of review from the WG membership.
> > >
> > >
> > It is wonderful that you caught it immediately. The reason chain was
> > the perfect consistency, but it is not a good idea. So we will just drop
> > this new metric.
> >
> >
> >
> > > Section 3.1.2, 3.2.2
> > >
> > > I see that the delay-ow and delay-rt semantics have been changed from
> > > milliseconds to microseconds going from -20 to -21.  Either
> > > representation seems fine, but it may be risky to make such a change so
> > > late in the publication process, especially if there are already
> > > implementations in place.  I also don't see any AD ballot comments that
> > > seem to motivate the change, so I'm a bit curious how it arose -- is it
> > > for consistency with the corresponding TE link metrics?
> > >
> > >
> > Exactly. It was motived by the authors' search for as perfect consistency
> > and we noted that
> > we used milliseconds in our drafts and OSPF/ISIS/BGP-LS use microseconds.
> > So we made the
> > quick change.
>
> Thanks for confirming.  Hopefully we are in touch with all implementations
> and can get them adjusted quickly.
>
> > > Section 3.3.3
> > >
> > >    Intended Semantics: To specify temporal and spatial aggregated delay
> > >    variation (also called delay jitter)) with respect to the minimum
> > >    delay observed on the stream over the one-way delay from the
> > >    specified source and destination, where the one-way delay is defined
> > >    in Section 3.1.  A non-normative reference definition of end-to-end
> > >    one-way delay variation is [RFC3393].  [...]
> > >
> > > I note that RFC 3393 explicitly says that as part of the metric,
> several
> > > parameters must be specified, most notably the selection function F
> that
> > > unambiguously defines the two packets selected for the metric.  While
> > > it's allowed for F to select as the "first" packet the one with the
> > > smallest one-way delay, which maps up to the "with respect to the
> > > minimum delay observed on the stream" here, it seems to me that it's
> > > fairly important to call out that we are not allowing the full
> > > flexibility of the RFC 3393 metric.  Assuming, of course, that we
> > > specifically have that as the intent, versus allowing the full
> > > generality of RFC 3393.  If there has been some research results since
> > > RFC 3393 was published that indicate that it's preferred to use the
> > > minimum delay for this purpose, that might be worth listing as a
> > > reference in addition to RFC 3393.
> > >
> > >
> > Excellent comment! Here is the new text:
> > "A non-normative reference definition of end-to-end one-way delay
> > variation is <xref target="RFC3393" />, which allows general delay
> > variations
> > by specifying a selection function F (Section 2.2 of [RFC3393]). See
> > <xref target="RFC5481" /> for additional discussions on RFC3393 related
> > packet delay variations. This document focuses on the specific case of
> > delay variation with respect to the
> > minimum delay observed in the packet stream, as commonly exposed by
> > routing link metrics. If an ALTO server provides a delay variation metric
> > that
> > is not based on the minimum delay, the server can provide the precise
> > definition in the "cost-context" field, for example, by specifying a
> > general
> > IPPM PDV metric in the "parameters" field. The server SHOULD be
> > cognizant that the "cost-context" field is optional, and hence the client
> > may not interpret the semantics properly."
>
> That looks great; thank you!
>
> > I looked at the IPPM registry
> > (
> >
> https://www.iana.org/assignments/performance-metrics/performance-metrics.xhtml
> > )
> > The only PDV metric defined so far is entry 3, and it is with respect to
> > minimum delay.
> >
> >
> > > Section 3.4.4
> > >
> > > The estimation of end-to-end loss rate as the sum of per-link loss
> rates
> > > is (1) only valid in the low-loss limit, and (2) assumes that each
> > > link's loss events are uncorrelated with every other link's loss
> events.
> > > The current text does mention (2) in the form of "should be cognizant
> of
> > > correlated loss rates", but I don't think it touches on (1) at all.
> > > (The general formula for aggregating loss assuming each link is
> > > independent is to compute end-to-end loss as one minus the product of
> > > the success rate for each link.)
> > >
> > >
> > Excellent comments. How about this updated text:
> > "For estimation by aggregation of routing protocol link metrics, the
> > default
> > aggregation function to compute the loss rate of a path is to compute it
> as
> > one minus the success rate of the path, where the success rate of the
> path
> > is the product of the success rates of the links on the path, and the
> > success
> > rate of a link is one minus the loss rate of the link. In low loss-rate
> > settings,
> > the loss rate of a path can be approximated as the sum of link loss
> rates.
> > This aggregation function assumes independent link losses, and the ALTO
> > server should be cognizant of correlated link loss rates."
>
> That should work, thanks.
>
> >
> > > Section 4.4.3
> > >
> > > It seems like there may some subtlety in the interpretation of the
> > > "bw-utilized" metric, which leads me to wonder if more caution is
> > > advised prior to adding new metrics at this stage in the document
> > > lifecycle.  In particular, it seems like it would be natural to attempt
> > > to compare the "bw-utilized" value against the "bw-maxres" value and
> > > "bw-residual" value, but it seems to me that the inferences that can be
> > > made by such comparisons will depend on the topology in question.
> > >
> > >
> > > Routers and link capacities between them:
> > >
> > >        1Gbps            10Gbps            1Gbps
> > >    +-----------------+=================+--------------+
> > >    A                 B                 C              D
> > >
> > > If there is a flow using 6GBps from B to C, that would show up when
> > > querying "bw-utilized" between A and B, but that 6Gbps is obviously
> more
> > > than both the maximum reservable and residual bandwidth end-to-end from
> > > A to D; likewise, the 4GBps of residual bandwidth on the B-to-C link is
> > > also more than the achievable bandwidth end-to-end from A to D.  So it
> > > seems like the utilized bandwidth is potentially from totally unrelated
> > > flows on paths that only have a minimal set of links in common with the
> > > path being queried.  How do we expect someone to use the reported
> > > "bw-utilized" values?
> > >
> > To put it differently, I don't think that the specification of "the
> > > maximum utilized bandwidth among all links from the source to the
> > > destination" will actually provide the desired "utilized bandwidth of
> > > the path from the source to the destination", since the procedure as
> > > stated can report a bandwidth that corresponds to a different path.
> > >
> > >
> > Excellent comment! We will just use the previous version w/o bw-utilized
> > and will engage you in a separate thread, so that we will not block the
> > progress of the current document.  Make sense?
>
> Yes, that sounds good.  This will also let us exercise the IANA registry
> procedures quickly and get experience with them :)
>
> [nits trimmed]
>
> Thanks again,
>
> Ben
>
> _______________________________________________
> alto mailing list
> alto@ietf.org
> https://www.ietf.org/mailman/listinfo/alto
>

-- 
-- 
 =====================================
| Y. Richard Yang <yry@cs.yale.edu>   |
| Professor of Computer Science       |
| http://www.cs.yale.edu/~yry/        |
 =====================================