Re: [alto] Benjamin Kaduk's Discuss on draft-ietf-alto-performance-metrics-21: (with DISCUSS and COMMENT)

Hi Ben,

Thank you so much for the wonderful, fast, thorough reviews! I understand
that Qin has already sent a reply and please see below.

On Mon, Dec 20, 2021 at 7:58 PM Benjamin Kaduk via Datatracker <
noreply@ietf.org> wrote:

> Benjamin Kaduk has entered the following ballot position for
> draft-ietf-alto-performance-metrics-21: Discuss
>
> When responding, please keep the subject line intact and reply to all
> email addresses included in the To and CC lines. (Feel free to cut this
> introductory paragraph, however.)
>
>
> Please refer to https://www.ietf.org/blog/handling-iesg-ballot-positions/
> for more information about how to handle DISCUSS and COMMENT positions.
>
>
> The document, along with other ballot positions, can be found here:
> https://datatracker.ietf.org/doc/draft-ietf-alto-performance-metrics/
>
>
>
> ----------------------------------------------------------------------
> DISCUSS:
> ----------------------------------------------------------------------
>
> Thank you for addressing my previous discuss points with the -21 (and my
> apologies for the spurious one!); I'm glad to see that they were indeed
> easy to address.
>
> However, I have looked over the changes from -20 to -21 and seem to have
> found a couple more issues that should be addressed:
>
> (1) I can't replicate the Content-Length values in the examples (I only
> looked at Examples 1 and 2).  Can you please share the methodology used
> to generate the values?  My testing involved copy/paste from the
> htmlized version of the draft to a file, manually editing that file to
> remove the leading three spaces that come from the formatting of the
> draft, and using Unix wc(1) on the resulting file.  It looks like the
> numbers reported in the -21 are computed as the overall number of
> characters in the file *minus* the number of lines in the file, but I
> think it should be the number of characters *plus* the number of lines,
> to accommodate the HTTP CRLF line endings.  (My local temporary files
> contain standard Unix LF (0x0a) line endings, verified by hexdump(1).)
>
>
This is very helpful and we are impressed! Below is the method that we are
finally using:
copy .json file to a text file, e.g., example1-req.json
{
  "cost-type" : {"cost-mode"   : "numerical",
                 "cost-metric" : "delay-ow"},
  "endpoints" : {
    "srcs" : [ "ipv4:192.0.2.2" ],
    "dsts" : [
      "ipv4:192.0.2.89",
      "ipv4:198.51.100.34"
    ]
  }
}

Issue a curl request to example.com:
curl -v --http1.1 -X POST -H 'Content-Type: application/json' --crlf
--data-binary @$i example.com 2>&1 -o /dev/null | grep -Fi '>
Content-Length'

Note that it sets http/1.1, and asks curl to fix the crlf issue (I am using
a mac and hence have 0x0a). Note $i is the input file. For example, for the
updated example1-req.json, we get 225. Is it consistent with your method?

> (2) We seem to be inconsistent about what the "cur" statistical operator
> for the "bw-utilized" metric indicates -- in §4.4.3 it is "the current
> instantaneous sample", but in §4.4.4 it is somehow repurposed as "The
> current ("cur") utilized bandwidth of a path is the maximum of the
> available bandwidth of all links on the path."
>

Good comment. I see where the potential confusion can be. How about the
following change to make clear the definition of utilized bandwidth for a
path?
"The base semantics of the metric is the Unidirectional Utilized Bandwidth
metric defined in [RFC8571,RFC8570,RFC7471], but instead of specifying the
utilized bandwidth for a link, it is the utilized bandwidth of the path
from the source to the destination, where the utilized bandwidth of the
path from the source to the destination is defined as the maximum utilized
bandwidth among all links from the source to the destination."

Overall, I agree that we should not add bw-utilized to be fully consistent,
and we will just reverted bw-utilized.

>
> ----------------------------------------------------------------------
> COMMENT:
> ----------------------------------------------------------------------
>
> I cannot currently provide a concise explanation of the nature of my
> unease with the "bw-utilized" metric specification that is new in this
> revision (so as to elevate it to a Discuss-level concern), but I
> strongly urge the authors and WG to consider my comments on Section 4.4.3.
>
> The new text in Section 1 explaining the origins of the metrics (e.g.,
> from TE performance metrics) and why some other TE metrics are not
> defined is nicely done.  I trust the responsible AD and WG chairs to
> ensure that it, and the other places where we have added new exposition,
> has gotten the appropriate level of review from the WG membership.
>
>
It is wonderful that you caught it immediately. The reason chain was
the perfect consistency, but it is not a good idea. So we will just drop
this new metric.

> Section 3.1.2, 3.2.2
>
> I see that the delay-ow and delay-rt semantics have been changed from
> milliseconds to microseconds going from -20 to -21.  Either
> representation seems fine, but it may be risky to make such a change so
> late in the publication process, especially if there are already
> implementations in place.  I also don't see any AD ballot comments that
> seem to motivate the change, so I'm a bit curious how it arose -- is it
> for consistency with the corresponding TE link metrics?
>
>
Exactly. It was motived by the authors' search for as perfect consistency
and we noted that
we used milliseconds in our drafts and OSPF/ISIS/BGP-LS use microseconds.
So we made the
quick change.

> Section 3.3.3
>
>    Intended Semantics: To specify temporal and spatial aggregated delay
>    variation (also called delay jitter)) with respect to the minimum
>    delay observed on the stream over the one-way delay from the
>    specified source and destination, where the one-way delay is defined
>    in Section 3.1.  A non-normative reference definition of end-to-end
>    one-way delay variation is [RFC3393].  [...]
>
> I note that RFC 3393 explicitly says that as part of the metric, several
> parameters must be specified, most notably the selection function F that
> unambiguously defines the two packets selected for the metric.  While
> it's allowed for F to select as the "first" packet the one with the
> smallest one-way delay, which maps up to the "with respect to the
> minimum delay observed on the stream" here, it seems to me that it's
> fairly important to call out that we are not allowing the full
> flexibility of the RFC 3393 metric.  Assuming, of course, that we
> specifically have that as the intent, versus allowing the full
> generality of RFC 3393.  If there has been some research results since
> RFC 3393 was published that indicate that it's preferred to use the
> minimum delay for this purpose, that might be worth listing as a
> reference in addition to RFC 3393.
>
>
Excellent comment! Here is the new text:
"A non-normative reference definition of end-to-end one-way delay
variation is <xref target="RFC3393" />, which allows general delay
variations
by specifying a selection function F (Section 2.2 of [RFC3393]). See
<xref target="RFC5481" /> for additional discussions on RFC3393 related
packet delay variations. This document focuses on the specific case of
delay variation with respect to the
minimum delay observed in the packet stream, as commonly exposed by
routing link metrics. If an ALTO server provides a delay variation metric
that
is not based on the minimum delay, the server can provide the precise
definition in the "cost-context" field, for example, by specifying a
general
IPPM PDV metric in the "parameters" field. The server SHOULD be
cognizant that the "cost-context" field is optional, and hence the client
may not interpret the semantics properly."

I looked at the IPPM registry
(
https://www.iana.org/assignments/performance-metrics/performance-metrics.xhtml
)
The only PDV metric defined so far is entry 3, and it is with respect to
minimum delay.

> Section 3.4.4
>
> The estimation of end-to-end loss rate as the sum of per-link loss rates
> is (1) only valid in the low-loss limit, and (2) assumes that each
> link's loss events are uncorrelated with every other link's loss events.
> The current text does mention (2) in the form of "should be cognizant of
> correlated loss rates", but I don't think it touches on (1) at all.
> (The general formula for aggregating loss assuming each link is
> independent is to compute end-to-end loss as one minus the product of
> the success rate for each link.)
>
>
Excellent comments. How about this updated text:
"For estimation by aggregation of routing protocol link metrics, the
default
aggregation function to compute the loss rate of a path is to compute it as
one minus the success rate of the path, where the success rate of the path
is the product of the success rates of the links on the path, and the
success
rate of a link is one minus the loss rate of the link. In low loss-rate
settings,
the loss rate of a path can be approximated as the sum of link loss rates.
This aggregation function assumes independent link losses, and the ALTO
server should be cognizant of correlated link loss rates."

> Section 4.4.3
>
> It seems like there may some subtlety in the interpretation of the
> "bw-utilized" metric, which leads me to wonder if more caution is
> advised prior to adding new metrics at this stage in the document
> lifecycle.  In particular, it seems like it would be natural to attempt
> to compare the "bw-utilized" value against the "bw-maxres" value and
> "bw-residual" value, but it seems to me that the inferences that can be
> made by such comparisons will depend on the topology in question.
>
>
> Routers and link capacities between them:
>
>        1Gbps            10Gbps            1Gbps
>    +-----------------+=================+--------------+
>    A                 B                 C              D
>
> If there is a flow using 6GBps from B to C, that would show up when
> querying "bw-utilized" between A and B, but that 6Gbps is obviously more
> than both the maximum reservable and residual bandwidth end-to-end from
> A to D; likewise, the 4GBps of residual bandwidth on the B-to-C link is
> also more than the achievable bandwidth end-to-end from A to D.  So it
> seems like the utilized bandwidth is potentially from totally unrelated
> flows on paths that only have a minimal set of links in common with the
> path being queried.  How do we expect someone to use the reported
> "bw-utilized" values?
>
To put it differently, I don't think that the specification of "the
> maximum utilized bandwidth among all links from the source to the
> destination" will actually provide the desired "utilized bandwidth of
> the path from the source to the destination", since the procedure as
> stated can report a bandwidth that corresponds to a different path.
>
>
Excellent comment! We will just use the previous version w/o bw-utilized
and will engage you in a separate thread, so that we will not block the
progress of the current document.  Make sense?

> NITS
>
> Section 1
>
> s/"Semantics Base On" column/"Semantics Based On" column/ (in the prose,
> first paragraph after the table).
>
> Section 4.3
>
> The section heading has a typo: s/Availlble/Available/
>
>
> Fixed.

Thank you so much!
Richard

>
> _______________________________________________
> alto mailing list
> alto@ietf.org
> https://www.ietf.org/mailman/listinfo/alto