Re: [alto] Lars Eggert's Discuss on draft-ietf-alto-performance-metrics-19: (with DISCUSS and COMMENT)

Hi Lars,

Thank you for your feedback and we have made changes to the alto
performance metrics. The latest version is posted at:
https://datatracker.ietf.org/doc/html/draft-ietf-alto-performance-metrics-22

Could you please take a look to see if we have addressed your comments? At
a high level,
- We tried to clarify where the metrics come from. The related text is on
page 5 and 6.
- One part which we want you to check is tput. The section is at:
https://datatracker.ietf.org/doc/html/draft-ietf-alto-performance-metrics-22#section-4.1

Greatly appreciate the wonderful reviews!
Richard

On Tue, Nov 30, 2021 at 9:27 PM Y. Richard Yang <yry@cs.yale.edu> wrote:

> Hi Lars,
>
> Just posted an updated version, and a diff from the previous version is
> available at:
> https://www.ietf.org/rfcdiff?url2=draft-ietf-alto-performance-metrics-20
>
> Thanks!
> Richard
>
> On Tue, Nov 30, 2021 at 8:49 PM Y. Richard Yang <yry@cs.yale.edu> wrote:
>
>> Hi Lars,
>>
>> Thanks for the review! Please see below.
>>
>> On Mon, Nov 29, 2021 at 8:10 AM Lars Eggert via Datatracker <
>> noreply@ietf.org> wrote:
>>
>>> Lars Eggert has entered the following ballot position for
>>> draft-ietf-alto-performance-metrics-19: Discuss
>>>
>>> Please refer to
>>> https://www.ietf.org/blog/handling-iesg-ballot-positions/
>>> for more information about how to handle DISCUSS and COMMENT positions.
>>>
>>> The document, along with other ballot positions, can be found here:
>>> https://datatracker.ietf.org/doc/draft-ietf-alto-performance-metrics/
>>>
>>>
>>>
>>> ----------------------------------------------------------------------
>>> DISCUSS:
>>> ----------------------------------------------------------------------
>>>
>>> This document needs to become much more formal about how it defines the
>>> metrics it wishes to use with ALTO. This could either be done either by
>>> identifying and normatively referencing existing metrics the IETF has
>>> defined,
>>> or by defining them here. When normatively referencing existing IETF
>>> metrics, it
>>> would need to explain why their use with ALTO makes sense.
>>>
>>> At the moment, the document informatively points to a somewhat arbitrary
>>> collection of prior IETF metrics (most of which are from IPPM, residual
>>> bandwidth from IS-IS TE, but then reservable bandwidth from OSPF TE?).
>>
>>
>> To give some background, the WG derived the list of metrics from RFC 8571
>> (BGP - Link State (BGP-LS) Advertisement of IGP Traffic Engineering
>> Performance
>> Metric Extensions), focusing on network->application. The list added Hop
>> Count
>> (exists in original ALTO RFC 7285), Round-trip (to avoid two queries, and
>> many apps
>> use RTT), and TCP Throughput, and removed Unidirectional Available
>> Bandwidth
>> and Unidirectional Utilized Bandwidth, to reduce the number of bandwidth
>> metrics.
>>
>>
>>> But it
>>> only refers to them as "examples",
>>
>>
>> I searched the word "example" and do not see where the document says that
>> they
>> are examples. It says that "Since different applications may use
>> different cost metrics,
>> the ALTO base protocol introduces an ALTO Cost Metric Registry (Section
>> 14.2 of
>> [RFC7285]), as a systematic mechanism to allow different metrics to be
>> specified. For
>> example, a delay-sensitive application may want to use latency-related
>> metrics, and
>> a bandwidth-sensitive application may want to use bandwidth-related
>> metrics."
>>
>> Does this paragraph give an impression that the metrics are only
>> examples? If so,
>> do you suggest removing the "For example" phrase to reduce the impression?
>>
>> The document does have the sentence " The "Origin Example" column of
>> Table
>> 1 gives an example RFC that has defined each metric." Here the word
>> "example"
>> word means one existing work.
>>
>>
>>> without actually defining how exactly they
>>> are to be used with ALTO, or - if not those - which actual metrics are
>>> supposed
>>> to be used.
>>>
>>
>> The document has "... the ALTO base protocol introduces an ALTO Cost
>> Metric Registry
>> (Section 14.2 of [RFC7285]), as a systematic mechanism to allow different
>> metrics
>>  to be specified. " and "When an ALTO server supports a cost metric
>> defined in this document,
>> it should announce this metric in its information resource directory
>> (IRD) as defined in
>>  Section 9.2 of [RFC7285]." Does this provide enough on how exactly they
>> should be used?
>> The function of this document is to satisfy the registry and the use will
>> be in the base protocol
>> (RFC7285). If there is a specific suggestion, it will be good to have.
>>
>>
>>> Defining a mechanism for exposing metric information to clients isn't
>>> really
>>> useful unless the content of that information is much more clearly
>>> specified.
>>>
>>> I agree with this statement that information should be specified as
>> clearly as
>> possible, but at the same time, we need abstraction to reduce the
>> complexity.
>> One guiding principle in the design is that ALTO information provides
>> reasonable guidance, not mathematical precision.
>>
>>
>>> Section 4.1.3. , paragraph 2, discuss:
>>> >    Intended Semantics: To give the throughput of a TCP congestion-
>>> >    control conforming flow from the specified source to the specified
>>> >    destination; see [RFC3649, Section 5.1 of RFC8312] on how TCP
>>> >    throughput is estimated.  The spatial aggregation level is specified
>>> >    in the query context (e.g., PID to PID, or endpoint to endpoint).
>>>
>>> A TCP bandwidth estimate can only be meaningfully be derived for bulk TCP
>>> transfers
>>
>>
>> Yes. It is intended for bulk transfer.
>>
>>
>>> under a set of pretty strict and simplistic assumptions, making this
>>> metric a meaningless at best and misleading at worst,
>>
>>
>> I will say that TCP throughput formula in general has turned out to be
>> quite useful.
>>
>>
>>
>>> given that the source of
>>> this information doesn't know what workload, congestion controller and
>>> network
>>> conditions the user of this information will use or see.
>>>
>>
>> Network (the source) is in a pretty good position to estimate the
>> potential TCP
>> throughput. In a high multiplexing setting (small fish in a big pond),
>> network can
>> have access to estimated loss rate, RTT, and typical packet size to
>> compute
>> the TCP throughput formula. In a low multiplexing setting (big fish in a
>> small pond),
>> network can know the set of flows and estimate the bandwidth share. See
>> the citation
>> of the Prophet work in the document and the G2 work in SIGMETRICS'21 and
>> SIGCOMM'21. The congestion controller info is part of the metric (the
>> link points
>> to standard TCP/Reno). I made some minor edits to clarify.
>>
>>
>>> Also, RFC3649 is an Experimental RFC (from 2003!) and RFC8312 is an
>>> Informational RFC. Since this document normatively refers to them, it
>>> needs to
>>> cite them, and this will cause DOWNREFs for PS document. I would argue
>>> that
>>> at least RFC3649 is certainly not an appropriate DOWNREF.
>>>
>>>
>> Good suggestion! I added the reference to 3649 from the second paragraph
>> of
>> Sec. 5.1 of RFC8312 (you are a co-author). It reads "The average
>> window sizes of Standard TCP and HSTCP are from [RFC3649].  The
>> average window size of CUBIC is calculated using Eq. 6 and the CUBIC
>> TCP-friendly region for three different values of C." Our plan, which is
>> already suggested
>> by Martin but it is my fault to not update yet, is to remove 3649 and use
>> RFC8312bis.
>> Make sense?
>>
>>
>>> Why define this metric at all? The material you point to is the usual
>>> model-based throughput calculation based on RTT and loss rates; a client
>>> that
>>> intended to predict TCP performance could simply query ALTO for this and
>>> perform
>>> their own computation, which will likely be more accurate, since the
>>> client will
>>> hopefully know which congestion controller they will use for the given
>>> workload,
>>> and what the characteristics of that workload are.
>>>
>>
>> The throughput formula is for a very limited setting, i.e., the small
>> fish setting. What
>> we found useful is the low multiplexing setting, where the loss rate is
>> the output,
>> not the input, of the convergence process. It has good use cases. Please
>> see
>> the Prophet paper and one most recent example is the use cases, such as
>> accelerating time-bound constrained flows, in Sec. 3 of
>> https://www.reservoir.com/wp-content/uploads/2021/08/G2_QTBS_TR_2021.pdf
>> The paper uses max-min fairness but the Internet uses other fairness.
>>
>>
>>
>>> ----------------------------------------------------------------------
>>> COMMENT:
>>> ----------------------------------------------------------------------
>>>
>>> Section 1. , paragraph 6, comment:
>>> >    The purpose of this document is to ensure proper usage of the
>>> >    performance metrics defined in Table 1; it does not claim novelty of
>>> >    the metrics.  The "Origin Example" column of Table 1 gives an
>>> example
>>> >    RFC that has defined each metric.
>>>
>>> I don't understand what the purpose of the "origin example" column is.
>>> Most of
>>> these point to IPPM metrics, which have a pretty clear and
>>> narrowly-defined area
>>> of applicability. Since ALTO isn't performing IPPM-style network
>>> testing, it's
>>> not clear why IPPM metrics are referenced here?
>>>
>>
>> The metrics that this document use are defined in multiple IETF documents
>> before.
>> The intention of the sentence is to give early work credit.
>>
>>
>>> Section 2.2. , paragraph 23, comment:
>>> >    If a cost metric string does not have the optional statical operator
>>> >    string, the statistical operator SHOULD be interpreted as the
>>> default
>>> >    statical operator in the definition of the base metric.  If the
>>>
>>> What is a "statical" operator; I am not familiar with the term and it
>>> doesn't
>>> seem to appear in other RFCs? (Also occurs elsewhere in this document.)
>>>
>>> Apology for the typo. statical operator -> statistical operator. They
>> are fixed in
>> an internal version but we did not upload.
>>
>>
>>
>>
>>> Section 3.1.4. , paragraph 4, comment:
>>> >    link statistics.  Another example of a source to estimate the delay
>>> >    is the IPPM framework [RFC2330].  It is RECOMMENDED that the
>>>
>>> IPPM defines measurement metrics. How would they be a source for
>>> estimates?
>>>
>>>
>> The intention was to refer to the measurement methodology in 6.2 of RFC
>> 2330, but
>> I can see the potential confusion now. How about we change the wording to
>> "Another example of a source to estimate the delay is through active
>> measurements,
>> for example, considering the IETF IPPM framework [RFC2330]."
>>
>>
>>> Section 3.3. , paragraph 1, comment:
>>> > 3.3.  Cost Metric: Delay Variation (delay-variation)
>>>
>>> Is this supposed to apply to the one-way or bidirectional delay?
>>
>>
>> This is the current specification: "
>> 3.3.3.  Intended Semantics and Use
>>
>>    Intended Semantics: To specify spatial and temporal aggregated delay
>>    variation (also called delay jitter)) with respect to the minimum
>>    delay observed on the stream over the one-way delay from the
>>    specified source and destination.  The spatial aggregation level is
>>    specified in the query context (e.g., PID to PID, or endpoint to
>>    endpoint)."
>>
>> So it is one way.
>>
>> Also, delay
>>> variation is not independent from path utilization (c.f. bufferbloat),
>>> so why is
>>> it being reported independently?
>>>
>>
>> Not sure I understand the suggestion. We see reports of jitter
>> (e.g., https://cpr.att.com/pdf/se/0001-0003.pdf) reported independently
>> (in the sense
>> as a single metric, without specifying as conditional
>> values/probabilities).
>>
>>
>>>
>>> Section 3.5. , paragraph 1, comment:
>>> > 3.5.  Cost Metric: Loss Rate (lossrate)
>>>
>>> What is this metric supposed to capture? Loss is generally not
>>> independent from
>>> network utilization (apart from random corruption loss). So it should be
>>> zero
>>> for unloaded networks, and depends on utilization otherwise. Also, is
>>> this
>>> unidirectional or bidirectional loss (wording below is unclear)?
>>>
>>
>> It is meaningful in high multiplexing settings. There can also be an
>> load-independent
>> (I can see that you may see interference can be load as well) loss rate
>> when there are
>> wireless links.
>>
>> It is intended to be unidirectional: "3.5.3.  Intended Semantics and Use
>>
>>    Intended Semantics: To specify spatial and temporal aggregated packet
>>    loss rate from the specified source and the specified destination.
>>    The spatial aggregation level is specified in the query context
>>    (e.g., PID to PID, or endpoint to endpoint)."
>>
>> How about the following change:
>> " To specify spatial and temporal aggregated packet
>>    loss rate from the specified source and the specified destination."
>> =>
>>  To specify spatial and temporal aggregated packet
>>    loss rate, in one way, from the specified source and the specified
>> destination."
>>
>>
>> Using lowercase "not" together with an uppercase RFC2119 keyword is not
>>> acceptable usage. Found: "MUST not"
>>>
>>>
>> Got it. We have fixed the case:
>> "The total length of the cost metric string MUST not exceed 32"
>> =>
>> "The total length of the cost metric string MUST NOT exceed 32"
>>
>>
>>> The document has 6 authors, which exceeds the recommended author limit. I
>>> assume the sponsoring AD has agreed that this is appropriate?
>>>
>>> No reference entries found for: [RFC3649] and [RFC8312].
>>>
>>>
>> Thanks for pointing it out. It was missing after an update and pointed
>> out by
>> Martin. It is fixed in the next version which we will upload soon.
>>
>>
>>> Found terminology that should be reviewed for inclusivity; see
>>> https://www.rfc-editor.org/part2/#inclusive_language for background and
>>> more
>>> guidance:
>>>
>>>  * Term "man"; alternatives might be "individual", "people", "person".
>>>
>>> Hah. You mean change
>> "man-in-the-middle (MITM) attacks"
>> =>
>> "person-in-the-middle attacks".
>>
>> I looked and indeed see PITM (
>> https://en.wikipedia.org/wiki/Man-in-the-middle_attack).
>>
>> Interesting and fixed. Thanks!
>>
>> The edits below are great and fixed. Thanks again!
>>
>> Richard
>>
>>
>>
>>> -------------------------------------------------------------------------------
>>> All comments below are about very minor potential issues that you may
>>> choose to
>>> address in some way - or ignore - as you see fit. Some were flagged by
>>> automated tools (via https://github.com/larseggert/ietf-reviewtool), so
>>> there
>>> will likely be some false positives. There is no need to let me know
>>> what you
>>> did with these suggestions.
>>>
>>> "Abstract", paragraph 2, nit:
>>> -    types of cost metric.  Since the ALTO base protocol (RFC 7285)
>>> +    types of cost metrics.  Since the ALTO base protocol (RFC 7285)
>>> +                        +
>>>
>>> Section 1. , paragraph 2, nit:
>>> > ] on registering ALTO cost metrics. Hence it specifies the identifier,
>>> the in
>>> >                                     ^^^^^
>>> A comma may be missing after the conjunctive/linking adverb "Hence".
>>>
>>> Section 2.2. , paragraph 2, nit:
>>> > of the observations. median: the mid point (i.e., p50) of the
>>> observations.
>>> >                                  ^^^^^^^^^
>>> This word is normally spelled with a hyphen.
>>>
>>> "IPPM ", paragraph 2, nit:
>>> >  Also, delay variation is not independent from path utilization (c.f.
>>> buffer
>>> >                               ^^^^^^^^^^^^^^^^
>>> The usual collocation for "independent" is "of", not "from". Did you mean
>>> "independent of"?
>>>
>>> Section 3.3.3. , paragraph 7, nit:
>>> > apture? Loss is generally not independent from network utilization
>>> (apart fr
>>> >                               ^^^^^^^^^^^^^^^^
>>> The usual collocation for "independent" is "of", not "from". Did you mean
>>> "independent of"?
>>>
>>> Section 3.4.3. , paragraph 6, nit:
>>> > imation" method. See Section 3.1.4 on on related discussions such as
>>> summing
>>> >                                    ^^^^^
>>> Possible typo: you repeated a word.
>>>
>>> Section 3.5.4. , paragraph 3, nit:
>>> >  [RFC8312]), it helps to specify as much details as possible on the
>>> the cong
>>> >                                     ^^^^
>>> Use "many" with countable plural nouns like "details".
>>>
>>> Section 3.5.4. , paragraph 3, nit:
>>> > ify as much details as possible on the the congestion control
>>> algorithm used
>>> >                                    ^^^^^^^
>>> Two determiners in a row. Choose either "the" or "the".
>>>
>>> These URLs in the document can probably be converted to HTTPS:
>>>  *
>>> http://www.iana.org/assignments/alto-protocol/alto-protocol.xhtml#cost-metrics
>>>
>>>
>>>
>>> _______________________________________________
>>> alto mailing list
>>> alto@ietf.org
>>> https://www.ietf.org/mailman/listinfo/alto
>>
>>

-- 
-- 
 =====================================
| Y. Richard Yang <yry@cs.yale.edu>   |
| Professor of Computer Science       |
| http://www.cs.yale.edu/~yry/        |
 =====================================