Re: [alto] Lars Eggert's Discuss on draft-ietf-alto-performance-metrics-19: (with DISCUSS and COMMENT)

Hi Lars,

Just posted an updated version, and a diff from the previous version is
available at:
https://www.ietf.org/rfcdiff?url2=draft-ietf-alto-performance-metrics-20

Thanks!
Richard

On Tue, Nov 30, 2021 at 8:49 PM Y. Richard Yang <yry@cs.yale.edu> wrote:

> Hi Lars,
>
> Thanks for the review! Please see below.
>
> On Mon, Nov 29, 2021 at 8:10 AM Lars Eggert via Datatracker <
> noreply@ietf.org> wrote:
>
>> Lars Eggert has entered the following ballot position for
>> draft-ietf-alto-performance-metrics-19: Discuss
>>
>> Please refer to https://www.ietf.org/blog/handling-iesg-ballot-positions/
>> for more information about how to handle DISCUSS and COMMENT positions.
>>
>> The document, along with other ballot positions, can be found here:
>> https://datatracker.ietf.org/doc/draft-ietf-alto-performance-metrics/
>>
>>
>>
>> ----------------------------------------------------------------------
>> DISCUSS:
>> ----------------------------------------------------------------------
>>
>> This document needs to become much more formal about how it defines the
>> metrics it wishes to use with ALTO. This could either be done either by
>> identifying and normatively referencing existing metrics the IETF has
>> defined,
>> or by defining them here. When normatively referencing existing IETF
>> metrics, it
>> would need to explain why their use with ALTO makes sense.
>>
>> At the moment, the document informatively points to a somewhat arbitrary
>> collection of prior IETF metrics (most of which are from IPPM, residual
>> bandwidth from IS-IS TE, but then reservable bandwidth from OSPF TE?).
>
>
> To give some background, the WG derived the list of metrics from RFC 8571
> (BGP - Link State (BGP-LS) Advertisement of IGP Traffic Engineering
> Performance
> Metric Extensions), focusing on network->application. The list added Hop
> Count
> (exists in original ALTO RFC 7285), Round-trip (to avoid two queries, and
> many apps
> use RTT), and TCP Throughput, and removed Unidirectional Available
> Bandwidth
> and Unidirectional Utilized Bandwidth, to reduce the number of bandwidth
> metrics.
>
>
>> But it
>> only refers to them as "examples",
>
>
> I searched the word "example" and do not see where the document says that
> they
> are examples. It says that "Since different applications may use different
> cost metrics,
> the ALTO base protocol introduces an ALTO Cost Metric Registry (Section
> 14.2 of
> [RFC7285]), as a systematic mechanism to allow different metrics to be
> specified. For
> example, a delay-sensitive application may want to use latency-related
> metrics, and
> a bandwidth-sensitive application may want to use bandwidth-related
> metrics."
>
> Does this paragraph give an impression that the metrics are only examples?
> If so,
> do you suggest removing the "For example" phrase to reduce the impression?
>
> The document does have the sentence " The "Origin Example" column of Table
> 1 gives an example RFC that has defined each metric." Here the word
> "example"
> word means one existing work.
>
>
>> without actually defining how exactly they
>> are to be used with ALTO, or - if not those - which actual metrics are
>> supposed
>> to be used.
>>
>
> The document has "... the ALTO base protocol introduces an ALTO Cost
> Metric Registry
> (Section 14.2 of [RFC7285]), as a systematic mechanism to allow different
> metrics
>  to be specified. " and "When an ALTO server supports a cost metric
> defined in this document,
> it should announce this metric in its information resource directory (IRD)
> as defined in
>  Section 9.2 of [RFC7285]." Does this provide enough on how exactly they
> should be used?
> The function of this document is to satisfy the registry and the use will
> be in the base protocol
> (RFC7285). If there is a specific suggestion, it will be good to have.
>
>
>> Defining a mechanism for exposing metric information to clients isn't
>> really
>> useful unless the content of that information is much more clearly
>> specified.
>>
>> I agree with this statement that information should be specified as
> clearly as
> possible, but at the same time, we need abstraction to reduce the
> complexity.
> One guiding principle in the design is that ALTO information provides
> reasonable guidance, not mathematical precision.
>
>
>> Section 4.1.3. , paragraph 2, discuss:
>> >    Intended Semantics: To give the throughput of a TCP congestion-
>> >    control conforming flow from the specified source to the specified
>> >    destination; see [RFC3649, Section 5.1 of RFC8312] on how TCP
>> >    throughput is estimated.  The spatial aggregation level is specified
>> >    in the query context (e.g., PID to PID, or endpoint to endpoint).
>>
>> A TCP bandwidth estimate can only be meaningfully be derived for bulk TCP
>> transfers
>
>
> Yes. It is intended for bulk transfer.
>
>
>> under a set of pretty strict and simplistic assumptions, making this
>> metric a meaningless at best and misleading at worst,
>
>
> I will say that TCP throughput formula in general has turned out to be
> quite useful.
>
>
>
>> given that the source of
>> this information doesn't know what workload, congestion controller and
>> network
>> conditions the user of this information will use or see.
>>
>
> Network (the source) is in a pretty good position to estimate the
> potential TCP
> throughput. In a high multiplexing setting (small fish in a big pond),
> network can
> have access to estimated loss rate, RTT, and typical packet size to compute
> the TCP throughput formula. In a low multiplexing setting (big fish in a
> small pond),
> network can know the set of flows and estimate the bandwidth share. See
> the citation
> of the Prophet work in the document and the G2 work in SIGMETRICS'21 and
> SIGCOMM'21. The congestion controller info is part of the metric (the link
> points
> to standard TCP/Reno). I made some minor edits to clarify.
>
>
>> Also, RFC3649 is an Experimental RFC (from 2003!) and RFC8312 is an
>> Informational RFC. Since this document normatively refers to them, it
>> needs to
>> cite them, and this will cause DOWNREFs for PS document. I would argue
>> that
>> at least RFC3649 is certainly not an appropriate DOWNREF.
>>
>>
> Good suggestion! I added the reference to 3649 from the second paragraph of
> Sec. 5.1 of RFC8312 (you are a co-author). It reads "The average
> window sizes of Standard TCP and HSTCP are from [RFC3649].  The
> average window size of CUBIC is calculated using Eq. 6 and the CUBIC
> TCP-friendly region for three different values of C." Our plan, which is
> already suggested
> by Martin but it is my fault to not update yet, is to remove 3649 and use
> RFC8312bis.
> Make sense?
>
>
>> Why define this metric at all? The material you point to is the usual
>> model-based throughput calculation based on RTT and loss rates; a client
>> that
>> intended to predict TCP performance could simply query ALTO for this and
>> perform
>> their own computation, which will likely be more accurate, since the
>> client will
>> hopefully know which congestion controller they will use for the given
>> workload,
>> and what the characteristics of that workload are.
>>
>
> The throughput formula is for a very limited setting, i.e., the small fish
> setting. What
> we found useful is the low multiplexing setting, where the loss rate is
> the output,
> not the input, of the convergence process. It has good use cases. Please
> see
> the Prophet paper and one most recent example is the use cases, such as
> accelerating time-bound constrained flows, in Sec. 3 of
> https://www.reservoir.com/wp-content/uploads/2021/08/G2_QTBS_TR_2021.pdf
> The paper uses max-min fairness but the Internet uses other fairness.
>
>
>
>> ----------------------------------------------------------------------
>> COMMENT:
>> ----------------------------------------------------------------------
>>
>> Section 1. , paragraph 6, comment:
>> >    The purpose of this document is to ensure proper usage of the
>> >    performance metrics defined in Table 1; it does not claim novelty of
>> >    the metrics.  The "Origin Example" column of Table 1 gives an example
>> >    RFC that has defined each metric.
>>
>> I don't understand what the purpose of the "origin example" column is.
>> Most of
>> these point to IPPM metrics, which have a pretty clear and
>> narrowly-defined area
>> of applicability. Since ALTO isn't performing IPPM-style network testing,
>> it's
>> not clear why IPPM metrics are referenced here?
>>
>
> The metrics that this document use are defined in multiple IETF documents
> before.
> The intention of the sentence is to give early work credit.
>
>
>> Section 2.2. , paragraph 23, comment:
>> >    If a cost metric string does not have the optional statical operator
>> >    string, the statistical operator SHOULD be interpreted as the default
>> >    statical operator in the definition of the base metric.  If the
>>
>> What is a "statical" operator; I am not familiar with the term and it
>> doesn't
>> seem to appear in other RFCs? (Also occurs elsewhere in this document.)
>>
>> Apology for the typo. statical operator -> statistical operator. They are
> fixed in
> an internal version but we did not upload.
>
>
>
>
>> Section 3.1.4. , paragraph 4, comment:
>> >    link statistics.  Another example of a source to estimate the delay
>> >    is the IPPM framework [RFC2330].  It is RECOMMENDED that the
>>
>> IPPM defines measurement metrics. How would they be a source for
>> estimates?
>>
>>
> The intention was to refer to the measurement methodology in 6.2 of RFC
> 2330, but
> I can see the potential confusion now. How about we change the wording to
> "Another example of a source to estimate the delay is through active
> measurements,
> for example, considering the IETF IPPM framework [RFC2330]."
>
>
>> Section 3.3. , paragraph 1, comment:
>> > 3.3.  Cost Metric: Delay Variation (delay-variation)
>>
>> Is this supposed to apply to the one-way or bidirectional delay?
>
>
> This is the current specification: "
> 3.3.3.  Intended Semantics and Use
>
>    Intended Semantics: To specify spatial and temporal aggregated delay
>    variation (also called delay jitter)) with respect to the minimum
>    delay observed on the stream over the one-way delay from the
>    specified source and destination.  The spatial aggregation level is
>    specified in the query context (e.g., PID to PID, or endpoint to
>    endpoint)."
>
> So it is one way.
>
> Also, delay
>> variation is not independent from path utilization (c.f. bufferbloat), so
>> why is
>> it being reported independently?
>>
>
> Not sure I understand the suggestion. We see reports of jitter
> (e.g., https://cpr.att.com/pdf/se/0001-0003.pdf) reported independently
> (in the sense
> as a single metric, without specifying as conditional
> values/probabilities).
>
>
>>
>> Section 3.5. , paragraph 1, comment:
>> > 3.5.  Cost Metric: Loss Rate (lossrate)
>>
>> What is this metric supposed to capture? Loss is generally not
>> independent from
>> network utilization (apart from random corruption loss). So it should be
>> zero
>> for unloaded networks, and depends on utilization otherwise. Also, is this
>> unidirectional or bidirectional loss (wording below is unclear)?
>>
>
> It is meaningful in high multiplexing settings. There can also be an
> load-independent
> (I can see that you may see interference can be load as well) loss rate
> when there are
> wireless links.
>
> It is intended to be unidirectional: "3.5.3.  Intended Semantics and Use
>
>    Intended Semantics: To specify spatial and temporal aggregated packet
>    loss rate from the specified source and the specified destination.
>    The spatial aggregation level is specified in the query context
>    (e.g., PID to PID, or endpoint to endpoint)."
>
> How about the following change:
> " To specify spatial and temporal aggregated packet
>    loss rate from the specified source and the specified destination."
> =>
>  To specify spatial and temporal aggregated packet
>    loss rate, in one way, from the specified source and the specified
> destination."
>
>
> Using lowercase "not" together with an uppercase RFC2119 keyword is not
>> acceptable usage. Found: "MUST not"
>>
>>
> Got it. We have fixed the case:
> "The total length of the cost metric string MUST not exceed 32"
> =>
> "The total length of the cost metric string MUST NOT exceed 32"
>
>
>> The document has 6 authors, which exceeds the recommended author limit. I
>> assume the sponsoring AD has agreed that this is appropriate?
>>
>> No reference entries found for: [RFC3649] and [RFC8312].
>>
>>
> Thanks for pointing it out. It was missing after an update and pointed out
> by
> Martin. It is fixed in the next version which we will upload soon.
>
>
>> Found terminology that should be reviewed for inclusivity; see
>> https://www.rfc-editor.org/part2/#inclusive_language for background and
>> more
>> guidance:
>>
>>  * Term "man"; alternatives might be "individual", "people", "person".
>>
>> Hah. You mean change
> "man-in-the-middle (MITM) attacks"
> =>
> "person-in-the-middle attacks".
>
> I looked and indeed see PITM (
> https://en.wikipedia.org/wiki/Man-in-the-middle_attack).
>
> Interesting and fixed. Thanks!
>
> The edits below are great and fixed. Thanks again!
>
> Richard
>
>
>
>> -------------------------------------------------------------------------------
>> All comments below are about very minor potential issues that you may
>> choose to
>> address in some way - or ignore - as you see fit. Some were flagged by
>> automated tools (via https://github.com/larseggert/ietf-reviewtool), so
>> there
>> will likely be some false positives. There is no need to let me know what
>> you
>> did with these suggestions.
>>
>> "Abstract", paragraph 2, nit:
>> -    types of cost metric.  Since the ALTO base protocol (RFC 7285)
>> +    types of cost metrics.  Since the ALTO base protocol (RFC 7285)
>> +                        +
>>
>> Section 1. , paragraph 2, nit:
>> > ] on registering ALTO cost metrics. Hence it specifies the identifier,
>> the in
>> >                                     ^^^^^
>> A comma may be missing after the conjunctive/linking adverb "Hence".
>>
>> Section 2.2. , paragraph 2, nit:
>> > of the observations. median: the mid point (i.e., p50) of the
>> observations.
>> >                                  ^^^^^^^^^
>> This word is normally spelled with a hyphen.
>>
>> "IPPM ", paragraph 2, nit:
>> >  Also, delay variation is not independent from path utilization (c.f.
>> buffer
>> >                               ^^^^^^^^^^^^^^^^
>> The usual collocation for "independent" is "of", not "from". Did you mean
>> "independent of"?
>>
>> Section 3.3.3. , paragraph 7, nit:
>> > apture? Loss is generally not independent from network utilization
>> (apart fr
>> >                               ^^^^^^^^^^^^^^^^
>> The usual collocation for "independent" is "of", not "from". Did you mean
>> "independent of"?
>>
>> Section 3.4.3. , paragraph 6, nit:
>> > imation" method. See Section 3.1.4 on on related discussions such as
>> summing
>> >                                    ^^^^^
>> Possible typo: you repeated a word.
>>
>> Section 3.5.4. , paragraph 3, nit:
>> >  [RFC8312]), it helps to specify as much details as possible on the the
>> cong
>> >                                     ^^^^
>> Use "many" with countable plural nouns like "details".
>>
>> Section 3.5.4. , paragraph 3, nit:
>> > ify as much details as possible on the the congestion control algorithm
>> used
>> >                                    ^^^^^^^
>> Two determiners in a row. Choose either "the" or "the".
>>
>> These URLs in the document can probably be converted to HTTPS:
>>  *
>> http://www.iana.org/assignments/alto-protocol/alto-protocol.xhtml#cost-metrics
>>
>>
>>
>> _______________________________________________
>> alto mailing list
>> alto@ietf.org
>> https://www.ietf.org/mailman/listinfo/alto
>
>