Re: [alto] Lars Eggert's Discuss on draft-ietf-alto-performance-metrics-19: (with DISCUSS and COMMENT)

Hi Lars,

Thanks for the review! Please see below.

On Mon, Nov 29, 2021 at 8:10 AM Lars Eggert via Datatracker <
noreply@ietf.org> wrote:

> Lars Eggert has entered the following ballot position for
> draft-ietf-alto-performance-metrics-19: Discuss
>
> Please refer to https://www.ietf.org/blog/handling-iesg-ballot-positions/
> for more information about how to handle DISCUSS and COMMENT positions.
>
> The document, along with other ballot positions, can be found here:
> https://datatracker.ietf.org/doc/draft-ietf-alto-performance-metrics/
>
>
>
> ----------------------------------------------------------------------
> DISCUSS:
> ----------------------------------------------------------------------
>
> This document needs to become much more formal about how it defines the
> metrics it wishes to use with ALTO. This could either be done either by
> identifying and normatively referencing existing metrics the IETF has
> defined,
> or by defining them here. When normatively referencing existing IETF
> metrics, it
> would need to explain why their use with ALTO makes sense.
>
> At the moment, the document informatively points to a somewhat arbitrary
> collection of prior IETF metrics (most of which are from IPPM, residual
> bandwidth from IS-IS TE, but then reservable bandwidth from OSPF TE?).

To give some background, the WG derived the list of metrics from RFC 8571
(BGP - Link State (BGP-LS) Advertisement of IGP Traffic Engineering
Performance
Metric Extensions), focusing on network->application. The list added Hop
Count
(exists in original ALTO RFC 7285), Round-trip (to avoid two queries, and
many apps
use RTT), and TCP Throughput, and removed Unidirectional Available
Bandwidth
and Unidirectional Utilized Bandwidth, to reduce the number of bandwidth
metrics.

> But it
> only refers to them as "examples",

I searched the word "example" and do not see where the document says that
they
are examples. It says that "Since different applications may use different
cost metrics,
the ALTO base protocol introduces an ALTO Cost Metric Registry (Section
14.2 of
[RFC7285]), as a systematic mechanism to allow different metrics to be
specified. For
example, a delay-sensitive application may want to use latency-related
metrics, and
a bandwidth-sensitive application may want to use bandwidth-related
metrics."

Does this paragraph give an impression that the metrics are only examples?
If so,
do you suggest removing the "For example" phrase to reduce the impression?

The document does have the sentence " The "Origin Example" column of Table
1 gives an example RFC that has defined each metric." Here the word
"example"
word means one existing work.

> without actually defining how exactly they
> are to be used with ALTO, or - if not those - which actual metrics are
> supposed
> to be used.
>

The document has "... the ALTO base protocol introduces an ALTO Cost Metric
Registry
(Section 14.2 of [RFC7285]), as a systematic mechanism to allow different
metrics
 to be specified. " and "When an ALTO server supports a cost metric defined
in this document,
it should announce this metric in its information resource directory (IRD)
as defined in
 Section 9.2 of [RFC7285]." Does this provide enough on how exactly they
should be used?
The function of this document is to satisfy the registry and the use will
be in the base protocol
(RFC7285). If there is a specific suggestion, it will be good to have.

> Defining a mechanism for exposing metric information to clients isn't
> really
> useful unless the content of that information is much more clearly
> specified.
>
> I agree with this statement that information should be specified as
clearly as
possible, but at the same time, we need abstraction to reduce the
complexity.
One guiding principle in the design is that ALTO information provides
reasonable guidance, not mathematical precision.

> Section 4.1.3. , paragraph 2, discuss:
> >    Intended Semantics: To give the throughput of a TCP congestion-
> >    control conforming flow from the specified source to the specified
> >    destination; see [RFC3649, Section 5.1 of RFC8312] on how TCP
> >    throughput is estimated.  The spatial aggregation level is specified
> >    in the query context (e.g., PID to PID, or endpoint to endpoint).
>
> A TCP bandwidth estimate can only be meaningfully be derived for bulk TCP
> transfers

Yes. It is intended for bulk transfer.

> under a set of pretty strict and simplistic assumptions, making this
> metric a meaningless at best and misleading at worst,

I will say that TCP throughput formula in general has turned out to be
quite useful.

> given that the source of
> this information doesn't know what workload, congestion controller and
> network
> conditions the user of this information will use or see.
>

Network (the source) is in a pretty good position to estimate the potential
TCP
throughput. In a high multiplexing setting (small fish in a big pond),
network can
have access to estimated loss rate, RTT, and typical packet size to compute
the TCP throughput formula. In a low multiplexing setting (big fish in a
small pond),
network can know the set of flows and estimate the bandwidth share. See the
citation
of the Prophet work in the document and the G2 work in SIGMETRICS'21 and
SIGCOMM'21. The congestion controller info is part of the metric (the link
points
to standard TCP/Reno). I made some minor edits to clarify.

> Also, RFC3649 is an Experimental RFC (from 2003!) and RFC8312 is an
> Informational RFC. Since this document normatively refers to them, it
> needs to
> cite them, and this will cause DOWNREFs for PS document. I would argue that
> at least RFC3649 is certainly not an appropriate DOWNREF.
>
>
Good suggestion! I added the reference to 3649 from the second paragraph of
Sec. 5.1 of RFC8312 (you are a co-author). It reads "The average
window sizes of Standard TCP and HSTCP are from [RFC3649].  The
average window size of CUBIC is calculated using Eq. 6 and the CUBIC
TCP-friendly region for three different values of C." Our plan, which is
already suggested
by Martin but it is my fault to not update yet, is to remove 3649 and use
RFC8312bis.
Make sense?

> Why define this metric at all? The material you point to is the usual
> model-based throughput calculation based on RTT and loss rates; a client
> that
> intended to predict TCP performance could simply query ALTO for this and
> perform
> their own computation, which will likely be more accurate, since the
> client will
> hopefully know which congestion controller they will use for the given
> workload,
> and what the characteristics of that workload are.
>

The throughput formula is for a very limited setting, i.e., the small fish
setting. What
we found useful is the low multiplexing setting, where the loss rate is the
output,
not the input, of the convergence process. It has good use cases. Please see
the Prophet paper and one most recent example is the use cases, such as
accelerating time-bound constrained flows, in Sec. 3 of
https://www.reservoir.com/wp-content/uploads/2021/08/G2_QTBS_TR_2021.pdf
The paper uses max-min fairness but the Internet uses other fairness.

> ----------------------------------------------------------------------
> COMMENT:
> ----------------------------------------------------------------------
>
> Section 1. , paragraph 6, comment:
> >    The purpose of this document is to ensure proper usage of the
> >    performance metrics defined in Table 1; it does not claim novelty of
> >    the metrics.  The "Origin Example" column of Table 1 gives an example
> >    RFC that has defined each metric.
>
> I don't understand what the purpose of the "origin example" column is.
> Most of
> these point to IPPM metrics, which have a pretty clear and
> narrowly-defined area
> of applicability. Since ALTO isn't performing IPPM-style network testing,
> it's
> not clear why IPPM metrics are referenced here?
>

The metrics that this document use are defined in multiple IETF documents
before.
The intention of the sentence is to give early work credit.

> Section 2.2. , paragraph 23, comment:
> >    If a cost metric string does not have the optional statical operator
> >    string, the statistical operator SHOULD be interpreted as the default
> >    statical operator in the definition of the base metric.  If the
>
> What is a "statical" operator; I am not familiar with the term and it
> doesn't
> seem to appear in other RFCs? (Also occurs elsewhere in this document.)
>
> Apology for the typo. statical operator -> statistical operator. They are
fixed in
an internal version but we did not upload.

> Section 3.1.4. , paragraph 4, comment:
> >    link statistics.  Another example of a source to estimate the delay
> >    is the IPPM framework [RFC2330].  It is RECOMMENDED that the
>
> IPPM defines measurement metrics. How would they be a source for estimates?
>
>
The intention was to refer to the measurement methodology in 6.2 of RFC
2330, but
I can see the potential confusion now. How about we change the wording to
"Another example of a source to estimate the delay is through active
measurements,
for example, considering the IETF IPPM framework [RFC2330]."

> Section 3.3. , paragraph 1, comment:
> > 3.3.  Cost Metric: Delay Variation (delay-variation)
>
> Is this supposed to apply to the one-way or bidirectional delay?

This is the current specification: "
3.3.3.  Intended Semantics and Use

   Intended Semantics: To specify spatial and temporal aggregated delay
   variation (also called delay jitter)) with respect to the minimum
   delay observed on the stream over the one-way delay from the
   specified source and destination.  The spatial aggregation level is
   specified in the query context (e.g., PID to PID, or endpoint to
   endpoint)."

So it is one way.

Also, delay
> variation is not independent from path utilization (c.f. bufferbloat), so
> why is
> it being reported independently?
>

Not sure I understand the suggestion. We see reports of jitter
(e.g., https://cpr.att.com/pdf/se/0001-0003.pdf) reported independently (in
the sense
as a single metric, without specifying as conditional
values/probabilities).

>
> Section 3.5. , paragraph 1, comment:
> > 3.5.  Cost Metric: Loss Rate (lossrate)
>
> What is this metric supposed to capture? Loss is generally not independent
> from
> network utilization (apart from random corruption loss). So it should be
> zero
> for unloaded networks, and depends on utilization otherwise. Also, is this
> unidirectional or bidirectional loss (wording below is unclear)?
>

It is meaningful in high multiplexing settings. There can also be an
load-independent
(I can see that you may see interference can be load as well) loss rate
when there are
wireless links.

It is intended to be unidirectional: "3.5.3.  Intended Semantics and Use

   Intended Semantics: To specify spatial and temporal aggregated packet
   loss rate from the specified source and the specified destination.
   The spatial aggregation level is specified in the query context
   (e.g., PID to PID, or endpoint to endpoint)."

How about the following change:
" To specify spatial and temporal aggregated packet
   loss rate from the specified source and the specified destination."
=>
 To specify spatial and temporal aggregated packet
   loss rate, in one way, from the specified source and the specified
destination."

Using lowercase "not" together with an uppercase RFC2119 keyword is not
> acceptable usage. Found: "MUST not"
>
>
Got it. We have fixed the case:
"The total length of the cost metric string MUST not exceed 32"
=>
"The total length of the cost metric string MUST NOT exceed 32"

> The document has 6 authors, which exceeds the recommended author limit. I
> assume the sponsoring AD has agreed that this is appropriate?
>
> No reference entries found for: [RFC3649] and [RFC8312].
>
>
Thanks for pointing it out. It was missing after an update and pointed out
by
Martin. It is fixed in the next version which we will upload soon.

> Found terminology that should be reviewed for inclusivity; see
> https://www.rfc-editor.org/part2/#inclusive_language for background and
> more
> guidance:
>
>  * Term "man"; alternatives might be "individual", "people", "person".
>
> Hah. You mean change
"man-in-the-middle (MITM) attacks"
=>
"person-in-the-middle attacks".

I looked and indeed see PITM (
https://en.wikipedia.org/wiki/Man-in-the-middle_attack).

Interesting and fixed. Thanks!

The edits below are great and fixed. Thanks again!

Richard

-------------------------------------------------------------------------------
> All comments below are about very minor potential issues that you may
> choose to
> address in some way - or ignore - as you see fit. Some were flagged by
> automated tools (via https://github.com/larseggert/ietf-reviewtool), so
> there
> will likely be some false positives. There is no need to let me know what
> you
> did with these suggestions.
>
> "Abstract", paragraph 2, nit:
> -    types of cost metric.  Since the ALTO base protocol (RFC 7285)
> +    types of cost metrics.  Since the ALTO base protocol (RFC 7285)
> +                        +
>
> Section 1. , paragraph 2, nit:
> > ] on registering ALTO cost metrics. Hence it specifies the identifier,
> the in
> >                                     ^^^^^
> A comma may be missing after the conjunctive/linking adverb "Hence".
>
> Section 2.2. , paragraph 2, nit:
> > of the observations. median: the mid point (i.e., p50) of the
> observations.
> >                                  ^^^^^^^^^
> This word is normally spelled with a hyphen.
>
> "IPPM ", paragraph 2, nit:
> >  Also, delay variation is not independent from path utilization (c.f.
> buffer
> >                               ^^^^^^^^^^^^^^^^
> The usual collocation for "independent" is "of", not "from". Did you mean
> "independent of"?
>
> Section 3.3.3. , paragraph 7, nit:
> > apture? Loss is generally not independent from network utilization
> (apart fr
> >                               ^^^^^^^^^^^^^^^^
> The usual collocation for "independent" is "of", not "from". Did you mean
> "independent of"?
>
> Section 3.4.3. , paragraph 6, nit:
> > imation" method. See Section 3.1.4 on on related discussions such as
> summing
> >                                    ^^^^^
> Possible typo: you repeated a word.
>
> Section 3.5.4. , paragraph 3, nit:
> >  [RFC8312]), it helps to specify as much details as possible on the the
> cong
> >                                     ^^^^
> Use "many" with countable plural nouns like "details".
>
> Section 3.5.4. , paragraph 3, nit:
> > ify as much details as possible on the the congestion control algorithm
> used
> >                                    ^^^^^^^
> Two determiners in a row. Choose either "the" or "the".
>
> These URLs in the document can probably be converted to HTTPS:
>  *
> http://www.iana.org/assignments/alto-protocol/alto-protocol.xhtml#cost-metrics
>
>
>
> _______________________________________________
> alto mailing list
> alto@ietf.org
> https://www.ietf.org/mailman/listinfo/alto
>

-- 
-- 
 =====================================
| Y. Richard Yang <yry@cs.yale.edu>   |
| Professor of Computer Science       |
| http://www.cs.yale.edu/~yry/        |
 =====================================