Re: [alto] Benjamin Kaduk's Discuss on draft-ietf-alto-performance-metrics-20: (with DISCUSS and COMMENT)

Qin Wu <bill.wu@huawei.com> Thu, 02 December 2021 09:04 UTC

Return-Path: <bill.wu@huawei.com>
X-Original-To: alto@ietfa.amsl.com
Delivered-To: alto@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A25333A0D15; Thu, 2 Dec 2021 01:04:29 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 3.103
X-Spam-Level: ***
X-Spam-Status: No, score=3.103 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, GB_SUMOF=5, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hN6cNwmZhNW5; Thu, 2 Dec 2021 01:04:25 -0800 (PST)
Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BC3AF3A0D10; Thu, 2 Dec 2021 01:04:24 -0800 (PST)
Received: from fraeml709-chm.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4J4VNF2s1fz6881S; Thu, 2 Dec 2021 17:02:53 +0800 (CST)
Received: from dggeml704-chm.china.huawei.com (10.3.17.142) by fraeml709-chm.china.huawei.com (10.206.15.37) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id 15.1.2308.20; Thu, 2 Dec 2021 10:04:20 +0100
Received: from dggeml753-chm.china.huawei.com (10.1.199.152) by dggeml704-chm.china.huawei.com (10.3.17.142) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2308.20; Thu, 2 Dec 2021 17:04:18 +0800
Received: from dggeml753-chm.china.huawei.com ([10.1.199.152]) by dggeml753-chm.china.huawei.com ([10.1.199.152]) with mapi id 15.01.2308.020; Thu, 2 Dec 2021 17:04:18 +0800
From: Qin Wu <bill.wu@huawei.com>
To: Benjamin Kaduk <kaduk@mit.edu>, The IESG <iesg@ietf.org>
CC: "draft-ietf-alto-performance-metrics@ietf.org" <draft-ietf-alto-performance-metrics@ietf.org>, "alto-chairs@ietf.org" <alto-chairs@ietf.org>, "alto@ietf.org" <alto@ietf.org>, "ietf@j-f-s.de" <ietf@j-f-s.de>, "Y. Richard Yang" <yang.r.yang@gmail.com>
Thread-Topic: Benjamin Kaduk's Discuss on draft-ietf-alto-performance-metrics-20: (with DISCUSS and COMMENT)
Thread-Index: AdfnUMY/CCQlwLQnSxKLWUBtLrsAJA==
Date: Thu, 02 Dec 2021 09:04:18 +0000
Message-ID: <968dd761b0e84c998e8f0603b5bd9af2@huawei.com>
Accept-Language: zh-CN, en-US
Content-Language: zh-CN
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.136.100.16]
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
X-CFilter-Loop: Reflected
Archived-At: <https://mailarchive.ietf.org/arch/msg/alto/1W1-fS2YMpincFN40iTJ-guT8YY>
Subject: Re: [alto] Benjamin Kaduk's Discuss on draft-ietf-alto-performance-metrics-20: (with DISCUSS and COMMENT)
X-BeenThere: alto@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Application-Layer Traffic Optimization \(alto\) WG mailing list" <alto.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/alto>, <mailto:alto-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/alto/>
List-Post: <mailto:alto@ietf.org>
List-Help: <mailto:alto-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/alto>, <mailto:alto-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Dec 2021 09:04:30 -0000

Thanks Ben for detailed valuable review, see reply and clarification below.

-----邮件原件-----
>发件人: Benjamin Kaduk via Datatracker [mailto:noreply@ietf.org] 
>发送时间: 2021年12月2日 13:05
>收件人: The IESG <iesg@ietf.org>
>抄送: draft-ietf-alto-performance-metrics@ietf.org; alto-chairs@ietf.org; alto@ietf.org; ietf@j-f-s.de; ietf@j-f-s.de
>主题: Benjamin Kaduk's Discuss on draft-ietf-alto-performance-metrics-20: (with DISCUSS and COMMENT)

>Benjamin Kaduk has entered the following ballot position for
>draft-ietf-alto-performance-metrics-20: Discuss

>When responding, please keep the subject line intact and reply to all email addresses included in the To and CC lines. (Feel free to cut this introductory paragraph, however.)


>Please refer to https://www.ietf.org/blog/handling-iesg-ballot-positions/
>for more information about how to handle DISCUSS and COMMENT positions.


>The document, along with other ballot positions, can be found here:
>https://datatracker.ietf.org/doc/draft-ietf-alto-performance-metrics/



>----------------------------------------------------------------------
>DISCUSS:
>----------------------------------------------------------------------

>These should all be trivial to resolve -- just some minor internal inconsistencies that need to be fixed before publication.

>The discussion of percentile statistical operator in §2.2 is internally inconsistent -- if the percentile number must be an integer, then p99.9 is not valid.
[Qin Wu] Yes, the percentile is a number following the letter 'p', but in some case when high precision is needed, this percentile number will be further followed by an optional decimal part
The decimal part should start with the '.' separator. Maybe the separator cause your confusion. See definition in section 2.2 for details:
"
   percentile, with letter 'p' followed by a number:
      gives the percentile specified by the number following the letter
      'p'.  The number MUST be a non-negative JSON integer in the range
      [0, 100] (i.e., greater than or equal to 0 and less than or equal
      to 100), followed by an optional decimal part, if a higher
      precision is needed.  The decimal part should start with the '.'
      separator (U+002E), and followed by a sequence of one or more
      ASCII numbers between '0' and '9'.
"
Let us know if you think separator should be changed or you live with the current form.

>Also, the listing of "cost-source" values introduced by this document (in §5.1) does not include "nominal", but we do also introduce "nominal".
[Qin Wu] I agree with this inconsistency issue, should be fixed in the next version. Thanks.
>Similarly, in §3.1.3 we refer to the "-<percentile>" component of a cost metric string, that has been generalized to an arbitrary statistical operator.
[Qin Wu] No, it is not arbitrary statistics operator, We did add a statement to say
"
   Since the identifier
   does not include the -<percentile> component, the values will
   represent median values.
"
The median value has been defined in the section 2.1 as middle-point of the observation, see median definition in section 2.2
"
   median:
      the mid-point (i.e., p50) of the observations.
"
>----------------------------------------------------------------------
>COMMENT:
>----------------------------------------------------------------------

>All things considered, this is a pretty well-written document that was easy to read.  That helped a lot as I reviewed it, especially so on a week with a pretty full agenda for the IESG telechat.

>Section 2.2

>Should we say anything about how to handle a situation where a base metric identifier is so long that the statistical operator string cannot be appended while remaining under the 32-character limit?
[Qin Wu] I think base metric identifier should not be randomly selected, full name of base metric is not recommended, probably short name or abbreviation should be used if cost metric string is too long.
But I am not sure we should set rule for this. Maybe the rule "The total length of the cost metric string MUST NOT exceed 32 " defined in RFC7285 is sufficient? 
>   min:
>      the minimal value of the observations.
>   max:
>      the maximal value of the observations.
>   [...]

>Should we say anything about what sampling period of observations is in scope for these operators?
[Qin Wu] I think sampling period of observation is related to Method of Measurement or Calculation, based on earlier discussion and agreement in the group, we believe this more depends on measurement methodology or metric definition, which in some cases not necessary or feasible, we can look into metric definition RFC for more details. see clarification in section 2 for more details. 

>Section 3.x.4

>If we're going to be recommending that implementations link to external human-readable resources (e.g., for the SLA details of estimation methodology), does the guidance from BCP 18 in indicating the language
>of text come into play?

>It's also a bit surprising that we specify the new fields in the "parameters" of a metric just in passing in the prose, without a more prominent indication that we're defining a new field.
[Qin Wu] See CostContext defintion in section2.1, "parameters" is included in Costcontext object.
>Section 3.1.4

>   "nominal": Typically network one-way delay does not have a nominal
>   value.

>Does that mean that they MUST NOT be generated, or that they should be ignored if received, or something else?  (Similarly for the other sections where we say the same thing.)
[Qin Wu] Yes, that is my understanding. We can add a statement to make this behavior clear.

>   This description can be either free text for possible presentation to
>   the user, or a formal specification; see [IANA-IPPM] for the
>   specification on fields which should be included.  [...]

>Is the IANA registry really the best reference for what fields to include?  Tpically we would only refer to the registry when we care about the current state of registered values, but the need here seems to effectively be >the column headings of the registry, which could be obtained from the RFC defining the registry.
[Qin Wu] In this IANA registry, it provide Metric Name, Metric URI, click URI details, it provide you more details of measurement methodology. That is why [IANA-registry] reference is selected, maybe we can make this more clear in the text.

>Section 3.3.3

>   Intended Semantics: To specify spatial and temporal aggregated delay
>   variation (also called delay jitter)) with respect to the minimum
>   delay observed on the stream over the one-way delay from the
>   specified source and destination.  The spatial aggregation level is
>   specified in the query context (e.g., PID to PID, or endpoint to
>   endpoint).

>I do appreciate the note about how this is not the normal statistics variation that follows this paragraph, but I also don't think this is a particularly clear or precise specification for how to produce the number that is
>be reported.  It also doesn't seem to fully align with the prior art in the IETF, e.g., RFC 3393.  It seems like it would be highly preferrable to pick an existing RFC and refer to its specification for computing a 
>delay variation value.  (To be clear, such a reference would then be a normative reference.)
[Qin Wu] Agree, we are not introducing a new metric, we just expose the existing metric defined in RFC3393. Also I agree to move RFC3393 as normative reference, will see how to fix this.
>Section 3.4.3

>   Intended Semantics: To specify the number of hops in the path from
>   the specified source to the specified destination.  The hop count is
>   a basic measurement of distance in a network and can be exposed as
>   the number of router hops computed from the routing protocols
>   originating this information.  [...]

>It seems like this could get a little messy if there are multiple routing protocols in use (e.g., both normal IP routing and an overlay network, as for service function chaining or other overlay schemes).
>I don't have any suggestions for disambiguating things, though, and if the usage is consistent within a given ALTO Server it may not have much impact on the clients.
[Qin Wu] Hop count has been implicitly mentioned in RFC7285, this document specify this metric explicitly.
I am thinking which protocol is used can be indicated in in the link (a field named "link") providing an URI to a description of the "estimation" method.
>Section 3.4.4

>   "sla": Typically hop count does not have an SLA value.

>As for "nominal", earlier, is there any guidance to give on not generating it or what to do if it is received?
> (Also appears later, I suppose.)
[Qin Wu] Will see how to provide guidance on this, thanks.
>Section 4.1.4

>   "estimation": The exact estimation method is out of the scope of this
>   document.  See [Prophet] for a method to estimate TCP throughput.  It
>   is RECOMMENDED that the "parameters" field of an "estimation" TCP
>   throughput metric provides two fields: (1) a congestion-control
>   algorithm name (a field named "congestion-control-alg"); and (2) a
>   link (a field named "link")to a description of the "estimation"
>   method.  Note that as TCP congestion control algorithms evolve (e.g.,
>   TCP Cubic Congestion Control [I-D.ietf-tcpm-rfc8312bis]), it helps to
>   specify as many details as possible on the congestion control
>   algorithm used.  This description can be either free text for
>   possible presentation to the user, or a formal specification.  [...]

>Do these specifics go into the "congestion-control-alg" name, or in the linked content?
[Qin Wu] My understanding is the later, but two fields will be provided by one "parameters" field which can be seen as JSON object since "parameters" is a plural of "parameter".
>Section 5.3

>   To address the backward-compatibility issue, if a "cost-metric" is
>   "routingcost" and the metric contains a "cost-context" field, then it
>   MUST be "estimation"; if it is not, the client SHOULD reject the
>   information as invalid.

>This seems like a sub-optimal route to backwards compatibility, as it would (apparently) permanently lock the "routingcost" metric to only the "estimation" source with no way to negotiate more flexibility.  Unless we >define a new "routingcost2" metric that differs only in the lack of this restriction, of course.
[Qin Wu] Probably we should have a default value for cost-context, I think the default value is estimation since legacy client only support metric estimation.
>Section 5.4.1

>   the ALTO server may provide the client with two pieces of additional
>   information: (1) when the metrics are last computed, and (2) when the
>   metrics will be updated (i.e., the validity period of the exposed
>   metric values).  The ALTO server can expose these two pieces of
>   information by using the HTTP response headers Last-Modified and
>   Expires.

>While this seems like it would work okay in the usual case, it seems a bit fragile, in that it may fail in boundary cases, such as when a server is just starting up.  I would lean towards recommending use of explicit data items to convey this sort of information (and also the overall measurement interval over which statistics are computed, which may not always go back to "the start of time").
[Qin Wu] Okay.
>Section 5.4.2

>   often be link level.  For example, routing protocols often measure
>   and report only per link loss, not end-to-end loss; similarly,
>   routing protocols report link level available bandwidth, not end-to-
>   end available bandwidth.  The ALTO server then needs to aggregate
>   these data to provide an abstract and unified view that can be more
>   useful to applications.  The server should consider that different
>   metrics may use different aggregation computation.  For example, the
>   end-to-end latency of a path is the sum of the latency of the links
>   on the path; the end-to-end available bandwidth of a path is the
>   minimum of the available bandwidth of the links on the path.

>Some caution seems in order relating to aggregation of loss measurements, as loss is not always uncorrolated across links in the path.
[Qin Wu] Agree, but here we just provide examples.
>Section 6

>I thought that the outcome of the art-art review thread was that we would add some mention of ordinal cost mode here as a means to mitigate the risk of exposing sensitive numerical metric values, but I don't see such 
>test.

>In light of the guidance in Section 7 for new cost source types to document their security considerations, should we document the security considerations for the "sla" type here?  The overall theme would be similar 
>to what RFC 7285 already describes, but we could mention that knowledge specifically of provider SLA targets allow for attackers to target the SLA, causing problems for the provider other than the typical DoS attack
>class.  (I'm not coming up with anything new to say about "nominal" or "estimation".)

>I would also consider mentioning that the "origin" references in table 1 might have useful things to say about the individual metrics that we use.

>Giving an attacker the ability to receive the instantaneous loss rate on a path could be useful in helping the attacker gauge the efficacy of an ongoing attack targeting that path.  The RFCs from the DOTS WG (e.g.,
8783 and 9132) may have some useful text on this topic that could be used as a model.

[Qin Wu] Good suggestion and will integrate these in the next version. Thanks.
>Section 9.1

>It's not really clear to me that [IANA-IPPM] needs to be classified as normative (or whatever it is replaced by, in light of my earlier comment in §3.1.4).

>RFC 2330 is cited only once, in a "for example" clause; this would typically cause it to be classified as only an informative reference.

>The mention of RFC 8895 is conditional on it being implemented, so that could probably also be downgraded to an informative reference as well.
[Qin Wu] Okay, good suggestion.
>Section 9.2

>Some kind of URL for [Prometheus] would be very helpful.
> [Prophet], too, though at least that has the ACM/IEEE Transactions venue to anchor the reference.
[Qin Wu] Okay.
>I'm not entirely sure why RFC 2818 is classified as normative but RFC
>8446 only as informative, since they are part of the same (quoted) requirement clause.
[Qin Wu] Tend to agree, will see how to fix.
>NITS

>Section 2.1

   To make it possible to specify the source and the aforementioned
   parameters, this document introduces an optional "cost-context" field
   to the "cost-type" field defined by the ALTO base protocol
   (Section 10.7 of [RFC7285]) as the following:

I think s/"cost-type" field/CostType object/ would be slightly more accurate.
[Qin Wu] Agree.
   The "estimation" category indicates that the metric value is computed
   through an estimation process.  An ALTO server may compute
   "estimation" values by retrieving and/or aggregating information from
   routing protocols (e.g., [RFC8571]) and traffic measurement
   management tools (e.g., TWAMP [RFC5357]), with corresponding
   operational issues.  [...]

>I'm not sure if "with corresponding operational issues" conveys the intended phrasing -- to me, it seems to say "do [the previous things], but expect that there will sometimes be operational issues that make the data unavailable or inaccurate".
[Qin Wu] Yes, we will see how to rephrase this.
>Section 2.2

>   stddev:
      the standard deviation of the observations.


>   stdvar:
      the standard variance of the observations.

>Pedantically, we could say if these are sample or population standard deviation/variance (a difference of one in the denominator), but it seems very unlikely to matter for these purposes.
[Qin Wu] I am thinking maybe this can be indicated in the link (a field named "link") providing an URI to a description of the "estimation" method.
>Section 3

 >  dropped before reaching the destination (pkt.dropped).  The semantics
 >  of the performance metrics defined in this section are that they are
 >  statistics (percentiles) computed from these measures; for example,

>I suggest "e.g., percentiles" since stddev/variance are not percentiles but are statistics.
[Qin Wu] Reasonable.
   the x-percentile of the one-way delay is the x-percentile of the set
   of delays {pkt.delay} for the packets in the stream.

>This phrasing presupposes that there is a definite stream under consideration, but I don't think that much confusion is likely and am not sure that there's a need to change anything.
[Qin Wu] If you have better suggestion, please let us know.
>Section 3.1.3

>I'd perhaps make a note about the wrapping of the Accept: header field line in the example (and all the other similarly affected examples).
[Qin Wu] Okay, thanks.
>Section 3.2.2

>I suggest reusing the phrasing from §3.1.2 that mentions floating-point values, for consistency..
[Qin Wu] Okay.
>Section 3.5.2

   The metric value type is a single 'JSONNumber' type value conforming
   to the number specification of [RFC8259] Section 6.  The number MUST
   be non-negative.  The value represents the percentage of packet
   losses.

I'd probably mention floating-point here as well.
[Qin Wu] Okay.
Section 4.3.3

   Intended Semantics: To specify spatial and temporal maximum
   reservable bandwidth from the specified source to the specified
   destination.  The value corresponds to the maximum bandwidth that can
   be reserved (motivated from [RFC3630] Section 2.5.7).  The spatial

It's a little interesting to see an OSPF reference for max reservable bandwidth when we used an IS-IS one for current residual bandwidth, but it's hard to see much harm causing from the mixture of references (who is going to follow the references anyway?).

[Qin Wu] Correct, that is our expectation.

Section 7

   IANA has created and now maintains the "ALTO Cost Metric Registry",
   listed in Section 14.2, Table 3 of [RFC7285].  This registry is
   located at <https://www.iana.org/assignments/alto-protocol/alto-
   protocol.xhtml#cost-metrics>.  This document requests to add the
   following entries to "ALTO Cost Metric Registry".

The live registry has a "reference" column, so I'd add ", with this document as the reference", here.
[Qin Wu] Okay.
   Registered ALTO address type identifiers MUST conform to the
   syntactical requirements specified in Section 2.1.  Identifiers are
   to be recorded and displayed as strings.

s/address type/cost source type/

[Qin Wu] Thanks.