[alto] Benjamin Kaduk's Discuss on draft-ietf-alto-performance-metrics-20: (with DISCUSS and COMMENT)

Benjamin Kaduk via Datatracker <noreply@ietf.org> Thu, 02 December 2021 05:04 UTC

Return-Path: <noreply@ietf.org>
X-Original-To: alto@ietf.org
Delivered-To: alto@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id 30DD43A0A09; Wed, 1 Dec 2021 21:04:42 -0800 (PST)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: Benjamin Kaduk via Datatracker <noreply@ietf.org>
To: The IESG <iesg@ietf.org>
Cc: draft-ietf-alto-performance-metrics@ietf.org, alto-chairs@ietf.org, alto@ietf.org, ietf@j-f-s.de, ietf@j-f-s.de
X-Test-IDTracker: no
X-IETF-IDTracker: 7.40.0
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: Benjamin Kaduk <kaduk@mit.edu>
Message-ID: <163842148143.28155.14877960253642777896@ietfa.amsl.com>
Date: Wed, 01 Dec 2021 21:04:42 -0800
Archived-At: <https://mailarchive.ietf.org/arch/msg/alto/wCmX114BwqodotBkGpd7kKF0ld8>
Subject: [alto] Benjamin Kaduk's Discuss on draft-ietf-alto-performance-metrics-20: (with DISCUSS and COMMENT)
X-BeenThere: alto@ietf.org
X-Mailman-Version: 2.1.29
List-Id: "Application-Layer Traffic Optimization \(alto\) WG mailing list" <alto.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/alto>, <mailto:alto-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/alto/>
List-Post: <mailto:alto@ietf.org>
List-Help: <mailto:alto-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/alto>, <mailto:alto-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Dec 2021 05:04:42 -0000

Benjamin Kaduk has entered the following ballot position for
draft-ietf-alto-performance-metrics-20: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/blog/handling-iesg-ballot-positions/
for more information about how to handle DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-alto-performance-metrics/



----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

These should all be trivial to resolve -- just some minor internal
inconsistencies that need to be fixed before publication.

The discussion of percentile statistical operator in §2.2 is internally
inconsistent -- if the percentile number must be an integer, then p99.9
is not valid.

Also, the listing of "cost-source" values introduced by this document
(in §5.1) does not include "nominal", but we do also introduce "nominal".

Similarly, in §3.1.3 we refer to the "-<percentile>" component of a cost
metric string, that has been generalized to an arbitrary statistical
operator.


----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

All things considered, this is a pretty well-written document that was
easy to read.  That helped a lot as I reviewed it, especially so on a
week with a pretty full agenda for the IESG telechat.

Section 2.2

Should we say anything about how to handle a situation where a base
metric identifier is so long that the statistical operator string cannot
be appended while remaining under the 32-character limit?

   min:
      the minimal value of the observations.
   max:
      the maximal value of the observations.
   [...]

Should we say anything about what sampling period of observations is in
scope for these operators?

Section 3.x.4

If we're going to be recommending that implementations link to external
human-readable resources (e.g., for the SLA details of estimation
methodology), does the guidance from BCP 18 in indicating the language
of the text come into play?

It's also a bit surprising that we specify the new fields in the
"parameters" of a metric just in passing in the prose, without a more
prominent indication that we're defining a new field.

Section 3.1.4

   "nominal": Typically network one-way delay does not have a nominal
   value.

Does that mean that they MUST NOT be generated, or that they should be
ignored if received, or something else?  (Similarly for the other
sections where we say the same thing.)

   This description can be either free text for possible presentation to
   the user, or a formal specification; see [IANA-IPPM] for the
   specification on fields which should be included.  [...]

Is the IANA registry really the best reference for what fields to
include?  Tpically we would only refer to the registry when we care
about the current state of registered values, but the need here seems to
effectively be the column headings of the registry, which could be
obtained from the RFC defining the registry.

Section 3.3.3

   Intended Semantics: To specify spatial and temporal aggregated delay
   variation (also called delay jitter)) with respect to the minimum
   delay observed on the stream over the one-way delay from the
   specified source and destination.  The spatial aggregation level is
   specified in the query context (e.g., PID to PID, or endpoint to
   endpoint).

I do appreciate the note about how this is not the normal statistics
variation that follows this paragraph, but I also don't think this is a
particularly clear or precise specification for how to produce the
number that is to be reported.  It also doesn't seem to fully align with
the prior art in the IETF, e.g., RFC 3393.  It seems like it would be
highly preferrable to pick an existing RFC and refer to its
specification for computing a delay variation value.  (To be clear, such
a reference would then be a normative reference.)

Section 3.4.3

   Intended Semantics: To specify the number of hops in the path from
   the specified source to the specified destination.  The hop count is
   a basic measurement of distance in a network and can be exposed as
   the number of router hops computed from the routing protocols
   originating this information.  [...]

It seems like this could get a little messy if there are multiple
routing protocols in use (e.g., both normal IP routing and an overlay
network, as for service function chaining or other overlay schemes).
I don't have any suggestions for disambiguating things, though, and if
the usage is consistent within a given ALTO Server it may not have much
impact on the clients.

Section 3.4.4

   "sla": Typically hop count does not have an SLA value.

As for "nominal", earlier, is there any guidance to give on not
generating it or what to do if it is received?
(Also appears later, I suppose.)

Section 4.1.4

   "estimation": The exact estimation method is out of the scope of this
   document.  See [Prophet] for a method to estimate TCP throughput.  It
   is RECOMMENDED that the "parameters" field of an "estimation" TCP
   throughput metric provides two fields: (1) a congestion-control
   algorithm name (a field named "congestion-control-alg"); and (2) a
   link (a field named "link")to a description of the "estimation"
   method.  Note that as TCP congestion control algorithms evolve (e.g.,
   TCP Cubic Congestion Control [I-D.ietf-tcpm-rfc8312bis]), it helps to
   specify as many details as possible on the congestion control
   algorithm used.  This description can be either free text for
   possible presentation to the user, or a formal specification.  [...]

Do these specifics go into the "congestion-control-alg" name, or in the
linked content?

Section 5.3

   To address the backward-compatibility issue, if a "cost-metric" is
   "routingcost" and the metric contains a "cost-context" field, then it
   MUST be "estimation"; if it is not, the client SHOULD reject the
   information as invalid.

This seems like a sub-optimal route to backwards compatibility, as it
would (apparently) permanently lock the "routingcost" metric to only the
"estimation" source with no way to negotiate more flexibility.  Unless
we define a new "routingcost2" metric that differs only in the lack of
this restriction, of course.

Section 5.4.1

   the ALTO server may provide the client with two pieces of additional
   information: (1) when the metrics are last computed, and (2) when the
   metrics will be updated (i.e., the validity period of the exposed
   metric values).  The ALTO server can expose these two pieces of
   information by using the HTTP response headers Last-Modified and
   Expires.

While this seems like it would work okay in the usual case, it seems a
bit fragile, in that it may fail in boundary cases, such as when a
server is just starting up.  I would lean towards recommending use of
explicit data items to convey this sort of information (and also the
overall measurement interval over which statistics are computed, which
may not always go back to "the start of time").

Section 5.4.2

   often be link level.  For example, routing protocols often measure
   and report only per link loss, not end-to-end loss; similarly,
   routing protocols report link level available bandwidth, not end-to-
   end available bandwidth.  The ALTO server then needs to aggregate
   these data to provide an abstract and unified view that can be more
   useful to applications.  The server should consider that different
   metrics may use different aggregation computation.  For example, the
   end-to-end latency of a path is the sum of the latency of the links
   on the path; the end-to-end available bandwidth of a path is the
   minimum of the available bandwidth of the links on the path.

Some caution seems in order relating to aggregation of loss
measurements, as loss is not always uncorrolated across links in the
path.

Section 6

I thought that the outcome of the art-art review thread was that we
would add some mention of ordinal cost mode here as a means to mitigate
the risk of exposing sensitive numerical metric values, but I don't see
such test.

In light of the guidance in Section 7 for new cost source types to
document their security considerations, should we document the security
considerations for the "sla" type here?  The overall theme would be
similar to what RFC 7285 already describes, but we could mention that
knowledge specifically of provider SLA targets allow for attackers to
target the SLA, causing problems for the provider other than the typical
DoS attack class.  (I'm not coming up with anything new to say about
"nominal" or "estimation".)

I would also consider mentioning that the "origin" references in table 1
might have useful things to say about the individual metrics that we
use.

Giving an attacker the ability to receive the instantaneous loss rate on
a path could be useful in helping the attacker gauge the efficacy of an
ongoing attack targeting that path.  The RFCs from the DOTS WG (e.g.,
8783 and 9132) may have some useful text on this topic that could be
used as a model.

Section 9.1

It's not really clear to me that [IANA-IPPM] needs to be classified as
normative (or whatever it is replaced by, in light of my earlier comment
in §3.1.4).

RFC 2330 is cited only once, in a "for example" clause; this would
typically cause it to be classified as only an informative reference.

The mention of RFC 8895 is conditional on it being implemented, so that
could probably also be downgraded to an informative reference as well.

Section 9.2

Some kind of URL for [Prometheus] would be very helpful.
[Prophet], too, though at least that has the ACM/IEEE Transactions venue
to anchor the reference.

I'm not entirely sure why RFC 2818 is classified as normative but RFC
8446 only as informative, since they are part of the same (quoted)
requirement clause.

NITS

Section 2.1

   To make it possible to specify the source and the aforementioned
   parameters, this document introduces an optional "cost-context" field
   to the "cost-type" field defined by the ALTO base protocol
   (Section 10.7 of [RFC7285]) as the following:

I think s/"cost-type" field/CostType object/ would be slightly more
accurate.

   The "estimation" category indicates that the metric value is computed
   through an estimation process.  An ALTO server may compute
   "estimation" values by retrieving and/or aggregating information from
   routing protocols (e.g., [RFC8571]) and traffic measurement
   management tools (e.g., TWAMP [RFC5357]), with corresponding
   operational issues.  [...]

I'm not sure if "with corresponding operational issues" conveys the
intended phrasing -- to me, it seems to say "do [the previous things],
but expect that there will sometimes be operational issues that make the
data unavailable or inaccurate".

Section 2.2

   stddev:
      the standard deviation of the observations.


   stdvar:
      the standard variance of the observations.

Pedantically, we could say if these are sample or population standard
deviation/variance (a difference of one in the denominator), but it
seems very unlikely to matter for these purposes.

Section 3

   dropped before reaching the destination (pkt.dropped).  The semantics
   of the performance metrics defined in this section are that they are
   statistics (percentiles) computed from these measures; for example,

I suggest "e.g., percentiles" since stddev/variance are not percentiles
but are statistics.

   the x-percentile of the one-way delay is the x-percentile of the set
   of delays {pkt.delay} for the packets in the stream.

This phrasing presupposes that there is a definite stream under
consideration, but I don't think that much confusion is likely and am
not sure that there's a need to change anything.

Section 3.1.3

I'd perhaps make a note about the wrapping of the Accept: header field
line in the example (and all the other similarly affected examples).

Section 3.2.2

I suggest reusing the phrasing from §3.1.2 that mentions floating-point
values, for consistency..

Section 3.5.2

   The metric value type is a single 'JSONNumber' type value conforming
   to the number specification of [RFC8259] Section 6.  The number MUST
   be non-negative.  The value represents the percentage of packet
   losses.

I'd probably mention floating-point here as well.

Section 4.3.3

   Intended Semantics: To specify spatial and temporal maximum
   reservable bandwidth from the specified source to the specified
   destination.  The value corresponds to the maximum bandwidth that can
   be reserved (motivated from [RFC3630] Section 2.5.7).  The spatial

It's a little interesting to see an OSPF reference for max reservable
bandwidth when we used an IS-IS one for current residual bandwidth, but
it's hard to see much harm causing from the mixture of references (who
is going to follow the references anyway?).

Section 7

   IANA has created and now maintains the "ALTO Cost Metric Registry",
   listed in Section 14.2, Table 3 of [RFC7285].  This registry is
   located at <https://www.iana.org/assignments/alto-protocol/alto-
   protocol.xhtml#cost-metrics>.  This document requests to add the
   following entries to "ALTO Cost Metric Registry".

The live registry has a "reference" column, so I'd add ", with this
document as the reference", here.

   Registered ALTO address type identifiers MUST conform to the
   syntactical requirements specified in Section 2.1.  Identifiers are
   to be recorded and displayed as strings.

s/address type/cost source type/