Re: [ippm] Review of draft-ietf-ippm-responsiveness-03

Christoph Paasch <cpaasch@apple.com> Fri, 16 February 2024 23:59 UTC

From: Christoph Paasch <cpaasch@apple.com>
Message-id: <0F9D1E40-FCF7-486A-A232-B81CCA3F43AA@apple.com>
Content-type: multipart/alternative; boundary="Apple-Mail=_DD417B91-C409-4166-895E-E79EF3E3B4C7"
MIME-version: 1.0 (Mac OS X Mail 16.0 \(3774.500.131\))
Date: Fri, 16 Feb 2024 16:58:14 -0700
In-reply-to: <48CF7C06-7582-4664-A4B8-B2FD6F1F9157@CableLabs.com>
Cc: "ippm@ietf.org" <ippm@ietf.org>
To: Greg White <g.white=40cablelabs.com@dmarc.ietf.org>
References: <48CF7C06-7582-4664-A4B8-B2FD6F1F9157@CableLabs.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/ippm/uLFYyztlZ2sf2ilYZjyv9kM8oQk>
Subject: Re: [ippm] Review of draft-ietf-ippm-responsiveness-03
Precedence: list

Hello Greg,

thanks a lot for your detailed review! Please see inline!

> On Jan 23, 2024, at 10:45 AM, Greg White <g.white=40cablelabs.com@dmarc.ietf.org> wrote:
> Technical Comments:
>  
>  
> The responsiveness test probe measurements include more than just the network responsiveness, they also include effects from the web server implementation.  Yet, the Abstract and Introduction sections use videoconferencing as the (only) example application.  Since videoconferencing apps don’t use the same connection methodology as HTTPS, I think this is misleading.  This methodology should either be more clearly labeled/described as an interactive web responsiveness measurement (and videoconferencing not be used as the example application), or the methodology should be revised to more accurately reflect the responsiveness that impacts real-time communication applications (e.g. by trying to eliminate the HTTP server components of delay in the Responsiveness calculation, rather than explicitly including it).

Yes, we do cite videoconferencing in the abstract as an example of how bad responsiveness affects the user-experience. The reason we did this is because this is the most relatable use-case that everyone has experienced and is the most visible (and even audible).
Many more use-cases benefit from good responsiveness. But, I don’t think we should outline all of these in the abstract. And besides the abstract, this is the only place where we primarily use video-conferencing as an example. Or do you see other places where we are emphasizing too much on video conferencing ?

The reason we use HTTPS is because it is the most widely used protocol on the Internet for the vast majority of use-cases. And I believe we explain this in Section 3.

>  
> §4.3.1 suggests using a Trimmed Mean to aggregate the samples of each metric. I have a couple of issues with this.
> The first issue is with the “trimmed” part. The responsiveness test uses traffic that is modeled after real application traffic, and it’s intended to give the user an indication of the real behavior of the network.  Real applications don't get to ignore the times when the network latency is bad, they have to deal with it.  Throwing away the top 5% of the measurements seems really arbitrary.  To just pretend that those events didn't happen and don't include them in the measurement I think is wrong.  In the experimental sciences, “outliers” that represent faulty data or are the result of experimental error are often excluded from data sets, but the experimenter needs to be very careful to identify whether the outliers are in fact faulty, or are they actually extreme observations of the real variability that exists in the experiment.  In the context of the responsiveness test, I think the likelihood is much greater that many of these data points are in fact real (and important) observations of the network latency variation.  Is there a specific concern about the potential for faulty data that would affect 5% of the samples?

Empirically we have seen that this is for the stability of the responsiveness measurement across subsequent runs. If we do not “throw away” outliers, the final RPM-value has huge variance. Responsiveness is aiming to be a user-accessible metric, thus providing some amount of stability across different runs on the same network is important.

In the next revision of the draft we suggest that a verbose mode should expose the raw values that were collected (https://github.com/network-quality/draft-ietf-ippm-responsiveness/blob/7b69834c887f7829b999123fa222ed7cbd1f9e7e/draft-ietf-ippm-responsiveness.md?plain=1#L719)

>  
> The second issue is with the “mean” part. In the context of user-observable events, like page load time, it may be that mean values are useful metrics, but in the context of measuring network RTT (which is one of the dominant factors measured by the responsiveness probes), averages are often a poor way to summarize a set of measurements [1],[2].  Quite a number of applications for which “responsiveness” is important (including voice applications and many multiplayer games) utilize a jitter buffer at the receiver to convert the variable packet latency of the network into a fixed (or slowly varying) latency and residual loss.  These jitter buffers often result in a fairly low level of residual loss (e.g. 1%), which implies that the fixed latency that they provide to the application matches a high percentile (e.g. P99) of the packet latency of the network.  If the description of the responsiveness test is revised to be more clearly about web responsiveness rather than (e.g.) videoconferencing, perhaps this argument is less strong, but even then I think that the mean value deemphasizes the variability in latency that affects user experience.  
>  
> Instead of TM95, I would suggest using P99 if the goal is to reflect a responsiveness metric for real-time interactive applications (RTC/CloudGaming/XR). 

We tried percentiles in the past. Unfortunately the variance across subsequent runs is way too high for the metric to be useful to an end-user.
The key element here is that we do not take TM95 over the measurements taken from the start of the test (thus including latency measurements from when the link was not saturated) but only from the last moving average distance (4 by default) intervals during which we reach stability.
In general, any time we try to express a larger set of numbers in a single metric we lose precision. And there is no perfect answer to that. Whichever we chose, P99, TM95, Average, … is always going to hide information. The only right way is to look at a CDF plot. And that is possible thanks to the recommendation to expose raw values in a verbose mode. (see above)

>  
> [1] https://www.bitag.org/documents/BITAG_latency_explained.pdf
> [2] https://www.iab.org/wp-content/IAB-uploads/2021/09/single-delay-metric-1.pdf
>  
> I understand that P99 metrics can be more variable than mean metrics, and there might be an interest in reporting a more inherently stable metric, but metric stability seems to me to be less important than metric validity.

As an end-user accessible metric we believe that stability is very important to allow for credibility of the metric. Our goal is not for responsiveness to be a primary debugging tool for networking engineers.

> What is the reason that the goodput phase starts with only 1 connection and only adds 1 connection each interval?  It seems that there is some underlying assumption that using the minimum # of connections to achieve goodput saturation has some value.  I’m not sure that it does. Why not start with 8 connections and add 8 more each interval until goodput saturation is reached?  I would think that you’d reach saturation much more quickly, and this is more important than you may realize.  The majority of existing tools that only aim to maximize goodput (i.e. speedtests) start several (4,6,8 or 16) simultaneous flows immediately, web browsers also (nearly) simultaneously open dozens of TCP connections. What is the downside?

We did observe bursts of significant packet-loss due to TCP slow-start when opening too many connections per interval. But that being said, I can totally understand the reason for initially starting with a number > 1. E.g., start with 4 and add 2 flows at each interval seems reasonable to me. I think this can be 2 parameters to the test that can be set by the tool.

I would suggest 2 additional parameters:

INP : Initial number of concurrent transport connections at the start of the test (default: 1)
INC : Number of transport connections to add to the pool of load-generating connections at each interval (default: 1)


Sounds good ?

> §4.4 indicates that responsiveness probes don’t begin until the load-generating process declares goodput saturation.  Why wait?  Is there a downside to starting the responsiveness probing at the beginning of the test?  It seems like the current method could in some (possibly many) cases take MAD more intervals than would be necessary to collect a measurement.

You are right - this is incorrectly explained and comes from a past iteration of the algorithm.

We continuously probe for responsiveness. But the reason is not to create additional measurement probes, but rather if suddenly, when we reach goodput saturation, we start sending out 100 responsiveness probes per second, these 100 probes per second on an already saturated link will create considerable additional congestion and stress on the system. We have experienced this. So, by having probes from the start it can be avoided.

I will fix the wording in the algorithm.

> §4.4.1. I don’t understand the definition of “low” confidence.  This seems to only be possible with invalid parameter choices.  If MAD=4 and ID=1s, then as long as the tool allows at least 4 seconds for each “stage”, it would be impossible to have the “Low” confidence result, and if the tool allows less than 4 seconds, then it would be impossible not to.

It may take time to reach capacity saturation. Such that the tool’s time-limit is reached. I can make this more explicit in the text.

> The purpose of the Responsiveness Test is to raise awareness of the deployment (or lack thereof) of new technologies to improve network responsiveness and to encourage deployment of these technologies.  As such, the use of classic congestion control is invalid.  The draft needs to mandate the use of L4S congestion control for all Responsiveness Test traffic and make it clear why this is necessary.  §6  3rd paragraph reads:
> “As clients and servers become deployed that use L4S congestion
>    control (e.g., TCP Prague with ECT(1) packet marking), for their
>    normal traffic when it is available, and fall back to traditional
>    loss-based congestion controls (e.g., Reno or CUBIC) otherwise, the
>    same strategy SHOULD be used for Responsiveness Test traffic.  This
>    is RECOMMENDED so that the synthetic traffic generated by the
>    Responsiveness Test mimics real-world traffic for that server.”
> Suggest:
> “As clients, servers and networks become deployed that support L4S
> congestion control (e.g., TCP Prague with ECT(1) packet marking
> and L4S ECN congestion marking in the network), it will be
>    imperative that the Responsiveness Test measures performance using
>    L4S as well.  Both clients and servers MUST support
>    and use L4S congestion control (e.g. TCP Prague [I.D-prague]) and
>    Accurate ECN ([RFCxxxx]) for all Responsiveness Test traffic. This
>    is required so that the synthetic traffic generated by the
>    Responsiveness Test correctly measures the responsiveness capability
>   of the network. In networks that don’t support L4S, L4S congestion
>    control algorithms are required (by [RFC9331]) to automatically behave
>    like classic congestion control (e.g. CUBIC/Reno), so the use of an L4S
>    congestion control algorithm for the Responsiveness Test is appropriate
>    for all networks.”

Please take a look at the upcoming version of the draft at https://github.com/network-quality/draft-ietf-ippm-responsiveness/blob/main/draft-ietf-ippm-responsiveness.md.

We removed this entire part on congestion-control. The responsiveness test does not only want to raise awareness of technologies in the network but wants to holistically measure responsiveness the way the end-user will experience it when connecting to a particular service.

This means, the end-user system congestion-control should be left at the default configuration (same for the server-side). We do not encourage or recommend any specific technology,…

> §7.1 it isn’t clear to me the meaning of the test_endpoint URL. How exactly does the server set it, and what does the client do with this information?

We try to explain this with:

If the test server provider can pin all of the requests for a test run to a specific host in the service (for a particular run), they can specify that host name in the "test_endpoint" field.

I can try to clarify this a bit morel

>  
>  
> Editorial Comments:
>  
> The description in §4.1 presumes no AQM (e.g. it uses phrases like: “… the bottleneck … has its buffer filled up entirely ….”).  Since AQM is getting to be relatively widely deployed, it would be good to make this intro material a little more aware of the current reality.

Tried to clarify.

>  
> §5 should be more clear that selection of the measurement server is important.  In addition to the server characteristics having an influence on the result and perhaps not reflecting the characteristics of a “real” application server of interest to the user, it is only the path to that server that is being tested, so the network characteristics might not reflect the network path that is of interest to the user.   A discussion of this in section 5 will then provide motivation for the discussion in Section 7 on locating test servers in various locations, and a back reference to section 5 could be added there.

I added a paragraph in the “Server side influence” section, talking about server-selection.

>  
>  
>  
>  
> Nits:
>  
> §2
> “Finally, it is crucial to recognize that significant queueing only
>    happens on entry to the lowest-capacity (or "bottleneck") hop on a
>    network path.“
> suggest: 
> “Finally, it is crucial to recognize that significant queueing in the network only
>    happens on entry to the lowest-capacity (or "bottleneck") hop on a
>    network path. “
> rationale:  the immediately preceding paragraph discusses that queuing doesn’t just happen in the network.


Thanks!

>  
>  
> §2
> “In a heterogeneous
>    network like the Internet it is inevitable that there must
>    necessarily be some hop along the path with the lowest capacity for
>    that path.”
> Either say “it is inevitable that” or “there must necessarily be”, not both.

Done!

>  
>  
> §2
> “Arguably, this heterogeneity of
>    the Internet is one of its greatest strengths.  Allowing individual
>    technologies to evolve and improve at their own pace, without
>    requiring the entire Internet to change in lock-step, has enabled
>    enormous improvements over the years in technologies like DSL, cable
>    modems, Ethernet, and Wi-Fi, each advancing independently as new
>    developments became ready.  As a result of this flexibility we have
>    moved incrementally, one step at a time, from 56kb/s dial-up modems
>    in the 1990s to Gb/s home Internet service and Gb/s wireless
>    connectivity today.”
> Is this relevant to the Responsiveness definition?  Suggest deleting it for brevity.

Agree’d.

>  
>  
> §4.1.1 is a bit circuitous. 
> A) suggest moving the first 2 sentences of the 5th paragraph to the start of the section:
>  
>    “The purpose of the Responsiveness Test is not to productively move
>    data across the network, the way a normal application does.  The
>    purpose of the Responsiveness Test is to, as quickly as possible,
>    simulate a representative traffic load as if real applications were
>    doing sustained data transfers and measure the resulting round-trip
>    time occurring under those realistic conditions.”
>  
>                 B) Suggest deleting the second paragraph.
>                 C) Suggest changing the beginning of the (former) 3rd sentence in the (former) 5th paragraph to: “For all of these reasons, using multiple ….”

Makes sense! Done!

>  
> §4.1.2.   s/Debuggability of/Debugging/
Thanks!

> §4.1.3 again presumes no AQM, and it has some strange language.
> “By definition, once goodput is
>    maximized, buffers will start filling up, creating the "standing
>    queue" that is characteristic of bufferbloat.”
> This is not a true statement in all cases, and it certainly isn’t the definition of maximum goodput that buffers start filling up. Suggestion:
> “When using TCP with traditional loss-based congestion control, in 
> bottlenecks that don’t support Active Queue Management, once goodput is
>    maximized, buffers will start filling up, creating the "standing
>    queue" that is characteristic of bufferbloat.”

I went with the following:

By definition - once goodput is maximized - if the transport protocol emits more
traffic into the network than is needed, the additional traffic will either
get dropped or be buffered and thus create a "standing queue" that is characteristic
of bufferbloat.


>  
> §4.1.3 first paragraph,
> “At this moment the
>    test starts measuring the responsiveness until it, too, reaches
>    saturation.”
> When reading this for the first time, it was somewhat ambiguous what “it” refers to in this sentence (the test or the responsiveness). Suggest
> “At this moment the
>    test starts measuring the responsiveness until that metric reaches
>    saturation.”

Thanks!

>  
> §4.1.3 second paragraph: “The algorithm notes that…”. “Notes” seems like a strange word here.  The algorithm isn’t taking notes. Maybe “presumes” or “is based on the presumption”.

Done!

>  
> §4.1.3 second paragraph:
> “If new connections leave the throughput the same …”
> Suggest:
> “If new connections don’t result in an increase in throughput …”

Done!

>  
> §4.1.3 second paragraph:
> “At this point, adding more
>    connections will allow to achieve full buffer occupancy.”
> In most cases, once the link is saturated with TCP traffic, the buffer will fill.  It doesn’t take additional connections in order to achieve full buffer occupancy.  The algorithm seems to be designed with this in mind, since it does not add more connections once goodput saturation is reached. Suggest deleting this sentence.

No, we do keep adding connections even if goodput saturation has been reached. And that is needed to work around receive-window limitations.

>  
> §4.1.3 second paragraph:
> “Responsiveness will gradually decrease from now on, until the buffers
>    are entirely full and reach stability of the responsiveness as well.”
> Suggest:
> “Responsiveness will decrease from now on, until the buffers
>    are entirely full and stability is reached.”

Thanks!

>  
> §4.3  “http_l” appears twice and “http_s” appears 3 times, yet these appear to refer to the same value.

That was from previous revisions. Changed all to http_l

> §4.3 the arithmetic appears to support the interpretation that the 100 probes per second are divided equally between foreign-probe types and self-probe types, but I don’t see this explicitly stated in the draft. Suggest stating this as a requirement. Also, I suspect that the two probe types should be alternated so that both components of the responsiveness calculation stabilize at the approximately the same time.   Should the two probe types be spread out relative to each other (e.g. send self-probe, wait 10ms, send foreign probe, wait 10ms, repeat), or should they be sent back-to-back in order to both be measuring (nearly) the same network condition (e.g. send self-probe, send foreign-probe, wait 20ms, send foreign-probe, send self-probe, wait 20ms, repeat)? Suggest making these aspects explicit.

I will make it explicit that the probes should be spread out and interleaved:

Given this information, we recommend that a test client does
not send more than `MPS` (Maximum responsiveness Probes per Second - default to 100) probes per `ID`.
The same amount of probes should be sent on load-generating as well as on separate connections.
The probes should be spread out equally over the duration of the interval, with the two types
of probes interleaving with each other.
The test client
should uniformly and randomly select from the active load-generating connections on which to send self probes.


> §4.3.1 Responsiveness formula is unnecessarily confusing. While it isn’t mathematically ambiguous, using the constructs 1/B*C and 1/B*(C+D) is more confusing for readers than the constructs C/B and (C+D)/B. Referring to the formula as a weighted mean isn’t very helpful for two reasons.  First it is the denominator that is a mean calculation, not the whole expression. Second, the denominator would be more helpfully described as a mean of two values, one of which is itself a mean of three values.  This could be improved by calculating foreign_response_time = (TM(tcp_f) + TM(tls_f) + TM(http_f))/3 and then using that value in the Responsiveness formula, or alternatively, explain the weighting factors 1/6 and 1/2 in the text. 

Good point - I will use your suggestion of calculating foreign_responsiveness.

> §4.4
> “calculated in the most-recent MAD intervals” 
> Does this include the current interval?  Later text indicates that it does not.  Suggest:
> “calculated in the immediately preceding MAD intervals”

Done!

>  
> §3 refers to the test taking 20 seconds, while §4.4 and §4.4.1 refer to one half of the test taking 20 seconds, and the full test taking 40 seconds.  Make these consistent.

We removed all references to an exact time.

> §4.4
> “It is left to the implementation what to do when stability is not reached within that time-frame.”
> The very next subsection (4.4.1) seems to give some guidance here.  Why not recommend that mechanism?

True. Will do.

>  
> §4.4.1 The definitions use the hard coded value “4”.  Shouldn’t this instead be “MAD”?

Fixed that.

> §5.1.2, first paragraph, 4th sentence reads:
> “Many networks deploy transparent TCP Proxies,
> firewalls doing deep packet-inspection, HTTP "accelerators",...”
> Suggest completing this sentence rather than ending with an ellipsis.

Done

>  
> §5.1.2 second paragraph.  Please provide a reference or definition of Smart Queue Management.  Does this term include AQM? If not, suggest adding AQM explicitly.

Should have been “Active Queue Management”. Fixing it.

> §6 title is “Responsiveness Test Server API” yet it contains requirements on both the server and the client, and those requirements go beyond the API.  Maybe: “Responsiveness Test Server and Client Protocol”? 

I disagree - this section is about what the server needs to provide to allow a client to run a responsiveness test. So, I think the title is correct.

>  
>  
> §6 list item 4.
> “… which contains configuration …”
> Suggest
> “… which contains JSON configuration …”
> To link this to the text in the next paragraph.

Done.

>  
> §7 the text in the first paragraph, beginning with “However” and through the end of the numbered list seems like it would fit better in §5.2.

Hmmm - this part is all about localization (home-network vs ISP vs …). So, I think it fits well in this part.

>  
> §7.1 Can you provide an explicit example URL (maybe it is: https://nq.example.com/.well-known/nq)

Sure.

>  
> §7.1 In the JSON example the listed urls and the test_endpoint url are in completely different domains.  Is this an expected scenario?

It is definitely a possible scenario. Often unicast hostnames are completely different domains than the hostname of the service. For example, curl https://mensura.cdn-apple.com/.well-known/nq .

>  
> §7.1 mentions 3 times that the content-type is application/json.  Maybe once is enough?

Better be sure ;-)

But yes, I fixed it.


Thanks a lot for all your comments and sorry it took a moment to go over all of these. I have sent out a pull-request with the changes at https://github.com/network-quality/draft-ietf-ippm-responsiveness/pull/100. If you want you can take a look and comment.

Chrstoph

[ippm] Review of draft-ietf-ippm-responsiveness-03 Greg White
Re: [ippm] Review of draft-ietf-ippm-responsivene… can desem
Re: [ippm] Review of draft-ietf-ippm-responsivene… Christoph Paasch
Re: [ippm] Review of draft-ietf-ippm-responsivene… Christoph Paasch
Re: [ippm] Review of draft-ietf-ippm-responsivene… Christoph Paasch