[ippm] Review of draft-ietf-ippm-responsiveness-03

Greg White <g.white@CableLabs.com> Tue, 23 January 2024 18:46 UTC

From: Greg White <g.white@CableLabs.com>
To: "ippm@ietf.org" <ippm@ietf.org>
Thread-Topic: Review of draft-ietf-ippm-responsiveness-03
Thread-Index: AQHaTixSyq+MMTMvtk6qwbaUAP7AGg==
Date: Tue, 23 Jan 2024 18:45:23 +0000
Message-ID: <48CF7C06-7582-4664-A4B8-B2FD6F1F9157@CableLabs.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/16.80.23121017
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: iZZG4Flz82XGeQYSno2fJm7ZCB9nPjbW43stHXMG/5wupMt6AAgWMiwDlMU1Wxs6YUHPXyY6GIzGBxD7iMRnCvYvxrLlk8luNrz3iXJWuUsoRGDAHSD4NhM1QLNgL0ngm1zJHlzaxXzUpvYAY8vHjjfF68ZnCjTOHN3YkPrCzA81BUie83kEPpKQj7vaYmEoCbT+eawzIAUkwZGUn9vkJEXlmF86T0DkAIkyAy9AI17fBCkXFbHX9XvSF4SR0fEkDeTjtdCTJ802Dj2OBBUj1P0RRkhqvCRBnO2l2uh0LSK1e+9w10d+yaYiN0PRykR5G3vnZFcA98kP+5rKtwRA0clbEajRkPAga6pewRId1etB38qjCYy/LKrlB9e/it4NLfxf0pxUGnRyprplwMSuLieEe9zAycCY7SLf8SEXVnGPmc9H/14+yHb0q4rltTS04oJwU5jCelEvfOKvC62vzImYv3eifqF//T+Oe85u+hpNsToKj8hS+Zyiz8uLY9xgSzmPffeYI9J5q3Z9TucawGaa17BYI296PiJi/gsBJ58TIv+9KffsdmExqZoaQyvufka3f5a4amzTDCIYVWbkmuEXPNrubje4d9R6tuQ33bI75i0bffF9R5vKNmLFyqcCmuXlK0x7uH5ObU431aI9KMOzyswLg7+xPOnloIGepnVcDm46+ZMiNcUqb/0YdGFWebBXv1uhH/IrWvRGr8qQp1S37ZOaXyoFXuItGP+U0sJzYfvZDZvJrFcpg51FK+MQkX63fzyi9nugekpJzS/62xXtYG4JTSifNyu55zZhjCX3sjrmINy4xHVy3D48jU02Zgl4f1SxiY5ZJTtW0jo8MMwwZmWcAPGVMSJ9OXAVRa0hhfE5kHagtkTdan7vXIYYI5QHtDvIB7/hL7iSKKWbkicxD/MZ9MFx3zWfodSqt9WqkY4ubUnJBHZ7o3nJIUGd+4ormPNtRrXC+qRYecoIInlbl3oL8IwKwnrVr8jfK8vOMZn4pNfT5IwPDxWVbXjIYUo9g2qzWV4Ki1TtvEiEk2oxb5hvC7lpgdZYhH7hqAMV726hTqh0X9y3uqH8lxcrXqolqHLyLsWwDnePHgwK5BXQUyvTAlzZy661CLNBPBpWjj7BdgI4hNNot2pSJj9MEet2odgFDZXansvja5RwyEEZ3Y8c/TJW6Axe5S8WRPXsGZVQ1Ckgz6lhxgXsdqpMxgZsZ/CLB3/oTE3o3bFqDO6HWc5IETI+yqeqVc2629zHYO/Wn0FNVJhoIG+gUYqWhqGWURP/ik455tJ0t/3KU8i0Mfff6+s2qmeRgqWDZVGek/onZNVcY/HK4/oAo0fVtE6T/lNj6b6UBf10pV0zumPAlFO9ybqcmY/VfiJqpqzT6JY9pg/asWAlfmKLwAoA5b2VJeWU00fNqFThLmrL6OR1OZsAofunVmzYQY9PqDuknw97Ozn6PzImhHaHw47UcWmDyEOFT6UoZSfCQY22Tfrd6CX0gIp5bzh4A8eLeglYqC5RGPuaR76Qrv54/jBD98jd8lVZSVMVzwyPmY1fgVD3mGWVLgcxkcoexwgNA7YKB0kNOcPIzs9O0ZYByAGjyMivs6m6Krr2/4anTm9q9nEcNWNLoUmVqu8UBWAbBHHgT2ZsRP7UWzcnm78A+oyI
Content-Type: multipart/alternative; boundary="_000_48CF7C0675824664A4B8B2FD6F1F9157CableLabscom_"
MIME-Version: 1.0
X-OriginatorOrg: cablelabs.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: BN8PR06MB5892.namprd06.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 7275a43d-f85b-446e-8846-08dc1c43753f
X-MS-Exchange-CrossTenant-originalarrivaltime: 23 Jan 2024 18:45:23.4916 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: ce4fbcd1-1d81-4af0-ad0b-2998c441e160
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: ru9Z44m7uRF1ZoprpJ9bc2q6L+DW3IYJ4LE0yc98QLo08Mi0naouN321dTdlZuQQgC6l0V7hB/js2yz3xeo7Tw==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY5PR06MB6609
Archived-At: <https://mailarchive.ietf.org/arch/msg/ippm/DoLV9MrMvAxv-LSWnwhydKRneZE>
Subject: [ippm] Review of draft-ietf-ippm-responsiveness-03
X-BeenThere: ippm@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: IETF IP Performance Metrics Working Group <ippm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ippm>, <mailto:ippm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ippm/>
List-Post: <mailto:ippm@ietf.org>
List-Help: <mailto:ippm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ippm>, <mailto:ippm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Jan 2024 18:46:17 -0000

All,

I did a review of the current responsiveness draft and have a list of technical comments, editorial comments and nits. Within each section, the comments are listed in the order that the relevant material appears in the draft.

Best Regards,
Greg

Technical Comments:

1. The responsiveness test probe measurements include more than just the network responsiveness, they also include effects from the web server implementation. Yet, the Abstract and Introduction sections use videoconferencing as the (only) example application. Since videoconferencing apps don’t use the same connection methodology as HTTPS, I think this is misleading. This methodology should either be more clearly labeled/described as an interactive web responsiveness measurement (and videoconferencing not be used as the example application), or the methodology should be revised to more accurately reflect the responsiveness that impacts real-time communication applications (e.g. by trying to eliminate the HTTP server components of delay in the Responsiveness calculation, rather than explicitly including it).

1. §4.3.1 suggests using a Trimmed Mean to aggregate the samples of each metric. I have a couple of issues with this.
The first issue is with the “trimmed” part. The responsiveness test uses traffic that is modeled after real application traffic, and it’s intended to give the user an indication of the real behavior of the network. Real applications don't get to ignore the times when the network latency is bad, they have to deal with it. Throwing away the top 5% of the measurements seems really arbitrary. To just pretend that those events didn't happen and don't include them in the measurement I think is wrong. In the experimental sciences, “outliers” that represent faulty data or are the result of experimental error are often excluded from data sets, but the experimenter needs to be very careful to identify whether the outliers are in fact faulty, or are they actually extreme observations of the real variability that exists in the experiment. In the context of the responsiveness test, I think the likelihood is much greater that many of these data points are in fact real (and important) observations of the network latency variation. Is there a specific concern about the potential for faulty data that would affect 5% of the samples?

The second issue is with the “mean” part. In the context of user-observable events, like page load time, it may be that mean values are useful metrics, but in the context of measuring network RTT (which is one of the dominant factors measured by the responsiveness probes), averages are often a poor way to summarize a set of measurements [1],[2]. Quite a number of applications for which “responsiveness” is important (including voice applications and many multiplayer games) utilize a jitter buffer at the receiver to convert the variable packet latency of the network into a fixed (or slowly varying) latency and residual loss. These jitter buffers often result in a fairly low level of residual loss (e.g. 1%), which implies that the fixed latency that they provide to the application matches a high percentile (e.g. P99) of the packet latency of the network. If the description of the responsiveness test is revised to be more clearly about web responsiveness rather than (e.g.) videoconferencing, perhaps this argument is less strong, but even then I think that the mean value deemphasizes the variability in latency that affects user experience.

Instead of TM95, I would suggest using P99 if the goal is to reflect a responsiveness metric for real-time interactive applications (RTC/CloudGaming/XR).

[1] https://www.bitag.org/documents/BITAG_latency_explained.pdf
[2] https://www.iab.org/wp-content/IAB-uploads/2021/09/single-delay-metric-1.pdf

I understand that P99 metrics can be more variable than mean metrics, and there might be an interest in reporting a more inherently stable metric, but metric stability seems to me to be less important than metric validity.

1. What is the reason that the goodput phase starts with only 1 connection and only adds 1 connection each interval? It seems that there is some underlying assumption that using the minimum # of connections to achieve goodput saturation has some value. I’m not sure that it does. Why not start with 8 connections and add 8 more each interval until goodput saturation is reached? I would think that you’d reach saturation much more quickly, and this is more important than you may realize. The majority of existing tools that only aim to maximize goodput (i.e. speedtests) start several (4,6,8 or 16) simultaneous flows immediately, web browsers also (nearly) simultaneously open dozens of TCP connections. What is the downside?
2. §4.4 indicates that responsiveness probes don’t begin until the load-generating process declares goodput saturation. Why wait? Is there a downside to starting the responsiveness probing at the beginning of the test? It seems like the current method could in some (possibly many) cases take MAD more intervals than would be necessary to collect a measurement.
3. §4.4.1. I don’t understand the definition of “low” confidence. This seems to only be possible with invalid parameter choices. If MAD=4 and ID=1s, then as long as the tool allows at least 4 seconds for each “stage”, it would be impossible to have the “Low” confidence result, and if the tool allows less than 4 seconds, then it would be impossible not to.
4. The purpose of the Responsiveness Test is to raise awareness of the deployment (or lack thereof) of new technologies to improve network responsiveness and to encourage deployment of these technologies. As such, the use of classic congestion control is invalid. The draft needs to mandate the use of L4S congestion control for all Responsiveness Test traffic and make it clear why this is necessary. §6 3rd paragraph reads:
“As clients and servers become deployed that use L4S congestion
control (e.g., TCP Prague with ECT(1) packet marking), for their
normal traffic when it is available, and fall back to traditional
loss-based congestion controls (e.g., Reno or CUBIC) otherwise, the
same strategy SHOULD be used for Responsiveness Test traffic. This
is RECOMMENDED so that the synthetic traffic generated by the
Responsiveness Test mimics real-world traffic for that server.”
Suggest:
“As clients, servers and networks become deployed that support L4S
congestion control (e.g., TCP Prague with ECT(1) packet marking
and L4S ECN congestion marking in the network), it will be
imperative that the Responsiveness Test measures performance using
L4S as well. Both clients and servers MUST support
and use L4S congestion control (e.g. TCP Prague [I.D-prague]) and
Accurate ECN ([RFCxxxx]) for all Responsiveness Test traffic. This
is required so that the synthetic traffic generated by the
Responsiveness Test correctly measures the responsiveness capability
of the network. In networks that don’t support L4S, L4S congestion
control algorithms are required (by [RFC9331]) to automatically behave
like classic congestion control (e.g. CUBIC/Reno), so the use of an L4S
congestion control algorithm for the Responsiveness Test is appropriate
for all networks.”

1. §7.1 it isn’t clear to me the meaning of the test_endpoint URL. How exactly does the server set it, and what does the client do with this information?

Editorial Comments:

1. The description in §4.1 presumes no AQM (e.g. it uses phrases like: “… the bottleneck … has its buffer filled up entirely ….”). Since AQM is getting to be relatively widely deployed, it would be good to make this intro material a little more aware of the current reality.

1. §5 should be more clear that selection of the measurement server is important. In addition to the server characteristics having an influence on the result and perhaps not reflecting the characteristics of a “real” application server of interest to the user, it is only the path to that server that is being tested, so the network characteristics might not reflect the network path that is of interest to the user. A discussion of this in section 5 will then provide motivation for the discussion in Section 7 on locating test servers in various locations, and a back reference to section 5 could be added there.

Nits:

1. §2

“Finally, it is crucial to recognize that significant queueing only
happens on entry to the lowest-capacity (or "bottleneck") hop on a
network path.“
suggest:

“Finally, it is crucial to recognize that significant queueing in the network only

happens on entry to the lowest-capacity (or "bottleneck") hop on a

network path. “
rationale: the immediately preceding paragraph discusses that queuing doesn’t just happen in the network.

1. §2

“In a heterogeneous

network like the Internet it is inevitable that there must

necessarily be some hop along the path with the lowest capacity for

that path.”
Either say “it is inevitable that” or “there must necessarily be”, not both.

1. §2

“Arguably, this heterogeneity of

the Internet is one of its greatest strengths. Allowing individual

technologies to evolve and improve at their own pace, without

requiring the entire Internet to change in lock-step, has enabled

enormous improvements over the years in technologies like DSL, cable

modems, Ethernet, and Wi-Fi, each advancing independently as new

developments became ready. As a result of this flexibility we have

moved incrementally, one step at a time, from 56kb/s dial-up modems

in the 1990s to Gb/s home Internet service and Gb/s wireless

connectivity today.”
Is this relevant to the Responsiveness definition? Suggest deleting it for brevity.

1. §4.1.1 is a bit circuitous.
A) suggest moving the first 2 sentences of the 5th paragraph to the start of the section:

“The purpose of the Responsiveness Test is not to productively move

data across the network, the way a normal application does. The

purpose of the Responsiveness Test is to, as quickly as possible,

simulate a representative traffic load as if real applications were

doing sustained data transfers and measure the resulting round-trip

time occurring under those realistic conditions.”

B) Suggest deleting the second paragraph.
C) Suggest changing the beginning of the (former) 3rd sentence in the (former) 5th paragraph to: “For all of these reasons, using multiple ….”

1. §4.1.2. s/Debuggability of/Debugging/
2. §4.1.3 again presumes no AQM, and it has some strange language.

“By definition, once goodput is

maximized, buffers will start filling up, creating the "standing

queue" that is characteristic of bufferbloat.”
This is not a true statement in all cases, and it certainly isn’t the definition of maximum goodput that buffers start filling up. Suggestion:

“When using TCP with traditional loss-based congestion control, in

bottlenecks that don’t support Active Queue Management, once goodput is

maximized, buffers will start filling up, creating the "standing

queue" that is characteristic of bufferbloat.”

1. §4.1.3 first paragraph,

“At this moment the

test starts measuring the responsiveness until it, too, reaches

saturation.”
When reading this for the first time, it was somewhat ambiguous what “it” refers to in this sentence (the test or the responsiveness). Suggest

“At this moment the

test starts measuring the responsiveness until that metric reaches

saturation.”

1. §4.1.3 second paragraph: “The algorithm notes that…”. “Notes” seems like a strange word here. The algorithm isn’t taking notes. Maybe “presumes” or “is based on the presumption”.

1. §4.1.3 second paragraph:

“If new connections leave the throughput the same …”
Suggest:

“If new connections don’t result in an increase in throughput …”

1. §4.1.3 second paragraph:

“At this point, adding more

connections will allow to achieve full buffer occupancy.”
In most cases, once the link is saturated with TCP traffic, the buffer will fill. It doesn’t take additional connections in order to achieve full buffer occupancy. The algorithm seems to be designed with this in mind, since it does not add more connections once goodput saturation is reached. Suggest deleting this sentence.

1. §4.1.3 second paragraph:

“Responsiveness will gradually decrease from now on, until the buffers

are entirely full and reach stability of the responsiveness as well.”
Suggest:

“Responsiveness will decrease from now on, until the buffers

are entirely full and stability is reached.”

1. §4.3 “http_l” appears twice and “http_s” appears 3 times, yet these appear to refer to the same value.
2. §4.3 the arithmetic appears to support the interpretation that the 100 probes per second are divided equally between foreign-probe types and self-probe types, but I don’t see this explicitly stated in the draft. Suggest stating this as a requirement. Also, I suspect that the two probe types should be alternated so that both components of the responsiveness calculation stabilize at the approximately the same time. Should the two probe types be spread out relative to each other (e.g. send self-probe, wait 10ms, send foreign probe, wait 10ms, repeat), or should they be sent back-to-back in order to both be measuring (nearly) the same network condition (e.g. send self-probe, send foreign-probe, wait 20ms, send foreign-probe, send self-probe, wait 20ms, repeat)? Suggest making these aspects explicit.
3. §4.3.1 Responsiveness formula is unnecessarily confusing. While it isn’t mathematically ambiguous, using the constructs 1/B*C and 1/B*(C+D) is more confusing for readers than the constructs C/B and (C+D)/B. Referring to the formula as a weighted mean isn’t very helpful for two reasons. First it is the denominator that is a mean calculation, not the whole expression. Second, the denominator would be more helpfully described as a mean of two values, one of which is itself a mean of three values. This could be improved by calculating foreign_response_time = (TM(tcp_f) + TM(tls_f) + TM(http_f))/3 and then using that value in the Responsiveness formula, or alternatively, explain the weighting factors 1/6 and 1/2 in the text.
4. §4.4

“calculated in the most-recent MAD intervals”
Does this include the current interval? Later text indicates that it does not. Suggest:

“calculated in the immediately preceding MAD intervals”

1. §3 refers to the test taking 20 seconds, while §4.4 and §4.4.1 refer to one half of the test taking 20 seconds, and the full test taking 40 seconds. Make these consistent.

1. §4.4

“It is left to the implementation what to do when stability is not reached within that time-frame.”
The very next subsection (4.4.1) seems to give some guidance here. Why not recommend that mechanism?

1. §4.4.1 The definitions use the hard coded value “4”. Shouldn’t this instead be “MAD”?
2. §5.1.2, first paragraph, 4th sentence reads:

“Many networks deploy transparent TCP Proxies,

firewalls doing deep packet-inspection, HTTP "accelerators",...”
Suggest completing this sentence rather than ending with an ellipsis.

1. §5.1.2 second paragraph. Please provide a reference or definition of Smart Queue Management. Does this term include AQM? If not, suggest adding AQM explicitly.
2. §6 title is “Responsiveness Test Server API” yet it contains requirements on both the server and the client, and those requirements go beyond the API. Maybe: “Responsiveness Test Server and Client Protocol”?

1. §6 list item 4.

“… which contains configuration …”
Suggest

“… which contains JSON configuration …”
To link this to the text in the next paragraph.

1. §7 the text in the first paragraph, beginning with “However” and through the end of the numbered list seems like it would fit better in §5.2.

1. §7.1 Can you provide an explicit example URL (maybe it is: https://nq.example.com/.well-known/nq)

1. §7.1 In the JSON example the listed urls and the test_endpoint url are in completely different domains. Is this an expected scenario?

1. §7.1 mentions 3 times that the content-type is application/json. Maybe once is enough?

[ippm] Review of draft-ietf-ippm-responsiveness-03 Greg White
Re: [ippm] Review of draft-ietf-ippm-responsivene… can desem
Re: [ippm] Review of draft-ietf-ippm-responsivene… Christoph Paasch
Re: [ippm] Review of draft-ietf-ippm-responsivene… Christoph Paasch
Re: [ippm] Review of draft-ietf-ippm-responsivene… Christoph Paasch