[ippm] review: draft-ietf-ippm-responsiveness-03

Sebastian Moeller <moeller0@gmx.de> Sun, 10 December 2023 15:51 UTC

Return-Path: <moeller0@gmx.de>
X-Original-To: ippm@ietfa.amsl.com
Delivered-To: ippm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 21DB5C14F747 for <ippm@ietfa.amsl.com>; Sun, 10 Dec 2023 07:51:08 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 2.447
X-Spam-Level: **
X-Spam-Status: No, score=2.447 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, GB_SUMOF=5, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmx.de
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id AjLzf-nwA5Y5 for <ippm@ietfa.amsl.com>; Sun, 10 Dec 2023 07:51:03 -0800 (PST)
Received: from mout.gmx.net (mout.gmx.net [212.227.17.22]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B146FC14F61D for <ippm@ietf.org>; Sun, 10 Dec 2023 07:51:02 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gmx.de; s=s31663417; t=1702223460; x=1702828260; i=moeller0@gmx.de; bh=UQHFaYKlCYBgft0F2GEyjwXIcNrwCeVCXxZchV9uxcc=; h=X-UI-Sender-Class:From:Subject:Date:To; b=C5oOeXiuA/BVLZlgR75pBj0JkDn9567MGRdPbpHc65c3SBHsMSSFBiLAQyXiPQmS lRj6wBOLMA/ORr0SmSZKhe1y69q/7r7ccLLmRVKMjHaVeWTQCNwgGvHmhC1RGsAtX mEqEFYlZoQCxsHWN7QTvdZd6XYNBSGkJXXIWWxufmHGY7bCpFIFE59z3C3x1RHms2 jf+yvyucvHiIGpBz4YAbyViyH36cAX4i3fHzwg5Y1IKxQ7Esg9cxSeBBBBZkQ1z8q O2f+XetwKNUrws8V4Bl99hokxS9yw70RCcq2HQuCs4q+Q1t+M2cJ57SXuQqnZAmjz zcT5H96HCIiva/3Clg==
X-UI-Sender-Class: 724b4f7f-cbec-4199-ad4e-598c01a50d3a
Received: from smtpclient.apple ([95.112.249.46]) by mail.gmx.net (mrgmx104 [212.227.17.168]) with ESMTPSA (Nemesis) id 1MV63q-1qlMjS24co-00S9Pd for <ippm@ietf.org>; Sun, 10 Dec 2023 16:51:00 +0100
From: Sebastian Moeller <moeller0@gmx.de>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.4\))
Message-Id: <94654535-67F5-4719-B988-3306BFEA2D6F@gmx.de>
Date: Sun, 10 Dec 2023 16:50:59 +0100
To: IETF IPPM WG <ippm@ietf.org>
X-Mailer: Apple Mail (2.3696.120.41.1.4)
X-Provags-ID: V03:K1:QFM6F3ymfUzxZdr555TydQg8/PcItHOk+XBYQVckQdDxHd481GS gArPrzdBr4Zgbn0citSFmulFFEhrQ1UVoXkHIWbzEXs8U9GkuXdA0DWOtutVzFUYiYdVLjJ /aoIa/MkC2WiO6MmM8xaEHqqqB7xVtGixsA7vURq6KXZFGKPvWob9QEjawCgYGy/wyx6C/o sEB+1Lr5IWAtyKB0O6VZQ==
UI-OutboundReport: notjunk:1;M01:P0:lviOPidcqcg=;N0gVX4Akt9TT9oFsJcXpbWvu9gX 9Za97cafYQh+GPrdIOBA6jSgWZAmrLrryWbOKQ+Mm+6m3edzDOwcJYvZO5yfMg2Yo6XhW6f8z 0Y9xFvabTIMhu0zpa3EwbzHIcK+NY/+cTVErHFk+2P8warsKFSjQWx34+XKyD+RDhLtJbaBnq 3jCFi3U0jt0EWDt+fqDezbWANPI4/710qPP0vJu/MjXtzmXLr8/a94evw/LTq30omo1eyx6q9 y5ZWtVX0utxTMnYwsOnq10VkLn4bIdQ6evcsbQFlzBvp3ejm3PKWRoMzYqSemejuHWV59tHX/ k0I21JBjdmpqKbs9KszlDDjODqZuRPXvM6xiYQiRg0i8unaiL1I3UXm9Rn07cp0ZsxaOTyaHp 5Ii6kSAFyYTE9X3ow2+gzK2+XrCYI/+Hn1XyOOLm/rlWrODb+5V8Gf9N5BWnu+Vr5+cIlxUNz KwkTYMDOl5p12yCQv3JNGgWmAwE4PmmZ5F0yKlIwGfegm2WofBBeig+vOGUjsK7kiforRQxM9 xmDPNxWuHgzZjSIoEzfI5j6xf05ybpmfFz8G6/lMAbkeAhaZ3Ife/hlLxAqX60Mx134w2RWGc d/j1+orEpKEYATD2ZDKX1qbA1ZMslKCol4UPKSHWZP6yzvMDC619q9kry7Wwygmnn8p17yxMB JZF0+T+46I3H5TwO1373Dx2qIx+sJnyYLbOdh2GWhj7HJk2fDeuaty6n7BVvr2PDVuIayE9oA 90N28BsMWir+V1mpzTt4fglwuEjsIgRl2vQyYZpELY5Sk3VZhNDjeKSVIblneaJZelxNvtgb9 d1QgmPCH51ZMe06Ip4QLHdcGM9GPJNXE7f8koV8AP0bnYjfFW8/uq/AzTLEyc2mAkO5zd/+jJ V0iI2fRelDBm08gw0G+880yMiyX+DYYasrBJilln6gJGAQxoZHq89d+c5+2nNMgwyRu80PQ/9 yk238bJkU1JTYt3CxakbfQHvfQo=
Archived-At: <https://mailarchive.ietf.org/arch/msg/ippm/dw88LH2jm0422KVNUNJNqUujwxc>
Subject: [ippm] review: draft-ietf-ippm-responsiveness-03
X-BeenThere: ippm@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: IETF IP Performance Metrics Working Group <ippm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ippm>, <mailto:ippm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ippm/>
List-Post: <mailto:ippm@ietf.org>
List-Help: <mailto:ippm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ippm>, <mailto:ippm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Dec 2023 15:51:08 -0000

Dear IPPM list,


I have read through the current version of the draft draft-ietf-ippm-responsiveness-03 and here are my comments, notes and questions.
With the exception of the request to exchange the small object from a single byte with some form of timestamps these are minor points and requests for clarification. All in all a fine draft that is informative, on point, and easy to read.


Please find my comments/notes/questions with an appropriate prefix below:
(I will make a few redactions denoted by [...] by simply cutting out parts
I have nothing to comment upon)


[...]
                Responsiveness under Working Conditions
                   draft-ietf-ippm-responsiveness-03

Abstract

   For many years, a lack of responsiveness, variously called lag,
   latency, or bufferbloat, has been recognized as an unfortunate, but
   common, symptom in today's networks.  Even after a decade of work on
   standardizing technical solutions, it remains a common problem for
   the end users.

NOTE: As much as I dislike that usage but indeed people call it bufferbloat in spite of bufferbloat being the cause while the observable is decreased responsiveness (or latency under load increases). I also think that we do have quite decent ways to ameliorate the issue, yet it still remains a common problem. Well crafted paragraph ;)


   Everyone "knows" that it is "normal" for a video conference to have
   problems when somebody else at home is watching a 4K movie or
   uploading photos from their phone.  However, there is no technical
   reason for this to be the case.  In fact, various queue management
   solutions have solved the problem.

   Our networks remain unresponsive, not from a lack of technical
   solutions, but rather a lack of awareness of the problem and
   deployment of its solutions.  We believe that creating a tool that
   measures the problem and matches people's everyday experience will
   create the necessary awareness, and result in a demand for solutions.

   This document specifies the "Responsiveness Test" for measuring
   responsiveness.  It uses common protocols and mechanisms to measure
   user experience specifically when the network is under working
   conditions.  The measurement is expressed as "Round-trips Per Minute"
   (RPM) and should be included with throughput (up and down) and idle
   latency as critical indicators of network quality.

NOTE: Again, well crafted abstract that should be accessible to end-users (even though few end-users will end up reading this directly)...

[...]

1.  Introduction

   For many years, a lack of responsiveness, variously called lag,
   latency, or bufferbloat, has been recognized as an unfortunate, but
   common, symptom in today's networks [Bufferbloat].  Solutions like
   fq_codel [RFC8290], PIE [RFC8033] or L4S [RFC9330] have been
   standardized and are to some extent widely implemented.
   Nevertheless, people still suffer from bufferbloat.

NOTE: I think this should mention cake as well:
T. Høiland-Jørgensen, D. Täht and J. Morton, "Piece of CAKE: A Comprehensive Queue Management Solution for Home Gateways," 2018 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN), Washington, DC, USA, 2018, pp. 37-42, doi: 10.1109/LANMAN.2018.8475045.
I note this is not an IETF standard but a pretty well written paper laying out part of the problems as well as one approach at a solution.
Put differently cake is as close to single piece solution to the issue as we could come, aimed squarely at end-user/leaf networks.

[...]

1.1.  Terminology

   A word about the term "bufferbloat" -- the undesirable latency that
   comes from a router or other network equipment buffering too much
   data.  This document uses the term as a general description of bad
   latency, using more precise wording where warranted.
   "Latency" is a poor measure of responsiveness, because it can be hard
   for the general public to understand.  The units are unfamiliar
   ("what is a millisecond?") and counterintuitive ("100 msec -- that
   sounds good -- it's only a tenth of a second!").

NOTE: I still disagree that people will not be able to understand that more/longer can be worse, after all most who ever did their taxes understand that or who had to wait for transportation. But I also accept that larger is better is more intuitive and especially a much better message to use when trying to convince people to use their ways a bit od positivity goes a loooong way. So I agree with this section.


2.  Design Constraints

   There are many challenges to defining measurements of the Internet:
   the dynamic nature of the Internet, the diverse nature of the
   traffic, the large number of devices that affect traffic, the
   difficulty of attaining appropriate measurement conditions, diurnal
   traffic patterns, and changing routes.

   In order to minimize the effects of these challenges, it's best to
   keep the test duration relatively short.

NOTE: Again, I mildly disagree, it is good to have test typically be relatively short, but n would recommend to make the duration end-user configurable and occasionally randomly propose a longer measurement time (in the range of 0.5 to 2 minutes, especially with WiFi we observed cyclic behaviour with such periods that transiently decreases responsiveness noticeably and that can/will also affect say vide conferences over WiFi). I defer to the should prompt the user to opt-in or whether these should simply be randomly inserted (but the duration should be communicated to the end-user).


   TCP and UDP traffic, or traffic on ports 80 and 443, may take
   significantly different paths over the network between source and
   destination and be subject to entirely different Quality of Service
   (QoS) treatment.  A good test will use standard transport-layer
   traffic -- typical for people's use of the network -- that is subject
   to the transport layer's congestion control algorithms that might
   reduce the traffic's rate and thus its buffering in the network.

NOTE: Should this not also mention ICMP? After all, ping, traceroute, mtr, pingplotter all default to using ICMP for delay measurements and hence are often also used and known by endusers. What is far less on peoples radars is that ICMP traffic often is rate-limited an deprioritized (or worse up-prioritized).

   Traditionally, one thinks of bufferbloat happening in the network,
   i.e., on routers and switches of the Internet.  However, the
   networking stacks of the clients and servers can have huge buffers.
   Data sitting in TCP sockets or waiting for the application to send or
   read causes artificial latency, and affects user experience the same
   way as in-network bufferbloat.

NOTE: One more reason to mention ICMP and explain why not using it is the right approach here!

   Finally, it is crucial to recognize that significant queueing only
   happens on entry to the lowest-capacity (or "bottleneck") hop on a
   network path.  For any flow of data between two endpoints there is
   always one hop along the path where the capacity available to that
   flow at that hop is the lowest among all the hops of that flow's path
   at that moment in time.  It is important to understand that the
   existence of a lowest-capacity hop on a network path and a buffer to
   smooth bursts of data is not itself a problem.  In a heterogeneous
   network like the Internet it is inevitable that there must
   necessarily be some hop along the path with the lowest capacity for
   that path.  If that hop were to be improved, then some other hop

NOTE: for IETFers that is fine, for end-users maybe add "improved by increasing its capacity"?

[...]

   In order to discover the depth of the buffer at the bottleneck hop,
   the proposed Responsiveness Test mimics normal network operations and
   data transfers, with the goal of filling the bottleneck buffer to
   capacity, and then measures the resulting end-to-end latency under
   these so-called working conditions.  A well-managed bottleneck queue
   keeps its occupancy under control, resulting in consistently low
   round-trip times and consistently good responsiveness.  A poorly
   managed bottleneck queue will not.

NOTE: Yes! Spelling out the alternatives explicitly is a great idea here.


3.  Goals

   The algorithm described here defines the Responsiveness Test that
   serves as a means of quantifying user experience of latency in their
   network connection.  Therefore:

   1.  Because today's Internet traffic primarily uses HTTP/2 over TLS,
       the test's algorithm should use that protocol.

       As a side note: other types of traffic are gaining in popularity
       (HTTP/3) and/or are already being used widely (RTP).  Traffic
       prioritization and QoS rules on the Internet may subject traffic
       to completely different paths: these could also be measured
       separately.

   2.  Because the Internet is marked by the deployment of countless
       middleboxes like transparent TCP proxies or traffic
       prioritization for certain types of traffic, the Responsiveness
       Test algorithm must take into account their effect on TCP-
       handshake [RFC0793], TLS-handshake, and request/response.

NOTE: I would motivate that simply by the fact that a lot/most? traffic uses TLS so that is the traffic type that needs to be tested for a realistic measurement?


   3.  Because the goal of the test is to educate end users, the results
       should be expressed in an intuitive, nontechnical form and not
       commit the user to spend a significant amount of their time (we
       target 20 seconds).

NOTE: As above, I think that a) tests longer than 20 seconds should be requestable by the user b) the rpm application should occasionally propose to run for a longer duration (with the user having to opt-in for a longer test). I was initially hesitant to believe reliable full-saturation can be achieved within that time budget, but measurement data (on my own link) convinced me otherwise; I still think running longer tests will be useful as we observed other latency/capacity modulations on periods well above 20 seconds (e.g. some WiFi probing every 30 seconds and/or every minute).

[...]

4.1.1.  Single-flow vs multi-flow

   A single TCP connection may not be sufficient to reach the capacity
   and full buffer occupancy of a path quickly.  Using a 4MB receive
   window, over a network with a 32 ms round-trip time, a single TCP
   connection can achieve up to 1Gb/s throughput.  Additionally, deep
   buffers along the path between the two endpoints may be significantly
   larger than 4MB.  TCP allows larger receive window sizes, up to 1GB.
   However, most transport stacks aggressively limit the size of the
   receive window to avoid consuming too much memory.

   Thus, the only way to achieve full capacity and full buffer occupancy
   on those networks is by creating multiple connections, allowing to
   actively fill the bottleneck's buffer to achieve maximum working
   conditions.

NOTE: Also for longer paths it is not guaranteed that a single flow can saturate the link (as now the buffering of all intermediate nodes comes into play as well). Since this test aims at all end-points, including geostationary satellite users, a single flow test will not do...


   Even if a single TCP connection would be able to fill the
   bottleneck's buffer, it may take some time for a single TCP
   connection to ramp up to full speed.  One of the goals of the
   Responsiveness Test is to help the user quickly measure their
   network.  As a result, the test must load the network, take its
   measurements, and then finish as fast as possible.

   Finally, traditional loss-based TCP congestion control algorithms
   react aggressively to packet loss by reducing the congestion window.
   This reaction (intended by the protocol design) decreases the
   queueing within the network, making it harder to determine the depth
   of the bottleneck queue reliably.

   The purpose of the Responsiveness Test is not to productively move
   data across the network, the way a normal application does.  The
   purpose of the Responsiveness Test is to, as quickly as possible,
   simulate a representative traffic load as if real applications were
   doing sustained data transfers and measure the resulting round-trip
   time occurring under those realistic conditions.  Because of this,
   using multiple simultaneous parallel connections allows the
   Responsiveness Test to complete its task more quickly, in a way that
   overall is less disruptive and less wasteful of network capacity than
   a test using a single TCP connection that would take longer to bring
   the bottleneck hop to a stable saturated state.

   One of the configuration parameters for the test is an upper bound on
   the number of parallel load-generating connections.  We recommend a
   default value for this parameter of 16.

QUESTION: Is there a theoretical derivation of that number or is this simply an empirically learned number of what tends to be enough in the existing internet? (I am all for recommending an upper limit, to avoid easy resource exhaustion).



4.1.2.  Parallel vs Sequential Uplink and Downlink

   Poor responsiveness can be caused by queues in either (or both) the
   upstream and the downstream direction.  Furthermore, both paths may
   differ significantly due to access link conditions (e.g., 5G
   downstream and LTE upstream) or routing changes within the ISPs.  To
   measure responsiveness under working conditions, the algorithm must
   explore both directions.

   One approach could be to measure responsiveness in the uplink and
   downlink in parallel.  It would allow for a shorter test run-time.

   However, a number of caveats come with measuring in parallel:

   *  Half-duplex links may not permit simultaneous uplink and downlink
      traffic.  This restriction means the test might not reach the
      path's capacity in both directions at once and thus not expose all
      the potential sources of low responsiveness.

   *  Debuggability of the results becomes harder: During parallel
      measurement it is impossible to differentiate whether the observed
      latency happens in the uplink or the downlink direction.

   Thus, we recommend testing uplink and downlink sequentially.
   Parallel testing is considered a future extension.

NOTE: As I written before, I do not think this section actually should remain. Parallel testing in my experience is quite helpful to show issues, and is not really all that unrealistic, many modern DOCSIS links are provisioned in a 20:1 ration e.g. 1000/50 Mbps, and traditional TCP Reno delayed ACKs cause ~1/40 of the forward traffic as reverse ACK traffic, leaving only 50% of upload capacity available for other traffic... it is not that hard o saturate 25 Mbps capacity with normal internet usage (e.g. data synchronization traffic to cloud storage)


4.1.3.  Achieving Full Buffer Utilization

   The Responsiveness Test gradually increases the number of TCP
   connections (known as load-generating connections) and measures
   "goodput" (the sum of actual data transferred across all connections
   in a unit of time) continuously.  By definition, once goodput is

QUESTION: "in a unit of time" is that idiomatic english? I would have expected "in units of time" instead, but I am not a native speaker....

   maximized, buffers will start filling up, creating the "standing
   queue" that is characteristic of bufferbloat.  At this moment the
   test starts measuring the responsiveness until it, too, reaches
   saturation.  At this point we are creating the worst-case scenario
   within the limits of the realistic traffic pattern.

   The algorithm notes that throughput increases rapidly until TCP
   connections complete their TCP slow-start phase.  At that point,
   throughput eventually stalls, often due to receive window
   limitations, particularly in cases of high network bandwidth, high
   network round-trip time, low receive window size, or a combination of
   all three.  The only means to further increase throughput is by
   adding more TCP connections to the pool of load-generating
   connections.  If new connections leave the throughput the same, full
   link utilization has been reached.  At this point, adding more
   connections will allow to achieve full buffer occupancy.

QUESTION: Above we argue that we measure goodput across all connections, and here we talk about throughput, should we not also call this goodput here to stick to our own definition?


   Responsiveness will gradually decrease from now on, until the buffers
   are entirely full and reach stability of the responsiveness as well.

4.2.  Test parameters

   A number of parameters can be used to configure the test methodology.
   The following list contains the names of those parameters and their
   default values.  The detailed description of the methodology that
   follows will explain how these parameters are being used.  Experience
   has shown that the default values for these parameters allow for a
   low runtime for the test and produce accurate results in a wide range
   of environments.

     +======+==============================================+=========+
     | Name | Explanation                                  | Default |
     |      |                                              | Value   |
     +======+==============================================+=========+
     | MAD  | Moving Average Distance (number of intervals | 4       |
     |      | to take into account for the moving average) |         |
     +------+----------------------------------------------+---------+
     | ID   | Interval duration at which the algorithm     | 1       |
     |      | reevaluates stability                        | second  |
     +------+----------------------------------------------+---------+
     | TMP  | Trimmed Mean Percentage to be trimmed        | 95%     |
     +------+----------------------------------------------+---------+
     | SDT  | Standard Deviation Tolerance for stability   | 5%      |
     |      | detection                                    |         |
     +------+----------------------------------------------+---------+
     | MNP  | Maximum number of parallel transport-layer   | 16      |
     |      | connections                                  |         |
     +------+----------------------------------------------+---------+
     | MPS  | Maximum responsiveness probes per second     | 100     |
     +------+----------------------------------------------+---------+
     | PTC  | Percentage of Total Capacity the probes are  | 5%      |
     |      | allowed to consume                           |         |
     +------+----------------------------------------------+---------+

                                  Table 1

4.3.  Measuring Responsiveness

   Measuring responsiveness while achieving working conditions is an
   iterative process.  Moreover, it requires a sufficiently large sample
   of measurements to have confidence in the results.

   The measurement of the responsiveness happens by sending probe-
   requests.  There are two types of probe requests:

   1.  An HTTP GET request on a connection separate from the load-
       generating connections ("foreign probes").  This probe type
       mimics the time it takes for a web browser to connect to a new
       web server and request the first element of a web page (e.g.,
       "index.html"), or the startup time for a video streaming client
       to launch and begin fetching media.

   2.  An HTTP GET request multiplexed on the load-generating
       connections ("self probes").  This probe type mimics the time it
       takes for a video streaming client to skip ahead to a different
       chapter in the same video stream, or for a navigation mapping
       application to react and fetch new map tiles when the user
       scrolls the map to view a different area.  In a well functioning
       system, fetching new data over an existing connection should take
       less time than creating a brand new TLS connection from scratch
       to do the same thing.

   Foreign probes will provide three (3) sets of data-points: First, the
   duration of the TCP-handshake (noted hereafter as tcp_f).  Second,
   the TLS round-trip-time (noted tls_f).  For this, it is important to
   note that different TLS versions have a different number of round-
   trips.  Thus, the TLS establishment time needs to be normalized to
   the number of round-trips the TLS handshake takes until the
   connection is ready to transmit data.  And third, the HTTP elapsed
   time between issuing the GET request for a 1-byte object and
   receiving the entire response (noted http_f).

QUESTION: I think instead of getting a 1 byte object we should size this object to allow for one or two timestamps and optionally return both the time the server received the request as well as the time the server sent out the response, akin to ICMP timestamps (type 13/14) or NTP time packets (both also reflect the timestamp from the sender, but here that would not be applicable, the point is both separate out the servers side receive and transmit timestamps).


   Self probes will provide a single data-point that measures the
   duration of time between when the HTTP GET request for the 1-byte
   object is issued on the load-generating connection and when the full
   HTTP response has been received (noted http_s).

   tcp_f, tls_f, http_f and http_s are all measured in milliseconds.

   The more probes that are sent, the more data available for
   calculation.  In order to generate as much data as possible, the
   Responsiveness Test specifies that a client issue these probes
   regularly.  There is, however, a risk that on low-capacity networks
   the responsiveness probes themselves will consume a significant
   amount of the capacity.  Because the test mandates first saturating
   capacity before starting to probe for responsiveness, the test will
   have an accurate estimate of how much of the capacity the
   responsiveness probes will consume and never send more probes than
   the network can handle.

NOTE: Elegant!


   Limiting the data used by probes can be done by providing an estimate
   of the number of bytes exchanged for a each of the probe types.

NIT: superfluous "a" between for and each?


   Taking TCP and TLS overheads into account, we can estimate the amount
   of data exchanged for a foreign probe to be around 5000 bytes.  For
   self probes we can expect an overhead of no more than 1000 bytes.

NOTE: Given this amount it seems pretty insubstantial whether the queried objects is 1 byte in size or 8 bytes (ICMP-style 32bit timestamp) or even 16 bytes (NTP style 64 bit timetamps), so I would respectfully ask we add timestamps to the spec (at least as an option)!


   Given this information, we recommend that a test client does not send
   more than MPS (Maximum responsiveness Probes per Second - default to
   100) probes per ID.  The probes should be spread out equally over the
   duration of the interval.  The test client should uniformly and
   randomly select from the active load-generating connections on which
   to send self probes.

   According to the default parameter values, the probes will consume
   300 KB (or 2400Kb) of data per second, meaning a total capacity
   utilization of 2400 Kbps for the probing.
   On high-speed networks, these default parameter values will provide a
   significant amount of samples, while at the same time minimizing the
   probing overhead.  However, on severely capacity-constrained networks
   the probing traffic could consume a significant portion of the
   available capacity.  The Responsiveness Test must adjust its probing
   frequency in such a way that the probing traffic does not consume
   more than PTC (Percentage of Total Capacity - default to 5%) of the
   available capacity.

QUESTION: Should we require a fixed minimal number of latency samples or is the idea to judge this mainly by stability of the returned latency estimate?


4.3.1.  Aggregating the Measurements

   The algorithm produces sets of 4 times for each probe, namely: tcp_f,
   tls_f, http_f, http_l (from the previous section).  The
   responsiveness of the network connection being tested evolves over
   time as buffers gradually reach saturation.  Once the buffers are
   saturated, responsiveness will stabilize.  Thus, the final
   calculation of network responsiveness considers the last MAD (Moving
   Average Distance - default to 4) intervals worth of completed
   responsiveness probes.

NOTE: for a network limited by MPS that means 400 latency samples, which is pretty impressive! For a PTC limited network however that could be considerably less. Could we require a verbose mode that reports the number of latency samples taken into account and that also reportes some measure of variance?


   Over that period of time, a large number of samples will have been
   collected.  These may be affected by noise in the measurements, and
   outliers.  Thus, to aggregate these we suggest using a trimmed mean

NOTE: We do not know whether a sample is an outlier of reflects true network delay... this is where are dual timestamp from the server might help to disambiguate


   at the TMP (Trimmed Mean Percentage - default to 95%) percentile,
   thus providing the following numbers: TM(tcp_f), TM(tls_f),
   TM(http_f), TM(http_l).

QUESTION: The trimmed mean traditionally is applied symmetrically, so a 05% trimmed mean would leave out the lowest 2.5% of samples and the highest 2.5% of samples (and estimate in case 2.5% does not fall on a natural sample count). My understanding so far is that the proposal is to use a single sided trimmed mean instead that will only remove the highest 5% of the data and include everything below. Personally I prefer the symmetric two-sided trimmed mean (not removing the lowest samples will bias the resulting mean towards lower numbers), but I also assume the difference is not going to be all that great between these two options. I do think however this draft should be explicit about what should be calculated so different implementations will return similar numbers.


   The responsiveness is then calculated as the weighted mean:

   Responsiveness = 60000 /
   (1/6*(TM(tcp_f) + TM(tls_f) + TM(http_f)) + 1/2*TM(http_s))

   This responsiveness value presents round-trips per minute (RPM).

NOTE: I also once proposed to calculate responsiveness numbers for each sample-tuple and perform the trimmed mean on those aggregate numbers, but I believe the differences will be small and essentially similar... (here however the draft text is as explicit as I would like to propose for the trimmed mean section).


4.4.  Final Algorithm

   Considering the previous two sections, where we explained the meaning
   of working conditions and the definition of responsiveness, we can
   now describe the design of final algorithm.  In order to measure the

NIT: "the design of the final algorithm"?

   worst-case latency, we need to transmit traffic at the full capacity
   of the path as well as ensure the buffers are filled to the maximum.
   We can achieve this by continuously adding HTTP sessions to the pool
   of connections in an ID (Interval duration - default to 1 second)
   interval.  This technique ensures that we quickly reach full capacity
   full buffer occupancy.  First, the algorithm reaches stability for
   the goodput.  Once goodput stability has been achieved,
   responsiveness probes will be transmitted until responsiveness
   stability is reached.

   We consider both goodput and responsiveness to be stable when the
   standard deviation of the moving averages of the responsiveness
   calculated in the most-recent MAD intervals is within SDT (Standard
   Deviation Tolerance - default to 5%) of the moving average calculated
   in the most-recent ID.

QUESTION: Now I am confused. Here is how I interpret this (please correct me where I am wrong):
we calculate Responsiveness for each ID, then calculate the standard deviation over the most recent MAD IDs and if that number is <= mean(Responsiveness) * (SDT/100) then we report that average responsiveness number as the final number?



   The following algorithm reaches working conditions of a network by
   using HTTP/2 upload (POST) or download (GET) requests of infinitely
   large files.  The algorithm is the same for upload and download and
   uses the same term "load-generating connection" for each.  The
   actions of the algorithm take place at regular intervals.  For the
   current draft the interval is defined as one second.

   Where

   *  i: The index of the current interval.  The variable i is
      initialized to 0 when the algorithm begins and increases by one
      for each interval.

   *  moving average aggregate goodput at interval p: The number of
      total bytes of data transferred within interval p and the MAD - 1
      immediately preceding intervals, divided by MAD times ID.

   the steps of the algorithm are:

   *  Create a load-generating connection.

   *  At each interval:

      -  Create an additional load-generating connection.

      -  If goodput has not saturated:

         o  Compute the moving average aggregate goodput at interval i
            as current_average.

         o  If the standard deviation of the past MAD average goodput
            values is less than SDT of the current_average, declare
            goodput saturation and move on to probe responsiveness.

QUESTION: Dies "move on to probe responsiveness" differ from the next part just below?



      -  If goodput saturation has been declared:

         o  Compute the responsiveness at interval i as
            current_responsiveness.

         o  If the standard deviation of the past MAD responsiveness
            values is less than SDT of the current_responsiveness,

QUESTION: I am confused again, standard deviation can only be calculated over multiple samples, but current_responsiveness is a singular value... or is the idea that if current_responsiveness is in the range of average _responsiveness +/- responsiveness_standard deviation (calculated over MAD samples)?


            declare responsiveness saturation and report
            current_responsiveness as the final test result.

   In Section 3, it is mentioned that one of the goals is that the test
   finishes within 20 seconds.  It is left to the implementation what to
   do when stability is not reached within that time-frame.  For
   example, an implementation might gather a provisional responsiveness
   measurement or let the test run for longer.

NOTE: If we specify that a user should be ale to optionally request a longer measurement time, the test could result in an output instruction the user which command invocation t use to achieve that?


   Finally, if at any point one of these connections terminates with an
   error, the test should be aborted.

4.4.1.  Confidence of test-results

   As described above, a tool running the algorithm typically defines a
   time-limit for the execution of each of the stages.  For example, if
   the tool allocates a total run-time of 40 seconds, and it executes a
   full downlink followed by a uplink test, it may allocate 10 seconds
   to each of the saturation-stages (downlink capacity saturation,
   downlink responsiveness saturation, uplink capacity saturation,
   uplink responsiveness saturation).

   As the different stages may or may not reach stability, we can define
   a "confidence score" for the different metrics (capacity and
   responsiveness) the methodology was able to measure.

   We define "Low" confidence in the result if the algorithm was not
   even able to execute 4 iterations of the specific stage.  Meaning,

QUESTION: Instead of hardcoding 4 how about referencing MAD here?


   the moving average is not taking the full window into account.

   We define "Medium" confidence if the algorithm was able to execute at
   least 4 iterations, but did not reach stability based on standard
   deviation tolerance.

QUESTION: Instead of hardcoding 4 how about referencing MAD here?


   We define "High" confidence if the algorithm was able to fully reach
   stability based on the defined standard deviation tolerance.

   It must be noted that depending on the chosen standard deviation
   tolerance or other parameters of the methodology and the network-
   environment it may be that a measurement never converges to a stable
   point.  This is expected and part of the dynamic nature of networking
   and the accompanying measurement inaccuracies.  Which is why the
   importance of imposing a time-limit is so crucial, together with an
   accurate depiction of the "confidence" the methodology was able to
   generate.

QUESTION: Should we propose a recommended verbiage to convey the confidence to the user, to make different implementations easier to compare?

5.  Interpreting responsiveness results

   The described methodology uses a high-level approach to measure
   responsiveness.  By executing the test with regular HTTP requests a
   number of elements come into play that will influence the result.
   Contrary to more traditional measurement methods the responsiveness
   metric is not only influenced by the properties of the network but
   can significantly be influenced by the properties of the client and
   the server implementations.  This section describes how the different
   elements influence responsiveness and how a user may differentiate
   them when debugging a network.

NOTE: Maybe add that is is fully justified as these factors also directly influence the perceived responsiveness?

5.1.  Elements influencing responsiveness

   Due to the HTTP-centric approach of the measurement methodology a
   number of factors come into play that influence the results.  Namely,
   the client-side networking stack (from the top of the HTTP-layer all
   the way down to the physical layer), the network (including potential
   transparent HTTP "accelerators"), and the server-side networking
   stack.  The following outlines how each of these contributes to the
   responsiveness.

5.1.1.  Client side influence

   As the driver of the measurement, the client-side networking stack
   can have a large influence on the result.  The biggest influence of
   the client comes when measuring the responsiveness in the uplink
   direction.  Load-generation will cause queue-buildup in the transport
   layer as well as the HTTP layer.  Additionally, if the network's
   bottleneck is on the first hop, queue-buildup will happen at the
   layers below the transport stack (e.g., NIC firmware).

   Each of these queue build-ups may cause latency and thus low
   responsiveness.  A well designed networking stack would ensure that
   queue-buildup in the TCP layer is kept at a bare minimum with
   solutions like TCP_NOTSENT_LOWAT [RFC9293].  At the HTTP/2 layer it
   is important that the load-generating data is not interfering with
   the latency-measuring probes.  For example, the different streams
   should not be stacked one after the other but rather be allowed to be
   multiplexed for optimal latency.  The queue-buildup at these layers
   would only influence latency on the probes that are sent on the load-
   generating connections.

   Below the transport layer many places have a potential queue build-
   up.  It is important to keep these queues at reasonable sizes or that
   they implement techniques like FQ-Codel.  Depending on the techniques
   used at these layers, the queue build-up can influence latency on
   probes sent on load-generating connections as well as separate
   connections.  If flow-queuing is used at these layers, the impact on
   separate connections will be negligible.

5.1.2.  Network influence

   The network obviously is a large driver for the responsiveness
   result.  Propagation delay from the client to the server as well as
   queuing in the bottleneck node will cause latency.  Beyond these
   traditional sources of latency, other factors may influence the
   results as well.  Many networks deploy transparent TCP Proxies,
   firewalls doing deep packet-inspection, HTTP "accelerators",... As
   the methodology relies on the use of HTTP/2, the responsiveness
   metric will be influenced by such devices as well.

   The network will influence both kinds of latency probes that the
   responsiveness tests sends out.  Depending on the network's use of
   Smart Queue Management and whether this includes flow-queuing or not,
   the latency probes on the load-generating connections may be
   influenced differently than the probes on the separate connections.

NOTE: In verbose mode a responsiveness client probably should return separate numbers for these two probe types?


5.1.3.  Server side influence

   Finally, the server-side introduces the same kind of influence on the
   responsiveness as the client-side, with the difference that the
   responsiveness will be impacted during the downlink load generation.

5.2.  Root-causing Responsiveness

   Once a responsiveness result has been generated one might be tempted
   to try to localize the source of a potential low responsiveness.  The
   responsiveness measurement is however aimed at providing a quick,
   top-level view of the responsiveness under working conditions the way
   end-users experience it.  Localizing the source of low responsiveness
   involves however a set of different tools and methodologies.

   Nevertheless, the Responsiveness Test allows to gain some insight
   into what the source of the latency is.  The previous section
   described the elements that influence the responsiveness.  From there
   it became apparent that the latency measured on the load-generating
   connections and the latency measured on separate connections may be
   different due to the different elements.

   For example, if the latency measured on separate connections is much
   less than the latency measured on the load-generating connections, it
   is possible to narrow down the source of the additional latency on
   the load-generating connections.  As long as the other elements of
   the network don't do flow-queueing, the additional latency must come
   from the queue build-up at the HTTP and TCP layer.  This is because
   all other bottlenecks in the network that may cause a queue build-up
   will be affecting the load-generating connections as well as the
   separate latency probing connections in the same way.

NOTE: This however requires that the output of the measurement tool exposes these two as separate numbers, no?


6.  Responsiveness Test Server API

   The responsiveness measurement is built upon a foundation of standard
   protocols: IP, TCP, TLS, HTTP/2.  On top of this foundation, a
   minimal amount of new "protocol" is defined, merely specifying the
   URLs that used for GET and PUT in the process of executing the test.

   Both the client and the server MUST support HTTP/2 over TLS.  The
   client MUST be able to send a GET request and a POST.  The server
   MUST be able to respond to both of these HTTP commands.  The server
   MUST have the ability to provide content upon a GET request.  The
   server MUST use a packet scheduling algorithm that minimizes internal
   queueing to avoid affecting the client's measurement.

QUESTION: While I fully agree that is what a server should do, I would prefer to get numbers over not getting numbers... so maybe make this a SHOULD, also if the server needsto return time the request was received and time the response was sent out, maybe we can factor out the local processing?


   As clients and servers become deployed that use L4S congestion
   control (e.g., TCP Prague with ECT(1) packet marking), for their
   normal traffic when it is available, and fall back to traditional
   loss-based congestion controls (e.g., Reno or CUBIC) otherwise, the
   same strategy SHOULD be used for Responsiveness Test traffic.  This
   is RECOMMENDED so that the synthetic traffic generated by the
   Responsiveness Test mimics real-world traffic for that server.

NOTE: This clearly needs to be under end-user control, that is there should be a way for the user to request/enforce either option (I am fine with the default being, what the local OS defaults to). L4S is an experimental RFC and is is expected to have odd failure modes so a L4S aware measurement tool should keep in mind that L4S itself needs to be measured/tested. At the very least the results need to report whether L4S was used or not.



   Delay-based congestion-control algorithms (e.g., Vegas, FAST, BBR)
   SHOULD NOT be used for Responsiveness Test traffic because they take
   much longer to discover the depth of the bottleneck buffers.  Delay-
   based congestion-control algorithms seek to mitigate the effects of
   bufferbloat, by detecting and responding to early signs of increasing
   round-trip delay, and reducing the amount of data they have in flight
   before the bottleneck buffer fills up and overflows.  In a world
   where bufferbloat is common, this is a pragmatic mitigation to allow
   software to work better in that environment.  However, that approach
   does not fix the underlying problem of bufferbloat; it merely avoids
   it in some cases, and allows the problem in the network to persist.
   For a diagnostic tool made to identify symptoms of bufferbloat in the
   network so that they can be fixed, using a transport protocol
   explicitly designed to mask those symptoms would be a poor choice,
   and would require the test to run for much longer to deliver the same
   results.

NOTE: From an end-user perspective it seems less relevant whether the problem is fixed in the network or in the end points (and L4S required end-point modification as well). BUT I would mention that the choice of TCP is not under control  of the end-user (especially for downloads) so testing with e.g. pure BBR is not going to give a realistic estimate of the path characteristics, and even a single not delay based flow can noticeably decrease responsiveness for the whole link.

   The server MUST respond to 4 URLs:

   1.  A "small" URL/response: The server must respond with a status
       code of 200 and 1 byte in the body.  The actual message content
       is irrelevant.  The server SHOULD specify the content-type as
       application/octet-stream.  The server SHOULD minimize the size,
       in bytes, of the response fields that are encoded and sent on the
       wire.

NOTE: I would like something like 8 or 16 bytes here to allow either ICMP- or NTP-style timestamps for request received, and response sent timestamps, as these would help solve a number of challenges... (Maybe requiring clock-sync, but even without sync a consistent time will still be useful!)


   2.  A "large" URL/response: The server must respond with a status
       code of 200 and a body size of at least 8GB.  The server SHOULD
       specify the content-type as application/octet-stream.  The body
       can be bigger, and may need to grow as network speeds increases
       over time.  The actual message content is irrelevant.  The client
       will probably never completely download the object, but will
       instead close the connection after reaching working condition and
       making its measurements.

NOTE: Some ISPs in the past used compression on access links that will not give realistic results for highly compressible data chunks, but this likely is solved for TLS encrypted data. HOWEVER if we allow measurements without TLS this issue becomes potentially relevant again...


   3.  An "upload" URL/response: The server must handle a POST request
       with an arbitrary body size.  The server should discard the
       payload.  The actual POST message content is irrelevant.  The
       client will probably never completely upload the object, but will
       instead close the connection after reaching working condition and
       making its measurements.

NOTE: Some ISPs in the past used compression on access links that will not give realistic results for highly compressible data chunks, but this likely is solved for TLS encrypted data. HOWEVER if we allow measurements without TLS this issue becomes potentially relevant again...


   4.  A .well-known URL [RFC8615] which contains configuration
       information for the client to run the test (See Section 7,
       below.)

   The client begins the responsiveness measurement by querying for the
   JSON [RFC8259] configuration.  This supplies the URLs for creating
   the load-generating connections in the upstream and downstream
   direction as well as the small object for the latency measurements.

7.  Responsiveness Test Server Discovery

   It makes sense for a service provider (either an application service
   provider like a video conferencing service or a network access
   provider like an ISP) to host Responsiveness Test Server instances on
   their network so customers can determine what to expect about the
   quality of their connection to the service offered by that provider.
   However, when a user performs a Responsiveness Test and determines
   that they are suffering from poor responsiveness during the
   connection to that service, the logical next questions might be,

   1.  "What's causing my poor performance?"

   2.  "Is it poor buffer management by my ISP?"

   3.  "Is it poor buffer management in my home Wi-Fi Access point?"

   4.  "Something to do with the service provider?"

   5.  "Something else entirely?"

   To help an end user answer these questions, it will be useful for
   test clients to be able to easily discover Responsiveness Test Server
   instances running in various places in the network (e.g., their home
   router, their Wi-Fi access point, their ISP's head-end equipment,
   etc).

NOTE: Not all home networking equipment is ready to actually act as http traffic source an sink next to its main job as firewall/router/access point/... so when implementing responsiveness tests on those devices these will need to monitor their own resource usage e.g. CPU cycles, to make sure the test is not limited by that end-points capabilities. I wish it were different but current home router designs often have little reserves...


   Consider this example scenario: A user has a cable modem service
   offering 100 Mb/s download speed, connected via gigabit Ethernet to
   one or more Wi-Fi access points in their home, which then offer
   service to Wi-Fi client devices at different rates depending on
   distance, interference from other traffic, etc.  By having the cable
   modem itself host a Responsiveness Test Server instance, the user can
   then run a test between the cable modem and their computer or
   smartphone, to help isolate whether bufferbloat they are experiencing
   is occurring in equipment inside the home (like their Wi-Fi access
   points) or somewhere outside the home.

NOTE: I fully agree with the desirability here, but I am not sure whether current end-user networking gear is prepared for something like this.



7.1.  Well-Known Uniform Resource Identifier (URI) For Test Server
      Discovery

   Any organization that wishes to host their own instance of a
   Responsiveness Test Server can advertise that capability by hosting
   at the network quality well-known URI a resource whose content type
   is application/json and contains a valid JSON object meeting the
   following criteria:

   {
     "version": 1,
     "urls": {
       "large_download_url":"https://nq.example.com/api/v1/large",
       "small_download_url":"https://nq.example.com/api/v1/small",
       "upload_url":        "https://nq.example.com/api/v1/upload"
     }
     "test_endpoint": "hostname123.provider.com"
   }

NOTE: This might allow to optionally introduce a
       "timestamp_download_url":"https://nq.example.com/api/v1/timestamps",
stanza so prepared servers could alternatively offer a small timestamp file over the normal small URL?
(even though I would prefer if we would request the timestamps unconditionally as content of the small urls)


   The server SHALL specify the content-type of the resource at the
   well-known URI as application/json.

   The content of the "version" field SHALL be "1".  Integer values
   greater than "1" are reserved for future versions of this protocol.
   The content of the "large_download_url", "small_download_url", and
   "upload_url" SHALL all be validly formatted "http" or "https" URLs.
   See above for the semantics of the fields.  All of the fields in the
   sample configuration are required except "test_endpoint".  If the
   test server provider can pin all of the requests for a test run to a
   specific host in the service (for a particular run), they can specify
   that host name in the "test_endpoint" field.

   For purposes of registration of the well-known URI [RFC8615], the
   application name is "nq".  The media type of the resource at the
   well-known URI is "application/json" and the format of the resource
   is as specified above.  The URI scheme is "https".  No additional
   path components, query strings or fragments are valid for this well-
   known URI.

QUESTION: I have a hunch where nq might be coming from, could we actually mention that in the draft somehow?


[...]