Re: [ippm] Review of draft-ietf-ippm-responsiveness-03

Hello,

> On Feb 8, 2024, at 4:59 AM, can desem <cdesem@gmail.com> wrote:
> The work in ippm-qoo for example is one, where it relies heavily on measuring delay response of the network (building on Broadband Forum's qualitiy attenuation metric and framework TR-542) . In particular, the way delay is measured (using UDP is quite acceptable and may be easier ,as well as TCP) is not as rigid as proposed here which specifies it should be HTTP based, which may really imply that we are measuring web responsiveness as mentioned by Greg White rather than network/ip responsiveness. If we did use UDP, are we really going to see a significant difference in network delay measurements in practice?

yes, our goal is explicitly to not measure network/ip responsiveness but rather the responsiveness the way the user will experience it. We try to be very clear of that focus in the abstract & introduction. Do you think there is room for improvement there ?

> When measuring delay under load, it is proposed to use multiple TCP connections up to 16. Would it not be easier to use UDP given for example that the new BBF speed test TR-471 (also called UDPST or draft-ietf ippm-capacity-protocol-04)  is purely based on UDP. This is easily capable of saturating the link probably faster than the way it is proposed here using TCP. TCP is clearly capable of saturating the link as proposed but would we really fail to get valid measurements if we did use the new speed measurement tool which is based on UDP.  TR-471 or UDPST does have the capability to measure network delays and combine it with network loading as well but is all based on UDP.  There is an open source implementation of this tool as well which can easily be used/modified to implement what is proposed here but only if UDP implementation of this measurement was considered acceptable.
> 
> It is mentioned as a side note that other types of traffic are gaining in popularity  and these could be measured separately but the overall emphasis  of the measurement in on HTTP. This is probably too restrictive and not in line with the other work mentioned which may be ok if we are emphasising HTTP responsiveness rather than ip/network responsiveness in general. It would be interesting to see if there is a significant difference between UDP and TCP measurements in real life.

There definitely is a huge difference in TCP vs UDP, purely because TCP itself is queueing data in its stack on the transmit-side. That queue adds additional delay to the requests. There are techniques to reduce that queue (notably the socket-option TCP_NOTSENT_LOWAT).

Christoph

> 
> Best regards,
> Can Desem
> 
> 
> On Wed, Jan 24, 2024 at 5:46 AM Greg White <g.white=40CableLabs.com@dmarc.ietf.org <mailto:40CableLabs.com@dmarc.ietf.org>> wrote:
>> All,
>> 
>>  
>> 
>> I did a review of the current responsiveness draft and have a list of technical comments, editorial comments and nits.  Within each section, the comments are listed in the order that the relevant material appears in the draft.
>> 
>>  
>> 
>> Best Regards,
>> 
>> Greg
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> Technical Comments:
>> 
>>  
>> 
>>  
>> 
>> The responsiveness test probe measurements include more than just the network responsiveness, they also include effects from the web server implementation.  Yet, the Abstract and Introduction sections use videoconferencing as the (only) example application.  Since videoconferencing apps don’t use the same connection methodology as HTTPS, I think this is misleading.  This methodology should either be more clearly labeled/described as an interactive web responsiveness measurement (and videoconferencing not be used as the example application), or the methodology should be revised to more accurately reflect the responsiveness that impacts real-time communication applications (e.g. by trying to eliminate the HTTP server components of delay in the Responsiveness calculation, rather than explicitly including it).
>>  
>> 
>> §4.3.1 suggests using a Trimmed Mean to aggregate the samples of each metric. I have a couple of issues with this.
>> The first issue is with the “trimmed” part. The responsiveness test uses traffic that is modeled after real application traffic, and it’s intended to give the user an indication of the real behavior of the network.  Real applications don't get to ignore the times when the network latency is bad, they have to deal with it.  Throwing away the top 5% of the measurements seems really arbitrary.  To just pretend that those events didn't happen and don't include them in the measurement I think is wrong.  In the experimental sciences, “outliers” that represent faulty data or are the result of experimental error are often excluded from data sets, but the experimenter needs to be very careful to identify whether the outliers are in fact faulty, or are they actually extreme observations of the real variability that exists in the experiment.  In the context of the responsiveness test, I think the likelihood is much greater that many of these data points are in fact real (and important) observations of the network latency variation.  Is there a specific concern about the potential for faulty data that would affect 5% of the samples?
>>  
>> 
>> The second issue is with the “mean” part. In the context of user-observable events, like page load time, it may be that mean values are useful metrics, but in the context of measuring network RTT (which is one of the dominant factors measured by the responsiveness probes), averages are often a poor way to summarize a set of measurements [1],[2].  Quite a number of applications for which “responsiveness” is important (including voice applications and many multiplayer games) utilize a jitter buffer at the receiver to convert the variable packet latency of the network into a fixed (or slowly varying) latency and residual loss.  These jitter buffers often result in a fairly low level of residual loss (e.g. 1%), which implies that the fixed latency that they provide to the application matches a high percentile (e.g. P99) of the packet latency of the network.  If the description of the responsiveness test is revised to be more clearly about web responsiveness rather than (e.g.) videoconferencing, perhaps this argument is less strong, but even then I think that the mean value deemphasizes the variability in latency that affects user experience.  
>> 
>>  
>> 
>> Instead of TM95, I would suggest using P99 if the goal is to reflect a responsiveness metric for real-time interactive applications (RTC/CloudGaming/XR). 
>> 
>>  
>> 
>> [1] https://www.bitag.org/documents/BITAG_latency_explained.pdf
>> 
>> [2] https://www.iab.org/wp-content/IAB-uploads/2021/09/single-delay-metric-1.pdf
>> 
>>  
>> 
>> I understand that P99 metrics can be more variable than mean metrics, and there might be an interest in reporting a more inherently stable metric, but metric stability seems to me to be less important than metric validity.  
>> 
>>  
>> 
>> What is the reason that the goodput phase starts with only 1 connection and only adds 1 connection each interval?  It seems that there is some underlying assumption that using the minimum # of connections to achieve goodput saturation has some value.  I’m not sure that it does. Why not start with 8 connections and add 8 more each interval until goodput saturation is reached?  I would think that you’d reach saturation much more quickly, and this is more important than you may realize.  The majority of existing tools that only aim to maximize goodput (i.e. speedtests) start several (4,6,8 or 16) simultaneous flows immediately, web browsers also (nearly) simultaneously open dozens of TCP connections. What is the downside?
>> §4.4 indicates that responsiveness probes don’t begin until the load-generating process declares goodput saturation.  Why wait?  Is there a downside to starting the responsiveness probing at the beginning of the test?  It seems like the current method could in some (possibly many) cases take MAD more intervals than would be necessary to collect a measurement.
>> §4.4.1. I don’t understand the definition of “low” confidence.  This seems to only be possible with invalid parameter choices.  If MAD=4 and ID=1s, then as long as the tool allows at least 4 seconds for each “stage”, it would be impossible to have the “Low” confidence result, and if the tool allows less than 4 seconds, then it would be impossible not to.
>> The purpose of the Responsiveness Test is to raise awareness of the deployment (or lack thereof) of new technologies to improve network responsiveness and to encourage deployment of these technologies.  As such, the use of classic congestion control is invalid.  The draft needs to mandate the use of L4S congestion control for all Responsiveness Test traffic and make it clear why this is necessary.  §6  3rd paragraph reads:
>> “As clients and servers become deployed that use L4S congestion
>> 
>>    control (e.g., TCP Prague with ECT(1) packet marking), for their
>> 
>>    normal traffic when it is available, and fall back to traditional
>> 
>>    loss-based congestion controls (e.g., Reno or CUBIC) otherwise, the
>> 
>>    same strategy SHOULD be used for Responsiveness Test traffic.  This
>> 
>>    is RECOMMENDED so that the synthetic traffic generated by the
>> 
>>    Responsiveness Test mimics real-world traffic for that server.”
>> 
>> Suggest:
>> 
>> “As clients, servers and networks become deployed that support L4S
>> 
>> congestion control (e.g., TCP Prague with ECT(1) packet marking
>> 
>> and L4S ECN congestion marking in the network), it will be
>> 
>>    imperative that the Responsiveness Test measures performance using
>> 
>>    L4S as well.  Both clients and servers MUST support
>> 
>>    and use L4S congestion control (e.g. TCP Prague [I.D-prague]) and
>> 
>>    Accurate ECN ([RFCxxxx]) for all Responsiveness Test traffic. This
>> 
>>    is required so that the synthetic traffic generated by the
>> 
>>    Responsiveness Test correctly measures the responsiveness capability
>> 
>>   of the network. In networks that don’t support L4S, L4S congestion
>> 
>>    control algorithms are required (by [RFC9331]) to automatically behave
>> 
>>    like classic congestion control (e.g. CUBIC/Reno), so the use of an L4S
>> 
>>    congestion control algorithm for the Responsiveness Test is appropriate
>> 
>>    for all networks.”
>> 
>> §7.1 it isn’t clear to me the meaning of the test_endpoint URL. How exactly does the server set it, and what does the client do with this information?
>>  
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> Editorial Comments:
>> 
>>  
>> 
>> The description in §4.1 presumes no AQM (e.g. it uses phrases like: “… the bottleneck … has its buffer filled up entirely ….”).  Since AQM is getting to be relatively widely deployed, it would be good to make this intro material a little more aware of the current reality.
>>  
>> 
>> §5 should be more clear that selection of the measurement server is important.  In addition to the server characteristics having an influence on the result and perhaps not reflecting the characteristics of a “real” application server of interest to the user, it is only the path to that server that is being tested, so the network characteristics might not reflect the network path that is of interest to the user.   A discussion of this in section 5 will then provide motivation for the discussion in Section 7 on locating test servers in various locations, and a back reference to section 5 could be added there.
>>  
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> Nits:
>> 
>>  
>> 
>> §2
>> “Finally, it is crucial to recognize that significant queueing only
>>    happens on entry to the lowest-capacity (or "bottleneck") hop on a
>> 
>>    network path.“
>> 
>> suggest:
>> 
>> “Finally, it is crucial to recognize that significant queueing in the network only
>>    happens on entry to the lowest-capacity (or "bottleneck") hop on a
>>    network path. “
>> rationale:  the immediately preceding paragraph discusses that queuing doesn’t just happen in the network.
>> 
>>  
>> 
>>  
>> 
>> §2
>> “In a heterogeneous
>>    network like the Internet it is inevitable that there must
>>    necessarily be some hop along the path with the lowest capacity for
>>    that path.”
>> Either say “it is inevitable that” or “there must necessarily be”, not both.
>> 
>>  
>> 
>>  
>> 
>> §2
>> “Arguably, this heterogeneity of
>>    the Internet is one of its greatest strengths.  Allowing individual
>>    technologies to evolve and improve at their own pace, without
>>    requiring the entire Internet to change in lock-step, has enabled
>>    enormous improvements over the years in technologies like DSL, cable
>>    modems, Ethernet, and Wi-Fi, each advancing independently as new
>>    developments became ready.  As a result of this flexibility we have
>>    moved incrementally, one step at a time, from 56kb/s dial-up modems
>>    in the 1990s to Gb/s home Internet service and Gb/s wireless
>>    connectivity today.”
>> Is this relevant to the Responsiveness definition?  Suggest deleting it for brevity.
>> 
>>  
>> 
>>  
>> 
>> §4.1.1 is a bit circuitous.  
>> A) suggest moving the first 2 sentences of the 5th paragraph to the start of the section:
>>  
>> 
>>    “The purpose of the Responsiveness Test is not to productively move
>>    data across the network, the way a normal application does.  The
>>    purpose of the Responsiveness Test is to, as quickly as possible,
>>    simulate a representative traffic load as if real applications were
>>    doing sustained data transfers and measure the resulting round-trip
>>    time occurring under those realistic conditions.”
>>  
>> 
>>                 B) Suggest deleting the second paragraph.
>> 
>>                 C) Suggest changing the beginning of the (former) 3rd sentence in the (former) 5th paragraph to: “For all of these reasons, using multiple ….”
>> 
>>  
>> 
>> §4.1.2.   s/Debuggability of/Debugging/
>> §4.1.3 again presumes no AQM, and it has some strange language.
>> “By definition, once goodput is
>>    maximized, buffers will start filling up, creating the "standing
>>    queue" that is characteristic of bufferbloat.”
>> This is not a true statement in all cases, and it certainly isn’t the definition of maximum goodput that buffers start filling up. Suggestion:
>> 
>> “When using TCP with traditional loss-based congestion control, in 
>> bottlenecks that don’t support Active Queue Management, once goodput is
>>    maximized, buffers will start filling up, creating the "standing
>>    queue" that is characteristic of bufferbloat.”
>>  
>> 
>> §4.1.3 first paragraph,
>> “At this moment the
>>    test starts measuring the responsiveness until it, too, reaches
>>    saturation.”
>> When reading this for the first time, it was somewhat ambiguous what “it” refers to in this sentence (the test or the responsiveness). Suggest
>> 
>> “At this moment the
>>    test starts measuring the responsiveness until that metric reaches
>>    saturation.”
>>  
>> 
>> §4.1.3 second paragraph: “The algorithm notes that…”. “Notes” seems like a strange word here.  The algorithm isn’t taking notes. Maybe “presumes” or “is based on the presumption”.
>>  
>> 
>> §4.1.3 second paragraph:
>> “If new connections leave the throughput the same …”
>> Suggest:
>> 
>> “If new connections don’t result in an increase in throughput …”
>>  
>> 
>> §4.1.3 second paragraph:
>> “At this point, adding more
>>    connections will allow to achieve full buffer occupancy.”
>> In most cases, once the link is saturated with TCP traffic, the buffer will fill.  It doesn’t take additional connections in order to achieve full buffer occupancy.  The algorithm seems to be designed with this in mind, since it does not add more connections once goodput saturation is reached. Suggest deleting this sentence.
>> 
>>  
>> 
>> §4.1.3 second paragraph:
>> “Responsiveness will gradually decrease from now on, until the buffers
>>    are entirely full and reach stability of the responsiveness as well.”
>> Suggest:
>> 
>> “Responsiveness will decrease from now on, until the buffers
>>    are entirely full and stability is reached.”
>>  
>> 
>> §4.3  “http_l” appears twice and “http_s” appears 3 times, yet these appear to refer to the same value.
>> §4.3 the arithmetic appears to support the interpretation that the 100 probes per second are divided equally between foreign-probe types and self-probe types, but I don’t see this explicitly stated in the draft. Suggest stating this as a requirement. Also, I suspect that the two probe types should be alternated so that both components of the responsiveness calculation stabilize at the approximately the same time.   Should the two probe types be spread out relative to each other (e.g. send self-probe, wait 10ms, send foreign probe, wait 10ms, repeat), or should they be sent back-to-back in order to both be measuring (nearly) the same network condition (e.g. send self-probe, send foreign-probe, wait 20ms, send foreign-probe, send self-probe, wait 20ms, repeat)? Suggest making these aspects explicit.
>> §4.3.1 Responsiveness formula is unnecessarily confusing. While it isn’t mathematically ambiguous, using the constructs 1/B*C and 1/B*(C+D) is more confusing for readers than the constructs C/B and (C+D)/B. Referring to the formula as a weighted mean isn’t very helpful for two reasons.  First it is the denominator that is a mean calculation, not the whole expression. Second, the denominator would be more helpfully described as a mean of two values, one of which is itself a mean of three values.  This could be improved by calculating foreign_response_time = (TM(tcp_f) + TM(tls_f) + TM(http_f))/3 and then using that value in the Responsiveness formula, or alternatively, explain the weighting factors 1/6 and 1/2 in the text.
>> §4.4
>> “calculated in the most-recent MAD intervals” 
>> Does this include the current interval?  Later text indicates that it does not.  Suggest:
>> 
>> “calculated in the immediately preceding MAD intervals” 
>>  
>> 
>> §3 refers to the test taking 20 seconds, while §4.4 and §4.4.1 refer to one half of the test taking 20 seconds, and the full test taking 40 seconds.  Make these consistent.
>>  
>> 
>> §4.4
>> “It is left to the implementation what to do when stability is not reached within that time-frame.”
>> The very next subsection (4.4.1) seems to give some guidance here.  Why not recommend that mechanism?
>> 
>>  
>> 
>> §4.4.1 The definitions use the hard coded value “4”.  Shouldn’t this instead be “MAD”?
>> §5.1.2, first paragraph, 4th sentence reads:
>> “Many networks deploy transparent TCP Proxies,
>> firewalls doing deep packet-inspection, HTTP "accelerators",...”
>> Suggest completing this sentence rather than ending with an ellipsis.
>> 
>>  
>> 
>> §5.1.2 second paragraph.  Please provide a reference or definition of Smart Queue Management.  Does this term include AQM? If not, suggest adding AQM explicitly.
>> §6 title is “Responsiveness Test Server API” yet it contains requirements on both the server and the client, and those requirements go beyond the API.  Maybe: “Responsiveness Test Server and Client Protocol”? 
>>  
>> 
>> §6 list item 4.
>> “… which contains configuration …”
>> Suggest
>> 
>> “… which contains JSON configuration …”
>> To link this to the text in the next paragraph.
>> 
>>  
>> 
>> §7 the text in the first paragraph, beginning with “However” and through the end of the numbered list seems like it would fit better in §5.2.
>>  
>> 
>> §7.1 Can you provide an explicit example URL (maybe it is: https://nq.example.com/.well-known/nq)
>>  
>> 
>> §7.1 In the JSON example the listed urls and the test_endpoint url are in completely different domains.  Is this an expected scenario?
>>  
>> 
>> §7.1 mentions 3 times that the content-type is application/json.  Maybe once is enough?
>>  
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> _______________________________________________
>> ippm mailing list
>> ippm@ietf.org <mailto:ippm@ietf.org>
>> https://www.ietf.org/mailman/listinfo/ippm
> _______________________________________________
> ippm mailing list
> ippm@ietf.org
> https://www.ietf.org/mailman/listinfo/ippm