Re: [tsvwg] draft-ietf-tsvwg-careful-resume-05

Gorry Fairhurst <gorry@erg.abdn.ac.uk> Mon, 11 December 2023 09:59 UTC

Return-Path: <gorry@erg.abdn.ac.uk>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D0653C14F609 for <tsvwg@ietfa.amsl.com>; Mon, 11 Dec 2023 01:59:10 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.905
X-Spam-Level:
X-Spam-Status: No, score=-1.905 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id J7jStHirxbrH for <tsvwg@ietfa.amsl.com>; Mon, 11 Dec 2023 01:59:06 -0800 (PST)
Received: from pegasus.erg.abdn.ac.uk (pegasus.erg.abdn.ac.uk [137.50.19.135]) by ietfa.amsl.com (Postfix) with ESMTP id 281CCC14F5E2 for <tsvwg@ietf.org>; Mon, 11 Dec 2023 01:59:04 -0800 (PST)
Received: from [192.168.1.130] (fgrpf.plus.com [212.159.18.54]) by pegasus.erg.abdn.ac.uk (Postfix) with ESMTPSA id 318F51B001FC; Mon, 11 Dec 2023 09:59:00 +0000 (GMT)
Content-Type: multipart/alternative; boundary="------------nNfqgNmDV0K5hLIcn0UDUXcG"
Message-ID: <04b3c580-b6df-409e-9c79-e699c53beb19@erg.abdn.ac.uk>
Date: Mon, 11 Dec 2023 09:58:59 +0000
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Content-Language: en-GB
To: Sebastian Moeller <moeller0=40gmx.de@dmarc.ietf.org>, tsvwg <tsvwg@ietf.org>
References: <811CAA14-E58B-41ED-A528-53B0AA3F8227@gmx.de>
From: Gorry Fairhurst <gorry@erg.abdn.ac.uk>
Organization: UNIVERSITY OF ABERDEEN
In-Reply-To: <811CAA14-E58B-41ED-A528-53B0AA3F8227@gmx.de>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/SnYNkVZKLUF4QrwfBYXwImZ_UAk>
Subject: Re: [tsvwg] draft-ietf-tsvwg-careful-resume-05
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Dec 2023 09:59:10 -0000

On 10/12/2023 12:49, Sebastian Moeller wrote:
> Hi list,
>
>
> here my comments onhttps://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-careful-resume-05.

Thanks for this detailed review. This was very helpful, please see notes 
below.

Gorry

>
> "In another example, an application connects after a disruption had temporarily reduced the path capacity (e.g., after a link propagation impairment, or where a user on a train journey travels through different areas of connectivity). When the endpoint returns to use a path with the original characteristics, using a rate that is based on the previously observed CC parameters."
>
> NOTE: This seems rather optimistic: "where a user on a train journey travels through different areas of connectivity" these in my limited experience do not come as binary on-off-on transitions where the capacity persists between the two on-states, but with a rate reduction between on and off; in addition to stick to the train example if the off epoch is caused by e.g. a tunnel through a mountain it is not surprising if the RF environment on the other side of that mountain is different, and hence the capacity.
[GF] I took the decision to remove the specific example of the train - 
because as you corrently note, this would need much more detail to 
present a clear example using a rail journey. I expect a more 
detaied/other example could be re-introduced in future if this is 
thought worthwhile.
> • Careful Resume (CR): The method specified in this document to select initial CC parameters, that seeka to more rapidly and safely increase the initrial sending rate.
>
> NIT: "seek a" instead of "seeka"
[GF] Fixed in editor's text.
>
> • endpoint_token: An Endpoint Token identifyingh a path to a receiver;
>
> NIT: "identifying" instead of "identifyingh"
[GF] Fixed in editor's text.
>
> 3. The Phases of CC using Resume
>
> NOTE: This starts with a diagram omitting the observe phase and mentions "The Observe Phase is later performed by an established connection." and then immediately follows with "3.1. Observe Phase" this seems odd, maybe either include the observe phase in the diagram or start with one of the other phases...
> Something like "Normal (new observe phase)". (Side question, we have validating phase, why not observing phase?)
>
>
> • Observe Phase (small CWND): If the measured CWND is less than four times the Initial Window (IW) (i.e. CWND less than IW*4), a sender SHOULD NOT store and/or send CC parameters.
>
> QUESTION: I guess this is because below that number doing careful resume would end up taking more time than traditional slow-start? If so maybe add this information to the paragraph?
[GF] Added in editor's text. Indeed, there is a cost in complexity and 
potential collateral damage from small jumps to traffic when using CR, 
which would yield little benefit...
> • Observe Phase (sending CC Parameters): When sending the CC parameters to a receiver, these ought to be updated if there are significant changes in the saved CC parameters; The frequency of update SHOULD be less than one update for several RTTs of time.
>
> QUESTION: Conflicting requirements here (what if there are significant changes within several RTTs?). Also significance and several needs to be quantified, as is this is quite hand-wavy.
[GF] Changed in editor's text,
> The sentwnce at the end of the paragraph "Implementation notes are provided in Section 4.1." does not contain the missing information. Nit-nit: "RTTs of time" seems too repetitive, maybe just for several RTTs or "for the duration of several RTTs, after all the second T in RTT already expands to time"?
[GF] Reworded in editor's text,
>
> • Reconnaissance Phase (Lifetime of saved CC parameters): The CC parameters are temporal. If the lifetime of the observed CC parameters is exceeded Section 4.3.1, the CC parameters are no longer used and sender enters the Normal Phase.
>
> NOTE: Section 4.3.1 does not give even a recommendation of the lifetime, this is sub-optimal for any implementor.
[GF] The editors will next work on Safe Retreat (before revisiting 
lifetime).
>
> • Reconnaissance Phase (Confirming RTT): Since the CC information is directly impacted by the RTT, a significant change in the minimum RTT is a strong indication that the previously observed CC parameters are not valid for the current path. An RTT measurement is confirmed when current_rtt is greater than (saved_rtt / 2) and the current_rtt is less than or equal to (saved_rtt x 10).
>
> QUESTION: Why the non-symmetry here? Why not simply use (saved_rtt / 2) <= current_rtt <= (saved_rtt x 2)? The number 10 clearly is pure guesswork? Sure the 2 is probably not on better theoretical or empirical footing, but it is considerably more conservative. Also on a link with competent scheduling ans AQM RTTs do not routinely inflate by a factor of 10... Yes, section 4.2.1 addresses this, but the first is a technicality (of you select jump_cwnd differently by that rationale so will this threshold, which is an argument that is decoupled from the true underlaying path, but as long as we keep jump-cwnd and 1/2 this seeems IMHO fully acceptable) and the factor 10 is clearly out of thin air, but apparently also safe in that this will result in a much lower actual sending rate. This brings up the question, at what current_RTT/saved_RTT factor becomes CR slower than traditional slow-start? Maybe that could be used to set the upper threshold?

[GF] All valid questions, thanks - but, he choices that were explored 
were not without a basis.

/2 has the origins in the choice CWND/2. - Less than RTT/2 would result 
in excessive rate.

x2 balances the /2, and would be OK, but then paced rate scales as 
saved_rtt/current_RTT, but the duration of Unvalidated Phase also 
impacts: this can result in fewer unvalidated packets when an ACK is 
received in ~1 current_RTT, a higher current_RTT results in a lower 
rate. That was intentional. At some point, a much higher RTT is 
indicative of a change, (the chosen value was RTT>10, based on other 
specs that say flows should be fair within an order of magnitude).

I accept this comment that this needs better explanation, and the edtors 
will make an explanation to better separate the design constraints in 
the next revison.

> • {XXX-Editor note: Reconnaissance Phase (Is there a need for a minimum required number of RTT samples to confirm a path ???? }
>
> NOTE: Specifying at least a recommendation here IMHO makes sense given that the saved_rtt is to be collected over 5 minutes.... Especially given that the fewer samples are used the more likely the true minimal path RTT is over estimated, but the section above with its current recommendation of "current_rtt is less than or equal to (saved_rtt x 10)" is less sensitive to increased RTTs? So we might introduce a systemic bias by under-sampling here. I think some discussion of this should be added to section 4.2.
>
>
> • In some implementations, the decision to enter the Unvalidated Phase could require coordination with the management of buffers in the interface to the higher layers.
>
> NOTE: this seems pretty speculative, maybe keep this out of the draft until such an implementation arises?
[GF] That was our experience, if you've tried implementiung and have 
more guidance, then happy to add more specific help.
>
> • Unvalidated Phase (Confirming the path): If a sender determines that the previous CC parameters are not valid (due to a detected change in the path) (e.g., the RTT has changed), Careful Resume enters the Safe Retreat Phase. (The sender cannot receive feedback for the jump_cwnd, because less than an RTT has passed before the Unvalidated Phase was entered. Therefore, any detected congestion must have resulted from packets sent before the Unvalidated Phase.)
>
> QUESTION: This should have already been established in "Reconnaissance Phase (Confirming RTT)" so we should never have entered this state? Implementing this still seems worthwhile, but at the same time a sign that "Reconnaissance Phase (Confirming RTT)" did not work as intended. This section lacks an enumeration for the conditions under which a RTT is considered to be changed. Even if these are the same as in "Reconnaissance Phase (Confirming RTT)"maybe add a reference to section 4.2.1 here?
[GF] Aha. A change could happen at any time, but I see a hint that 
mentioning "RTT" here is a poor thing to do and have removed that, 
because it's not the real point.
> • Validating Phase (Limiting CWND): On entry to the Validating Phase, the CWND is set to the flight size.
>
> NOTE: flight size seems not to be defined in Section 2.2
[GF] True, we did as requested.
> NOTE: The Validating Phase has no implementation notes, maybe mention that explicitly at the end of section 3.4?
>
>
> • Safe Retreat Phase (Increasing CWND): The CWND MAY be increased for each acknowledgment that acknowledges a previously unacknowledged packet that was sent in the Unvalidated Phase, since this indicates a packet has been successfully sent across the path.
>
> NOTE: Mmmh, so we reduce the cwnd to allow the excess volume to "drain", but at the same time we also increase the cwnd again? Wouldn't we expect that given that we paced out our traffic, that we will be above (an unknown) capacity for a full RTT? Why not refrain from increasing the cnd for a full RTT? I am sure there are good reasons for doing what the draft proposes, I would just like to read that in the draft ;)
>
[GF] Also true, we noted that Safe Retreat was the last for us to 
implement and the text had problems. We have running code for a better 
approach after the Hackathon, and we will share that in the next rev, 
once we have more experience.
> 3.6. Normal Phase
> In the Normal Phase, the sender transitions to using the normal CC method (e.g., in congestion avoidance).
>
> NOTE: I might be missing things here, but the normal phase needs to be identical with the "next" observe phase? Or is the observe phase only happening when the current connection terminates and the CR data is updated?
>
[GF] A sender becomes "Normal", and then chooses to "Observe" ... that 
could be explained better.
> • Observe Phase: There are cases where the current CWND does not reflect the path capacity. At the end of slow start, the CWND can be significantly larger than needed to fully utilize the path (i.e., a CWND overshoot). It is inappropriate to use an overshoot in the CWND as a basis for estimating the capacity. In most cases, the CWND will converge to a stable value after several more RTTs. One mitigation could be to set the saved_cwnd based on the flight_size, or an averaged CWND.
>
> NOTE: Inappropriate is a rather soft way of phrasing that... I would argue a CR implementation MUST assert (to the best of its possibilities) to not store/use an inflated and hence wrong CWND estimate. I guess the mitigation should be that the CR saved_cwnd is only updated/stored once a connection finishes slow-start and enters congestin avoidance phase?
[GF] I'd favour not storing things that a sender does not wish to use.
> • Observe Phase (application-limited): When the sender is application-limited or in an RTT following a burst of transmission, a sender typically transmits much less data than allowed. Such observations ought to be discounted when estimating the saved_cwnd.
>
> QUESTION: Why? If that number ends up >= 4 times IW this will still result in a speed-up, and hence might be worthwhile?
[GF] Your comment is correct, text has been updated.
>
> NOTE: This sequence of section headings:
>
> 4.2. Confirming the Path in the Reconnaissance Phase
> 4.2.1. Confirming the Path
>
> It seems repetitive, either merge the two sections or come up with headings that are more specific?

[GF] True. My mistake. The latter should be about RTT not Path. In 
addressing other comments this section will be reworked.

> The method does not permit multiple concurrent reuse of the saved CC parameters. When multiple new concurrent connections are made to a server, each can have a valid endpoint_token, but the saved_cwnd can only be used for one new connection. This is designed to prevent a sender from performing multiple jumps in the cwnd, each individually based on the same saved_cwnd, and hence creating an excessive aggregate load at the bottleneck.
>
> NOTE: This leaves an obvious implementation on the table, in such situations equitably split the saved_cwnd between contenders.
>
[GF] I did not make this change, because I don't have any experience of 
what can go wrong if we allowed this. That makes me nervous to permit 
this, but I see the intuition and this could be revisited if there is 
real interest ... It will depend a lot on Safe Retreat for safety.
> Path characteristics can change over time for many reasons, resulting in the previously observed CC parameters becoming irrelevant. The sender therefore compares the saved_RTT with each of a series of measured RTT samples.
>
> NOTE: This section only tackles RTT, capacity is handled in 4.1, so starting with the plural "Path characteristics" seems too broad?
>
>
> If the current RTT sample is less than a half of the saved_RTT, this is regarded as too small, and is an indicator of a path change. (This factor of two arises, because the rate should not exceed the observed rate when the saved_cwnd was measured, because the jump_cwnd is calculated as half the measured saved_cwnd.)
>
> NOTE: As mentioned above this conflates two things, a) did the network path (and hence its expected equitably available capacity) change? b) is our probing CWND actually careful enough. I think we should fist answer a) and then select a jump_cwnd for b) based on a)... that sees to be the flow of causality here. BUT I appreciate the frank statement of dependency here. (I also intuitively think that I ca buy the 05*RTT threshold as useful enough, so in the end the recommended jump_cwnd might not need to change just the direction of justification, that is jump_cwnd followed from lower RTT_threshold not vice versa).
[GF] Thought resolved in planned changes, see next rev.
>
> A current RTT larger than that at the time the saved_cwnd was measured results in a proportionaly lower resumed rate, because the transmission using the CR method is paced based on the current RTT. An RTT sample more than ten times the saved_RTT is regarded as too large, such a high RTT is indicative of a path change. (The factor of ten accommodates both increases in latency from buffering on a path, and any variation between samples).
>
> NOTE: This is the only mention of variability... in isolation that makes little sense. HOWEVER I would like to propose that the CR block be extended by measures of variability for both RTT and capacity, as e.g. a massive change in variance likely also reflects a path change (ether the true physical network path or the leel of congestion along that path).
>
[GF] I agree that there will likely be variability in individual samples 
and also between different measurement times. This is something that 
ought to be measured. Unsure though that it would help to save the CC 
param, but we could indeed add more params if the WG thinks these will 
be useful.
> This section defines the safety requirements for using saved CC parameters to tentatively update the CWND. These safety requirements mitigate the risk of adding excessive congestion to an already congested path.
> 	• Unvalidated Phase (Jump): A connection must not directly use the previously saved_cwnd to directly initialize a new flow causing it to resume sending at the same rate. The jump_cwnd MUST be no more than half the previously saved_cwnd.
>
> 	NOTE: This clearly is accounting for the fact that we might start with an already congested path, and is not modeling the increasing uncertainty about the validity of the cached values in the stored CR information. (This uncertainty is what I proposed with an exponential decay).
>
>
> {XXX-Editor NOTE: A future revision of this document could specify a maximum time that CC Parameters can be cached - Ought this to me minutes, hours, days?}
>
> NOTE: I would like to encourage the authors to back anything well above the previously used duration of minutes by empirical data... just looking athttps://www.de-cix.net/en/services/globepeer/statistics  implies that there is a massive change in utilization/traffic over the course of a day, it stands to reason that this change in aggregate traffic is also reflected in local traffic and hence the "left-over" capacity on a link will change considerably over the course of a day. Based on that data I would argue that caching should stay well within the range of minutes or needs to take known traffic patterns into account when selecting the jump-cwnd (but that seems quite complicated). But these are just my rough ideas and I welcome better data or theory driven validity durations here.
>
[GF] I think this will be interesting. We'll also publish our data in 
time, following the final algo.
> {XXX-Editor note: This section to be completed XXX}
>
> NOTE: Indeed...
>
>
> • A simple conservative design sets CWND to IW and then resumes using normal slow-start. This does not require measuring the measured at congestion. The resulting pattern of CWND growth resembles that which would have occurred had the design not been used.
>
> NIT: "This does not require measuring the measured at congestion" this sentence is unclear maybe "This does not require measuring the actual amount of congestion"?
>
> • The volume of successfully transmitted packets sent using the Unvalidated Phase (e.g., by recording the sequence number of the first packet sent in the phase) is used as a measure of the maximum capacity, called the Pipe. The Pipe is not a safe measure of the currently available share of the capacity whenever there was also a significant overshoot at the bottleneck, as indicated by excessive loss. Therefore, any design that increases CWND based on received acknowledgments ought to avoid unduly taking capacity from sharing flows.
>
> NOTE: The "pipe" (this is IMHO not a helpful name) is indeed not all that useful, given that it will ignore e.g. concurrently starting flows in traditional slow-start... and other cross-traffic.
>
[GF] Safe Retreat is now Work-in-progress. In short: there are two 
things 1) a congestion event detected; 2) A set of path measurement 
samples for a pipe size from the Unvalidated packets. Both happend at a 
moment when the sender might have been overly aggressive, but give two 
pieces of information (see next rev).
> • For BBR CC, it is recommended to enter the "probe bandwidth" state.
>
> NOTE: This seems to be the first time BBR is named although it has been alluded to before, maybe name it even earlier?
>
>
> {XXX-Editor note: A future revision should discuss updating the saved parameters, whether used or not, after reaching normal operation for use the next time even if that update is to just refresh the expiration time.}
>
> NOTE: I would believe normal phase = observation phase as how else to meet the requirement for saved_cwnd:
> 	• saved_rtt: The preserved minimum RTT, e.g., corresponding to the minimum of a set RTT of measurements over the last 5 minutes of a connection.
>
> QUESTION: Does TCP typically .keep an estimate of sRTT over a fixed interval? I naively assumed TCP would use an EWMA of some sort that is not directly identical to a moving minimum over the last X minutes?
[GF] You could ask that to the list?
>
> • it could include other information such as the DSCP, ports, flow label, etc (recognising that this additional information might improve the path differentiation, but that this can reduce the re-usability of the token);
>
> NOTE: ports and flow labels are known to be used in e.g. ECMP/load balancing so these might be integral path information identifiers, and with load balancing the RTT of the different back-end devics might be quite similar, but the available capacity might not...
[GF] I'm recommending not driving into that topic. The key point of CR 
is that the Safe Retreat action is designed to limit the damage, rather 
than to accurately track the capacity...  but I'm not going to unpick 
that in this thread.
>
> NOTE: All in all I think the CR data needs to include information that allows a quick assessment of a paths volatility (IMHO some variance measures for cwnd/capacity and TT/OWDs might be sufficient) so a sender can decide based on observed data whether to try to use CR in the first place. I guess we could punt this from storing in the CR data to have the sender delete/not store a CR for a connection not deemed sufficiently predictable.
[GF] Agree, that is what I'd personally choose.
> Regards
> 	Sebastian

Thanks again, and we'll update the text.