[tcpm] Re: PRR behaviour on detecting loss of a retransmission(WAS:I-DAction: draft-ietf-tcpm-prr-rfc6937bis-06.txt)

Markku Kojo <kojo@cs.helsinki.fi> Tue, 05 November 2024 00:56 UTC

Subject: [tcpm] Re: PRR behaviour on detecting loss of a retransmission(WAS:I-DAction: draft-ietf-tcpm-prr-rfc6937bis-06.txt)
Hi Neal, all,

See inline below tagged [MK3] (towards the end of the msg).

On Fri, 1 Nov 2024, Neal Cardwell wrote:

> On Tue, Oct 29, 2024 at 10:45 AM Markku Kojo <kojo=40cs.helsinki.fi@dmarc.ietf.org> wrote:
>       Hi Neal, all,
>       I just noted that I have missed a bunch of replies as I had been out of
>       email for about one month starting in mid July. Many thanks for the
>       replies and clarifications and my apologies for not having time to track
>       and search the mailing list until now.
>       I just now reply to this thread wrt lost retransmission handling. The
>       others are of less importance and possibly fine with the latest
>       adjustments. I'll pass through the rest of the threads ASAP within a few
>       coming days to see if there was anything important left.
>       The major point with the suggested way of reinitializing the PRR state
>       using the same steps as in the beginning of recovery is that it
>       often results in incorrect result if one follows the current CC RFCs on
>       how to do multiplicative decrease. Reason: both FlightSize and cwnd are
>       badly off during Fast Recovery and must not be used to compute new
>       ssthresh (and cwnd).
>       Pls see inline tagged [MK2].
>       On Tue, 16 Jul 2024, Neal Cardwell wrote:
>       >
>       >
>       > On Tue, Jun 25, 2024 at 10:25 AM Markku Kojo <kojo=40cs.helsinki.fi@dmarc.ietf.org> wrote:
>       >       Hi Neal, all,
>       >
>       >       I changed the subject line for discussing this specific topic of PRR
>       >       behaviour when loss of a retransmission is detected.
>       >
>       >       Please see below tagged [MK].
>       >
>       >       On Mon, 18 Mar 2024, Neal Cardwell wrote:
>       >
>       >       >       In addition, it seems that the algorithm in the latest version does not
>       >       >       address my WGLC comment on reducing send rate (ssthresh) again if
>       >       >       RACK-TLP detects loss of a retransmission. The sender must reduce
>       >       >       ssthresh again as loss of a rexmit occurs on another RTT. If it is not
>       >       >       done, the fast recovery keeps on sending at the same rate until the end
>       >       >       of recovery regardless of how many times a segment has to be
>       >       >       retransmitted. This sounds very bad behaviour to me in front of heavy
>       >       >       congestion that drops a lot of pkts (rexmits) and the PRR sender does not
>       >       >       react at all.
>       >       >
>       >       >
>       >       > I would argue that the question of whether a connection should reduce ssthresh when
>       >       RACK-TLP detects the
>       >       > loss of a retransmission, while important, is outside the scope of PRR. PRR is taking
>       >       loss detection and
>       >       > congestion control decisions as externally provided inputs into PRR. When to mark a
>       >       packet as lost is a
>       >       > loss detection question, and whether to reduce ssthresh upon a particular packet loss
>       >       is a congestion
>       >       > control decision. PRR is focused on taking the ssthresh output from congestion
>       >       control, and loss detection
>       >       > decisions from the loss detection algorithm, and deciding how to evolve the cwnd to
>       >       try to smoothly and
>       >       > safely converge the volume of in-flight data toward the given ssthresh.
>       >
>       >       [MK] Yes, agreed that loss detection (including detecting the loss of
>       >       a retransmission) is outside of scope of PRR (i.e., detecting loss of
>       >       rexmit is currently RACK-TLP).
>       >
>       >       However, PRR is a congestion control algorithm defining the congestion
>       >       control behaviour of the sender during a fast recovery. Currently it
>       >       borrows only the multiplicative decrease factor from other congestion
>       >       control algos, that is, either from RFC 5681 or RFC 9438) but defines
>       >       everything else in controlling the send rate (= congestion ctrl) during
>       >       fast recovery.
>       >
>       >       PRR does not need to define the multiplicative decrease factor to be used
>       >       when a loss of rexmitted segment is detected. It may borrow it from
>       >       another doc like it currently does for entering loss recovery.
>       >       However, I don't quite see how some other document possibly could define
>       >       how the other PRR-specific variables are reinitialized,
>       >       e.g., RecoverFS. Maybe I am missing something but the algo seems
>       >       not to work correctly with a lowered ssthresh after detection of
>       >       lost rexmit unless RecoverFS (and prr_deliverd and prr_out too?) is also
>       >       adjusted. Could you explain how the algo is supposed to work upon
>       >       detecting loss of a rexmit with multiplicative decrease factor of 0.5,
>       >       for example.
>       >
>       >       Thanks,
>       >
>       >       /Markku
>       >
>       >
>       > Hi Markku,
>       >
>       > In regards to the way PRR is intended to work after a data sender detects a lost retransmit, I have
>       > chatted with Matt about this, and I think Matt and I have a similar perspective on this. Let me
>       > offer my thoughts:
>       >
>       > + I agree it doesn't make sense for other IETF documents to try to define how PRR-specific variables
>       > are reinitialized. To do so would invite a combinatorial explosion of standards, as every congestion
>       > control algorithm doc would need to specify how PRR-enabled and non-PRR implementations should work.
>       > And then those documents would need to be updated every time the PRR algorithm specification
>       > changes.
>       [MK2] Yes, agreed.
>       > + Likewise, I would argue that it doesn't make sense for the PRR document to attempt to define (a)
>       > when a congestion control algorithm decides to slow down (reduce ssthresh and/or cwnd),
>       [MK2] Sure PRR document does not need to define (a). Lowering ssthresh
>       and cwnd on the loss of a retransmission is already MUST in RFC 5681. The
>       long term tradition in CC RFCs has been to repeat crucial MUSTs with a
>       normative reference. So, I thing this document should also explicitly
>       repeat that on detecting loss of a retransmission, the TCP sender MUST
>       lower ssthresh once per RTT (and give enough details on how to do it
>       correctly with PRR to avoid pitfalls with flightsize and cwnd, but not
>       saying how much to lower).
>       > or (b) what
>       > the exact ssthresh value is as a result of the slow-down decision. For the PRR document to attempt
>       > to document (a) and (b) would likewise result in a combinatorial explosion of text, and dependencies
>       > between documents, and document updates.
>       [MK2] Agreed, no need to define the exact ssthresh value, but needs to
>       advise how to compute it correctly (see below).
>       > Instead, the model we are advocating with PRR is a separation of concerns:
>       >
>       > + a congestion control algorithm decides:
>       >   (a) when a congestion control algorithm decides to slow down (reduce ssthresh and/or cwnd)
>       >   (b) what the exact ssthresh value is as a result of the slow-down decision
>       >
>       > + PRR decides: given (a) and (b) decisions made by a separate congestion control algorithm, how to
>       > set the cwnd using each ACK
>       >
>       > How should that work, in practice, after a data sender detects a lost retransmit?
>       >
>       > Each time the data sender detects a lost retransmit, the congestion control algorithm should decide
>       > whether or not to slow down. I would think that, ideally, a well-designed congestion control
>       > algorithm should slow down multiplicatively once per round trip, for every round trip in which there
>       > is any loss detected (whether it is a lost retransmit or a lost original retransmission). Any time
>       > the congestion control algorithm decides to slow down (for whatever reason), it would initiate a new
>       > PRR episode, and invoke the PRR initialization code. That would take care of initializing RecoverFS,
>       > prr_deliverd, and prr_out, to ensure that the PRR behavior over the next round trip and beyond works
>       > correctly.
>       [MK2] To ensure that multiplicative decrease becomes implemented correctly
>       this document should give exact advise how to compute the new
>       ssthresh value. If one follows current RFCs (e.g., RFC 5681 or RFC 9438)
>       and computes the new value of ssthresh (and cwnd) either using flightsize
>       (ssthresh = 0.5 * FlightSize or ssthresh = 0.7 *FlightSize) or cwnd
>       (ssthresh = 0.5 * cwnd or ssthresh = 0.7 * cwnd) the result is often not
>       correct (or is more or less random).
>       This is because fligthsize becomes inflated during fast recovery as the
>       TCP sender sends new data during the recovery (before any
>       cumulative/partial ACKs arrive). Similarly, with PRR cwnd reaches the
>       target (=correctly reduced) value only in the end of recovery, meaning
>       that cwnd is too big during the recovery.
>       A typical, simple scenario with flightsize, for example:
>       Amount of outstanding data is 100 segments and a loss is detected ->
>       ssthresh = 50 or 70 (assume CC algo is Reno or Cubic) and recovery
>       starts. Assume the fast rexmitted (1st lost) segment becomes dropped.
>       During the first RTT the TCP sender injects 50 or 70 new data segments ->
>       FlightSize = 150 or 170. Soon after the first RTT, the TCP sender detects
>       the loss of the rexmitted segment and computes: ssthresh = 0.5 * 150 = 75
>       (Reno) or ssthresh = 0.7 * 170 = 119 (Cubic). This results in three times
>       higher ssthress value than expected with Reno (= 25) and ~ 2.5 times
>       higher than expected with Cubic (= 49).
>       A simple scenario with cwnd, for example:
>       Amount of outstanding data is 100 segments and a loss is detected ->
>       ssthresh = 50 or 70 (assume CC algo is Reno or Cubic) and recovery
>       starts. Assume there is significant number of losses in the current
>       window of data and the fast rexmitted (=1st lost) segment becomes
>       dropped. In addition, there may be significant Ack loss.
>       That is, a typical case with very heavy congestion and it would be
>       crucial to reduce ssthresh (and cwnd) correctly.
>       During the first RTT only little date gets delivered (i.e., hardly any
>       SACKed data and little additional lost segments are detected, keeping
>       cwnd ~ flightsize (= cwnd before entering recovery). When lost rexmit
>       becomes detected after one RTT, the TCP sender computes new ssthresh =
>       0.5 * cwnd or 0.7 * cwnd and the results is only minimal reduction from
>       the ssthresh used during the first RTT of recovery, instead lowering twice
>       with the same multiplicative decrease factor.
>       I hope this clarifies the issue and the need to define that ssthresh MUST
>       NOT be reinitialized using flightsize or cwnd.
>       Maybe on detecting loss of a rexmit, ssthresh could be reinitialized
>         ssthresh = multiplicatice_decrease_factor * ssthresh?
>       In addition, when rereading the prr algo I found also an additonal
>       problem with the PRR algo that uses pipe (RFC 6675 pipe algorithm)
>       together with RACK-TLP loss detection. The algo uses pipe as the
>       (quite accurate) estimate of outstanding data. However, the definition of
>       pipe depends on loss detection in RFC 6675 that defines a lost segment
>       different from RACK-TLP. The pipe algo depends of RFC 6675 IsLost()
>       function that requires three SACked segments above the lost segment to
>       declare the segment lost, while this is often not the case with RACK-TLP.
>       Shouldn't the "pipe" estimate in the algo be based on the loss detection
>       algorithm in use? Otherwise, pipe may be badly off in number of
>       scenarios.
>       The same holds for non-SACK fast recovery, that is NewReno. I think the
>       document/algo should clarify that, without SACK, the SACKed segments for
>       pipe calculation are estimated in the similar way as they are estimated
>       for DeliveredData (i.e., one SACKd segment = one duplicate ACK).
>       Hope this is helpful.
>       Best regards,
>       /Markku
>       > I would propose that we make that more clear in the PRR document.
>       >
>       > In section 6, "Algorithm", I would propose we change the existing text:
>       >
>       >   At the beginning of recovery, initialize the PRR state.
>       > to:
>       >
>       >    At the beginning of a congestion control response episode initiated
>       >    by the congestion control algorithm, a TCP data sender using PRR
>       >    MUST initialize the PRR state. The timing of the start of a
>       >    congestion control response episode is entirely up to the
>       >    congestion control algorithm, and (for example) could correspond to
>       >    the start of a fast recovery episode, or a once-per-round-trip
>       >    reduction when lost retransmits or lost original transmissions are
>       >    detected after fast recovery is already in progress.
>       >
>       > How does that sound to everyone?
>       >
>       > neal
> Hi Markku,
> Markku, thanks once again for the detailed feedback!
> If I understand correctly, you are basically making four high-level points:
> (1) [when to reduce ssthresh on lost retransmits]
> > ...I thin[k] this document should also explicitly
> > repeat that on detecting loss of a retransmission, the TCP sender MUST
> > lower ssthresh once per RTT (and give enough details on how to do it
> > correctly with PRR to avoid pitfalls with flightsize and cwnd, but not
> > saying how much to lower).

[MK3] To clarify: I am not proposing PRR doc to define "when to reduce 
ssthresh on lost retransmits". That is already defined/required in RFC 
5681 which is congestion control draft standard. I only suggest that PPR 
doc follows to well established tradition in CC RFCs and indicates that 
it conforms with this requirement and reminds the reader of this by 
adding normative reference to RFC 5681. Otherwise, an imlementor might 
easily miss this crucial requirement.

> (2) [how to set ssthresh on lost retransmits]
> > The major point with the suggested way of reinitializing the PRR state
> > using the same steps as in the beginning of recovery is that it
> > often results in incorrect result if one follows the current CC RFCs on
> > how to do multiplicative decrease. Reason: both FlightSize and cwnd are
> > badly off during Fast Recovery and must not be used to compute new
> > ssthresh (and cwnd).
> > ...ssthresh MUST NOT be reinitialized using flightsize or cwnd.
> > ...Maybe on detecting loss of a rexmit, ssthresh could be reinitialized
> >
> >   ssthresh = multiplicative_decrease_factor * ssthresh?
> IMHO the (1)/(2) questions of exactly when to set ssthresh (sequence window? time window? lost retransmits?) and how to set
> ssthresh (based on FlightSize? cwnd?) is tricky, and opens up a whole can of worms.

[MK3] It seems that I was not able to put it clearly. No, I am not 
proposing PRR doc to advise whether ssthresh should be set by using 
FlightSize orcwnd. Instead, I find it important for PRR doc to give heads 
up for an implementor to avoid incorrect ways of setting ssthresh. That 
is, I am proposing that the PRR doc clearly says that when a new PRR 
episode is initiated due to detection of a lost retransmission during 
fast recovery phase and a new value of ssthresh is set as part of the 
initialization, neither FlightSize nor cwnd should be used to set the new 
value of sstheresh. The reason is simple: both FlightSize and cwnd are 
moving targets during fast recovery. If cwnd is used to set the new value 
of ssthresh and cwnd has valua "cwnd_before" just before entering 
fast recovery, the value of cwnd fluctuates in range [1, cwnd_before-1] 
during fast recovery. This means that the resulting ssthresh may be 
undesireable low or unacceptable high (practically hardly decreased at 
all). Similarly, if FlightSize is used to set the new value of ssthresh 
and FlightSize has valua FS_before just before entering fast recovery, 
the value of FlightSize fluctuates in range [0, FS_before + 
Multiplicative_Decease_Factor * FS_before] during fast recovery. This 
again means that the resulting ssthresh may be undesireable low or 
unacceptable high (may even become icreased, instead of decreased).

[MK3] I think, it should be quite clear that neither cwnd nor FlightSize 
can have a proper value during fast recovery to be used in (re)setting 
ssthresh when a lost rexmit is detected. Instead, setting ssthresh should 
be based on a variable that is set when the PRR episode was initiated and 
that stays unmodified during the recovery. ssthresh has this 
property, so I am wondering if it could be used instead? Maybe I am 
missing something but I cannot see how ssthresh could be set to an 
approproate value if using cwnd or FlightSize. Could you explain how to 
set it correctly?

> Many of the tricky issues are nicely described in detail in this nice IETF 120 deck from Matt Mathis:
>   https://datatracker.ietf.org/meeting/120/materials/slides-120-ccwg-cc-response-while-application-limited-00

[MK3] This nice presentation by Matt is mostly about whether to use 
something else (cwnd) than FlightSize when setting ssthresh (and cwnd) at 
the time recovery phase is first entered . This is a quite well known 
problem and was debated in depth when CUBIC [RFC 9438] was finalized. But 
here with PRR the question is not whether to use cwnd or FlightSize but 
that one must not use either of them.

> The issue is so thorny that, AFAIK, all the major TCP implementations (Linux, FreeBSD, Windows, MacOS/iOS) diverge from the
> Reno/RFC5681 and CUBIC/RFC9438 standards, and do not set ssthresh based on FlightSize, but rather based on cwnd.
> IMHO this is a much deeper and wider problem than what can be reasonably tackled in a PRR draft.
> So for (1) and (2) my sense is still that it is best to leave the question of when/how to set ssthresh as something to be
> documented outside the PRR draft.

[MK] Quite contrary, I think. With PRR, cwnd is modified constantly during 
fast recovery while with RFC 6675 cwnd is constant during fast recovery 
and could well be used to set ssthresh (and cwnd) if a loss of rexmit is 
detected. So, the problem of using cwnd is specific to PRR and should be 
tackled in PRR doc, I think. If tcpm does not address this problem now, 
it will create a similar new problem that was created earlier and what 
Matt's preso is about.

> So I will repeat my proposal: In section 6, "Algorithm", I would propose we change the existing text:
>   At the beginning of recovery, initialize the PRR state.
> to:
>    At the beginning of a congestion control response episode initiated
>    by the congestion control algorithm, a TCP data sender using PRR
>    MUST initialize the PRR state. The timing of the start of a
>    congestion control response episode is entirely up to the
>    congestion control algorithm, and (for example) could correspond to
>    the start of a fast recovery episode, or a once-per-round-trip
>    reduction when lost retransmits or lost original transmissions are
>    detected after fast recovery is already in progress.
> We will discuss this in the PRR slides for TCPM next week, and hopefully the TCPM chairs can moderate a discussion about
> whether to specify how/when to set ssthresh in the PRRbis draft. 
> (3) ["pipe" and current loss detection algorithm]
> > Shouldn't the "pipe" estimate in the algo be based on the loss detection
> > algorithm in use?
> (4) ["pipe" and current loss detection algorithm for non-SACK connections]
> > The same holds for non-SACK fast recovery, that is NewReno. I think the
> > document/algo should clarify that, without SACK, the SACKed segments for
> > pipe calculation are estimated in the similar way as they are estimated
> > or DeliveredData (i.e., one SACKd segment = one duplicate ACK).
> I agree that your points (3) and (4) make sense. To address your points (3) and (4), where the PRRbis draft currently uses
> "pipe" (as defined in RFC 6675) it should probably instead use a name referring to the estimated volume of data in-flight
> in the network as estimated by the *current* loss detection algorithm (which is often not RFC 6675, e.g. in Linux TCP).
> I would propose that the PRRbis draft use "inflight" since (a) this is the term used for this concept in Linux TCP, BBR
> (CACM article and Internet draft), and the Prague draft, and (b) there should not be confusion because "inflight" is not
> used anywhere currently in PRR (RFC6937 or PRRbis), RFC6675, Reno/RFC5681, or CUBIC/RFC9438.
> In Sec 5, "Definitions", I would propose to add:
> inflight: The transport connection's sender-side estimate of the number of bytes still outstanding in the network.
> Connections with SACK and using RFC 6675 loss detection MAY use "pipe" as specified in RFC 6675. For connections using
> RACK-TLP loss detection [RFC8985] or other loss detection algorithms, they MUST calculate inflight by starting with SND.NXT
> -  SND.UNA, subtracting out bytes SACKed in the SACK scoreboard, subtracting out bytes marked lost in the SACK scoreboard,
> and adding bytes in the scoreboard that have been retransmitted since they were last marked lost.
> Then I would propose to replace occurrences of "pipe" in the current draft-ietf-tcpm-prr-rfc6937bis-12 with "inflight".

[MK3] Agreed. This seems to me an appropriate way to address it. It would 
be good to add also how inflight is calculated with non-SACK connections.



> We will also add a slide about this in the PRR presentation at TCPM on Tuesday, to gather feedback on this proposal.
> Markku, thanks again for the detailed feedback!
> best,
> neal