Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc6937bis-06.txt

Markku Kojo <kojo@cs.helsinki.fi> Wed, 27 March 2024 14:54 UTC

Date: Wed, 27 Mar 2024 16:54:11 +0200
From: Markku Kojo <kojo@cs.helsinki.fi>
To: Yoshifumi Nishida <nsd.ietf@gmail.com>
cc: Neal Cardwell <ncardwell@google.com>, Matt Mathis <ietf@mattmathis.net>, Matt Mathis <mattmathis@measurementlab.net>, tcpm@ietf.org
In-Reply-To: <CAAK044QWL2xBvg4S_cvHFE_iTddSmnEOkhu33pftvUiUCCGZ_g@mail.gmail.com>
Message-ID: <44e1fff-f915-fdd-6c8e-4cd1cc3a9df3@cs.helsinki.fi>
References: <170896098131.16189.4842811868600508870@ietfa.amsl.com> <CADVnQy=rvCoQC0RwVq=P2XWFGPrXvGKvj2cAooj94yx+WzXz3A@mail.gmail.com> <8e5f0a7-b39b-cfaa-5c38-edeb9916bef6@cs.helsinki.fi> <CADVnQynR99fQjWmYj-rYZ4nZxYS=-O7zbfWjJLMxd5Lqcpwgcg@mail.gmail.com> <CAAK044SJWsvqWf+Tt3wUeGpMRH6aVg175CFUBrsz_YyhDsKYwQ@mail.gmail.com> <CADVnQy=0yhx9U-ogVX_Dh66fZWGyMzqtWAgSfaYXX-6ppGx=Kg@mail.gmail.com> <CAAK044So4qO4zma-qdYbMrJcNVdRi1-o30wcP7QjKLuT2c_Zuw@mail.gmail.com> <CADVnQynvFY24cNfe4tfm6FMY47UaMPjxrtD5RwWhCPK6EH4=6w@mail.gmail.com> <CAAK044QWL2xBvg4S_cvHFE_iTddSmnEOkhu33pftvUiUCCGZ_g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=_script-4309-1711551252-0001-2"
Content-ID: <868997e-85e2-fb93-20ef-1dd093ec8323@cs.helsinki.fi>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/XOqfasm6WK31ulEvdjpdKRJB9HU>
Subject: Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc6937bis-06.txt
Precedence: list

Hi Yoshi, Neal, all,

Please see below inline (tagged [MK]). And my apologies for a very long 
explanation. Hopefully I did not include too many mistakes this time ;)

In summary, it seems that we do not need to reset cwnd in the end of the 
recovery nor adjust RecoverFS necessarily because all issues raised could 
be resolved by simply correcting the definition of DeliveredData 
(reverting back the original definition + small additional change) and 
moving the actions to take with the ACK that triggers loss recovery to 
the unconditional steps to be taken together with the initialization of 
the algo in the beginning (this would also be in line with how the other 
fast recovery algorithms are described in the RFC series).

Hopefully I did not misunderstand any parts of the algo (either in RFC 
6937 or in the current -08 version of the draft).

On Thu, 21 Mar 2024, Yoshifumi Nishida wrote:

> Hi Neal,
> On Wed, Mar 20, 2024 at 1:32 PM Neal Cardwell <ncardwell@google.com> wrote:
> 
> 
> On Wed, Mar 20, 2024 at 3:29 PM Yoshifumi Nishida <nsd.ietf@gmail.com> wrote:
>       Hi Neal,
> 
> On Wed, Mar 20, 2024 at 6:55 AM Neal Cardwell <ncardwell@google.com> wrote:
>
>       On Wed, Mar 20, 2024 at 3:07 AM Yoshifumi Nishida <nsd.ietf@gmail.com> wrote:
> 
> On Mon, Mar 18, 2024 at 8:13 PM Neal Cardwell <ncardwell=40google.com@dmarc.ietf.org>
> wrote:
> 
> But your point still stands, and you raise a great point: simply initializing
> RecoverFS to "pipe" is not safe, because packets that were marked lost and removed
> from pipe may actually have been merely reordered. So if those packets are
> delivered later, they will increase the numerator of prr_delivered / RecoverFS
> without increasing the denominator, thus leading to a result above 1.0, and thus
> potentially leading to a target for Sent_so_far that is above ssthresh, causing the
> algorithm to erroneously exceed ssthresh.
> 
> 
> Hmm. I have a native question here. If packets were merely reordered, isn't it ok for
> cwnd to be bigger than ssthresh?
> 
> 
> Yes, if packets were merely reordered and none were lost, then I agree it would be OK for cwnd
> to be bigger than ssthresh. And in fact I would argue that cwnd should be reverted back to its
> value before fast recovery. And this is in fact what Linux TCP would do, using loss recovery
> "undo" mechanisms based on TCP timestamps or DSACKs.
> 
> However, in the kind of scenario Markku described, there was not merely reordering, but also
> real packet loss: "1 packet is lost (P1), and 24 packets are delayed (packets P2..P25)". In the
> traditional loss-based Reno/CUBIC paradigm, any non-zero amount of real packet loss in fast
> recovery should result in the same multiplicative decrease in cwnd, regardless of the
> combination of reordering and loss. We could argue about whether that approach is the best
> approach (BBR, for example, takes a different approach), but that is a very different
> discussion. :-) For now AFAICT we are focused on PRR's faithful enactment of the congestion
> control algorithms decision to reduce cwnd toward ssthresh when there is any non-zero amount of
> real packet loss in fast recovery.
> 
> 
> Got it. But, I just would like to clarify whether we are discussing the inflation of sndcnt during
> the recovery process or cwnd after the exit of recovery.
> 
> 
> Good point. We are talking about inflation of sndcnt during the recovery process.

[MK] I think we are talking about both in practice because inflation of 
sndcnt during the recovery process would also result in exiting recovery 
with too big cwnd. In the examples that I gave the segments sent_so_far 
was calculated when the SACK for P100 had arrived (actually the numbers 
were off by one):

For Reno:

Sent_so_far = CEIL(prr_delivered * ssthresh / RecoverFS)
             = CEIL(97 * 50 / 72)
             = 68

For CUBIC:
Sent_so_far = CEIL(prr_delivered * ssthresh / RecoverFS)
             = CEIL(97 * 70 / 72)
             = 95

Now, when the cumulative ACK triggered by rexmit of P1 arrives and
terminates fast recovery, the following is executed as per the *current
version* of the draft:

DeliveredData =  (bytes newly cumulatively acknowledged) = 100
DeliveredData += (bytes newly selectively acknowledged) = 100 + 0

prr_delivered += DeliveredData = 95 + 100 = 195
pipe = 94
if (pipe > ssthresh) => (94 > 70) => (true)
  sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out
         = CEIL(195*70/72) = 190 - 95 = 95
cwnd = pipe + sndcnt = 94 + 95 = 189

So oops, when exiting fast recovery cwnd would be nearly doubled from 
what it was before entering loss recovery. It seems that there 
is an additional problem because the definition of DeliveredData in the 
current version of the draft is incorrect; the cumulatively acked bytes 
that have already been SACKed are counted twice in DeliveredData. It 
seems that RFC 6937 and rfc6937bis-01 both define DeliveredData (nearly) 
correctly by including the change in snd.una in DeliveredData and 
substracting data that has already been SACKed. The definition of 
DeliveredData obviously needs to be corrected. See also below the issue 
with bytes SACked that are above snd.una but get SACKed before the start 
of recovery.

With the original definition of DeliveredData cwnd would not be 
inflated further but fast recovery would still exit with too big cwnd 
(For CUBIC cwnd=95, instead of 70, and for Reno cwnd=68, instead of 50), 
if we use too small RecoveryFS (=Pipe)

So, it seems that we agree that the problem of sending too many bytes 
during the recovery process gets corrected if RecoverFS is initialized to 
snd.nxt - snd.una. The next question is, should RecoverFS be initialized 
to even higher value in some scenarios? See below.

> Because I'm personally not very keen to address miscalculation of lost packets due to reordering
> during the recovery process as it seems to be tricky. 
> 
> 
> It is tricky, but I think it is feasible to address. What do folks think about my suggestion from above in
> this thread:
> 
>   existing text:
>      pipe = (RFC 6675 pipe algorithm)
>      RecoverFS = pipe              // RFC 6675 pipe before recovery
> 
>   proposed new text:
>      RecoverFS = snd.nxt - snd.una + (bytes newly cumulatively acknowledged)
> 
> 
> Hmm. Sorry. I'm not very sure about the difference between snd.nxt - snd.una and snd.nxt - snd.una + (bytes newly
> cumulatively acknowledged)
> Could you elaborate a bit? I thought we don't have data which are cumulatively acked in case of reordering.

[MK] It seems there might be another case that Neil is thinking where the 
sender may end up sending too many segments during the first RTT in fast 
recovery. If I understood it correctly this may occur in a scenario 
with ACK loss for pkts preceeding the first dropped data pkt, for 
example. Consider the following scenario where there are 100 pkts 
outstanding

  P1..P24, P25, P26, P27, P28..P100

Packets P1..P24 and P26..P100 are delivered succesfully to the 
receiver. P25 is lost. ACKs (and SACKs) for pkts P1..P24, P26 and P27 get 
dropped. SACKs for P28..P100 are delivered successfully. When SACK 
for pkt P28 arrives, an RFC 6675 sender would declare P25 is lost, and 
enter fast retransmit. Similarly, a RACK-TLP sender may declare P25 lost, 
but this may happen with any of SACKs P28..P100 arriving.

Let's assume we were fully utilizing congestion window, i.e., cwnd=100 
and we enter loss recovery when the SACK of P28 arrives (cumulative 
ACK#=25):

ssthresh = cwnd / 2 = 50  (Reno)
prr_delivered = prr_out = 0
Pipe = snd.nxt - snd.una - (lost + SACKed) = 76 - (1 + 3) = 72
RecoverFS = snd.nxt - snd.una = 101 - 25 = 76

DeliveredData = (bytes newly cumulatively acknowledged) = 24
DeliveredData += change_in(SACKd) = 24+3 = 27
prr_delivered += DeliveredData = 0+27 = 27

if (pipe > ssthresh) => if (72 > 50) => true
       // Proportional Rate Reduction
       sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out
              = CEIL(27 * 50 / 76) = 19 - 0 = 19

cwnd = 72 + 19 = 91

so, we will send a burst of 19 pkts on entry to recovery and during the 
rest of the recovery around 49 pkts, giving a total of 19+49=68 pkts 
while only 50 was allowed. If we add 24 cumulatively acked pkts into 
RecoverFS like Neil suggests, we are about to send around 14+37=51 pkts 
which is almost fine. However, the major shortcoming of this approach is 
that we'll still send a burst of 14 pkts in the beginning of the recovery 
while avoiding such a burst was one of the major goals of PRR.

Alternatively we could modify the algo such that the cumulatively acked 
bytes with the ACK that triggers loss recovery are not added to 
DeliveredData nor in RecoverFS. Then we would send just one pkt (rexmit of 
P1) on entering the recovery and during the rest of recovery around 49 
pkts, i.e., 1+49=50 pkts during the recovery, which would be exactly equal 
to ssthresh we set. With this approach we could avoid the burst in the 
beginning.  In addition we could  have a consistent solution also for the 
additional problem of including extra SACKed data with the ACK that 
triggers the recovery. Let's look at the above scenario again cwnd=100 
and pkts P1..P100 in flight:

  P1..P24, P25, P26, P27, P28..P100

Packet P1..P24 are delivered to the receiver but ACKs get dropped (whether 
ACKs are dropped or not is not relevant for this issue). P25 gets 
dropped. If the DupAcks of pkt P26 and pkt P27 are delivered, from the 
DupAck of P28 only the SACK info for P28 is counted in DeliveredData but 
the SACK info for P26 and P27 are never counted in DeliveredData because 
P26 and P27 are already SACKed when the DupAck of P28 arrives. However, 
if the DupAcks of pkt P26 and pkt P27 get dropped as in the previous 
example, the ACK of P28 includes new SACK info for pkts P26, P27, and 
P28 and the bytes of P26 and P27 are also counted in DeliveredData. (If 
also DupAck of P28 gets dropped, the DupAck of P29 may include up to 3 
MSS of additional SACK info to be counted (P26, P27, and P28). This alone 
will result in a miniburst in the beginning of the recovery or add to the 
burst size as in the previous example where the two additinal SACKs (for 
P26 and P27) inflated prr_delivered by 2, resulting in slightly too large 
number of segments sent during the recovery (51).

As suggested above, this problem with additional SACKs would be solved 
such that the DupAck that triggers the loss recovery is allowed to add 
only "itself" into DeliveredData and let PRR to include the missing bytes 
for pkts that were SACKed before the start of the recovery only at the 
end of the recovery when the cumulative ACK for the first pkt (P1) 
arrives and inherently covers those bytes.

In other words, the algo can be modified such that fast retransmit is 
always handled separately in the beginning of the recovery together with 
the initialization of the PRR variables:

   ssthresh = CongCtrlAlg()      // Target flight size in recovery
//[MK]: the next three lines can be deleted as unnecessary
   prr_delivered = 0             // Total bytes delivered in recovery
   prr_out = 0                   // Total bytes sent in recovery
   pipe = (RFC 6675 pipe algorithm)

   RecoverFS = snd.nxt - snd.una // FlightSize right before recovery
                                 // [MK]:maybe add cumulatively ACKed
                                 //      bytes?

   Fast Retransmit the first missing segment
   prr_delivered  = (With SACK: bytes selectively acknowledged by the first
                     SACK block of the ACK triggering the loss recovery, OR
                     Without SACK: 1 MSS)
   prr_out  = (data fast retransmitted)

On each arriving ACK during the rest of the fast recovery, including the 
final cumulative ACK that signals the end of loss recovery:

   DeliveredData = change_in(snd.una)
   if (SACK is used) {
      DeliveredData += change_in(SACKd) //[MK]:(*) modify change_in(SACKd)
   ...

The above changes would imply deleting

  if (prr_out is 0 AND sndcnt is 0) {
       // Force a fast retransmit upon entering recovery
       sndcnt = MSS

from the algo and would make it consistent with the description of the 
other fast retransmit & Fast recovery algorithms (RFC 5681, RFC 6582, RFC 
6675) that include fast retransmit together with the initialization of 
the algo in the unconditional first steps of the algorithm.

(*)
In addition, one more smallish but important correction is needed. The 
bytes that are SACKed before the recovery starts (i.e., typically the 
famous first two DupAcks or more bytes if the start of recovery is 
postponed due to reordering) should be taken into account in the 
DeliveredData during the recovery but with the current algo they 
are never counted in DeliveredData (and prr_delivered).
Why? Because when the first cumulative ACK arrives, it advances snd.una 
such that those bytes are covered but change_in(SACKd) is negative and 
it incorrectly substracts also these bytes from DeliveredData (and 
prr_delivered) even though they were never counted in. Usually this is 
only 2 MSS but in scenarios similar to one that Neil earlier introduced 
there might be much more data bytes that are not counted. This change 
would also solve the problem of exiting PRR with too low cwnd.
Let's look at Neil's earlier example again (see comments with [MK] for 
suggested change to solve the issue):

CC = CUBIC
cwnd = 10
The reordering degree was estimated to be large, so the connection will 
wait for more than 3 packets to be SACKed before entering fast recovery.

--- Application writes 10*MSS.

TCP sends packets P1 .. P10.
pipe = 10 packets in flight (P1 .. P10)

--- P2..P9 SACKed  -> do nothing //

(Because the reordering degree was previously estimated to be large.)

--- P10 SACKed -> mark P1 as lost and enter fast recovery

PRR:
ssthresh = CongCtrlAlg() = 7 packets // CUBIC
prr_delivered = 0
prr_out = 0
RecoverFS = snd.nxt - snd.una = 10 packets (P1..P10)

DeliveredData = 1  (P10 was SACKed)

prr_delivered += DeliveredData   ==> prr_delivered = 1

pipe =  0  (all packets are SACKed or lost; P1 is lost, rest are SACKed)

safeACK = false (snd.una did not advance)

if (pipe > ssthresh) => if (0 > 7) => false
else
  // PRR-CRB by default
  sndcnt = MAX(prr_delivered - prr_out, DeliveredData)
         = MAX(1 - 0, 1)
         = 1

  sndcnt = MIN(ssthresh - pipe, sndcnt)
         = MIN(7 - 0, 1)
         = 1

cwnd = pipe + sndcnt
     = 0    + 1
     = 1

retransmit P1

prr_out += 1   ==> prr_out = 1

--- P1 retransmit plugs hole; receive cumulative ACK for P1..P10

DeliveredData = 1  (P1 was newly ACKed) //[MK]: should be = 10 - 1 = 9

//[MK]: Note that SACKed bytes of P2..P9 were also newly acked
//      because the bytes have not been delivered *during* the
//      recovery by this far and thereby not yet counted in
//      prr_delivered.
//      So, they should not be substracted from DeliveredData
//      but included as those bytes got delivered only when
//      snd.una advanced. Only P10 should be substracted.

prr_delivered += DeliveredData   ==> prr_delivered = 2
//[MK]: should be = 1 + 9 = 10

pipe =  0  (all packets are cumuatively ACKed)

safeACK = (snd.una advances and no further loss indicated)
safeACK = true

if (pipe > ssthresh) => if (0 > 7) => false
else
  // PRR-CRB by default
  sndcnt = MAX(prr_delivered - prr_out, DeliveredData)
         = MAX(2 - 1, 1)  //[MK]  = MAX(10-1, 1)
         = 1              //[MK]  = 9
  if (safeACK) => true
    // PRR-SSRB when recovery is in good progress
    sndcnt += 1   ==> sndcnt = 2 //[MK] ==> sndcnt = 10

  sndcnt = MIN(ssthresh - pipe, sndcnt)
         = MIN(7 - 0, 2) //[MK] = MIN(7 - 0, 10)
         = 2             //[MK] = 7

cwnd = pipe + sndcnt
     = 0    + 2  //[MK] = 0 + 7
     = 2         //[MK] = 7

So we exit fast recovery with cwnd=2 even though ssthresh is 7.

[MK]: Or, we exit with cwnd=7 if we correctly count in DeliveredData 
during the recovery process all data that is in flight when the recovery 
starts. All bytes in flight at the start of the recovery are supposed to 
become acknowledged by the end of the recovery, so they should be counted 
in prr_delivered during the recovery.

>             However, I think it won't be good if it's propagated to ater the recovery.  But, don't we
>             reset cwnd to ssthresh at the end of recovery?

[MK]: It seems that just by correcting the definition of DeliveredData 
there is no need to reset cwnd to ssthresh at the end of recovery because 
the algo would do it for us. But I am not opposing to reset cwnd to 
ssthresh at the end. In that case it might be better to specify it by 
giving two alternatives similar to what RFC 6582 does. Maybe:

   Set cwnd to either (1) min (ssthresh, cwnd) or (2) ssthresh.

[MK] Finally, one correction and one clarifying question.
The first one is related to use of Limited Transmit and computing 
ssthresh:
when limited transmit is in use, the segments transmitted as part of 
Limited Transmit must not be counted to FlightSize used in determining the 
new value of ssthresh (and cwnd) as specified in RFC 5681 (Sec 3.2 step 2) 
and RFC 6675 (Sec 5, step 4.2). In other words, ssthresh after 
multiplicative decrease should have the same value regardless of 
whether Limited Trasmit is used or not because Limited Transmit folloes 
the pkt conservation principle and does not inclease the effective flight 
size. If one uses cwnd to compute the new value of sshtresh the same 
result follows because the bytes transmitted via Limited Transmit must 
not increase cwnd.
This seems to be reflected incorrectly in the examples for RFC 6675 in 
Figs 1 and 2. The cwnd should be reduced to 10 (=20/2) not 11 and 
consequently in Fig 1 for RFC 6675 the first new segment is sent on 
arrival of Ack#13, not Ack#12, and in Fig 2, RFC 6675 would send 6R (not 
7R) on arrival of Ack#17. Also the behavior of PRR in Fig 1 should be 
modified such that PRR does not send a new data segment on arrival of 
Ack#19 (but would send new segment on arrival of Ack#20 and Ack#21 just 
like 6675 would do).

The second one relates to counting DeliveredData without SACK. The algo 
includes the following which was not present in the original PRR:

   if (SACK is used) {
      ...
   else if (ACK is a DUPACK and prr_delivered < RecoverFS) {
    DeliveredData += MSS

What might be the reason for stop counting DupAcks soon after the first 
RTT? When RecoverFS is initialized to flight size in the beginning of 
loss recovery, it is expected that around flight size worth bytes 
becomes delivered (acked) by the time when the first cumulative ACK 
arrives (flight size - # of dropped pkts + 1).
If we stop counting DupAcks soon after the first RTT, prr_delivered would 
little forward progress during the following RTTs of the recovery. For 
example, if we have pkts P1..P100 outstanding and pkts P1..10 get dropped 
and no other drops occur during the recovery. cwnd == RecoverFS == 100, 
ssthresh == 50. 
This means that NewReno would require 10 RTTs to recover from all losses 
P1..P10 and it is allowed to send upto 50 pkts during each RTT. At the 
end of the first RTT of recovery, cumulative ACK for P1 arrives and we 
have received and accumulated 88 (Dup)Acks in prr_delivered and 
transmitted 44 pkts (1 rexmit and 43 new data pkts, CC = Reno). At the 
beginning of the second RTT, we have outstanding 43 new data pkts and 
rexmit of P2 (=44 pkts). After receiving 12 DupAcks during the 2nd RTT, 
prr_delivered crosses RecoverFS and we stop increasing prr_delivered.
prr_delivered == 100 and prr_out ==50. We still need 8 more RTTs to 
recover all 10 dropped pkts but we are allowed to send only 50 pkts more 
during the recovery because DeliveredData does not inflate anymore.
Wouldn't this unnecessarily limit the forward progress during the last 8 
RTTs of the loss NewReno loss recovery (Reno CC would allow sending upto 
50 pkts per RTT)?

Thanks,

/Markku

> Yes, Linux TCP does reset cwnd to ssthresh at the end of recovery. And we discussed putting that in the
> draft last year in the thread "draft-ietf-tcpm-prr-rfc6937bis-03: set cwnd to ssthresh exiting fast
> recovery?". But it looks to me from reviewing the draft that we never got around to inserting that in the
> draft. I will do that now, and post the proposal in that thread for discussion.
> 
> 
> Thanks! I believe that reflects the consensus of the WG.
> --
> Yoshi 
>

[tcpm] I-D Action: draft-ietf-tcpm-prr-rfc6937bis… internet-drafts
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo