Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc6937bis-06.txt

Yoshifumi Nishida <nsd.ietf@gmail.com> Mon, 01 April 2024 07:41 UTC

MIME-Version: 1.0
References: <170896098131.16189.4842811868600508870@ietfa.amsl.com> <CADVnQy=rvCoQC0RwVq=P2XWFGPrXvGKvj2cAooj94yx+WzXz3A@mail.gmail.com> <8e5f0a7-b39b-cfaa-5c38-edeb9916bef6@cs.helsinki.fi> <CADVnQynR99fQjWmYj-rYZ4nZxYS=-O7zbfWjJLMxd5Lqcpwgcg@mail.gmail.com> <CAAK044SJWsvqWf+Tt3wUeGpMRH6aVg175CFUBrsz_YyhDsKYwQ@mail.gmail.com> <CADVnQy=0yhx9U-ogVX_Dh66fZWGyMzqtWAgSfaYXX-6ppGx=Kg@mail.gmail.com> <CAAK044So4qO4zma-qdYbMrJcNVdRi1-o30wcP7QjKLuT2c_Zuw@mail.gmail.com> <CADVnQynvFY24cNfe4tfm6FMY47UaMPjxrtD5RwWhCPK6EH4=6w@mail.gmail.com> <CAAK044QWL2xBvg4S_cvHFE_iTddSmnEOkhu33pftvUiUCCGZ_g@mail.gmail.com> <44e1fff-f915-fdd-6c8e-4cd1cc3a9df3@cs.helsinki.fi>
In-Reply-To: <44e1fff-f915-fdd-6c8e-4cd1cc3a9df3@cs.helsinki.fi>
From: Yoshifumi Nishida <nsd.ietf@gmail.com>
Date: Mon, 01 Apr 2024 00:41:38 -0700
Message-ID: <CAAK044TT93bBTkF_J8HgXxLNGnp+LLZvySf+4TOdsc=_yD8HKQ@mail.gmail.com>
To: Markku Kojo <kojo@cs.helsinki.fi>
Cc: Neal Cardwell <ncardwell@google.com>, Matt Mathis <ietf@mattmathis.net>, Matt Mathis <mattmathis@measurementlab.net>, tcpm@ietf.org
Content-Type: multipart/alternative; boundary="0000000000004db6a50615041fb9"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/Gj6Pj9MlE_lohTzKayJOPiULgpg>
Subject: Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc6937bis-06.txt
Precedence: list

Hi Markku,

Thanks for the detailed comments. I put my comments in lines on the two
points.

On Wed, Mar 27, 2024 at 7:54 AM Markku Kojo <kojo@cs.helsinki.fi> wrote:

> Hi Yoshi, Neal, all,
>
> Please see below inline (tagged [MK]). And my apologies for a very long
> explanation. Hopefully I did not include too many mistakes this time ;)
>
> In summary, it seems that we do not need to reset cwnd in the end of the
> recovery nor adjust RecoverFS necessarily because all issues raised could
> be resolved by simply correcting the definition of DeliveredData
> (reverting back the original definition + small additional change) and
> moving the actions to take with the ACK that triggers loss recovery to
> the unconditional steps to be taken together with the initialization of
> the algo in the beginning (this would also be in line with how the other
> fast recovery algorithms are described in the RFC series).
>
> Hopefully I did not misunderstand any parts of the algo (either in RFC
> 6937 or in the current -08 version of the draft).
>
> On Thu, 21 Mar 2024, Yoshifumi Nishida wrote:
>
> > Hi Neal,
> > On Wed, Mar 20, 2024 at 1:32 PM Neal Cardwell <ncardwell@google.com>
> wrote:
> >
> >
> > On Wed, Mar 20, 2024 at 3:29 PM Yoshifumi Nishida <nsd.ietf@gmail.com>
> wrote:
> >       Hi Neal,
> >
> > On Wed, Mar 20, 2024 at 6:55 AM Neal Cardwell <ncardwell@google.com>
> wrote:
> >
> >       On Wed, Mar 20, 2024 at 3:07 AM Yoshifumi Nishida <
> nsd.ietf@gmail.com> wrote:
> >
> > On Mon, Mar 18, 2024 at 8:13 PM Neal Cardwell <ncardwell=
> 40google.com@dmarc.ietf.org>
> > wrote:
> >
> > But your point still stands, and you raise a great point: simply
> initializing
> > RecoverFS to "pipe" is not safe, because packets that were marked lost
> and removed
> > from pipe may actually have been merely reordered. So if those packets
> are
> > delivered later, they will increase the numerator of prr_delivered /
> RecoverFS
> > without increasing the denominator, thus leading to a result above 1.0,
> and thus
> > potentially leading to a target for Sent_so_far that is above ssthresh,
> causing the
> > algorithm to erroneously exceed ssthresh.
> >
> >
> > Hmm. I have a native question here. If packets were merely reordered,
> isn't it ok for
> > cwnd to be bigger than ssthresh?
> >
> >
> > Yes, if packets were merely reordered and none were lost, then I agree
> it would be OK for cwnd
> > to be bigger than ssthresh. And in fact I would argue that cwnd should
> be reverted back to its
> > value before fast recovery. And this is in fact what Linux TCP would do,
> using loss recovery
> > "undo" mechanisms based on TCP timestamps or DSACKs.
> >
> > However, in the kind of scenario Markku described, there was not merely
> reordering, but also
> > real packet loss: "1 packet is lost (P1), and 24 packets are delayed
> (packets P2..P25)". In the
> > traditional loss-based Reno/CUBIC paradigm, any non-zero amount of real
> packet loss in fast
> > recovery should result in the same multiplicative decrease in cwnd,
> regardless of the
> > combination of reordering and loss. We could argue about whether that
> approach is the best
> > approach (BBR, for example, takes a different approach), but that is a
> very different
> > discussion. :-) For now AFAICT we are focused on PRR's faithful
> enactment of the congestion
> > control algorithms decision to reduce cwnd toward ssthresh when there is
> any non-zero amount of
> > real packet loss in fast recovery.
> >
> >
> > Got it. But, I just would like to clarify whether we are discussing the
> inflation of sndcnt during
> > the recovery process or cwnd after the exit of recovery.
> >
> >
> > Good point. We are talking about inflation of sndcnt during the recovery
> process.
>
> [MK] I think we are talking about both in practice because inflation of
> sndcnt during the recovery process would also result in exiting recovery
> with too big cwnd. In the examples that I gave the segments sent_so_far
> was calculated when the SACK for P100 had arrived (actually the numbers
> were off by one):
>
> For Reno:
>
> Sent_so_far = CEIL(prr_delivered * ssthresh / RecoverFS)
>              = CEIL(97 * 50 / 72)
>              = 68
>
> For CUBIC:
> Sent_so_far = CEIL(prr_delivered * ssthresh / RecoverFS)
>              = CEIL(97 * 70 / 72)
>              = 95
>
> Now, when the cumulative ACK triggered by rexmit of P1 arrives and
> terminates fast recovery, the following is executed as per the *current
> version* of the draft:
>
> DeliveredData =  (bytes newly cumulatively acknowledged) = 100
> DeliveredData += (bytes newly selectively acknowledged) = 100 + 0
>
> prr_delivered += DeliveredData = 95 + 100 = 195
> pipe = 94
> if (pipe > ssthresh) => (94 > 70) => (true)
>   sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out
>          = CEIL(195*70/72) = 190 - 95 = 95
> cwnd = pipe + sndcnt = 94 + 95 = 189
>
> So oops, when exiting fast recovery cwnd would be nearly doubled from
> what it was before entering loss recovery. It seems that there
> is an additional problem because the definition of DeliveredData in the
> current version of the draft is incorrect; the cumulatively acked bytes
> that have already been SACKed are counted twice in DeliveredData. It
> seems that RFC 6937 and rfc6937bis-01 both define DeliveredData (nearly)
> correctly by including the change in snd.una in DeliveredData and
> substracting data that has already been SACKed. The definition of
> DeliveredData obviously needs to be corrected. See also below the issue
> with bytes SACked that are above snd.una but get SACKed before the start
> of recovery.
>
> With the original definition of DeliveredData cwnd would not be
> inflated further but fast recovery would still exit with too big cwnd
> (For CUBIC cwnd=95, instead of 70, and for Reno cwnd=68, instead of 50),
> if we use too small RecoveryFS (=Pipe)
>
> So, it seems that we agree that the problem of sending too many bytes
> during the recovery process gets corrected if RecoverFS is initialized to
> snd.nxt - snd.una. The next question is, should RecoverFS be initialized
> to even higher value in some scenarios? See below.
>
> > Because I'm personally not very keen to address miscalculation of lost
> packets due to reordering
> > during the recovery process as it seems to be tricky.
> >
> >
> > It is tricky, but I think it is feasible to address. What do folks think
> about my suggestion from above in
> > this thread:
> >
> >   existing text:
> >      pipe = (RFC 6675 pipe algorithm)
> >      RecoverFS = pipe              // RFC 6675 pipe before recovery
> >
> >   proposed new text:
> >      RecoverFS = snd.nxt - snd.una + (bytes newly cumulatively
> acknowledged)
> >
> >
> > Hmm. Sorry. I'm not very sure about the difference between snd.nxt -
> snd.una and snd.nxt - snd.una + (bytes newly
> > cumulatively acknowledged)
> > Could you elaborate a bit? I thought we don't have data which are
> cumulatively acked in case of reordering.
>
> [MK] It seems there might be another case that Neil is thinking where the
> sender may end up sending too many segments during the first RTT in fast
> recovery. If I understood it correctly this may occur in a scenario
> with ACK loss for pkts preceeding the first dropped data pkt, for
> example. Consider the following scenario where there are 100 pkts
> outstanding
>
>   P1..P24, P25, P26, P27, P28..P100
>
> Packets P1..P24 and P26..P100 are delivered succesfully to the
> receiver. P25 is lost. ACKs (and SACKs) for pkts P1..P24, P26 and P27 get
> dropped. SACKs for P28..P100 are delivered successfully. When SACK
> for pkt P28 arrives, an RFC 6675 sender would declare P25 is lost, and
> enter fast retransmit. Similarly, a RACK-TLP sender may declare P25 lost,
> but this may happen with any of SACKs P28..P100 arriving.
>
> Let's assume we were fully utilizing congestion window, i.e., cwnd=100
> and we enter loss recovery when the SACK of P28 arrives (cumulative
> ACK#=25):
>
> ssthresh = cwnd / 2 = 50  (Reno)
> prr_delivered = prr_out = 0
> Pipe = snd.nxt - snd.una - (lost + SACKed) = 76 - (1 + 3) = 72
> RecoverFS = snd.nxt - snd.una = 101 - 25 = 76
>
> DeliveredData = (bytes newly cumulatively acknowledged) = 24
> DeliveredData += change_in(SACKd) = 24+3 = 27
> prr_delivered += DeliveredData = 0+27 = 27
>
> if (pipe > ssthresh) => if (72 > 50) => true
>        // Proportional Rate Reduction
>        sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out
>               = CEIL(27 * 50 / 76) = 19 - 0 = 19
>
> cwnd = 72 + 19 = 91
>
> so, we will send a burst of 19 pkts on entry to recovery and during the
> rest of the recovery around 49 pkts, giving a total of 19+49=68 pkts
> while only 50 was allowed. If we add 24 cumulatively acked pkts into
> RecoverFS like Neil suggests, we are about to send around 14+37=51 pkts
> which is almost fine. However, the major shortcoming of this approach is
> that we'll still send a burst of 14 pkts in the beginning of the recovery
> while avoiding such a burst was one of the major goals of PRR.
>
> Alternatively we could modify the algo such that the cumulatively acked
> bytes with the ACK that triggers loss recovery are not added to
> DeliveredData nor in RecoverFS.


I have thought about similar things. I'm thinking that this might be a
minor point for now.
My personal thoughts on this are:
* The case you presented presumes huge ack losses before entering recovery
and no ack loss after that. this doesn't look a very common case.
* At any point of recovery, the inflation of DeliveredData can happen due
to ack losses or other reasons. I'm not very sure creating a special
handling only for the first ACK is effective.
* As long as there's no sudden increase of DeliveredData, I guess both
logics behave mostly the same. So, I think a question would be how much we
care about this kind of situations. It seems to me that this looks minor
case.


> Then we would send just one pkt (rexmit of
> P1) on entering the recovery and during the rest of recovery around 49
> pkts, i.e., 1+49=50 pkts during the recovery, which would be exactly equal
> to ssthresh we set. With this approach we could avoid the burst in the
> beginning.  In addition we could  have a consistent solution also for the
> additional problem of including extra SACKed data with the ACK that
> triggers the recovery. Let's look at the above scenario again cwnd=100
> and pkts P1..P100 in flight:
>
>   P1..P24, P25, P26, P27, P28..P100
>
> Packet P1..P24 are delivered to the receiver but ACKs get dropped (whether
> ACKs are dropped or not is not relevant for this issue). P25 gets
> dropped. If the DupAcks of pkt P26 and pkt P27 are delivered, from the
> DupAck of P28 only the SACK info for P28 is counted in DeliveredData but
> the SACK info for P26 and P27 are never counted in DeliveredData because
> P26 and P27 are already SACKed when the DupAck of P28 arrives. However,
> if the DupAcks of pkt P26 and pkt P27 get dropped as in the previous
> example, the ACK of P28 includes new SACK info for pkts P26, P27, and
> P28 and the bytes of P26 and P27 are also counted in DeliveredData. (If
> also DupAck of P28 gets dropped, the DupAck of P29 may include up to 3
> MSS of additional SACK info to be counted (P26, P27, and P28). This alone
> will result in a miniburst in the beginning of the recovery or add to the
> burst size as in the previous example where the two additinal SACKs (for
> P26 and P27) inflated prr_delivered by 2, resulting in slightly too large
> number of segments sent during the recovery (51).
>
> As suggested above, this problem with additional SACKs would be solved
> such that the DupAck that triggers the loss recovery is allowed to add
> only "itself" into DeliveredData and let PRR to include the missing bytes
> for pkts that were SACKed before the start of the recovery only at the
> end of the recovery when the cumulative ACK for the first pkt (P1)
> arrives and inherently covers those bytes.
>
> In other words, the algo can be modified such that fast retransmit is
> always handled separately in the beginning of the recovery together with
> the initialization of the PRR variables:
>
>    ssthresh = CongCtrlAlg()      // Target flight size in recovery
> //[MK]: the next three lines can be deleted as unnecessary
>    prr_delivered = 0             // Total bytes delivered in recovery
>    prr_out = 0                   // Total bytes sent in recovery
>    pipe = (RFC 6675 pipe algorithm)
>
>    RecoverFS = snd.nxt - snd.una // FlightSize right before recovery
>                                  // [MK]:maybe add cumulatively ACKed
>                                  //      bytes?
>
>    Fast Retransmit the first missing segment
>    prr_delivered  = (With SACK: bytes selectively acknowledged by the first
>                      SACK block of the ACK triggering the loss recovery, OR
>                      Without SACK: 1 MSS)
>    prr_out  = (data fast retransmitted)
>
> On each arriving ACK during the rest of the fast recovery, including the
> final cumulative ACK that signals the end of loss recovery:
>
>    DeliveredData = change_in(snd.una)
>    if (SACK is used) {
>       DeliveredData += change_in(SACKd) //[MK]:(*) modify change_in(SACKd)
>    ...
>
>
> The above changes would imply deleting
>
>   if (prr_out is 0 AND sndcnt is 0) {
>        // Force a fast retransmit upon entering recovery
>        sndcnt = MSS
>
> from the algo and would make it consistent with the description of the
> other fast retransmit & Fast recovery algorithms (RFC 5681, RFC 6582, RFC
> 6675) that include fast retransmit together with the initialization of
> the algo in the unconditional first steps of the algorithm.
>
> (*)
> In addition, one more smallish but important correction is needed. The
> bytes that are SACKed before the recovery starts (i.e., typically the
> famous first two DupAcks or more bytes if the start of recovery is
> postponed due to reordering) should be taken into account in the
> DeliveredData during the recovery but with the current algo they
> are never counted in DeliveredData (and prr_delivered).
> Why? Because when the first cumulative ACK arrives, it advances snd.una
> such that those bytes are covered but change_in(SACKd) is negative and
> it incorrectly substracts also these bytes from DeliveredData (and
> prr_delivered) even though they were never counted in. Usually this is
> only 2 MSS but in scenarios similar to one that Neil earlier introduced
> there might be much more data bytes that are not counted. This change
> would also solve the problem of exiting PRR with too low cwnd.
> Let's look at Neil's earlier example again (see comments with [MK] for
> suggested change to solve the issue):
>
> CC = CUBIC
> cwnd = 10
> The reordering degree was estimated to be large, so the connection will
> wait for more than 3 packets to be SACKed before entering fast recovery.
>
> --- Application writes 10*MSS.
>
> TCP sends packets P1 .. P10.
> pipe = 10 packets in flight (P1 .. P10)
>
> --- P2..P9 SACKed  -> do nothing //
>
> (Because the reordering degree was previously estimated to be large.)
>
> --- P10 SACKed -> mark P1 as lost and enter fast recovery
>
> PRR:
> ssthresh = CongCtrlAlg() = 7 packets // CUBIC
> prr_delivered = 0
> prr_out = 0
> RecoverFS = snd.nxt - snd.una = 10 packets (P1..P10)
>
> DeliveredData = 1  (P10 was SACKed)
>
> prr_delivered += DeliveredData   ==> prr_delivered = 1
>
> pipe =  0  (all packets are SACKed or lost; P1 is lost, rest are SACKed)
>
> safeACK = false (snd.una did not advance)
>
> if (pipe > ssthresh) => if (0 > 7) => false
> else
>   // PRR-CRB by default
>   sndcnt = MAX(prr_delivered - prr_out, DeliveredData)
>          = MAX(1 - 0, 1)
>          = 1
>
>   sndcnt = MIN(ssthresh - pipe, sndcnt)
>          = MIN(7 - 0, 1)
>          = 1
>
> cwnd = pipe + sndcnt
>      = 0    + 1
>      = 1
>
> retransmit P1
>
> prr_out += 1   ==> prr_out = 1
>
> --- P1 retransmit plugs hole; receive cumulative ACK for P1..P10
>
> DeliveredData = 1  (P1 was newly ACKed) //[MK]: should be = 10 - 1 = 9
>
> //[MK]: Note that SACKed bytes of P2..P9 were also newly acked
> //      because the bytes have not been delivered *during* the
> //      recovery by this far and thereby not yet counted in
> //      prr_delivered.
> //      So, they should not be substracted from DeliveredData
> //      but included as those bytes got delivered only when
> //      snd.una advanced. Only P10 should be substracted.
>
> prr_delivered += DeliveredData   ==> prr_delivered = 2
> //[MK]: should be = 1 + 9 = 10
>
> pipe =  0  (all packets are cumuatively ACKed)
>
> safeACK = (snd.una advances and no further loss indicated)
> safeACK = true
>
> if (pipe > ssthresh) => if (0 > 7) => false
> else
>   // PRR-CRB by default
>   sndcnt = MAX(prr_delivered - prr_out, DeliveredData)
>          = MAX(2 - 1, 1)  //[MK]  = MAX(10-1, 1)
>          = 1              //[MK]  = 9
>   if (safeACK) => true
>     // PRR-SSRB when recovery is in good progress
>     sndcnt += 1   ==> sndcnt = 2 //[MK] ==> sndcnt = 10
>
>   sndcnt = MIN(ssthresh - pipe, sndcnt)
>          = MIN(7 - 0, 2) //[MK] = MIN(7 - 0, 10)
>          = 2             //[MK] = 7
>
> cwnd = pipe + sndcnt
>      = 0    + 2  //[MK] = 0 + 7
>      = 2         //[MK] = 7
>
> So we exit fast recovery with cwnd=2 even though ssthresh is 7.
>
> [MK]: Or, we exit with cwnd=7 if we correctly count in DeliveredData
> during the recovery process all data that is in flight when the recovery
> starts. All bytes in flight at the start of the recovery are supposed to
> become acknowledged by the end of the recovery, so they should be counted
> in prr_delivered during the recovery.
>
> >             However, I think it won't be good if it's propagated to ater
> the recovery.  But, don't we
> >             reset cwnd to ssthresh at the end of recovery?
>
> [MK]: It seems that just by correcting the definition of DeliveredData
> there is no need to reset cwnd to ssthresh at the end of recovery because
> the algo would do it for us. But I am not opposing to reset cwnd to
> ssthresh at the end. In that case it might be better to specify it by
> giving two alternatives similar to what RFC 6582 does. Maybe:
>
>    Set cwnd to either (1) min (ssthresh, cwnd) or (2) ssthresh.
>

I think we have discussed this in the past discussions.
In case of (1), cwnd can become very low when there were big losses before
the recovery.
As many implementations don't take approach and they have been used for a
long time, (2) became our consensus.
We probably can come up with the scenarios where this approach will be
suboptimal, but I think this is a trade-off.

Thanks,
--
Yoshi

[tcpm] I-D Action: draft-ietf-tcpm-prr-rfc6937bis… internet-drafts
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo