[tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc6937bis-06.txt

Yoshifumi Nishida <nsd.ietf@gmail.com> Fri, 05 July 2024 08:39 UTC

Return-Path: <nsd.ietf@gmail.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A4904C14CE45 for <tcpm@ietfa.amsl.com>; Fri, 5 Jul 2024 01:39:47 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.107
X-Spam-Level:
X-Spam-Status: No, score=-7.107 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QNpJrDkA--Yw for <tcpm@ietfa.amsl.com>; Fri, 5 Jul 2024 01:39:45 -0700 (PDT)
Received: from mail-vk1-xa2e.google.com (mail-vk1-xa2e.google.com [IPv6:2607:f8b0:4864:20::a2e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B217CC14F73E for <tcpm@ietf.org>; Fri, 5 Jul 2024 01:39:45 -0700 (PDT)
Received: by mail-vk1-xa2e.google.com with SMTP id 71dfb90a1353d-4f2e13622e4so454330e0c.3 for <tcpm@ietf.org>; Fri, 05 Jul 2024 01:39:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720168784; x=1720773584; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=DEn/eo+IUm2nGwxAJHdVnSBRQCpes01zjw0WTw2/nxc=; b=csHcpfcUuZG9XePdS2cOzJEGdGB0+0PBNDi0SHb/goXkIFGdr51kyn5I3R8aeEGzXg otqTdO3S7WhuVPLA67LjN0Y2N+KeYGPLHUiYodYJLIc4llCRpST4Y1LbylKzin8PbRN9 hXAhjnIwipMqB6zeWk3cWIsrNNxPfZUb4/cgPbIrI1mXwV5PsSq5+tfEKytDkrjGg80X SY3Dg5HhM33E5V8FpbDaa61K8m3eE/0ycROYiSZ9ah+6eYsgb7UYkBH75ipkRLk4irw5 Br0+Dr0xplMPuS1sYEOicVA+TfFer3QAmMsV0FlECX5dpXY3LtNdQV8lWBhlKuA2o86T BEpg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720168784; x=1720773584; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=DEn/eo+IUm2nGwxAJHdVnSBRQCpes01zjw0WTw2/nxc=; b=DmC/qT7P6LfJQ7JEcMaBtXewYjtH1Q7p04Mocl6NDLQ2+hrHuTJO5j6Xbmdc164222 SYIStUznIPruHhXstCxZMSNYKU7X97pkS8L1TCaXKSon6WJdeOT5VuSqNMSAjX5bIrgl Ov0eObfji3wmino2E/S0H3P547HfzqsRZx0nDWaSA6Hjwx98lkogtenjSkKWH+7TjTPY Uck/kLgc96dYoQqTfeZvQ0aeS22lujqYbr9MopWnQmFRTGTF3uFTnnFgfOhQVdg6k4yK vHJjERrPuapMU9r6FCx6fN+QHCaSGy5Y1tpEk/9Z5KQB3y1NHvr8qWJsyzde6xNdZKqh oh3Q==
X-Forwarded-Encrypted: i=1; AJvYcCWLPDlY5YLNPLvBwNlHzODEYvLyv57S+bNyVSKY/LSN/XBqxmNXYFHmYKJwWSZPTCgYIQN0K+L01mbQBDZw
X-Gm-Message-State: AOJu0YzWKzdQw5R1RWy2jkIgdsG1OI88RYsURbdfAOPIT8L2jWcW3qqi fLSRcdTKMTDeCvQcC1dhuxiNOHgmu3SqNuEivk9tgnzUd4WVFLWlCqQc5/gFojPJu2NC2Tk4NLh rMZpdsloxbLLCz9EJcBqOCStcIIA=
X-Google-Smtp-Source: AGHT+IGxptrhlDyPq7U4eca5AWs0asuaxYhzMnpKYD2EOOExji1wvi8gKi7TnKLagozk4t3tNAtCnJUoRI0YnVycmGA=
X-Received: by 2002:a05:6122:289a:b0:4ec:ef78:d753 with SMTP id 71dfb90a1353d-4f2f402c431mr4913774e0c.14.1720168782665; Fri, 05 Jul 2024 01:39:42 -0700 (PDT)
MIME-Version: 1.0
References: <170896098131.16189.4842811868600508870@ietfa.amsl.com> <CADVnQy=rvCoQC0RwVq=P2XWFGPrXvGKvj2cAooj94yx+WzXz3A@mail.gmail.com> <8e5f0a7-b39b-cfaa-5c38-edeb9916bef6@cs.helsinki.fi> <CADVnQynR99fQjWmYj-rYZ4nZxYS=-O7zbfWjJLMxd5Lqcpwgcg@mail.gmail.com> <CAAK044SJWsvqWf+Tt3wUeGpMRH6aVg175CFUBrsz_YyhDsKYwQ@mail.gmail.com> <CADVnQy=0yhx9U-ogVX_Dh66fZWGyMzqtWAgSfaYXX-6ppGx=Kg@mail.gmail.com> <CAAK044So4qO4zma-qdYbMrJcNVdRi1-o30wcP7QjKLuT2c_Zuw@mail.gmail.com> <CADVnQynvFY24cNfe4tfm6FMY47UaMPjxrtD5RwWhCPK6EH4=6w@mail.gmail.com> <CAAK044QWL2xBvg4S_cvHFE_iTddSmnEOkhu33pftvUiUCCGZ_g@mail.gmail.com> <44e1fff-f915-fdd-6c8e-4cd1cc3a9df3@cs.helsinki.fi> <CAAK044TT93bBTkF_J8HgXxLNGnp+LLZvySf+4TOdsc=_yD8HKQ@mail.gmail.com> <9e3efc1f-e6e-8edb-d354-f442d78967ef@cs.helsinki.fi>
In-Reply-To: <9e3efc1f-e6e-8edb-d354-f442d78967ef@cs.helsinki.fi>
From: Yoshifumi Nishida <nsd.ietf@gmail.com>
Date: Fri, 05 Jul 2024 01:39:31 -0700
Message-ID: <CAAK044SE46f6RaqcRgFOhPtOJcY77OZGEMCPEGoQMCy+2E51Ww@mail.gmail.com>
To: Markku Kojo <kojo@cs.helsinki.fi>
Content-Type: multipart/alternative; boundary="0000000000003ea4d9061c7c01b1"
Message-ID-Hash: WNTZIZHEKCNARO5XQFCP7YV2SUT7MZN4
X-Message-ID-Hash: WNTZIZHEKCNARO5XQFCP7YV2SUT7MZN4
X-MailFrom: nsd.ietf@gmail.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-tcpm.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: Matt Mathis <ietf@mattmathis.net>, Matt Mathis <mattmathis@measurementlab.net>, tcpm@ietf.org
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc6937bis-06.txt
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/eAGOXyPxLSPGOEXnw-zyhDamjdk>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Owner: <mailto:tcpm-owner@ietf.org>
List-Post: <mailto:tcpm@ietf.org>
List-Subscribe: <mailto:tcpm-join@ietf.org>
List-Unsubscribe: <mailto:tcpm-leave@ietf.org>

Hi Markku,

Thanks for the comments. Please see inlines.

On Tue, Jun 25, 2024 at 7:21 AM Markku Kojo <kojo@cs.helsinki.fi> wrote:

> Hi Yoshi,
>
> please see inline tagged [MK2].
>
> On Mon, 1 Apr 2024, Yoshifumi Nishida wrote:
>
> > Hi Markku,
> >
> > Thanks for the detailed comments. I put my comments in lines on the two
> points.
> >
> > On Wed, Mar 27, 2024 at 7:54 AM Markku Kojo <kojo@cs.helsinki.fi> wrote:
> >       Hi Yoshi, Neal, all,
> >
> >       Please see below inline (tagged [MK]). And my apologies for a very
> long
> >       explanation. Hopefully I did not include too many mistakes this
> time ;)
> >
> >       In summary, it seems that we do not need to reset cwnd in the end
> of the
> >       recovery nor adjust RecoverFS necessarily because all issues
> raised could
> >       be resolved by simply correcting the definition of DeliveredData
> >       (reverting back the original definition + small additional change)
> and
> >       moving the actions to take with the ACK that triggers loss
> recovery to
> >       the unconditional steps to be taken together with the
> initialization of
> >       the algo in the beginning (this would also be in line with how the
> other
> >       fast recovery algorithms are described in the RFC series).
> >
> >       Hopefully I did not misunderstand any parts of the algo (either in
> RFC
> >       6937 or in the current -08 version of the draft).
> >
> >       On Thu, 21 Mar 2024, Yoshifumi Nishida wrote:
> >
> >       > Hi Neal,
> >       > On Wed, Mar 20, 2024 at 1:32 PM Neal Cardwell <
> ncardwell@google.com> wrote:
> >       >
> >       >
> >       > On Wed, Mar 20, 2024 at 3:29 PM Yoshifumi Nishida <
> nsd.ietf@gmail.com> wrote:
> >       >       Hi Neal,
> >       >
> >       > On Wed, Mar 20, 2024 at 6:55 AM Neal Cardwell <
> ncardwell@google.com> wrote:
> >       >
> >       >       On Wed, Mar 20, 2024 at 3:07 AM Yoshifumi Nishida <
> nsd.ietf@gmail.com> wrote:
> >       >
> >       > On Mon, Mar 18, 2024 at 8:13 PM Neal Cardwell <ncardwell=
> 40google.com@dmarc.ietf.org>
> >       > wrote:
> >       >
> >       > But your point still stands, and you raise a great point: simply
> initializing
> >       > RecoverFS to "pipe" is not safe, because packets that were
> marked lost and removed
> >       > from pipe may actually have been merely reordered. So if those
> packets are
> >       > delivered later, they will increase the numerator of
> prr_delivered / RecoverFS
> >       > without increasing the denominator, thus leading to a result
> above 1.0, and thus
> >       > potentially leading to a target for Sent_so_far that is above
> ssthresh, causing the
> >       > algorithm to erroneously exceed ssthresh.
> >       >
> >       >
> >       > Hmm. I have a native question here. If packets were merely
> reordered, isn't it ok for
> >       > cwnd to be bigger than ssthresh?
> >       >
> >       >
> >       > Yes, if packets were merely reordered and none were lost, then I
> agree it would be OK for
> >       cwnd
> >       > to be bigger than ssthresh. And in fact I would argue that cwnd
> should be reverted back to
> >       its
> >       > value before fast recovery. And this is in fact what Linux TCP
> would do, using loss recovery
> >       > "undo" mechanisms based on TCP timestamps or DSACKs.
> >       >
> >       > However, in the kind of scenario Markku described, there was not
> merely reordering, but also
> >       > real packet loss: "1 packet is lost (P1), and 24 packets are
> delayed (packets P2..P25)". In
> >       the
> >       > traditional loss-based Reno/CUBIC paradigm, any non-zero amount
> of real packet loss in fast
> >       > recovery should result in the same multiplicative decrease in
> cwnd, regardless of the
> >       > combination of reordering and loss. We could argue about whether
> that approach is the best
> >       > approach (BBR, for example, takes a different approach), but
> that is a very different
> >       > discussion. :-) For now AFAICT we are focused on PRR's faithful
> enactment of the congestion
> >       > control algorithms decision to reduce cwnd toward ssthresh when
> there is any non-zero amount
> >       of
> >       > real packet loss in fast recovery.
> >       >
> >       >
> >       > Got it. But, I just would like to clarify whether we are
> discussing the inflation of
> >       sndcnt during
> >       > the recovery process or cwnd after the exit of recovery.
> >       >
> >       >
> >       > Good point. We are talking about inflation of sndcnt during the
> recovery process.
> >
> >       [MK] I think we are talking about both in practice because
> inflation of
> >       sndcnt during the recovery process would also result in exiting
> recovery
> >       with too big cwnd. In the examples that I gave the segments
> sent_so_far
> >       was calculated when the SACK for P100 had arrived (actually the
> numbers
> >       were off by one):
> >
> >       For Reno:
> >
> >       Sent_so_far = CEIL(prr_delivered * ssthresh / RecoverFS)
> >                    = CEIL(97 * 50 / 72)
> >                    = 68
> >
> >       For CUBIC:
> >       Sent_so_far = CEIL(prr_delivered * ssthresh / RecoverFS)
> >                    = CEIL(97 * 70 / 72)
> >                    = 95
> >
> >       Now, when the cumulative ACK triggered by rexmit of P1 arrives and
> >       terminates fast recovery, the following is executed as per the
> *current
> >       version* of the draft:
> >
> >       DeliveredData =  (bytes newly cumulatively acknowledged) = 100
> >       DeliveredData += (bytes newly selectively acknowledged) = 100 + 0
> >
> >       prr_delivered += DeliveredData = 95 + 100 = 195
> >       pipe = 94
> >       if (pipe > ssthresh) => (94 > 70) => (true)
> >         sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out
> >                = CEIL(195*70/72) = 190 - 95 = 95
> >       cwnd = pipe + sndcnt = 94 + 95 = 189
> >
> >       So oops, when exiting fast recovery cwnd would be nearly doubled
> from
> >       what it was before entering loss recovery. It seems that there
> >       is an additional problem because the definition of DeliveredData
> in the
> >       current version of the draft is incorrect; the cumulatively acked
> bytes
> >       that have already been SACKed are counted twice in DeliveredData.
> It
> >       seems that RFC 6937 and rfc6937bis-01 both define DeliveredData
> (nearly)
> >       correctly by including the change in snd.una in DeliveredData and
> >       substracting data that has already been SACKed. The definition of
> >       DeliveredData obviously needs to be corrected. See also below the
> issue
> >       with bytes SACked that are above snd.una but get SACKed before the
> start
> >       of recovery.
> >
> >       With the original definition of DeliveredData cwnd would not be
> >       inflated further but fast recovery would still exit with too big
> cwnd
> >       (For CUBIC cwnd=95, instead of 70, and for Reno cwnd=68, instead
> of 50),
> >       if we use too small RecoveryFS (=Pipe)
> >
> >       So, it seems that we agree that the problem of sending too many
> bytes
> >       during the recovery process gets corrected if RecoverFS is
> initialized to
> >       snd.nxt - snd.una. The next question is, should RecoverFS be
> initialized
> >       to even higher value in some scenarios? See below.
> >
> >       > Because I'm personally not very keen to address miscalculation
> of lost packets due to
> >       reordering
> >       > during the recovery process as it seems to be tricky.
> >       >
> >       >
> >       > It is tricky, but I think it is feasible to address. What do
> folks think about my suggestion
> >       from above in
> >       > this thread:
> >       >
> >       >   existing text:
> >       >      pipe = (RFC 6675 pipe algorithm)
> >       >      RecoverFS = pipe              // RFC 6675 pipe before
> recovery
> >       >
> >       >   proposed new text:
> >       >      RecoverFS = snd.nxt - snd.una + (bytes newly cumulatively
> acknowledged)
> >       >
> >       >
> >       > Hmm. Sorry. I'm not very sure about the difference between
> snd.nxt - snd.una and snd.nxt -
> >       snd.una + (bytes newly
> >       > cumulatively acknowledged)
> >       > Could you elaborate a bit? I thought we don't have data which
> are cumulatively acked in case
> >       of reordering.
> >
> >       [MK] It seems there might be another case that Neil is thinking
> where the
> >       sender may end up sending too many segments during the first RTT
> in fast
> >       recovery. If I understood it correctly this may occur in a scenario
> >       with ACK loss for pkts preceeding the first dropped data pkt, for
> >       example. Consider the following scenario where there are 100 pkts
> >       outstanding
> >
> >         P1..P24, P25, P26, P27, P28..P100
> >
> >       Packets P1..P24 and P26..P100 are delivered succesfully to the
> >       receiver. P25 is lost. ACKs (and SACKs) for pkts P1..P24, P26 and
> P27 get
> >       dropped. SACKs for P28..P100 are delivered successfully. When SACK
> >       for pkt P28 arrives, an RFC 6675 sender would declare P25 is lost,
> and
> >       enter fast retransmit. Similarly, a RACK-TLP sender may declare
> P25 lost,
> >       but this may happen with any of SACKs P28..P100 arriving.
> >
> >       Let's assume we were fully utilizing congestion window, i.e.,
> cwnd=100
> >       and we enter loss recovery when the SACK of P28 arrives (cumulative
> >       ACK#=25):
> >
> >       ssthresh = cwnd / 2 = 50  (Reno)
> >       prr_delivered = prr_out = 0
> >       Pipe = snd.nxt - snd.una - (lost + SACKed) = 76 - (1 + 3) = 72
> >       RecoverFS = snd.nxt - snd.una = 101 - 25 = 76
> >
> >       DeliveredData = (bytes newly cumulatively acknowledged) = 24
> >       DeliveredData += change_in(SACKd) = 24+3 = 27
> >       prr_delivered += DeliveredData = 0+27 = 27
> >
> >       if (pipe > ssthresh) => if (72 > 50) => true
> >              // Proportional Rate Reduction
> >              sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) -
> prr_out
> >                     = CEIL(27 * 50 / 76) = 19 - 0 = 19
> >
> >       cwnd = 72 + 19 = 91
> >
> >       so, we will send a burst of 19 pkts on entry to recovery and
> during the
> >       rest of the recovery around 49 pkts, giving a total of 19+49=68
> pkts
> >       while only 50 was allowed. If we add 24 cumulatively acked pkts
> into
> >       RecoverFS like Neil suggests, we are about to send around 14+37=51
> pkts
> >       which is almost fine. However, the major shortcoming of this
> approach is
> >       that we'll still send a burst of 14 pkts in the beginning of the
> recovery
> >       while avoiding such a burst was one of the major goals of PRR.
> >
> >       Alternatively we could modify the algo such that the cumulatively
> acked
> >       bytes with the ACK that triggers loss recovery are not added to
> >       DeliveredData nor in RecoverFS.
> >
> >
> > I have thought about similar things. I'm thinking that this might be a
> minor point for now.
> > My personal thoughts on this are:
> > * The case you presented presumes huge ack losses before entering
> recovery and no ack loss after that. this
> > doesn't look a very common case.
>
> [MK2]: I am not sure if this is a minor point.
> First, it does not need a huge number of ack losses before entering
> recovery. The severity of miscalculation depends on the number of lost
> acks.
> Second, there is no requirement for no ack loss after that. There
> might or might not be further ack losses. The authors (and others) have
> reported that ack loss is a very common phenomenon, so I think we cannot
> consider it uncommon.
>
> > * At any point of recovery, the inflation of DeliveredData can happen
> due to ack losses or other reasons
> > I'm not very sure creating a special handling only for the first ACK is
> effective.
>
> [MK2]: Sure, sending a burst at any point of recovery is common to all
> Fast Recovery algos not only PRR. But PRR justifies itself (for a good
> reason) by avoiding bursts in the beginning of the loss recovery. Such
> a burst is very bad because it is likely to result in loss of
> retransmitted data that would be always present with segment(s)
> transmitted in the beginning of the recovery. Particularly with PRR,
> sending a burst in the beginning is bad because PRR does not have a pause
> in sending segments in the beginning like NewReno or RFC 6675 would have
> (i.e., with PRR, there is no similar opportunity for the bottleneck queue
> to drain before retransmitted segments start to flow into the congested
> queue).
>

> * As long as there's no sudden increase of DeliveredData, I guess both
> logics behave mostly the same. So, I
> > think a question would be how much we care about this kind of
> situations. It seems to me that this looks
> > minor case.
>
> [MK2]: I don't think they are mostly the same. My proposal would avoid a
> burst in the beginning and solve in a simple way also the other
> miscalculation problems which ignore a number of Acks as I have pointed
> out. It would also be consistent with the other Fast Retransmit & Fast
> Recovery algos in the RFC series that by definition always handle Fast
> Retransmit as a separate and unconditional step at the entry to loss
> recovery.
> PRR algo as curently described is also inefficient as it includes an
> unnecessary step to check whether to send out the Fast Retransmit which
> can only happen in the beginning of the recovery. I doubt hardly any
> implementor would like to include an unnecessary condition check to be
> executed repeatedely on arrival of each Ack during the recovery while the
> condition can be true only with the first Ack that triggers the loss
> recovery?
>

Hmm, it seems to me that you're saying we might not be able to avoid the
bursts in the middle of recovery, but there's a way to avoid a burst in the
beginning.
Could you elaborate your proposal? I think I need to understand it a bit
more.

>       Then we would send just one pkt (rexmit of
> >       P1) on entering the recovery and during the rest of recovery
> around 49
> >       pkts, i.e., 1+49=50 pkts during the recovery, which would be
> exactly equal
> >       to ssthresh we set. With this approach we could avoid the burst in
> the
> >       beginning.  In addition we could  have a consistent solution also
> for the
> >       additional problem of including extra SACKed data with the ACK that
> >       triggers the recovery. Let's look at the above scenario again
> cwnd=100
> >       and pkts P1..P100 in flight:
> >
> >         P1..P24, P25, P26, P27, P28..P100
> >
> >       Packet P1..P24 are delivered to the receiver but ACKs get dropped
> (whether
> >       ACKs are dropped or not is not relevant for this issue). P25 gets
> >       dropped. If the DupAcks of pkt P26 and pkt P27 are delivered, from
> the
> >       DupAck of P28 only the SACK info for P28 is counted in
> DeliveredData but
> >       the SACK info for P26 and P27 are never counted in DeliveredData
> because
> >       P26 and P27 are already SACKed when the DupAck of P28 arrives.
> However,
> >       if the DupAcks of pkt P26 and pkt P27 get dropped as in the
> previous
> >       example, the ACK of P28 includes new SACK info for pkts P26, P27,
> and
> >       P28 and the bytes of P26 and P27 are also counted in
> DeliveredData. (If
> >       also DupAck of P28 gets dropped, the DupAck of P29 may include up
> to 3
> >       MSS of additional SACK info to be counted (P26, P27, and P28).
> This alone
> >       will result in a miniburst in the beginning of the recovery or add
> to the
> >       burst size as in the previous example where the two additinal
> SACKs (for
> >       P26 and P27) inflated prr_delivered by 2, resulting in slightly
> too large
> >       number of segments sent during the recovery (51).
> >
> >       As suggested above, this problem with additional SACKs would be
> solved
> >       such that the DupAck that triggers the loss recovery is allowed to
> add
> >       only "itself" into DeliveredData and let PRR to include the
> missing bytes
> >       for pkts that were SACKed before the start of the recovery only at
> the
> >       end of the recovery when the cumulative ACK for the first pkt (P1)
> >       arrives and inherently covers those bytes.
> >
> >       In other words, the algo can be modified such that fast retransmit
> is
> >       always handled separately in the beginning of the recovery
> together with
> >       the initialization of the PRR variables:
> >
> >          ssthresh = CongCtrlAlg()      // Target flight size in recovery
> >       //[MK]: the next three lines can be deleted as unnecessary
> >          prr_delivered = 0             // Total bytes delivered in
> recovery
> >          prr_out = 0                   // Total bytes sent in recovery
> >          pipe = (RFC 6675 pipe algorithm)
> >
> >          RecoverFS = snd.nxt - snd.una // FlightSize right before
> recovery
> >                                        // [MK]:maybe add cumulatively
> ACKed
> >                                        //      bytes?
> >
> >          Fast Retransmit the first missing segment
> >          prr_delivered  = (With SACK: bytes selectively acknowledged by
> the first
> >                            SACK block of the ACK triggering the loss
> recovery, OR
> >                            Without SACK: 1 MSS)
> >          prr_out  = (data fast retransmitted)
> >
> >       On each arriving ACK during the rest of the fast recovery,
> including the
> >       final cumulative ACK that signals the end of loss recovery:
> >
> >          DeliveredData = change_in(snd.una)
> >          if (SACK is used) {
> >             DeliveredData += change_in(SACKd) //[MK]:(*) modify
> change_in(SACKd)
> >          ...
> >
> >
> >       The above changes would imply deleting
> >
> >         if (prr_out is 0 AND sndcnt is 0) {
> >              // Force a fast retransmit upon entering recovery
> >              sndcnt = MSS
> >
> >       from the algo and would make it consistent with the description of
> the
> >       other fast retransmit & Fast recovery algorithms (RFC 5681, RFC
> 6582, RFC
> >       6675) that include fast retransmit together with the
> initialization of
> >       the algo in the unconditional first steps of the algorithm.
> >
> >       (*)
> >       In addition, one more smallish but important correction is needed.
> The
> >       bytes that are SACKed before the recovery starts (i.e., typically
> the
> >       famous first two DupAcks or more bytes if the start of recovery is
> >       postponed due to reordering) should be taken into account in the
> >       DeliveredData during the recovery but with the current algo they
> >       are never counted in DeliveredData (and prr_delivered).
> >       Why? Because when the first cumulative ACK arrives, it advances
> snd.una
> >       such that those bytes are covered but change_in(SACKd) is negative
> and
> >       it incorrectly substracts also these bytes from DeliveredData (and
> >       prr_delivered) even though they were never counted in. Usually
> this is
> >       only 2 MSS but in scenarios similar to one that Neil earlier
> introduced
> >       there might be much more data bytes that are not counted. This
> change
> >       would also solve the problem of exiting PRR with too low cwnd.
> >       Let's look at Neil's earlier example again (see comments with [MK]
> for
> >       suggested change to solve the issue):
> >
> >       CC = CUBIC
> >       cwnd = 10
> >       The reordering degree was estimated to be large, so the connection
> will
> >       wait for more than 3 packets to be SACKed before entering fast
> recovery.
> >
> >       --- Application writes 10*MSS.
> >
> >       TCP sends packets P1 .. P10.
> >       pipe = 10 packets in flight (P1 .. P10)
> >
> >       --- P2..P9 SACKed  -> do nothing //
> >
> >       (Because the reordering degree was previously estimated to be
> large.)
> >
> >       --- P10 SACKed -> mark P1 as lost and enter fast recovery
> >
> >       PRR:
> >       ssthresh = CongCtrlAlg() = 7 packets // CUBIC
> >       prr_delivered = 0
> >       prr_out = 0
> >       RecoverFS = snd.nxt - snd.una = 10 packets (P1..P10)
> >
> >       DeliveredData = 1  (P10 was SACKed)
> >
> >       prr_delivered += DeliveredData   ==> prr_delivered = 1
> >
> >       pipe =  0  (all packets are SACKed or lost; P1 is lost, rest are
> SACKed)
> >
> >       safeACK = false (snd.una did not advance)
> >
> >       if (pipe > ssthresh) => if (0 > 7) => false
> >       else
> >         // PRR-CRB by default
> >         sndcnt = MAX(prr_delivered - prr_out, DeliveredData)
> >                = MAX(1 - 0, 1)
> >                = 1
> >
> >         sndcnt = MIN(ssthresh - pipe, sndcnt)
> >                = MIN(7 - 0, 1)
> >                = 1
> >
> >       cwnd = pipe + sndcnt
> >            = 0    + 1
> >            = 1
> >
> >       retransmit P1
> >
> >       prr_out += 1   ==> prr_out = 1
> >
> >       --- P1 retransmit plugs hole; receive cumulative ACK for P1..P10
> >
> >       DeliveredData = 1  (P1 was newly ACKed) //[MK]: should be = 10 - 1
> = 9
> >
> >       //[MK]: Note that SACKed bytes of P2..P9 were also newly acked
> >       //      because the bytes have not been delivered *during* the
> >       //      recovery by this far and thereby not yet counted in
> >       //      prr_delivered.
> >       //      So, they should not be substracted from DeliveredData
> >       //      but included as those bytes got delivered only when
> >       //      snd.una advanced. Only P10 should be substracted.
> >
> >       prr_delivered += DeliveredData   ==> prr_delivered = 2
> >       //[MK]: should be = 1 + 9 = 10
> >
> >       pipe =  0  (all packets are cumuatively ACKed)
> >
> >       safeACK = (snd.una advances and no further loss indicated)
> >       safeACK = true
> >
> >       if (pipe > ssthresh) => if (0 > 7) => false
> >       else
> >         // PRR-CRB by default
> >         sndcnt = MAX(prr_delivered - prr_out, DeliveredData)
> >                = MAX(2 - 1, 1)  //[MK]  = MAX(10-1, 1)
> >                = 1              //[MK]  = 9
> >         if (safeACK) => true
> >           // PRR-SSRB when recovery is in good progress
> >           sndcnt += 1   ==> sndcnt = 2 //[MK] ==> sndcnt = 10
> >
> >         sndcnt = MIN(ssthresh - pipe, sndcnt)
> >                = MIN(7 - 0, 2) //[MK] = MIN(7 - 0, 10)
> >                = 2             //[MK] = 7
> >
> >       cwnd = pipe + sndcnt
> >            = 0    + 2  //[MK] = 0 + 7
> >            = 2         //[MK] = 7
> >
> >       So we exit fast recovery with cwnd=2 even though ssthresh is 7.
> >
> >       [MK]: Or, we exit with cwnd=7 if we correctly count in
> DeliveredData
> >       during the recovery process all data that is in flight when the
> recovery
> >       starts. All bytes in flight at the start of the recovery are
> supposed to
> >       become acknowledged by the end of the recovery, so they should be
> counted
> >       in prr_delivered during the recovery.
> >
> >       >             However, I think it won't be good if it's propagated
> to ater the recovery.  But,
> >       don't we
> >       >             reset cwnd to ssthresh at the end of recovery?
> >
> >       [MK]: It seems that just by correcting the definition of
> DeliveredData
> >       there is no need to reset cwnd to ssthresh at the end of recovery
> because
> >       the algo would do it for us. But I am not opposing to reset cwnd to
> >       ssthresh at the end. In that case it might be better to specify it
> by
> >       giving two alternatives similar to what RFC 6582 does. Maybe:
> >
> >          Set cwnd to either (1) min (ssthresh, cwnd) or (2) ssthresh.
> >
> >
> > I think we have discussed this in the past discussions.
> > In case of (1), cwnd can become very low when there were big losses
> before the recovery.
> > As many implementations don't take approach and they have been used for
> a long time, (2) became our
> > consensus.
>
> [MK2]: Yes, setting cwnd to ssthresh has been discussed and in my
> recalling there was two differing opinions between co-authors with no
> final resolution. My apologies if there was and I missed it.
> But I am not disagreeing. First, with my proposed solution this simply
> becomes a non-issue and is not needed at all. Second, if we decide to
> propose setting cwnd at the exit of the recovery I just think it would be
> not a good idea to require all implementations to do (2) but let the
> implementor to have another safe alternative as well. In case of cwnd
> being small, (1) would result in slow start that avoids a burst and would
> quickly determine the available capacity. With (2), the draft should say
> (suggest/require) something about applying some sort of pacing to avoid a
> burst, I think (such a burst is not necessarily self-evident for
> everyone).
>

I believe our consensus has been (2) while we are aware there can be a
burst.
I can agree that a burst here might not be obvious for everyone.
So, my personal recommendation would be to put something about bursts here.
But, if some folks have other opinions, please share.

--
Yoshi