[tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc6937bis-06.txt

Matt Mathis <matt.mathis@gmail.com> Thu, 18 July 2024 16:33 UTC

Return-Path: <matt.mathis@gmail.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2B83AC1840C5 for <tcpm@ietfa.amsl.com>; Thu, 18 Jul 2024 09:33:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.108
X-Spam-Level:
X-Spam-Status: No, score=-2.108 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Chpx8et1_4GJ for <tcpm@ietfa.amsl.com>; Thu, 18 Jul 2024 09:33:09 -0700 (PDT)
Received: from mail-lf1-x12b.google.com (mail-lf1-x12b.google.com [IPv6:2a00:1450:4864:20::12b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A0CA0C169407 for <tcpm@ietf.org>; Thu, 18 Jul 2024 09:33:08 -0700 (PDT)
Received: by mail-lf1-x12b.google.com with SMTP id 2adb3069b0e04-52e99060b0dso674782e87.1 for <tcpm@ietf.org>; Thu, 18 Jul 2024 09:33:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1721320387; x=1721925187; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=AYsHli68D1kiZPZOkV8k8cogxK+cwlYsaQwGJadjGt4=; b=Oa9pyLyLSMyKumV5GfyhsZne84bVwG5wnAnKAhfBikhqY/oO0PRDwObvMBV+t2vl+u 2V/BVrDouPNcydhnubGzAI8Sw/mwSn3kBADs9+rQj6PdhryPk9ZjVDAYlAK3y3rVglaz oprwWF/O4TY8YqUVG1tXt8Zkk0cfPuIeRkmh7uTSLU4g/CH5KkHWErTyG2Kj/yzRaIse 8pvSRoxbKYI7qmlvGqiujM+ONQ6oQnuqEw1Oe6dY7rdeZp4wPtHdg9TSZT6cqiHho+iX InP6h/PIn97XoVR56tH9oCZS9R+EhoAVYGAlgmfWt3wJoLlqyiYUyTJITmIuybLIGXBx zmug==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721320387; x=1721925187; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=AYsHli68D1kiZPZOkV8k8cogxK+cwlYsaQwGJadjGt4=; b=ErjHENpBshLx1D4vOHjI6SZbV3bYrpSSYMzaSdh/0djOXh7vJx0QlaZnz2NhkWZEI+ Js5nYP3TJ98NLDC+NMtnq61pN5Fqw//skUeI1l8tokSgPqrYvW/ldU6UUAFSN0j3UB5N Wgdu56Db/qNreLoNyaXIcEWESktEnc4eY7JAOTYetNakxnQxAP7SNOLVmeU6/KQ3DvoN r4rl3eckLkdb1yNxWNLPkPBKtqb+7YF3mqDlIIjXRKnKNNKNVyzedZnhraeGXneRDkei XUQNYhAB5QD64E/x+2ME71QV7/BG6xgyOpNmnANCSUhSZypg1xIlkUCHNWMB9KDoO5aq JKSg==
X-Forwarded-Encrypted: i=1; AJvYcCWTT4fTISaswW3h35toMJtZauCMLT2MChvfN1RIs8pv4o7nzvHBHPgahT7XB5FFRjSrax7VyUMvWDUa9IGP
X-Gm-Message-State: AOJu0YwC1jGWtbhcIrxnflvkpD5sJFQ78BVrkRhQCtmh0YkEoHLMfjUI uH/GuOUij1LUYCFh7aarH4AJfr4tV83+IXJhLeorO/BvEcgutBH+uLeg95JQp9vVKlVVOTFwpQn 5ziZF/OeHaGtmgVCE0eMdhb3RXSFh2wWPxQA=
X-Google-Smtp-Source: AGHT+IETReciMPAGpTdjHZt/xqEhcG/jxBHYdjsjsulS8mNHVYLOjSgPtYYGafq7WitCd+pm5x/lTEF/1Dkz2vKlvZw=
X-Received: by 2002:a05:6512:4025:b0:52d:73b:db14 with SMTP id 2adb3069b0e04-52ee53d368bmr3490384e87.35.1721320386501; Thu, 18 Jul 2024 09:33:06 -0700 (PDT)
MIME-Version: 1.0
References: <170896098131.16189.4842811868600508870@ietfa.amsl.com> <CADVnQy=rvCoQC0RwVq=P2XWFGPrXvGKvj2cAooj94yx+WzXz3A@mail.gmail.com> <8e5f0a7-b39b-cfaa-5c38-edeb9916bef6@cs.helsinki.fi> <CADVnQynR99fQjWmYj-rYZ4nZxYS=-O7zbfWjJLMxd5Lqcpwgcg@mail.gmail.com> <CAAK044SJWsvqWf+Tt3wUeGpMRH6aVg175CFUBrsz_YyhDsKYwQ@mail.gmail.com> <CADVnQy=0yhx9U-ogVX_Dh66fZWGyMzqtWAgSfaYXX-6ppGx=Kg@mail.gmail.com> <CAAK044So4qO4zma-qdYbMrJcNVdRi1-o30wcP7QjKLuT2c_Zuw@mail.gmail.com> <CADVnQynvFY24cNfe4tfm6FMY47UaMPjxrtD5RwWhCPK6EH4=6w@mail.gmail.com> <CAAK044QWL2xBvg4S_cvHFE_iTddSmnEOkhu33pftvUiUCCGZ_g@mail.gmail.com> <44e1fff-f915-fdd-6c8e-4cd1cc3a9df3@cs.helsinki.fi> <CAAK044TT93bBTkF_J8HgXxLNGnp+LLZvySf+4TOdsc=_yD8HKQ@mail.gmail.com> <9e3efc1f-e6e-8edb-d354-f442d78967ef@cs.helsinki.fi> <CAAK044SE46f6RaqcRgFOhPtOJcY77OZGEMCPEGoQMCy+2E51Ww@mail.gmail.com> <CADVnQymUztV8qV3SLeuWV1LxJRkO1SRQxxT_erqPa7owiMSCPw@mail.gmail.com> <CA+s1KQFVDbCiZDqjVfYG-feRztkUDjymV2fBy+-nGkGL8Zx9WA@mail.gmail.com> <CADVnQynAOS=ugQ6KWsORi1xqYNShH8jCp2fuVEP2ij3YvexFGQ@mail.gmail.com>
In-Reply-To: <CADVnQynAOS=ugQ6KWsORi1xqYNShH8jCp2fuVEP2ij3YvexFGQ@mail.gmail.com>
From: Matt Mathis <matt.mathis@gmail.com>
Date: Thu, 18 Jul 2024 11:32:49 -0500
Message-ID: <CA+s1KQEape0k1GJ+0ZRQT5BvsVd80N8F8kTJcKzGnaRods5VMg@mail.gmail.com>
To: Neal Cardwell <ncardwell@google.com>
Content-Type: multipart/alternative; boundary="0000000000002ea8cf061d8822c9"
Message-ID-Hash: OEVO4DXMZ7AEJLSH6BMGRZNQZRDGQ57T
X-Message-ID-Hash: OEVO4DXMZ7AEJLSH6BMGRZNQZRDGQ57T
X-MailFrom: matt.mathis@gmail.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-tcpm.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: Markku Kojo <kojo@cs.helsinki.fi>, Matt Mathis <ietf@mattmathis.net>, Matt Mathis <mattmathis@measurementlab.net>, tcpm@ietf.org
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc6937bis-06.txt
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/8hoiRO24rZwIwJ334rDv88XCqWc>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Owner: <mailto:tcpm-owner@ietf.org>
List-Post: <mailto:tcpm@ietf.org>
List-Subscribe: <mailto:tcpm-join@ietf.org>
List-Unsubscribe: <mailto:tcpm-leave@ietf.org>

I completely agree with Neal, however It was probably a mistake on my part
to bring up the wider problem of setting ssthresh.

We used "ssthresh = CongCtrlAlg()" explicitly to sidestep that rathole.

But yes PRR is kind of lame if ssthresh is set from FlightSize.

Thanks,
--MM--
Evil is defined by mortals who think they know "The Truth" and use force to
apply it to others.
-------------------------------------------
Matt Mathis  (Email is best)
Home & mobile: 412-654-7529 please leave a message if you must call.



On Thu, Jul 18, 2024 at 9:57 AM Neal Cardwell <ncardwell@google.com> wrote:

> I would add a note that setting cwnd to ssthresh is critical in the case
> where both these conditions hold:
>
> (a) an application-limited flight
>
> (b) the TCP implementation uses the sensible and widely-deployed (e.g., by
> Linux TCP) approach of setting ssthresh in recovery based on cwnd rather
> than FlightSize, e.g. setting ssthresh to cwnd*0.5 (Reno) or cwnd*0.7
> (CUBIC).
>
> In such cases, the application-limited behavior can make pipe tiny, and
> thus cwnd, set via cwnd = pipe + sndcnt, is also tiny, and would often
> never be enough DeliveredData to grow cwnd organically back up to ssthresh
> using the core PRR code path. Thus, to handle such cases, PRR needs an
> explicit cwnd=ssthresh for cwnd to reliably reach ssthresh by the end of
> recovery.
>
> Thanks,
> neal
>
>
> On Thu, Jul 18, 2024 at 10:18 AM Matt Mathis <matt.mathis@gmail.com>
> wrote:
>
>> In the presentation, I would go even stronger than the text in the
>> document:
>>
>> The original design did not explicitly set cwnd to ssthresh, because PRR
>> implicitly converges (I believe for all common conditions that might
>> confuse any of the estimators used in the algorithm).  The explicit
>> assignment was added to make PRR more robust in the event of unknown or
>> unanticipated events.  Under the rare condition where the assignment causes
>> a discontinuous change to cwnd, the stack will behave as it would have
>> behaved if it did not have PRR: either a (greatly reduced) line rate burst,
>> or stopping transmissions to reduce FlightSize.
>>
>> For non-paced stacks, both of these events are common anyhow.  e.g.
>> Simple reordering causes both.  None of the pathological corner cases can
>> cause important misbehaviors.
>>
>> There are other gaping holes in the standards that are much more serious:
>> Not  forbidding raising cwnd when it is not being used, and no
>> specifications around recovery extending across multiple RTTs due to new
>> losses.    Although this latter problem has become more prominent with
>> RACK's ability to detect lost retransmissions, it was first observed
>> while testing RFC 2018 (as long as new losses were only new data).
>>
>> Thanks,
>> --MM--
>> Evil is defined by mortals who think they know "The Truth" and use force
>> to apply it to others.
>> -------------------------------------------
>> Matt Mathis  (Email is best)
>> Home & mobile: 412-654-7529 <(412)%20654-7529> please leave a message if
>> you must call.
>>
>>
>>
>> On Wed, Jul 17, 2024 at 2:22 PM Neal Cardwell <ncardwell@google.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Jul 5, 2024 at 4:39 AM Yoshifumi Nishida <nsd.ietf@gmail.com>
>>> wrote:
>>>
>>>> Hi Markku,
>>>>
>>>> Thanks for the comments. Please see inlines.
>>>>
>>>> On Tue, Jun 25, 2024 at 7:21 AM Markku Kojo <kojo@cs.helsinki.fi>
>>>> wrote:
>>>>
>>>>> Hi Yoshi,
>>>>>
>>>>> please see inline tagged [MK2].
>>>>>
>>>>> On Mon, 1 Apr 2024, Yoshifumi Nishida wrote:
>>>>>
>>>>> > Hi Markku,
>>>>> >
>>>>> > Thanks for the detailed comments. I put my comments in lines on the
>>>>> two points.
>>>>> >
>>>>> > On Wed, Mar 27, 2024 at 7:54 AM Markku Kojo <kojo@cs.helsinki.fi>
>>>>> wrote:
>>>>> >       Hi Yoshi, Neal, all,
>>>>> >
>>>>> >       Please see below inline (tagged [MK]). And my apologies for a
>>>>> very long
>>>>> >       explanation. Hopefully I did not include too many mistakes
>>>>> this time ;)
>>>>> >
>>>>> >       In summary, it seems that we do not need to reset cwnd in the
>>>>> end of the
>>>>> >       recovery nor adjust RecoverFS necessarily because all issues
>>>>> raised could
>>>>> >       be resolved by simply correcting the definition of
>>>>> DeliveredData
>>>>> >       (reverting back the original definition + small additional
>>>>> change) and
>>>>> >       moving the actions to take with the ACK that triggers loss
>>>>> recovery to
>>>>> >       the unconditional steps to be taken together with the
>>>>> initialization of
>>>>> >       the algo in the beginning (this would also be in line with how
>>>>> the other
>>>>> >       fast recovery algorithms are described in the RFC series).
>>>>> >
>>>>> >       Hopefully I did not misunderstand any parts of the algo
>>>>> (either in RFC
>>>>> >       6937 or in the current -08 version of the draft).
>>>>> >
>>>>> >       On Thu, 21 Mar 2024, Yoshifumi Nishida wrote:
>>>>> >
>>>>> >       > Hi Neal,
>>>>> >       > On Wed, Mar 20, 2024 at 1:32 PM Neal Cardwell <
>>>>> ncardwell@google.com> wrote:
>>>>> >       >
>>>>> >       >
>>>>> >       > On Wed, Mar 20, 2024 at 3:29 PM Yoshifumi Nishida <
>>>>> nsd.ietf@gmail.com> wrote:
>>>>> >       >       Hi Neal,
>>>>> >       >
>>>>> >       > On Wed, Mar 20, 2024 at 6:55 AM Neal Cardwell <
>>>>> ncardwell@google.com> wrote:
>>>>> >       >
>>>>> >       >       On Wed, Mar 20, 2024 at 3:07 AM Yoshifumi Nishida <
>>>>> nsd.ietf@gmail.com> wrote:
>>>>> >       >
>>>>> >       > On Mon, Mar 18, 2024 at 8:13 PM Neal Cardwell <ncardwell=
>>>>> 40google.com@dmarc.ietf.org>
>>>>> >       > wrote:
>>>>> >       >
>>>>> >       > But your point still stands, and you raise a great point:
>>>>> simply initializing
>>>>> >       > RecoverFS to "pipe" is not safe, because packets that were
>>>>> marked lost and removed
>>>>> >       > from pipe may actually have been merely reordered. So if
>>>>> those packets are
>>>>> >       > delivered later, they will increase the numerator of
>>>>> prr_delivered / RecoverFS
>>>>> >       > without increasing the denominator, thus leading to a result
>>>>> above 1.0, and thus
>>>>> >       > potentially leading to a target for Sent_so_far that is
>>>>> above ssthresh, causing the
>>>>> >       > algorithm to erroneously exceed ssthresh.
>>>>> >       >
>>>>> >       >
>>>>> >       > Hmm. I have a native question here. If packets were merely
>>>>> reordered, isn't it ok for
>>>>> >       > cwnd to be bigger than ssthresh?
>>>>> >       >
>>>>> >       >
>>>>> >       > Yes, if packets were merely reordered and none were lost,
>>>>> then I agree it would be OK for
>>>>> >       cwnd
>>>>> >       > to be bigger than ssthresh. And in fact I would argue that
>>>>> cwnd should be reverted back to
>>>>> >       its
>>>>> >       > value before fast recovery. And this is in fact what Linux
>>>>> TCP would do, using loss recovery
>>>>> >       > "undo" mechanisms based on TCP timestamps or DSACKs.
>>>>> >       >
>>>>> >       > However, in the kind of scenario Markku described, there was
>>>>> not merely reordering, but also
>>>>> >       > real packet loss: "1 packet is lost (P1), and 24 packets are
>>>>> delayed (packets P2..P25)". In
>>>>> >       the
>>>>> >       > traditional loss-based Reno/CUBIC paradigm, any non-zero
>>>>> amount of real packet loss in fast
>>>>> >       > recovery should result in the same multiplicative decrease
>>>>> in cwnd, regardless of the
>>>>> >       > combination of reordering and loss. We could argue about
>>>>> whether that approach is the best
>>>>> >       > approach (BBR, for example, takes a different approach), but
>>>>> that is a very different
>>>>> >       > discussion. :-) For now AFAICT we are focused on PRR's
>>>>> faithful enactment of the congestion
>>>>> >       > control algorithms decision to reduce cwnd toward ssthresh
>>>>> when there is any non-zero amount
>>>>> >       of
>>>>> >       > real packet loss in fast recovery.
>>>>> >       >
>>>>> >       >
>>>>> >       > Got it. But, I just would like to clarify whether we are
>>>>> discussing the inflation of
>>>>> >       sndcnt during
>>>>> >       > the recovery process or cwnd after the exit of recovery.
>>>>> >       >
>>>>> >       >
>>>>> >       > Good point. We are talking about inflation of sndcnt during
>>>>> the recovery process.
>>>>> >
>>>>> >       [MK] I think we are talking about both in practice because
>>>>> inflation of
>>>>> >       sndcnt during the recovery process would also result in
>>>>> exiting recovery
>>>>> >       with too big cwnd. In the examples that I gave the segments
>>>>> sent_so_far
>>>>> >       was calculated when the SACK for P100 had arrived (actually
>>>>> the numbers
>>>>> >       were off by one):
>>>>> >
>>>>> >       For Reno:
>>>>> >
>>>>> >       Sent_so_far = CEIL(prr_delivered * ssthresh / RecoverFS)
>>>>> >                    = CEIL(97 * 50 / 72)
>>>>> >                    = 68
>>>>> >
>>>>> >       For CUBIC:
>>>>> >       Sent_so_far = CEIL(prr_delivered * ssthresh / RecoverFS)
>>>>> >                    = CEIL(97 * 70 / 72)
>>>>> >                    = 95
>>>>> >
>>>>> >       Now, when the cumulative ACK triggered by rexmit of P1 arrives
>>>>> and
>>>>> >       terminates fast recovery, the following is executed as per the
>>>>> *current
>>>>> >       version* of the draft:
>>>>> >
>>>>> >       DeliveredData =  (bytes newly cumulatively acknowledged) = 100
>>>>> >       DeliveredData += (bytes newly selectively acknowledged) = 100
>>>>> + 0
>>>>> >
>>>>> >       prr_delivered += DeliveredData = 95 + 100 = 195
>>>>> >       pipe = 94
>>>>> >       if (pipe > ssthresh) => (94 > 70) => (true)
>>>>> >         sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out
>>>>> >                = CEIL(195*70/72) = 190 - 95 = 95
>>>>> >       cwnd = pipe + sndcnt = 94 + 95 = 189
>>>>> >
>>>>> >       So oops, when exiting fast recovery cwnd would be nearly
>>>>> doubled from
>>>>> >       what it was before entering loss recovery. It seems that there
>>>>> >       is an additional problem because the definition of
>>>>> DeliveredData in the
>>>>> >       current version of the draft is incorrect; the cumulatively
>>>>> acked bytes
>>>>> >       that have already been SACKed are counted twice in
>>>>> DeliveredData. It
>>>>> >       seems that RFC 6937 and rfc6937bis-01 both define
>>>>> DeliveredData (nearly)
>>>>> >       correctly by including the change in snd.una in DeliveredData
>>>>> and
>>>>> >       substracting data that has already been SACKed. The definition
>>>>> of
>>>>> >       DeliveredData obviously needs to be corrected. See also below
>>>>> the issue
>>>>> >       with bytes SACked that are above snd.una but get SACKed before
>>>>> the start
>>>>> >       of recovery.
>>>>> >
>>>>> >       With the original definition of DeliveredData cwnd would not be
>>>>> >       inflated further but fast recovery would still exit with too
>>>>> big cwnd
>>>>> >       (For CUBIC cwnd=95, instead of 70, and for Reno cwnd=68,
>>>>> instead of 50),
>>>>> >       if we use too small RecoveryFS (=Pipe)
>>>>> >
>>>>> >       So, it seems that we agree that the problem of sending too
>>>>> many bytes
>>>>> >       during the recovery process gets corrected if RecoverFS is
>>>>> initialized to
>>>>> >       snd.nxt - snd.una. The next question is, should RecoverFS be
>>>>> initialized
>>>>> >       to even higher value in some scenarios? See below.
>>>>> >
>>>>> >       > Because I'm personally not very keen to address
>>>>> miscalculation of lost packets due to
>>>>> >       reordering
>>>>> >       > during the recovery process as it seems to be tricky.
>>>>> >       >
>>>>> >       >
>>>>> >       > It is tricky, but I think it is feasible to address. What do
>>>>> folks think about my suggestion
>>>>> >       from above in
>>>>> >       > this thread:
>>>>> >       >
>>>>> >       >   existing text:
>>>>> >       >      pipe = (RFC 6675 pipe algorithm)
>>>>> >       >      RecoverFS = pipe              // RFC 6675 pipe before
>>>>> recovery
>>>>> >       >
>>>>> >       >   proposed new text:
>>>>> >       >      RecoverFS = snd.nxt - snd.una + (bytes newly
>>>>> cumulatively acknowledged)
>>>>> >       >
>>>>> >       >
>>>>> >       > Hmm. Sorry. I'm not very sure about the difference between
>>>>> snd.nxt - snd.una and snd.nxt -
>>>>> >       snd.una + (bytes newly
>>>>> >       > cumulatively acknowledged)
>>>>> >       > Could you elaborate a bit? I thought we don't have data
>>>>> which are cumulatively acked in case
>>>>> >       of reordering.
>>>>> >
>>>>> >       [MK] It seems there might be another case that Neil is
>>>>> thinking where the
>>>>> >       sender may end up sending too many segments during the first
>>>>> RTT in fast
>>>>> >       recovery. If I understood it correctly this may occur in a
>>>>> scenario
>>>>> >       with ACK loss for pkts preceeding the first dropped data pkt,
>>>>> for
>>>>> >       example. Consider the following scenario where there are 100
>>>>> pkts
>>>>> >       outstanding
>>>>> >
>>>>> >         P1..P24, P25, P26, P27, P28..P100
>>>>> >
>>>>> >       Packets P1..P24 and P26..P100 are delivered succesfully to the
>>>>> >       receiver. P25 is lost. ACKs (and SACKs) for pkts P1..P24, P26
>>>>> and P27 get
>>>>> >       dropped. SACKs for P28..P100 are delivered successfully. When
>>>>> SACK
>>>>> >       for pkt P28 arrives, an RFC 6675 sender would declare P25 is
>>>>> lost, and
>>>>> >       enter fast retransmit. Similarly, a RACK-TLP sender may
>>>>> declare P25 lost,
>>>>> >       but this may happen with any of SACKs P28..P100 arriving.
>>>>> >
>>>>> >       Let's assume we were fully utilizing congestion window, i.e.,
>>>>> cwnd=100
>>>>> >       and we enter loss recovery when the SACK of P28 arrives
>>>>> (cumulative
>>>>> >       ACK#=25):
>>>>> >
>>>>> >       ssthresh = cwnd / 2 = 50  (Reno)
>>>>> >       prr_delivered = prr_out = 0
>>>>> >       Pipe = snd.nxt - snd.una - (lost + SACKed) = 76 - (1 + 3) = 72
>>>>> >       RecoverFS = snd.nxt - snd.una = 101 - 25 = 76
>>>>> >
>>>>> >       DeliveredData = (bytes newly cumulatively acknowledged) = 24
>>>>> >       DeliveredData += change_in(SACKd) = 24+3 = 27
>>>>> >       prr_delivered += DeliveredData = 0+27 = 27
>>>>> >
>>>>> >       if (pipe > ssthresh) => if (72 > 50) => true
>>>>> >              // Proportional Rate Reduction
>>>>> >              sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) -
>>>>> prr_out
>>>>> >                     = CEIL(27 * 50 / 76) = 19 - 0 = 19
>>>>> >
>>>>> >       cwnd = 72 + 19 = 91
>>>>> >
>>>>> >       so, we will send a burst of 19 pkts on entry to recovery and
>>>>> during the
>>>>> >       rest of the recovery around 49 pkts, giving a total of
>>>>> 19+49=68 pkts
>>>>> >       while only 50 was allowed. If we add 24 cumulatively acked
>>>>> pkts into
>>>>> >       RecoverFS like Neil suggests, we are about to send around
>>>>> 14+37=51 pkts
>>>>> >       which is almost fine. However, the major shortcoming of this
>>>>> approach is
>>>>> >       that we'll still send a burst of 14 pkts in the beginning of
>>>>> the recovery
>>>>> >       while avoiding such a burst was one of the major goals of PRR.
>>>>> >
>>>>> >       Alternatively we could modify the algo such that the
>>>>> cumulatively acked
>>>>> >       bytes with the ACK that triggers loss recovery are not added to
>>>>> >       DeliveredData nor in RecoverFS.
>>>>> >
>>>>> >
>>>>> > I have thought about similar things. I'm thinking that this might be
>>>>> a minor point for now.
>>>>> > My personal thoughts on this are:
>>>>> > * The case you presented presumes huge ack losses before entering
>>>>> recovery and no ack loss after that. this
>>>>> > doesn't look a very common case.
>>>>>
>>>>> [MK2]: I am not sure if this is a minor point.
>>>>> First, it does not need a huge number of ack losses before entering
>>>>> recovery. The severity of miscalculation depends on the number of lost
>>>>> acks.
>>>>> Second, there is no requirement for no ack loss after that. There
>>>>> might or might not be further ack losses. The authors (and others)
>>>>> have
>>>>> reported that ack loss is a very common phenomenon, so I think we
>>>>> cannot
>>>>> consider it uncommon.
>>>>>
>>>>> > * At any point of recovery, the inflation of DeliveredData can
>>>>> happen due to ack losses or other reasons
>>>>> > I'm not very sure creating a special handling only for the first ACK
>>>>> is effective.
>>>>>
>>>>> [MK2]: Sure, sending a burst at any point of recovery is common to all
>>>>> Fast Recovery algos not only PRR. But PRR justifies itself (for a good
>>>>> reason) by avoiding bursts in the beginning of the loss recovery. Such
>>>>> a burst is very bad because it is likely to result in loss of
>>>>> retransmitted data that would be always present with segment(s)
>>>>> transmitted in the beginning of the recovery. Particularly with PRR,
>>>>> sending a burst in the beginning is bad because PRR does not have a
>>>>> pause
>>>>> in sending segments in the beginning like NewReno or RFC 6675 would
>>>>> have
>>>>> (i.e., with PRR, there is no similar opportunity for the bottleneck
>>>>> queue
>>>>> to drain before retransmitted segments start to flow into the
>>>>> congested
>>>>> queue).
>>>>>
>>>>
>>>> > * As long as there's no sudden increase of DeliveredData, I guess
>>>>> both logics behave mostly the same. So, I
>>>>> > think a question would be how much we care about this kind of
>>>>> situations. It seems to me that this looks
>>>>> > minor case.
>>>>>
>>>>> [MK2]: I don't think they are mostly the same. My proposal would avoid
>>>>> a
>>>>> burst in the beginning and solve in a simple way also the other
>>>>> miscalculation problems which ignore a number of Acks as I have
>>>>> pointed
>>>>> out. It would also be consistent with the other Fast Retransmit & Fast
>>>>> Recovery algos in the RFC series that by definition always handle Fast
>>>>> Retransmit as a separate and unconditional step at the entry to loss
>>>>> recovery.
>>>>> PRR algo as curently described is also inefficient as it includes an
>>>>> unnecessary step to check whether to send out the Fast Retransmit
>>>>> which
>>>>> can only happen in the beginning of the recovery. I doubt hardly any
>>>>> implementor would like to include an unnecessary condition check to be
>>>>> executed repeatedely on arrival of each Ack during the recovery while
>>>>> the
>>>>> condition can be true only with the first Ack that triggers the loss
>>>>> recovery?
>>>>>
>>>>
>>>> Hmm, it seems to me that you're saying we might not be able to avoid
>>>> the bursts in the middle of recovery, but there's a way to avoid a burst in
>>>> the beginning.
>>>> Could you elaborate your proposal? I think I need to understand it a
>>>> bit more.
>>>>
>>>> >       Then we would send just one pkt (rexmit of
>>>>> >       P1) on entering the recovery and during the rest of recovery
>>>>> around 49
>>>>> >       pkts, i.e., 1+49=50 pkts during the recovery, which would be
>>>>> exactly equal
>>>>> >       to ssthresh we set. With this approach we could avoid the
>>>>> burst in the
>>>>> >       beginning.  In addition we could  have a consistent solution
>>>>> also for the
>>>>> >       additional problem of including extra SACKed data with the ACK
>>>>> that
>>>>> >       triggers the recovery. Let's look at the above scenario again
>>>>> cwnd=100
>>>>> >       and pkts P1..P100 in flight:
>>>>> >
>>>>> >         P1..P24, P25, P26, P27, P28..P100
>>>>> >
>>>>> >       Packet P1..P24 are delivered to the receiver but ACKs get
>>>>> dropped (whether
>>>>> >       ACKs are dropped or not is not relevant for this issue). P25
>>>>> gets
>>>>> >       dropped. If the DupAcks of pkt P26 and pkt P27 are delivered,
>>>>> from the
>>>>> >       DupAck of P28 only the SACK info for P28 is counted in
>>>>> DeliveredData but
>>>>> >       the SACK info for P26 and P27 are never counted in
>>>>> DeliveredData because
>>>>> >       P26 and P27 are already SACKed when the DupAck of P28 arrives.
>>>>> However,
>>>>> >       if the DupAcks of pkt P26 and pkt P27 get dropped as in the
>>>>> previous
>>>>> >       example, the ACK of P28 includes new SACK info for pkts P26,
>>>>> P27, and
>>>>> >       P28 and the bytes of P26 and P27 are also counted in
>>>>> DeliveredData. (If
>>>>> >       also DupAck of P28 gets dropped, the DupAck of P29 may include
>>>>> up to 3
>>>>> >       MSS of additional SACK info to be counted (P26, P27, and P28).
>>>>> This alone
>>>>> >       will result in a miniburst in the beginning of the recovery or
>>>>> add to the
>>>>> >       burst size as in the previous example where the two additinal
>>>>> SACKs (for
>>>>> >       P26 and P27) inflated prr_delivered by 2, resulting in
>>>>> slightly too large
>>>>> >       number of segments sent during the recovery (51).
>>>>> >
>>>>> >       As suggested above, this problem with additional SACKs would
>>>>> be solved
>>>>> >       such that the DupAck that triggers the loss recovery is
>>>>> allowed to add
>>>>> >       only "itself" into DeliveredData and let PRR to include the
>>>>> missing bytes
>>>>> >       for pkts that were SACKed before the start of the recovery
>>>>> only at the
>>>>> >       end of the recovery when the cumulative ACK for the first pkt
>>>>> (P1)
>>>>> >       arrives and inherently covers those bytes.
>>>>> >
>>>>> >       In other words, the algo can be modified such that fast
>>>>> retransmit is
>>>>> >       always handled separately in the beginning of the recovery
>>>>> together with
>>>>> >       the initialization of the PRR variables:
>>>>> >
>>>>> >          ssthresh = CongCtrlAlg()      // Target flight size in
>>>>> recovery
>>>>> >       //[MK]: the next three lines can be deleted as unnecessary
>>>>> >          prr_delivered = 0             // Total bytes delivered in
>>>>> recovery
>>>>> >          prr_out = 0                   // Total bytes sent in
>>>>> recovery
>>>>> >          pipe = (RFC 6675 pipe algorithm)
>>>>> >
>>>>> >          RecoverFS = snd.nxt - snd.una // FlightSize right before
>>>>> recovery
>>>>> >                                        // [MK]:maybe add
>>>>> cumulatively ACKed
>>>>> >                                        //      bytes?
>>>>> >
>>>>> >          Fast Retransmit the first missing segment
>>>>> >          prr_delivered  = (With SACK: bytes selectively acknowledged
>>>>> by the first
>>>>> >                            SACK block of the ACK triggering the loss
>>>>> recovery, OR
>>>>> >                            Without SACK: 1 MSS)
>>>>> >          prr_out  = (data fast retransmitted)
>>>>> >
>>>>> >       On each arriving ACK during the rest of the fast recovery,
>>>>> including the
>>>>> >       final cumulative ACK that signals the end of loss recovery:
>>>>> >
>>>>> >          DeliveredData = change_in(snd.una)
>>>>> >          if (SACK is used) {
>>>>> >             DeliveredData += change_in(SACKd) //[MK]:(*) modify
>>>>> change_in(SACKd)
>>>>> >          ...
>>>>> >
>>>>> >
>>>>> >       The above changes would imply deleting
>>>>> >
>>>>> >         if (prr_out is 0 AND sndcnt is 0) {
>>>>> >              // Force a fast retransmit upon entering recovery
>>>>> >              sndcnt = MSS
>>>>> >
>>>>> >       from the algo and would make it consistent with the
>>>>> description of the
>>>>> >       other fast retransmit & Fast recovery algorithms (RFC 5681,
>>>>> RFC 6582, RFC
>>>>> >       6675) that include fast retransmit together with the
>>>>> initialization of
>>>>> >       the algo in the unconditional first steps of the algorithm.
>>>>> >
>>>>> >       (*)
>>>>> >       In addition, one more smallish but important correction is
>>>>> needed. The
>>>>> >       bytes that are SACKed before the recovery starts (i.e.,
>>>>> typically the
>>>>> >       famous first two DupAcks or more bytes if the start of
>>>>> recovery is
>>>>> >       postponed due to reordering) should be taken into account in
>>>>> the
>>>>> >       DeliveredData during the recovery but with the current algo
>>>>> they
>>>>> >       are never counted in DeliveredData (and prr_delivered).
>>>>> >       Why? Because when the first cumulative ACK arrives, it
>>>>> advances snd.una
>>>>> >       such that those bytes are covered but change_in(SACKd) is
>>>>> negative and
>>>>> >       it incorrectly substracts also these bytes from DeliveredData
>>>>> (and
>>>>> >       prr_delivered) even though they were never counted in. Usually
>>>>> this is
>>>>> >       only 2 MSS but in scenarios similar to one that Neil earlier
>>>>> introduced
>>>>> >       there might be much more data bytes that are not counted. This
>>>>> change
>>>>> >       would also solve the problem of exiting PRR with too low cwnd.
>>>>> >       Let's look at Neil's earlier example again (see comments with
>>>>> [MK] for
>>>>> >       suggested change to solve the issue):
>>>>> >
>>>>> >       CC = CUBIC
>>>>> >       cwnd = 10
>>>>> >       The reordering degree was estimated to be large, so the
>>>>> connection will
>>>>> >       wait for more than 3 packets to be SACKed before entering fast
>>>>> recovery.
>>>>> >
>>>>> >       --- Application writes 10*MSS.
>>>>> >
>>>>> >       TCP sends packets P1 .. P10.
>>>>> >       pipe = 10 packets in flight (P1 .. P10)
>>>>> >
>>>>> >       --- P2..P9 SACKed  -> do nothing //
>>>>> >
>>>>> >       (Because the reordering degree was previously estimated to be
>>>>> large.)
>>>>> >
>>>>> >       --- P10 SACKed -> mark P1 as lost and enter fast recovery
>>>>> >
>>>>> >       PRR:
>>>>> >       ssthresh = CongCtrlAlg() = 7 packets // CUBIC
>>>>> >       prr_delivered = 0
>>>>> >       prr_out = 0
>>>>> >       RecoverFS = snd.nxt - snd.una = 10 packets (P1..P10)
>>>>> >
>>>>> >       DeliveredData = 1  (P10 was SACKed)
>>>>> >
>>>>> >       prr_delivered += DeliveredData   ==> prr_delivered = 1
>>>>> >
>>>>> >       pipe =  0  (all packets are SACKed or lost; P1 is lost, rest
>>>>> are SACKed)
>>>>> >
>>>>> >       safeACK = false (snd.una did not advance)
>>>>> >
>>>>> >       if (pipe > ssthresh) => if (0 > 7) => false
>>>>> >       else
>>>>> >         // PRR-CRB by default
>>>>> >         sndcnt = MAX(prr_delivered - prr_out, DeliveredData)
>>>>> >                = MAX(1 - 0, 1)
>>>>> >                = 1
>>>>> >
>>>>> >         sndcnt = MIN(ssthresh - pipe, sndcnt)
>>>>> >                = MIN(7 - 0, 1)
>>>>> >                = 1
>>>>> >
>>>>> >       cwnd = pipe + sndcnt
>>>>> >            = 0    + 1
>>>>> >            = 1
>>>>> >
>>>>> >       retransmit P1
>>>>> >
>>>>> >       prr_out += 1   ==> prr_out = 1
>>>>> >
>>>>> >       --- P1 retransmit plugs hole; receive cumulative ACK for
>>>>> P1..P10
>>>>> >
>>>>> >       DeliveredData = 1  (P1 was newly ACKed) //[MK]: should be = 10
>>>>> - 1 = 9
>>>>> >
>>>>> >       //[MK]: Note that SACKed bytes of P2..P9 were also newly acked
>>>>> >       //      because the bytes have not been delivered *during* the
>>>>> >       //      recovery by this far and thereby not yet counted in
>>>>> >       //      prr_delivered.
>>>>> >       //      So, they should not be substracted from DeliveredData
>>>>> >       //      but included as those bytes got delivered only when
>>>>> >       //      snd.una advanced. Only P10 should be substracted.
>>>>> >
>>>>> >       prr_delivered += DeliveredData   ==> prr_delivered = 2
>>>>> >       //[MK]: should be = 1 + 9 = 10
>>>>> >
>>>>> >       pipe =  0  (all packets are cumuatively ACKed)
>>>>> >
>>>>> >       safeACK = (snd.una advances and no further loss indicated)
>>>>> >       safeACK = true
>>>>> >
>>>>> >       if (pipe > ssthresh) => if (0 > 7) => false
>>>>> >       else
>>>>> >         // PRR-CRB by default
>>>>> >         sndcnt = MAX(prr_delivered - prr_out, DeliveredData)
>>>>> >                = MAX(2 - 1, 1)  //[MK]  = MAX(10-1, 1)
>>>>> >                = 1              //[MK]  = 9
>>>>> >         if (safeACK) => true
>>>>> >           // PRR-SSRB when recovery is in good progress
>>>>> >           sndcnt += 1   ==> sndcnt = 2 //[MK] ==> sndcnt = 10
>>>>> >
>>>>> >         sndcnt = MIN(ssthresh - pipe, sndcnt)
>>>>> >                = MIN(7 - 0, 2) //[MK] = MIN(7 - 0, 10)
>>>>> >                = 2             //[MK] = 7
>>>>> >
>>>>> >       cwnd = pipe + sndcnt
>>>>> >            = 0    + 2  //[MK] = 0 + 7
>>>>> >            = 2         //[MK] = 7
>>>>> >
>>>>> >       So we exit fast recovery with cwnd=2 even though ssthresh is 7.
>>>>> >
>>>>> >       [MK]: Or, we exit with cwnd=7 if we correctly count in
>>>>> DeliveredData
>>>>> >       during the recovery process all data that is in flight when
>>>>> the recovery
>>>>> >       starts. All bytes in flight at the start of the recovery are
>>>>> supposed to
>>>>> >       become acknowledged by the end of the recovery, so they should
>>>>> be counted
>>>>> >       in prr_delivered during the recovery.
>>>>> >
>>>>> >       >             However, I think it won't be good if it's
>>>>> propagated to ater the recovery.  But,
>>>>> >       don't we
>>>>> >       >             reset cwnd to ssthresh at the end of recovery?
>>>>> >
>>>>> >       [MK]: It seems that just by correcting the definition of
>>>>> DeliveredData
>>>>> >       there is no need to reset cwnd to ssthresh at the end of
>>>>> recovery because
>>>>> >       the algo would do it for us. But I am not opposing to reset
>>>>> cwnd to
>>>>> >       ssthresh at the end. In that case it might be better to
>>>>> specify it by
>>>>> >       giving two alternatives similar to what RFC 6582 does. Maybe:
>>>>> >
>>>>> >          Set cwnd to either (1) min (ssthresh, cwnd) or (2) ssthresh.
>>>>> >
>>>>> >
>>>>> > I think we have discussed this in the past discussions.
>>>>> > In case of (1), cwnd can become very low when there were big losses
>>>>> before the recovery.
>>>>> > As many implementations don't take approach and they have been used
>>>>> for a long time, (2) became our
>>>>> > consensus.
>>>>>
>>>>> [MK2]: Yes, setting cwnd to ssthresh has been discussed and in my
>>>>> recalling there was two differing opinions between co-authors with no
>>>>> final resolution. My apologies if there was and I missed it.
>>>>> But I am not disagreeing. First, with my proposed solution this simply
>>>>> becomes a non-issue and is not needed at all. Second, if we decide to
>>>>> propose setting cwnd at the exit of the recovery I just think it would
>>>>> be
>>>>> not a good idea to require all implementations to do (2) but let the
>>>>> implementor to have another safe alternative as well. In case of cwnd
>>>>> being small, (1) would result in slow start that avoids a burst and
>>>>> would
>>>>> quickly determine the available capacity. With (2), the draft should
>>>>> say
>>>>> (suggest/require) something about applying some sort of pacing to
>>>>> avoid a
>>>>> burst, I think (such a burst is not necessarily self-evident for
>>>>> everyone).
>>>>>
>>>>
>>>> I believe our consensus has been (2) while we are aware there can be a
>>>> burst.
>>>> I can agree that a burst here might not be obvious for everyone.
>>>> So, my personal recommendation would be to put something about bursts
>>>> here. But, if some folks have other opinions, please share.
>>>>
>>>> --
>>>> Yoshi
>>>>
>>>
>>> Thanks, Yoshifumi. I agree that this sounds reasonable. As noted earlier
>>> in the thread just now, I do see that RFC 6675 does mention some cases
>>> where it can create bursts. So I suppose it's reasonable to mention the
>>> potential for bursts in the cwnd = ssthresh step at the end of PRR. I have
>>> proposed to add the following text, after the "cwnd = ssthresh" step:
>>>
>>> "Note that this step that sets cwnd to ssthresh can potentially, in some
>>> scenarios, allow a burst of back-to-back segments into the network. As with
>>> common scenarios that could allow bursts, such as restarting from idle, it
>>> is RECOMMENDED that implementations use pacing to reduce the burstiness of
>>> traffic."
>>>
>>> Thanks!
>>>
>>> Best regards,
>>> neal
>>>
>>>