[tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc6937bis-06.txt
Matt Mathis <matt.mathis@gmail.com> Thu, 18 July 2024 16:33 UTC
Return-Path: <matt.mathis@gmail.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2B83AC1840C5 for <tcpm@ietfa.amsl.com>; Thu, 18 Jul 2024 09:33:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.108
X-Spam-Level:
X-Spam-Status: No, score=-2.108 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Chpx8et1_4GJ for <tcpm@ietfa.amsl.com>; Thu, 18 Jul 2024 09:33:09 -0700 (PDT)
Received: from mail-lf1-x12b.google.com (mail-lf1-x12b.google.com [IPv6:2a00:1450:4864:20::12b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A0CA0C169407 for <tcpm@ietf.org>; Thu, 18 Jul 2024 09:33:08 -0700 (PDT)
Received: by mail-lf1-x12b.google.com with SMTP id 2adb3069b0e04-52e99060b0dso674782e87.1 for <tcpm@ietf.org>; Thu, 18 Jul 2024 09:33:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1721320387; x=1721925187; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=AYsHli68D1kiZPZOkV8k8cogxK+cwlYsaQwGJadjGt4=; b=Oa9pyLyLSMyKumV5GfyhsZne84bVwG5wnAnKAhfBikhqY/oO0PRDwObvMBV+t2vl+u 2V/BVrDouPNcydhnubGzAI8Sw/mwSn3kBADs9+rQj6PdhryPk9ZjVDAYlAK3y3rVglaz oprwWF/O4TY8YqUVG1tXt8Zkk0cfPuIeRkmh7uTSLU4g/CH5KkHWErTyG2Kj/yzRaIse 8pvSRoxbKYI7qmlvGqiujM+ONQ6oQnuqEw1Oe6dY7rdeZp4wPtHdg9TSZT6cqiHho+iX InP6h/PIn97XoVR56tH9oCZS9R+EhoAVYGAlgmfWt3wJoLlqyiYUyTJITmIuybLIGXBx zmug==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721320387; x=1721925187; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=AYsHli68D1kiZPZOkV8k8cogxK+cwlYsaQwGJadjGt4=; b=ErjHENpBshLx1D4vOHjI6SZbV3bYrpSSYMzaSdh/0djOXh7vJx0QlaZnz2NhkWZEI+ Js5nYP3TJ98NLDC+NMtnq61pN5Fqw//skUeI1l8tokSgPqrYvW/ldU6UUAFSN0j3UB5N Wgdu56Db/qNreLoNyaXIcEWESktEnc4eY7JAOTYetNakxnQxAP7SNOLVmeU6/KQ3DvoN r4rl3eckLkdb1yNxWNLPkPBKtqb+7YF3mqDlIIjXRKnKNNKNVyzedZnhraeGXneRDkei XUQNYhAB5QD64E/x+2ME71QV7/BG6xgyOpNmnANCSUhSZypg1xIlkUCHNWMB9KDoO5aq JKSg==
X-Forwarded-Encrypted: i=1; AJvYcCWTT4fTISaswW3h35toMJtZauCMLT2MChvfN1RIs8pv4o7nzvHBHPgahT7XB5FFRjSrax7VyUMvWDUa9IGP
X-Gm-Message-State: AOJu0YwC1jGWtbhcIrxnflvkpD5sJFQ78BVrkRhQCtmh0YkEoHLMfjUI uH/GuOUij1LUYCFh7aarH4AJfr4tV83+IXJhLeorO/BvEcgutBH+uLeg95JQp9vVKlVVOTFwpQn 5ziZF/OeHaGtmgVCE0eMdhb3RXSFh2wWPxQA=
X-Google-Smtp-Source: AGHT+IETReciMPAGpTdjHZt/xqEhcG/jxBHYdjsjsulS8mNHVYLOjSgPtYYGafq7WitCd+pm5x/lTEF/1Dkz2vKlvZw=
X-Received: by 2002:a05:6512:4025:b0:52d:73b:db14 with SMTP id 2adb3069b0e04-52ee53d368bmr3490384e87.35.1721320386501; Thu, 18 Jul 2024 09:33:06 -0700 (PDT)
MIME-Version: 1.0
References: <170896098131.16189.4842811868600508870@ietfa.amsl.com> <CADVnQy=rvCoQC0RwVq=P2XWFGPrXvGKvj2cAooj94yx+WzXz3A@mail.gmail.com> <8e5f0a7-b39b-cfaa-5c38-edeb9916bef6@cs.helsinki.fi> <CADVnQynR99fQjWmYj-rYZ4nZxYS=-O7zbfWjJLMxd5Lqcpwgcg@mail.gmail.com> <CAAK044SJWsvqWf+Tt3wUeGpMRH6aVg175CFUBrsz_YyhDsKYwQ@mail.gmail.com> <CADVnQy=0yhx9U-ogVX_Dh66fZWGyMzqtWAgSfaYXX-6ppGx=Kg@mail.gmail.com> <CAAK044So4qO4zma-qdYbMrJcNVdRi1-o30wcP7QjKLuT2c_Zuw@mail.gmail.com> <CADVnQynvFY24cNfe4tfm6FMY47UaMPjxrtD5RwWhCPK6EH4=6w@mail.gmail.com> <CAAK044QWL2xBvg4S_cvHFE_iTddSmnEOkhu33pftvUiUCCGZ_g@mail.gmail.com> <44e1fff-f915-fdd-6c8e-4cd1cc3a9df3@cs.helsinki.fi> <CAAK044TT93bBTkF_J8HgXxLNGnp+LLZvySf+4TOdsc=_yD8HKQ@mail.gmail.com> <9e3efc1f-e6e-8edb-d354-f442d78967ef@cs.helsinki.fi> <CAAK044SE46f6RaqcRgFOhPtOJcY77OZGEMCPEGoQMCy+2E51Ww@mail.gmail.com> <CADVnQymUztV8qV3SLeuWV1LxJRkO1SRQxxT_erqPa7owiMSCPw@mail.gmail.com> <CA+s1KQFVDbCiZDqjVfYG-feRztkUDjymV2fBy+-nGkGL8Zx9WA@mail.gmail.com> <CADVnQynAOS=ugQ6KWsORi1xqYNShH8jCp2fuVEP2ij3YvexFGQ@mail.gmail.com>
In-Reply-To: <CADVnQynAOS=ugQ6KWsORi1xqYNShH8jCp2fuVEP2ij3YvexFGQ@mail.gmail.com>
From: Matt Mathis <matt.mathis@gmail.com>
Date: Thu, 18 Jul 2024 11:32:49 -0500
Message-ID: <CA+s1KQEape0k1GJ+0ZRQT5BvsVd80N8F8kTJcKzGnaRods5VMg@mail.gmail.com>
To: Neal Cardwell <ncardwell@google.com>
Content-Type: multipart/alternative; boundary="0000000000002ea8cf061d8822c9"
Message-ID-Hash: OEVO4DXMZ7AEJLSH6BMGRZNQZRDGQ57T
X-Message-ID-Hash: OEVO4DXMZ7AEJLSH6BMGRZNQZRDGQ57T
X-MailFrom: matt.mathis@gmail.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-tcpm.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: Markku Kojo <kojo@cs.helsinki.fi>, Matt Mathis <ietf@mattmathis.net>, Matt Mathis <mattmathis@measurementlab.net>, tcpm@ietf.org
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc6937bis-06.txt
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/8hoiRO24rZwIwJ334rDv88XCqWc>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Owner: <mailto:tcpm-owner@ietf.org>
List-Post: <mailto:tcpm@ietf.org>
List-Subscribe: <mailto:tcpm-join@ietf.org>
List-Unsubscribe: <mailto:tcpm-leave@ietf.org>
I completely agree with Neal, however It was probably a mistake on my part to bring up the wider problem of setting ssthresh. We used "ssthresh = CongCtrlAlg()" explicitly to sidestep that rathole. But yes PRR is kind of lame if ssthresh is set from FlightSize. Thanks, --MM-- Evil is defined by mortals who think they know "The Truth" and use force to apply it to others. ------------------------------------------- Matt Mathis (Email is best) Home & mobile: 412-654-7529 please leave a message if you must call. On Thu, Jul 18, 2024 at 9:57 AM Neal Cardwell <ncardwell@google.com> wrote: > I would add a note that setting cwnd to ssthresh is critical in the case > where both these conditions hold: > > (a) an application-limited flight > > (b) the TCP implementation uses the sensible and widely-deployed (e.g., by > Linux TCP) approach of setting ssthresh in recovery based on cwnd rather > than FlightSize, e.g. setting ssthresh to cwnd*0.5 (Reno) or cwnd*0.7 > (CUBIC). > > In such cases, the application-limited behavior can make pipe tiny, and > thus cwnd, set via cwnd = pipe + sndcnt, is also tiny, and would often > never be enough DeliveredData to grow cwnd organically back up to ssthresh > using the core PRR code path. Thus, to handle such cases, PRR needs an > explicit cwnd=ssthresh for cwnd to reliably reach ssthresh by the end of > recovery. > > Thanks, > neal > > > On Thu, Jul 18, 2024 at 10:18 AM Matt Mathis <matt.mathis@gmail.com> > wrote: > >> In the presentation, I would go even stronger than the text in the >> document: >> >> The original design did not explicitly set cwnd to ssthresh, because PRR >> implicitly converges (I believe for all common conditions that might >> confuse any of the estimators used in the algorithm). The explicit >> assignment was added to make PRR more robust in the event of unknown or >> unanticipated events. Under the rare condition where the assignment causes >> a discontinuous change to cwnd, the stack will behave as it would have >> behaved if it did not have PRR: either a (greatly reduced) line rate burst, >> or stopping transmissions to reduce FlightSize. >> >> For non-paced stacks, both of these events are common anyhow. e.g. >> Simple reordering causes both. None of the pathological corner cases can >> cause important misbehaviors. >> >> There are other gaping holes in the standards that are much more serious: >> Not forbidding raising cwnd when it is not being used, and no >> specifications around recovery extending across multiple RTTs due to new >> losses. Although this latter problem has become more prominent with >> RACK's ability to detect lost retransmissions, it was first observed >> while testing RFC 2018 (as long as new losses were only new data). >> >> Thanks, >> --MM-- >> Evil is defined by mortals who think they know "The Truth" and use force >> to apply it to others. >> ------------------------------------------- >> Matt Mathis (Email is best) >> Home & mobile: 412-654-7529 <(412)%20654-7529> please leave a message if >> you must call. >> >> >> >> On Wed, Jul 17, 2024 at 2:22 PM Neal Cardwell <ncardwell@google.com> >> wrote: >> >>> >>> >>> On Fri, Jul 5, 2024 at 4:39 AM Yoshifumi Nishida <nsd.ietf@gmail.com> >>> wrote: >>> >>>> Hi Markku, >>>> >>>> Thanks for the comments. Please see inlines. >>>> >>>> On Tue, Jun 25, 2024 at 7:21 AM Markku Kojo <kojo@cs.helsinki.fi> >>>> wrote: >>>> >>>>> Hi Yoshi, >>>>> >>>>> please see inline tagged [MK2]. >>>>> >>>>> On Mon, 1 Apr 2024, Yoshifumi Nishida wrote: >>>>> >>>>> > Hi Markku, >>>>> > >>>>> > Thanks for the detailed comments. I put my comments in lines on the >>>>> two points. >>>>> > >>>>> > On Wed, Mar 27, 2024 at 7:54 AM Markku Kojo <kojo@cs.helsinki.fi> >>>>> wrote: >>>>> > Hi Yoshi, Neal, all, >>>>> > >>>>> > Please see below inline (tagged [MK]). And my apologies for a >>>>> very long >>>>> > explanation. Hopefully I did not include too many mistakes >>>>> this time ;) >>>>> > >>>>> > In summary, it seems that we do not need to reset cwnd in the >>>>> end of the >>>>> > recovery nor adjust RecoverFS necessarily because all issues >>>>> raised could >>>>> > be resolved by simply correcting the definition of >>>>> DeliveredData >>>>> > (reverting back the original definition + small additional >>>>> change) and >>>>> > moving the actions to take with the ACK that triggers loss >>>>> recovery to >>>>> > the unconditional steps to be taken together with the >>>>> initialization of >>>>> > the algo in the beginning (this would also be in line with how >>>>> the other >>>>> > fast recovery algorithms are described in the RFC series). >>>>> > >>>>> > Hopefully I did not misunderstand any parts of the algo >>>>> (either in RFC >>>>> > 6937 or in the current -08 version of the draft). >>>>> > >>>>> > On Thu, 21 Mar 2024, Yoshifumi Nishida wrote: >>>>> > >>>>> > > Hi Neal, >>>>> > > On Wed, Mar 20, 2024 at 1:32 PM Neal Cardwell < >>>>> ncardwell@google.com> wrote: >>>>> > > >>>>> > > >>>>> > > On Wed, Mar 20, 2024 at 3:29 PM Yoshifumi Nishida < >>>>> nsd.ietf@gmail.com> wrote: >>>>> > > Hi Neal, >>>>> > > >>>>> > > On Wed, Mar 20, 2024 at 6:55 AM Neal Cardwell < >>>>> ncardwell@google.com> wrote: >>>>> > > >>>>> > > On Wed, Mar 20, 2024 at 3:07 AM Yoshifumi Nishida < >>>>> nsd.ietf@gmail.com> wrote: >>>>> > > >>>>> > > On Mon, Mar 18, 2024 at 8:13 PM Neal Cardwell <ncardwell= >>>>> 40google.com@dmarc.ietf.org> >>>>> > > wrote: >>>>> > > >>>>> > > But your point still stands, and you raise a great point: >>>>> simply initializing >>>>> > > RecoverFS to "pipe" is not safe, because packets that were >>>>> marked lost and removed >>>>> > > from pipe may actually have been merely reordered. So if >>>>> those packets are >>>>> > > delivered later, they will increase the numerator of >>>>> prr_delivered / RecoverFS >>>>> > > without increasing the denominator, thus leading to a result >>>>> above 1.0, and thus >>>>> > > potentially leading to a target for Sent_so_far that is >>>>> above ssthresh, causing the >>>>> > > algorithm to erroneously exceed ssthresh. >>>>> > > >>>>> > > >>>>> > > Hmm. I have a native question here. If packets were merely >>>>> reordered, isn't it ok for >>>>> > > cwnd to be bigger than ssthresh? >>>>> > > >>>>> > > >>>>> > > Yes, if packets were merely reordered and none were lost, >>>>> then I agree it would be OK for >>>>> > cwnd >>>>> > > to be bigger than ssthresh. And in fact I would argue that >>>>> cwnd should be reverted back to >>>>> > its >>>>> > > value before fast recovery. And this is in fact what Linux >>>>> TCP would do, using loss recovery >>>>> > > "undo" mechanisms based on TCP timestamps or DSACKs. >>>>> > > >>>>> > > However, in the kind of scenario Markku described, there was >>>>> not merely reordering, but also >>>>> > > real packet loss: "1 packet is lost (P1), and 24 packets are >>>>> delayed (packets P2..P25)". In >>>>> > the >>>>> > > traditional loss-based Reno/CUBIC paradigm, any non-zero >>>>> amount of real packet loss in fast >>>>> > > recovery should result in the same multiplicative decrease >>>>> in cwnd, regardless of the >>>>> > > combination of reordering and loss. We could argue about >>>>> whether that approach is the best >>>>> > > approach (BBR, for example, takes a different approach), but >>>>> that is a very different >>>>> > > discussion. :-) For now AFAICT we are focused on PRR's >>>>> faithful enactment of the congestion >>>>> > > control algorithms decision to reduce cwnd toward ssthresh >>>>> when there is any non-zero amount >>>>> > of >>>>> > > real packet loss in fast recovery. >>>>> > > >>>>> > > >>>>> > > Got it. But, I just would like to clarify whether we are >>>>> discussing the inflation of >>>>> > sndcnt during >>>>> > > the recovery process or cwnd after the exit of recovery. >>>>> > > >>>>> > > >>>>> > > Good point. We are talking about inflation of sndcnt during >>>>> the recovery process. >>>>> > >>>>> > [MK] I think we are talking about both in practice because >>>>> inflation of >>>>> > sndcnt during the recovery process would also result in >>>>> exiting recovery >>>>> > with too big cwnd. In the examples that I gave the segments >>>>> sent_so_far >>>>> > was calculated when the SACK for P100 had arrived (actually >>>>> the numbers >>>>> > were off by one): >>>>> > >>>>> > For Reno: >>>>> > >>>>> > Sent_so_far = CEIL(prr_delivered * ssthresh / RecoverFS) >>>>> > = CEIL(97 * 50 / 72) >>>>> > = 68 >>>>> > >>>>> > For CUBIC: >>>>> > Sent_so_far = CEIL(prr_delivered * ssthresh / RecoverFS) >>>>> > = CEIL(97 * 70 / 72) >>>>> > = 95 >>>>> > >>>>> > Now, when the cumulative ACK triggered by rexmit of P1 arrives >>>>> and >>>>> > terminates fast recovery, the following is executed as per the >>>>> *current >>>>> > version* of the draft: >>>>> > >>>>> > DeliveredData = (bytes newly cumulatively acknowledged) = 100 >>>>> > DeliveredData += (bytes newly selectively acknowledged) = 100 >>>>> + 0 >>>>> > >>>>> > prr_delivered += DeliveredData = 95 + 100 = 195 >>>>> > pipe = 94 >>>>> > if (pipe > ssthresh) => (94 > 70) => (true) >>>>> > sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out >>>>> > = CEIL(195*70/72) = 190 - 95 = 95 >>>>> > cwnd = pipe + sndcnt = 94 + 95 = 189 >>>>> > >>>>> > So oops, when exiting fast recovery cwnd would be nearly >>>>> doubled from >>>>> > what it was before entering loss recovery. It seems that there >>>>> > is an additional problem because the definition of >>>>> DeliveredData in the >>>>> > current version of the draft is incorrect; the cumulatively >>>>> acked bytes >>>>> > that have already been SACKed are counted twice in >>>>> DeliveredData. It >>>>> > seems that RFC 6937 and rfc6937bis-01 both define >>>>> DeliveredData (nearly) >>>>> > correctly by including the change in snd.una in DeliveredData >>>>> and >>>>> > substracting data that has already been SACKed. The definition >>>>> of >>>>> > DeliveredData obviously needs to be corrected. See also below >>>>> the issue >>>>> > with bytes SACked that are above snd.una but get SACKed before >>>>> the start >>>>> > of recovery. >>>>> > >>>>> > With the original definition of DeliveredData cwnd would not be >>>>> > inflated further but fast recovery would still exit with too >>>>> big cwnd >>>>> > (For CUBIC cwnd=95, instead of 70, and for Reno cwnd=68, >>>>> instead of 50), >>>>> > if we use too small RecoveryFS (=Pipe) >>>>> > >>>>> > So, it seems that we agree that the problem of sending too >>>>> many bytes >>>>> > during the recovery process gets corrected if RecoverFS is >>>>> initialized to >>>>> > snd.nxt - snd.una. The next question is, should RecoverFS be >>>>> initialized >>>>> > to even higher value in some scenarios? See below. >>>>> > >>>>> > > Because I'm personally not very keen to address >>>>> miscalculation of lost packets due to >>>>> > reordering >>>>> > > during the recovery process as it seems to be tricky. >>>>> > > >>>>> > > >>>>> > > It is tricky, but I think it is feasible to address. What do >>>>> folks think about my suggestion >>>>> > from above in >>>>> > > this thread: >>>>> > > >>>>> > > existing text: >>>>> > > pipe = (RFC 6675 pipe algorithm) >>>>> > > RecoverFS = pipe // RFC 6675 pipe before >>>>> recovery >>>>> > > >>>>> > > proposed new text: >>>>> > > RecoverFS = snd.nxt - snd.una + (bytes newly >>>>> cumulatively acknowledged) >>>>> > > >>>>> > > >>>>> > > Hmm. Sorry. I'm not very sure about the difference between >>>>> snd.nxt - snd.una and snd.nxt - >>>>> > snd.una + (bytes newly >>>>> > > cumulatively acknowledged) >>>>> > > Could you elaborate a bit? I thought we don't have data >>>>> which are cumulatively acked in case >>>>> > of reordering. >>>>> > >>>>> > [MK] It seems there might be another case that Neil is >>>>> thinking where the >>>>> > sender may end up sending too many segments during the first >>>>> RTT in fast >>>>> > recovery. If I understood it correctly this may occur in a >>>>> scenario >>>>> > with ACK loss for pkts preceeding the first dropped data pkt, >>>>> for >>>>> > example. Consider the following scenario where there are 100 >>>>> pkts >>>>> > outstanding >>>>> > >>>>> > P1..P24, P25, P26, P27, P28..P100 >>>>> > >>>>> > Packets P1..P24 and P26..P100 are delivered succesfully to the >>>>> > receiver. P25 is lost. ACKs (and SACKs) for pkts P1..P24, P26 >>>>> and P27 get >>>>> > dropped. SACKs for P28..P100 are delivered successfully. When >>>>> SACK >>>>> > for pkt P28 arrives, an RFC 6675 sender would declare P25 is >>>>> lost, and >>>>> > enter fast retransmit. Similarly, a RACK-TLP sender may >>>>> declare P25 lost, >>>>> > but this may happen with any of SACKs P28..P100 arriving. >>>>> > >>>>> > Let's assume we were fully utilizing congestion window, i.e., >>>>> cwnd=100 >>>>> > and we enter loss recovery when the SACK of P28 arrives >>>>> (cumulative >>>>> > ACK#=25): >>>>> > >>>>> > ssthresh = cwnd / 2 = 50 (Reno) >>>>> > prr_delivered = prr_out = 0 >>>>> > Pipe = snd.nxt - snd.una - (lost + SACKed) = 76 - (1 + 3) = 72 >>>>> > RecoverFS = snd.nxt - snd.una = 101 - 25 = 76 >>>>> > >>>>> > DeliveredData = (bytes newly cumulatively acknowledged) = 24 >>>>> > DeliveredData += change_in(SACKd) = 24+3 = 27 >>>>> > prr_delivered += DeliveredData = 0+27 = 27 >>>>> > >>>>> > if (pipe > ssthresh) => if (72 > 50) => true >>>>> > // Proportional Rate Reduction >>>>> > sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - >>>>> prr_out >>>>> > = CEIL(27 * 50 / 76) = 19 - 0 = 19 >>>>> > >>>>> > cwnd = 72 + 19 = 91 >>>>> > >>>>> > so, we will send a burst of 19 pkts on entry to recovery and >>>>> during the >>>>> > rest of the recovery around 49 pkts, giving a total of >>>>> 19+49=68 pkts >>>>> > while only 50 was allowed. If we add 24 cumulatively acked >>>>> pkts into >>>>> > RecoverFS like Neil suggests, we are about to send around >>>>> 14+37=51 pkts >>>>> > which is almost fine. However, the major shortcoming of this >>>>> approach is >>>>> > that we'll still send a burst of 14 pkts in the beginning of >>>>> the recovery >>>>> > while avoiding such a burst was one of the major goals of PRR. >>>>> > >>>>> > Alternatively we could modify the algo such that the >>>>> cumulatively acked >>>>> > bytes with the ACK that triggers loss recovery are not added to >>>>> > DeliveredData nor in RecoverFS. >>>>> > >>>>> > >>>>> > I have thought about similar things. I'm thinking that this might be >>>>> a minor point for now. >>>>> > My personal thoughts on this are: >>>>> > * The case you presented presumes huge ack losses before entering >>>>> recovery and no ack loss after that. this >>>>> > doesn't look a very common case. >>>>> >>>>> [MK2]: I am not sure if this is a minor point. >>>>> First, it does not need a huge number of ack losses before entering >>>>> recovery. The severity of miscalculation depends on the number of lost >>>>> acks. >>>>> Second, there is no requirement for no ack loss after that. There >>>>> might or might not be further ack losses. The authors (and others) >>>>> have >>>>> reported that ack loss is a very common phenomenon, so I think we >>>>> cannot >>>>> consider it uncommon. >>>>> >>>>> > * At any point of recovery, the inflation of DeliveredData can >>>>> happen due to ack losses or other reasons >>>>> > I'm not very sure creating a special handling only for the first ACK >>>>> is effective. >>>>> >>>>> [MK2]: Sure, sending a burst at any point of recovery is common to all >>>>> Fast Recovery algos not only PRR. But PRR justifies itself (for a good >>>>> reason) by avoiding bursts in the beginning of the loss recovery. Such >>>>> a burst is very bad because it is likely to result in loss of >>>>> retransmitted data that would be always present with segment(s) >>>>> transmitted in the beginning of the recovery. Particularly with PRR, >>>>> sending a burst in the beginning is bad because PRR does not have a >>>>> pause >>>>> in sending segments in the beginning like NewReno or RFC 6675 would >>>>> have >>>>> (i.e., with PRR, there is no similar opportunity for the bottleneck >>>>> queue >>>>> to drain before retransmitted segments start to flow into the >>>>> congested >>>>> queue). >>>>> >>>> >>>> > * As long as there's no sudden increase of DeliveredData, I guess >>>>> both logics behave mostly the same. So, I >>>>> > think a question would be how much we care about this kind of >>>>> situations. It seems to me that this looks >>>>> > minor case. >>>>> >>>>> [MK2]: I don't think they are mostly the same. My proposal would avoid >>>>> a >>>>> burst in the beginning and solve in a simple way also the other >>>>> miscalculation problems which ignore a number of Acks as I have >>>>> pointed >>>>> out. It would also be consistent with the other Fast Retransmit & Fast >>>>> Recovery algos in the RFC series that by definition always handle Fast >>>>> Retransmit as a separate and unconditional step at the entry to loss >>>>> recovery. >>>>> PRR algo as curently described is also inefficient as it includes an >>>>> unnecessary step to check whether to send out the Fast Retransmit >>>>> which >>>>> can only happen in the beginning of the recovery. I doubt hardly any >>>>> implementor would like to include an unnecessary condition check to be >>>>> executed repeatedely on arrival of each Ack during the recovery while >>>>> the >>>>> condition can be true only with the first Ack that triggers the loss >>>>> recovery? >>>>> >>>> >>>> Hmm, it seems to me that you're saying we might not be able to avoid >>>> the bursts in the middle of recovery, but there's a way to avoid a burst in >>>> the beginning. >>>> Could you elaborate your proposal? I think I need to understand it a >>>> bit more. >>>> >>>> > Then we would send just one pkt (rexmit of >>>>> > P1) on entering the recovery and during the rest of recovery >>>>> around 49 >>>>> > pkts, i.e., 1+49=50 pkts during the recovery, which would be >>>>> exactly equal >>>>> > to ssthresh we set. With this approach we could avoid the >>>>> burst in the >>>>> > beginning. In addition we could have a consistent solution >>>>> also for the >>>>> > additional problem of including extra SACKed data with the ACK >>>>> that >>>>> > triggers the recovery. Let's look at the above scenario again >>>>> cwnd=100 >>>>> > and pkts P1..P100 in flight: >>>>> > >>>>> > P1..P24, P25, P26, P27, P28..P100 >>>>> > >>>>> > Packet P1..P24 are delivered to the receiver but ACKs get >>>>> dropped (whether >>>>> > ACKs are dropped or not is not relevant for this issue). P25 >>>>> gets >>>>> > dropped. If the DupAcks of pkt P26 and pkt P27 are delivered, >>>>> from the >>>>> > DupAck of P28 only the SACK info for P28 is counted in >>>>> DeliveredData but >>>>> > the SACK info for P26 and P27 are never counted in >>>>> DeliveredData because >>>>> > P26 and P27 are already SACKed when the DupAck of P28 arrives. >>>>> However, >>>>> > if the DupAcks of pkt P26 and pkt P27 get dropped as in the >>>>> previous >>>>> > example, the ACK of P28 includes new SACK info for pkts P26, >>>>> P27, and >>>>> > P28 and the bytes of P26 and P27 are also counted in >>>>> DeliveredData. (If >>>>> > also DupAck of P28 gets dropped, the DupAck of P29 may include >>>>> up to 3 >>>>> > MSS of additional SACK info to be counted (P26, P27, and P28). >>>>> This alone >>>>> > will result in a miniburst in the beginning of the recovery or >>>>> add to the >>>>> > burst size as in the previous example where the two additinal >>>>> SACKs (for >>>>> > P26 and P27) inflated prr_delivered by 2, resulting in >>>>> slightly too large >>>>> > number of segments sent during the recovery (51). >>>>> > >>>>> > As suggested above, this problem with additional SACKs would >>>>> be solved >>>>> > such that the DupAck that triggers the loss recovery is >>>>> allowed to add >>>>> > only "itself" into DeliveredData and let PRR to include the >>>>> missing bytes >>>>> > for pkts that were SACKed before the start of the recovery >>>>> only at the >>>>> > end of the recovery when the cumulative ACK for the first pkt >>>>> (P1) >>>>> > arrives and inherently covers those bytes. >>>>> > >>>>> > In other words, the algo can be modified such that fast >>>>> retransmit is >>>>> > always handled separately in the beginning of the recovery >>>>> together with >>>>> > the initialization of the PRR variables: >>>>> > >>>>> > ssthresh = CongCtrlAlg() // Target flight size in >>>>> recovery >>>>> > //[MK]: the next three lines can be deleted as unnecessary >>>>> > prr_delivered = 0 // Total bytes delivered in >>>>> recovery >>>>> > prr_out = 0 // Total bytes sent in >>>>> recovery >>>>> > pipe = (RFC 6675 pipe algorithm) >>>>> > >>>>> > RecoverFS = snd.nxt - snd.una // FlightSize right before >>>>> recovery >>>>> > // [MK]:maybe add >>>>> cumulatively ACKed >>>>> > // bytes? >>>>> > >>>>> > Fast Retransmit the first missing segment >>>>> > prr_delivered = (With SACK: bytes selectively acknowledged >>>>> by the first >>>>> > SACK block of the ACK triggering the loss >>>>> recovery, OR >>>>> > Without SACK: 1 MSS) >>>>> > prr_out = (data fast retransmitted) >>>>> > >>>>> > On each arriving ACK during the rest of the fast recovery, >>>>> including the >>>>> > final cumulative ACK that signals the end of loss recovery: >>>>> > >>>>> > DeliveredData = change_in(snd.una) >>>>> > if (SACK is used) { >>>>> > DeliveredData += change_in(SACKd) //[MK]:(*) modify >>>>> change_in(SACKd) >>>>> > ... >>>>> > >>>>> > >>>>> > The above changes would imply deleting >>>>> > >>>>> > if (prr_out is 0 AND sndcnt is 0) { >>>>> > // Force a fast retransmit upon entering recovery >>>>> > sndcnt = MSS >>>>> > >>>>> > from the algo and would make it consistent with the >>>>> description of the >>>>> > other fast retransmit & Fast recovery algorithms (RFC 5681, >>>>> RFC 6582, RFC >>>>> > 6675) that include fast retransmit together with the >>>>> initialization of >>>>> > the algo in the unconditional first steps of the algorithm. >>>>> > >>>>> > (*) >>>>> > In addition, one more smallish but important correction is >>>>> needed. The >>>>> > bytes that are SACKed before the recovery starts (i.e., >>>>> typically the >>>>> > famous first two DupAcks or more bytes if the start of >>>>> recovery is >>>>> > postponed due to reordering) should be taken into account in >>>>> the >>>>> > DeliveredData during the recovery but with the current algo >>>>> they >>>>> > are never counted in DeliveredData (and prr_delivered). >>>>> > Why? Because when the first cumulative ACK arrives, it >>>>> advances snd.una >>>>> > such that those bytes are covered but change_in(SACKd) is >>>>> negative and >>>>> > it incorrectly substracts also these bytes from DeliveredData >>>>> (and >>>>> > prr_delivered) even though they were never counted in. Usually >>>>> this is >>>>> > only 2 MSS but in scenarios similar to one that Neil earlier >>>>> introduced >>>>> > there might be much more data bytes that are not counted. This >>>>> change >>>>> > would also solve the problem of exiting PRR with too low cwnd. >>>>> > Let's look at Neil's earlier example again (see comments with >>>>> [MK] for >>>>> > suggested change to solve the issue): >>>>> > >>>>> > CC = CUBIC >>>>> > cwnd = 10 >>>>> > The reordering degree was estimated to be large, so the >>>>> connection will >>>>> > wait for more than 3 packets to be SACKed before entering fast >>>>> recovery. >>>>> > >>>>> > --- Application writes 10*MSS. >>>>> > >>>>> > TCP sends packets P1 .. P10. >>>>> > pipe = 10 packets in flight (P1 .. P10) >>>>> > >>>>> > --- P2..P9 SACKed -> do nothing // >>>>> > >>>>> > (Because the reordering degree was previously estimated to be >>>>> large.) >>>>> > >>>>> > --- P10 SACKed -> mark P1 as lost and enter fast recovery >>>>> > >>>>> > PRR: >>>>> > ssthresh = CongCtrlAlg() = 7 packets // CUBIC >>>>> > prr_delivered = 0 >>>>> > prr_out = 0 >>>>> > RecoverFS = snd.nxt - snd.una = 10 packets (P1..P10) >>>>> > >>>>> > DeliveredData = 1 (P10 was SACKed) >>>>> > >>>>> > prr_delivered += DeliveredData ==> prr_delivered = 1 >>>>> > >>>>> > pipe = 0 (all packets are SACKed or lost; P1 is lost, rest >>>>> are SACKed) >>>>> > >>>>> > safeACK = false (snd.una did not advance) >>>>> > >>>>> > if (pipe > ssthresh) => if (0 > 7) => false >>>>> > else >>>>> > // PRR-CRB by default >>>>> > sndcnt = MAX(prr_delivered - prr_out, DeliveredData) >>>>> > = MAX(1 - 0, 1) >>>>> > = 1 >>>>> > >>>>> > sndcnt = MIN(ssthresh - pipe, sndcnt) >>>>> > = MIN(7 - 0, 1) >>>>> > = 1 >>>>> > >>>>> > cwnd = pipe + sndcnt >>>>> > = 0 + 1 >>>>> > = 1 >>>>> > >>>>> > retransmit P1 >>>>> > >>>>> > prr_out += 1 ==> prr_out = 1 >>>>> > >>>>> > --- P1 retransmit plugs hole; receive cumulative ACK for >>>>> P1..P10 >>>>> > >>>>> > DeliveredData = 1 (P1 was newly ACKed) //[MK]: should be = 10 >>>>> - 1 = 9 >>>>> > >>>>> > //[MK]: Note that SACKed bytes of P2..P9 were also newly acked >>>>> > // because the bytes have not been delivered *during* the >>>>> > // recovery by this far and thereby not yet counted in >>>>> > // prr_delivered. >>>>> > // So, they should not be substracted from DeliveredData >>>>> > // but included as those bytes got delivered only when >>>>> > // snd.una advanced. Only P10 should be substracted. >>>>> > >>>>> > prr_delivered += DeliveredData ==> prr_delivered = 2 >>>>> > //[MK]: should be = 1 + 9 = 10 >>>>> > >>>>> > pipe = 0 (all packets are cumuatively ACKed) >>>>> > >>>>> > safeACK = (snd.una advances and no further loss indicated) >>>>> > safeACK = true >>>>> > >>>>> > if (pipe > ssthresh) => if (0 > 7) => false >>>>> > else >>>>> > // PRR-CRB by default >>>>> > sndcnt = MAX(prr_delivered - prr_out, DeliveredData) >>>>> > = MAX(2 - 1, 1) //[MK] = MAX(10-1, 1) >>>>> > = 1 //[MK] = 9 >>>>> > if (safeACK) => true >>>>> > // PRR-SSRB when recovery is in good progress >>>>> > sndcnt += 1 ==> sndcnt = 2 //[MK] ==> sndcnt = 10 >>>>> > >>>>> > sndcnt = MIN(ssthresh - pipe, sndcnt) >>>>> > = MIN(7 - 0, 2) //[MK] = MIN(7 - 0, 10) >>>>> > = 2 //[MK] = 7 >>>>> > >>>>> > cwnd = pipe + sndcnt >>>>> > = 0 + 2 //[MK] = 0 + 7 >>>>> > = 2 //[MK] = 7 >>>>> > >>>>> > So we exit fast recovery with cwnd=2 even though ssthresh is 7. >>>>> > >>>>> > [MK]: Or, we exit with cwnd=7 if we correctly count in >>>>> DeliveredData >>>>> > during the recovery process all data that is in flight when >>>>> the recovery >>>>> > starts. All bytes in flight at the start of the recovery are >>>>> supposed to >>>>> > become acknowledged by the end of the recovery, so they should >>>>> be counted >>>>> > in prr_delivered during the recovery. >>>>> > >>>>> > > However, I think it won't be good if it's >>>>> propagated to ater the recovery. But, >>>>> > don't we >>>>> > > reset cwnd to ssthresh at the end of recovery? >>>>> > >>>>> > [MK]: It seems that just by correcting the definition of >>>>> DeliveredData >>>>> > there is no need to reset cwnd to ssthresh at the end of >>>>> recovery because >>>>> > the algo would do it for us. But I am not opposing to reset >>>>> cwnd to >>>>> > ssthresh at the end. In that case it might be better to >>>>> specify it by >>>>> > giving two alternatives similar to what RFC 6582 does. Maybe: >>>>> > >>>>> > Set cwnd to either (1) min (ssthresh, cwnd) or (2) ssthresh. >>>>> > >>>>> > >>>>> > I think we have discussed this in the past discussions. >>>>> > In case of (1), cwnd can become very low when there were big losses >>>>> before the recovery. >>>>> > As many implementations don't take approach and they have been used >>>>> for a long time, (2) became our >>>>> > consensus. >>>>> >>>>> [MK2]: Yes, setting cwnd to ssthresh has been discussed and in my >>>>> recalling there was two differing opinions between co-authors with no >>>>> final resolution. My apologies if there was and I missed it. >>>>> But I am not disagreeing. First, with my proposed solution this simply >>>>> becomes a non-issue and is not needed at all. Second, if we decide to >>>>> propose setting cwnd at the exit of the recovery I just think it would >>>>> be >>>>> not a good idea to require all implementations to do (2) but let the >>>>> implementor to have another safe alternative as well. In case of cwnd >>>>> being small, (1) would result in slow start that avoids a burst and >>>>> would >>>>> quickly determine the available capacity. With (2), the draft should >>>>> say >>>>> (suggest/require) something about applying some sort of pacing to >>>>> avoid a >>>>> burst, I think (such a burst is not necessarily self-evident for >>>>> everyone). >>>>> >>>> >>>> I believe our consensus has been (2) while we are aware there can be a >>>> burst. >>>> I can agree that a burst here might not be obvious for everyone. >>>> So, my personal recommendation would be to put something about bursts >>>> here. But, if some folks have other opinions, please share. >>>> >>>> -- >>>> Yoshi >>>> >>> >>> Thanks, Yoshifumi. I agree that this sounds reasonable. As noted earlier >>> in the thread just now, I do see that RFC 6675 does mention some cases >>> where it can create bursts. So I suppose it's reasonable to mention the >>> potential for bursts in the cwnd = ssthresh step at the end of PRR. I have >>> proposed to add the following text, after the "cwnd = ssthresh" step: >>> >>> "Note that this step that sets cwnd to ssthresh can potentially, in some >>> scenarios, allow a burst of back-to-back segments into the network. As with >>> common scenarios that could allow bursts, such as restarting from idle, it >>> is RECOMMENDED that implementations use pacing to reduce the burstiness of >>> traffic." >>> >>> Thanks! >>> >>> Best regards, >>> neal >>> >>>
- [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc6937bis… internet-drafts
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- [tcpm] Re: PRR behaviour on detecting loss of a r… Neal Cardwell
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Matt Mathis
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Matt Mathis
- [tcpm] About growing cwnd when the sender is rate… Michael Welzl
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo
- [tcpm] Re: PRR behaviour on detecting loss of a r… Markku Kojo
- [tcpm] Re: About growing cwnd when the sender is … Neal Cardwell
- [tcpm] Re: About growing cwnd when the sender is … Michael Welzl
- [tcpm] Re: About growing cwnd when the sender is … Neal Cardwell
- [tcpm] Re: About growing cwnd when the sender is … Michael Welzl
- [tcpm] Re: About growing cwnd when the sender is … Michael Welzl
- [tcpm] Re: About growing cwnd when the sender is … Christian Huitema
- [tcpm] Re: About growing cwnd when the sender is … Michael Welzl
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- [tcpm] Re: PRR behaviour on detecting loss of a r… Markku Kojo
- [tcpm] Re: PRR behaviour on detecting loss of a r… Neal Cardwell
- [tcpm] Re: PRR behaviour on detecting loss of a r… Neal Cardwell
- [tcpm] Re: PRR behaviour on detecting loss of a r… Markku Kojo
- [tcpm] Re: PRR behaviour on detecting loss of a r… Yoshifumi Nishida
- [tcpm] Re: PRR behaviour on detecting loss of a r… Markku Kojo
- [tcpm] Re: PRR behaviour on detecting loss of a r… Matt Mathis
- [tcpm] Re: PRR behaviour on detecting loss of a r… Neal Cardwell