[tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc6937bis-06.txt
Neal Cardwell <ncardwell@google.com> Wed, 17 July 2024 16:51 UTC
Return-Path: <ncardwell@google.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C7AB6C14F60D for <tcpm@ietfa.amsl.com>; Wed, 17 Jul 2024 09:51:50 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -17.598
X-Spam-Level:
X-Spam-Status: No, score=-17.598 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, T_SCC_BODY_TEXT_LINE=-0.01, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=google.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Jorg0XuyftQe for <tcpm@ietfa.amsl.com>; Wed, 17 Jul 2024 09:51:48 -0700 (PDT)
Received: from mail-vk1-xa30.google.com (mail-vk1-xa30.google.com [IPv6:2607:f8b0:4864:20::a30]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4A263C14F605 for <tcpm@ietf.org>; Wed, 17 Jul 2024 09:51:48 -0700 (PDT)
Received: by mail-vk1-xa30.google.com with SMTP id 71dfb90a1353d-4f2c8e99c0fso617185e0c.1 for <tcpm@ietf.org>; Wed, 17 Jul 2024 09:51:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721235107; x=1721839907; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=jE0ywDz2YwLhksl+zov0j6jGlBofOJf4hJaMGzNw+MU=; b=HOPeORNs5WsJKYFW2TosIJpjDTQCKt8IUOJeZqygE1QaiQie/duAnq/EZPnwp4EnkZ XNe75RLDRpqRlzqQBRZ8gcVlOdbb7qX7/JdTd9Miu1LOGJVGqR4VDIsqWqkHl+vgTLjy nEdtSsQfnWXlxvEr55WcekMOAjy0BF65qVNc6vXFmtsRkqGx0lXOCXsSgDbJ2OA/w3na 18fGXQyeOaHFytpc/buZI7JuMQjEWon7AH1BxhFfSLEt8iVEbEJk5sjLhKtqQIeFJ2tv mPOtA/SvRuPEr/Tib4pBAGaU0ehS3NovPC/fp59rYuqgJdPZ99IQnfhjILTKSvGr9EM3 y5Aw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721235107; x=1721839907; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=jE0ywDz2YwLhksl+zov0j6jGlBofOJf4hJaMGzNw+MU=; b=cvMWBewWHwBtx/X49Jz8purKJk4rDooVytN9vB7w0gzyIY67CIe2bu+1MfDbO5Pflx yIrbgqwEgqz24/sENfMY7K0DAkmDu695fhsDdtlLt5sl24GygrolIyGgOp2310SGNFoT 1k8bp0FuVqxtnRA+x2z/q+dwBdOfvKuMiMLW+9idyRu5TIV4F9A7yQYcuX9hCMbhQIAv 737hVZyHvwpn9EGM8NpHbyr7soVmAsmoP/nzYopv/49WsUrLtyOTCVEwHr6CNQHMuwOs 2xlBnuheTbbp/8LZ78YupFtDLnYFokazUeEXaHqoTLTJ1VySu0CghdRWBZy42YGT8lmV vyfA==
X-Forwarded-Encrypted: i=1; AJvYcCWjs84Mm3grpAnwdyWqoG8gbv2+TZTlDB7spyv3+lQeqHwuGTginK1vTD3L9mOKiCLtTARGzBpy8g5WnEQC
X-Gm-Message-State: AOJu0YwuKxIWcxPw54mfHrhdvyUVF6zFLNkRClmgqK2HB1CYSHIb9iFc Qlwzkz1aKJPFRIhujIzE14taCEyisehlaDNO415rRm/+rliF8L5LkVHswYQVjCaAfflB7cWfPXg E7BZ4mKJbthkQxJmM3tnnP8l9oKzNoqXaKUDT
X-Google-Smtp-Source: AGHT+IFB6xvJbrbC4ZQG1ZUzZCa1nQJflusO8NtngUGCHOCJpfxLvWlmJjyoSy/Cf6y5y2zKbR/QNPuHKAT8MtW8JXQ=
X-Received: by 2002:a05:6122:551:b0:4ef:65b6:f3b5 with SMTP id 71dfb90a1353d-4f4df89f3f0mr3060288e0c.10.1721235106475; Wed, 17 Jul 2024 09:51:46 -0700 (PDT)
MIME-Version: 1.0
References: <170896098131.16189.4842811868600508870@ietfa.amsl.com> <CADVnQy=rvCoQC0RwVq=P2XWFGPrXvGKvj2cAooj94yx+WzXz3A@mail.gmail.com> <8e5f0a7-b39b-cfaa-5c38-edeb9916bef6@cs.helsinki.fi> <CADVnQynR99fQjWmYj-rYZ4nZxYS=-O7zbfWjJLMxd5Lqcpwgcg@mail.gmail.com> <CAAK044SJWsvqWf+Tt3wUeGpMRH6aVg175CFUBrsz_YyhDsKYwQ@mail.gmail.com> <CADVnQy=0yhx9U-ogVX_Dh66fZWGyMzqtWAgSfaYXX-6ppGx=Kg@mail.gmail.com> <CAAK044So4qO4zma-qdYbMrJcNVdRi1-o30wcP7QjKLuT2c_Zuw@mail.gmail.com> <CADVnQynvFY24cNfe4tfm6FMY47UaMPjxrtD5RwWhCPK6EH4=6w@mail.gmail.com> <CAAK044QWL2xBvg4S_cvHFE_iTddSmnEOkhu33pftvUiUCCGZ_g@mail.gmail.com> <44e1fff-f915-fdd-6c8e-4cd1cc3a9df3@cs.helsinki.fi>
In-Reply-To: <44e1fff-f915-fdd-6c8e-4cd1cc3a9df3@cs.helsinki.fi>
From: Neal Cardwell <ncardwell@google.com>
Date: Wed, 17 Jul 2024 12:51:26 -0400
Message-ID: <CADVnQynof+20HAqnnCynBDBx_BPSQorzbYTMZzkHbsx7CJVpvQ@mail.gmail.com>
To: Markku Kojo <kojo@cs.helsinki.fi>
Content-Type: multipart/alternative; boundary="000000000000192fd5061d744799"
Message-ID-Hash: WEZL2QVK4OYBFSVCVYKFL4HGM73JEGLQ
X-Message-ID-Hash: WEZL2QVK4OYBFSVCVYKFL4HGM73JEGLQ
X-MailFrom: ncardwell@google.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-tcpm.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: Matt Mathis <ietf@mattmathis.net>, Matt Mathis <mattmathis@measurementlab.net>, tcpm@ietf.org
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc6937bis-06.txt
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/yBAi51ApPP3mPBoABgmCrFcPUlM>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Owner: <mailto:tcpm-owner@ietf.org>
List-Post: <mailto:tcpm@ietf.org>
List-Subscribe: <mailto:tcpm-join@ietf.org>
List-Unsubscribe: <mailto:tcpm-leave@ietf.org>
On Wed, Mar 27, 2024 at 10:54 AM Markku Kojo <kojo@cs.helsinki.fi> wrote: > Hi Yoshi, Neal, all, > > Please see below inline (tagged [MK]). And my apologies for a very long > explanation. Hopefully I did not include too many mistakes this time ;) > > In summary, it seems that we do not need to reset cwnd in the end of the > recovery nor adjust RecoverFS necessarily because all issues raised could > be resolved by simply correcting the definition of DeliveredData > (reverting back the original definition + small additional change) and > moving the actions to take with the ACK that triggers loss recovery to > the unconditional steps to be taken together with the initialization of > the algo in the beginning (this would also be in line with how the other > fast recovery algorithms are described in the RFC series). > > Hopefully I did not misunderstand any parts of the algo (either in RFC > 6937 or in the current -08 version of the draft). > > On Thu, 21 Mar 2024, Yoshifumi Nishida wrote: > > > Hi Neal, > > On Wed, Mar 20, 2024 at 1:32 PM Neal Cardwell <ncardwell@google.com> > wrote: > > > > > > On Wed, Mar 20, 2024 at 3:29 PM Yoshifumi Nishida <nsd.ietf@gmail.com> > wrote: > > Hi Neal, > > > > On Wed, Mar 20, 2024 at 6:55 AM Neal Cardwell <ncardwell@google.com> > wrote: > > > > On Wed, Mar 20, 2024 at 3:07 AM Yoshifumi Nishida < > nsd.ietf@gmail.com> wrote: > > > > On Mon, Mar 18, 2024 at 8:13 PM Neal Cardwell <ncardwell= > 40google.com@dmarc.ietf.org> > > wrote: > > > > But your point still stands, and you raise a great point: simply > initializing > > RecoverFS to "pipe" is not safe, because packets that were marked lost > and removed > > from pipe may actually have been merely reordered. So if those packets > are > > delivered later, they will increase the numerator of prr_delivered / > RecoverFS > > without increasing the denominator, thus leading to a result above 1.0, > and thus > > potentially leading to a target for Sent_so_far that is above ssthresh, > causing the > > algorithm to erroneously exceed ssthresh. > > > > > > Hmm. I have a native question here. If packets were merely reordered, > isn't it ok for > > cwnd to be bigger than ssthresh? > > > > > > Yes, if packets were merely reordered and none were lost, then I agree > it would be OK for cwnd > > to be bigger than ssthresh. And in fact I would argue that cwnd should > be reverted back to its > > value before fast recovery. And this is in fact what Linux TCP would do, > using loss recovery > > "undo" mechanisms based on TCP timestamps or DSACKs. > > > > However, in the kind of scenario Markku described, there was not merely > reordering, but also > > real packet loss: "1 packet is lost (P1), and 24 packets are delayed > (packets P2..P25)". In the > > traditional loss-based Reno/CUBIC paradigm, any non-zero amount of real > packet loss in fast > > recovery should result in the same multiplicative decrease in cwnd, > regardless of the > > combination of reordering and loss. We could argue about whether that > approach is the best > > approach (BBR, for example, takes a different approach), but that is a > very different > > discussion. :-) For now AFAICT we are focused on PRR's faithful > enactment of the congestion > > control algorithms decision to reduce cwnd toward ssthresh when there is > any non-zero amount of > > real packet loss in fast recovery. > > > > > > Got it. But, I just would like to clarify whether we are discussing the > inflation of sndcnt during > > the recovery process or cwnd after the exit of recovery. > > > > > > Good point. We are talking about inflation of sndcnt during the recovery > process. > > [MK] I think we are talking about both in practice because inflation of > sndcnt during the recovery process would also result in exiting recovery > with too big cwnd. In the examples that I gave the segments sent_so_far > was calculated when the SACK for P100 had arrived (actually the numbers > were off by one): > > For Reno: > > Sent_so_far = CEIL(prr_delivered * ssthresh / RecoverFS) > = CEIL(97 * 50 / 72) > = 68 > > For CUBIC: > Sent_so_far = CEIL(prr_delivered * ssthresh / RecoverFS) > = CEIL(97 * 70 / 72) > = 95 > > Now, when the cumulative ACK triggered by rexmit of P1 arrives and > terminates fast recovery, the following is executed as per the *current > version* of the draft: > > DeliveredData = (bytes newly cumulatively acknowledged) = 100 > DeliveredData += (bytes newly selectively acknowledged) = 100 + 0 > > prr_delivered += DeliveredData = 95 + 100 = 195 > pipe = 94 > if (pipe > ssthresh) => (94 > 70) => (true) > sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out > = CEIL(195*70/72) = 190 - 95 = 95 > cwnd = pipe + sndcnt = 94 + 95 = 189 > > So oops, when exiting fast recovery cwnd would be nearly doubled from > what it was before entering loss recovery. It seems that there > is an additional problem because the definition of DeliveredData in the > current version of the draft is incorrect; the cumulatively acked bytes > that have already been SACKed are counted twice in DeliveredData. It > seems that RFC 6937 and rfc6937bis-01 both define DeliveredData (nearly) > correctly by including the change in snd.una in DeliveredData and > substracting data that has already been SACKed. The definition of > DeliveredData obviously needs to be corrected. See also below the issue > with bytes SACked that are above snd.una but get SACKed before the start > of recovery. > OK, fair points. In our internal copy of the draft I have proposed to edit the description of DeliveredData to closely match the text from RFC 6937 and rfc6937bis-01, which you feel "both define DeliveredData (nearly) correctly". The new draft description of DeliveredData is: --- DeliveredData: The total number of bytes that the current ACK indicates have been delivered to the receiver. When there are no SACKed sequence ranges in the scoreboard before or after the ACK, DeliveredData is the change in snd.una. With SACK, DeliveredData can be computed precisely as the change in snd.una, plus the (signed) change in SACKed. In recovery without SACK, DeliveredData is estimated to be 1 SMSS on receiving a duplicate acknowledgement, and on a subsequent partial or full ACK DeliveredData is the change in snd.una, minus 1 SMSS for each preceding duplicate ACK. Note that without SACK, a poorly-behaved receiver that returns extraneous DUPACKs (as described in <xref target='Savage99' />) could attempt to artificially inflate DeliveredData. As a mitigation, if not using SACK then PRR disallows incrementing DeliveredData when the total bytes delivered in a PRR episode would exceed the estimated data outstanding upon entering recovery (RecoverFS). --- I have proposed to remove the pseudocode definition of DeliveredData, since the pseudocode version in RFC 6937 / rfc6937bis-01 was incomplete (it did not cover the non-SACK case) and the pseudocode version in rfc6937bis-02 through rfc6937bis-08 was buggy, as you noted. > With the original definition of DeliveredData cwnd would not be > inflated further but fast recovery would still exit with too big cwnd > (For CUBIC cwnd=95, instead of 70, and for Reno cwnd=68, instead of 50), > if we use too small RecoveryFS (=Pipe) > > So, it seems that we agree that the problem of sending too many bytes > during the recovery process gets corrected if RecoverFS is initialized to > snd.nxt - snd.una. The next question is, should RecoverFS be initialized > to even higher value in some scenarios? See below. > > > Because I'm personally not very keen to address miscalculation of lost > packets due to reordering > > during the recovery process as it seems to be tricky. > > > > > > It is tricky, but I think it is feasible to address. What do folks think > about my suggestion from above in > > this thread: > > > > existing text: > > pipe = (RFC 6675 pipe algorithm) > > RecoverFS = pipe // RFC 6675 pipe before recovery > > > > proposed new text: > > RecoverFS = snd.nxt - snd.una + (bytes newly cumulatively > acknowledged) > > > > > > Hmm. Sorry. I'm not very sure about the difference between snd.nxt - > snd.una and snd.nxt - snd.una + (bytes newly > > cumulatively acknowledged) > > Could you elaborate a bit? I thought we don't have data which are > cumulatively acked in case of reordering. > > [MK] It seems there might be another case that Neil is thinking where the > sender may end up sending too many segments during the first RTT in fast > recovery. If I understood it correctly this may occur in a scenario > with ACK loss for pkts preceeding the first dropped data pkt, for > example. Consider the following scenario where there are 100 pkts > outstanding > > P1..P24, P25, P26, P27, P28..P100 > > Packets P1..P24 and P26..P100 are delivered succesfully to the > receiver. P25 is lost. ACKs (and SACKs) for pkts P1..P24, P26 and P27 get > dropped. SACKs for P28..P100 are delivered successfully. When SACK > for pkt P28 arrives, an RFC 6675 sender would declare P25 is lost, and > enter fast retransmit. Similarly, a RACK-TLP sender may declare P25 lost, > but this may happen with any of SACKs P28..P100 arriving. > > Let's assume we were fully utilizing congestion window, i.e., cwnd=100 > and we enter loss recovery when the SACK of P28 arrives (cumulative > ACK#=25): > > ssthresh = cwnd / 2 = 50 (Reno) > prr_delivered = prr_out = 0 > Pipe = snd.nxt - snd.una - (lost + SACKed) = 76 - (1 + 3) = 72 > RecoverFS = snd.nxt - snd.una = 101 - 25 = 76 > > DeliveredData = (bytes newly cumulatively acknowledged) = 24 > DeliveredData += change_in(SACKd) = 24+3 = 27 > prr_delivered += DeliveredData = 0+27 = 27 > > if (pipe > ssthresh) => if (72 > 50) => true > // Proportional Rate Reduction > sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out > = CEIL(27 * 50 / 76) = 19 - 0 = 19 > > cwnd = 72 + 19 = 91 > > so, we will send a burst of 19 pkts on entry to recovery and during the > rest of the recovery around 49 pkts, giving a total of 19+49=68 pkts > while only 50 was allowed. If we add 24 cumulatively acked pkts into > RecoverFS like Neil suggests, we are about to send around 14+37=51 pkts > which is almost fine. Yes, that is exactly the kind of case I was thinking about when I was proposing initializing RecoverFS with: RecoverFS = snd.nxt - snd.una + (bytes newly cumulatively acknowledged) > However, the major shortcoming of this approach is > that we'll still send a burst of 14 pkts in the beginning of the recovery > while avoiding such a burst was one of the major goals of PRR. > > Alternatively we could modify the algo such that the cumulatively acked > bytes with the ACK that triggers loss recovery are not added to > DeliveredData nor in RecoverFS. Then we would send just one pkt (rexmit of > P1) on entering the recovery and during the rest of recovery around 49 > pkts, i.e., 1+49=50 pkts during the recovery, which would be exactly equal > to ssthresh we set. With this approach we could avoid the burst in the > beginning. In addition we could have a consistent solution also for the > additional problem of including extra SACKed data with the ACK that > triggers the recovery. Let's look at the above scenario again cwnd=100 > and pkts P1..P100 in flight: > > P1..P24, P25, P26, P27, P28..P100 > > Packet P1..P24 are delivered to the receiver but ACKs get dropped (whether > ACKs are dropped or not is not relevant for this issue). P25 gets > dropped. If the DupAcks of pkt P26 and pkt P27 are delivered, from the > DupAck of P28 only the SACK info for P28 is counted in DeliveredData but > the SACK info for P26 and P27 are never counted in DeliveredData because > P26 and P27 are already SACKed when the DupAck of P28 arrives. However, > if the DupAcks of pkt P26 and pkt P27 get dropped as in the previous > example, the ACK of P28 includes new SACK info for pkts P26, P27, and > P28 and the bytes of P26 and P27 are also counted in DeliveredData. (If > also DupAck of P28 gets dropped, the DupAck of P29 may include up to 3 > MSS of additional SACK info to be counted (P26, P27, and P28). This alone > will result in a miniburst in the beginning of the recovery or add to the > burst size as in the previous example where the two additinal SACKs (for > P26 and P27) inflated prr_delivered by 2, resulting in slightly too large > number of segments sent during the recovery (51). > > As suggested above, this problem with additional SACKs would be solved > such that the DupAck that triggers the loss recovery is allowed to add > only "itself" into DeliveredData and let PRR to include the missing bytes > for pkts that were SACKed before the start of the recovery only at the > end of the recovery when the cumulative ACK for the first pkt (P1) > arrives and inherently covers those bytes. > > In other words, the algo can be modified such that fast retransmit is > always handled separately in the beginning of the recovery together with > the initialization of the PRR variables: > > ssthresh = CongCtrlAlg() // Target flight size in recovery > //[MK]: the next three lines can be deleted as unnecessary > prr_delivered = 0 // Total bytes delivered in recovery > prr_out = 0 // Total bytes sent in recovery > pipe = (RFC 6675 pipe algorithm) > > RecoverFS = snd.nxt - snd.una // FlightSize right before recovery > // [MK]:maybe add cumulatively ACKed > // bytes? > > Fast Retransmit the first missing segment > prr_delivered = (With SACK: bytes selectively acknowledged by the first > SACK block of the ACK triggering the loss recovery, OR > Without SACK: 1 MSS) > prr_out = (data fast retransmitted) > > On each arriving ACK during the rest of the fast recovery, including the > final cumulative ACK that signals the end of loss recovery: > > DeliveredData = change_in(snd.una) > if (SACK is used) { > DeliveredData += change_in(SACKd) //[MK]:(*) modify change_in(SACKd) > ... > > > The above changes would imply deleting > > if (prr_out is 0 AND sndcnt is 0) { > // Force a fast retransmit upon entering recovery > sndcnt = MSS > > from the algo and would make it consistent with the description of the > other fast retransmit & Fast recovery algorithms (RFC 5681, RFC 6582, RFC > 6675) that include fast retransmit together with the initialization of > the algo in the unconditional first steps of the algorithm. > I would suggest that we stick with the current PRR approach rather than your suggested alternative, for a few reasons: + Your proposed alternative is adding complexity to optimize an edge/corner case to reduce burstiness. However, these edge/corner cases where the improvement would be significant (cases where significant ACK information was lost just before entering recovery) are probably rare. So my sense is that the cost of adding this level of complexity to try to optimize these edge/corner cases is not worth the minor benefit. + Your proposed alternative seeks to reduce burstiness on the first ACK that triggers fast recovery in the case where ACKs were lost before entering recovery. However, TCP already has a significant degree of burstiness due to restart-from-idle scenarios, delayed ACKs ACKing every other packet, and the widely-deployed aggregation/offload mechanisms such as TSO/GRO, LRO/GRO, and Linux TCP SACK compression. My sense is that the effect of attempting to reduce burstiness in the scenario you describe would be lost in the noise of the burstiness inherent from these other scenarios/mechanisms. + Your proposed alternative seeks to reduce burstiness on the first ACK that triggers fast recovery in the case where ACKs were lost before entering recovery. However, burstiness due to lost ACKs can happen at any point during recovery. So the complexity added in the proposed alternative is a mitigation only for a tiny fraction of the full class of burstiness problems that can occur due to lost ACKs. So again, this alternative does not seem worth it, IMHO. + Your proposed alternative results in a new algorithm which is not yet tested or deployed. If we make a change like this to the algorithm, then we would presumably want implementation and deployment experience before trying to make the updated algorithm a Proposed Standard. So that would push back the RFC process significantly for what, IMHO, is a minor anticipated potential gain in some edge cases. > (*) > In addition, one more smallish but important correction is needed. The > bytes that are SACKed before the recovery starts (i.e., typically the > famous first two DupAcks or more bytes if the start of recovery is > postponed due to reordering) should be taken into account in the > DeliveredData during the recovery but with the current algo they > are never counted in DeliveredData (and prr_delivered). > Why? Because when the first cumulative ACK arrives, it advances snd.una > such that those bytes are covered but change_in(SACKd) is negative and > it incorrectly substracts also these bytes from DeliveredData (and > prr_delivered) even though they were never counted in. Usually this is > only 2 MSS but in scenarios similar to one that Neil earlier introduced > there might be much more data bytes that are not counted. This change > would also solve the problem of exiting PRR with too low cwnd. > Let's look at Neil's earlier example again (see comments with [MK] for > suggested change to solve the issue): > > CC = CUBIC > cwnd = 10 > The reordering degree was estimated to be large, so the connection will > wait for more than 3 packets to be SACKed before entering fast recovery. > > --- Application writes 10*MSS. > > TCP sends packets P1 .. P10. > pipe = 10 packets in flight (P1 .. P10) > > --- P2..P9 SACKed -> do nothing // > > (Because the reordering degree was previously estimated to be large.) > > --- P10 SACKed -> mark P1 as lost and enter fast recovery > > PRR: > ssthresh = CongCtrlAlg() = 7 packets // CUBIC > prr_delivered = 0 > prr_out = 0 > RecoverFS = snd.nxt - snd.una = 10 packets (P1..P10) > > DeliveredData = 1 (P10 was SACKed) > > prr_delivered += DeliveredData ==> prr_delivered = 1 > > pipe = 0 (all packets are SACKed or lost; P1 is lost, rest are SACKed) > > safeACK = false (snd.una did not advance) > > if (pipe > ssthresh) => if (0 > 7) => false > else > // PRR-CRB by default > sndcnt = MAX(prr_delivered - prr_out, DeliveredData) > = MAX(1 - 0, 1) > = 1 > > sndcnt = MIN(ssthresh - pipe, sndcnt) > = MIN(7 - 0, 1) > = 1 > > cwnd = pipe + sndcnt > = 0 + 1 > = 1 > > retransmit P1 > > prr_out += 1 ==> prr_out = 1 > > --- P1 retransmit plugs hole; receive cumulative ACK for P1..P10 > > DeliveredData = 1 (P1 was newly ACKed) //[MK]: should be = 10 - 1 = 9 > > //[MK]: Note that SACKed bytes of P2..P9 were also newly acked > // because the bytes have not been delivered *during* the > // recovery by this far and thereby not yet counted in > // prr_delivered. > // So, they should not be substracted from DeliveredData > // but included as those bytes got delivered only when > // snd.una advanced. Only P10 should be substracted. > > prr_delivered += DeliveredData ==> prr_delivered = 2 > //[MK]: should be = 1 + 9 = 10 > > pipe = 0 (all packets are cumuatively ACKed) > > safeACK = (snd.una advances and no further loss indicated) > safeACK = true > > if (pipe > ssthresh) => if (0 > 7) => false > else > // PRR-CRB by default > sndcnt = MAX(prr_delivered - prr_out, DeliveredData) > = MAX(2 - 1, 1) //[MK] = MAX(10-1, 1) > = 1 //[MK] = 9 > if (safeACK) => true > // PRR-SSRB when recovery is in good progress > sndcnt += 1 ==> sndcnt = 2 //[MK] ==> sndcnt = 10 > > sndcnt = MIN(ssthresh - pipe, sndcnt) > = MIN(7 - 0, 2) //[MK] = MIN(7 - 0, 10) > = 2 //[MK] = 7 > > cwnd = pipe + sndcnt > = 0 + 2 //[MK] = 0 + 7 > = 2 //[MK] = 7 > > So we exit fast recovery with cwnd=2 even though ssthresh is 7. > > [MK]: Or, we exit with cwnd=7 if we correctly count in DeliveredData > during the recovery process all data that is in flight when the recovery > starts. All bytes in flight at the start of the recovery are supposed to > become acknowledged by the end of the recovery, so they should be counted > in prr_delivered during the recovery. > Your proposal to count previously SACKed packets as newly delivered when SND.UNA advances past them happens to produce a reasonable cwnd in that particular example. However, AFAICT your proposal produces incorrect results when SND.UNA advances past some SACKed packets but does not complete recovery (SND.UNA < RecoveryPoint, to use the RFC 6675 terminology). In such cases, counting previously SACKed packets as newly delivered when SND.UNA advances past them would "double-count" the sequence range and produce an inflated sense of the volume of data that had just been delivered, and would cause the PRR sender to send too much data at that moment, violating the intended packet conservation or slow-start invariants that PRR seeks to maintain. > > > However, I think it won't be good if it's propagated to ater > the recovery. But, don't we > > reset cwnd to ssthresh at the end of recovery? > > [MK]: It seems that just by correcting the definition of DeliveredData > there is no need to reset cwnd to ssthresh at the end of recovery because > the algo would do it for us. But I am not opposing to reset cwnd to > ssthresh at the end. In that case it might be better to specify it by > giving two alternatives similar to what RFC 6582 does. Maybe: > > Set cwnd to either (1) min (ssthresh, cwnd) or (2) ssthresh. > IMHO it is better to stick with the current planned approach where we give implementers an unambiguous specification that tells them to implement option (2) (cwnd = ssthresh) at the end of a PRR episode. Rationale: + A single course of action is simpler and more clear than offering two choices. + From first principles we can expect option (1) – min (ssthresh, cwnd) – to have very poor performance if a connection is application-limited when entering recovery and the TCP implementation uses the more widely-deployed approach of setting ssthresh to cwnd/2 rather than FlightSize/2. + The case (2) is what has been running in Linux TCP since 2011 ( https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a262f0cdf1f2916ea918dc329492abb5323d9a6c ) so it has a lot of testing and is known to work well enough, and thus I would suggest we center our "rough consensus" on this "running code" > > [MK] Finally, one correction and one clarifying question. > The first one is related to use of Limited Transmit and computing > ssthresh: > when limited transmit is in use, the segments transmitted as part of > Limited Transmit must not be counted to FlightSize used in determining the > new value of ssthresh (and cwnd) as specified in RFC 5681 (Sec 3.2 step 2) > and RFC 6675 (Sec 5, step 4.2). In other words, ssthresh after > multiplicative decrease should have the same value regardless of > whether Limited Trasmit is used or not because Limited Transmit folloes > the pkt conservation principle and does not inclease the effective flight > size. If one uses cwnd to compute the new value of sshtresh the same > result follows because the bytes transmitted via Limited Transmit must > not increase cwnd. > This seems to be reflected incorrectly in the examples for RFC 6675 in > Figs 1 and 2. The cwnd should be reduced to 10 (=20/2) not 11 and > consequently in Fig 1 for RFC 6675 the first new segment is sent on > arrival of Ack#13, not Ack#12, and in Fig 2, RFC 6675 would send 6R (not > 7R) on arrival of Ack#17. Also the behavior of PRR in Fig 1 should be > modified such that PRR does not send a new data segment on arrival of > Ack#19 (but would send new segment on arrival of Ack#20 and Ack#21 just > like 6675 would do). > Thanks for the corrections on the examples. I have incorporated all of your suggested changes, except for the "PRR does not send a new data segment on arrival of Ack#19 (but would send new segment on arrival of Ack#20 and Ack#21 just like 6675 would do)." AFAICT would indeed send a new data segment on the arrival of ACK#19, but not on Ack#20 (ACK of first limited transmit packet), and it would send new data on Ack#21 (ACK of second limited transmit) and Ack#22 (ACK of retransmit). I've updated the tables as follows (let's hope this doesn't get wrapped by various email software!): ### RFC 6675 ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 cwnd: 20 20 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 pipe: 19 19 18 18 17 16 15 14 13 12 11 10 9 9 9 9 9 9 9 9 9 9 sent: N N R N N N N N N N N N N PRR ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 cwnd: 20 20 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 10 pipe: 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9 9 sent: N N R N N N N N N N N N N ### I have updated the PRR table using the awk script below, FWIW: ### #!/usr/bin/awk # # Simulate the first PRR case in # draft-ietf-tcpm-prr-rfc6937bis : # function ceil(x, y) { y = int(x); return (x > y ? y + 1 : y); } BEGIN { pkt_num = 3; flightsize = 20; pipe = 19; ssthresh = flightsize/2; RecoverFS = flightsize; prr_delivered = 0; print "pkt_num cwnd pipe sndcnt"; for (pkt_num = 3; pkt_num <= 22; pkt_num++) { pipe -=1 ; prr_delivered += 1; if (prr_delivered >= flightsize) { cwnd = ssthresh; sndcnt = cwnd - pipe; } else { sndcnt = ceil(prr_delivered * ssthresh / RecoverFS) - prr_out; cwnd = pipe + sndcnt; } print pkt_num, cwnd, pipe, sndcnt; prr_out += sndcnt; pipe += sndcnt; } } ### that outputs: pkt_num cwnd pipe sndcnt 3 19 18 1 4 18 18 0 5 18 17 1 6 17 17 0 7 17 16 1 8 16 16 0 9 16 15 1 10 15 15 0 11 15 14 1 12 14 14 0 13 14 13 1 14 13 13 0 15 13 12 1 16 12 12 0 17 12 11 1 18 11 11 0 19 11 10 1 20 10 10 0 21 10 9 1 22 10 9 1 ### To get the correct behavior in that simulation script, AFAICT the RecoverFS initialization needs to be something like the following: RecoverFS = snd.nxt - snd.una // Bytes SACKed before entering recovery will not be // marked as delivered during recovery: RecoverFS -= (bytes SACKed in scoreboard) - (bytes newly SACKed) // Include the (rare) case of cumulatively ACKed bytes: RecoverFS += (bytes newly cumulatively acknowledged) So I have updated our local copy with that proposed pseudocode. > > The second one relates to counting DeliveredData without SACK. The algo > includes the following which was not present in the original PRR: > > if (SACK is used) { > ... > else if (ACK is a DUPACK and prr_delivered < RecoverFS) { > DeliveredData += MSS > > What might be the reason for stop counting DupAcks soon after the first > RTT? When RecoverFS is initialized to flight size in the beginning of > loss recovery, it is expected that around flight size worth bytes > becomes delivered (acked) by the time when the first cumulative ACK > arrives (flight size - # of dropped pkts + 1). > If we stop counting DupAcks soon after the first RTT, prr_delivered would > little forward progress during the following RTTs of the recovery. For > example, if we have pkts P1..P100 outstanding and pkts P1..10 get dropped > and no other drops occur during the recovery. cwnd == RecoverFS == 100, > ssthresh == 50. > This means that NewReno would require 10 RTTs to recover from all losses > P1..P10 and it is allowed to send upto 50 pkts during each RTT. At the > end of the first RTT of recovery, cumulative ACK for P1 arrives and we > have received and accumulated 88 (Dup)Acks in prr_delivered and > transmitted 44 pkts (1 rexmit and 43 new data pkts, CC = Reno). At the > beginning of the second RTT, we have outstanding 43 new data pkts and > rexmit of P2 (=44 pkts). After receiving 12 DupAcks during the 2nd RTT, > prr_delivered crosses RecoverFS and we stop increasing prr_delivered. > prr_delivered == 100 and prr_out ==50. We still need 8 more RTTs to > recover all 10 dropped pkts but we are allowed to send only 50 pkts more > during the recovery because DeliveredData does not inflate anymore. > Wouldn't this unnecessarily limit the forward progress during the last 8 > RTTs of the loss NewReno loss recovery (Reno CC would allow sending upto > 50 pkts per RTT)? > Thanks for catching that. You are indeed correct that the DeliveredData pseudocode in draft-ietf-tcpm-prr-rfc6937bis-08 is not complete for the case of partial or full ACKs in the case of non-SACK/NewReno recovery. I have removed the DeliveredData pseudocode and, as mentioned above, have edited the DeliveredData prose definition to match the semantics in RFC 6937, which IIUC from your remarks above you agree is good enough. Markku, thank you again for all your careful and thoughtful comments! Best regards, neal > > Thanks, > > /Markku > > > Yes, Linux TCP does reset cwnd to ssthresh at the end of recovery. And > we discussed putting that in the > > draft last year in the thread "draft-ietf-tcpm-prr-rfc6937bis-03: set > cwnd to ssthresh exiting fast > > recovery?". But it looks to me from reviewing the draft that we never > got around to inserting that in the > > draft. I will do that now, and post the proposal in that thread for > discussion. > > > > > > Thanks! I believe that reflects the consensus of the WG. > > -- > > Yoshi > >
- [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc6937bis… internet-drafts
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- [tcpm] Re: PRR behaviour on detecting loss of a r… Neal Cardwell
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Yoshifumi Nishida
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Matt Mathis
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Matt Mathis
- [tcpm] About growing cwnd when the sender is rate… Michael Welzl
- Re: [tcpm] I-D Action: draft-ietf-tcpm-prr-rfc693… Markku Kojo
- [tcpm] Re: PRR behaviour on detecting loss of a r… Markku Kojo
- [tcpm] Re: About growing cwnd when the sender is … Neal Cardwell
- [tcpm] Re: About growing cwnd when the sender is … Michael Welzl
- [tcpm] Re: About growing cwnd when the sender is … Neal Cardwell
- [tcpm] Re: About growing cwnd when the sender is … Michael Welzl
- [tcpm] Re: About growing cwnd when the sender is … Michael Welzl
- [tcpm] Re: About growing cwnd when the sender is … Christian Huitema
- [tcpm] Re: About growing cwnd when the sender is … Michael Welzl
- [tcpm] Re: I-D Action: draft-ietf-tcpm-prr-rfc693… Neal Cardwell
- [tcpm] Re: PRR behaviour on detecting loss of a r… Markku Kojo
- [tcpm] Re: PRR behaviour on detecting loss of a r… Neal Cardwell
- [tcpm] Re: PRR behaviour on detecting loss of a r… Neal Cardwell
- [tcpm] Re: PRR behaviour on detecting loss of a r… Markku Kojo
- [tcpm] Re: PRR behaviour on detecting loss of a r… Yoshifumi Nishida
- [tcpm] Re: PRR behaviour on detecting loss of a r… Markku Kojo
- [tcpm] Re: PRR behaviour on detecting loss of a r… Matt Mathis
- [tcpm] Re: PRR behaviour on detecting loss of a r… Neal Cardwell