Re: [tcpm] Linux doesn’t implement RFC3465

Neal Cardwell <> Thu, 29 July 2021 13:39 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id A2F7A3A2317 for <>; Thu, 29 Jul 2021 06:39:04 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -18.096
X-Spam-Status: No, score=-18.096 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.499, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id I-5SxzKiex90 for <>; Thu, 29 Jul 2021 06:38:59 -0700 (PDT)
Received: from ( [IPv6:2607:f8b0:4864:20::a36]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 9BD2F3A2334 for <>; Thu, 29 Jul 2021 06:38:59 -0700 (PDT)
Received: by with SMTP id i13so1288201vkd.12 for <>; Thu, 29 Jul 2021 06:38:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=n5FtIhZupmS9HOoQZdKr7zJaufPs6zXiUtJz8JHSBHU=; b=ZQl/l3f8K/Ec8FzmMLJbASSXVAInH3sHzveywcKaqXojEI89WS0cPZK10ZvnvL3WEw g0s0cd9UEqm1MrN9sJ67dH/YD1Xzo5fK7Ns+E4cjsiRUPOE/olehe0jeN4rsHGiOBwFt JXU4Vk6hk76cz12D0XevZdXP4zvA8CdisJiFSCGq6OtQH5xrlNpC+FvngmtW9p6ezV8b 2rFaMM7crSmKEyT+M+8DJOcL9dnkzAWxnMgopW0Du9xiG4Bs1za7N9tVwPoFqWSh5UDt IdHplruo2cOUEcM/n52+BhrAkgeyVakuofeMWvVCbWeEajHfn+Q0Apjins7wtbqiH6b6 ziVg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=n5FtIhZupmS9HOoQZdKr7zJaufPs6zXiUtJz8JHSBHU=; b=qrIcmDod9MEltSOarnKQxRsZI/hTOJkG7juZf732+FcqEjPOIzZIG4zlhzZlomrPZe B+2PzBlj0fIO8sSvsZ6d4zte14YPLd5QaIgC85fP0xlDa7IrkrPIfQxSTWtCZ+tz/UUj xe2SMrK9vBLSo58LX25gJo140CJWFlSvF9fPqfCs3A5LY6qqMjegeZmsKznETegQ3VjW nUu6asgZRrYRxczCXwQgTLknVKNySFQt1q5nJIDev2pW8DEaJvn7WMGIZ+hIK4/x7hgZ Ao5ZVvmPmn3mpxvKuXG5pE+Oa5bnLxQVyNowVR4198lYpokiYtsdAgik81y5f1Aw768F SSVw==
X-Gm-Message-State: AOAM532T6/KheYWGyoNjhYrLYGqs4/zacE/Wlo12gtmDo0SdG4XaFBhx 2/0VABWqqrNyGTcPJPpS/wOXraV+fTZmHaw4sUUzQQ==
X-Google-Smtp-Source: ABdhPJzsLiZmB12t5aQbLOSJ2ZqVy/0bXGYvytJ4Ep0uqHvPNlukOvYBdt71bpZDiqbF7em33ZdY7TldjXao9CSjx4Q=
X-Received: by 2002:a1f:4843:: with SMTP id v64mr3956242vka.11.1627565936328; Thu, 29 Jul 2021 06:38:56 -0700 (PDT)
MIME-Version: 1.0
References: <> <> <> <> <> <>
In-Reply-To: <>
From: Neal Cardwell <>
Date: Thu, 29 Jul 2021 09:38:39 -0400
Message-ID: <>
To: Mark Allman <>
Cc: Vidhi Goel <>, Extensions <>
Content-Type: multipart/alternative; boundary="0000000000007c530e05c843393a"
Archived-At: <>
Subject: Re: [tcpm] =?utf-8?q?Linux_doesn=E2=80=99t_implement_RFC3465?=
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 29 Jul 2021 13:39:05 -0000

On Thu, Jul 29, 2021 at 8:41 AM Mark Allman <> wrote:

> I just inhaled notes from Vidhi, Yuchung and Neal.  And, I am trying
> to page the recesses of my brain back in ... :-)
> I have a few thoughts ...
>   - I sort of agree with Yuchung & Neal in spirit, if not
>     particulars.
>   - During the initial slow start phase I think it'd be fine to say
>     either:
>     (a) If you use some sort of burst mitigator/pacing, an
>         implementation can increase cwnd by the number of bytes
>         ACKed on each ACK (i.e., no L).
>     (b) If there is no burst mitigation then we have to figure out
>         if L is still useful for this purpose and whether we want to
>         retain it.  Seems like perhaps L=2 is sensible here.  L was
>         never meant to be some general burst mitigator.  However,
>         ABC clearly *can* aggravate bursting and so perhaps it makes
>         sense to have it also try to limit the impact of the
>         aggravation (in the absence of some general mechanism).

Even if recommending a static L value, IMHO L=2 is a bit conservative.
Given that RFC 6928 and the experience with IW10 seems to show that the
vast majority of paths on the public Internet work fine with bursts of 10
packets, this seems to suggest that it is safe to allow an ACK for 5
packets in slow start to increase cwnd by 5 packets and allow the release
of 10 packets, which I guess implies an L of at least 5.

>   - During slow starts that follow RTOs there is a general problem
>     that just because the window slides by X bytes doesn't say
>     anything about the *network*, as that sliding can happen because
>     much of the data was likely queued for the application on the
>     receiver.  So, e.g., you can RTO and send one packet and get an
>     ACK back that slides the window 10 packets.  That doesn't mean
>     10 packets left.  It means one packet left the network and nine
>     packets are eligible to be sent to the application.  So, it is
>     not OK to set the cwnd to 1+10 = 11 packets in response to this
>     ACK.  Here L should exist and be 1.

AFAICT this argument only applies to non-SACK connections. For connections
with SACK (the vast majority of connections over the public Internet and in
datacenters), it is quite feasible to determine how many packets really
left the network (and Linux TCP does this; see below).

>   - It is possible that someone could work out a SACK-based scheme
>     to try to figure out how much of the ACK is data leaving the
>     network and how much of it was data sitting in the buffer at the
>     receiver.  And, if we could do that then we could say "L=1
>     unless you use a fancy algorithm".

The SACK-based scheme to try to figure out how much of the ACK is data
leaving the network has been implemented in Linux for at least 20 years or
so ( see the 2002 paper "Congestion Control in Linux TCP" by Pasi Sarolahti
and Alexey Kuznetsov: )

> My guess, however, is this
>     is likely a fool's errand for two reasons.  First, on an RTO the
>     scoreboard is supposed to be cleared and while SACK information
>     will slowly get built back up we lose a bunch of state that we
>     could perhaps use to differentiate between "left network" and
>     "left buffer".

Clearing the scoreboard on RTO is indeed a "SHOULD" from RFC 2018:

   After a retransmit timeout the data sender SHOULD turn off all of the
   SACKed bits, since the timeout might indicate that the data receiver
   has reneged.  The data sender MUST retransmit the segment at the left
   edge of the window after a retransmit timeout, whether or not the
   SACKed bit is on for that segment.

But our experience with Linux TCP is that clearing the scoreboard on RTO is
unnecessary, and causes unnecessary performance problems and redundant
retransmissions.  While it is true that "the timeout might indicate that
the data receiver has reneged", it is vastly more common for timeouts to be
due to simple packet loss. And the sender can specifically detect reneging
if it finds it times out and hits the specific scenario implied in the RFC
(the segment at the left edge of the window is SACKed); and if this
specific indication of reneging happens, *then* it makes sense for the
sender to clear the SACK scoreboard. This is the approach Linux TCP has
used for a decade or more, and it works very well in practice.

>   Second, receivers are supposed to ACK out of
>     order segments and those that fill in holes immediately.  If

    they do that, it means that every ACK received actually does
>     convey one segment leaving the network and therefore L=1 is

    correct and even an accurate SACK-based algorithm wouldn't let
>     one increase cwnd more than L=1 would.  (One caveat here is I am

    not sure how this thinking would play with the offload stuff.)

Yes, offload mechanisms are so pervasive in practice, that this kind of
reasoning no longer holds in common cases. In common cases, there are
hardware (LRO) or software (GRO) mechanisms that are doing aggregation of
consecutive byte ranges at a layer that is too low to know whether the byte
ranges are out of order or fill a hole, so by the time that out-of-order or
hole-filling ranges reach TCP they may be up to 64KBytes in length. So
although TCP will try to ACK an out-of-order or hole-filling sequence range
"immediately" from the TCP layer's perspective, it may often be up to
64Kbytes or roughly 44 segments in length.

best regards,