Re: [tcpm] Linux doesn’t implement RFC3465

Neal Cardwell <ncardwell@google.com> Thu, 29 July 2021 00:41 UTC

Return-Path: <ncardwell@google.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1D8933A078E for <tcpm@ietfa.amsl.com>; Wed, 28 Jul 2021 17:41:52 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -18.096
X-Spam-Level:
X-Spam-Status: No, score=-18.096 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.499, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=google.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ICAUsv2MKpye for <tcpm@ietfa.amsl.com>; Wed, 28 Jul 2021 17:41:46 -0700 (PDT)
Received: from mail-ua1-x933.google.com (mail-ua1-x933.google.com [IPv6:2607:f8b0:4864:20::933]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BAABE3A0765 for <tcpm@ietf.org>; Wed, 28 Jul 2021 17:41:46 -0700 (PDT)
Received: by mail-ua1-x933.google.com with SMTP id t26so1834059uao.12 for <tcpm@ietf.org>; Wed, 28 Jul 2021 17:41:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=+qz7jmPVwLw81N0rAZQfVWW1tWX+aq7F1SQz3Dxt9uI=; b=nWXON9xaqOxPRYpV5hXdDqYl4m4F5xOYHIZORVOm0yKmYwrk6Zg8xXbLUKfUeG4Ur5 NcMY2ftt1Y7Txbe5pXDHDG8zHjlXVTPH1xzutcYsiBhGl0qLbxCzaAHsYzkk2Zig4E+c LuYQEPOC/hmWzXhY06jUGZSK2QqHnVyTCbYHahOSCcyWWP3xYTJUh1tPucjAgD9qeKOA pHuxQSJbn2qbjHkTEByIDcuXLkCvS/bUru1W/89tIJodRuHLD7JxTZMVECrJJJINVgIi 1lAFRBjJw9fE5bDGGRu2qwcErnsBeWk0VTbRfXRq4JVnop2lagEjbBngbK1GVCaitria ecHw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=+qz7jmPVwLw81N0rAZQfVWW1tWX+aq7F1SQz3Dxt9uI=; b=l0v1FPU8VkKz7Dxc3qBWzGbuKbtrcY50nIKX5OFL4qusrdaGHAa6hejdgPiqrAOGm5 /LRqxJcewzy9D3e30mGiBUwqWjYA0K6/tLYEiGS5kJ2k+1kpZzkIx6hNlQR5CE+cZOF+ tGRv4nW1M0zLptxeVir7+MVvnZ/s+uUE474/SFTlzrz5KOxId3e6KnY/at1BPWJGmyKg 3mIFMXZdNgzY2rjIOe6ZUUv1w/6Q7oukFlRkSWMxKJ7FSfl6P7u3Lc3TE9H1u7itiI+o 7Lz95Xx2bL/eTNPbI4r/4SqKq3a5OIggYL83FMFrH1cGBHMk6jUAqje0IhTgvUqcELwZ XwGw==
X-Gm-Message-State: AOAM533RsUQM93nxxWoPh/dUGVK1x6TEpG7BizpQakT7fWJ91e2aqn+g f9tKT7lNISpa7ZEl4Epu7uTsdjAj+TIX3uJzhocgSw==
X-Google-Smtp-Source: ABdhPJwxW/QmyWtqbpblkdFQJzs8YHoAPTOms8k3FWnzA1BI7Q1yf/RYV+kj1aVnduISHDiI+q3bBdSNQQLBGimK76A=
X-Received: by 2002:ab0:25c5:: with SMTP id y5mr2651090uan.142.1627519304247; Wed, 28 Jul 2021 17:41:44 -0700 (PDT)
MIME-Version: 1.0
References: <78EF3761-7CAF-459E-A4C0-57CDEAFEA8EE@apple.com> <CADVnQynkBxTdapXN0rWOuWO3KXQ2qb6x=xhB35XrMU38JkX2DQ@mail.gmail.com> <601D9D4F-A82C-475A-98CC-383C1F876C44@apple.com> <54699CC9-C8F5-4CA3-8815-F7A21AE10429@icsi.berkeley.edu> <DF5EF1C7-0940-478A-9518-62185A79A288@apple.com> <CAK6E8=fb1xioSzhj0kXPrkRz+cqkJbkC=2s643uHgiR8bERp1g@mail.gmail.com>
In-Reply-To: <CAK6E8=fb1xioSzhj0kXPrkRz+cqkJbkC=2s643uHgiR8bERp1g@mail.gmail.com>
From: Neal Cardwell <ncardwell@google.com>
Date: Wed, 28 Jul 2021 20:41:27 -0400
Message-ID: <CADVnQy=Ct0+TdE2A8vV6jKpD_wpmJ07N+_+esD7oJczLv_wi=Q@mail.gmail.com>
To: Yuchung Cheng <ycheng@google.com>
Cc: Vidhi Goel <vidhi_goel=40apple.com@dmarc.ietf.org>, Mark Allman <mallman@icir.org>, "tcpm@ietf.org Extensions" <tcpm@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000ff1d4605c8385de2"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/9df9c7qZ0yJ0Aefo0FZ4UQxKvzo>
Subject: Re: [tcpm] =?utf-8?q?Linux_doesn=E2=80=99t_implement_RFC3465?=
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Jul 2021 00:41:59 -0000

I would vote that the guidance should be something like:

  On an ACK, don't increase the cwnd by more than the amount of data ACKed.

This implies:

 o slow-start is limited to doubling each round trip

 o the Savage ACK division attacks are squashed

I don't think limiting cwnd growth to some constant L less than cwnd is the
best road to go down, since:

(a) At high speeds (at least >10Gbps) TCP has needed to operate in
TSO/GRO/LRO/GRO aggregates of 64KBytes for a decade or more, and is now
moving on to larger aggregates to go beyond 100Gbps (see Eric Dumazet's
nice talk at the Linux Netdev conference a few weeks ago). Having a
constant L is not scalable or future-proof. I would imagine any constant L
lower than 64KBytes will be ignored, in practice, with good reason. A
constant L bigger than 64KBytes will be out of date soon, if it's not
already.

(b) If we're worried about bursts (which we should be), then we should
recommend pacing. And RFC 7661 already does that:
  https://datatracker.ietf.org/doc/html/rfc7661#section-4.4.2

  "A TCP sender in the non-validated phase SHOULD control the maximum
   burst size, e.g., using a rate-based pacing algorithm in which a
   sender paces out the cwnd over its estimate of the RTT, or some
   other method, to prevent many segments being transmitted
   contiguously at line-rate."

  So a sender that follows RFC 7661 already has a solution for bursts:
pacing.

(b) If an implementation ignores RFC 7661 and does not do pacing, then
limiting cwnd *growth* to L per ACK does not prevent bursts. A sender that
ignores RFC 7661 will be frequently restarting from idle to blast a full
cwnd at line rate. Those restart-from-idle cwnd bursts are already
super-common in some of the most common types of TCP traffic: web, RPC,
streaming video. Having a particular L limit for cwnd growth will not save
us from those cwnd-magnitude bursts.

Just my two cents on that,
neal


On Wed, Jul 28, 2021 at 7:20 PM Yuchung Cheng <ycheng@google.com> wrote:

> Thank you Vidhi and Mark for supporting an ABC update.
>
> To give this a start, my recommended update is:
>
> 1) if the sender uses some form of pacing to send data packets, L is
> RECOMMENDED to be the effective window worth of packets. Pacing here refers
> to spread packet transmission following a rate based on the congestion
> window and round trip. Additionally if SACK is supported, the same L
> applies in slow start after RTO
>
> 2) otherwise L is RECOMMENDED to be 8
>
> Thoughts?
>
>
> On Wed, Jul 28, 2021 at 2:13 PM Vidhi Goel <vidhi_goel=
> 40apple.com@dmarc.ietf.org> wrote:
>
>> Resurrecting the 3465 thread.
>>
>> In the TCPM meeting at IETF 111, we discussed about this issue of L=2
>> which is a MUST in RFC 3465. This is a very strict requirement and stacks
>> like Linux already doesn’t follow it.
>>
>> The basic principle in the Linux code is that the sender should use the
>> ACKs to learn about the capacity of the path (both in volumetric and rate
>> dimensions), and should not ignore that information. This allows the sender
>> to quickly grow and achieve high throughput, even in the presence of
>> stretch ACKs, which are pervasive, due to TSO/GSO, GRO/LRO, etc.
>>
>> Considering bursts is important, but that can be tackled as an orthogonal
>> issue. Bursts are avoided in the Linux TCP ecosystem by the combination of
>> TSO autosizing, pacing, TSQ, and fair queueing.
>>
>>
>> As Neal described, the congestion controller should use the information
>> in stretch ACKs to increase its congestion window so that it correctly
>> adjusts the *cwnd* based on available link capacity.
>> Burstiness is an orthogonal issue which can be solved by pacing.
>>
>> QUIC loss recovery (RFC 9002) also follows this approach.
>>
>> *Mark*,
>> As a lot of transport and congestion control drafts reference RFC 3465,
>> do you think we should update this RFC to reflect the current deployment?
>> This would also be useful for someone who is just starting with a new
>> implementation.
>>
>> Thanks,
>> Vidhi
>>
>> On Nov 27, 2019, at 5:14 AM, Mark Allman <mallman@icir.org> wrote:
>>
>>
>> +Mark Allman
>>
>>
>> Just to clear it up, I *was* at BBN long ago when the ABC document
>> was written.  It's a cool place to work.  I recommend it.  But, I
>> now hang out at ICSI.
>>
>> I believe that ABC was written to solve the problem with ACK
>> counting by counting the number of bytes acknowledged for
>> misbehaving receivers. Limiting the increase to 2*MSS was a good
>> solution to avoid bursts at the time.
>>
>>
>> The main motivation behind ABC was to counteract delayed ACKs.  The
>> common approach at the time was to just bump cwnd by one MSS every
>> time an ACK rolled in.  So, if an ACK covered two segments because
>> the receiver was delaying the ACKs then the growth rate during slow
>> start was 1.5x per RTT instead of the 2x that was really
>> envisioned.  During congestion avoidance the growth was 1 MSS every
>> 2 RTTs instead of the envisioned every RTT.
>>
>> A secondary motivation was to counteract these ACK division attacks
>> that Savage taught us about.  I.e., we could ACK an MSS-sized packet
>> one byte at a time and the sender would then increase the cwnd by
>> MSS*MSS bytes in the prevalent ACK counting scheme (i.e., cwnd would
>> get bumped by MSS bytes for every ACK).
>>
>> The limit has two roots ...
>>
>> (1) The limit is important in slow starts that follow an RTO.  As
>>    the RFC discusses, in this case we might retransmit a single
>>    packet and this will cause the receiver's window to slide a
>>    great deal.  Therefore, an ACK may indicate that a ton of data
>>    has left the network, but that isn't really the case.  So, we
>>    don't want to increase the cwnd based on all the new bytes
>>    ACKed.
>>
>>    I have since mostly decided that this use of L is crude.
>>    Probably there is a more elegant way to do it by using the
>>    scoreboard and the SACK information to get a better
>>    understanding of what left the network and when.  That said, in
>>    this case L is simple and probably about right most of the
>>    time.
>>
>> (2) I think there was some general conservativeness to bursts and
>>    using L everywhere quelled some of the worry.  Here L=2 was used
>>    to exactly offset delayed ACKs.
>>
>> I agree that increasing the congestion window and controlling the
>> burst rate are orthogonal issues.
>>
>>
>> Yes.  In fact, we did subsequent research on mitigating bursts
>> because we never really thought of ABC as somehow the way to control
>> bursts (research papers available, but never went into RFCs).
>>
>> And, in a world that leverages stretch ACKs as a routine I think
>> Linux's approach of not using an L may well be correct.  Documenting
>> that and the reasoning behind it in modern networks seems useful to
>> me.
>>
>> allman
>>
>>
>> _______________________________________________
>> tcpm mailing list
>> tcpm@ietf.org
>> https://www.ietf.org/mailman/listinfo/tcpm
>>
>