Re: [tcpm] Linux doesn’t implement RFC3465

Neal Cardwell <ncardwell@google.com> Fri, 30 July 2021 23:07 UTC

Return-Path: <ncardwell@google.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BF7703A1543 for <tcpm@ietfa.amsl.com>; Fri, 30 Jul 2021 16:07:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -18.098
X-Spam-Level:
X-Spam-Status: No, score=-18.098 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.499, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=google.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id DxoSWpVffjGf for <tcpm@ietfa.amsl.com>; Fri, 30 Jul 2021 16:07:39 -0700 (PDT)
Received: from mail-ua1-x931.google.com (mail-ua1-x931.google.com [IPv6:2607:f8b0:4864:20::931]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 626923A153A for <tcpm@ietf.org>; Fri, 30 Jul 2021 16:07:39 -0700 (PDT)
Received: by mail-ua1-x931.google.com with SMTP id t26so4583161uao.12 for <tcpm@ietf.org>; Fri, 30 Jul 2021 16:07:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=UmW+MHp0SVlEPT72lHxJ8X5OY7rgGAPROdRgspVcQ0k=; b=wSs9D4RKDAKt2f20g7aEh45b5L7wFO/YXp9lzjFpNZBdmFfVfaxxJXEaZqBgBdnYwC hAYVScMcyyr1XdJlJzMMyEgGwqPE3WCoERCVkif3MZO7l1BKa2BIHS81e4BZizybXJ5D GeO5jUvJdDNN851DcUqtu3W5AxYsIo6q7p54rE2l2fJRixLdom9W0zxJZZbN8IK64EBu KaqmKlBrdYGl0lmbsO96721fIJFp+Vcq9VGlLSPblikiyfyafs7d9EqBRYpS7Ycwle1F CHwwZu/yrDqct+1qT7HTo4qT9gq7RNMXcHwUKt+W5jSoeJ66ypB/8T0GhuwgvzHEg5A2 MptQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=UmW+MHp0SVlEPT72lHxJ8X5OY7rgGAPROdRgspVcQ0k=; b=sjuC2agOM/SMVQWB2Ff56vM7d5Tzbzf65bwM9iIF/ifCTiK1J3vWyN70Quad8mj1CQ kGCa/7X3gm3h2KRWw7KKuSdyd8JE49Fwsv//VcTmPQ7nWp8iyKXJL3NPS8ZvTIz8nKR2 9Tnde4hb/D9PwR8YH4xnjSWX2UvrPP4TlgERoQy75JVoDHAUxKotlSl261xcFOmmYcbb sgMgUQ0NelYqLGuK2OVudvDD7yFLh+gctO8GylwhCPNKJdnG/kbQ4P3f0rbAmiQR8uv+ qwfTjXC3Oha4loaBKRLSJGW0lMDbmXJAwY6dEAuIrNeFO4aLaTtyNBPPbzsoXl3cjXJP MElg==
X-Gm-Message-State: AOAM530A4BhP/+2nEbcHlnawipvdzMvG2dAEKi2VEIvN7NTlJdErgKRF Br5mdA0ueecDtxj6XPdFn4OQcRlZ/10UkWzKrJngLw==
X-Google-Smtp-Source: ABdhPJySKHGG6JJbv9kJy3gwxKVMwdmP0mZWLdWxS45gL5r53p9iBiIDs7hy6jBOCwrvcJ8Om9Y5aj1sIrBR9eHVadw=
X-Received: by 2002:ab0:42d:: with SMTP id 42mr5041430uav.63.1627686455969; Fri, 30 Jul 2021 16:07:35 -0700 (PDT)
MIME-Version: 1.0
References: <78EF3761-7CAF-459E-A4C0-57CDEAFEA8EE@apple.com> <CADVnQynkBxTdapXN0rWOuWO3KXQ2qb6x=xhB35XrMU38JkX2DQ@mail.gmail.com> <601D9D4F-A82C-475A-98CC-383C1F876C44@apple.com> <54699CC9-C8F5-4CA3-8815-F7A21AE10429@icsi.berkeley.edu> <DF5EF1C7-0940-478A-9518-62185A79A288@apple.com> <E150D881-4AB3-4AEA-BE0C-1D4B47B2C531@icir.org> <CADVnQynjE+D-OSvdOVROjT3y1cnHHWqdNQSmphLAJ+HsBTUAJQ@mail.gmail.com> <A1B50403-2405-4348-9626-025D255DEAE7@icir.org> <CADVnQykM8p-bVz_oPrje1yNh9_7_isAUL+wnQWDoY9Gs18sLPQ@mail.gmail.com> <11FE4818-87E7-4FD8-8F45-E19CD9A3366A@apple.com> <CAK6E8=fFWAE_NSr45i2mdh6NmYDusUFW3GYGtuo-FcL07sox9A@mail.gmail.com> <D6B865F7-9865-4B6F-986B-F44ABE5F12B0@apple.com> <CAK6E8=ep0wNzLq59GnenSAZSq3STTgERBAr6bTMqn0txg==18A@mail.gmail.com> <0CB3BAA7-721D-42F0-B302-B626B26A4D32@apple.com> <CAK6E8=c8oMX4YSjOssd9+PGZEzFHVNwXSajqro3aJ3vns5SbOg@mail.gmail.com>
In-Reply-To: <CAK6E8=c8oMX4YSjOssd9+PGZEzFHVNwXSajqro3aJ3vns5SbOg@mail.gmail.com>
From: Neal Cardwell <ncardwell@google.com>
Date: Fri, 30 Jul 2021 19:07:19 -0400
Message-ID: <CADVnQymuE58SX=iD3-+5g-Zk_RbpB+d3Mz=hGUwKxhfp_6gAMA@mail.gmail.com>
To: Yuchung Cheng <ycheng@google.com>
Cc: Vidhi Goel <vidhi_goel=40apple.com@dmarc.ietf.org>, Mark Allman <mallman@icir.org>, Extensions <tcpm@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/Uxox32_2ETCkSccGK3Pa-HGPSCg>
Subject: Re: [tcpm] =?utf-8?q?Linux_doesn=E2=80=99t_implement_RFC3465?=
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 30 Jul 2021 23:07:45 -0000

A github repo sounds good to me as well.

thanks,
neal

On Fri, Jul 30, 2021 at 7:05 PM Yuchung Cheng <ycheng@google.com> wrote:
>
> github repo sounds good to me.
>
> On Fri, Jul 30, 2021 at 1:26 PM Vidhi Goel <vidhi_goel=40apple.com@dmarc.ietf.org> wrote:
>>
>> Hi Mark, Yuchung, Neal,
>>
>> Given that we have some ideas / suggestion text on improving 3465, how do you think we should proceed?
>> Mark, do you want to start a GitHub repo or such with some of the changes already suggested so far and others can review and / or contribute to the bis draft?
>>
>> Thanks,
>> Vidhi
>>
>> On Jul 30, 2021, at 11:04 AM, Yuchung Cheng <ycheng=40google.com@dmarc.ietf.org> wrote:
>>
>>
>>
>> On Thu, Jul 29, 2021 at 6:03 PM Vidhi Goel <vidhi_goel=40apple.com@dmarc.ietf.org> wrote:
>>>>>
>>>>> Well, perhaps.  L=2 was designed to exactly counteract delayed ACKs.
>>>>> So, it isn't exactly a new magic number.  We could wave our hands
>>>>> and say "5 seems OK" or "10 seems OK" or whatever.  And, I am sure
>>>>> we could come up with something that folks felt was fine.  However,
>>>>> my feeling is that if we want to worry about bursts then let's worry
>>>>> about bursts in some generic way.  And, if you have some way to deal
>>>>> with bursts then L isn't needed.  And, if you don't have a way to
>>>>> deal with bursts then a conservative L seems fine.  But, perhaps
>>>>> putting the effort into a generic mechanism instead of cooking yet
>>>>> another magic number we need to periodically refresh is probably a
>>>>> better way to spend effort.
>>>>
>>>>
>>>> Yes, I very much agree that "putting the effort into a generic mechanism instead of cooking yet another magic number we need to periodically refresh is probably a better way to spend effort.”
>>>>
>>>>
>>>> I agree that defining such a number doesn’t fully solve the problem but it gives some recommendation for implementations that don’t do pacing. So, defining a somewhat less restrictive value for L (5 or 10) would be a last resort for implementations that don’t pace.
>>>
>>> How about putting a number 10, and also put all the rationales to follow to decide a higher or lower value. It's never one-size for all.
>>>
>>>
>>> That sounds great. Something on the lines of,
>>>
>>>  “This document RECOMMENDS using mechanisms like Pacing to control how many bytes are sent to the network at a point of time. But if it is not possible to implement pacing, an implementation MAY implicitly pace their traffic by applying a limit L to the increase in congestion window per ACK during slow start. In modern stacks, acknowledgments are aggregated for various reason, CPU optimization, reducing network load etc. Hence it is common for a sender to receive an aggregated ACK that acknowledges more than 2 segments. For example, a stack that implements GRO could aggregate packets up to 64Kbytes or ~44 segments before passing on to the TCP layer and this would result in a single ACK to be generated by the TCP stack. Given that an initial window of 10 packets in current deployments has been working fine, the draft makes a recommendation to set L=10 during slow start. This would mean that with every ACK, we are probing for a new capacity by sending 10 packets in addition to the previously discovered capacity. Implementations MAY choose to set a lower limit if they believe an increase of 10 is too aggressive."
>>>
>>> Does this sound like what we would like to say?
>>
>> Thanks for taking a shot. I would put more description on Pacing to ensure better implementation. How about:
>> "Pacing here refers to spread packet transmission following a rate based on the congestion window and round trip." with a citation of https://datatracker.ietf.org/doc/html/rfc7661#section-4.4.2
>>
>>
>>
>> I would also refer to IW RFC 6928 in case it gets increased / updated a few years later.
>> Hmm maybe we should also move RFC6928 to the standard track :-)
>>
>>>
>>> -
>>> Vidhi
>>>
>>> On Jul 29, 2021, at 1:47 PM, Yuchung Cheng <ycheng=40google.com@dmarc.ietf.org> wrote:
>>>
>>>
>>>
>>> On Thu, Jul 29, 2021 at 1:19 PM Vidhi Goel <vidhi_goel=40apple.com@dmarc.ietf.org> wrote:
>>>>>
>>>>> Well, perhaps.  L=2 was designed to exactly counteract delayed ACKs.
>>>>> So, it isn't exactly a new magic number.  We could wave our hands
>>>>> and say "5 seems OK" or "10 seems OK" or whatever.  And, I am sure
>>>>> we could come up with something that folks felt was fine.  However,
>>>>> my feeling is that if we want to worry about bursts then let's worry
>>>>> about bursts in some generic way.  And, if you have some way to deal
>>>>> with bursts then L isn't needed.  And, if you don't have a way to
>>>>> deal with bursts then a conservative L seems fine.  But, perhaps
>>>>> putting the effort into a generic mechanism instead of cooking yet
>>>>> another magic number we need to periodically refresh is probably a
>>>>> better way to spend effort.
>>>>
>>>>
>>>> Yes, I very much agree that "putting the effort into a generic mechanism instead of cooking yet another magic number we need to periodically refresh is probably a better way to spend effort.”
>>>>
>>>>
>>>> I agree that defining such a number doesn’t fully solve the problem but it gives some recommendation for implementations that don’t do pacing. So, defining a somewhat less restrictive value for L (5 or 10) would be a last resort for implementations that don’t pace.
>>>
>>> How about putting a number 10, and also put all the rationales to follow to decide a higher or lower value. It's never one-size for all.
>>>
>>> Also I believe it's time to move ABC into the standards track, in the era of (bigger and bigger) stretch ACKs.
>>>
>>>>
>>>> Thanks,
>>>> Vidhi
>>>>
>>>>
>>>>
>>>> On Jul 29, 2021, at 8:19 AM, Neal Cardwell <ncardwell@google.com> wrote:
>>>>
>>>>
>>>>
>>>> On Thu, Jul 29, 2021 at 10:06 AM Mark Allman <mallman@icir.org> wrote:
>>>>>
>>>>>
>>>>> >>     (b) If there is no burst mitigation then we have to figure out
>>>>> >>         if L is still useful for this purpose and whether we want to
>>>>> >>         retain it.  Seems like perhaps L=2 is sensible here.  L was
>>>>> >>         never meant to be some general burst mitigator.  However,
>>>>> >>         ABC clearly *can* aggravate bursting and so perhaps it makes
>>>>> >>         sense to have it also try to limit the impact of the
>>>>> >>         aggravation (in the absence of some general mechanism).
>>>>> >
>>>>> > Even if recommending a static L value, IMHO L=2 is a bit
>>>>> > conservative.
>>>>>
>>>>> Well, perhaps.  L=2 was designed to exactly counteract delayed ACKs.
>>>>> So, it isn't exactly a new magic number.  We could wave our hands
>>>>> and say "5 seems OK" or "10 seems OK" or whatever.  And, I am sure
>>>>> we could come up with something that folks felt was fine.  However,
>>>>> my feeling is that if we want to worry about bursts then let's worry
>>>>> about bursts in some generic way.  And, if you have some way to deal
>>>>> with bursts then L isn't needed.  And, if you don't have a way to
>>>>> deal with bursts then a conservative L seems fine.  But, perhaps
>>>>> putting the effort into a generic mechanism instead of cooking yet
>>>>> another magic number we need to periodically refresh is probably a
>>>>> better way to spend effort.
>>>>
>>>>
>>>> Yes, I very much agree that "putting the effort into a generic mechanism instead of cooking yet another magic number we need to periodically refresh is probably a better way to spend effort."
>>>>>
>>>>>
>>>>> >>   - During slow starts that follow RTOs there is a general
>>>>> >>     problem that just because the window slides by X bytes
>>>>> >>     doesn't say anything about the *network*, as that sliding can
>>>>> >>     happen because much of the data was likely queued for the
>>>>> >>     application on the receiver.  So, e.g., you can RTO and send
>>>>> >>     one packet and get an ACK back that slides the window 10
>>>>> >>     packets.  That doesn't mean 10 packets left.  It means one
>>>>> >>     packet left the network and nine packets are eligible to be
>>>>> >>     sent to the application.  So, it is not OK to set the cwnd to
>>>>> >>     1+10 = 11 packets in response to this ACK.  Here L should
>>>>> >>     exist and be 1.
>>>>> >
>>>>> > AFAICT this argument only applies to non-SACK connections. For
>>>>> > connections with SACK (the vast majority of connections over the
>>>>> > public Internet and in datacenters), it is quite feasible to
>>>>> > determine how many packets really left the network (and Linux TCP
>>>>> > does this; see below).
>>>>>
>>>>> If you have an accurate way to figure out how many of the ACKed
>>>>> bytes left the network and how many were just buffered at the
>>>>> receiver then I see no problem with increasing based on byte count
>>>>> as you do in the initial slow start.
>>>>>
>>>>> (I don't remember what the paper you cite says, but my guess is it's
>>>>> often the case that L=1 is a reasonable substitute for something
>>>>> complicated here.  But, perhaps I am running the simulation in my
>>>>> head wrong ... it has been a while, admittedly!)
>>>>>
>>>>> > Yes, offload mechanisms are so pervasive in practice,
>>>>>
>>>>> I am trying to build a mental model here.  How pervasive would you
>>>>> guess these are?  And, where in the network?  I have assumed that
>>>>> they are for sure pervasive in data centers and server farms, but
>>>>> not for the vast majority of Internet-connected devices.
>>>>
>>>>
>>>> From my impression looking at public Internet traces, aggregation mechanisms that cause TCP ACKs for more than 2 segments are very common. I suspect that's because the majority of public Internet traffic these days has a bottleneck that is either wifi, cellular, or DOCSIS, and all of these have a shared medium with a large latency overhead for L2 MAC control of gets to speak next. So a lot of batching happens, both in big batches of data that arrive at the client in the same L2 medium time slot, and big batches of ACKs that accumulate while the client waits (often several milliseconds, sometimes even tens of milliseconds) for its chance to send a big stretch ACK or batch of ACKs.
>>>>
>>>> This brings up a related point: even if there is some ABC-style per-ACK L limit on cwnd increases, the time structure of most public Internet ACK streams is massively bursty because of these aggregation mechanisms inherent in L2 behavior on most public Internet bottlenecks (wifi, cellular, DOCSIS). So even if there is a limit L that limits the per-ACK behavior to be smooth, if there is no pacing of data segments then the data transmit time structure will still be bursty because the ACK arrivals these days are very bursty.
>>>>
>>>> best regards,
>>>> neal
>>>>
>>>>
>>>> _______________________________________________
>>>> tcpm mailing list
>>>> tcpm@ietf.org
>>>> https://www.ietf.org/mailman/listinfo/tcpm
>>
>>