Re: [tcpm] Linux doesn’t implement RFC3465

Vidhi Goel <vidhi_goel@apple.com> Fri, 30 July 2021 20:26 UTC

Return-Path: <vidhi_goel@apple.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5373A3A104A for <tcpm@ietfa.amsl.com>; Fri, 30 Jul 2021 13:26:12 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.548
X-Spam-Level:
X-Spam-Status: No, score=-2.548 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.452, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=apple.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id E_jAsmAysqZh for <tcpm@ietfa.amsl.com>; Fri, 30 Jul 2021 13:26:08 -0700 (PDT)
Received: from rn-mailsvcp-ppex-lapp15.apple.com (rn-mailsvcp-ppex-lapp15.rno.apple.com [17.179.253.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4F9D33A0FEC for <tcpm@ietf.org>; Fri, 30 Jul 2021 13:25:56 -0700 (PDT)
Received: from pps.filterd (rn-mailsvcp-ppex-lapp15.rno.apple.com [127.0.0.1]) by rn-mailsvcp-ppex-lapp15.rno.apple.com (8.16.1.2/8.16.1.2) with SMTP id 16UKN4mS016056; Fri, 30 Jul 2021 13:25:52 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=apple.com; h=from : message-id : content-type : mime-version : subject : date : in-reply-to : cc : to : references; s=20180706; bh=6xaULiUaHiLpPTKPDJ/3vsDZzxqR67uNxcGGIw9U8Ow=; b=kCmT/HNDPBdjmSz4w6NU971O63B6L2obvzILj2WhgNYIOvAg1BSxN9M7eug8N7p0l72O 0xLSVmjDWCHfTobgJdcrCX6K7hwR+ch9D4ZQgSni2ZldarMlMhJ9vj10ZWnQAr0sAppM 2VRTEJIJu260kEs/5WQbXQHI2A16mqW8KxEZX4bgCdZFyvlHQ9lbzkdtTsIyRDgI5bMJ hfpvU65babSttNaqlmlV/KP+T1HvUduKUu0W6EJRcY1Sb9L0jUEZhBoDIHbCE0LQP3DN kxItJyt+z1iZFujcZoJn5UBNqzawxyu9kK01NTyffOi5jN+w4pRyI83sRvnStGf5U5s7 dA==
Received: from rn-mailsvcp-mta-lapp04.rno.apple.com (rn-mailsvcp-mta-lapp04.rno.apple.com [10.225.203.152]) by rn-mailsvcp-ppex-lapp15.rno.apple.com with ESMTP id 3a235sxq1x-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO); Fri, 30 Jul 2021 13:25:52 -0700
Received: from rn-mailsvcp-mmp-lapp04.rno.apple.com (rn-mailsvcp-mmp-lapp04.rno.apple.com [17.179.253.17]) by rn-mailsvcp-mta-lapp04.rno.apple.com (Oracle Communications Messaging Server 8.1.0.9.20210415 64bit (built Apr 15 2021)) with ESMTPS id <0QX200HXCRF4SMB0@rn-mailsvcp-mta-lapp04.rno.apple.com>; Fri, 30 Jul 2021 13:25:52 -0700 (PDT)
Received: from process_milters-daemon.rn-mailsvcp-mmp-lapp04.rno.apple.com by rn-mailsvcp-mmp-lapp04.rno.apple.com (Oracle Communications Messaging Server 8.1.0.9.20210415 64bit (built Apr 15 2021)) id <0QX200500R6D5Q00@rn-mailsvcp-mmp-lapp04.rno.apple.com>; Fri, 30 Jul 2021 13:25:52 -0700 (PDT)
X-Va-A:
X-Va-T-CD: 2e660d2641f10d05ffec32ae59e5d560
X-Va-E-CD: 8ad83cf34a8c3732cb73dea81270d36d
X-Va-R-CD: 2eec4d4b333905bf2498bde820845c7b
X-Va-CD: 0
X-Va-ID: 0b6d172c-8c58-434a-b3e8-77382d952abf
X-V-A:
X-V-T-CD: 2e660d2641f10d05ffec32ae59e5d560
X-V-E-CD: 8ad83cf34a8c3732cb73dea81270d36d
X-V-R-CD: 2eec4d4b333905bf2498bde820845c7b
X-V-CD: 0
X-V-ID: d25c34a7-260c-4cad-88cb-9738bea50e3e
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-30_11:2021-07-30, 2021-07-30 signatures=0
Received: from smtpclient.apple (vimac.scv.apple.com [17.192.171.33]) by rn-mailsvcp-mmp-lapp04.rno.apple.com (Oracle Communications Messaging Server 8.1.0.9.20210415 64bit (built Apr 15 2021)) with ESMTPSA id <0QX200GKSRF36L00@rn-mailsvcp-mmp-lapp04.rno.apple.com>; Fri, 30 Jul 2021 13:25:51 -0700 (PDT)
From: Vidhi Goel <vidhi_goel@apple.com>
Message-id: <0CB3BAA7-721D-42F0-B302-B626B26A4D32@apple.com>
Content-type: multipart/alternative; boundary="Apple-Mail=_21329852-9850-4D9D-8F0B-D89603BF7C43"
MIME-version: 1.0 (Mac OS X Mail 14.0 \(3654.80.0.2.43\))
Date: Fri, 30 Jul 2021 13:25:51 -0700
In-reply-to: <CAK6E8=ep0wNzLq59GnenSAZSq3STTgERBAr6bTMqn0txg==18A@mail.gmail.com>
Cc: Extensions <tcpm@ietf.org>
To: Yuchung Cheng <ycheng=40google.com@dmarc.ietf.org>, Mark Allman <mallman@icir.org>, Neal Cardwell <ncardwell@google.com>
References: <78EF3761-7CAF-459E-A4C0-57CDEAFEA8EE@apple.com> <CADVnQynkBxTdapXN0rWOuWO3KXQ2qb6x=xhB35XrMU38JkX2DQ@mail.gmail.com> <601D9D4F-A82C-475A-98CC-383C1F876C44@apple.com> <54699CC9-C8F5-4CA3-8815-F7A21AE10429@icsi.berkeley.edu> <DF5EF1C7-0940-478A-9518-62185A79A288@apple.com> <E150D881-4AB3-4AEA-BE0C-1D4B47B2C531@icir.org> <CADVnQynjE+D-OSvdOVROjT3y1cnHHWqdNQSmphLAJ+HsBTUAJQ@mail.gmail.com> <A1B50403-2405-4348-9626-025D255DEAE7@icir.org> <CADVnQykM8p-bVz_oPrje1yNh9_7_isAUL+wnQWDoY9Gs18sLPQ@mail.gmail.com> <11FE4818-87E7-4FD8-8F45-E19CD9A3366A@apple.com> <CAK6E8=fFWAE_NSr45i2mdh6NmYDusUFW3GYGtuo-FcL07sox9A@mail.gmail.com> <D6B865F7-9865-4B6F-986B-F44ABE5F12B0@apple.com> <CAK6E8=ep0wNzLq59GnenSAZSq3STTgERBAr6bTMqn0txg==18A@mail.gmail.com>
X-Mailer: Apple Mail (2.3654.80.0.2.43)
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-30_11:2021-07-30, 2021-07-30 signatures=0
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/0kiGljWwAE0kARGGV4SbJ6fL-m4>
Subject: Re: [tcpm] =?utf-8?q?Linux_doesn=E2=80=99t_implement_RFC3465?=
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 30 Jul 2021 20:26:20 -0000

Hi Mark, Yuchung, Neal,

Given that we have some ideas / suggestion text on improving 3465, how do you think we should proceed?
Mark, do you want to start a GitHub repo or such with some of the changes already suggested so far and others can review and / or contribute to the bis draft?

Thanks,
Vidhi

> On Jul 30, 2021, at 11:04 AM, Yuchung Cheng <ycheng=40google.com@dmarc.ietf.org> wrote:
> 
> 
> 
> On Thu, Jul 29, 2021 at 6:03 PM Vidhi Goel <vidhi_goel=40apple.com@dmarc.ietf.org <mailto:40apple.com@dmarc.ietf.org>> wrote:
>>> Well, perhaps.  L=2 was designed to exactly counteract delayed ACKs.
>>> So, it isn't exactly a new magic number.  We could wave our hands
>>> and say "5 seems OK" or "10 seems OK" or whatever.  And, I am sure
>>> we could come up with something that folks felt was fine.  However,
>>> my feeling is that if we want to worry about bursts then let's worry
>>> about bursts in some generic way.  And, if you have some way to deal
>>> with bursts then L isn't needed.  And, if you don't have a way to
>>> deal with bursts then a conservative L seems fine.  But, perhaps
>>> putting the effort into a generic mechanism instead of cooking yet
>>> another magic number we need to periodically refresh is probably a
>>> better way to spend effort.
>>> 
>>> Yes, I very much agree that "putting the effort into a generic mechanism instead of cooking yet another magic number we need to periodically refresh is probably a better way to spend effort.” 
>> 
>> I agree that defining such a number doesn’t fully solve the problem but it gives some recommendation for implementations that don’t do pacing. So, defining a somewhat less restrictive value for L (5 or 10) would be a last resort for implementations that don’t pace.
>> How about putting a number 10, and also put all the rationales to follow to decide a higher or lower value. It's never one-size for all.
> 
> That sounds great. Something on the lines of,
> 
>  “This document RECOMMENDS using mechanisms like Pacing to control how many bytes are sent to the network at a point of time. But if it is not possible to implement pacing, an implementation MAY implicitly pace their traffic by applying a limit L to the increase in congestion window per ACK during slow start. In modern stacks, acknowledgments are aggregated for various reason, CPU optimization, reducing network load etc. Hence it is common for a sender to receive an aggregated ACK that acknowledges more than 2 segments. For example, a stack that implements GRO could aggregate packets up to 64Kbytes or ~44 segments before passing on to the TCP layer and this would result in a single ACK to be generated by the TCP stack. Given that an initial window of 10 packets in current deployments has been working fine, the draft makes a recommendation to set L=10 during slow start. This would mean that with every ACK, we are probing for a new capacity by sending 10 packets in addition to the previously discovered capacity. Implementations MAY choose to set a lower limit if they believe an increase of 10 is too aggressive."
> 
> Does this sound like what we would like to say?
> Thanks for taking a shot. I would put more description on Pacing to ensure better implementation. How about:
> "Pacing here refers to spread packet transmission following a rate based on the congestion window and round trip." with a citation of https://datatracker.ietf.org/doc/html/rfc7661#section-4.4.2 <https://datatracker.ietf.org/doc/html/rfc7661#section-4.4.2>
> 
> 
> 
> I would also refer to IW RFC 6928 in case it gets increased / updated a few years later.
> Hmm maybe we should also move RFC6928 to the standard track :-)
> 
> 
> -
> Vidhi
> 
>> On Jul 29, 2021, at 1:47 PM, Yuchung Cheng <ycheng=40google.com@dmarc.ietf.org <mailto:ycheng=40google.com@dmarc.ietf.org>> wrote:
>> 
>> 
>> 
>> On Thu, Jul 29, 2021 at 1:19 PM Vidhi Goel <vidhi_goel=40apple.com@dmarc.ietf.org <mailto:40apple.com@dmarc.ietf.org>> wrote:
>>> Well, perhaps.  L=2 was designed to exactly counteract delayed ACKs.
>>> So, it isn't exactly a new magic number.  We could wave our hands
>>> and say "5 seems OK" or "10 seems OK" or whatever.  And, I am sure
>>> we could come up with something that folks felt was fine.  However,
>>> my feeling is that if we want to worry about bursts then let's worry
>>> about bursts in some generic way.  And, if you have some way to deal
>>> with bursts then L isn't needed.  And, if you don't have a way to
>>> deal with bursts then a conservative L seems fine.  But, perhaps
>>> putting the effort into a generic mechanism instead of cooking yet
>>> another magic number we need to periodically refresh is probably a
>>> better way to spend effort.
>>> 
>>> Yes, I very much agree that "putting the effort into a generic mechanism instead of cooking yet another magic number we need to periodically refresh is probably a better way to spend effort.” 
>> 
>> I agree that defining such a number doesn’t fully solve the problem but it gives some recommendation for implementations that don’t do pacing. So, defining a somewhat less restrictive value for L (5 or 10) would be a last resort for implementations that don’t pace.
>> How about putting a number 10, and also put all the rationales to follow to decide a higher or lower value. It's never one-size for all.
>> 
>> Also I believe it's time to move ABC into the standards track, in the era of (bigger and bigger) stretch ACKs.
>> 
>> 
>> Thanks,
>> Vidhi
>> 
>> 
>> 
>>> On Jul 29, 2021, at 8:19 AM, Neal Cardwell <ncardwell@google.com <mailto:ncardwell@google.com>> wrote:
>>> 
>>> 
>>> 
>>> On Thu, Jul 29, 2021 at 10:06 AM Mark Allman <mallman@icir.org <mailto:mallman@icir.org>> wrote:
>>> 
>>> >>     (b) If there is no burst mitigation then we have to figure out
>>> >>         if L is still useful for this purpose and whether we want to
>>> >>         retain it.  Seems like perhaps L=2 is sensible here.  L was
>>> >>         never meant to be some general burst mitigator.  However,
>>> >>         ABC clearly *can* aggravate bursting and so perhaps it makes
>>> >>         sense to have it also try to limit the impact of the
>>> >>         aggravation (in the absence of some general mechanism).
>>> >
>>> > Even if recommending a static L value, IMHO L=2 is a bit
>>> > conservative.
>>> 
>>> Well, perhaps.  L=2 was designed to exactly counteract delayed ACKs.
>>> So, it isn't exactly a new magic number.  We could wave our hands
>>> and say "5 seems OK" or "10 seems OK" or whatever.  And, I am sure
>>> we could come up with something that folks felt was fine.  However,
>>> my feeling is that if we want to worry about bursts then let's worry
>>> about bursts in some generic way.  And, if you have some way to deal
>>> with bursts then L isn't needed.  And, if you don't have a way to
>>> deal with bursts then a conservative L seems fine.  But, perhaps
>>> putting the effort into a generic mechanism instead of cooking yet
>>> another magic number we need to periodically refresh is probably a
>>> better way to spend effort.
>>> 
>>> Yes, I very much agree that "putting the effort into a generic mechanism instead of cooking yet another magic number we need to periodically refresh is probably a better way to spend effort." 
>>> 
>>> >>   - During slow starts that follow RTOs there is a general
>>> >>     problem that just because the window slides by X bytes
>>> >>     doesn't say anything about the *network*, as that sliding can
>>> >>     happen because much of the data was likely queued for the
>>> >>     application on the receiver.  So, e.g., you can RTO and send
>>> >>     one packet and get an ACK back that slides the window 10
>>> >>     packets.  That doesn't mean 10 packets left.  It means one
>>> >>     packet left the network and nine packets are eligible to be
>>> >>     sent to the application.  So, it is not OK to set the cwnd to
>>> >>     1+10 = 11 packets in response to this ACK.  Here L should
>>> >>     exist and be 1.
>>> >
>>> > AFAICT this argument only applies to non-SACK connections. For
>>> > connections with SACK (the vast majority of connections over the
>>> > public Internet and in datacenters), it is quite feasible to
>>> > determine how many packets really left the network (and Linux TCP
>>> > does this; see below).
>>> 
>>> If you have an accurate way to figure out how many of the ACKed
>>> bytes left the network and how many were just buffered at the
>>> receiver then I see no problem with increasing based on byte count
>>> as you do in the initial slow start.
>>> 
>>> (I don't remember what the paper you cite says, but my guess is it's
>>> often the case that L=1 is a reasonable substitute for something
>>> complicated here.  But, perhaps I am running the simulation in my
>>> head wrong ... it has been a while, admittedly!)
>>> 
>>> > Yes, offload mechanisms are so pervasive in practice,
>>> 
>>> I am trying to build a mental model here.  How pervasive would you
>>> guess these are?  And, where in the network?  I have assumed that
>>> they are for sure pervasive in data centers and server farms, but
>>> not for the vast majority of Internet-connected devices.
>>> 
>>> From my impression looking at public Internet traces, aggregation mechanisms that cause TCP ACKs for more than 2 segments are very common. I suspect that's because the majority of public Internet traffic these days has a bottleneck that is either wifi, cellular, or DOCSIS, and all of these have a shared medium with a large latency overhead for L2 MAC control of gets to speak next. So a lot of batching happens, both in big batches of data that arrive at the client in the same L2 medium time slot, and big batches of ACKs that accumulate while the client waits (often several milliseconds, sometimes even tens of milliseconds) for its chance to send a big stretch ACK or batch of ACKs.
>>> 
>>> This brings up a related point: even if there is some ABC-style per-ACK L limit on cwnd increases, the time structure of most public Internet ACK streams is massively bursty because of these aggregation mechanisms inherent in L2 behavior on most public Internet bottlenecks (wifi, cellular, DOCSIS). So even if there is a limit L that limits the per-ACK behavior to be smooth, if there is no pacing of data segments then the data transmit time structure will still be bursty because the ACK arrivals these days are very bursty. 
>>> 
>>> best regards,
>>> neal
>> 
>> _______________________________________________
>> tcpm mailing list
>> tcpm@ietf.org <mailto:tcpm@ietf.org>
>> https://www.ietf.org/mailman/listinfo/tcpm <https://www.ietf.org/mailman/listinfo/tcpm>