Re: [tcpm] Linux doesn’t implement RFC3465

Vidhi Goel <vidhi_goel@apple.com> Fri, 30 July 2021 01:03 UTC

Return-Path: <vidhi_goel@apple.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AE17C3A134A for <tcpm@ietfa.amsl.com>; Thu, 29 Jul 2021 18:03:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.551
X-Spam-Level:
X-Spam-Status: No, score=-2.551 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.452, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=apple.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id J_M-1iRz1Zja for <tcpm@ietfa.amsl.com>; Thu, 29 Jul 2021 18:03:22 -0700 (PDT)
Received: from ma1-aaemail-dr-lapp03.apple.com (ma1-aaemail-dr-lapp03.apple.com [17.171.2.72]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1EA4B3A1349 for <tcpm@ietf.org>; Thu, 29 Jul 2021 18:03:21 -0700 (PDT)
Received: from pps.filterd (ma1-aaemail-dr-lapp03.apple.com [127.0.0.1]) by ma1-aaemail-dr-lapp03.apple.com (8.16.0.42/8.16.0.42) with SMTP id 16U12tOi041857; Thu, 29 Jul 2021 18:03:14 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=apple.com; h=from : message-id : content-type : mime-version : subject : date : in-reply-to : cc : to : references; s=20180706; bh=qdrb3lIZnJLb7niP7zMxybJtyjWyUG6cP53yveihUXo=; b=tQKRqnKve+nPpcEU8cXn+pHe2mZUm/dE2f8xu0kuxSm+f4UdyMi9nbL4JBxtpDWr8edZ BwPP1QTEq1b5Vwg1iCJjz0j41fKdh2w+KdmQx6au9e9eb/QuhqXvrBgTzeMgAbT7j+zV jorWEZQ+N5/lTl2Gdf0gyw6Z6ncJp8x0z4FexNAUc56lNIHszONC9H1X6yCidJwYYG1Y CKLH3tqV5/w3LRnOIai2HSddfMYn8TeCwc7zuw6oJNmTEbY2BeWKQiU45GDVRQZDW+H2 ofkkhF9SP7+XAab4E34anCIqof3YjMkR0ta3aNsi71/cqzLtqJZHtV5Tjg5A1HNQEuQL tA==
Received: from rn-mailsvcp-mta-lapp01.rno.apple.com (rn-mailsvcp-mta-lapp01.rno.apple.com [10.225.203.149]) by ma1-aaemail-dr-lapp03.apple.com with ESMTP id 3a41dvc8j8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO); Thu, 29 Jul 2021 18:03:14 -0700
Received: from rn-mailsvcp-mmp-lapp02.rno.apple.com (rn-mailsvcp-mmp-lapp02.rno.apple.com [17.179.253.15]) by rn-mailsvcp-mta-lapp01.rno.apple.com (Oracle Communications Messaging Server 8.1.0.9.20210415 64bit (built Apr 15 2021)) with ESMTPS id <0QX100YNB9LCK440@rn-mailsvcp-mta-lapp01.rno.apple.com>; Thu, 29 Jul 2021 18:03:12 -0700 (PDT)
Received: from process_milters-daemon.rn-mailsvcp-mmp-lapp02.rno.apple.com by rn-mailsvcp-mmp-lapp02.rno.apple.com (Oracle Communications Messaging Server 8.1.0.9.20210415 64bit (built Apr 15 2021)) id <0QX1008009E4Q800@rn-mailsvcp-mmp-lapp02.rno.apple.com>; Thu, 29 Jul 2021 18:03:12 -0700 (PDT)
X-Va-A:
X-Va-T-CD: 637c60740750869fdac3fa3bd3c9b6e8
X-Va-E-CD: 8ad83cf34a8c3732cb73dea81270d36d
X-Va-R-CD: 2eec4d4b333905bf2498bde820845c7b
X-Va-CD: 0
X-Va-ID: 8e677037-20f6-46b9-ba6b-b3aec8367b9e
X-V-A:
X-V-T-CD: 637c60740750869fdac3fa3bd3c9b6e8
X-V-E-CD: 8ad83cf34a8c3732cb73dea81270d36d
X-V-R-CD: 2eec4d4b333905bf2498bde820845c7b
X-V-CD: 0
X-V-ID: 0ed97748-1c16-41f3-b41e-eeafa2c52253
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-29_20:2021-07-29, 2021-07-29 signatures=0
Received: from smtpclient.apple (unknown [17.11.74.148]) by rn-mailsvcp-mmp-lapp02.rno.apple.com (Oracle Communications Messaging Server 8.1.0.9.20210415 64bit (built Apr 15 2021)) with ESMTPSA id <0QX10067Y9LA8600@rn-mailsvcp-mmp-lapp02.rno.apple.com>; Thu, 29 Jul 2021 18:03:11 -0700 (PDT)
From: Vidhi Goel <vidhi_goel@apple.com>
Message-id: <D6B865F7-9865-4B6F-986B-F44ABE5F12B0@apple.com>
Content-type: multipart/alternative; boundary="Apple-Mail=_32687F68-9803-4041-A39F-EBA71F6C7605"
MIME-version: 1.0 (Mac OS X Mail 14.0 \(3654.100.0.2.11\))
Date: Thu, 29 Jul 2021 18:03:10 -0700
In-reply-to: <CAK6E8=fFWAE_NSr45i2mdh6NmYDusUFW3GYGtuo-FcL07sox9A@mail.gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>, Mark Allman <mallman@icir.org>, Extensions <tcpm@ietf.org>
To: Yuchung Cheng <ycheng=40google.com@dmarc.ietf.org>
References: <78EF3761-7CAF-459E-A4C0-57CDEAFEA8EE@apple.com> <CADVnQynkBxTdapXN0rWOuWO3KXQ2qb6x=xhB35XrMU38JkX2DQ@mail.gmail.com> <601D9D4F-A82C-475A-98CC-383C1F876C44@apple.com> <54699CC9-C8F5-4CA3-8815-F7A21AE10429@icsi.berkeley.edu> <DF5EF1C7-0940-478A-9518-62185A79A288@apple.com> <E150D881-4AB3-4AEA-BE0C-1D4B47B2C531@icir.org> <CADVnQynjE+D-OSvdOVROjT3y1cnHHWqdNQSmphLAJ+HsBTUAJQ@mail.gmail.com> <A1B50403-2405-4348-9626-025D255DEAE7@icir.org> <CADVnQykM8p-bVz_oPrje1yNh9_7_isAUL+wnQWDoY9Gs18sLPQ@mail.gmail.com> <11FE4818-87E7-4FD8-8F45-E19CD9A3366A@apple.com> <CAK6E8=fFWAE_NSr45i2mdh6NmYDusUFW3GYGtuo-FcL07sox9A@mail.gmail.com>
X-Mailer: Apple Mail (2.3654.100.0.2.11)
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-29_20:2021-07-29, 2021-07-29 signatures=0
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/QjiJveSWqgQkEFWG5IjeDaYzGZY>
Subject: Re: [tcpm] =?utf-8?q?Linux_doesn=E2=80=99t_implement_RFC3465?=
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 30 Jul 2021 01:03:27 -0000

>> Well, perhaps.  L=2 was designed to exactly counteract delayed ACKs.
>> So, it isn't exactly a new magic number.  We could wave our hands
>> and say "5 seems OK" or "10 seems OK" or whatever.  And, I am sure
>> we could come up with something that folks felt was fine.  However,
>> my feeling is that if we want to worry about bursts then let's worry
>> about bursts in some generic way.  And, if you have some way to deal
>> with bursts then L isn't needed.  And, if you don't have a way to
>> deal with bursts then a conservative L seems fine.  But, perhaps
>> putting the effort into a generic mechanism instead of cooking yet
>> another magic number we need to periodically refresh is probably a
>> better way to spend effort.
>> 
>> Yes, I very much agree that "putting the effort into a generic mechanism instead of cooking yet another magic number we need to periodically refresh is probably a better way to spend effort.” 
> 
> I agree that defining such a number doesn’t fully solve the problem but it gives some recommendation for implementations that don’t do pacing. So, defining a somewhat less restrictive value for L (5 or 10) would be a last resort for implementations that don’t pace.
> How about putting a number 10, and also put all the rationales to follow to decide a higher or lower value. It's never one-size for all.

That sounds great. Something on the lines of,

 “This document RECOMMENDS using mechanisms like Pacing to control how many bytes are sent to the network at a point of time. But if it is not possible to implement pacing, an implementation MAY implicitly pace their traffic by applying a limit L to the increase in congestion window per ACK during slow start. In modern stacks, acknowledgments are aggregated for various reason, CPU optimization, reducing network load etc. Hence it is common for a sender to receive an aggregated ACK that acknowledges more than 2 segments. For example, a stack that implements GRO could aggregate packets up to 64Kbytes or ~44 segments before passing on to the TCP layer and this would result in a single ACK to be generated by the TCP stack. Given that an initial window of 10 packets in current deployments has been working fine, the draft makes a recommendation to set L=10 during slow start. This would mean that with every ACK, we are probing for a new capacity by sending 10 packets in addition to the previously discovered capacity. Implementations MAY choose to set a lower limit if they believe an increase of 10 is too aggressive."

Does this sound like what we would like to say?

-
Vidhi

> On Jul 29, 2021, at 1:47 PM, Yuchung Cheng <ycheng=40google.com@dmarc.ietf.org> wrote:
> 
> 
> 
> On Thu, Jul 29, 2021 at 1:19 PM Vidhi Goel <vidhi_goel=40apple.com@dmarc.ietf.org <mailto:40apple.com@dmarc.ietf.org>> wrote:
>> Well, perhaps.  L=2 was designed to exactly counteract delayed ACKs.
>> So, it isn't exactly a new magic number.  We could wave our hands
>> and say "5 seems OK" or "10 seems OK" or whatever.  And, I am sure
>> we could come up with something that folks felt was fine.  However,
>> my feeling is that if we want to worry about bursts then let's worry
>> about bursts in some generic way.  And, if you have some way to deal
>> with bursts then L isn't needed.  And, if you don't have a way to
>> deal with bursts then a conservative L seems fine.  But, perhaps
>> putting the effort into a generic mechanism instead of cooking yet
>> another magic number we need to periodically refresh is probably a
>> better way to spend effort.
>> 
>> Yes, I very much agree that "putting the effort into a generic mechanism instead of cooking yet another magic number we need to periodically refresh is probably a better way to spend effort.” 
> 
> I agree that defining such a number doesn’t fully solve the problem but it gives some recommendation for implementations that don’t do pacing. So, defining a somewhat less restrictive value for L (5 or 10) would be a last resort for implementations that don’t pace.
> How about putting a number 10, and also put all the rationales to follow to decide a higher or lower value. It's never one-size for all.
> 
> Also I believe it's time to move ABC into the standards track, in the era of (bigger and bigger) stretch ACKs.
> 
> 
> Thanks,
> Vidhi
> 
> 
> 
>> On Jul 29, 2021, at 8:19 AM, Neal Cardwell <ncardwell@google.com <mailto:ncardwell@google.com>> wrote:
>> 
>> 
>> 
>> On Thu, Jul 29, 2021 at 10:06 AM Mark Allman <mallman@icir.org <mailto:mallman@icir.org>> wrote:
>> 
>> >>     (b) If there is no burst mitigation then we have to figure out
>> >>         if L is still useful for this purpose and whether we want to
>> >>         retain it.  Seems like perhaps L=2 is sensible here.  L was
>> >>         never meant to be some general burst mitigator.  However,
>> >>         ABC clearly *can* aggravate bursting and so perhaps it makes
>> >>         sense to have it also try to limit the impact of the
>> >>         aggravation (in the absence of some general mechanism).
>> >
>> > Even if recommending a static L value, IMHO L=2 is a bit
>> > conservative.
>> 
>> Well, perhaps.  L=2 was designed to exactly counteract delayed ACKs.
>> So, it isn't exactly a new magic number.  We could wave our hands
>> and say "5 seems OK" or "10 seems OK" or whatever.  And, I am sure
>> we could come up with something that folks felt was fine.  However,
>> my feeling is that if we want to worry about bursts then let's worry
>> about bursts in some generic way.  And, if you have some way to deal
>> with bursts then L isn't needed.  And, if you don't have a way to
>> deal with bursts then a conservative L seems fine.  But, perhaps
>> putting the effort into a generic mechanism instead of cooking yet
>> another magic number we need to periodically refresh is probably a
>> better way to spend effort.
>> 
>> Yes, I very much agree that "putting the effort into a generic mechanism instead of cooking yet another magic number we need to periodically refresh is probably a better way to spend effort." 
>> 
>> >>   - During slow starts that follow RTOs there is a general
>> >>     problem that just because the window slides by X bytes
>> >>     doesn't say anything about the *network*, as that sliding can
>> >>     happen because much of the data was likely queued for the
>> >>     application on the receiver.  So, e.g., you can RTO and send
>> >>     one packet and get an ACK back that slides the window 10
>> >>     packets.  That doesn't mean 10 packets left.  It means one
>> >>     packet left the network and nine packets are eligible to be
>> >>     sent to the application.  So, it is not OK to set the cwnd to
>> >>     1+10 = 11 packets in response to this ACK.  Here L should
>> >>     exist and be 1.
>> >
>> > AFAICT this argument only applies to non-SACK connections. For
>> > connections with SACK (the vast majority of connections over the
>> > public Internet and in datacenters), it is quite feasible to
>> > determine how many packets really left the network (and Linux TCP
>> > does this; see below).
>> 
>> If you have an accurate way to figure out how many of the ACKed
>> bytes left the network and how many were just buffered at the
>> receiver then I see no problem with increasing based on byte count
>> as you do in the initial slow start.
>> 
>> (I don't remember what the paper you cite says, but my guess is it's
>> often the case that L=1 is a reasonable substitute for something
>> complicated here.  But, perhaps I am running the simulation in my
>> head wrong ... it has been a while, admittedly!)
>> 
>> > Yes, offload mechanisms are so pervasive in practice,
>> 
>> I am trying to build a mental model here.  How pervasive would you
>> guess these are?  And, where in the network?  I have assumed that
>> they are for sure pervasive in data centers and server farms, but
>> not for the vast majority of Internet-connected devices.
>> 
>> From my impression looking at public Internet traces, aggregation mechanisms that cause TCP ACKs for more than 2 segments are very common. I suspect that's because the majority of public Internet traffic these days has a bottleneck that is either wifi, cellular, or DOCSIS, and all of these have a shared medium with a large latency overhead for L2 MAC control of gets to speak next. So a lot of batching happens, both in big batches of data that arrive at the client in the same L2 medium time slot, and big batches of ACKs that accumulate while the client waits (often several milliseconds, sometimes even tens of milliseconds) for its chance to send a big stretch ACK or batch of ACKs.
>> 
>> This brings up a related point: even if there is some ABC-style per-ACK L limit on cwnd increases, the time structure of most public Internet ACK streams is massively bursty because of these aggregation mechanisms inherent in L2 behavior on most public Internet bottlenecks (wifi, cellular, DOCSIS). So even if there is a limit L that limits the per-ACK behavior to be smooth, if there is no pacing of data segments then the data transmit time structure will still be bursty because the ACK arrivals these days are very bursty. 
>> 
>> best regards,
>> neal
> 
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org <mailto:tcpm@ietf.org>
> https://www.ietf.org/mailman/listinfo/tcpm <https://www.ietf.org/mailman/listinfo/tcpm>