Re: [tcpm] [EXTERNAL] Re: Seeking WG opinions on ACKing ACKs with good cause

Bob Briscoe <ietf@bobbriscoe.net> Mon, 12 July 2021 20:47 UTC

Return-Path: <ietf@bobbriscoe.net>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6F6733A0D35 for <tcpm@ietfa.amsl.com>; Mon, 12 Jul 2021 13:47:43 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Level:
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=bobbriscoe.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Nh2A-LpWQhCV for <tcpm@ietfa.amsl.com>; Mon, 12 Jul 2021 13:47:38 -0700 (PDT)
Received: from mail-ssdrsserver2.hostinginterface.eu (mail-ssdrsserver2.hostinginterface.eu [185.185.85.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 435DB3A0D7B for <tcpm@ietf.org>; Mon, 12 Jul 2021 13:47:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=bobbriscoe.net; s=default; h=Content-Type:In-Reply-To:MIME-Version:Date: Message-ID:From:References:Cc:To:Subject:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=FQD8T4arP/B2Q2GNInI70PdJNgRNnegUZpHaGIUKnGo=; b=oXt0xtV9apIyFNibPxfj4lYANx 5974r5hG04v+q1wicCif/HzRh2TfQ+v+5MwWBG2PskMkZDo1umQkLAm1IUmy8YtbDO/hhJZ4DE665 vaI0n/ZDvFVbqJ2gficYwH/GZCmVrf2LPbl7TIsDhBPQHafjUefqnwvGqIvatk9fwfqwtjMfuS6Ns 5UUVoC6756rn3zxb9MDnYcbIUBPgypcw7B3sxN1rrF//qv32VkwJ7uvhH2kf1zimx4TxhN05JyNdj oEWdcjMp2A1DOZ4p91YGi/yxG/sXIS2tvzoYo48gUx/IrjrRk/56YTZbt6UbBVWjZ876GZrGggAEs 44vbuBpw==;
Received: from 67.153.238.178.in-addr.arpa ([178.238.153.67]:35580 helo=[192.168.1.10]) by ssdrsserver2.hostinginterface.eu with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from <ietf@bobbriscoe.net>) id 1m32px-0001BL-Bb; Mon, 12 Jul 2021 21:47:32 +0100
To: Neal Cardwell <ncardwell@google.com>
Cc: tcpm@ietf.org, Mirja Kuehlewind <ietf@kuehlewind.net>, Yuchung Cheng <ycheng@google.com>, Richard Scheffenegger <rs.ietf@gmx.at>, Christian Huitema <huitema@huitema.net>, Ilpo Jarvinen <ilpo.jarvinen@cs.helsinki.fi>, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
References: <2482da71-5b79-1933-f975-e46cd7661c39@bobbriscoe.net> <474DCED0-622E-413F-A4A0-9539548A6377@huitema.net> <CADVnQyn=cdpgUgDUsOWC17Ot=xODxr=RehVCSQrfKoh32N0wfw@mail.gmail.com> <6ad1817b-2b58-8aee-2cf5-3654e33b3eda@bobbriscoe.net> <CADVnQy=9yiy-Y0=Ord-cih3tJAvSzACjP5sFXnBSC=71nDFKSg@mail.gmail.com>
From: Bob Briscoe <ietf@bobbriscoe.net>
Message-ID: <df1cf86e-01b6-14ce-9c04-91c99e9d7f39@bobbriscoe.net>
Date: Mon, 12 Jul 2021 21:47:28 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <CADVnQy=9yiy-Y0=Ord-cih3tJAvSzACjP5sFXnBSC=71nDFKSg@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------ABC66D4F0BDD33F68A232F75"
Content-Language: en-GB
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - ssdrsserver2.hostinginterface.eu
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - bobbriscoe.net
X-Get-Message-Sender-Via: ssdrsserver2.hostinginterface.eu: authenticated_id: in@bobbriscoe.net
X-Authenticated-Sender: ssdrsserver2.hostinginterface.eu: in@bobbriscoe.net
X-Source:
X-Source-Args:
X-Source-Dir:
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/tRpPTb4-cQoC3qfvshSm06LV4Lc>
Subject: Re: [tcpm] [EXTERNAL] Re: Seeking WG opinions on ACKing ACKs with good cause
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Jul 2021 20:47:44 -0000

Neal,

On 12/07/2021 15:35, Neal Cardwell wrote:
>
>
> On Mon, Jul 12, 2021 at 8:30 AM Bob Briscoe <ietf@bobbriscoe.net 
> <mailto:ietf@bobbriscoe.net>> wrote:
>
>     Neal,
>
>     On 11/07/2021 19:28, Neal Cardwell wrote:
>>
>>
>>     On Sun, Jul 11, 2021 at 11:55 AM Christian Huitema
>>     <huitema@huitema.net <mailto:huitema@huitema.net>> wrote:
>>
>>
>>
>>         > On Jul 11, 2021, at 4:42 AM, Bob Briscoe
>>         <ietf@bobbriscoe.net <mailto:ietf@bobbriscoe.net>> wrote:
>>
>>     ...
>>
>>         > The implementation will have its own ACK ratio - that's out
>>         of scope of AccECN, except to set this max of 6 CEs, which is
>>         to mitigate wrap of the 3-bit counter of CE-marks (which in
>>         the worst case of 100% marking in the data direction could
>>         then induce 1 ACK per 6 packets). This shouldn't limit
>>         forward performance, because it only increases the reverse
>>         ACK rate if there is heavy congestion in the forward path,
>>         when the Data Sender should be reducing the forward rate anyway.
>>
>>         1 ACK per 6 packets would cause performance issues on high
>>         speed links. Common setup is "4 to 8 ACK per RTT", which
>>         means intervals much larger than 6 packets.
>>
>>
>>     I share this concern about the AccECN requirement of one ACK per
>>     6 CE-marked data segments potentially causing performance issues
>>     with high-speed links. Today, with most high-speed last-hop link
>>     technologies (wifi, cellular, Ethernet) TCP receivers often
>>     receive (from lower layers of the hardware and software
>>     networking stack) large aggregates  (often up to 64KBytes or
>>     around 44 packets), and generate a single ACK for that aggregate.
>>     AFAICT changing this scenario to require 1 packet per 6 CE-marked
>>     data segments means that during CE-marked scenarios receiving
>>     such an aggregate would require ceil(44/6) = 8 ACKs, increasing
>>     the number of ACKs by up to 8x.
>>
>>     I have two main concerns in that scenario:
>>
>>     (1) CPU load: This seems like it would impose much higher CPU
>>     load on the data receiver, at just the moment when the receiver
>>     may already be under stress due to receiving data near the
>>     maximum rate of the link.
>>
>>     (2) Congestion and ACK loss: This requirement for generating up
>>     to 8 ACKs per aggregate seems like it is likely to produce a
>>     tight burst of 8 ACKs, which is going to increase congestion and
>>     increase the odds of losing at least one ACK, which is going to
>>     cause accuracy problems for "Accurate" ECN. :-)
>
>     [BB] I hadn't intended this wording to apply in response to a
>     large burst. I'd want to alter the wording to allow just one ACK
>     in this case.
>
>
> In the case where there is just one ACK for a big CE-marked burst of 
> >= 8 segments, does that mean the result is:
>
> (a) the ACE field loses accuracy, or
>
> (b) the TCP connection interpreting the ACE field should presume the 
> ACE field probably wrapped and estimate the number of CE-marked 
> segments using the number of SACKed/ACKed segments, rather than the 
> ACE field?
>

[BB] Answers:
a) Yes, (obviously) less accurate than with the Option
b) Sort-of.
Just presuming the field wrapped as many times as possible gives v poor 
performance.
Ilpo and I have been developing heuristics to deal with cases where the 
number of SACKed/ACKed segments is >=8.
I'm planning to report the results in the meeting.


>>     I understand that this proposed "ACK per 6 CE-marked data
>>     segments" rule is necessary to avoid issues with the 3-bit ACE
>>     field wrapping. So IMHO this is one of the good arguments against
>>     including the ACE field in the AccECN design.
>
>     [BB] I think the subtext here is a preference for the DCTCP style
>     of feedback vs ACE (when there is no AccECN TCP Option). I don't
>     think this particular issue is any different between the two.
>
>
> Yes, when the AccECN TCP Option is not present, I'd prefer DCTCP-style 
> feedback rather than the ACE field.
>
>     With DCTCP feedback, when a large data burst like this arrives, if
>     there are transitions to and from CE marking within the burst,
>     does Linux DCTCP generate an ACK for each transition, and for
>     every n repetitions of CE not ECT, like the RFC says it should? Or
>     does the implementation just pop out one ACK at the end?
>
>
> Yes, with Linux DCTCP (or BBRv2), when there is a big burst, there is 
> an ACK for each CE<->non-CE transition. They also handle the case 
> where there is a CE<->non-CE transition while a delayed ACK is 
> pending. This requires cooperation from a few different layers.
>
> (1) the NIC hardware LRO aggregation is supposed to not aggregate 
> across CE<->non-CE transitions. I believe the de facto spec on NIC 
> aggregation with respect to IP fields is the following, which mandates 
> that:
> https://docs.microsoft.com/en-us/windows-hardware/drivers/network/updating-the-ip-headers-for-coalesced-segments 
> <https://docs.microsoft.com/en-us/windows-hardware/drivers/network/updating-the-ip-headers-for-coalesced-segments>
>
> (2) the Linux software GRO aggregation does not aggregate across 
> CE<->non-CE transitions; see inet_gro_receive() checking the  ToS 
> bytes: (iph->tos ^ iph2->tos)
>
> (3) the DCTCP and BBRv2 code use dctcp_ece_ack_update() to handle the 
> case where there is a CE<->non-CE transition while a delayed ACK is 
> pending

[BB] Understood. So doesn't my point stand that the existing 
hardware/software combination already often/typically emits numerous 
ACKs in response to arrival of a large data burst? Indeed, designing the 
ACE field as a counter was deliberately intended to reduce that.

Did you see the text we included in the current rev (-14 in Feb'21) that 
explains this:
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-14#section-3.3.4
It also explains how the new approach can be handled while hardware is 
not available. Essentially, assuming you start with a subset of users 
negotiating AccECN, those sessions would switch receive offload to 
software-only, until hardware became available. But I'd appreciate your 
thoughts on the whole section.

I've updated my local copy of the 'rules' in our proposed new rev of 
draft-15, to replace 'immediately send' and explain better what it 
means. Here it is:

3.2.2.5.1.  Data Receiver Safety Procedures

    The following rules define when a Data Receiver in AccECN mode emits
    an ACK:

    Change-Triggered ACKs:  An AccECN Data Receiver SHOULD emit an ACK
       whenever a data packet marked CE arrives after the previous packet
       was not CE.

       Even though this rule is stated as a "SHOULD", it is important for
       a transition to trigger an ACK if at all possible, The only valid
       exception to this rule is given below these bullets.

       For the avoidance of doubt, this rule is deliberately worded to
       apply solely when _data_ packets arrive, but the comparison with
       the previous packet includes any packet, not just data packets.

    Increment-Triggered ACKs:  An AccECN Data Receiver MUST emit an ACK
       if 'n' CE marks have arrived since the previous ACK.  If there is
       new data to acknowledge, 'n' SHOULD be 2.  If there is no new data
       to acknowledge, 'n' SHOULD be 3 and MUST be no less than 3.  In
       either case, 'n' MUST be no greater than 6.

    The above rules for when to send an ACK are designed to be
    complemented by those in Section 3.2.3.3, which concern whether the
    AccECN TCP Option ought to be included on ACKs.

    If the arrivals of a number of data packets are all processed as one
    event, e.g. using large receive offload (LRO) or generic receive
    offload (GRO), both the above rules SHOULD be interpreted as
    requiring multiple ACKs to be emitted back-to-back (for each
    transition and for each repetition by 'n' CE marks).  If this is
    problematic for high performance, either rule can be interpreted as
    requiring just a single ACK at the end of the whole receive event.

    Even if a number of data packets do not arrive as one event, the
    'Change-Triggered ACKs' rule could sometimes cause the ACK rate to be
    problematic for high performance (although high performance protocols
    such as DCTCP already successfully use change-triggered ACKs).  The
    rationale for change-triggered ACKs is so that the Data Sender can
    rely on them to detect queue growth as soon as possible, particularly
    at the start of a flow.  The approach can lead to some additional
    ACKs but it feeds back the timing and the order in which ECN marks
    are received with minimal additional complexity.  If CE marks are
    infrequent, or there are multiple marks in a row, the additional load
    will be low.  Other marking patterns could increase the load
    significantly.  One possible compromise would be for the receiver to
    heuristically detect whether the sender is in slow-start, then to
    implement change-triggered ACKs while the sender is in slow-start,
    and offload otherwise.

    With ECN-capable pure ACKs [I-D.ietf-tcpm-generalized-ecn], the
    'Increment-Triggered ACKs' rule could cause ECN-marked pure ACKs to
    trigger further ACKs. [...snipped to avoid muddling this thread
                  with the "ACKing ACKs without good cause" thread]



>
>     Ilpo & I have been doing experiments with high levels of ACK
>     coalescing causing the ACE field seen by the Data Sender to wrap
>     multiple times under high congestion. For example, 1 ACK per ~4ms
>     is one common scheme, which would result in about 1 ACK for 34
>     data packets at just 100Mb/s.
>
>
> Great. Yes, certainly that level of aggregation is important to test.
>
>
>>
>>     The other main concerns I have with the ACE field are:
>>
>>     o Complexity
>
>     [BB] I understand there could be a tradeoff between complexity and
>     accuracy. But I think the complexity of ACE and DCTCP feedback are
>     pretty much the same but, with even low levels of ACK coalescing,
>     the accuracy of ACE is superior. Nonetheless, I understand that
>     deploying something different to DCTCP when you've already got
>     DCTCP could involve deployment complexity.
>
>
> IMHO the ACE field is more complex, due to:
>
> o context-dependent interpretation of the header bits used by ACE

[BB] That would I think be the same for the DCTCP feedback scheme (it's 
only not the case now in a DC where you can hard-code rather than 
negotiate the scheme).

>
> o logic to handle wraps in the ACE field and estimate what the real 
> increment to the ACE field was in the case of possible ACK loss

[BB] That's slightly true ;)

I would say it like this:
* With DCTCP feedback, a simple line of code can fill in the missing 
ACKs: it assumes the marking continued as per the last signal. This is 
very inaccurate under high levels of ACK coalescing and there is no 
obvious way to make it more accurate.
* With ACE feedback a simple line of code can also fill in the missing 
ACKs. However, in contrast to DCTCP, you can add code to make ACE more 
accurate if you are prepared to add the extra complexity (but you don't 
have to).

>
> o interactions with drivers or hardware that already have assumptions 
> about the header bits used for ACE (e.g. that the CWR bit should be 
> cleared for the first N-1 segments in a TSO burst and only set on the 
> last segment of the TSO burst)

[BB]

>
>
>
>>
>>     o Redundancy with respect to the AccECN counter options
>
>     [BB] The redundancy is intended. For cases where the TCP Option
>     doesn't traverse the path, or where TCP option space is limited.
>
>
> I realize the redundancy is intentional, but the redundant code 
> does impose a maintenance cost.
>
>>
>>     o Potential problems with middleboxes
>
>     [BB] Well, we've tested the 3 header bits over millions of paths
>     without problems. But yes, there are billions more to test.
>
>>
>>     o Known problems with NICs and drivers (based on Ilpo's nice
>>     talk, "Accurate ECN Linux Implementation: Experiences and
>>     Challenges", at the April 2020 TCPM interim meeting)
>
>     [BB] I understood that talk as concluding there weren't really
>     problems. What specific problem do you have in mind?
>
>
> Mainly I'm concerned about:
>
> o interactions with drivers or hardware that already have assumptions 
> about the header bits used for ACE (e.g. that the CWR bit should be 
> cleared for the first N-1 segments in a TSO burst and only set on the 
> last segment of the TSO burst)
>
> It sounds like since the ACE field redefines the semantics of some 
> long-standing TCP header fields, there will be a long trial-and-error 
> period of finding the drivers and NICs that are incompatible with the 
> ACE field, and working around those.

[BB] Yes. You can't make an omelette without breaking eggs. Meaning, to 
get significantly lower delay over the public Internet, we can't expect 
it all just to fall into place with minimal effort. Given we have worked 
out a way to deploy pretty accurate ECN feedback without the option, it 
would be a shame to decide not to, just because there /might/ be 
problems. There always /might/ be problems, indeed there nearly always are.

If DCTCP feedback were used over the public Internet, it would have to 
be included some form of negotiation framework, probably like AccECN, 
and that would probably also trip over some hardware incompatibilities. 
Why make half-hearted changes and still have potential deployment problems?

I really think we need to question the prevailing preference for 
inconsequential changes, when we end up with so much grief making /any/ 
change.

>>     I realize that if AccECN does not have the ACE field feature,
>>     then AccECN and TCP L4S will not be usable on paths with
>>     middleboxes that strip the AccECN counter options. But IMHO
>>     living without the ACE field is preferable. IMHO it's acceptable
>>     to say that L4S can only be used with (a) QUIC, or (b) TCP
>>     connections where no middleboxes are stripping AccECN options.
>
>     [BB] When you say 'living without the ACE field', alongside the
>     AccECN TCP Option, would you leave classic ECN feedback? Or put
>     DCTCP feedback in its place?
>
>
> When 'living without the ACE field', I mean either using the AccECN 
> TCP Option or DCTCP-style feedback.
>
>     I don't think there's any case for using classic ECN feedback
>     within AccECN. I see the competition as between ACE and DCTCP
>     feedback, each optionally with the AccECN TCP Option.
>
>
> Agreed. I agree there's no case for using classic ECN feedback within 
> AccECN.
>
>     If DCTCP feedback were a stop-gap until we could get good
>     traversal of the TCP Option, it might have some merit, but I think
>     DCTCP isn't going to work well with the level of ACK coalescing in
>     the Internet. However, this is very difficult to judge unless we
>     do large scale A-B experiments on the real Internet (and
>     potentially within DCs). I'll think further about enabling some
>     way to do that, possibly within AccECN's negotiation framework.
>     But today, I have to prioritize for the draft deadline.
>
>
> I agree it's unclear whether DCTCP-style feedback will work well over 
> the public Internet, due to ACK loss and the variety of aggregation 
> mechanisms in play. That is why in my previous email in this thread I 
> was mainly presuming that if the AccECN TCP Option is not forwarded by 
> the path then the flow might have to just disable AccECN and L4S.

[BB] If you're saying a third possibility would be to have no ECN 
feedback in the header, only via an option, that seems perverse, given 
the ECN header bits would then be idle. And, if you consumed the option 
space for SACK, timestamps, (even MPTCP) etc. you might have no ECN 
feedback at all.

The ability to have wider early deployment is important for the economic 
motivations of everyone involved, so not to be dismissed lightly.

I'll think whether we can find some way to try out both ACE and DCTCP 
feedback within a realistic protocol so that all the deployment niggles 
and performance can be tested and compared. But if you also agree that 
DCTCP probably won't work well over the Internet, it might be a 
non-starter anyway (but worth checking if not too onerous).


Bob

>
> cheers,
> neal
>

-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/