Re: [tcpm] [EXTERNAL] Re: Seeking WG opinions on ACKing ACKs with good cause

Neal Cardwell <ncardwell@google.com> Mon, 12 July 2021 14:36 UTC

Return-Path: <ncardwell@google.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6A6AD3A1A2D for <tcpm@ietfa.amsl.com>; Mon, 12 Jul 2021 07:36:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -16.197
X-Spam-Level:
X-Spam-Status: No, score=-16.197 tagged_above=-999 required=5 tests=[DKIMWL_WL_MED=-0.499, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=google.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5brFF1d7nRbU for <tcpm@ietfa.amsl.com>; Mon, 12 Jul 2021 07:35:59 -0700 (PDT)
Received: from mail-vk1-xa29.google.com (mail-vk1-xa29.google.com [IPv6:2607:f8b0:4864:20::a29]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 65EF13A1A37 for <tcpm@ietf.org>; Mon, 12 Jul 2021 07:35:57 -0700 (PDT)
Received: by mail-vk1-xa29.google.com with SMTP id e15so4036604vkp.13 for <tcpm@ietf.org>; Mon, 12 Jul 2021 07:35:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=onkJBFQEGdJAgSK4HQ+cLRC7TwetOPjI+1zOUchIG0I=; b=DFIt6J+GE2gTPZA5cJs1BbY76AWUhx+WEhiB2ZAr6ebnOusC6Q5Txhu0beWDLDyVwj 0abJykVxLYwHrlsn+XT0QZ8gHvst5VT3tlhOKvMVLnwVdW+o29k9ZBYz7Yt5kpxU8KAp /HB+qShz9sdwpIXVJv2Yc4uovVd8N5P4Td5BGxh6ZvbE9e7oRD1HTZbqkzrlnhJ6hHFf bHPkzkRPEg+DHxU0NA4q56xNpN257YSi/fy0SE4/jstED5zwvBg2/go3XhpQRrr9La3S YmaKlCre/qFBXpGsvXhLBp5fkGngKhUbl2uxPoVKqEQW2WPbZjFANK7BQJEucjybAOlA n+Bg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=onkJBFQEGdJAgSK4HQ+cLRC7TwetOPjI+1zOUchIG0I=; b=tLx4UQe5rvTok7D46bdOVERcVuLfZLcFtgTZZqzy0vL/uBWdIJUHEVG2dIhhP7ietS KwBspi8qvXRJc8S9+qKdAHUWkx2B7AIPTdMtOqb8K9I1nk/ALRed9po78/953FUPw+JF xtZJW13XxHwyU3MjjKCFswOJVFt8W04HQGxFOoyqRXe95M4qj04JH1TVgcRbHvdVEn9I KuhIe8QuEqFV0fISGKqgC/v9hOmF5jtYnrtSAPzZYodaw28flJ0dl6G7IXswJAD6njI9 rH6RDrLuM5sWN9uFC3R/DDm9fggugLwHokaJdhN+intDvZwUWSxmdYrF5bg4Vm6jwnzq Arfg==
X-Gm-Message-State: AOAM530U7t21bY8lM1ZxX509ZWomII/5oRGXGlcLOZQOl76OWLrKQ6sb Lx6jqVF7AGoe7F142asS0r/M93hU9PNbMQVGZlWl2X4Z1VyLcg==
X-Google-Smtp-Source: ABdhPJw4Nj3NX2W09L/FBm1AwuDjz7Zf0N5jJgoq4EtYr7irNRqaBfwcJm65Tx3cbdmAsQdhfpxOaYbrYaZipAPua40=
X-Received: by 2002:a1f:1d4d:: with SMTP id d74mr5706930vkd.19.1626100555431; Mon, 12 Jul 2021 07:35:55 -0700 (PDT)
MIME-Version: 1.0
References: <2482da71-5b79-1933-f975-e46cd7661c39@bobbriscoe.net> <474DCED0-622E-413F-A4A0-9539548A6377@huitema.net> <CADVnQyn=cdpgUgDUsOWC17Ot=xODxr=RehVCSQrfKoh32N0wfw@mail.gmail.com> <6ad1817b-2b58-8aee-2cf5-3654e33b3eda@bobbriscoe.net>
In-Reply-To: <6ad1817b-2b58-8aee-2cf5-3654e33b3eda@bobbriscoe.net>
From: Neal Cardwell <ncardwell@google.com>
Date: Mon, 12 Jul 2021 10:35:38 -0400
Message-ID: <CADVnQy=9yiy-Y0=Ord-cih3tJAvSzACjP5sFXnBSC=71nDFKSg@mail.gmail.com>
To: Bob Briscoe <ietf@bobbriscoe.net>
Cc: tcpm@ietf.org, Mirja Kuehlewind <ietf@kuehlewind.net>, Yuchung Cheng <ycheng@google.com>, Richard Scheffenegger <rs.ietf@gmx.at>, Christian Huitema <huitema@huitema.net>, Ilpo Jarvinen <ilpo.jarvinen@cs.helsinki.fi>, =?UTF-8?Q?Ilpo_J=C3=A4rvinen?= <ilpo.jarvinen@helsinki.fi>
Content-Type: multipart/alternative; boundary="000000000000fa5c1e05c6ee09cf"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/QU56dMsWsHJYzvjNDcKCnRGR-Ik>
Subject: Re: [tcpm] [EXTERNAL] Re: Seeking WG opinions on ACKing ACKs with good cause
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Jul 2021 14:36:06 -0000

On Mon, Jul 12, 2021 at 8:30 AM Bob Briscoe <ietf@bobbriscoe.net> wrote:

> Neal,
>
> On 11/07/2021 19:28, Neal Cardwell wrote:
>
>
>
> On Sun, Jul 11, 2021 at 11:55 AM Christian Huitema <huitema@huitema.net>
> wrote:
>
>>
>>
>> > On Jul 11, 2021, at 4:42 AM, Bob Briscoe <ietf@bobbriscoe.net> wrote:
>>
> ...
>
>> > The implementation will have its own ACK ratio - that's out of scope of
>> AccECN, except to set this max of 6 CEs, which is to mitigate wrap of the
>> 3-bit counter of CE-marks (which in the worst case of 100% marking in the
>> data direction could then induce 1 ACK per 6 packets). This shouldn't limit
>> forward performance, because it only increases the reverse ACK rate if
>> there is heavy congestion in the forward path, when the Data Sender should
>> be reducing the forward rate anyway.
>>
>> 1 ACK per 6 packets would cause performance issues on high speed links.
>> Common setup is "4 to 8 ACK per RTT", which means intervals much larger
>> than 6 packets.
>
>
> I share this concern about the AccECN requirement of one ACK per 6
> CE-marked data segments potentially causing performance issues with
> high-speed links. Today, with most high-speed last-hop link technologies
> (wifi, cellular, Ethernet) TCP receivers often receive (from lower layers
> of the hardware and software networking stack) large aggregates  (often up
> to 64KBytes or around 44 packets), and generate a single ACK for that
> aggregate. AFAICT changing this scenario to require 1 packet per 6
> CE-marked data segments means that during CE-marked scenarios receiving
> such an aggregate would require ceil(44/6) = 8 ACKs, increasing the number
> of ACKs by up to 8x.
>
> I have two main concerns in that scenario:
>
> (1) CPU load: This seems like it would impose much higher CPU load on the
> data receiver, at just the moment when the receiver may already be under
> stress due to receiving data near the maximum rate of the link.
>
> (2) Congestion and ACK loss: This requirement for generating up to 8 ACKs
> per aggregate seems like it is likely to produce a tight burst of 8 ACKs,
> which is going to increase congestion and increase the odds of losing at
> least one ACK, which is going to cause accuracy problems for "Accurate"
> ECN. :-)
>
>
> [BB] I hadn't intended this wording to apply in response to a large burst.
> I'd want to alter the wording to allow just one ACK in this case.
>

In the case where there is just one ACK for a big CE-marked burst of >= 8
segments, does that mean the result is:

(a) the ACE field loses accuracy, or

(b) the TCP connection interpreting the ACE field should presume the ACE
field probably wrapped and estimate the number of CE-marked segments using
the number of SACKed/ACKed segments, rather than the ACE field?



> I understand that this proposed "ACK per 6 CE-marked data segments" rule
> is necessary to avoid issues with the 3-bit ACE field wrapping. So IMHO
> this is one of the good arguments against including the ACE field in the
> AccECN design.
>
>
> [BB] I think the subtext here is a preference for the DCTCP style of
> feedback vs ACE (when there is no AccECN TCP Option). I don't think this
> particular issue is any different between the two.
>

Yes, when the AccECN TCP Option is not present, I'd prefer DCTCP-style
feedback rather than the ACE field.


> With DCTCP feedback, when a large data burst like this arrives, if there
> are transitions to and from CE marking within the burst, does Linux DCTCP
> generate an ACK for each transition, and for every n repetitions of CE not
> ECT, like the RFC says it should? Or does the implementation just pop out
> one ACK at the end?
>

Yes, with Linux DCTCP (or BBRv2), when there is a big burst, there is an
ACK for each CE<->non-CE transition. They also handle the case where there
is a CE<->non-CE transition while a delayed ACK is pending. This requires
cooperation from a few different layers.

(1) the NIC hardware LRO aggregation is supposed to not aggregate across
CE<->non-CE transitions. I believe the de facto spec on NIC aggregation
with respect to IP fields is the following, which mandates that:

https://docs.microsoft.com/en-us/windows-hardware/drivers/network/updating-the-ip-headers-for-coalesced-segments

(2) the Linux software GRO aggregation does not aggregate across
CE<->non-CE transitions; see inet_gro_receive() checking the  ToS bytes:
(iph->tos ^ iph2->tos)

(3) the DCTCP and BBRv2 code use dctcp_ece_ack_update() to handle the case
where there is a CE<->non-CE transition while a delayed ACK is pending


>
> Ilpo & I have been doing experiments with high levels of ACK coalescing
> causing the ACE field seen by the Data Sender to wrap multiple times under
> high congestion. For example, 1 ACK per ~4ms is one common scheme, which
> would result in about 1 ACK for 34 data packets at just 100Mb/s.
>

Great. Yes, certainly that level of aggregation is important to test.


>
>
> The other main concerns I have with the ACE field are:
>
> o Complexity
>
>
> [BB] I understand there could be a tradeoff between complexity and
> accuracy. But I think the complexity of ACE and DCTCP feedback are pretty
> much the same but, with even low levels of ACK coalescing, the accuracy of
> ACE is superior. Nonetheless, I understand that deploying something
> different to DCTCP when you've already got DCTCP could involve deployment
> complexity.
>

IMHO the ACE field is more complex, due to:

o context-dependent interpretation of the header bits used by ACE

o logic to handle wraps in the ACE field and estimate what the real
increment to the ACE field was in the case of possible ACK loss

o interactions with drivers or hardware that already have assumptions about
the header bits used for ACE (e.g. that the CWR bit should be cleared for
the first N-1 segments in a TSO burst and only set on the last segment of
the TSO burst)



>
>
>
> o Redundancy with respect to the AccECN counter options
>
>
> [BB] The redundancy is intended. For cases where the TCP Option doesn't
> traverse the path, or where TCP option space is limited.
>

I realize the redundancy is intentional, but the redundant code does impose
a maintenance cost.


>
> o Potential problems with middleboxes
>
>
> [BB] Well, we've tested the 3 header bits over millions of paths without
> problems. But yes, there are billions more to test.
>
>
> o Known problems with NICs and drivers (based on Ilpo's nice talk, "Accurate
> ECN Linux Implementation: Experiences and Challenges", at the April 2020
> TCPM interim meeting)
>
>
> [BB] I understood that talk as concluding there weren't really problems.
> What specific problem do you have in mind?
>

Mainly I'm concerned about:

o interactions with drivers or hardware that already have assumptions about
the header bits used for ACE (e.g. that the CWR bit should be cleared for
the first N-1 segments in a TSO burst and only set on the last segment of
the TSO burst)

It sounds like since the ACE field redefines the semantics of some
long-standing TCP header fields, there will be a long trial-and-error
period of finding the drivers and NICs that are incompatible with the ACE
field, and working around those.


> I realize that if AccECN does not have the ACE field feature, then AccECN
> and TCP L4S will not be usable on paths with middleboxes that strip the AccECN
> counter options. But IMHO living without the ACE field is preferable. IMHO
> it's acceptable to say that L4S can only be used with (a) QUIC, or (b) TCP
> connections where no middleboxes are stripping AccECN options.
>
>
> [BB] When you say 'living without the ACE field', alongside the AccECN TCP
> Option, would you leave classic ECN feedback? Or put DCTCP feedback in its
> place?
>

When 'living without the ACE field', I mean either using the AccECN TCP
Option or DCTCP-style feedback.


> I don't think there's any case for using classic ECN feedback within
> AccECN. I see the competition as between ACE and DCTCP feedback, each
> optionally with the AccECN TCP Option.
>

Agreed. I agree there's no case for using classic ECN feedback within
AccECN.


> If DCTCP feedback were a stop-gap until we could get good traversal of the
> TCP Option, it might have some merit, but I think DCTCP isn't going to work
> well with the level of ACK coalescing in the Internet. However, this is
> very difficult to judge unless we do large scale A-B experiments on the
> real Internet (and potentially within DCs). I'll think further about
> enabling some way to do that, possibly within AccECN's negotiation
> framework. But today, I have to prioritize for the draft deadline.
>

I agree it's unclear whether DCTCP-style feedback will work well over the
public Internet, due to ACK loss and the variety of aggregation mechanisms
in play. That is why in my previous email in this thread I was mainly
presuming that if the AccECN TCP Option is not forwarded by the path then
the flow might have to just disable AccECN and L4S.

cheers,
neal