Re: [tsvwg] Status of ECN encapsulation drafts (i.e., stuck)

Bob Briscoe <ietf@bobbriscoe.net> Fri, 13 March 2020 13:03 UTC

Return-Path: <ietf@bobbriscoe.net>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B04623A1778 for <tsvwg@ietfa.amsl.com>; Fri, 13 Mar 2020 06:03:22 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.098
X-Spam-Level:
X-Spam-Status: No, score=-2.098 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=bobbriscoe.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id f9LMdJzyjVp1 for <tsvwg@ietfa.amsl.com>; Fri, 13 Mar 2020 06:03:19 -0700 (PDT)
Received: from cl3.bcs-hosting.net (cl3.bcs-hosting.net [3.11.37.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id F0EC23A177A for <tsvwg@ietf.org>; Fri, 13 Mar 2020 06:03:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=bobbriscoe.net; s=default; h=Content-Type:In-Reply-To:MIME-Version:Date: Message-ID:From:References:Cc:To:Subject:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=C5b67G8I/BzqUqjw60fpPIK9Bx81cmVws+f95JssJmI=; b=PRV8NCbMIdXnXvIJre9x2KoNi jE/Haxy0RrYVNJVv60ETX+4OH7Ja0y1woMJaa1DfgUEDnR2CggqRe6JAsSrVMFB6/zEQx7DC5icwb tvLc/HjJ2g5uvpiVsLAkH00lMSPLs1PxJQfsnN3qk0dxhUpIvDRUIHyENJ2OIsCPpuXwV/o06phfe 6k22VURuYTkeC0zRZKCW5DYE5o5lYBYNuAqKZDqNJxehA885rbf4BnmSQrCp/ZIARqbMuY9CvPsXw t+YzVWzAEk6gWXv+M0DssIqRom09TBdEphLk6dC/qaJywn2Iz3D74QaBjtMBxXBA2ISYzZ9mHnM6o JkTl36jaA==;
Received: from [31.185.135.141] (port=38028 helo=[192.168.0.4]) by cl3.bcs-hosting.net with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from <ietf@bobbriscoe.net>) id 1jCjy8-00Ey9M-3i; Fri, 13 Mar 2020 13:03:16 +0000
To: Jonathan Morton <chromatix99@gmail.com>
Cc: "Black, David" <David.Black@dell.com>, "tsvwg@ietf.org" <tsvwg@ietf.org>
References: <CE03DB3D7B45C245BCA0D24327794936306F8925@MX307CL04.corp.emc.com> <2873ab79-19ad-0541-e3a4-d1d28dbc7ba0@bobbriscoe.net> <B6D58310-41E0-4172-B555-D28E7926A0B5@gmail.com>
From: Bob Briscoe <ietf@bobbriscoe.net>
Message-ID: <3ee6e427-9dc9-e885-21a9-df9e35d99006@bobbriscoe.net>
Date: Fri, 13 Mar 2020 13:03:15 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <B6D58310-41E0-4172-B555-D28E7926A0B5@gmail.com>
Content-Type: multipart/alternative; boundary="------------C8F807E3ABC755F64AC6692C"
Content-Language: en-GB
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - cl3.bcs-hosting.net
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - bobbriscoe.net
X-Get-Message-Sender-Via: cl3.bcs-hosting.net: authenticated_id: in@bobbriscoe.net
X-Authenticated-Sender: cl3.bcs-hosting.net: in@bobbriscoe.net
X-Source:
X-Source-Args:
X-Source-Dir:
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/B2Gnb-jYPiZFYuA3YXNRhktKaT4>
Subject: Re: [tsvwg] Status of ECN encapsulation drafts (i.e., stuck)
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Mar 2020 13:03:23 -0000

Jonathan,

I have had to write a very long reply to explain why you are incorrect 
on this.
So, for the list, I'll summarize here at the top.

[@Jonathan, Please don't reply to the summary alone - not until you've 
read the explanation right through. If I've convinced you, please don't 
just go silent - please confirm that you were wrong, so that we can 
unstick this blockage quickly.]

Problem: RFC 3168 mandates a reassembly process for CE markings that is 
incorrect.
Solution: The proposed updated requirement corrects RFC3168 (for all 
these cases), but allows the original behaviour (avoids making existing 
implementations non-compliant).

This is orthogonal to the debates about scalable, classic, SCE, L4S, or 
whatever. RFC3168 is incorrect and the proposal is correct for all types 
of congestion controls and all meanings of ECN codepoints. [Your very 
last point about re-assembly with ECT1 does have a downside for SCE but 
not for L4S, so I have started a new thread for that.]

David Black and I are debating where a proposed correction to the 
reassembly requirement should go (these drafts?, a new draft?, etc). 
This email addresses your questioning of the correctness of the new 
requirement in the first place.


Specific problem: End-to-end, taking into account both fragmentation and 
reassembly, the "logical OR" reassembly process required in RFC3168 
roughly *doubles* *both* the number *and* the proportion of CE marked 
packets *and* the number and proportion of CE-marked bytes (relative to 
a parallel flow with slightly smaller packets that is not fragmented).

Specific solution: The proposed corrected approach *preserves all four* 
of the above properties: the number and proportion of CE marked packets 
and the number and proportion of CE-marked bytes, again taking into 
account both fragmentation and reassembly end-to-end. Actually it 
inherently slightly inflates all of them in proportion to the extra 
inner headers on the fragments, which is an even better property.

Also, a proposed example implementation is as simple, or simpler than 
needed for RFC3168 (it requires one additional state per interface, 
rather than per packet).

See explanation inline...

On 10/03/2020 23:02, Jonathan Morton wrote:
>> On 10 Mar, 2020, at 8:47 pm, Bob Briscoe <ietf@bobbriscoe.net> wrote:
>>
>>    This specification does not rule out the logical OR approach of RFC3168.
>> .  So a tunnel egress MAY CE-mark a reassembled packet if any of
>>     the fragments are CE-marked (and none are Not-ECT).  However, this
>>     approach could result in reduced link utilization, or bias against
>>     flows that are fragmented relative to those that are not.
> As I have stated in the past on this subject, I disagree with this characterisation.  It misconstrues the meaning of CE and the definition of steady state involving it, both for conventional congestion control and for the high-fidelity version.  It also leads to a more complex reassembly implementation being "preferred" than is necessary, which may deter implementors.
The wording also allows the "logical OR" approach if you believe it is 
simpler. Nonetheless, the two example approaches given in this and 
previous version of the draft both have the same or better simplicity 
than 'logical OR'.
>
> For conventional congestion control, the steady state is defined in terms of the time interval (during which cwnd growth occurs) between RTTs containing at least one CE mark (and/or packet loss).

When the network introduces ECN marking it doesn't know which of its 
marks will be considered by the transport to be within the same RTT. 
Indeed, in general, the network doesn't even know which marks will be 
seen by the same L4 flow (FQ+AQM is an exception). All an AQM has to do 
is increase or decrease the marking probability in response to growth or 
shrinkage of the queue. Decreasing or increasing the gap between marks 
is equivalent.

Your understanding about "the time interval between RTTs containing at 
least one CE mark/loss" is correct from the point of view of a 
conventional congestion controller. But it's wrong to say that "the 
control loop is not properly closed..." unless "...each single CE mark 
is attached to the same packet as the marked fragment belonged to".

If reassembly of marked fragments results in fewer marked packets, as 
long as timeliness is preserved (see next), the reassembly process just 
becomes a constant multiplier within the control loop. The control loop 
is no less closed as a result.

The Pythonesque "every congestion event is sacred" mindset would only 
apply if the AQM somehow knew what rate it wanted each flow to adopt, 
then applied a pre-calculated non-adaptive spacing between markings to 
make that so. That's (obviously) not how the Internet's control loops 
work. The whole point is they continually adapt.

> It is necessary for a single CE mark to be delivered promptly, ie. attached to the same packet as the marked fragment belonged to, in order to minimise growth overshoot and keep the control loop properly closed.

Yup. Timeliness is important - that's a given. The previous version gave 
the following example for how to ensure timeliness.

    Even if only enough incoming CE-marked octets have arrived for part
    of the departing packet, the next departing packet SHOULD be
    immediately CE-marked.  This ensures that CE-markings are propagated
    immediately, rather than held back waiting for more incoming CE-
    marked octets.

I deleted a para about preserving octets (which included the above 
sentences), mainly because you didn't like the para. I intended to 
replace the above sentences with the sentence below. However, in one of 
the drafts it looks like I got distracted during this 2-phase commit. 
(Previous versions of the ecn-encap draft had similar sentences to those 
above, which I had already generalized to the one below):

    The mechanism for propagating congestion indications SHOULD ensure
    that any incoming congestion indication is propagated immediately,
    not held awaiting the possibility of further congestion indications
    to be sufficient to indicate congestion on an outgoing PDU.

I have now pasted this latter sentence into my local editor's copy of 
rfc6040update-shim, which was my intention, so it will appear in the 
next revision.

> The control loop is not sensitive to the "proportion of bytes marked", only that a mark was encountered at a particular time.

Both sides are true, not just one side: a conventional congestion 
response is dependent on a mark being encountered at a particular time, 
*which makes it* sensitive to the proportion of bytes marked.

If that were not true, AQMs would not work (because all they do is alter 
the marking probability). You must be aware that TCP's window depends on 
the marking probability.

I will try to explain below. But I don't know which part of the whole 
picture you're not grocking. So, ultimately, you need to work out for 
yourself how both these things can be true at once, rather than trying 
to argue that all the understanding in the congestion control community 
is wrong, just because you haven't yet reconciled it with the way you 
currently think. That's what reading papers is for.

> It is well known, moreover, that adding delay to a control loop destabilises it - but this is exactly what an attempt to maintain the proportion of marked bytes would do.

Here goes: an example.

Let's assume the tunnel fragments the packets of some IPv4 flows into 2 
fragments; one large (always size S to make the explanation simple), one 
small (always size s).  Sizes include inner headers.

Now, I'm going to assume the reassembly algo in the previous version of 
the draft that maintains a per-interface counter incremented by incoming 
marked octets and decremented by outgoing marked octets. It marks 
outgoing packets while (counter >= 0). That only uses per-interface 
state, whereas the "logical OR" algorithm uses per-packet state.

At reassembly, let's start from the point where a marked larger fragment 
arrives (counter rises to S), but its little sister fragment isn't 
marked. The algo marks this first reassembled packet (size S+s), and the 
counter goes negative (-s). The next time a marked fragment arrives it 
bumps up the counter, either to 0 if it's a small fragment or to (S-s) 
if it's a large one. Either way it marks the outgoing reassembled 
packet. So the counter drops to either (-s-S) or (-2s).

No delayed signals so far. But, between packets, the counter will 
continue to go more negative until the deficit is less than the size of 
an arriving marked fragment, which increases the counter but it's still 
not quite positive. So the algo doesn't propagate that congestion 
signal. That looks like a delayed signal, doesn't it? But there's more 
to this than meets the eye....

For explanation, let's take 2 types of flows, one that the tunnel 
doesn't have to fragment and one it does. And let's start from a 
position where they are all originally sent with the same packet rate 
(pre-fragmentation).

The tunnel ingress doubles the packet rate of the flows it fragments. 
Therefore, during any congestion event, the AQM marks twice as many 
packets from the fragmented flows as the others {Note 1}. Reassembly 
restores the packet rate to that at the origin sender. And it removes 
some of the extra congestion markings that fragmentation added.

For example, if 100 packets of the unfragmented flow received 2 marks, 
100 packets of the fragmented flow would turn into 200 fragments, 4 
would be marked, then reassembly would turn them back into 100 packets 
and a little more than 2 would be marked (because of the extra size of 
the inner headers).

If reassembly used "logical OR" instead, of the 100 forwarded packets, 4 
would be marked (3.96% to be precise).

With marked-byte-preserving, TCP will reduce the average packet rate of 
the fragmented flows in proportion to the extra headers. With "logical 
OR", TCP will roughly halve the rate of the fragmented flows.

Any change in the aggregate marking probability at the AQM due to 
fragmentation is just a constant multiplier in all the control loops, 
which will all just adapt accordingly. When there's a pulse in load, 
removing some signals at reassembly doesn't delay the signals any more 
than adding signals during fragmentation "undelays" signals. The system 
was already at a different operating point to start with, due to the 
presence of fragmentation. So it doesn't take any longer to signal a 
pulse in load.

In summary, it's wrong to focus only on the removal of some congestion 
signals, without also taking into account the addition of more in the 
first place.

This is the flaw in the "every mark is sacred" mindset. It selectively 
focuses on preserving one piece of the bigger picture, without noticing 
that the whole picture is different.


{Note 1}: Fragmentation doesn't alter the aggregate bit-rate (other than 
the extra headers). So there are not that many more marked packets 
overall, because the AQM isn't that much more congested. The markings 
are just biased more onto the fragmented flows.

> For DCTCP-style congestion control, the steady state is defined in terms of the number of CE marks received per RTT.  Fragmentation does not change the RTT, only the number of packets passing a middlebox located on the tunnel path.  It would be reasonable to design an AQM expecting DCTCP traffic so that it produces exactly the correct number of CE marks per RTT at a particular queue depth.  But if the number of marks is effectively halved by a reassembly process that attempts to preserve the number of marked bytes, that queue will continue to grow past that designed ideal point.
Again, this is because you weren't looking at the whole picture. You 
hadn't noticed that fragmentation at the ingress doubled the packet rate 
of the flows needing fragmentation to start with.

> We can therefore conclude that DCTCP is also insensitive to the proportion of marked bytes, and this is not a property worth preserving; rather, the total number of marked packets should be preserved.
Preserving the proportion of marked bytes at reassembly is done to 
preserve BOTH the proportion of marked packets AND the proportion of 
marked bytes over the whole path.

>
> Finally, I observe that Codel performs marking on a time schedule, not on a proportion of packets passing through.  This is entirely consistent with the above observations about transport behaviour, and further confirms the need to preserve the number of marked packets, not the proportion of marked bytes or packets.
Oh dear, this is not correct either. CoDel uses a time schedule, but it 
doesn't use a particular time for a particular degree of congestion. It 
just increases or decreases the time until it's right. So it's no 
different from increasing and decreasing the marking proportion until it 
gets it right. It's adjusting both time and proportion at the same time.

In general, the choice of unit is only important for a target, not for a 
control variable (linearity is important for a control variable though).

>
> The existing language in RFC-3168 succeeds in preserving the number of CE marks applied to a flow.
But reassembly halves the number of packets (typically). So RFC 3168 
doubles the /proportion/ of CE marked packets.

Over the whole path, the combination of fragmentation AND RFC3168 
reassembly doubles both the number and the proportion of CE marks 
applied to a flow.

> Any deficiencies we should consider are in relation to handling the distinction between ECT(1) and ECT(0), as this is what newly becomes significant with both the L4S and SCE proposals.
That is also incorrect, but it requires a different thread.



Bob



>
>   - Jonathan Morton
>

-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/