Re: [tsvwg] ECN encapsulation draft - proposed resolution

Bob Briscoe <ietf@bobbriscoe.net> Tue, 15 June 2021 22:25 UTC

Return-Path: <ietf@bobbriscoe.net>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6859A3A4047 for <tsvwg@ietfa.amsl.com>; Tue, 15 Jun 2021 15:25:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.432
X-Spam-Level:
X-Spam-Status: No, score=-1.432 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, NICE_REPLY_A=-0.001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_SOFTFAIL=0.665, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=bobbriscoe.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id IlTAEQLQC8S1 for <tsvwg@ietfa.amsl.com>; Tue, 15 Jun 2021 15:25:30 -0700 (PDT)
Received: from mail-ssdrsserver2.hostinginterface.eu (mail-ssdrsserver2.hostinginterface.eu [185.185.85.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C992B3A4043 for <tsvwg@ietf.org>; Tue, 15 Jun 2021 15:25:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=bobbriscoe.net; s=default; h=Content-Transfer-Encoding:Content-Type: In-Reply-To:MIME-Version:Date:Message-ID:From:References:Cc:To:Subject:Sender :Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=uQPBLQB+TdXD3ajyXJyjabzk//H5NrexxJ66o7MFQJU=; b=reNtSPP06UwLE7In7HbDIdbsJ0 posMH+bzVtS+N9/ysJYmoZBsSVq6Zd78IS2Nli64FnKzDXo5EfW6XO+9loDryVq2MsFO2ZBN16LWx gX/kbtKk5QxyWXdjEsr9qepSdtB8ZF7HeHuBbd/MC2BVhQW/bJbEJeLfCKUmdbb7/3WjtPTTdOLjN Xy3cNfAjSVie91KjoJrYvlmBuh4WtprMwbZnTbZJybJ39t1C5sVC4sH/+wyiwQWErKZhsl0d3Ei5s 9n2yOS+nA/Xr/fDCmgtIk1sWH8qxOCzjrzAwrr9Ek1XVJEnWOjXXuq8aWLnLMvjNQTQxHSepZPPgP 4ynXsfSw==;
Received: from 67.153.238.178.in-addr.arpa ([178.238.153.67]:51228 helo=[192.168.1.11]) by ssdrsserver2.hostinginterface.eu with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from <ietf@bobbriscoe.net>) id 1ltHUy-00019o-0l; Tue, 15 Jun 2021 23:25:26 +0100
To: Jonathan Morton <chromatix99@gmail.com>
Cc: "tsvwg@ietf.org" <tsvwg@ietf.org>, Markku Kojo <kojo=40cs.helsinki.fi@dmarc.ietf.org>, Donald Eastlake <d3e3e3@gmail.com>, John Kaippallimalil <kjohn@futurewei.com>
References: <MN2PR19MB40454BC50161943BC33AAAD783289@MN2PR19MB4045.namprd19.prod.outlook.com> <43e89761-d168-1eca-20ce-86aa574bd17a@bobbriscoe.net> <de8d355d-08b6-34fb-a6cc-56755c9a11ee@bobbriscoe.net> <MN2PR19MB4045DB9D2C45066AEB0762DB83259@MN2PR19MB4045.namprd19.prod.outlook.com> <alpine.DEB.2.21.2106021717300.4214@hp8x-60.cs.helsinki.fi> <BE497F82-5452-41A1-943F-7ABD0048C7F9@gmail.com>
From: Bob Briscoe <ietf@bobbriscoe.net>
Message-ID: <56c2887b-5e9e-c2b6-c760-81e2627400a2@bobbriscoe.net>
Date: Tue, 15 Jun 2021 23:25:24 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1
MIME-Version: 1.0
In-Reply-To: <BE497F82-5452-41A1-943F-7ABD0048C7F9@gmail.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Content-Language: en-GB
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - ssdrsserver2.hostinginterface.eu
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - bobbriscoe.net
X-Get-Message-Sender-Via: ssdrsserver2.hostinginterface.eu: authenticated_id: in@bobbriscoe.net
X-Authenticated-Sender: ssdrsserver2.hostinginterface.eu: in@bobbriscoe.net
X-Source:
X-Source-Args:
X-Source-Dir:
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/f_nrkJ5OP5iaNV_gkWK5-R9KVrg>
Subject: Re: [tsvwg] ECN encapsulation draft - proposed resolution
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Jun 2021 22:25:35 -0000

Jonathan,

I waited before replying to your email below until I had the time to try 
to unpick and deeply understand the way you think about all this. I 
would appreciate it if you would do likewise, and certainly please do 
not snip anything I say (like you did last time I put in a similar 
amount of effort into explaining all this on your terms). Instead, 
please try to understand and address each point.
See [BB] inline.

On 04/06/2021 02:02, Jonathan Morton wrote:
>> On 3 Jun, 2021, at 2:41 pm, Markku Kojo <kojo=40cs.helsinki.fi@dmarc.ietf.org> wrote:
>>
>> I'm afraid there is an open/ongoing discussion on this where we have not reached concensus.
> I think I agree with your analysis that preserving the number of bytes marked is probably the wrong thing to do.  The implementation complexity argument is fairly compelling in itself, particularly noting that there isn't any known "running code" which would place some upper bound on what the complexity is.
>
> By contrast the RFC-3168 approach of "if any part of the packet is marked, all of it is marked" is very simple to implement, and this is proved by actual implementations in the wild.  This approach is also consistent with the behaviour when ECN is not in use, since if any fragment of a packet is lost, the packet can't be reassembled and is discarded in whole.
>
> I'd like to try and back this position up with some mathematical rigour from a different direction.
>
> One thing I noticed about Bob's position is that he's trying to preserve the number of bytes marked, but usually the transport is actually sensitive to the number of bytes, or possibly the interval of time, *between* marks.  These are two (or three) very different quantities, and it's important not to get them confused in the debate.
>
> In engineering, a very useful tool for checking whether an answer is correct is "dimensional analysis".  Simply put, work out the units that your inputs and outputs are most naturally expressed as, and check that the formulae linking them are consistent with respect to the units.  In this case, we can look at the units of the AQM's marking function, the units of the transport's congestion control steady state criterion, and whether each approach to resolving ECN marks at reassembly preserves the quality of the marking with respect to those units.
>
> For the purpose of this analysis, we are looking for the quantity that is most nearly preserved if we hold the AQM state constant or the transport in steady state, but vary packet size and/or link bandwidth arbitrarily.
>
> Starting with the AQMs:
>
> 1: Codel is a time-domain AQM; its marking function is explicitly a frequency, expressed for performance reasons as a time interval between marks.  For analysis, we can equivalently express this quantity in units of marks/sec or its reciprocal, seconds per mark.  If we change the bandwidth or the packet size while holding Codel's control law constant, the number of bytes in marked packets and/or the number of packets between marks will change, but the time between marks will stay constant.
>
> 2: A packet-domain marking function (as used in BLUE and many RED-family AQMs) maintains a probability that any given packet will be marked.  Amortised over a large interval, we can express this as a (fractional) number of marks per packet, or equivalently as packets between marks. If the packet rate increases, so will the number of marks per time interval; if the packet size decreases, so will the number of bytes between marks.
>
> 3: A byte-domain marking function, as Markku described for early RED work, is obtained by taking a nominally packet-mode probability function, but multiplying the probability by bytes/packet (and a constant scale factor).  The resulting unit is (fractional) marks per byte, or bytes between marks.

[BB] One doesn't see flow rate formulae for Reno (or DCTCP, Cubic, etc) 
that say, "the rate is x if over PIE; but the rate is y if over CoDel". 
The outcome solely depends on the transports involved, which converge on 
a set of rates that fill the capacity irrespective of which AQM is at 
the bottleneck. Therefore the unit that a particular AQM uses to mark 
packets does not alter the steady-state outcome. This is a critical 
point, so if you disagree, there's no need to move on to the next part 
about transports - please stop here and solely explain why you think 
this point is wrong.


The fact that CoDel inserts an amount of time between marks is a 
distraction. CoDel's control law decreases the time between marks until 
it balances with the transports using it. The fact that RED, BLUE, PIE 
etc use a packet marking probability is also a distraction. They also 
adapt, increasing the marking probability to converge on 'the same 
outcome' as CoDel would, if placed in the same location with the same 
traffic sources.

By 'the same outcome', I mean that if you replaced one AQM at the 
bottleneck with another, the outcome would be:
* the same number of marked packets per second (or equivalently the same 
time between marks)
* AND the same marking probability.

If it existed, byte-mode marking (the early RED option) would have the 
same property, but let's only talk about that if we need to, when we get 
to re-framing, where it does make a difference.

BTW, the time taken to reach the outcome (the dynamic performance) will 
be different for different AQMs, but you're talking here about stable 
outcomes, which is indeed one of the two important considerations (the 
other being timeliness).


> Now the transports:
>
> 1: DCTCP's steady state is two marks per RTT.  Since RTT has units of time, the steady-state criterion has units of marks/sec.  This is precisely the same as Codel's marking function.
>
> 2: Reno's steady state is BDP/2 RTTs per mark, where BDP has units of bytes (the two time terms in the formula cancel).  The steady-state criterion thus has units of byte-seconds per mark.  This does not exactly match any of the AQMs' control laws, but it seems likely that "seconds per mark" or "bytes between marks" would be a better match than "packets between marks".  Regardless, for constant performance with Reno, we would look for methods that preserve the byte-seconds per mark value.
>
> 3: CUBIC's steady state, when not in the Reno compatibility regime, is quoted in RFC-8312 as (BDP/MSS) = 1.054 * RTT^0.75 / p^0.75, where p is the (fractional) number of marks per segment, and the quantity BDP/MSS has units of segments.  But each segment, as seen by the transport, is MSS bytes long, and it is ignorant of any fragmentation that might happen at Layer 3 or 2 along the way - so if that occurs, "segments" is not the same as "packets" as seem by the AQM.  So we can convert p into a byte quantity by dividing it by MSS, yielding (BDP/MSS) = 1.054 * (RTT*MSS / p)^0.75. This reveals that for a constant cwnd, a constant byte-seconds per mark value is required, similar to Reno, and we can probably expect this to be true for other "TCP friendly" congestion controls as well.

[BB] One can substitute parameters with different units into 
steady-state flow rate formulae (as I show at the end, headed "Formula 
Transformations"). So if you happen to choose to express a formula in 
terms of a parameter with certain units, it doesn't prove anything about 
reassembling markings. It just says you have chosen to express the 
formulae in terms of a parameter with certain units.

Dimensional analysis is very useful to check one's work. But you're not 
doing dimensional analysis above (otherwise you'd be checking that the 
two sides of an equation have the same units). You're just expressing 
formulae in terms of a parameter with units that you are comfortable 
with, then convincing yourself that this proves that reassembly should 
use the units that you are comfortable with.

When you say '"segments" is not the same as "packets" as seen by the 
AQM', you are following the logic that the AQM examines the packets and 
knows what time to insert between marks. It doesn't. The AQM inserts a 
first guess at a time between marks, the transport responds, and if it 
doesn't respond enough, the queue doesn't shrink enough, so the AQM 
keeps on inserting less time before the next mark until the time between 
marks from the AQM causes the flow rate to balance at a steady queue 
around the target.

As explained earlier, if you observe a PIE or BLUE AQM from the outside, 
you might believe it is inserting a reducing time between marks. But 
internally, it is reducing a marking probability. Looking from the 
outside, the final probability it reaches can also be considered as the 
final time between marks that it reaches. So (again), the units the AQM 
adjusts internally are irrelevant to the steady state it reaches.


When I said that all these formulae can be expressed in terms of marking 
probability, p, rather than time between marks, t_m, that was not to 
prove that p is the correct unit to preserve on reassembly. It's just to 
prove that each transport's formula can be expressed in either unit, so 
the units chosen for a formula do not imply anything about which units 
should be preserved on reassembly. Your formulae are just convincing 
yourself that you can express formulae in terms of a parameter in units 
that you are comfortable with.

In order to convince anyone else, you're going to need to address the 
question of why your proposal (also in RFC3168) to preserve the time 
between marks leads to roughly x2 unfairness between flows that are just 
below and just above the fragmentation limit. Whereas preserving marking 
probability (or equivalently bytes marked), leads to no unfairness. 
That's a real-life phenomenon that doesn't depend on which units anyone 
chooses to reason about the world.

You can stop here, if you are convinced. If not, go to the end, to work 
through the transformations from formulae expressed in 'time between 
marks' into formulae expressed in 'marking probability'.

> So now we can turn to the methods of resolving congestion marks at reassembly.  I'll examine three candidate rules that seem relevant to the debate, though others probably exist:
>
> 1: "If any fragment of a packet is marked, the whole packet is marked."  This is the RFC-3168 rule, generalised to also cover the weird Layer 2 fragmentation scheme that sparked this debate.  It is clear that this rule preserves the number of bytes between marks and the time between marks (and therefore also the byte-seconds between marks), but not the number of packets between marks - except for the specific case where a fragment contains portions of more than one packet.  If these packets all belong to the same flow, then this is harmless to Reno and CUBIC since consecutive ECN marks are reported to the sender as a single event.  If multiple flows are involved, then this will stochastically amplify the congestion signal by having more than one flow react to a single mark - which seems tolerable in practice.
>
> 2: "Every mark is sacred, every mark is great."  The AQM produces one mark, and exactly one reassembled packet receives that mark; the reassembly process can make a reasonable choice as to which one.  This preserves bytes between marks, time between marks, and byte-seconds between marks *exactly*.
>
> 3: "The number of bytes marked should be preserved by reassembly."  This would tend to preserve the number of packets between marks better than the other two options, but that is not useful to any of the transports considered.  The distortion to the bytes between marks and time between marks metrics would be distorted proportionally to the difference between average fragmented and reassembled packet sizes, and the byte-seconds between marks measure would be distorted *quadratically* by that difference.
>
> In summary, the "preserve the number of marked bytes" rule is mathematically unreasonable, as well as being much more complicated to implement than either the existing RFC-3168 rule, or the "sacred mark" rule, which actually preserve the quantities that need to be preserved.

[BB] In March 2020, I posted an example of the unfairness problem if 
reassembly preserves the time between marks. The unfairness is between 
flows of reassembled packets and packets that have not been fragmented. 
And I showed that preserving marking probability (or equivalently bytes 
marked) solves the problem:
You can still go straight to the schematic that I referred to: 
https://bobbriscoe.net/projects/netsvc_i-f/consig/encap/ecn-reassembly.pdf
(and FYI, here's the original posting: 
https://mailarchive.ietf.org/arch/msg/tsvwg/Da0sagcLnvPzh6xKFdHFRUZ9w5o/ )

That is the problem with the RFC3168 approach, and with your approach 
which has the same goal. I have carefully worked through your long 
email, but it doesn't even mention the problem I've levelled against the 
scheme you prefer. That is the whole nub of this argument. Please, you 
cannot continue to keep asserting that you are right, without engaging 
with that problem. And you cannot keep just asserting that I don't know 
what I'm talking about and you do, unless you refute my arguments.


In summary, your long email had three parts:
     #1 AQMs
     #2 Transports
     #3 Schemes to reassemble marking.
Part #3 consists of assertions that the reassembly scheme you assert is 
right is the one that preserve the units that you assert are right. But 
you haven't given any arguments for why, and you haven't even tried to 
support your arguments using the first two parts.

I suspect you think your arguments in #1 & #2 are proofs. So, I have 
explained (and proved):
1) why the units of the parameter that an AQM manipulates are irrelevant.
2) that a flow formulae can be expressed in terms of a parameter with 
units either in (time between marks) or in (marking probability). So 
choosing to express a flow formula in terms of a parameter with one unit 
or the other is just your choice. It proves nothing about which is the 
correct unit to preserve during reassembly. (Supporting maths follows at 
the end)

>
>   - Jonathan Morton

[BB] Regards



Bob

==Errors in your analysis of transports==

Your analysis of DCTCP is OK, but your analyses of Reno and Cubic 
contain errors:

* Reno: Given you define BDP in bytes, your criterion for RTTs per mark 
should be (BDP/MSS)/2 , not BDP/2.
     and a picky point: the const should be approx 2/3 not 1/2.
* Cubic: Obviously, your two formulae compared below cannot both be true 
(by dimensional analysis, ironically):
     (BDP/MSS) = 1.054 * (RTT / p)^0.75
     (BDP/MSS) = 1.054 * (RTT*MSS / p)^0.75
     p is dimensionless (a fraction) so you cannot just divide it by MSS 
to "get a byte quantity", while not changing the other side of the equation.


==Formula Transformations==
Here, we transform rate formulae in terms of (time between marks) into 
formulae in terms of (marking probability).

For brevity I'll use the following notation, all at steady state, and 
with units shown in [square brackets]:
     R : RTT [s/round]
     S : MSS [B]
     x : flow rate [B/s] = S/R * cwnd
     t_m : time between marked packets [s/marked_pkt]

     r : packet rate [pkt/s] = x/S by definition
     BDP : bandwidth-delay product [B/round]
     p : marking probability [], meaning a dimensionless fraction

The time between marks is the reciprocal of the marks per time:
     t_m = 1/(p*r)
where p*r is the marks per time, because r is packets per time, and each 
packet is marked with probability p.
Expressing this in terms of flow rate rather than packet rate:
     t_m = S / (p*x) (1)

We can then substitute the above in your formulae for rounds per mark:

==DCTCP==
     mark per round = 2
     mark per time = 2 / R
Time per mark is the reciprocal,
     t_m = R/2
Substituting from eqn (1)
     S / (p*x) = R/2
     x = S/R * 2/p
which is the well-known DCTCP formula for flow rate, better known in 
terms of cwnd in units of [packet]:
     cwnd = 2/p.

==Reno ==
    rounds/mark = 2/3 * BDP/S            (corrected from your formula)
     t_m = R* (rounds/mark)
            = 2/3 R*BDP/S
            = 2/3 R*R*x/S
Substituting from eqn (1)
     S / (p*x) = 2/3 R*R*x/S
Rearranging and collecting terms:
     x^2 = 3/2 S^2/(R^2 * p)
     x = S/R sqrt(3/2p)
which is the well-known Reno formula for flow rate, better known in 
terms of cwnd in units of [packet]:
     cwnd = sqrt(3/2p).

==Cubic==
We can transform in the other direction from the Cubic formula that RFC 
8312 expressed in p, to a formula expressed in t_m:
     (BDP/S) = 1.054 * (R / p)^0.75
     x*R/S = 1.054 * (R * t_m * x / S)^0.75
Rearranging and collecting terms:
     x^0.25 = 1.054 * S^0.25 * t_m^0.75 / R^0.25
     x = 1.054^4 * S * t_m^3 / R

In summary, the following generalized formulae can be used to express 
flow rate either in terms of p (marking probability) or in terms of t_m 
(time between marks):
     x = k * S / (p^b * R^c)
     x = k * S * t_m^d / R^e
where:
             b    c    d      e
     ____________________________
     DCTCP   1    1  infty  infty
     Reno   1/2   1    1      2
     Cubic  3/4  1/4   3      1

(the infty values are because DCTCP flow rate, x, does not depend on t_m 
or R, given t_m = R/2, with no dependence on x)

-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/