Re: [tsvwg] ECN encapsulation draft - proposed resolution

Bob Briscoe <ietf@bobbriscoe.net> Tue, 22 June 2021 10:02 UTC

Return-Path: <ietf@bobbriscoe.net>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 55B1B3A1EB9 for <tsvwg@ietfa.amsl.com>; Tue, 22 Jun 2021 03:02:47 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.769
X-Spam-Level:
X-Spam-Status: No, score=-1.769 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, NICE_REPLY_A=-0.338, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_SOFTFAIL=0.665, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=bobbriscoe.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id auI5i10Zly8H for <tsvwg@ietfa.amsl.com>; Tue, 22 Jun 2021 03:02:42 -0700 (PDT)
Received: from mail-ssdrsserver2.hostinginterface.eu (mail-ssdrsserver2.hostinginterface.eu [185.185.85.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 317853A1EB7 for <tsvwg@ietf.org>; Tue, 22 Jun 2021 03:02:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=bobbriscoe.net; s=default; h=Content-Transfer-Encoding:Content-Type: In-Reply-To:MIME-Version:Date:Message-ID:References:Cc:To:From:Subject:Sender :Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=rb3gVlMGiT31v5dCUoxEJ/USCCJjiVwn2YEOwkMgIxg=; b=zt2zDsizu3NIdW5Tx8xmFpNN/A qNZqnfleF3ct0jZsCpKfcYlWP/Qx/6Vobz87z6ug3Lwa0XPkM9VtwPpW8kv9+AzhRDsu4v8ebC2Pm lOrZlox84QmTv8pKCGP/7KNe3duzFiGJnTKOnjr6oyQGglAb4Dux0lM7au32IB3gbLkxzHeJUwudL 0eFWPQ0gIfGI+Mdkijk1vqRVo4CXOlfCQAY67A7CegiOuzRRHH+ALAqc9AuSm408PQpylA7TSnfcT cA2tTinZ87Mfcf/xwLJFP6bvOvwEExFmpn07u5Yj8KcNM3vi7aAFzPMVUAA0qIugbcsrTKQLz8pOj pdNEjULg==;
Received: from 67.153.238.178.in-addr.arpa ([178.238.153.67]:51758 helo=[192.168.1.11]) by ssdrsserver2.hostinginterface.eu with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from <ietf@bobbriscoe.net>) id 1lvdEr-0007Z0-UL; Tue, 22 Jun 2021 11:02:34 +0100
From: Bob Briscoe <ietf@bobbriscoe.net>
To: Jonathan Morton <chromatix99@gmail.com>
Cc: "tsvwg@ietf.org" <tsvwg@ietf.org>, Markku Kojo <kojo=40cs.helsinki.fi@dmarc.ietf.org>, Donald Eastlake <d3e3e3@gmail.com>, John Kaippallimalil <kjohn@futurewei.com>
References: <MN2PR19MB40454BC50161943BC33AAAD783289@MN2PR19MB4045.namprd19.prod.outlook.com> <43e89761-d168-1eca-20ce-86aa574bd17a@bobbriscoe.net> <de8d355d-08b6-34fb-a6cc-56755c9a11ee@bobbriscoe.net> <MN2PR19MB4045DB9D2C45066AEB0762DB83259@MN2PR19MB4045.namprd19.prod.outlook.com> <alpine.DEB.2.21.2106021717300.4214@hp8x-60.cs.helsinki.fi> <BE497F82-5452-41A1-943F-7ABD0048C7F9@gmail.com> <56c2887b-5e9e-c2b6-c760-81e2627400a2@bobbriscoe.net>
Message-ID: <3a66effa-9269-a9b0-48e8-d48bd46d70d1@bobbriscoe.net>
Date: Tue, 22 Jun 2021 11:02:32 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1
MIME-Version: 1.0
In-Reply-To: <56c2887b-5e9e-c2b6-c760-81e2627400a2@bobbriscoe.net>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Content-Language: en-GB
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - ssdrsserver2.hostinginterface.eu
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - bobbriscoe.net
X-Get-Message-Sender-Via: ssdrsserver2.hostinginterface.eu: authenticated_id: in@bobbriscoe.net
X-Authenticated-Sender: ssdrsserver2.hostinginterface.eu: in@bobbriscoe.net
X-Source:
X-Source-Args:
X-Source-Dir:
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/FueYqXteWuMiEci376Hn28k2uX8>
Subject: Re: [tsvwg] ECN encapsulation draft - proposed resolution
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 22 Jun 2021 10:02:48 -0000

Jonathan,

Just checking that you have seen my response below. I spent a couple of 
days trying to write it in the way that you seem to be thinking about 
this. So we'll need you to respond if we are going to move forward. I 
appreciate there's a lot here, so it needs some time to work through it all.


Bob


On 15/06/2021 23:25, Bob Briscoe wrote:
> Jonathan,
>
> I waited before replying to your email below until I had the time to 
> try to unpick and deeply understand the way you think about all this. 
> I would appreciate it if you would do likewise, and certainly please 
> do not snip anything I say (like you did last time I put in a similar 
> amount of effort into explaining all this on your terms). Instead, 
> please try to understand and address each point.
> See [BB] inline.
>
> On 04/06/2021 02:02, Jonathan Morton wrote:
>>> On 3 Jun, 2021, at 2:41 pm, Markku Kojo 
>>> <kojo=40cs.helsinki.fi@dmarc.ietf.org> wrote:
>>>
>>> I'm afraid there is an open/ongoing discussion on this where we have 
>>> not reached concensus.
>> I think I agree with your analysis that preserving the number of 
>> bytes marked is probably the wrong thing to do.  The implementation 
>> complexity argument is fairly compelling in itself, particularly 
>> noting that there isn't any known "running code" which would place 
>> some upper bound on what the complexity is.
>>
>> By contrast the RFC-3168 approach of "if any part of the packet is 
>> marked, all of it is marked" is very simple to implement, and this is 
>> proved by actual implementations in the wild.  This approach is also 
>> consistent with the behaviour when ECN is not in use, since if any 
>> fragment of a packet is lost, the packet can't be reassembled and is 
>> discarded in whole.
>>
>> I'd like to try and back this position up with some mathematical 
>> rigour from a different direction.
>>
>> One thing I noticed about Bob's position is that he's trying to 
>> preserve the number of bytes marked, but usually the transport is 
>> actually sensitive to the number of bytes, or possibly the interval 
>> of time, *between* marks.  These are two (or three) very different 
>> quantities, and it's important not to get them confused in the debate.
>>
>> In engineering, a very useful tool for checking whether an answer is 
>> correct is "dimensional analysis".  Simply put, work out the units 
>> that your inputs and outputs are most naturally expressed as, and 
>> check that the formulae linking them are consistent with respect to 
>> the units.  In this case, we can look at the units of the AQM's 
>> marking function, the units of the transport's congestion control 
>> steady state criterion, and whether each approach to resolving ECN 
>> marks at reassembly preserves the quality of the marking with respect 
>> to those units.
>>
>> For the purpose of this analysis, we are looking for the quantity 
>> that is most nearly preserved if we hold the AQM state constant or 
>> the transport in steady state, but vary packet size and/or link 
>> bandwidth arbitrarily.
>>
>> Starting with the AQMs:
>>
>> 1: Codel is a time-domain AQM; its marking function is explicitly a 
>> frequency, expressed for performance reasons as a time interval 
>> between marks.  For analysis, we can equivalently express this 
>> quantity in units of marks/sec or its reciprocal, seconds per mark.  
>> If we change the bandwidth or the packet size while holding Codel's 
>> control law constant, the number of bytes in marked packets and/or 
>> the number of packets between marks will change, but the time between 
>> marks will stay constant.
>>
>> 2: A packet-domain marking function (as used in BLUE and many 
>> RED-family AQMs) maintains a probability that any given packet will 
>> be marked.  Amortised over a large interval, we can express this as a 
>> (fractional) number of marks per packet, or equivalently as packets 
>> between marks. If the packet rate increases, so will the number of 
>> marks per time interval; if the packet size decreases, so will the 
>> number of bytes between marks.
>>
>> 3: A byte-domain marking function, as Markku described for early RED 
>> work, is obtained by taking a nominally packet-mode probability 
>> function, but multiplying the probability by bytes/packet (and a 
>> constant scale factor).  The resulting unit is (fractional) marks per 
>> byte, or bytes between marks.
>
> [BB] One doesn't see flow rate formulae for Reno (or DCTCP, Cubic, 
> etc) that say, "the rate is x if over PIE; but the rate is y if over 
> CoDel". The outcome solely depends on the transports involved, which 
> converge on a set of rates that fill the capacity irrespective of 
> which AQM is at the bottleneck. Therefore the unit that a particular 
> AQM uses to mark packets does not alter the steady-state outcome. This 
> is a critical point, so if you disagree, there's no need to move on to 
> the next part about transports - please stop here and solely explain 
> why you think this point is wrong.
>
>
> The fact that CoDel inserts an amount of time between marks is a 
> distraction. CoDel's control law decreases the time between marks 
> until it balances with the transports using it. The fact that RED, 
> BLUE, PIE etc use a packet marking probability is also a distraction. 
> They also adapt, increasing the marking probability to converge on 
> 'the same outcome' as CoDel would, if placed in the same location with 
> the same traffic sources.
>
> By 'the same outcome', I mean that if you replaced one AQM at the 
> bottleneck with another, the outcome would be:
> * the same number of marked packets per second (or equivalently the 
> same time between marks)
> * AND the same marking probability.
>
> If it existed, byte-mode marking (the early RED option) would have the 
> same property, but let's only talk about that if we need to, when we 
> get to re-framing, where it does make a difference.
>
> BTW, the time taken to reach the outcome (the dynamic performance) 
> will be different for different AQMs, but you're talking here about 
> stable outcomes, which is indeed one of the two important 
> considerations (the other being timeliness).
>
>
>> Now the transports:
>>
>> 1: DCTCP's steady state is two marks per RTT.  Since RTT has units of 
>> time, the steady-state criterion has units of marks/sec.  This is 
>> precisely the same as Codel's marking function.
>>
>> 2: Reno's steady state is BDP/2 RTTs per mark, where BDP has units of 
>> bytes (the two time terms in the formula cancel).  The steady-state 
>> criterion thus has units of byte-seconds per mark. This does not 
>> exactly match any of the AQMs' control laws, but it seems likely that 
>> "seconds per mark" or "bytes between marks" would be a better match 
>> than "packets between marks". Regardless, for constant performance 
>> with Reno, we would look for methods that preserve the byte-seconds 
>> per mark value.
>>
>> 3: CUBIC's steady state, when not in the Reno compatibility regime, 
>> is quoted in RFC-8312 as (BDP/MSS) = 1.054 * RTT^0.75 / p^0.75, where 
>> p is the (fractional) number of marks per segment, and the quantity 
>> BDP/MSS has units of segments.  But each segment, as seen by the 
>> transport, is MSS bytes long, and it is ignorant of any fragmentation 
>> that might happen at Layer 3 or 2 along the way - so if that occurs, 
>> "segments" is not the same as "packets" as seem by the AQM.  So we 
>> can convert p into a byte quantity by dividing it by MSS, yielding 
>> (BDP/MSS) = 1.054 * (RTT*MSS / p)^0.75. This reveals that for a 
>> constant cwnd, a constant byte-seconds per mark value is required, 
>> similar to Reno, and we can probably expect this to be true for other 
>> "TCP friendly" congestion controls as well.
>
> [BB] One can substitute parameters with different units into 
> steady-state flow rate formulae (as I show at the end, headed "Formula 
> Transformations"). So if you happen to choose to express a formula in 
> terms of a parameter with certain units, it doesn't prove anything 
> about reassembling markings. It just says you have chosen to express 
> the formulae in terms of a parameter with certain units.
>
> Dimensional analysis is very useful to check one's work. But you're 
> not doing dimensional analysis above (otherwise you'd be checking that 
> the two sides of an equation have the same units). You're just 
> expressing formulae in terms of a parameter with units that you are 
> comfortable with, then convincing yourself that this proves that 
> reassembly should use the units that you are comfortable with.
>
> When you say '"segments" is not the same as "packets" as seen by the 
> AQM', you are following the logic that the AQM examines the packets 
> and knows what time to insert between marks. It doesn't. The AQM 
> inserts a first guess at a time between marks, the transport responds, 
> and if it doesn't respond enough, the queue doesn't shrink enough, so 
> the AQM keeps on inserting less time before the next mark until the 
> time between marks from the AQM causes the flow rate to balance at a 
> steady queue around the target.
>
> As explained earlier, if you observe a PIE or BLUE AQM from the 
> outside, you might believe it is inserting a reducing time between 
> marks. But internally, it is reducing a marking probability. Looking 
> from the outside, the final probability it reaches can also be 
> considered as the final time between marks that it reaches. So 
> (again), the units the AQM adjusts internally are irrelevant to the 
> steady state it reaches.
>
>
> When I said that all these formulae can be expressed in terms of 
> marking probability, p, rather than time between marks, t_m, that was 
> not to prove that p is the correct unit to preserve on reassembly. 
> It's just to prove that each transport's formula can be expressed in 
> either unit, so the units chosen for a formula do not imply anything 
> about which units should be preserved on reassembly. Your formulae are 
> just convincing yourself that you can express formulae in terms of a 
> parameter in units that you are comfortable with.
>
> In order to convince anyone else, you're going to need to address the 
> question of why your proposal (also in RFC3168) to preserve the time 
> between marks leads to roughly x2 unfairness between flows that are 
> just below and just above the fragmentation limit. Whereas preserving 
> marking probability (or equivalently bytes marked), leads to no 
> unfairness. That's a real-life phenomenon that doesn't depend on which 
> units anyone chooses to reason about the world.
>
> You can stop here, if you are convinced. If not, go to the end, to 
> work through the transformations from formulae expressed in 'time 
> between marks' into formulae expressed in 'marking probability'.
>
>> So now we can turn to the methods of resolving congestion marks at 
>> reassembly.  I'll examine three candidate rules that seem relevant to 
>> the debate, though others probably exist:
>>
>> 1: "If any fragment of a packet is marked, the whole packet is 
>> marked."  This is the RFC-3168 rule, generalised to also cover the 
>> weird Layer 2 fragmentation scheme that sparked this debate.  It is 
>> clear that this rule preserves the number of bytes between marks and 
>> the time between marks (and therefore also the byte-seconds between 
>> marks), but not the number of packets between marks - except for the 
>> specific case where a fragment contains portions of more than one 
>> packet.  If these packets all belong to the same flow, then this is 
>> harmless to Reno and CUBIC since consecutive ECN marks are reported 
>> to the sender as a single event.  If multiple flows are involved, 
>> then this will stochastically amplify the congestion signal by having 
>> more than one flow react to a single mark - which seems tolerable in 
>> practice.
>>
>> 2: "Every mark is sacred, every mark is great."  The AQM produces one 
>> mark, and exactly one reassembled packet receives that mark; the 
>> reassembly process can make a reasonable choice as to which one.  
>> This preserves bytes between marks, time between marks, and 
>> byte-seconds between marks *exactly*.
>>
>> 3: "The number of bytes marked should be preserved by reassembly."  
>> This would tend to preserve the number of packets between marks 
>> better than the other two options, but that is not useful to any of 
>> the transports considered.  The distortion to the bytes between marks 
>> and time between marks metrics would be distorted proportionally to 
>> the difference between average fragmented and reassembled packet 
>> sizes, and the byte-seconds between marks measure would be distorted 
>> *quadratically* by that difference.
>>
>> In summary, the "preserve the number of marked bytes" rule is 
>> mathematically unreasonable, as well as being much more complicated 
>> to implement than either the existing RFC-3168 rule, or the "sacred 
>> mark" rule, which actually preserve the quantities that need to be 
>> preserved.
>
> [BB] In March 2020, I posted an example of the unfairness problem if 
> reassembly preserves the time between marks. The unfairness is between 
> flows of reassembled packets and packets that have not been 
> fragmented. And I showed that preserving marking probability (or 
> equivalently bytes marked) solves the problem:
> You can still go straight to the schematic that I referred to: 
> https://bobbriscoe.net/projects/netsvc_i-f/consig/encap/ecn-reassembly.pdf
> (and FYI, here's the original posting: 
> https://mailarchive.ietf.org/arch/msg/tsvwg/Da0sagcLnvPzh6xKFdHFRUZ9w5o/ 
> )
>
> That is the problem with the RFC3168 approach, and with your approach 
> which has the same goal. I have carefully worked through your long 
> email, but it doesn't even mention the problem I've levelled against 
> the scheme you prefer. That is the whole nub of this argument. Please, 
> you cannot continue to keep asserting that you are right, without 
> engaging with that problem. And you cannot keep just asserting that I 
> don't know what I'm talking about and you do, unless you refute my 
> arguments.
>
>
> In summary, your long email had three parts:
>     #1 AQMs
>     #2 Transports
>     #3 Schemes to reassemble marking.
> Part #3 consists of assertions that the reassembly scheme you assert 
> is right is the one that preserve the units that you assert are right. 
> But you haven't given any arguments for why, and you haven't even 
> tried to support your arguments using the first two parts.
>
> I suspect you think your arguments in #1 & #2 are proofs. So, I have 
> explained (and proved):
> 1) why the units of the parameter that an AQM manipulates are irrelevant.
> 2) that a flow formulae can be expressed in terms of a parameter with 
> units either in (time between marks) or in (marking probability). So 
> choosing to express a flow formula in terms of a parameter with one 
> unit or the other is just your choice. It proves nothing about which 
> is the correct unit to preserve during reassembly. (Supporting maths 
> follows at the end)
>
>>
>>   - Jonathan Morton
>
> [BB] Regards
>
>
>
> Bob
>
> ==Errors in your analysis of transports==
>
> Your analysis of DCTCP is OK, but your analyses of Reno and Cubic 
> contain errors:
>
> * Reno: Given you define BDP in bytes, your criterion for RTTs per 
> mark should be (BDP/MSS)/2 , not BDP/2.
>     and a picky point: the const should be approx 2/3 not 1/2.
> * Cubic: Obviously, your two formulae compared below cannot both be 
> true (by dimensional analysis, ironically):
>     (BDP/MSS) = 1.054 * (RTT / p)^0.75
>     (BDP/MSS) = 1.054 * (RTT*MSS / p)^0.75
>     p is dimensionless (a fraction) so you cannot just divide it by 
> MSS to "get a byte quantity", while not changing the other side of the 
> equation.
>
>
> ==Formula Transformations==
> Here, we transform rate formulae in terms of (time between marks) into 
> formulae in terms of (marking probability).
>
> For brevity I'll use the following notation, all at steady state, and 
> with units shown in [square brackets]:
>     R : RTT [s/round]
>     S : MSS [B]
>     x : flow rate [B/s] = S/R * cwnd
>     t_m : time between marked packets [s/marked_pkt]
>
>     r : packet rate [pkt/s] = x/S by definition
>     BDP : bandwidth-delay product [B/round]
>     p : marking probability [], meaning a dimensionless fraction
>
> The time between marks is the reciprocal of the marks per time:
>     t_m = 1/(p*r)
> where p*r is the marks per time, because r is packets per time, and 
> each packet is marked with probability p.
> Expressing this in terms of flow rate rather than packet rate:
>     t_m = S / (p*x) (1)
>
> We can then substitute the above in your formulae for rounds per mark:
>
> ==DCTCP==
>     mark per round = 2
>     mark per time = 2 / R
> Time per mark is the reciprocal,
>     t_m = R/2
> Substituting from eqn (1)
>     S / (p*x) = R/2
>     x = S/R * 2/p
> which is the well-known DCTCP formula for flow rate, better known in 
> terms of cwnd in units of [packet]:
>     cwnd = 2/p.
>
> ==Reno ==
>    rounds/mark = 2/3 * BDP/S            (corrected from your formula)
>     t_m = R* (rounds/mark)
>            = 2/3 R*BDP/S
>            = 2/3 R*R*x/S
> Substituting from eqn (1)
>     S / (p*x) = 2/3 R*R*x/S
> Rearranging and collecting terms:
>     x^2 = 3/2 S^2/(R^2 * p)
>     x = S/R sqrt(3/2p)
> which is the well-known Reno formula for flow rate, better known in 
> terms of cwnd in units of [packet]:
>     cwnd = sqrt(3/2p).
>
> ==Cubic==
> We can transform in the other direction from the Cubic formula that 
> RFC 8312 expressed in p, to a formula expressed in t_m:
>     (BDP/S) = 1.054 * (R / p)^0.75
>     x*R/S = 1.054 * (R * t_m * x / S)^0.75
> Rearranging and collecting terms:
>     x^0.25 = 1.054 * S^0.25 * t_m^0.75 / R^0.25
>     x = 1.054^4 * S * t_m^3 / R
>
> In summary, the following generalized formulae can be used to express 
> flow rate either in terms of p (marking probability) or in terms of 
> t_m (time between marks):
>     x = k * S / (p^b * R^c)
>     x = k * S * t_m^d / R^e
> where:
>             b    c    d      e
>     ____________________________
>     DCTCP   1    1  infty  infty
>     Reno   1/2   1    1      2
>     Cubic  3/4  1/4   3      1
>
> (the infty values are because DCTCP flow rate, x, does not depend on 
> t_m or R, given t_m = R/2, with no dependence on x)
>

-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/