Re: [tsvwg] draft-ietf-tsvwg-rfc6040update-shim:SuggestedFragmentation/Reassemblytext

Markku Kojo <kojo@cs.helsinki.fi> Mon, 29 March 2021 22:01 UTC

Return-Path: <kojo@cs.helsinki.fi>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DB1223A231D; Mon, 29 Mar 2021 15:01:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2
X-Spam-Level:
X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cs.helsinki.fi
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rXjyjSFJrN4g; Mon, 29 Mar 2021 15:01:17 -0700 (PDT)
Received: from script.cs.helsinki.fi (script.cs.helsinki.fi [128.214.11.1]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 03B633A2311; Mon, 29 Mar 2021 15:01:16 -0700 (PDT)
X-DKIM: Courier DKIM Filter v0.50+pk-2017-10-25 mail.cs.helsinki.fi Tue, 30 Mar 2021 01:00:45 +0300
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.helsinki.fi; h=date:from:to:cc:subject:in-reply-to:message-id:references :mime-version:content-type:content-id; s=dkim20130528; bh=zPxMMU mC7MJZOi8WacgYs9jFgimGULpPVTdn+WvZl6o=; b=F95zQffHfOC5D8wXetJl1c 8fpHJCQWKu+YT3Lmafvh9gxzwLSfvianMvd6Z+VlWaoX1DWzJS4RdqI1VaNDOcmE /1i5IQxSItQy0aXM9hZ1KwxiFd6FHRBFT7l80nMTFF4O+VFB/EGGV5taqp8FYZh2 VHGi0C65oh6vFrxVAkBQ8=
Received: from hp8x-60 (88-113-50-238.elisa-laajakaista.fi [88.113.50.238]) (AUTH: PLAIN kojo, TLS: TLSv1/SSLv3,256bits,AES256-GCM-SHA384) by mail.cs.helsinki.fi with ESMTPSA; Tue, 30 Mar 2021 01:00:45 +0300 id 00000000005A09C2.0000000060624E0D.00007CA1
Date: Tue, 30 Mar 2021 01:00:43 +0300 (EEST)
From: Markku Kojo <kojo@cs.helsinki.fi>
To: Bob Briscoe <ietf@bobbriscoe.net>
cc: Markku Kojo <kojo=40cs.helsinki.fi@dmarc.ietf.org>, "tsvwg-chairs@ietf.org" <tsvwg-chairs@ietf.org>, Joe Touch <touch@strayalpha.com>, "tsvwg@ietf.org" <tsvwg@ietf.org>
In-Reply-To: <8ac0d6dd-1648-ee8d-d107-55ef7fe7695f@bobbriscoe.net>
Message-ID: <alpine.DEB.2.21.2103241738090.28263@hp8x-60.cs.helsinki.fi>
References: <CE03DB3D7B45C245BCA0D243277949363076629A@MX307CL04.corp.emc.com> <CE03DB3D7B45C245BCA0D24327794936307662EA@MX307CL04.corp.emc.com> <1920ABCD-6029-4E37-9A18-CC4FEBBFA486@gmail.com> <CE03DB3D7B45C245BCA0D2432779493630768173@MX307CL04.corp.emc.com> <6D176D4A-C0A7-41BA-807A-5478D28A0301@strayalpha.com> <CE03DB3D7B45C245BCA0D24327794936307688C5@MX307CL04.corp.emc.com> <alpine.DEB.2.21.1911171041020.5835@hp8x-60.cs.helsinki.fi> <9024d91a-bb08-fb45-84f8-ce89ba90648d@bobbriscoe.net> <alpine.DEB.2.21.2012141735030.5844@hp8x-60.cs.helsinki.fi> <1e038b64-8276-3515-ac45-e0fc84e1c413@bobbriscoe.net> <alpine.DEB.2.21.2103081540280.3820@hp8x-60.cs.helsinki.fi> <3c778eb9-56dc-3d58-0de4-c6373d1090ec@bobbriscoe.net> <alpine.DEB.2.21.2103181233160.3820@hp8x-60.cs.helsinki.fi> <8ac0d6dd-1648-ee8d-d107-55ef7fe7695f@bobbriscoe.net>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=_script-31931-1617055245-0001-2"
Content-ID: <alpine.DEB.2.21.2103292219000.28263@hp8x-60.cs.helsinki.fi>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/AtEI72QCFhOWOn9d6xNcrssTVzs>
Subject: Re: [tsvwg] draft-ietf-tsvwg-rfc6040update-shim:SuggestedFragmentation/Reassemblytext
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 29 Mar 2021 22:01:25 -0000

Hi Bob, all,

My apologies for the delayed reply and that this is getting very longish 
description but I liked to make it clear this time. I also moved a part 
of the discussion to another thread already created for 
ecn-encap-guidelines.

see inline tagged [MK].

On Sat, 20 Mar 2021, Bob Briscoe wrote:

> Markku, all,
> 
> On 18/03/2021 14:18, Markku Kojo wrote:
>       Hi Bob, all,
>
>       apologies for the additional delay on this.
>
>       Please see inline after Bob's suggestion, my 2 cents on this tricky issue tagged [MK].
>
>       On Mon, 8 Mar 2021, Bob Briscoe wrote:
>
>             Markku,
>
>             Thx. Unfortunately, this draft is coming up in the mtg this afternoon.
>             I take some of the blame -  only re-posting the draft a couple of hours ago, which
>             presumably reminded you that you were going to think on this.
>
>             inline tagged [BB]
>
>             On 08/03/2021 13:48, Markku Kojo wrote:
>                   Hi Bob,
>
>                   this issue and text on it seems acceptable to me.
>
>                   However, the other issue with the two contradictory SHOULD's - that I
>                   now notice I have never replied to - seems not ok, I think.
> 
>
>             [BB] The question is whether we have to solve this now.
>
>             I have a solution to resolve the contradiction. At the risk of prolonging the
>             progress of this draft, I'll say it now. But if we can't resolve this in the next
>             couple of days, I think we should go ahead with the two contradictory SHOULDs.
>
>             For the list, here's the two contradictory paras:
>
>                Congestion indications SHOULD be propagated on the basis that an
>                encapsulator or decapsulator SHOULD approximately preserve the
>                proportion of PDUs with congestion indications arriving and leaving.
>
>                The mechanism for propagating congestion indications SHOULD ensure
>                that any incoming congestion indication is propagated immediately,
>                not held awaiting the possibility of further congestion indications
>                to be sufficient to indicate congestion on an outgoing PDU.
>
>             Possible resolution of the contradiction: the "SHOULD approximately preserve the
>             proportion" is a rough long term average goal while "SHOULD ensure that incoming
>             congestion indication is propagated immediately" is a requirement for after there
>             has been some period (TBD) without any marking.
>
>             The big question is where would an implementer set that timescale? It needs to be a
>             "typical RTT in the deployment environment" or some such get-out clause. I guess
>             this is best left to the implementer.
> 
>
>       [MK]:
>
>       My concerns were and still are mainly for traffic under Standards Track congestion control, i.e.,
>       majority of the traffic foreseen for a long time. So, if not otherwise mentioned, the comments apply
>       for handling properly the traffic under Standards Track CC.
>
>       The subject line was for the shim draft, but the solution IMO should be the same for all cases where
>       "fragments" are decapsulated/reassembled
>       and my comments therefore mainly address fragmentation & reassembly. i.e., when small fragmented
>       packets are under AQM drop and later reassembled.
> 
> 
> [BB] It's not enough to make ecn-encap the same as shim. The reassembly logic in RFC3168 is only defined when
> packets are reassembled from /smaller/ fragments. When a L2 frame is /larger/ than an IP packet, or /overlaps/
> the boundary between IP packets, the reassembly logic in RFC3168 makes is undefined - it makes no sense.

[MK] Sure. I did not mean to say that the cases when a L2 frame is larger 
than an IP packet, or when the frame/packet boundaries do not match would 
not need to be addressed. I meant that the case I had considered mostly 
concerns reassembling/decapsulating IP packets from smaller 
fragments/frames. And, when inner IP packets are decapsulated from smaller 
L2 frames or from smaller IP fragments, the solution should be the same 
for all cases (modulo L2 overlapping IP pkt boundaries), if possible. 
And, when my conclusion was that the proposed two contradictory 
requirements (the two contradictory SHOULDs) are not adequate to solve 
the "from-smaller-to-larger" problem, it did not matter whether those 
two paras would work for the "from-larger-to-smaller" or the "overlapping 
IP packet/L2 frame boundary" problem, which I had not considered much at 
all yet.

See later below my clarifications why the two contradictory requirements 
do not work properly (in particular) for the "from-smaller-to-larger" 
problem.

> For instance, some link layers treat IP packets as a continuous byte stream, then break the stream into the
> largest possible frames, like so:
> 
> ----------------->+<---------------------------->+<------------------------------>+<----
>         Fr1       |                Fr2           |             Fr3                |    
> +-------------+-------------+-------------+-------------+-------------+-------------+---
> |   Pkt1      |    Pkt2     |    Pkt3     |   Pkt4      |    Pkt5     |   Pkt6      |
> +-------------+-------------+-------------+-------------+-------------+-------------+---
> 
> Then, say Fr2 was marked. On decap should Pkt2, Pkt3 & Pkt4 be marked, or just Pkt3 & Pkt4?

[MK] Ok, now when thinking of this, I'd say that for Standards Track CC 
it would be enough that one of the IP pkts gets marked. If you mark more 
than one, it does not make any difference when the marked pkts (e.g., Pkt 
2, 3, and 4) belong to the same IP flow, because multiple marks are 
considered as a single congestion signal.

However, things get more subtle, if the Pkts 2, 3, 4 belong to different 
flows and even more subtle if these flows potentially do not share 
the same e2e path. Let's move further discussion on this question 
about "when packet/frame boundaries overlap" to the new thread (with 
Subject: ecn-encap-guidelines reframing section) Bob already created. 
I'll try to clarify my thoughts on this (additional) problem when 
packet/frame boundaries overlap by replying there shortly.

> Section 4.6 of ecn-encap (where the contradictory SHOULDs are) covers re-framing and definitely does not cover
> fragmentation/reassembly.
> (Fragmentation/reassembly is only covered by RFC3168 and by Section 5 of the shim draft.)
> 
> The scope of section 4.6 on reframing in ecn-encap never included fragmentation until I was asked to widen it
> for draft-13. But in subsequent conversation, it was agreed that fragmentation should be referred to RFC3168. So
> after draft-14 there was no longer any reason for a draft about L2 encapsulation to give any guidelines about
> fragmentation (which had never really been the intention).

[MK] Understood. But ecn-encap should also cover the cases when 
decapsulating larger IP packets from smaller L2 frames. Current text 
tries to cover this case with the same text (two contradictory SHOULDs) 
for all cases:
1) from smaller frames to larger IP packets,
2) from larger frames to smaller IP packets, and
3) when IP packet/L2 frame boundaries overlap and the IP packets and L2 
frames are of the same size or of the different size (either way).

IMHO it would be very hard if not impossible to end up with a correct 
implementation with this very high level text consisting of the two 
contradictory SHOULDs. If we cannot devise a solution that works, how 
could we assume all implementors would? (some might but most won't because 
basing on these short two paras they would have no idea what is the 
problem they are supposed to solve and there is no guidance)

Furthermore, it is very unclear to me where this draft (ecn-encap) 
referes to RFC3168 for the fragmentation case (i.e., decapsulating larger 
IP packets from smaller L2 frames)?

>       The suggestion above may work with high fidelity CC traffic like L4S but unfortunately not with
>       standard CC traffic. I have doubts about correct behaviour with L4S traffic though.
> 
> 
> [BB] L4S is irrelevant to this conversation. It is experimental and so has to fit in with the standards track it
> finds in the Internet.

[MK] Agreed. I just mentioned L4S because the proposal you (earlier) 
gave, i.e., the solution of using a single "deficit" counter, would 
possibly work with (some restrictions) only if there is continuous and 
relatively high level of congestion where the AQM at the bottleneck would 
mark one or more pkts per RTT. With Standards Track congestion controlled 
traffic this would require that there is a high level of statistical 
multiplexing at the bottleneck such that marks per flow are spaced by 
multiple RTTs but one or a few flows get marked each RTT. In this case 
the aggregate sawtooth is very shallow and would also allow using very 
low queuing/marking threshold at the bottleneck AQM. Such behaviour 
resembles the behaviour that L4S targets at but L4S is able to achieve 
such behaviour also with a smaller number of competing flows or even with 
a single flow but Standards Track CC cannot. That's why I mentioned L4S.

So, for Standards Track CC we need to have a solution that works 
correctly also with a single flow or a small number of competing flows 
that is the typical traffic we see at the bottlenecks close to the 
network edge (that is, where the bottleneck usually is).
And the proposed approach would not work properly for such case (see 
explanation below).

> The reframing section in ecn-encap-guidelines existed long before L4S was even thought of (indeed it was in the
> first ever ecn-encap draft in March 2011).
> 
>
>       There are two major problems:
>
>       1) The suggested approach assumes an AQM that uses propabilistic
>          dropping. All AQMs do not employ propabilistic dropping. If
>          the decapsulator/node doing reasembly does not know which
>          type of AQM marked the PDUs/fragments, it cannot do the right
>          decision to not apply the above approach of preserving
>          the proportion of PDUs with congestion indications arriving
>          and leaving in case of AQM that does not employ propabilistic
>          dropping.
> 
> 
> [BB] It doesn't assume probabilistic at all. If you've got that impression from me saying in the example linked
> below "an AQM marks 2% of packets," that doesn't mean I'm saying the AQM is probabilistic. If a deterministic
> AQM like CoDel marks every 50th packet, that is still equivalent to 2% marking.

[MK] No, that is not the reason why I am saying it. I am saying it because 
the suggested approach of "SHOULD approximately preserve the proportion 
of marked bytes" is only required with AQMs that use probabilistic 
packet-mode marking (my apologies for being unclear earlier: I should 
have said "assumes probabilistic *packet-mode* marking").

So, now we got to one of the major points where our views of different 
AQMs and how they behave diverge:

In short, CoDel does not decide the spacing between marked packets by 
counting packets but it uses time (= number of bytes delivered, if the 
link bit-rate happens to be fixed, which is not always the case though). 
If the bottleneck is bit-congestive, it deliveres (and queues) the same 
amount of bytes with small packets as it does with large packets in the 
same period of time (the delivered payload bytes being the same modulo 
the difference in header overhead).
This means that even if the suggested two SHOULDs approach would work with 
probabilistic packet-mode marking it provides a wrong outcome with 
deterministic AQMs like CoDel and with AQMs that employ probabilistic 
byte-mode marking because such AQMs already adjust (reduce) small packet 
marks to the same level as larger pkt marks. If a 
reassembly/decapsulation process reduces marks further with such AQMs we 
end up exactly into a situation where fragmented traffic gets lower 
level of marking and what we want to avoid (let's not argue here whether 
AQMs with proabbilistic byte-mode drop exist or not).

Next a longer description of packet dynamics at a bottleneck to 
to explain why the "two contradictory SHOULDs" approach would not work 
correctly (not even for probabilistic packet-mode marking):

Here we need to look at the actual behaviour of AQMs (CoDel included) and 
packet dynamics to understand it correctly, i.e., we need to understand 
for how long we should keep counting the bytes to get a correct outcome 
and that we cannot blindly look at the drop/mark probabilities to 
figure out the resulting (TCP) performance because it is not a random 
marking propability we are interested in but the effective drop/mark 
probability for a flow that is determined by the MSS, the e2e RTT, the 
bit-rate of the bottleneck link, and the amount of queuing that an AQM 
manages at the bottleneck.
My apologies also for including some basics below, but I just try to make 
it as clear as possible.

Basically all AQMs set a queuing threshold (target) for the level of 
queuing after which they are designed & configured to enter the main 
operating range of the AQM and may start dropping/marking pkts (PDUs). 
Let's assume the AQM measures the length of the queue in bytes (or dual 
of it, queuing delay). For simplicity, let's also assume that a sender at 
the end point is in congestion avoidance.
When the sender increases its sending rate by inflating cwnd one MSS per 
RTT, at some point it fully utilizes the bottleneck and then after some 
number of RTTs the queuing level reaches the queue threshold (target) set 
by the AQM. Only from now on the AQM starts dropping/marking packets.

If we use /probabilistic marking/, each packet is subject to mark 
in the main operating range of the AQM but the initial marking 
probability is low such that it typically takes some number of RTTs 
before the first packet is marked. 
Let's concretize this with an example scenario where the bandwidth-delay 
product of the path is 10 full-sized packets, the queue threshold is set 
to 4 full-sized (MSS-sized) packets and the initial drop probability is 
around 1% (*). If we use /probabilistic packet-mode marking/, then 
after reaching the q threshold it takes on average 100 pkts until the 
first packet is marked. That is, after cwnd has reached 14 MSS (pkts), it 
takes 6 RTTs to the first mark, making cwnd to be 20 MSS upon the first 
mark. When the sender gets notified about the mark, the Standards Track CC 
halves the cwnd to 10 MSS and another saw tooth cycle starts.

Now, if the full-sized packets that the sender is sending get fragmented 
to two equal sized fragments (**) on the path before the AQM node and we 
use /packet-mode marking/, the first packet (fragment) is marked 
(roughly) after 3 RTTs, i.e., on average after 100 fragments when the 
cwnd has increased to 17 MSS (and the second mark would be after 6 RTTs 
if the first mark is removed at reassembly). So, with /packet-mode 
marking/ we need to reduce the marks at reassembly to get equitable 
effective marking rate compared to the non-fragmented case such that after 
reassembly the first mark for an inner pkt is forwarded only after 6 RTTs 
on average (i.e., when cwnd has reached 20 MSS). Otherwise the throughput 
with fragmentation is lower than without fragmentation, because a mark 
after 100 fragments would result in halving the cwnd earlier and starting 
the next saw tooth cycle with cwnd of 8 MSS (or 8,5 MSS). 8 full-sized 
pkts (8 MSS) is less than the bandwidth-delay product of the exampel path, 
meaning that we underutilize the link in the beginning of the cycle until 
cwnd = 10 MSS.

So, if I understand it correctly the above is what is the aspiration of 
the two contradictory SHOULDs, right? Obviously it is needed with 
/packet-mode marking/ only. With /probabilistic byte-mode marking/ and 
with /deterministic CoDel-like marking/ at an AQM we would mark the first 
fragment (roughly) at the same time (= at the same level of queuing) as we 
would mark the first non-fragmented (full-sized) packet. And this would be 
without manipulating marks at reassembly (for completenes see later below 
the behaviour with probabilistic byte-mode marking and with deterministic 
CoDel-like marking).

Then if we just look at whether the suggested approach with the two 
contradictory SHOULDs would work with /packet-mode marking/:

>From the example above we see quite clearly why the second SHOULD that 
requires an immediate mark "if there has been some period (~ typical RTT) 
without any marking" would result in incorrect operation because there 
were several RTTs without marking before the first fragment got marked. 
Hence, following the second SHOULD we would keep the mark for the first 
marked fragment at reassembly. In other words, the second SHOULD results 
in reducing marks only within one RTT which is wrong (and unnecessary 
with the Standards Track CCs that treat multiple marks as a single 
congestion signal). Obviously determining any longer timeout value that 
would work correctly is very hard because the number of RTTs over which 
we need to count the bytes varies with the e2e RTT and the parameters of 
the AQM that determine the level of queuing. And the reassembling node 
does not know any of these. 
Moreover, with probabilistic marking the actual marking doesn't occur 
always at the same point as the average used in the above example but 
varies around the average, altering the # of RTTs we need to keep 
counting bytes, making it even harder.

(*) the AQM marking probability of course increases as the queue builds 
but often quite slowly and the initial marking probability often is even 
lower. These details depend on the AQM and how its parameters have been 
configured, but using an avg 1% marking probability gives roughly the 
right outcome in this simplified example.

(**) pls ignore the byte-size difference due to more header bytes in 
fragmented pkts).

For completeness let's look at the behaviour with /probabilistic byte-mode 
marking/ using the same example scenario as above:

If we use /byte-mode marking/ that halves the per packet (fragment) 
marking probability, on average the first marked fragment is after 200 
pkts (frags) i.e., roughly at the same level of queuing as with 
non-fragmented packets (after 200 fragments and 6 RTTs) and we must not 
reduce the marks at reassembly.

And the behaviour with CoDel which uses /deadline-based/ (or 
deterministic) marking:

Assume the bottleneck link bandwidth in the example scenario above is such 
that we have configured the CoDel Target such that it is exceeded when 
4 full-sized packets are queued and the Interval such that it happens to 
equal to the serialization delay of 100 full-sized pkts (I'm not saying 
these values are preferred or realistic, but just to make it easier to 
compare the cases in this example).

Now, with /CoDel and full-sized packets/ the AQM enters its main operating 
range  when the cwnd is 14 MSS (pkts) and marks the first packet after 
100 full-sized packets, that is, after 6 RTTs with cwnd of 20 MSS.

With /CoDel and fragmented packets/ the AQM enters its main operating 
range also when the cwnd is 14 MSS and it marks the first fragment 
(packet) roughly after 200 fragments, that is, after 6 RTTS when the cwnd 
is 20 MSS. So, with CoDel there is /no/ need to reduce the marks at the 
reassembly, because the Interval (i.e., the serialization delay) of 100 
full-sized packets is *roughly* the same as that of 200 (half-sized) 
fragments. (pls ignore again the difference due to more bytes in 
fragmented pkt headers).

I hope, this helps explaining why we cannot adjust marked packets at 
reassembly/decapsulation because

1) it depends on the AQM we are using and the reassembling node cannot 
know which kind of AQM is marking the packets, and

2) it is hard (imposible) to devise an algorithm that would work correctly 
even if we knew that the AQM at the bottleneck is using probabilistic 
packet-mode marking .

And to generalize the problem further, if we operate with Non-ECT traffic 
then the only place to adjust (reduce) the drops for small 
fragments/frames is at the AQM, because we cannot recover dropped 
fragments when reassembling/decapsulating.

> The average proportion of marks at a fully utilized bottleneck determines the average capacity share of standard
> congestion controls (recall the 1/sqrt(p) in the Reno equation). So, even if there were no probabilistic AQMs in
> the Internet, the two contradictory SHOULDs would still be necessary, because satisfying the second SHOULD alone
> (undelayed signal) would roughly double the long-term average proportion of marking of fragmented vs
> unfragmented packets (thus breaking the first SHOULD). I explained this in the posting to Jake here:
> https://mailarchive.ietf.org/arch/msg/tsvwg/Da0sagcLnvPzh6xKFdHFRUZ9w5o/

[MK] As shown above this is incorrect.

And, we need to be very careful on how/when we use the Reno equation. 
Remember it has been derived for the case with infinite bandwidth and 
random drop probability, and the formula then gives the *maximum* 
throughput that can be achieved. You cannot use the AQM drop probability 
as the p in the reduced formula because in then p in the formula would 
not be the effective marking (dropping) probability for the flow.
When the bottleneck link bandwidth is a limiting factor for throughput, 
we must include RTT in the formula and derive p by figuring out the 
maximum window (which in turn depends on the queue lenght at the 
bottleneck). The effective drop (mark) probability decreases as the 
queue length is increased and vice versa, resulting in roughly 
correct outcome (with some restrictions) because RTT increases 
when p decreases.

> Nonetheless, the second SHOULD (undelayed signal) is useful when the AQM is only fully utilized intermittently.
> 
> Actually, you don't say what other type(s) of AQM you are thinking of. I imagine:
> * Spacing based (like CoDel or PDPC)

[MK] CoDel is one example. PDPC does not quite fit because it counts the 
number of pkts for spacing, if I recall it correctly. If PDPC would count
the bytes in the packet when counting the spacing between marks and set 
the spacing threshol also in bytes, then it would work pretty much like 
CoDel (but use fixed spacing) and would not need any manipulation of 
marks at reassembly.

> * Threshold based

[MK] If Threshold based means something like DCTCP then it would not be 
directly applicable with Standards Track CCs anyway.

Just speculating: even if a Threshold-based AQM could somehow be made 
working with Standards Track CCs (by changing Standards Track CC 
MD and AI behaviour), it probably would not make much difference for 
behaviour in congestion avoidance even though it would mark more small 
fragments than full-sized packets because fragments from the same 
original packet would typically arrive back-to-back (or at least 
within the same RTT) and get anyway marked. But this is just without 
really thinking this through, so I would not like to start arguing on it.

> Or did you mean something else?
> 
> Strictly, I also don't know what approach you are saying won't work, because no approach is described in
> ecn-encap any more. Just the two contradictory SHOULDs. And I deliberately didn't describe a way to resolve the
> two SHOULDs in detail in my email.
> 
> Reason: we need to take this one step at a time. So I've proposed a first step where we reach consensus that
> both these contradictory requirements are necessary. If I propose a compromise between the two contradictory
> requirements, people seem to delight in pointing out that it's not perfect for half the compromise (which is the
> definition of a compromise!).
> 
> So first things first, do you accept there's a trade-off here (even if there was only standard TCP traffic on
> the Internet) and the two contradictory SHOULDs capture it?

[MK] As is obvious from the above, I don't think that the two 
contradictory SHOULDs would capture it.

>       2) Majority of the congestion controlled traffic today is
>          non-ECT traffic, i.e., traffic under loss-based CC (let's
>          but delay-based CC aside now for to keep things simple enough),
>          and the above solution does not work for it.
>
>          The guiding principle of the original ECN design was to treat
>          ECN traffic eguitably with the loss-based traffic, i.e., not
>          to give preference to ECN traffic. If the congestion
>          indications marked on small fragments are reduced when
>          reassembeled then the ECN traffic is preferred over loss-based
>          CC traffic because it is impossible to reduce lost fragments
>          but each lost fragment results in loss of the entire packet
>          at reassembly and therefore triggers a congestion indication
>          without exception.
> 
> 
> [BB] That's true. Certainly, if there's an AQM in a tunnel, like the example in the email to Jake linked above,
> fragmented NECT packets running alongside non-fragmented will tend to experience twice the drop level.

[MK] This is only true with probabilistic packet-mode drop. With 
byte-mode drop as well as with deterministic CoDel-like drop the number 
of dropped/marked fragments for the F-flow is roughly the same as that of 
marked packets for the N-flow.

> This is
> the same problem as there used to be when IP was carried over ATM and each cell loss amplified into a whole
> packet loss. But that doesn't mean we have to make ECN reassembly exactly mimic the rubbishness of drop.
> 
> Whatever, I'm not sure what you're criticizing here 'cos the shim draft now essentially does mimic the
> rubbishness of drop, by deferring to RFC3168:
>     https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-rfc6040update-shim-13#section-5
> 
> While the ecn-encap draft doesn't currently define an approach, it just gives the two contradictory SHOULDs.
> 
> My email suggestion to draw a line between the two approaches at a 'typical RTT' was deliberately high-level. It
> would mimic the rubbishness of drop when the spacing of the markings seen by each flow was wider than the
> 'typical RTT' (i.e. intermittent congestion episodes) but it would ensure the proportions were the same during
> persistent congestion. Rationale: when the spacing is narrow (persistent congestion), the exact timing of each
> notification doesn't matter because it wasn't long since the last and won't be long before the next.
> 
> 
>
>       The suggested solution also would be very hard to configure because RTT is not known. The potential
>       RTT range in the Internet is from sub-millisecs to over 1 sec. In some specific evironments the
>       typical RTT could be known but not for the general case and these docs are targetting Standards
>       Track, so the solution should apply for all environments.
> 
> 
> [BB] I didn't say 'the RTT' I said 'the typical RTT in the deployment environment'.
> 
> The typical RTT (the mode) in the environment of a particular link can be known. It can be measured (out of
> band), then configured. And it could be reviewed occasionally, although it's unlikely to change for many years
> at a time.

[MK] I disagree. You may measure what the median or mean (~ typical) is. 
However, in many environments the peer is either near (local, CDN, etc) or 
it might be on other side of the Internet with an order of magnitude 
longer RTT. RTTs to a local peer and CDN may differ from sub msecs to a 
few tens of msecs. The measured typical RTT may possibly work for 50% or 
80% (or whatever) of the traffic but be badly incorrect for rest of the 
traffic in the environment. And if it is wireless environment, the RTTs 
may vary heavily from time to time.

> Also, it's easy to criticize a proposed compromise between the contradictory SHOULDs for not being perfect
> (again, that's the definition of a compromise). But you say at the end that you don't have a solution to offer.
> So how are we going to move forward?

[MK] I think we need to revisit this fragmentation/reassembly & 
encapsulation/decapsulation problem and consider it carefully, but not in 
these documents. That is, we need another draft for it as was already 
envisaged if I understood it correctly.

For the ecn-encap I'd propose we drop the SHOULDs and prepare some text 
that gives a "warning" that the implementer should look additional advise 
for the case where the size of the L2 frames and IP packets do not match 
(and maybe also when the L2 frame/IP packet boundaries overlap.

> I've continued to respond to all your points in the rest of this long email, but I'd like to try to focus on the
> contradictory SHOULD requirements first.
>

[MK] I hope I was able to make my point more clear this time.

>       In addition, when there are multiple flows sharing a tunnel, the proposal would potentially
>       concentrate marks to a smaller number of some flows, i.e., ignore/move marks for/from some flows to
>       a smaller number of flows.
>       This would be extremely undesirable in high load cases, e.g., when several flows are in slow start.
>       In such a case it is important to mark packets for several flows in order to have fast and strong
>       enough CC reaction.
>       If the AQM has succesfully spread the marks over a number of flows, but this gets supressed to a
>       smaller number of flows the overall congestion control reaction at the bottleneck is inappropriately
>       diminished and postponed.

[MK] please ignore this comment above. It was a result of a brain f**t.

> [BB] When an AQM applies marks to an aggregate, it randomly hits particular flows. Compare two flows in
> slow-start alongside each other, one with slightly larger packets that get fragmented, the other not. The
> fragments will have roughly half the size and twice the packet rate. So on average an AQM will mark twice as
> many packets in the fragmented flow. Then if the marks are close enough together for reassembly to preserve the
> proportion of marks, it will halve the absolute number of marks, thus leaving about the same number of marks in
> the two flows.
> 
> Assuming my time-based proposal could be implemented, if only one or two AQM marks hit each flow, the reassembly
> process will be more likely to preserve all the marks, because the time between them will be longer. Then the
> fragmented flow will end up with more marks than the unfragmented, which I think is what you want.
> 
>
>       After considering this problem, my conclusion is that the only working approach I can come up with
>       can be achieved by not applying any congestion indication manipulation when reassembling
> 
> 
> [BB] What do you mean? I think you mean "taking the approach in RFC3168" (which is a congestion indication
> manipulation).
> That is the approach now adopted in the shim draft. But the ecn-encap draft covers more general re-framing (and
> fragmentation/reassembly is out of its scope).
>
>       and by taking different approach to packet drops with AQMs that employ probabilistic dropping.
> 
> 
> [BB] Eh? Reassembly doesn't know what the AQM was, or even whether there was an AQM. Anyway, I've argued that an
> AQM that sets the spacing between marks in an aggregate is equivalent to probabilistic. Unless you are thinking
> of some different AQM that I'm not aware of.
>
>       The correct approach for dropping had already been taken by the RED design with byte-mode queue
>       measurement and drop (note: with the error in calculating the drop propbability corrected).
>
>       I'm very well aware of the strong case that RFC 7141 makes for packet-mode drop, and it makes this
>       problem even much trickier to solve.
> 
> 
> [BB] Reading ahead, I think you're saying that it's hard to solve the problem because you haven't really
> abandoned your preference for scaling down drop/marking of smaller packets ('byte-mode'). So you think that
> ought to be a good solution, but you know it's got its own problems. If you accept that independence from packet
> size (packet-mode) is preferable to byte-mode, it's easier to make everything consistent.
> 
> (BTW, it would be really hard if reassembly had to cater for some AQMs doing byte mode and others doing packet
> mode.)

[MK] Yep, and it being unfortunate for solving this problem we are facing 
such situation now although not quite because of some AQMs doing 
probabilistic byte mode dropping/marking.

>       However, if we concentrate only on the problem of dropping small fragments and reassembling them,
>       the byte-mode drop together with reassembly logic in RFC 3168 results in the correct outcome.
>
>       Why? By giving lower drop probability to small fragments, the byte-mode drop ensures that the
>       congestion signal (mark or drop) is given (approximately) at the same level of data queued (or: of
>       queuing delay), no matter whether the packets are small or large (i.e., on a bit-congested
>       bottleneck the operating range of AQM is entered at the same time no matter what size the packets in
>       the queue are). Therefore, it is ok to use fractional cwnd decrease as Standards Track TCP does.
>
>       When the drops/marks at the bottleneck are targetted at small fragments, it does not mean that the
>       (TCP) sender operates on small packets, but it sends MSS-sized segments and is basically unware of
>       fragmentation. Therefore, the sender also increments its cwnd using large MSS-sized units and there
>       is no small packet bias in its performance (because byte-mode drop at the bottleneck does correct
>       job). Small packets (fragments) do not go faster.
>
>       The fact that the performance problem with small packets does not originate from one reason only but
>       from two main reasons: 1) how drops are handled at the bottleneck device and 2) how cwnd is
>       incremented at the sender endpoint.
> 
> 
> [BB] All this is true, but academic. Such AQMs are not at all common, and also deprecated for the many good
> reasons in RFC7141, including to prevent amplification of small packet flooding attacks (see my response to your
> point on this later). 
> 
> The RED /design/ included packet-mode and byte-mode. But, according to this survey:
>     https://tools.ietf.org/html/rfc7141#appendix-A
> byte mode was rarely if ever implemented in production equipment (those respondents who gave reasons mostly said
> it was due to complexity).

[MK] Not quite academic, I would say. The world seems to have changed 
since the survey (which was about RED only). The conclusions of the 
survey might be correct but it leaves a big question mark also with the 
RED implementations as less than 20% replied.

I wonder whether the reported reason being complexity was really the 
complexity in doing byte-mode drop or in the complexity of doing 
byte-mode in general (i.e., if counting the queue in bytes was the major 
issue)? One extra MUL & DIV operation does not sound like a much added 
complexity to me with the gear available today.

> DOCSIS PIE is the only widely deployed AQM I know of that implements marking dependent on packet size.
>     https://tools.ietf.org/html/rfc8034#section-4.6
> Nonetheless, to mitigate amplification of small packet flooding attacks, it sets a floor of 85% for the reduced
> drop probability for smaller packets, and anyway DOCSIS PIE does not support ECN.
> 
>
>       The research that RFC 7141 has used as basis when justifying the choice with the drop mode seems not
>       to take this fact properly into account. Instead, it tries to solve the small packet propblem
>       entirely at the AQM node, which of course is wrong kind of reverse engineering that RFC 7141 states.
> 
> 
> [BB] Surely you mean the opposite - RFC 7141 does nothing about packet size at the AQM. It "tries to solve the

[MK] No, I didn't mean the opposite nor did I mean to say that RFC 7141 
does anything about packet size at the AQM.

I meant that the reaserch RFC 7141 cites tried to solve the small packet 
propblem entirely at the AQM node which is not adviseable and I agree 
with RFC 7141 for that part.

The proposal to use byte-mode drop in RED didn't try to solve the problem 
of the end systems using small packets (small MSS) and thereby 
increasing cwnd in small units but only the problem that probabilistic 
packet-mode drop together with measuring the queue in bytes at the AQM 
node creates by penalising small packet/fragments at the AQM. It was not 
devised to solve the problem of end systems sending small packets and 
thereby increasing the cwnd in small units. The latter problem should be 
solved in the end systems, but the AQM treating small and big packets 
differently should be solved where it can be solved, that is, in the AQM 
itself which is the only place where we can solve it correctly for all 
traffic. Trying to solve it in the "middle" (at reassembly) is 
architecturally the most complex approach, requiring quite unnecessarily 
third parties to be involved. This increases the risk of failure as it 
needs to be implemented in several places by several implementors (in end 
systems when reassembling, in tunnel egresses, in L2 decapsulators), not 
to mention upgrading if the solution needs to be modified.

> small packet problem" solely at the end system. The AQM is explicitly /not/ reverse engineering what it thinks
> end systems might do. See section 3.3. of RFC7141 entitled "Transport Independent Network".
>     https://tools.ietf.org/html/rfc7141#section-3.3
> In particular: "
>
>    When the network does not take packet size into account, it allows
>    transport protocols to choose whether or not to take packet size into
>    account.
> "

[MK] The above statement ignores the fact that transport and application 
protocols do not know whether IP fragmentation occurs. Nor do they know 
whether there is an AQM on the path, when they operate in a non-ECT mode.

If an application or transport uses small packets when sending, it adds 
an additional problem relating to cwnd increase rate and should of course 
be handled in the end systems (MSS in the Reno equation).

> Anyway, the "Appropriate Byte Counting" approach [RFC3465] is now formally recommended in standard TCP
> congestion control [RFC5681].

[MK] Right, but ABC has nothing to do with IP fragmentation which is 
invisible to TCP. TCP has no idea whether IP packets got fragmented on 
the path. ABC helps solving the problem of Delayed Acks affecting the cwnd 
increase rate as well as the problem of applications writing in small 
units (possibly nagle off).

>       However, solving only the problem 1) that an AQM node using probabilistic dropping creates by
>       dropping small packets faster at the AQM node itself is not reverse engineering.
> 
> 
> [BB] Eh? An AQM node using probabilistic dropping doesn't drop smaller packets faster.

[MK] Probabilistic packet-mode drop does. See the example scenarios 
above. With faster, I mean that small fragments are dropped at lower 
AQM node queue occupancy level than larger packets (= earlier ~ faster 
after the queue target/threshold of the AQM has been reached).

> Imagine two senders, both sending at the same bit-rate, but one sending packets half the size of the other (and
> therefore at twice the packet rate). If an AQM drops 2% of packets randomly, it will drop twice as many packets
> from the flow with smaller packets. But it will therefore drop bits from both at the /same/ rate.
> 
> If the one sending smaller packets chooses to increase by one smaller packet per RTT, the AIMD process will
> certainly converge with it running at half the rate of the other. But the point RFC7141 makes is that the end
> system chose to do that. By dropping any size packet with the same probability, the network treats them both the

[MK] The problem with this thinking is that we are discussing 
fragmentated packets (or packets that get split into smaller L2 frames 
at encapslator) in the first place. The end point did not decide that its 
packets will be fragmented. It send packets of the same size and 
increases its cwnd in units of same size (MSS) per RTT. So, the flow with 
fragmented packets does not get half the rate of the other due to sending 
smaller packets and increasing cwnd in smaller units. It gets a lower 
rate (not half), because it needs to react to congestion (halving the 
cwnd) more frequently than the other.
On the other hand, if the end point chooses to send smaller packets it 
suffers from both problems.

> same. It drops the same rate of bits from them both, if they both send at the same bit rate.

[MK] This AQM type of dropping affects p (and RTT) in the Reno formula, 
not MSS. And the end system does not know whether small packets are 
dropped at lower queue occupancy level than larger packets, so it is hard 
to solve the problem at the end system.

> So the place to fix this disparity is in the end-system. If you tried to fix it in the network, an AQM faced
> with two flows running at the same bit rate would have to drop less bits from the flow that happened to divide
> up its bit-rate into smaller packets. This creating a perverse incentive for everyone to use smaller packets.

[MK] Well, it is hard to see any incentive for a non-malicious end system 
to use smaller packets to achieve higher throughput (or to get smaller 
packets to go faster) even if an AQM at the bottleneck would use 
probabilistic byte-mode drop, i.e, drop smaller packets with lower pkt 
loss probability (and lower bit-loss rate) but such that packet loss-rate 
(drops per time unit) is the same as with the larger packets.

The bottleneck link delivers the same amount of bytes (bits) in a time 
unit regardless the packet size.
This means that when a small packet sender and a large packet sender 
increase their send rate (cwnd) at equal pace (i.e.,) in the same size 
units per RTT, the small packet sender incurs first drop (roughly) at the 
same time as the large packet sender. If both senders follow the same 
congestion control algorithm, both senders will decrease the sending rate 
(cwnd) equally (and at the same time and at the same congestion level at 
the bottleneck) and will continue increasing at the same pace for the next 
cycle. This is the behaviour for a TCP sender that unknowingly gets its 
packets fragmented on the path before the bottleneck compared to the 
case when the packets do not get fragmented. Actually, in the 
fragmentation (small packet) case the throughput is slightly lower due to 
additional header overhead.

Then if an application deliberately modifies TCP MSS to be smaller in 
order to force TCP to send smaller segments and thereby to make TCP go 
faster, the result is opposite because TCP will increase its cwnd 
in smaller units (i.e. slower) which in turn results in lower throughput.
And of course smaller MSS results in more header overhed as well.

If the sender is not using TCP but implements its own congestion control 
(or implements its own TCP congestion control), the situation does not 
change unless the sender modifies congestion control to decrease cwnd 
with a smaller factor upon drop and/or increase cwnd in larger units per 
RTT (***). And if it does so, it would always get higher throughput when 
using larger packets than when using smaller packets. So, unfortunately I 
fail to see the incentive to use smaller packets in non-malicious end 
systems.

(***) such a sender is not necessarily clasified as non-malicious anymore

>       Solving the problem with small packet senders, which increment cwnd in smaller units, anywhere
>       except at the endpoint is of course not the way to go.
> 
> 
> [BB] Yes. We seem to be agreeing now.

[MK] Right, we agree for the case of incrementing cwnd in different size 
units with different size packets. But it is a different problem from the 
case of dropping/marking same number of packets vs. same amount of bits.

>       I also do not fully agree with RFC 7141 that giving a lower drop probability to smaller packets
>       would notably amplify flooding attacks.
>       AQMs are not designed to protect against flooding attacks and they cannot. There are and need to be
>       other tools for that. Having a higher drop probability for smaller packets does not prevent small
>       packet attacker from (almost) fully utilizing the bottleneck link capacity (AQMs are designed to
>       drop excess packets). Sending unresponsive floods will anyway push away almost all competing
>       responsive traffic as the responsive traffic reduces its sending rate to less than one packet per
>       RTT.
> 
> 
> [BB] That's a reasonable argument. And I agree that AQMs aren't designed to protect against flooding attacks.
> But they shouldn't be designed to help them.
> 
> Also, your argument is that dropping smaller packets with lower probability doesn't amplify flooding attacks
> against responsive traffic as much as one might think. It still amplifies them. And it /does/ strongly amplify
> attacks against unresponsive flows (which includes semi-elastic flows and responsive flows once they have been
> squeezed down to their minimum rate or minimum window).

[MK] I am not a security expert but I think it is debatable whether using 
the same drop/mark probability in an AQM for small packets as for larger 
packets would help much at all compared to the case where small packets 
are dropped with a lower probability proportinal to the pkt size or 
deterministically. Remember that the AQM drops packets only in the its 
main operating range. If the bottleneck is not fully utilized there will 
be no drops at all so no difference. But attackers are likely to flood 
faster. If the bottleneck is congested but not overloaded, byte-mode drop 
would drop less small packets but the flooding attack sender is not 
responsive to loss, so it would pretty soon overload the bottleneck. And 
when it does so, it is relatively easy to stop preferring small packets 
on overload like AQMs are adviced to start dropping packets on overload 
instead of marking. In other words, it requires for an attacker to be 
skilled enough to send at a rate that results in the bottleneck AQM to 
become lightly congestsed but not overloaded in order to benefit from the 
lower drop probability for smaller packets. I doubt they are that 
skilled, but if they are then they are also skilled enough to turn on 
ECT(0) or ECT(1) in their small packets and thereby avoid the drops also 
with AQMs that do packet-mode drop/mark and support ECN.
So, the packet-mode drop would help only with non-ECN AQMs, but we 
all(?) are advocating ECN in AQMs.

/Markku

>
>       So, unfortunately I'm not able to offer a quick and simple solution to this question at hand.
> 
> 
> 
> 
>
>       thanks,
>
>       /Markku
>
>             Bob
> 
>
>                   I'm now occupied for the next few hours, so I'll come back with more
>                   detailed reasoning after the tsvwg meeting today.
>
>                   Cheers,
>
>                   /Markku
>
>                   On Mon, 8 Mar 2021, Bob Briscoe wrote:
>
>                         Markku, chairs, all,
>
>                         Having reached agreement on the text last Dec, I then went
>                         and dropped the ball and
>                         forgot all about this draft,... and ecn-encap-guidelines.
>
>                         == ECN-ENCAP-GUIDELINES ==
>
>                         I shall upload a new rev shortly with the following single
>                         diff that I noticed
>                         during the meeting last Nov, and said I would do at the next
>                         rev:
>
>                          4.6.  Reframing and Congestion Markings
>
>                             The guidance in this section is worded in terms of
>                         framing
>                             boundaries, but it applies equally whether the protocol
>                         data units
>                         -   are frames, cells, packets or fragments.
>                         +   are frames, cells or packets.
>
>                         == RFC6040UPDATE-SHIM ==
>
>                         I shall upload a new rev shortly, with the following 3 paras
>                         at the end of S.5 on
>                         ECN fragmentation/reassembly:
>
>                         Para 1 is just moved up from the end but otherwise
>                         unchanged.
>                         Para 2 is unchanged.
>                         Para 3 is the text agreed on this list last Dec subject to
>                         further checking, with
>                         one exception: I removed the citation of RFC3168 after
>                         "equivalent".
>                             Reason: The citation of RFC6040 at the end is the
>                         relevant one, 'cos it
>                         introduced the mechanism for ECT(0) and ECT(1) to be either
>                         equivalent or two
>                         severity levels. In this respect it updated RFC3168. So it
>                         would not be appropriate
>                         to cite RFC3168, which only said the two were equivalent. If
>                         we cited RFC3168 here,
>                         it could be interpreted as if we're saying two RFC give
>                         conflicting definitions.
>
>                             Section 5.3 of [RFC3168] defines the process that a
>                         tunnel egress
>                             follows to reassemble sets of outer fragments
>                             [I-D.ietf-intarea-tunnels] into packets.
>
>                             During reassembly of outer fragments
>                         [I-D.ietf-intarea-tunnels], if
>                             the ECN fields of the outer headers being reassembled
>                         into a single
>                             packet consist of a mixture of Not-ECT and other ECN
>                         codepoints, the
>                             packet MUST be discarded.
>
>                         +   If there is mix of ECT(0) and ECT(1) fragments, then the
>                         reassembled
>                         +   packet MUST be set to either ECT(0) or ECT(1).  In this
>                         case,
>                         +   reassembly SHOULD take into account that the RFC series
>                         has so far
>                         +   ensured that ECT(0) and ECT(1) can either be considered
>                         equivalent,
>                         +   or they can provide 2 levels of congestion severity,
>                         where the
>                         +   ranking of severity from highest to lowest is CE,
>                         ECT(1), ECT(0)
>                         +   [RFC6040].
>
>                         If any of this isn't acceptable, I'll have to post another
>                         rev, but I think it's
>                         what was agreed.
>
>                         Cheers
> 
> 
>
>                         Bob
> 
>
>                         On 14/12/2020 15:44, Markku Kojo wrote:
>                               Hi Bob, all,
>
>                               apologies for the delay, now catching up again.
>
>                               yes, handling mix of ECT(0)/ECT(1) like in the new
>                         proposed text below
>                               seems reasonable choice (for now).
>
>                               I'll come back shortly with the issue in the other
>                         thread. It seems
>                               less clear, actually seems quite difficult to handle
>                         correctly for all
>                               foreseen cases.
>
>                               /Markku
>
>                               On Thu, 3 Dec 2020, Bob Briscoe wrote:
>
>                                     Markku, all,
>
>                                     I am also only now catching up with the list...
>
>                                     On 17/11/2019 08:46, Markku Kojo wrote:
>                                           Hi Dave, Joe, All,
>
>                                           Catching up ...
>
>                                           I agree with the modified new text as well
>                         as
>                                     treatment of an ECT(0)/ECT(1)
>                                           mix as "any".
> 
>
>                                     [BB] Thanks. For the list, the current text that
>                         Markku is
>                                     agreeing with is here:
>                                    
>                         https://tools.ietf.org/html/draft-ietf-tsvwg-rfc6040update-shim-11#section-5
>
>                                     Regarding reassembly of a mix of ECT(0)/ECT(1).
>                         I agree
>                                     with David that the current text
>                                     should handle this case that 3168 doesn't
>                         address.
>                                     And I agree with Joe that an interim way of
>                         handling it is
>                                     needed, not just punting until
>                                     later.
>
>                                     I see that all of Jonathan, David and you Markku
>                         are happy
>                                     with reassembling a mix of
>                                     ECT(0) and ECT(1) to result in either ECT(0) or
>                         ECT(1).
>                                     (for now). I think we can go one
>                                     better than that, still without precluding a
>                         more specific
>                                     RFC later. Here's proposed
>                                     text:
>
>                                     After the following para:
>
>                                        During reassembly of outer fragments
>                                     [I-D.ietf-intarea-tunnels], if
>                                        the ECN fields of the outer headers being
>                         reassembled
>                                     into a single
>                                        packet consist of a mixture of Not-ECT and
>                         other ECN
>                                     codepoints, the
>                                        packet MUST be discarded.
>
>                                     Add:
>
>                                           If there is mix of ECT(0) and ECT(1)
>                         fragments, then
>                                     the reassembled packet
>                                           MUST be set to either ECT(0) or ECT(1). In
>                         this case,
>                                     reassembly SHOULD take
>                                           into account that the RFC series has so
>                         far ensured
>                                     that ECT(0) and ECT(1)
>                                           can either be considered equivalent
>                         [RFC3168], or
>                                     they can provide 2 levels
>                                           of congestion severity, where the ranking
>                         of severity
>                                     from highest to lowest
>                                           is CE, ECT(1), ECT(0) [RFC6040].
> 
>
>                                     Rationale: This avoids constraining future RFCs,
>                         but at
>                                     least lays out all the
>                                     interoperabilityrequirements we already have for
>                         handling
>                                     this mixture. Then if an
>                                     implementer wants to just default to choosing
>                         one, it hints
>                                     that they should choose
>                                     ECT(1).
> 
> 
>
>                                           I also want to repeat my comment that
>
>                                            draft-ietf-tsvwg-ecn-encap-guidelines-13
>
>                                           added similar new text that alters RFC
>                         3168, and it
>                                     should be modified
>                                           accordingly.
> 
>
>                                     [BB] I'll start another thread for this, rather
>                         than make
>                                     this thread too unweildy.
> 
> 
>
>                                     Bob
> 
> 
>
>                                           Thanks,
>
>                                           /Markku
>
>                                           PS. I missed Bob's response to my comment
>                         at the
>                                     time, but will reply it
>                                           separately at some point.
> 
>
>                                           On Wed, 9 Oct 2019, David Black wrote:
>
>                                                 At this juncture, for an
>                         ECT(0)/ECT(1) mix
>                                     across a set of
>                                                 fragments being reassembled, I would
>                         suggest
>                                     using "any" (i.e.,
>                                                 either is ok) at this juncture to
>                         avoid
>                                     constraining what we may
>                                                 do in the future; in particular,
>                         this allows
>                                     use of the value in
>                                                 the first or last fragment, both of
>                         which are
>                                     likely to be
>                                                 convenient approaches for some
>                         implementations.
>
>                                                 Thanks, --David
>
>                                                       -----Original Message-----
>                                                       From: Joe Touch
>                         <touch@strayalpha.com>
>                                                       Sent: Wednesday, October 9,
>                         2019 10:29 AM
>                                                       To: Black, David
>                                                       Cc: Jonathan Morton;
>                         tsvwg@ietf.org
>                                                       Subject: Re: [tsvwg]
>                                                      
>                         draft-ietf-tsvwg-rfc6040update-shim:
>                                     Suggested
>                                                       Fragmentation/Reassembly text
> 
>
>                                                       [EXTERNAL EMAIL]
>
>                                                       Hi, all,
>
>                                                       I disagree with the suggestion
>                         below.
>
>                                                       Pushing this “under the rug”
>                         for an
>                                     indeterminate
>                                                       later date only serves to
>                                                       undermine the importance of
>                         this issue.
>
>                                                       At a MINIMUM, there needs to
>                         be direct
>                                     guidance in
>                                                       place until a “better”
>                                                       solution can be developed. For
>                         now, that
>                                     would mean
>                                                       one of the following:
>                                                       - use the max of the frag code
>                         point
>                                     values
>                                                       - use the min of the frag code
>                         point
>                                     values
>                                                       - use “any” of the frag code
>                         point values
>                                                       - pick some other way (first,
>                         the one in
>                                     the initial
>                                                       fragment i.e., offset 0), etc.
>
>                                                       One of these needs to be
>                         *included at
>                                     this time*.
>
>                                                       If a clean up doc needs to be
>                         issued, it
>                                     can override
>                                                       individual “scattered”
>                                                       recommendations later.
>
>                                                       Joe
>
>                                                             On Oct 9, 2019, at 6:33
>                         AM, Black,
>                                     David
>                                                             <David.Black@dell.com>
>                         wrote:
>
>                                                                   The one case this
>                         doesn't
>                                                                   really cover is
>                         what happens
>                                                                   when a fragment
>
>                                                       set
>                                                                   has a mixture of
>                         ECT(0) and
>                                                                   ECT(1)
>                         codepoints.  This
>                                                                   probably isn't
>                         very
>                                                                   relevant to
>                         current ECN
>                                                                   usage, but may
>                         become
>                                                                   relevant with SCE,
>                         in
>
>                                                       which
>                                                                   middleboxes on the
>                         tunnel
>                                                                   path may introduce
>                         such a
>                                                                   mixture to
>                         formerly
>                                                                   "pure" packets. 
>                         From my
>                                                                   perspective, a
>                         likely
>                                                                   RFC-3168 compliant
>                                                                   implementation of
>                         arbitrarily
>                                                                   choosing one
>                         fragment's ECN
>                                                                   codepoint as
>                                                                   authoritative
>                         (where it
>                                                                   doesn't conflict
>                         with other
>                                                                   rules) is
>                         acceptable, but
>                                                                   this doesn't
>                         currently seem
>                                                                   to be mandatory.
>
>                                                                   With the above
>                         language, it
>                                                                   should be
>                         sufficient to
>                                                                   update RFC-3168 to
>
>                                                       cover
>                                                                   this case at an
>                         appropriate
>                                                                   time, rather than
>                         scattering
>                                                                   further
>
>                                                       requirements
>                                                                   in many documents.
> 
>
>                                                             I would concur that
>                         using a
>                                     separate
>                                                             draft to cover that case
>                         at the
>
>                                                       appropriate time would be the
>                         better
>                                     course of
>                                                       action.
>
>                                                             Thanks, --David
>
>                                                                   -----Original
>                         Message-----
>                                                                   From: Jonathan
>                         Morton
>                                                                  
>                         <chromatix99@gmail.com>
>                                                                   Sent: Tuesday,
>                         October 8,
>                                                                   2019 6:55 PM
>                                                                   To: Black, David
>                                                                   Cc: tsvwg@ietf.org
>                                                                   Subject: Re:
>                         [tsvwg]
>                                                                  
>                                     draft-ietf-tsvwg-rfc6040update-shim:
>                                                                   Suggested
>                                                                  
>                         Fragmentation/Reassembly text
> 
>
>                                                                   [EXTERNAL EMAIL]
>
>                                                                         On 8 Oct,
>                         2019,
>                                                                         at 10:51 pm,
>                                                                         Black, David
>                                                                        
>                         <David.Black@dell.com>
>                                                                         wrote:
>
>                                                                         **NEW**:
>                         Beyond
>                                                                         those first
>                         two
>                                                                         paragraphs,
>                         I
>                                                                         suggest
>                         deleting
>                                                                         the
>
>                                                       rest
>                                                                   of Section 5 of
>                         the
>                                                                   rfc6040update-shim
>                         draft and
>                                                                   substituting the
>
>                                                       following
>                                                                   paragraph:
>
>                                                                           As a
>                         tunnel
>                                                                         egress
>                                                                         reassembles
>                         sets
>                                                                         of outer
>                                                                         fragments
>                                                                          
>                                                                        
>                                     [I-D.ietf-intarea-tunnels]
>                                                                         into
>                         packets, it
>                                                                         MUST comply
>                         with
>                                                                           the
>                         reassembly
>                                                                         requirements
>                         in
>                                                                         Section 5.3
>                         of 
>                                                                         RFC 3168 in
>                                                                           order to
>                         ensure
>                                                                         that
>                         indications
>                                                                         of
>                         congestion are
>                                                                         not lost.
>
>                                                                         It is
>                         certainly
>                                                                         possible to
>                                                                         continue
>                         from
>                                                                         that text to
>                                                                         paraphrase
>                         part
>                                                                         or all
>
>                                                       of
>                                                                   Section 5.3 of RFC
>                         3168, but
>                                                                   I think the above
>                         text
>                                                                   crisply addresses
>                         the
>                                                                   problem, and
>                         avoids
>                                                                   possibilities of
>                         subtle
>                                                                   divergence.  I do
>                         like the
>                                                                   “reassembles sets
>                         of outer
>                                                                   fragments” lead-in
>                         text
>                                                                   (which I copied
>                         from
>
>                                                       the
>                                                                   current
>                         rfc6040shim-update
>                                                                   draft) because
>                         that text
>                                                                   makes it clear
>                         that
>                                                                   reassembly
>                         logically precedes
>                                                                   decapsulation at
>                         the tunnel
>                                                                   egress.
>
>                                                                         Comments?
> 
>
>                                                                   Looks good to me.
>
>                                                                   The one case this
>                         doesn't
>                                                                   really cover is
>                         what happens
>                                                                   when a fragment
>
>                                                       set
>                                                                   has a mixture of
>                         ECT(0) and
>                                                                   ECT(1)
>                         codepoints.  This
>                                                                   probably isn't
>                         very
>                                                                   relevant to
>                         current ECN
>                                                                   usage, but may
>                         become
>                                                                   relevant with SCE,
>                         in
>
>                                                       which
>                                                                   middleboxes on the
>                         tunnel
>                                                                   path may introduce
>                         such a
>                                                                   mixture to
>                         formerly
>                                                                   "pure" packets. 
>                         From my
>                                                                   perspective, a
>                         likely
>                                                                   RFC-3168 compliant
>                                                                   implementation of
>                         arbitrarily
>                                                                   choosing one
>                         fragment's ECN
>                                                                   codepoint as
>                                                                   authoritative
>                         (where it
>                                                                   doesn't conflict
>                         with other
>                                                                   rules) is
>                         acceptable, but
>                                                                   this doesn't
>                         currently seem
>                                                                   to be mandatory.
>
>                                                                   With the above
>                         language, it
>                                                                   should be
>                         sufficient to
>                                                                   update RFC-3168 to
>
>                                                       cover
>                                                                   this case at an
>                         appropriate
>                                                                   time, rather than
>                         scattering
>                                                                   further
>
>                                                       requirements
>                                                                   in many documents.
>
>                                                                   - Jonathan Morton
> 
> 
> 
> 
>
>                                     -- 
>                                    
>                         ________________________________________________________________
>                                     Bob Briscoe                              
>                                     http://bobbriscoe.net/
>                                                    PRIVILEGED AND CONFIDENTIAL
> 
> 
>
>                         -- 
>                         ________________________________________________________________
>                         Bob Briscoe                              
>                         http://bobbriscoe.net/
> 
> 
>
>             -- 
>             ________________________________________________________________
>             Bob Briscoe                               http://bobbriscoe.net/
> 
> 
> 
> -- 
> ________________________________________________________________
> Bob Briscoe                               http://bobbriscoe.net/
> 
>