Re: [tsvwg] ecn-encap-guidelines reframing section

Markku Kojo <> Tue, 22 June 2021 14:15 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 377A43A2709; Tue, 22 Jun 2021 07:15:38 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2
X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (1024-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id yTzYEP6fNdUM; Tue, 22 Jun 2021 07:15:33 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id DC1C53A26E0; Tue, 22 Jun 2021 07:15:32 -0700 (PDT)
X-DKIM: Courier DKIM Filter v0.50+pk-2017-10-25 Tue, 22 Jun 2021 17:15:24 +0300
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; h=date:from:to:cc:subject:in-reply-to:message-id:references :mime-version:content-type:content-id; s=dkim20130528; bh=Xq+gdv dl2q5RpJrUQkM00f4ruwaT181wlo98tf48Neo=; b=VMsRlwbcZnZkVmpdMw1z4u f1S6W4taSRvqS50Z4WOkoSmUYrTuJ+8yhOcbU8xiVtsf6I5eylFptoZJ+m/RA5+i gokNnH5KtXXqU4+Dug/MSLNdafWYKWIdBJjgvwWEXPsDASg3Bt7nclT7moJqaSaG DVgff2AYOSWVV5qOFKm+4=
Received: from hp8x-60 ( []) (AUTH: PLAIN kojo, TLS: TLSv1/SSLv3,256bits,AES256-GCM-SHA384) by with ESMTPSA; Tue, 22 Jun 2021 17:15:23 +0300 id 00000000005A01BC.0000000060D1F07C.00001A6D
Date: Tue, 22 Jun 2021 17:15:23 +0300
From: Markku Kojo <>
To: Bob Briscoe <>
cc: David Black <>, Jonathan Morton <>, Joe Touch <>, "" <>, "" <>
In-Reply-To: <>
Message-ID: <>
References: <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=_script-6790-1624371324-0001-2"
Content-ID: <>
Archived-At: <>
Subject: Re: [tsvwg] ecn-encap-guidelines reframing section
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Tue, 22 Jun 2021 14:15:45 -0000


This was left unanswered, my apologies. I try to focus here on the case 
where the packet/frame boundaries do not align. See inline.

On Wed, 24 Mar 2021, Bob Briscoe wrote:

> David, see [BB2] inline...
> On 24/03/2021 01:00, Black, David wrote:


>> Turning to the first "SHOULD," this paragraph from one of Markku's messages 
>> frames the conundrum well (at least for me):
>> 	I'm very well aware of the strong case that RFC 7141 makes for
>> 	packet-mode drop, and it makes this problem even much trickier to 
>> solve.
>> 	However, if we concentrate only on the problem of dropping small 
>> fragments
>> 	and reassembling them, the byte-mode drop together with reassembly 
>> logic
>> 	in RFC 3168 results in the correct outcome.
>> Of course, a significant aspect of the situation here is that reframing is 
>> not just about small fragments.
>> It's ironic that you (Bob) as an author of RFC 7141 are advocating 
>> byte-mode in this context - that's not intended to imply 
>> self-contradiction, unsoundness of argument, etc., but rather to serve as 
>> an indication of the complexity and subtlety of this situation. 
>> Definitively resolving this situation now appears to require digging in 
>> well beyond this high-level byte-mode vs. packet-mode discussion, a journey 
>> that I'd really like to avoid in the hope of landing these drafts in our 
>> AD's lap in the near future.
> [BB2] The first SHOULD is not advocating byte-mode marking. It is designed to 
> preserve the packet-mode marking applied by an AQM. See next response.

[MK] I cannot speak for David but I think he may mean that the first 
SHOULD intends to achieve (afterwards) at reassembly the same outcome for 
the packet-mode drop as the byte-mode drop does already at the AQM. 
Byte-mode drop reduces the number of marked outer packets by reducing the 
drop probability in proportion to the number of bytes in the smaller 
fragments/frames while your earlier proposal with the byte counter 
intended to reduce the number of marked leaving inner packets by 
counting marked bytes in arriving fragments/frames and preserving the 
proportion of marked bytes arriving and leaving. And, the first SHOULD in 
the recent ecn-encap draft tries to achieve the same but with coarser 
granularity by preserving the proportion of marked PDUs.

So, we agree that this adjustment is necessary (for packet-mode drop) 
when smaller fragments/L2 frames are reassembled/decapsulated to larger 
IP packets but disagree how it is best achieved and whether it is needed 
for all AQMs.

Then to the other cases in reframing. Let's use the example that Bob 
created earlier:

        Fr1       |                Fr2           |             Fr3             |   
|   Pkt1      |    Pkt2     |    Pkt3     |   Pkt4      |    Pkt5     |   Pkt6    |

First the case, when L2 frame encapsulates more that one IP packet but 
frame & IP packet boundaries do not overlap (e.g., assume Fr2 and Pkt2 
left boundaries as well as Fr2 and Pkt 4 right boundaries are aligned).

For Standards Track CC it is enough to mark one of the IP pkts when Fr2 is 
marked. On the other hand, when all IP pkts (Pkts 2, 3, and 4) belong to 
the same flow it makes no difference if all packets are marked.
However, if the three IP pkts happen to belong to different flows, 
marking all pkts results in multiple CC responses which may not be 
desireable as it unnecessarily results in higher level of 
oscillation in the bottleneck queue (and potentially to bottleneck 
underutilization), especially if the three (or whatever number) flows are 
the only flows currently sharing the bottleneck.

Furthermore, if one or two of the inner IP packets were already 
CE-marked, one needs to decide whether to add more marks. Again, for 
Standards Track congestion control no additional marks are necessary, 
because one or two of the flows that were already marked will react 
anyway and by adding more marks we just increase the level of 
oscillation. Note also that if only one of the flows gets its IP packet 
marked and only it reacts, this allows for achieving low delay with 
Standards Track congestion control if there is enough flows sharing the 
bottleneck. Only one flow at the time reacts, resulting in lower sawtooth. 
So, if we configure the AQM queue with a lower "target", we can get low 
delay and full utilization. Why would we want to disable this possibility 
in evolving AQMs and Standards Track congestion control?

If we then look at the case where IP pkt and L2 frame boundaries are not 
aligned but overlap. No matter whether the L2 frame is larger, i.e., it 
is capable of encapsulating more than just one IP packet, or smaller 
than the encapsulated IP packets, we may end up marking more IP packets 
than L2 frames if all "involved" IP packets are marked. Here, again, for 
Standards Track congestion control it would be just enough that only one 
of the involved IP packets gets marked. So, the main question is how to 
select the IP packet to mark.

In addition, Sec 4.2 of ecn-encap currently says:

A lower layer (or subnet) congestion notification system:

"  1.  SHOULD NOT apply explicit congestion notifications to PDUs that
        are destined for legacy layer-4 transport implementations that
        will not understand ECN, and

It's quite clear that if one IP pkt is encapsulated alone in one L2 frame, 
then the above is the only reasonable thing to do. However, if a L2 frame 
encapsulates more than one IP packet and there is a mix of non-ECT IP 
packets and ECT IP packets (i.e., several competing flows), it's not 
quite clear that the L2 should not apply ECN to such a L2 frame.

For example, assume there is just one (tiny) non-ECT IP packet and 
several ECT packets encapsulated in a large L2 frame that gets hit by an 
AQM at L2. If the L2 frame is considered as Not-ECN-PDU according the 
item 1 above even though the egress of the subnet is capable of 
propagating congestion notifications, the L2 would drop the frame and all 
encapsulated IP packets instead of marking the frame and forwarding the 
encapsulated ECT IP packets and dropping just the non-ECT IP packet. 
The operation according to the above Item 1 in the draft would be simple 
and straightforward to implement but IMO it would be reasonable to avoid 
unnecessary drops.

Then again the question to answer would be which of the packets to 
mark/drop? But, maybe I am missing some aspects?

>> So, in what may be an attempt to "have my cake and eat it too" I'd like to 
>> suggest rewriting the first SHOULD in terms of an observation that does not 
>> directly opine on byte-mode vs. packet-mode and does not use RFC 2119 
>> keywords, e.g.:
>> OLD
>>       Congestion indications SHOULD be propagated on the basis that an
>>       encapsulator or decapsulator SHOULD approximately preserve the
>>       proportion of PDUs with congestion indications arriving and leaving.
>> NEW
>>       For environments in which protocol and/or application response to
>>       congestion is sensitive to the number of bytes in IP packets with
>>       congestion indications rather than the number of IP packets with
>>       congestion indications, encapsulators and decapsulators ought to
>>       approximately preserve the proportion of PDUs with congestion
>>       indications arriving and leaving.  See RFC 7141 [RFC7141] for further
>>       discussion.
>> Would something like that text work?
> [BB2] I'm afraid not, because this is not about some niche environment. Both 
> the SHOULDs in the current draft are intended to apply to all known AQMs and 
> all known congestion controls including standard TCP.
> Before we discuss the requirement, can we make sure we're all on the same 
> page regarding some basic facts about preserving markings when PDU boundaries 
> change:
>                    | marked    marked
>                    | PDUs      bytes
> -------------------+------------------
> preserving prop'n  |  ==        ==
> preserving number  |  !=        ==
> For those who prefer writing, this means that, when the boundaries between 
> PDUs change, preserving the proportion of marked PDUs, the proportion of 
> marked bytes, and the number of marked bytes all mean the same thing. But 
> preserving the number of marked PDUs is not the same as any of the others. 
> And note that preserving the timing is the same as preserving the number of 
> marked PDUs.
> Does everyone agree on these factual points, at least?
> For instance, consider CoDel counting 200 PDUs between marks on the outer 
> headers, then imagine that on decap the boundaries between the PDUs are 
> changed to create half as many PDUs...

[MK] Just note that CoDel does not count PDUs between marks but the time 
between marks, i.e., dual of bytes between marks (if the bottleneck link 
deliveres bytes at fixed rate). That makes the marking interval 
independent of PDU size.

> Then, if each single mark is preserved as a single marked PDU, it will result 
> in only 100 PDUs between marks. This is because the total number of PDUs has 
> changed, so you cannot preserve both the number of marked PDUs and the number 
> of unmarked PDUs.

[MK] Here you ignore the crucial timing between the marks. The time (and 
delivered bytes) in between the 200 PDUs that CoDel sees is the same as 
the time between the 100 larger PDUs after decap. That is, the original 
sender at the end-point sees a mark after 100 PDUs it sent regardless of 
what happens at the encapsulator/decapsulator, i.e., with CoDel it does 
not matter whether the the original PDUs (IP packets) were splitted or 
not splitted in two frames by the encapsulator at L2 (and the original 
PDUs (IP packets) resumed by the decapsulator).

> With all existing congestion controls that I know of:
> * the instantaneous behaviour and responsiveness depends on the timing of 
> individual marks (the second SHOULD).

[MK] Yes, it depends on correct timing. And correct timing is not "as soon 
as possible".

> * but the flow rate of long-running flows depends on the average proportion 
> of marked packets (the first SHOULD). {Note 1}

[MK] Yes, it depends on the average proportion of marked packets as seen 
by the sender at the end-point (not at the bottleneck AQM), and the 
interval between the marks must be more than one RTT (which means that 
the the average proportion is actually the interval (in time or bytes) 
in between the marks. Otherwise, you cannot apply the Reno formula you 
refer to in {Note 1}, because for Standards Track congestion control on a 
congested botleneck the term 'p' in the Reno formula mentioned in Note 1 
is not the proportion of the marked packets, but the proportion of 
congestion signals where a congestion signal can occur only once per RTT 
(please see the original paper for Reno formula).

Please see also a separate email reply for this.

Best regards,


> It's up for debate how we solve this dilemma, but can people at least agree 
> (or not) that this dilemma exists.
> Bob
> {Note 1} For instance, the average proportion of marked packets is 'p' in the 
> well known Reno formula,
>     cwnd_avg = sqrt(3/2p)
>> Thanks, --David
>> -----Original Message-----
>> From: Bob Briscoe <>
>> Sent: Tuesday, March 23, 2021 7:24 PM
>> To: Black, David; Jonathan Morton
>> Cc: Markku Kojo; Joe Touch; Markku Kojo;; 
>> Subject: ecn-encap-guidelines reframing section
>> David,
>> On 22/03/2021 22:00, Black, David wrote:
>>> ---------------------------------
>>> Moving onto the ecn-encap draft (Section 4.6), the text involved concerns
>>> how to propagate layer 2 frame congestion marks to IP packets which might
>>> be fragments.  As this text is not dealing with reassembly of IP 
>>> fragments, it
>>> cannot be in conflict with the reassembly text in RFC 3168, which has 
>>> nothing
>>> to say about layer 2 frame congestion marks:
>>>      Congestion indications SHOULD be propagated on the basis that an
>>>      encapsulator or decapsulator SHOULD approximately preserve the
>>>      proportion of PDUs with congestion indications arriving and leaving.
>>>      The mechanism for propagating congestion indications SHOULD ensure
>>>      that any incoming congestion indication is propagated immediately,
>>>      not held awaiting the possibility of further congestion indications
>>>      to be sufficient to indicate congestion on an outgoing PDU.
>>> Bob initially suggested the following:
>>>> Possible resolution of the contradiction: the "SHOULD approximately 
>>>> preserve
>>>> the proportion" is a rough long term average goal while "SHOULD ensure 
>>>> that
>>>> incoming congestion indication is propagated immediately" is a 
>>>> requirement
>>>> for after there has been some period (TBD) without any marking.
>>> I'm going to go one step further and suggest removing the first "SHOULD" - 
>>> the
>>> whole notion of rate-based marking of IP packets reassembled from 
>>> fragments
>>> is what got us into the tarpit for the rfc6040update-shim draft, and the 
>>> first
>>> "SHOULD" appears to be headed into the same tarpit, only perhaps deeper
>>> as the frames involved may contain multiple packets and/or fragments 
>>> and/or
>>> portions of packets and/or portions of fragments.  That's not exactly 
>>> pretty ...
>> [BB] Er...hum...
>> You seem to have forgotten that you are talking about just dumping the
>> point that I believe was missing from RFC3168. We came to a long-fought
>> agreement that we would not decide on this before publishing these
>> drafts. But now you are proposing we decide on this before publishing
>> these drafts.
>>> For replacement, my initial sense matches Jonathan's, in particular that a 
>>> layer 2
>>> congestion mark ought not to result in congestion marking multiple IP 
>>> packets:
>> [BB] The whole problem I identified with only thinking in terms of the
>> second SHOULD is that you end up with either inflated or deflated
>> marking, depending respectively whether frames are smaller or larger
>> than packets. That is the whole point of the need for the two
>> contradictory requirements.
>>>> I would say that one mark applied at link layer should result in one mark 
>>>> applied
>>>> to one IP packet.  Exactly which one doesn't really matter, as long as it 
>>>> has some
>>>> tangible connection to the frame that was marked.  Word it that way, and 
>>>> we'll
>>>> be fine.  In particular, this method should work for *both* conventional 
>>>> and
>>>> high-fidelity sensitive traffic.
>>> That also has the useful simplification of not asking the implementation 
>>> of this draft
>>> to roughly track a long term average in some fashion.
>> [BB] No tracking of a long-term average is needed in the implementation,
>> only in the /requirement/. One example implementation would be a single
>> counter per aggregate (for the first SHOULD) and a timeout for the
>> second SHOULD. The two override each other to create a compromise that
>> addresses each requirement in the traffic scenarios where it is most
>> applicable.
>> If you want me to give example pseudocode in this email, I would love
>> to. But I thought we agreed that we are not going to solve the dilemma
>> in this text, we are just going to state the requirements. Having worked
>> on this draft for so many years, and having developed what I believe is
>> a solution, I find that highly unsatisfactory. But we agreed to it.
>> Bob
>>> Thanks, --David
>>> -----Original Message-----
>>> From: Jonathan Morton <>
>>> Sent: Sunday, March 21, 2021 2:42 PM
>>> To: Bob Briscoe
>>> Cc: Markku Kojo; Joe Touch; Markku Kojo;; 
>>> Subject: Re: [tsvwg] 
>>> draft-ietf-tsvwg-rfc6040update-shim:SuggestedFragmentation/Reassemblytext
>>>> On 20 Mar, 2021, at 8:27 pm, Bob Briscoe <> wrote:
>>>> It's not enough to make ecn-encap the same as shim. The reassembly logic 
>>>> in RFC3168 is only defined when packets are reassembled from /smaller/ 
>>>> fragments. When a L2 frame is /larger/ than an IP packet, or /overlaps/ 
>>>> the boundary between IP packets, the reassembly logic in RFC3168 makes is 
>>>> undefined - it makes no sense.
>>>> For instance, some link layers treat IP packets as a continuous byte 
>>>> stream, then break the stream into the largest possible frames, like so:
>>>> ----------------->+<---------------------------->+<------------------------------>+<----
>>>>           Fr1       |                Fr2           |             Fr3 
>>>> |
>>>> +-------------+-------------+-------------+-------------+-------------+-------------+---
>>>> |   Pkt1      |    Pkt2     |    Pkt3     |   Pkt4      |    Pkt5     | 
>>>> Pkt6      |
>>>> +-------------+-------------+-------------+-------------+-------------+-------------+---
>>>> Then, say Fr2 was marked. On decap should Pkt2, Pkt3 & Pkt4 be marked, or 
>>>> just Pkt3 & Pkt4?
>>> I would say that one mark applied at link layer should result in one mark 
>>> applied to one IP packet.  Exactly which one doesn't really matter, as 
>>> long as it has some tangible connection to the frame that was marked. 
>>> Word it that way, and we'll be fine.  In particular, this method should 
>>> work for *both* conventional and high-fidelity sensitive traffic.
>>>    - Jonathan Morton
> -- 
> ________________________________________________________________
> Bob Briscoe