Re: [tsvwg] [Technical Errata Reported] RFC7141 (7237)

Bob Briscoe <ietf@bobbriscoe.net> Sat, 30 September 2023 09:22 UTC

Content-Type: multipart/alternative; boundary="------------Fu4dEmOjPxa7TxHzyU0tugUe"
Message-ID: <d946c450-3a9d-38c8-a2d1-a5fc54c287b4@bobbriscoe.net>
Date: Sat, 30 Sep 2023 10:22:19 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1
Content-Language: en-GB
To: Sebastian Moeller <moeller0@gmx.de>
Cc: Gorry Fairhurst <gorry@erg.abdn.ac.uk>, tsvwg@ietf.org, jukka.manner@aalto.fi, RFC Errata System <rfc-editor@rfc-editor.org>
References: <20221104094005.747A455F68@rfcpa.amsl.com> <4aef3037-fae5-68c9-661f-4ce89b1ce7e7@erg.abdn.ac.uk> <273A82C1-E675-4950-A7E0-E8C564B09834@gmx.de> <6672b32e-19b6-b295-1460-904481de2c83@erg.abdn.ac.uk> <1351054E-7647-40CA-B2FA-7A566DE09E24@gmx.de> <f02cfbb6-9a14-0c70-4986-358b9226033f@erg.abdn.ac.uk> <CC3F2650-2CC7-4EC9-B0BC-2200D482CDEC@gmx.de> <073c8aed-91f4-a11a-771e-9932032cedba@bobbriscoe.net> <25626F56-AA8E-4B7D-904E-F17E2B57642E@gmx.de>
From: Bob Briscoe <ietf@bobbriscoe.net>
In-Reply-To: <25626F56-AA8E-4B7D-904E-F17E2B57642E@gmx.de>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/bmVZIot8HGxFATanY4J3twOKULI>
Subject: Re: [tsvwg] [Technical Errata Reported] RFC7141 (7237)
Precedence: list

Sebastian, pls see [BB2]

On 22/09/2023 15:41, Sebastian Moeller wrote:
> Dear Bob,
>
>> On Sep 21, 2023, at 18:28, Bob Briscoe<ietf@bobbriscoe.net>  wrote:
>>
>> Sebastian,
>>
>> I've just read your erratum, which is about §2.2 & §2.3:
>>      https://www.rfc-editor.org/errata/rfc7141
>> and I just read this thread, which is more about §2.4 (sorry I missed these at the time).
>> See [BB]
>>
>>
>>
>> On 04/11/2022 13:37, Sebastian Moeller wrote:
>>> Hi Gorry,
>>>
>>>
>>>
>>>> On Nov 4, 2022, at 14:03, Gorry Fairhurst<gorry@erg.abdn.ac.uk>
>>>>   wrote:
>>>>
>>>> On 04/11/2022 12:42, Sebastian Moeller wrote:
>>>>
>>>>> Hi Gorry,
>>>>>
>>>>>
>>>>>
>>>>>> On Nov 4, 2022, at 11:56, Gorry Fairhurst<gorry@erg.abdn.ac.uk>
>>>>>>   wrote:
>>>>>>
>>>>>> On 04/11/2022 10:43, Sebastian Moeller wrote:
>>>>>>
>>>>>>> Hi Gorry,
>>>>>>>
>>>>>>> See [SM] below.
>>>>>>>
>>>>>>> On 4 November 2022 11:20:56 CET, Gorry Fairhurst
>>>>>>> <gorry@erg.abdn.ac.uk>
>>>>>>>   wrote:
>>>>>>>
>>>>>>>> Commenting as an individual on the Errata filing:
>>>>>>>>
>>>>>>>> On 04/11/2022 09:40, RFC Errata System wrote:
>>>>>>>>
>>>>>>>>> The following errata report has been submitted for RFC7141,
>>>>>>>>> "Byte and Packet Congestion Notification".
>>>>>>>>>
>>>>>>>>> --------------------------------------
>>>>>>>>> You may review the report below and at:
>>>>>>>>>
>>>>>>>>> https://www.rfc-editor.org/errata/eid7237
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --------------------------------------
>>>>>>>>> Type: Technical
>>>>>>>>> Reported by: Sebastian Moeller
>>>>>>>>> <moeller0@gmx.de>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Section: 2
>>>>>>>>>
>>>>>>>>> Original Text
>>>>>>>>> -------------
>>>>>>>>> 2.2.  Recommendation on Encoding Congestion Notification
>>>>>>>>>
>>>>>>>>>      When encoding congestion notification (e.g., by drop, ECN, or PCN),
>>>>>>>>>      the probability that network equipment drops or marks a particular
>>>>>>>>>      packet to notify congestion SHOULD NOT depend on the size of the
>>>>>>>>>      packet in question.
>>>>>>>>> [...]
>>>>>>>>> 2.3.  Recommendation on Responding to Congestion
>>>>>>>>>
>>>>>>>>>      When a transport detects that a packet has been lost or congestion
>>>>>>>>>      marked, it SHOULD consider the strength of the congestion indication
>>>>>>>>>      as proportionate to the size in octets (bytes) of the missing or
>>>>>>>>>      marked packet.
>>>>>>>>>
>>>>>>>>>      In other words, when a packet indicates congestion (by being lost or
>>>>>>>>>      marked), it can be considered conceptually as if there is a
>>>>>>>>>      congestion indication on every octet of the packet, not just one
>>>>>>>>>      indication per packet.
>>>>>>>>>
>>>>>>>>>      To be clear, the above recommendation solely describes how a
>>>>>>>>>      transport should interpret the meaning of a congestion indication, as
>>>>>>>>>      a long term goal.  It makes no recommendation on whether a transport
>>>>>>>>>      should act differently based on this interpretation.  It merely aids
>>>>>>>>>      interoperability between transports, if they choose to make their
>>>>>>>>>      actions depend on the strength of congestion indications.
>>>>>>>>>
>>>>>>>>> Corrected Text
>>>>>>>>> --------------
>>>>>>>>> I am not sure the text is actually salvageable, as it appears ti be a logic disconnect at the core of the recommendations.
>>>>>>>>>
>>>>>>>>> Notes
>>>>>>>>> -----
>>>>>>>>> The recommendations seem not self consistent:
>>>>>>>>> A) Section 2.2.  recommends that CE marking should be made independent of packet size, so *a CE-mark carries no information about packet size*.
>>>>>>>>>
>>>>>>>> I did not understood that it needed to.
>> [BB]
>> 1/ I've emphasized the words in your erratum that help me see where a first misunderstanding might be:
>>> A) Section 2.2. recommends that CE marking should be made independent of packet size, so a CE-mark carries no information about packet size.
>>> B) Section 2.3 then recommends to use the size of marked packets as direct indicators of congestion strength.
>> An analogy might help explain: When a tail-drop queue discards a packet, or when a flow-agnostic AQM (like PIE or RED) ECN-marks a packet, they do so independent of flow ID. Nonetheless, because there is a flow-ID in each packet, these arbitrary/random algorithms blindly pick the flows that should respond, even though these algorithms don't read the flow-ID (it can even be encrypted).
> 	[SM] Yes, in a sense a single/dual-queue AQM treats its traffic as a single aggregate, but it expects that aggregate to respond to congestion signaling. This is less important for drops, and considerably more important for congestion marks (to avoid easy abuse of the marking system). (Side-note, my take on hiding flow-ids is, you are free to do so if you desire, but that decision is not guaranteed to be without side-effects, after all the IETF recommends not to starve any flow/connection, and to be able to do so the network needs information about what constitutes a flow... but I digress)

[BB2] I was just giving this as another example of a mechanism that 
selects some attribute of packets (in this case flow ID) by randomly 
selecting packets without looking at the attribute. I didn't intend or 
expect anyone to start pronouncing on the merit of the technology that I 
used as an example.

>
>> Returning to packet-size, most AQMs don't mark dependent on packet size.
> 	[SM] For a change I agree with you,

[BB2] Again, I didn't intend or expect anyone to agree or disagree with 
the mechanism in this example. The example was merely given to show that 
the text from your erratum (that I have again emphasized below) is 
incorrect:
     "Section 2.2.  recommends that CE marking should be made 
independent of packet size, so *a CE-mark carries no information about 
packet size*.

All I was proving was that:
A marked packet carries information about flow ID, packet size, DSCP, IP 
version, TTL, etc irrespective of whether those attributes were used to 
select the packet for marking.

> packet size does not seem a meaningful determinant of the contribution of a flow to congestion, so not making a signaling decision contingent on this seems like the a better choice than taking it into account. To illustrate this if we compare two flows both currently using up 80% of the bottleneck capacity each (for a total of 160%, so we need to signal "slow down" as we are or shortly will be growing the bottleneck queue) one using a (hypothetical) MSS of 1500 and the other a MSS of 150 octets, both contribute equally to the impeding (capacity) congestion and both should reduce their cwin by an equal proportion...

[BB2] But 10 times more of the smaller packets will be flowing through 
this AQM, so its size-independent marking will select 10 times more of 
the smaller packets. That doesn't seem to matter in your toy example 
with 2 flows and heavy overload, but it does matter in a more realistic 
example, say...
* half the capacity is consumed by 100 capacity-seeking flows with 
MSS=1500 and
* the other half is consumed by another 100 capacity-seeking flows with 
MSS=150,
* and within each flow there are a large number of round trips between 
each mark.

Then, your example AQM is 10 times more likely to select packets from 
the flows with the smaller MSS. So the 100 flows with smaller MSS react 
more often and gradually consume a smaller and smaller share of the 
capacity.

This is the sort of problem that the early designers of AQMs were faced 
with. Some decided to solve the problem by adding packet-size dependent 
marking to AQMs. That was the problem that motivated RFC7141 to be 
written (which I've admitted isn't described in the intro).

First RFC7141 showed that size-dependent marking had rarely, if ever, 
been enabled.

Then RFC7141 said there would be an interop mess if some AQMs tried to 
do packet-size dependent marking and others didn't, so it recommended 
only size-independent marking.

It wouldn't be so much of a problem if some CCAs did packet-size 
dependent response and others didn't, because most flows use PMTU-sized 
packets anyway, and if any used smaller packets it would be down to them 
if they still inflicted harm on *themselves* by responding as much to 
congestion as other flows did. That's why RFC7141 gives size-dependent 
response as a long-term goal, without requiring a CCA to act on it.

> I would argue that the relevant factor an AQM might base signaling severity on would be something like most recent capacity-share a flow used, or, if you must, percentage of current contribution to the queue (best measured in service-time or aggregate size). Given that current ECN/L4S offers only to have a single signaling bit (assuming we want to avoid drops) what an AQM can do is increase the marking frequency*.

[BB2] Applications can open multiple flows, and larger flows cause 
congestion over a longer time than smaller flows. But let's not get into 
a debate about the merits of per-flow mechanisms in the network, which 
is beyond the scope of this RFC and this erratum. This RFC dealt with 
non-FQ AQMs as they were, and as they are still.

> *) Mind you, I consider this to be a problem in its own right, especially for single/dual-queue AQMs as non-amgiguous signaling of congestion magnitude has been shown in several papers to be beneficial.
>
>
>> But packets do have a size (and a size field). And larger packets do contribute more per packet to link congestion than smaller packets.
>
> 	[SM] Let me cite your own words here:
> "applying to just each frame in isolation, which would make absolutely no sense at all"

[BB2] The context you have removed this quote of my words from was about 
carrying a byte count over from one marked frame to the next frame, 
which is why I said applying it to each frame in isolation made no sense.

> This, IMHO applies here as well, marking a packet only makes sense if the flow it belongs to reacts to that mark. And as a corollary an AQM ideally send marks preferably to those flows contributing most to the congestion, but that does not depend on packet size, as I hope I showed in my above example. So yes, if we force ourselves to only compare two random packets out of a shared queue the larger packet occupies more equivalent service-time of that queue than the smaller, but that is as true as inconsequential.
> REQUEST to list members: If anybody can demonstrate that this is an incorrect interpretation on my side please do so.

[BB2] I hope I have shown that it is not inconsequential in realistic 
examples with large numbers of flows.

>
>> So, if we assume that the marking algorithm has not already taken packet size into account, the strength of a congestion signal for the purpose of congestion response can be taken to depend on the size of the packet it is applied to.
> 	[SM] Again, I disagree, you are now making a leap of faith from the comparison of two random packets (which I technically agree with) to deduce something about the two flows these two packets were picked from. And there I disagree that your assumption generally holds (just think path's of different pMTU/pMSS).

[BB2] It's not a leap of faith. It's a way to design algorithms that use 
repeated operations over large numbers of packets by exploiting 
probability (like the AIMD algorithms). The outcome from the repetitive 
algorithms in existing CCAs have been derived as response functions and 
validated. So once you understand how those repetitive algorithms lead 
to those outcomes, you can modify the repeated operations to achieve the 
effect you want.

For instance, the well-known response function for the steady-state 
window (W) of Reno, which is a good approximation as long as p is small, is:
     W = √(K/p),
where K is approx 3/2
and bit rate,
     x = s/R * W
        =  s/R * √(K/p),
where s is the SMSS, R is the RTT and p is the marking/dropping probability.

It is well-known and widely validated that Reno flows with smaller SMSS 
converge on a proportionately lower rate, as can be seen from the 
presence of s in the above formula. This is simply because the additive 
increase per RTT is 1 SMSS, so a smaller SMSS will increase more slowly.

> Like if there was a flow scheduling bottleneck with an individual AQM for each "flow" then making the response contingent on packet size is arguably exactly the wrong thing.

[BB2] I don't know what led you to believe that, but it's wrong in two ways.

1/ Any packet-size dependence of a CCA will not determine flow rate 
anyway if there's an FQ scheduler at the bottleneck constraining each 
flow's rate (in FQ each CoDel AQM adjusts the time between marks 
independently for each flow so that each flow builds a similar queue in 
each equal-rate lane).

Packet-size dependent CCAs are only necessary for ensuring equal flow 
rates when flows all share a bottleneck queue, and therefore cannot have 
different marking or dropping probabilities. Over an FQ, that doesn't 
make them wrong; it just makes them unnecessary.

2/ If you're thinking that you'd like the CCA not to need the FQ-CoDel 
AQMs to converge on a different time between marks for flows with 
different SMSS, then you're already out of luck, 'cos existing 
Reno-Friendly CCAs already cause the CoDel AQMs to adjust to a longer 
time between marks for flows with smaller SMSS.

Specifically, for flows constrained to equal bit rates by the FQ 
scheduler, the ratio of the inter-mark times (T) that CoDel reaches for 
flows with small SMSS (index 1) and large SMSS (index 2) will be roughly 
as follows:
Reno:
     T1/T2 = s2/s1
Reno-SP:
     T1/T2 = s1/s2
where s is the SMSS, so s1 and s2 are respectively the sizes of the 
smaller and larger SMSS.
And Reno-SP (standing for small packets) is an invented algorithm that, 
whatever the SMSS, it keeps the bit rate the same as it would have been 
if the SMSS were s2 (the larger SMSS that it considers as a reference), 
all else being unchanged.

So as not to break up the flow of this already long email, see {Note 1} 
at the end for the working that proves the above two equations.
Briefly, the factors that lead to these results are:
* During a certain time between marks applied by CoDel, there will be 
more small packets than large.
* FQ aims to equalize the bit rates of flows, but Reno aims to equalize 
their windows (in number of SMSS-sized segments, not in bits).

> Similarly if a flow should always send pairs of packets one large one small (as sees in some games) by your logic the flow should respond differentially dependent of the small or the large packet was marked, which I consider not to be a defensible position.

[BB2] Why not?
That's the robust way to design an algorithm: a repetitive process that 
responds differently to markings on different-sized packets in order to 
achieve the desired outcome dependent on the relative prevalence of 
different sized packets (picked at random by the marking process). 
Algorithms like this are robust compared to an alternative that might 
maintain an average packet size, which would then require an arbitrary 
choice of the averaging period, which would then never be right in all 
scenarios, eg small packets might sometimes be evenly spread, or other 
times in bursts, etc etc.

>
>> And that can be done, even if the people who originally developed the AQM algorithm thought that packet-size dependence ought to be done by the AQM (but it wasn't enabled).
> 	[SM] Again, I fully agree with you, packet size is not a robust and reliable predictor of "contribution to congestion" and hence neither AQMs nor end-points should pretend it is. We seem to agree on the former, but not on the latter.

[BB2] You are agreeing with not making AQM algorithms dependent on 
packet size. But you're not agreeing with me (nor with RFC7141) when you 
say 'packet size is not a robust and reliable predictor of "contribution 
to congestion"'. I admit that the size of a packet is only a robust and 
reliable indicator of its contribution to congestion if the congested 
resource is bit-congestible. However, that is the common case (but not 
universal - see rfc-editor.org/rfc/rfc7141.html#section-5.2 
<https://www.rfc-editor.org/rfc/rfc7141.html#section-5.2> ). So I'd only 
go as far as saying packet size is a generally useful indicator of 
"contribution to congestion".

So far, I've detected that you are hesitant to make the response of a 
CCA depend on the size of arbitrary packets picked by an AQM. But, here 
you're saying something much more controversial. But you're not saying 
*why* you think packet size is not a robust and reliable predictor of 
"contribution to congestion".

If you follow your position to its logical conclusion...
Would you *mandate* that, when a CCA (say Reno or CUBIC in Reno-friendly 
mode) has a smaller pMTU, it MUST end up with a lower bit-rate than if 
the pMTU is larger (like it does now, because it increases less, but 
decreases the same). Surely that is a perverse position to adopt.

The recommendation in RFC7141 is that CCAs are recommended to respond 
dependent on packet size (for their own performance), but they can 
respond independent of packet-size (harming themselves) if they choose. 
I can't see why anyone would be against that, and you haven't given any 
reason, only an assertion.

>> Next step: If this does capture the your misunderstanding on /this/ point, pls say, then either you or I can suggest an edit to avoid the misunderstanding.
> 	[SM] As I tried to explain above, I do not see a mis-understanding, I do however see a logically problematic extrapolation from random packets to flows/connections that I consider to require quite a lot of evidential data to back it up...

[BB2] I understand that, and I hope I have helped with my earlier 
explanations.

>
>> 2/ A second problem might be that RFC7141 doesn't actually outline the original packet-size problem that byte-mode RED was trying to solve. I detect you're not familiar with that history, and why should you be? That /is/ something that could be corrected with an erratum.
> 	[SM] That is one (not the only) argument from rfc7141 that I am happy to agree with: that marking should not depend on packet-size. However, unlike you, I clearly see that if packet size is not a measure of congestion severity, then it should also not be interpreted as such.
>
>>
>> 3/ Then this thread shows further misunderstanding regarding §2.4 on splitting and merging - see [BB] below...
>>
>>>>>>>> This RFC I think was intended to be independent of the transport.  I see the transport sender as responsible for determining the packetisation of the transport segments, and the (S)ACKs can often identify segments, hence the sender can determine the segments that have been acknoweldged or times when ECN marking was seen.
>>>>>>>>
>>>>>>> [SM] This assumes that relevant segment size does not change along the path. Which generally is not true. Just think fragmentation, if the sender sends a packet that gets fragmented along the path and only a single fragment gets CE marked the sender will see this as the whole packet being marked. Or from the other side of the issue, if say a Linux router uses GRO/GSO and queues a larger meta packet and CE marks that, receiver and sender at best see a sequence of CE marked packets. So the recommendation would need to be changed to calculate the consecutive sequence of CE marked octets and take these as correlate for congestion strength. So no, the sender really has no reliable knowledge about the size of the data unit the marking node marked.
>>>>>>>
>>>>>>>
>>>>>> I suggest IETF transports treat all IP fragments as one unit of retransmission/congestion at the transport layer.
>>>>>>
>>>>> 	[SM2] But what if the re-segmentation does not happen at the receiver, but say a fragmenting and CE-marking path tries to act transparently. According to the rules both in RFC3168 and RFC7141 a re-segmented packet containing even a single CE-marked fragment is to be CE-marked (or dropped). So the AQM might have marked a 576 octet segment but all the endpoint sees is a marked ~1460 octet segment.
>>>>> 	This also illustrates how section 2.4 of RFC7141 proposes a method that does not achieve its aim, of giving veridical "number of market octets" information. It simply is impossible to do so generally (often it will work, but the endpoints can not even know when it was correct and when not).
>>>>> Section 2.4 has more issues BTW, it tries to give recommendation how to deal with splitting and merging but fails to achieve its goals of giving a veridical account of the marked octets:
>>>>>
>>>>> Let's see what happens when applying the proposed counter method in regards to number of marked octets under the conditions this section addresses
>>>>> Here let's look at a toy problem with 20 byte headers and a total payload of 1200 octets that is split in or merged out of 3 fragments/segments with 400 octets payload each
>>>>>
>>>>> Merging multiple segments pre-marking:
>>>>> (20+400) + (20+400)-20 + (20+400)-20 -> 1220 total 1200 payload + CE
>>>>> -> AQM marks 1220 or 1200 octets
>>>>> (12+1200)+CE
>>>>> receiver sees 1200 octets with CE and ACKs these with ECE
>>>>> sender can assume 1200 octets where marked
>>>>> CORRECT
>>>>>
>>>>> Merging multiple segments post-marking ():
>>>>> -> AQM marks segment 2 of 420 or 400 octets
>>>>> (20+400) + (20+400+CE) + (20+400)
>>>>> (20+400)+(20+400+CE)-20+(20+400)-20 = 1220 total 1200 payload + CE
>>>>> receiver sees 1200 octets with CE and ACKs these with ECE
>>>>> sender must assume 1200 octets where marked
>>>>> FALSE
>>>>>
>>>>> Fragmenting a segment pre-marking
>>>>> 1220 -> (20+400) + (20+400) + (20+400)
>>>>> -> AQM marks segment 2 of 420 or 400 octets
>>>>> (20+400) + (20+400+CE) + (20+400)
>>>>> Resegmentation happens before protocol sees marking
>>>>> (20+400) + (20+400+CE)-20 + (20+400)-20 -> 1220 total 1200 payload + CE
>>>>> receiver sees 1200 octets with CE and ACKs these with ECE
>>>>> sender must assume 1200 octets where marked
>>>>> FALSE
>>>>>
>>>>> Fragmenting a segment post-marking
>>>>> (20+1200)
>>>>> -> AQM marks 1220 or 1200 octets
>>>>> (12+1200)+CE
>>>>> fragmentation happens:
>>>>> (20+400+CE) + (20+400+CE) + (20+400+CE)
>>>>> Resegmentation happens before protocol sees marking
>>>>> (20+400=CE) + (20+400+CE)-20 + (20+400+C)-20 -> 1220 total 1200 payload + CE
>>>>> receiver sees 1200 octets with CE and ACKs these with ECE
>>>>> sender must assume 1200 octets where marked
>>>>> CORRECT
>>>>>
>>>>> So only in two out of four conditions does the proposed method actually achieves its goal.
>>>>>
>> [BB] You seem to have interpreted §2.4 as applying to just each frame in isolation, which would make absolutely no sense at all.
> 	[SM] This example above indeed looks like this but that is mainly to keep the number of conditions to look at low, possible that this results in too much simplification.
>
>
>
>> It applies to a stream of frames that are being split or merged. Indeed, the text already makes this clear, as follows:
>>
>>     even the smallest
>>     positive remainder in the conceptual counter should trigger the next
>>     outgoing packet to be marked (causing the counter to go negative)
>>
>>
>> I detect that you prefer to be fed every little detail, so here's some pseudocode that might help.
> 	[SM] I respectfully argue that as the author of a method, the onus is on you to describe it in sufficient detail. Whether you consider that spoon feeding or not is irrelevant.

[BB2] Did you think the text description in the draft meant something 
other than what the pseudocode describes? If so what was misleading or 
ambiguous?

>
>>      https://bobbriscoe.net/projects/netsvc_i-f/consig/encap/ecn-encap-reframing_goal2_pseudo.c
>> It's written in terms of frame decapsulation, not fragment reassembly, but you can see it works on the principle of ignoring the sizes of any headers that are not preserved during the process.
>>
>> Applying the pseudocode to your examples:
>> Ignoring the timeout block for a moment, you should be able to see that, in the two cases you've tagged 'FALSE', _diff will become negative (-800) after the reassembled packet in either of your 'FALSE' examples is marked. Then if more 400-byte-payload fragments arrived, it would take three marked ones to make _diff positive again, causing the packet reassembled from last to be marked before forwarding.
> 	[SM] Sure this is averaging out quite nicely,

[BB2] And would you say that it's not complex?

(BTW, I've fixed a bug in the wrap aspect.)

> but it absolutely requires temporal averaging, because at its core it is not correct frame by frame.

[BB2]  I'm not sure, but I think you're saying that, if one compares the 
proportions of marked frames and marked packets over a window of one 
frame, you can only get them to be closer to equal if you take moving 
averages of each measurement. {Note 2}

However, that shouldn't be confused with trying to improve the algorithm 
itself by using averaging within the algorithm. That would always make 
the algorithm worse. Addition of any form of averaging would require an 
arbitrary choice of averaging window, which would make the algorithm 
less correct for smaller measurement windows. Importantly, it would also 
make the algorithm less responsive to changes in signalling intensity.

Of course, the proportion of marked frames and marked packets certainly 
each fluctuate up and down around each other. However, that's not an 
inherent problem with the algorithm; rather it's inherent to the 
problem. I mean, inherently, marks can only be applied to whole packets 
or frames, and we have unaligned packet and frame boundaries. Put it 
another way, I don't think another algorithm can exist that is more 
correct, because this algorithm is as correct as it can be at every 
packet boundary.

>
>> This carrying over of the balance only continues as long as the time between incoming marks is less than CE_TIMEOUT. If not, you can see that the counters reset to become equal. The value of CE_TIMEOUT is just for illustration.
>>
>> The use of WRAP_GUARD is just a cheap emulation of modulo arithmetic that allows either counter to wrap to zero before the other and still work regardless.
> /* On completion of each reassembled packet */
> void update_ecn_out(packet) {
>      _diff = ce_oct_in > ce_oct_out;
>      if ( (_diff > 0) || (_diff < WRAP_GUARD) {
>          ce_mark(packet);    // Irrespective of whether it's already marked
>          ce_oct_out += size(packet);   // Size including packet header
>      }
> }
>
>
> Not sure WRAP_GUARD (#define WRAP_GUARD -(1<<63)) does that poor man's modulo thing you mention... given that _diff is either 1 or zero (_diff = ce_oct_in > ce_oct_out;, > is comparing left versus right), and the initial (_diff > 0) makes sure it is not zero.... i would guess that somewhere you wanted to use the real difference between ce_oct_in and ce_oct_out.
>
> Now, I might well misread that code, after all I am no compiler, but that illustrates why actual code compared to pseudocode seems desirable...

[BB2] That was a pseudocode bug (typo). The '>' was meant to be a minus.
As I said earlier, pls re-download from the same URL, 'cos I had also 
noticed another problem with the wrap, even once the typo was corrected. 
The new pseudocode for comparison with wrapping is a bit opaque, I'm 
afraid. I corrected it with a different cheap modulo trick.

>
>>>>> Now add the complication that the RFC fails to mention what it considers marked octets, just the payload or payload+headers.
>>>>>
>> [BB]  This RFC is not specifying a protocol, it's giving a statement of principle as a goal of future protocol design. Any protocol designer (or professional implementer) would be able to fill in details like which headers to include or ignore; or how to confine each instance of the algorithm to a stream of packets that had passed through the same AQM with the same type of ECN codepoint.
>>
>> Next steps:  Everyone else who discussed this section grocked the idea immediately.
> 	[SM] My take on this is, nobody even tried to implement this at all and hence politely ignored to interact with that proposal at that level of scrutiny.
>
>
>> So I don't think more explanation is needed to understand §2.4. Certainly pseudocode would seem overkill, given splitting and merging is a fairly minor part of this RFC. But I'll leave that decision for whoever has to decide on errata - the AD I think?
>
>
>
>>
>>>>> This is important as the sum of payload + headers of X fragments is larger than the sum payload + header of the single packet re-constituted out of these fragments. So the de-fragmenting process arguably needs to only look at payload size, but RFC7141 section 2.4 does not make that explicit.
>>>>> If an implementation actually uses the full size instead of the payload size now the last condition also gets it wrong:
>>>>>
>>>>> Fragmenting a segment post-marking
>>>>> (20+1200)
>>>>> -> AQM marks 1220 or 1200 octets
>>>>> (12+1200)+CE
>>>>> fragmentation happens:
>>>>> (20+400+CE) + (20+400+CE) + (20+400+CE)
>>>>> Resegmentation happens before protocol sees marking
>>>>> (20+400+CE) + (20+400+CE) + (20+400+C) -> 1220 total 1200 payload + CE
>>>>> but (20+400)*3 = 1260 marked octets
>>>>> receiver sees 1200 octets with CE and ACKs these with ECE
>>>>> sender must assume 1200 octets where marked
>>>>> CORRECT
>>>>> But now the left over 40 bytes in the marked-octet budget will result in CE feedback for the next (re-assembled) packet.
>>>>> FALSE
>>>>>
>>>>> For rfc3168 that will not matter much as ECE is sustained until CWR is received anyway, but L4S style signaling now acquired an erroneous CE mark.
>>>>>
>>>> Network fragmentation be it in tunnels, extension headers or IPv4 fragments is indeed thwarted with all manner of issues. Nothing new - the IETF has long recommended the unit of loss/marking to be the same as the end to end PDU. PMTU is tricky, but does have benfits:-)
>>>>
>>> 	[SM3] Not a fan of fragmentation either, but I assume that fragmentation will stay a fact of life over the internet independent of my opinion. My point here is if the IETF proposes a method that aims to correctly account for CE-marked octets, that method should actually deliver on its premise. Failing in 2-3 out of 4 conditions the method is designed to handle is IMHO a sign that the proposed method is/was not in proper shape for becoming an official recommendation.
>>> This is fine in an informational RFC as documenting subjective opinion, but problematic in a standards or BCP type document, if like in this case we feel that BCP methods (even incomplete or impossible to achieve ones) are binding precedence that need to be respected in later RFC drafts.
>>>
>> [BB] Instead of implying that everyone involved has been acting unprofessionally, or incompetently, it is good practice to word emails in such a way that allows for the possibility that you just haven't understood something.
> 	[SM] Yes, that is indeed good advise, thank you for that. However it also ignores my argument that especially in a BCP the onus is on us to make sure our recommendations are water-tight, correct and useful.
>
>>>>>> GSO/GRO and variants would/could change the fragmentation, that is true and need to be considered.
>>>>>>
>>>>> 	[SM2] I am confused? How do GRO/GSO affect fragmentation, IMHO these two will cause larger aggregates that exist only locally (Linux will segment meta-packets in the sending process and will not sent out say a large 64K TCP packet in fragments, but will re-segment the meta-packet into a neat sequence of complete self-sustained TCP packets)? IMHO they affect primarily the unit size the AQM might CE-mark on, in a way that is in-transparent to the end points. My point is the unit size an AQM acts on is generally unknowable precisely be the end-points. At which point making the end-points pretend that congestion strength somehow correlates with size of marked packets really stops making sense.
>>>>>
>>>>>
>>>> The segment delivered can be a different size to the unit of transmission. This is an implementation optimisation - if this done without regard to the marking, then the results will be different and likely do not deliver what is expected - optimisations need to understand what they optimise.
>>>>
>>> 	[SM3] Yes, we seem to agree that it is impossible for endpoints to veridically measure the amount of octets actually CE-marked for a number of reasons. IMHO from this observation it follows directly that basing end-point decisions on number of marked octets is not going to generally work, as that number is not robustly and reliably available at the end-point. That is end-points do see a number but that number is not guaranteed to actually match what happened at the marking entity, and hence this number can not be a correlate of congestion strength. Interpreting that number never the less as indicator of congestion strength seems hence sub-optimal and not something to unconditionally recommend.
>> [BB] With a better understanding of preserving marking probability when re-framing, I hope you can now see that it would be possible for endpoints to respond based on packet size of congestion indications (even if any re-framing includes approximations).
>>
>>>>>>>>> B) Section 2.3 then recommends to use the size of marked packets as direct indicators of congestion strength.
>>>>>>>>>
>>>>>>>>> C) Section 2.3 then later clarifies that transports should interpret the size of CE-marked packets as correlate for congestion strength but are in no way required to take this interpretation into account when acting based on the congestion signal.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This has several problems:
>>>>>>>>> 1) A) and B) are in direct contradiction to each other. If we ask marking nodes to ignore packet size while marking, but end nodes to take it into account we basically create random congestion strength "information" by the pure chance of a specific packet of a specific size "catching" a CE mark. At which point we might as well simply draw a random number at the end-point to interpret congestion strength (except that packet sizes are not distributed randomly).
>>>>>>>>>
>>>>>>>>> 2) Asking endpoints to interpret CE_marks in this way but not act on it, is hardly actionable advice for potential implementers. If we can not recommend a specific way, we should refrain from offering recommendations at all to keep things as simple as reasonably possible.
>>>>>>>>>
>>>>>>>> This doesn't appear to be textual errata, it seems more like the request is for more clarification or motivating an alternative?
>> [BB] I hope I have explained for Sebastian how A) & B) being different is a way of exploiting random selection, not a contradiction.
>> Re #2, the draft doesn't say "don't act on it"; it says it's not mandatory, which is very different (and necessary when protocols already exist that do not comply).
> 	[SM] I thank you for your attempt at explaining these contradictory recommendations. I am sad to report that you failed to do so in a satisfactory fashion. End points should respond to the best possible estimate of congestion along the path, and if that information is independent of packet size, then interpreting packet size when deciding on a response is simply not acting on the best possible estimate...

[BB2] I've now made a further attempt to explain why this is incorrect 
logic. Pls reconsider, because many of the tried and tested algorithms 
used in traffic control (CCAs, AQMs, policers, etc) use similar methods 
to those you seem to think are "incorrect".

> IHO this looks like an attempt at increasing "fairness" at the bottleneck for flows of different packet sizes, but not a robust and reliable attempt at that. If we desire better bottleneck sharing behaviour, we already know how to accomplish that by making the bottleneck better managed and make better marking decisions (by e.g. taking a flow's contribution to congestion into account what and when to mark). Trying to solve this specific problem from the end-points via heuristics like scaling the response to size of received marked packets is IMHO not a fruitful way forward.

[BB2] It is not an attempt at increasing rate 'fairness' per se, but it 
is an attempt to set down the 'rules of the road' that will ensure 
interoperability if anyone chooses to design something to increase 
'fairness' wrt different packet sizes.

As such, it is not appropriate to say (paraphrasing) "Well I think the 
whole class of algorithms that do not rely on per-flow queuing ought to 
be deprecated, and everyone ought to use FQ instead." This RFC covers 
the whole set of traffic control algorithms: CCAs, AQMs, etc. without 
assuming FQ but allowing for it. You have raised an erratum on this RFC. 
I have explained why the RFC says what it says, which has taken a huge 
amount of my own time.

You seem to have stopped disagreeing with the network aspects of the 
draft (AQM and splitting/merging). Now that we are left with the CCA 
response part, you seem to have decided that you would rather duck the 
question of whether any statement is actually wrong, and resorted 
instead to saying you wish the whole system didn't exist in the form 
that it does, and it all ought to be based on FQ.

Perhaps it would help you to be more constructive if you imagined an 
Internet where there was widespread per-flow rate policing (or FQ). Then 
consider how CCAs ought to deal with smaller packets without triggering 
any punishment from the policers.

_______________________________
[BB2] Notes

{Note 1} As promised, there follows the working for the FQ-CoDel example 
given earlier

Different long-running ECN flows use either all small (index 1) or all 
large (index 2) SMSS packets
Some flows of each size use Reno and some use Reno-SP, where Reno-SP is 
an invented algorithm defined earlier.
It doesn't matter how many flows of each - all that's important is that 
FQ keeps them all to the same bit rate

Terminology:
x: flow bit rate
s: SMSS of all packets within a flow
p: flow marking probability
R: RTT of flow (assumed all the same, so as not to distract from the 
focus on size-dependence)
T: inter-mark time converged on by CoDel AQM in steady-state.
K: the Reno constant, roughly 3/2
I've used unicode encoding, so apologies if you're still living in an 
ASCII world.

_A) Reno_
     x1 = s1/R * √(K/p1)
     x2 = s2/R * √(K/p2)
     x1 = x2
=> s1/√p1 = s2/√p2
=> p1/p2 = (s1/s2)²                    (1)

A marking probability (assumed low) can be converted to a time interval 
between marks as follows:
     T1 = s1/(p1*x1)
     T2 = s2/(p2*x2)
=> T1/T2 = s1/s2 * p2/p1        (2)
Substituting from eqn (1)
     T1/T2 = s1/s2 * (s2/s1)²
                = s2/s1

_B) Reno-SP_
     x1 = s2/R * √(K/p1)
     x2 = s2/R * √(K/p2)
     x1 = x2
=> p1 = p2                                (3)
Substituting into eqn (2), which still applies here
     T1/T2 = s1/s2

{Note 2}: I would add that, wherever the frame window is measured up to, 
the packet window has to be measured up to a point half a packet earlier 
in order to get the most accurate comparison. That's because the 
algorithm deliberately marks packets as early as possible (when the 
balance exceeds just 1 byte) to avoid any part of the congestion signal 
at the end of a congestion episode being delayed. That's a "good thing", 
so shifting the packet measurement earlier shouldn't be considered as 
"incorrect".

Regards

Bob

>
>
> Regards
> 	Sebastian
[snip]

-- 
________________________________________________________________
Bob Briscoehttp://bobbriscoe.net/

[tsvwg] [Technical Errata Reported] RFC7141 (7237) RFC Errata System
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Gorry Fairhurst
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Sebastian Moeller
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Gorry Fairhurst
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Sebastian Moeller
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Gorry Fairhurst
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Sebastian Moeller
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Gorry Fairhurst
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Sebastian Moeller
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Bob Briscoe
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Sebastian Moeller
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Bob Briscoe
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Sebastian Moeller
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Jonathan Morton
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Sebastian Moeller
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… C. M. Heard
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Jonathan Morton
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Sebastian Moeller
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Alex Burr
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Jonathan Morton
Re: [tsvwg] [Technical Errata Reported] RFC7141 (… Alex Burr