Re: [PANRG] On the new text on ECN: draft-irtf-panrg-what-not-to-do-16

Hi panrg,

I think I agree with Bob’s main point, if I’ve understood it correctly.

A note worth highlighting for the innocent ECN bystanders here:
Most of the benefits from FQ-codel and CAKE come from running an AQM and making it into a throughput bottleneck, which (I think uncontroversially) seems to have given substantial latency benefits to most of the people who’ve deployed it.

The ECN marking itself (as opposed to the AQM that’s doing the marking, but could just as easily drop the packet instead of marking) usually provides only the marginal benefit of avoiding one retransmitted packet, usually on a TCP fast retransmit (plus a few benefits from corner cases where you wouldn’t have gotten 3 dup acks, or a lost packet otherwise would have hurt more than usual).

There’s a fundamental problem that derives from treating loss as the right congestion signal to use, and then defining the CE marking to mean “respond the same as loss”.  In some sense the big problem underneath this came from landing on Reno instead of Vegas (along with policing aggressive senders) [1].

(It might even be right to say that almost all our later congestion-related problems derive from that early “Reno vs. Vegas” outcome, the fallout from which still hasn’t really been fixed yet.  This for instance is a key reason that large tail-drop buffers consistently cause poor latency, and I can state with pretty good confidence that it dramatically complicated the deployment of alternative congestion controllers.)

Anyway, this point about ECN performance is well-taken IMO, and it makes me think maybe there’s a good addition to the “test gently during initial deployment” advice that’s something like “characterize and quantify the achievable benefits for the bulk of the traffic that’s supposed to get benefits before nailing down semantics for scarce codepoints” [2].

Best regards,
Jake

PS:  I’m not vouching for Bob’s side comments about characterizing the reasoning behind the ISP and open-source communities’ decision processes; I imagine those a little differently but just as speculatively, so I won’t debate them here.  I’m just agreeing with the specific point about the realizable performance gains from 3168 ECN as compared to what’s achievable without using up the codepoints.

[1] I’ll just give one reference to probably my favorite explanation of the dynamics here, for those who haven’t read up on this:
“Analysis and improvement of fairness between TCP Reno and Vegas for deployment of TCP Vegas to the Internet”, Hasegawa et.al.
https://ieeexplore.ieee.org/abstract/document/896302

[2] Worth noting is that on Sally Floyd’s ECN page, referenced from RFC 3168, the first question in the list of open ECN issues is about quantifying the benefits of ECN for TCP:
http://www.icir.org/floyd/ecn.html#issues
“Open issues for ECN:
-          What are the quantitative benefits of ECN for TCP?”

From: Bob Briscoe <ietf@bobbriscoe.net>
Date: Thursday, March 11, 2021 at 4:41 PM
To: Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com>
Cc: "panrg@irtf.org" <panrg@irtf.org>
Subject: Re: [PANRG] On the new text on ECN: draft-irtf-panrg-what-not-to-do-16

Spencer,

The apocrypha in the draft needs to tell the whole story. The current message is that everything was done right, except we got blind-sided by a bug. No.

ECN involved a /three/-part deployment, client, server /and/ bottleneck AQM. The crashing home routers pushed back the client part, and servers until they finally realized they could enable ECN without making anyone more vulnerable to the home router crashes. But my email was about why the network part never got deployed (until the more recent FQ-CoDel and CAKE deployments).

Even once the bug was out of the way, the deployment pain for networks wasn't worth the small performance gain. In the language of the draft, ECN didn't "Outperform End-to-end Protocol Mechanisms", which had had plenty of time to mask losses in other ways before ECN arrived  (all evidenced in my original posting).

The final deployment of the network part in FQ-CoDel and CAKE is more difficult to explain. I think that was down to the nature of the open source communities that built them. A network operator would have assessed whether ECN's performance gain was worth the cost of deployment (as in the story in my original posting). But for the open source community the ability to function efficiently was enough. It wasn't driven by the more traditional cost-benefit analysis approach of commercial operators.

In contrast, if you look at most machines at the broadband access bottleneck from the major vendors (Nokia, Ericsson, Siemens, etc), they didn't implement AQM, let alone ECN, 'cos the operators they were selling to tendered for QoS features, which they understood as something the network alone provides, not something it helps end-systems to provide for themselves (as in AQM & ECN).

Bob
On 11/03/2021 23:37, Spencer Dawkins at IETF wrote:
Thanks for continuing this discussion on the mailing list. If I might add this ...

On Thu, Mar 11, 2021 at 11:13 AM Bob Briscoe <ietf@bobbriscoe.net<mailto:ietf@bobbriscoe.net>> wrote:
Gorry, and panrg,

I'd agree with Gorry, about one chance per generation.

My understanding (and we're talking about this to understand it better) is that there are two things in play with the early ECN deployment experiment.

Yes, routers crashing when they say non-zero ECN bits was in a device which (at the time) wasn't likely to be updated or replaced for a number of years (see "equipment generation"), but the second point (and I think it was Kireeti who said this well) was that once you've updated or replaced all of the problematic equipment, the people who turned ECN off didn't immediately turn ECN back on (I remember the phrase "a bad taste in their mouths" mentioned multiple times.

So, as Kireeti and others observed, generations between updates are getting shorter, but it's not clear that the ability of people to forgive and forget past experiences is getting shorter in the same way. I'm remembering the quote from Oscar Wilde, "Second marriage is the triumph of hope over experience" - I think we're talking about the same thing as "second marriage".

Best,

Spencer

I'd also say that there are two main prongs to the ECN story. Not just
the "one chance" point.
The other is "Outperforming End-to-end Protocol Mechanisms"

When I was asked to write the business case for deploying ECN in BT, my
colleagues in the tech strategy team pushed me on this point of
outperforming e2e mechanisms. Jamal Salim's performance evaluation
[RFC2884] showed ECN gave no performance benefit, except for short
flows. I had to admit that e2e FEC for short flows would be more likely
to win out. And at the time, I'd just found Damon Wischik paper about
how little the extra percentage of traffic volume would be, if every
sender just duplicated the first few packets of each flow [Wischik08].

So the business case flat-lined. End systems had long since worked out
tonnes of loss-hiding tricks. There was far too great a risk that, by
the time the 3-part deployment had got anywhere (sender, receiver and
network), the problem would have been solved e2e, and the high-risk
high-cost investment would all have been wasted.

This was actually the start of my journey to realize that the "ECN =
drop" rule was the problem, 'cos it disallowed the real benefit of ECN -
to cut queuing delay, by providing a finer-grained signal than loss. I
did some calculations to work out that the noise in the delay signal was
too great to get queue delay down as low as would be achievable with ECN
(which could use virtual queues once ISPs realized that bandwidth had
become plentiful enough). That was actually when I started working on
chirping (bringing in Mirja as a research fellow) and came to the
conclusion that we could only get a better delay signal out of the noise
by creating more noise with the chirps. That's when I realized the
chirps should only be used at start-up, and ECN was the only way to keep
queueing extremely low under load. Today you can see the limits of using
e2e delay to reduce queuing in BBR, although of course BBR is unlikely
to be the last word in e2e delay reduction.

I know it's unlikely that every ISP went through such a rigorous
exercise, but still none deployed it - probably intuitively reaching the
same conclusion. The main reason being the deployment pain wasn't worth
the small performance gain. Which is a TL;DR summary of all the above.

The issue with Linksys home routers crashing wasn't a biggie in that
assessment of ECN - there were few enough of that model around by then
that it could be worked round with pre-deployment validation testing
from the OS. This supports Gorry's "one chance per generation" point.

Reference
[Wischik08] Wischik, D. "Short Messages" Philosophical Transactions of
the Royal Society A, 2008, 366, 1941-1953

Bob

On 11/03/2021 16:17, Gorry Fairhurst wrote:
> Resend: On 11/03/2021 15:06, Gorry Fairhurst wrote:
>> HI,
>>
>> Two observations on ECN in what may have been learned?
>>
>> * I think the "one chance" is per one /Generation/ of equipment for
>> hardware, possibly less time for software updates - but hard to
>> eliminate deployed critical bugs completely.
>>
>> * "Measure widely before, but important also gently test during
>> initial deployment", to avoid the pain when there are issues and
>> therefore to provide a chance to refine the design if there is a
>> problem, and reduce the pain of trying.
>>
>> I like the additional thought from Eric (in the RG meeting) on
>> mitigating user pain by using fallback techniques, so they do not
>> share your pain when something happens to go wong.
>>
>> ————
>>
>> Is this a typo?: /Cannot be recovered at TCP layer/
>> - it seems like a partial sentence, does need a /This.../
>>
>> Gorry
>>
>> _______________________________________________
>> Panrg mailing list
>> Panrg@irtf.org<mailto:Panrg@irtf.org>
>> https://www.irtf.org/mailman/listinfo/panrg<https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$>
>
>
> _______________________________________________
> Panrg mailing list
> Panrg@irtf.org<mailto:Panrg@irtf.org>
> https://www.irtf.org/mailman/listinfo/panrg<https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$>

--
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/<https://urldefense.com/v3/__http:/bobbriscoe.net/__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPep-Eu0I$>

_______________________________________________
Panrg mailing list
Panrg@irtf.org<mailto:Panrg@irtf.org>
https://www.irtf.org/mailman/listinfo/panrg<https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$>

_______________________________________________

Panrg mailing list

Panrg@irtf.org<mailto:Panrg@irtf.org>

https://www.irtf.org/mailman/listinfo/panrg<https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$>

--

________________________________________________________________

Bob Briscoe                               http://bobbriscoe.net/<https://urldefense.com/v3/__http:/bobbriscoe.net/__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPep-Eu0I$>