Re: [PANRG] On the new text on ECN: draft-irtf-panrg-what-not-to-do-16
Gorry Fairhurst <gorry@erg.abdn.ac.uk> Fri, 12 March 2021 09:06 UTC
Return-Path: <gorry@erg.abdn.ac.uk>
X-Original-To: panrg@ietfa.amsl.com
Delivered-To: panrg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AB1BC3A1538 for <panrg@ietfa.amsl.com>; Fri, 12 Mar 2021 01:06:31 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.799
X-Spam-Level:
X-Spam-Status: No, score=-1.799 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, HTTPS_HTTP_MISMATCH=0.1, NICE_REPLY_A=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VjFpnNX79rDQ for <panrg@ietfa.amsl.com>; Fri, 12 Mar 2021 01:06:28 -0800 (PST)
Received: from pegasus.erg.abdn.ac.uk (pegasus.erg.abdn.ac.uk [IPv6:2001:630:42:150::2]) by ietfa.amsl.com (Postfix) with ESMTP id 591903A153B for <panrg@irtf.org>; Fri, 12 Mar 2021 01:06:28 -0800 (PST)
Received: from GF-MBP-2.lan (fgrpf.plus.com [212.159.18.54]) by pegasus.erg.abdn.ac.uk (Postfix) with ESMTPSA id 73D001B001D2; Fri, 12 Mar 2021 09:06:18 +0000 (GMT)
To: "Holland, Jake" <jholland=40akamai.com@dmarc.ietf.org>, Bob Briscoe <ietf@bobbriscoe.net>, Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com>
Cc: "panrg@irtf.org" <panrg@irtf.org>
References: <d072512f-ed66-bb8b-7338-58ed720210a2@erg.abdn.ac.uk> <45a9076d-d573-b976-8465-3e731081169a@erg.abdn.ac.uk> <455c5465-7a2a-211e-a712-1c6b412c73b4@bobbriscoe.net> <CAKKJt-fomvCaK+8LW=UmC2VcAXBD2KKhA_he6iys8OvcLoOZBA@mail.gmail.com> <16ffb0c2-2330-6770-c0ce-53abc082037c@bobbriscoe.net> <D429889D-FA1F-4629-8EF1-A8C93E98CE10@akamai.com>
From: Gorry Fairhurst <gorry@erg.abdn.ac.uk>
Message-ID: <1a14a4b0-32d4-0944-7464-293ebecf4967@erg.abdn.ac.uk>
Date: Fri, 12 Mar 2021 09:06:18 +0000
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:78.0) Gecko/20100101 Thunderbird/78.7.1
MIME-Version: 1.0
In-Reply-To: <D429889D-FA1F-4629-8EF1-A8C93E98CE10@akamai.com>
Content-Type: multipart/alternative; boundary="------------B8B00B201194EED0D85B0392"
Content-Language: en-GB
Archived-At: <https://mailarchive.ietf.org/arch/msg/panrg/fqVn-a1jtG1zvlKzO9m0ByurWOo>
Subject: Re: [PANRG] On the new text on ECN: draft-irtf-panrg-what-not-to-do-16
X-BeenThere: panrg@irtf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Path Aware Networking \(Proposed\) Research Group discussion list" <panrg.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/panrg>, <mailto:panrg-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/panrg/>
List-Post: <mailto:panrg@irtf.org>
List-Help: <mailto:panrg-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/panrg>, <mailto:panrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Mar 2021 09:06:32 -0000
This is helpful ... yesterday I was refelecting on the benefits of using AQM/RED (a different story) ... however, I'd agree with Jake: to me successful deployment of features that require new configuration needs operators to perceive the achievable benefits for the bulk of the traffic. Gorry On 12/03/2021 07:24, Holland, Jake wrote: > > Hi panrg, > > I think I agree with Bob’s main point, if I’ve understood it correctly. > > A note worth highlighting for the innocent ECN bystanders here: > > Most of the benefits from FQ-codel and CAKE come from running an AQM > and making it into a throughput bottleneck, which (I think > uncontroversially) seems to have given substantial latency benefits to > most of the people who’ve deployed it. > > The ECN marking itself (as opposed to the AQM that’s doing the > marking, but could just as easily drop the packet instead of marking) > usually provides only the marginal benefit of avoiding one > retransmitted packet, usually on a TCP fast retransmit (plus a few > benefits from corner cases where you wouldn’t have gotten 3 dup acks, > or a lost packet otherwise would have hurt more than usual). > > There’s a fundamental problem that derives from treating loss as the > right congestion signal to use, and then defining the CE marking to > mean “respond the same as loss”. In some sense the big problem > underneath this came from landing on Reno instead of Vegas (along with > policing aggressive senders) [1]. > > (It might even be right to say that almost all our later > congestion-related problems derive from that early “Reno vs. Vegas” > outcome, the fallout from which still hasn’t really been fixed yet. > This for instance is a key reason that large tail-drop buffers > consistently cause poor latency, and I can state with pretty good > confidence that it dramatically complicated the deployment of > alternative congestion controllers.) > > Anyway, this point about ECN performance is well-taken IMO, and it > makes me think maybe there’s a good addition to the “test gently > during initial deployment” advice that’s something like “characterize > and quantify the achievable benefits for the bulk of the traffic > that’s supposed to get benefits before nailing down semantics for > scarce codepoints” [2]. > > Best regards, > > Jake > > PS: I’m not vouching for Bob’s side comments about characterizing the > reasoning behind the ISP and open-source communities’ decision > processes; I imagine those a little differently but just as > speculatively, so I won’t debate them here. I’m just agreeing with > the specific point about the realizable performance gains from 3168 > ECN as compared to what’s achievable without using up the codepoints. > > [1] I’ll just give one reference to probably my favorite explanation > of the dynamics here, for those who haven’t read up on this: > > “Analysis and improvement of fairness between TCP Reno and Vegas for > deployment of TCP Vegas to the Internet”,**Hasegawa et.al. > > https://ieeexplore.ieee.org/abstract/document/896302 > > [2] Worth noting is that on Sally Floyd’s ECN page, referenced from > RFC 3168, the first question in the list of open ECN issues is about > quantifying the benefits of ECN for TCP: > > http://www.icir.org/floyd/ecn.html#issues > > > “Open issues for ECN: > > > -What are the quantitative benefits of ECN for *TCP*?” > > *From: *Bob Briscoe <ietf@bobbriscoe.net> > *Date: *Thursday, March 11, 2021 at 4:41 PM > *To: *Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com> > *Cc: *"panrg@irtf.org" <panrg@irtf.org> > *Subject: *Re: [PANRG] On the new text on ECN: > draft-irtf-panrg-what-not-to-do-16 > > Spencer, > > The apocrypha in the draft needs to tell the whole story. The current > message is that everything was done right, except we got blind-sided > by a bug. No. > > ECN involved a /three/-part deployment, client, server /and/ > bottleneck AQM. The crashing home routers pushed back the client part, > and servers until they finally realized they could enable ECN without > making anyone more vulnerable to the home router crashes. But my email > was about why the network part never got deployed (until the more > recent FQ-CoDel and CAKE deployments). > > Even once the bug was out of the way, the deployment pain for networks > wasn't worth the small performance gain. In the language of the draft, > ECN didn't "Outperform End-to-end Protocol Mechanisms", which had had > plenty of time to mask losses in other ways before ECN arrived (all > evidenced in my original posting). > > The final deployment of the network part in FQ-CoDel and CAKE is more > difficult to explain. I think that was down to the nature of the open > source communities that built them. A network operator would have > assessed whether ECN's performance gain was worth the cost of > deployment (as in the story in my original posting). But for the open > source community the ability to function efficiently was enough. It > wasn't driven by the more traditional cost-benefit analysis approach > of commercial operators. > > In contrast, if you look at most machines at the broadband access > bottleneck from the major vendors (Nokia, Ericsson, Siemens, etc), > they didn't implement AQM, let alone ECN, 'cos the operators they were > selling to tendered for QoS features, which they understood as > something the network alone provides, not something it helps > end-systems to provide for themselves (as in AQM & ECN). > > > Bob > > On 11/03/2021 23:37, Spencer Dawkins at IETF wrote: > > Thanks for continuing this discussion on the mailing list. If I > might add this ... > > On Thu, Mar 11, 2021 at 11:13 AM Bob Briscoe <ietf@bobbriscoe.net > <mailto:ietf@bobbriscoe.net>> wrote: > > Gorry, and panrg, > > I'd agree with Gorry, about one chance per generation. > > My understanding (and we're talking about this to understand it > better) is that there are two things in play with the early ECN > deployment experiment. > > Yes, routers crashing when they say non-zero ECN bits was in a > device which (at the time) wasn't likely to be updated or replaced > for a number of years (see "equipment generation"), but the second > point (and I think it was Kireeti who said this well) was that > once you've updated or replaced all of the problematic equipment, > the people who turned ECN off didn't immediately turn ECN back on > (I remember the phrase "a bad taste in their mouths" mentioned > multiple times. > > So, as Kireeti and others observed, generations between updates > are getting shorter, but it's not clear that the ability of people > to forgive and forget past experiences is getting shorter in the > same way. I'm remembering the quote from Oscar Wilde, "Second > marriage is the triumph of hope over experience" - I think we're > talking about the same thing as "second marriage". > > Best, > > Spencer > > I'd also say that there are two main prongs to the ECN story. > Not just > the "one chance" point. > The other is "Outperforming End-to-end Protocol Mechanisms" > > When I was asked to write the business case for deploying ECN > in BT, my > colleagues in the tech strategy team pushed me on this point of > outperforming e2e mechanisms. Jamal Salim's performance > evaluation > [RFC2884] showed ECN gave no performance benefit, except for > short > flows. I had to admit that e2e FEC for short flows would be > more likely > to win out. And at the time, I'd just found Damon Wischik > paper about > how little the extra percentage of traffic volume would be, if > every > sender just duplicated the first few packets of each flow > [Wischik08]. > > So the business case flat-lined. End systems had long since > worked out > tonnes of loss-hiding tricks. There was far too great a risk > that, by > the time the 3-part deployment had got anywhere (sender, > receiver and > network), the problem would have been solved e2e, and the > high-risk > high-cost investment would all have been wasted. > > This was actually the start of my journey to realize that the > "ECN = > drop" rule was the problem, 'cos it disallowed the real > benefit of ECN - > to cut queuing delay, by providing a finer-grained signal than > loss. I > did some calculations to work out that the noise in the delay > signal was > too great to get queue delay down as low as would be > achievable with ECN > (which could use virtual queues once ISPs realized that > bandwidth had > become plentiful enough). That was actually when I started > working on > chirping (bringing in Mirja as a research fellow) and came to the > conclusion that we could only get a better delay signal out of > the noise > by creating more noise with the chirps. That's when I realized > the > chirps should only be used at start-up, and ECN was the only > way to keep > queueing extremely low under load. Today you can see the > limits of using > e2e delay to reduce queuing in BBR, although of course BBR is > unlikely > to be the last word in e2e delay reduction. > > I know it's unlikely that every ISP went through such a rigorous > exercise, but still none deployed it - probably intuitively > reaching the > same conclusion. The main reason being the deployment pain > wasn't worth > the small performance gain. Which is a TL;DR summary of all > the above. > > The issue with Linksys home routers crashing wasn't a biggie > in that > assessment of ECN - there were few enough of that model around > by then > that it could be worked round with pre-deployment validation > testing > from the OS. This supports Gorry's "one chance per generation" > point. > > Reference > [Wischik08] Wischik, D. "Short Messages" Philosophical > Transactions of > the Royal Society A, 2008, 366, 1941-1953 > > > Bob > > On 11/03/2021 16:17, Gorry Fairhurst wrote: > > Resend: On 11/03/2021 15:06, Gorry Fairhurst wrote: > >> HI, > >> > >> Two observations on ECN in what may have been learned? > >> > >> * I think the "one chance" is per one /Generation/ of > equipment for > >> hardware, possibly less time for software updates - but > hard to > >> eliminate deployed critical bugs completely. > >> > >> * "Measure widely before, but important also gently test > during > >> initial deployment", to avoid the pain when there are > issues and > >> therefore to provide a chance to refine the design if there > is a > >> problem, and reduce the pain of trying. > >> > >> I like the additional thought from Eric (in the RG meeting) on > >> mitigating user pain by using fallback techniques, so they > do not > >> share your pain when something happens to go wong. > >> > >> ———— > >> > >> Is this a typo?: /Cannot be recovered at TCP layer/ > >> - it seems like a partial sentence, does need a /This.../ > >> > >> Gorry > >> > >> _______________________________________________ > >> Panrg mailing list > >> Panrg@irtf.org <mailto:Panrg@irtf.org> > >> https://www.irtf.org/mailman/listinfo/panrg > <https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$> > > > > > > _______________________________________________ > > Panrg mailing list > > Panrg@irtf.org <mailto:Panrg@irtf.org> > > https://www.irtf.org/mailman/listinfo/panrg > <https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$> > > -- > ________________________________________________________________ > Bob Briscoe http://bobbriscoe.net/ > <https://urldefense.com/v3/__http:/bobbriscoe.net/__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPep-Eu0I$> > > _______________________________________________ > Panrg mailing list > Panrg@irtf.org <mailto:Panrg@irtf.org> > https://www.irtf.org/mailman/listinfo/panrg > <https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$> > > > > _______________________________________________ > > Panrg mailing list > > Panrg@irtf.org <mailto:Panrg@irtf.org> > > https://www.irtf.org/mailman/listinfo/panrg <https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$> > > > > -- > ________________________________________________________________ > Bob Briscoehttp://bobbriscoe.net/ <https://urldefense.com/v3/__http:/bobbriscoe.net/__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPep-Eu0I$> > > _______________________________________________ > Panrg mailing list > Panrg@irtf.org > https://www.irtf.org/mailman/listinfo/panrg
- [PANRG] On the new text on ECN: draft-irtf-panrg-… Gorry Fairhurst
- Re: [PANRG] On the new text on ECN: draft-irtf-pa… Gorry Fairhurst
- Re: [PANRG] On the new text on ECN: draft-irtf-pa… Bob Briscoe
- Re: [PANRG] On the new text on ECN: draft-irtf-pa… Spencer Dawkins at IETF
- Re: [PANRG] On the new text on ECN: draft-irtf-pa… Bob Briscoe
- Re: [PANRG] On the new text on ECN: draft-irtf-pa… Holland, Jake
- Re: [PANRG] On the new text on ECN: draft-irtf-pa… Gorry Fairhurst
- Re: [PANRG] On the new text on ECN: draft-irtf-pa… Spencer Dawkins at IETF
- Re: [PANRG] On the new text on ECN: draft-irtf-pa… Spencer Dawkins at IETF
- [PANRG] Updated text on ECN: draft-irtf-panrg-wha… Spencer Dawkins at IETF
- Re: [PANRG] On the new text on ECN: draft-irtf-pa… Gorry Fairhurst
- Re: [PANRG] Updated text on ECN: draft-irtf-panrg… Spencer Dawkins at IETF
- Re: [PANRG] Updated text on ECN: draft-irtf-panrg… Gorry Fairhurst
- Re: [PANRG] Updated text on ECN: draft-irtf-panrg… Spencer Dawkins at IETF