Re: [PANRG] On the new text on ECN: draft-irtf-panrg-what-not-to-do-16

Gorry Fairhurst <gorry@erg.abdn.ac.uk> Fri, 12 March 2021 09:06 UTC

Return-Path: <gorry@erg.abdn.ac.uk>
X-Original-To: panrg@ietfa.amsl.com
Delivered-To: panrg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AB1BC3A1538 for <panrg@ietfa.amsl.com>; Fri, 12 Mar 2021 01:06:31 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.799
X-Spam-Level:
X-Spam-Status: No, score=-1.799 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, HTTPS_HTTP_MISMATCH=0.1, NICE_REPLY_A=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VjFpnNX79rDQ for <panrg@ietfa.amsl.com>; Fri, 12 Mar 2021 01:06:28 -0800 (PST)
Received: from pegasus.erg.abdn.ac.uk (pegasus.erg.abdn.ac.uk [IPv6:2001:630:42:150::2]) by ietfa.amsl.com (Postfix) with ESMTP id 591903A153B for <panrg@irtf.org>; Fri, 12 Mar 2021 01:06:28 -0800 (PST)
Received: from GF-MBP-2.lan (fgrpf.plus.com [212.159.18.54]) by pegasus.erg.abdn.ac.uk (Postfix) with ESMTPSA id 73D001B001D2; Fri, 12 Mar 2021 09:06:18 +0000 (GMT)
To: "Holland, Jake" <jholland=40akamai.com@dmarc.ietf.org>, Bob Briscoe <ietf@bobbriscoe.net>, Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com>
Cc: "panrg@irtf.org" <panrg@irtf.org>
References: <d072512f-ed66-bb8b-7338-58ed720210a2@erg.abdn.ac.uk> <45a9076d-d573-b976-8465-3e731081169a@erg.abdn.ac.uk> <455c5465-7a2a-211e-a712-1c6b412c73b4@bobbriscoe.net> <CAKKJt-fomvCaK+8LW=UmC2VcAXBD2KKhA_he6iys8OvcLoOZBA@mail.gmail.com> <16ffb0c2-2330-6770-c0ce-53abc082037c@bobbriscoe.net> <D429889D-FA1F-4629-8EF1-A8C93E98CE10@akamai.com>
From: Gorry Fairhurst <gorry@erg.abdn.ac.uk>
Message-ID: <1a14a4b0-32d4-0944-7464-293ebecf4967@erg.abdn.ac.uk>
Date: Fri, 12 Mar 2021 09:06:18 +0000
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:78.0) Gecko/20100101 Thunderbird/78.7.1
MIME-Version: 1.0
In-Reply-To: <D429889D-FA1F-4629-8EF1-A8C93E98CE10@akamai.com>
Content-Type: multipart/alternative; boundary="------------B8B00B201194EED0D85B0392"
Content-Language: en-GB
Archived-At: <https://mailarchive.ietf.org/arch/msg/panrg/fqVn-a1jtG1zvlKzO9m0ByurWOo>
Subject: Re: [PANRG] On the new text on ECN: draft-irtf-panrg-what-not-to-do-16
X-BeenThere: panrg@irtf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Path Aware Networking \(Proposed\) Research Group discussion list" <panrg.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/panrg>, <mailto:panrg-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/panrg/>
List-Post: <mailto:panrg@irtf.org>
List-Help: <mailto:panrg-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/panrg>, <mailto:panrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Mar 2021 09:06:32 -0000

This is helpful ... yesterday I was refelecting on the benefits of using 
AQM/RED (a different story) ... however, I'd agree with Jake: to me 
successful deployment of features that require new configuration needs 
operators to perceive the achievable benefits for the bulk of the traffic.

Gorry

On 12/03/2021 07:24, Holland, Jake wrote:
>
> Hi panrg,
>
> I think I agree with Bob’s main point, if I’ve understood it correctly.
>
> A note worth highlighting for the innocent ECN bystanders here:
>
> Most of the benefits from FQ-codel and CAKE come from running an AQM 
> and making it into a throughput bottleneck, which (I think 
> uncontroversially) seems to have given substantial latency benefits to 
> most of the people who’ve deployed it.
>
> The ECN marking itself (as opposed to the AQM that’s doing the 
> marking, but could just as easily drop the packet instead of marking) 
> usually provides only the marginal benefit of avoiding one 
> retransmitted packet, usually on a TCP fast retransmit (plus a few 
> benefits from corner cases where you wouldn’t have gotten 3 dup acks, 
> or a lost packet otherwise would have hurt more than usual).
>
> There’s a fundamental problem that derives from treating loss as the 
> right congestion signal to use, and then defining the CE marking to 
> mean “respond the same as loss”.  In some sense the big problem 
> underneath this came from landing on Reno instead of Vegas (along with 
> policing aggressive senders) [1].
>
> (It might even be right to say that almost all our later 
> congestion-related problems derive from that early “Reno vs. Vegas” 
> outcome, the fallout from which still hasn’t really been fixed yet.  
> This for instance is a key reason that large tail-drop buffers 
> consistently cause poor latency, and I can state with pretty good 
> confidence that it dramatically complicated the deployment of 
> alternative congestion controllers.)
>
> Anyway, this point about ECN performance is well-taken IMO, and it 
> makes me think maybe there’s a good addition to the “test gently 
> during initial deployment” advice that’s something like “characterize 
> and quantify the achievable benefits for the bulk of the traffic 
> that’s supposed to get benefits before nailing down semantics for 
> scarce codepoints” [2].
>
> Best regards,
>
> Jake
>
> PS:  I’m not vouching for Bob’s side comments about characterizing the 
> reasoning behind the ISP and open-source communities’ decision 
> processes; I imagine those a little differently but just as 
> speculatively, so I won’t debate them here.  I’m just agreeing with 
> the specific point about the realizable performance gains from 3168 
> ECN as compared to what’s achievable without using up the codepoints.
>
> [1] I’ll just give one reference to probably my favorite explanation 
> of the dynamics here, for those who haven’t read up on this:
>
> “Analysis and improvement of fairness between TCP Reno and Vegas for 
> deployment of TCP Vegas to the Internet”,**Hasegawa et.al.
>
> https://ieeexplore.ieee.org/abstract/document/896302
>
> [2] Worth noting is that on Sally Floyd’s ECN page, referenced from 
> RFC 3168, the first question in the list of open ECN issues is about 
> quantifying the benefits of ECN for TCP:
>
> http://www.icir.org/floyd/ecn.html#issues
>
>
>       “Open issues for ECN:
>
>
>       -What are the quantitative benefits of ECN for *TCP*?”
>
> *From: *Bob Briscoe <ietf@bobbriscoe.net>
> *Date: *Thursday, March 11, 2021 at 4:41 PM
> *To: *Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com>
> *Cc: *"panrg@irtf.org" <panrg@irtf.org>
> *Subject: *Re: [PANRG] On the new text on ECN: 
> draft-irtf-panrg-what-not-to-do-16
>
> Spencer,
>
> The apocrypha in the draft needs to tell the whole story. The current 
> message is that everything was done right, except we got blind-sided 
> by a bug. No.
>
> ECN involved a /three/-part deployment, client, server /and/ 
> bottleneck AQM. The crashing home routers pushed back the client part, 
> and servers until they finally realized they could enable ECN without 
> making anyone more vulnerable to the home router crashes. But my email 
> was about why the network part never got deployed (until the more 
> recent FQ-CoDel and CAKE deployments).
>
> Even once the bug was out of the way, the deployment pain for networks 
> wasn't worth the small performance gain. In the language of the draft, 
> ECN didn't "Outperform End-to-end Protocol Mechanisms", which had had 
> plenty of time to mask losses in other ways before ECN arrived  (all 
> evidenced in my original posting).
>
> The final deployment of the network part in FQ-CoDel and CAKE is more 
> difficult to explain. I think that was down to the nature of the open 
> source communities that built them. A network operator would have 
> assessed whether ECN's performance gain was worth the cost of 
> deployment (as in the story in my original posting). But for the open 
> source community the ability to function efficiently was enough. It 
> wasn't driven by the more traditional cost-benefit analysis approach 
> of commercial operators.
>
> In contrast, if you look at most machines at the broadband access 
> bottleneck from the major vendors (Nokia, Ericsson, Siemens, etc), 
> they didn't implement AQM, let alone ECN, 'cos the operators they were 
> selling to tendered for QoS features, which they understood as 
> something the network alone provides, not something it helps 
> end-systems to provide for themselves (as in AQM & ECN).
>
>
> Bob
>
> On 11/03/2021 23:37, Spencer Dawkins at IETF wrote:
>
>     Thanks for continuing this discussion on the mailing list. If I
>     might add this ...
>
>     On Thu, Mar 11, 2021 at 11:13 AM Bob Briscoe <ietf@bobbriscoe.net
>     <mailto:ietf@bobbriscoe.net>> wrote:
>
>         Gorry, and panrg,
>
>         I'd agree with Gorry, about one chance per generation.
>
>     My understanding (and we're talking about this to understand it
>     better) is that there are two things in play with the early ECN
>     deployment experiment.
>
>     Yes, routers crashing when they say non-zero ECN bits was in a
>     device which (at the time) wasn't likely to be updated or replaced
>     for a number of years (see "equipment generation"), but the second
>     point (and I think it was Kireeti who said this well) was that
>     once you've updated or replaced all of the problematic equipment,
>     the people who turned ECN off didn't immediately turn ECN back on
>     (I remember the phrase "a bad taste in their mouths" mentioned
>     multiple times.
>
>     So, as Kireeti and others observed, generations between updates
>     are getting shorter, but it's not clear that the ability of people
>     to forgive and forget past experiences is getting shorter in the
>     same way. I'm remembering the quote from Oscar Wilde, "Second
>     marriage is the triumph of hope over experience" - I think we're
>     talking about the same thing as "second marriage".
>
>     Best,
>
>     Spencer
>
>         I'd also say that there are two main prongs to the ECN story.
>         Not just
>         the "one chance" point.
>         The other is "Outperforming End-to-end Protocol Mechanisms"
>
>         When I was asked to write the business case for deploying ECN
>         in BT, my
>         colleagues in the tech strategy team pushed me on this point of
>         outperforming e2e mechanisms. Jamal Salim's performance
>         evaluation
>         [RFC2884] showed ECN gave no performance benefit, except for
>         short
>         flows. I had to admit that e2e FEC for short flows would be
>         more likely
>         to win out. And at the time, I'd just found Damon Wischik
>         paper about
>         how little the extra percentage of traffic volume would be, if
>         every
>         sender just duplicated the first few packets of each flow
>         [Wischik08].
>
>         So the business case flat-lined. End systems had long since
>         worked out
>         tonnes of loss-hiding tricks. There was far too great a risk
>         that, by
>         the time the 3-part deployment had got anywhere (sender,
>         receiver and
>         network), the problem would have been solved e2e, and the
>         high-risk
>         high-cost investment would all have been wasted.
>
>         This was actually the start of my journey to realize that the
>         "ECN =
>         drop" rule was the problem, 'cos it disallowed the real
>         benefit of ECN -
>         to cut queuing delay, by providing a finer-grained signal than
>         loss. I
>         did some calculations to work out that the noise in the delay
>         signal was
>         too great to get queue delay down as low as would be
>         achievable with ECN
>         (which could use virtual queues once ISPs realized that
>         bandwidth had
>         become plentiful enough). That was actually when I started
>         working on
>         chirping (bringing in Mirja as a research fellow) and came to the
>         conclusion that we could only get a better delay signal out of
>         the noise
>         by creating more noise with the chirps. That's when I realized
>         the
>         chirps should only be used at start-up, and ECN was the only
>         way to keep
>         queueing extremely low under load. Today you can see the
>         limits of using
>         e2e delay to reduce queuing in BBR, although of course BBR is
>         unlikely
>         to be the last word in e2e delay reduction.
>
>         I know it's unlikely that every ISP went through such a rigorous
>         exercise, but still none deployed it - probably intuitively
>         reaching the
>         same conclusion. The main reason being the deployment pain
>         wasn't worth
>         the small performance gain. Which is a TL;DR summary of all
>         the above.
>
>         The issue with Linksys home routers crashing wasn't a biggie
>         in that
>         assessment of ECN - there were few enough of that model around
>         by then
>         that it could be worked round with pre-deployment validation
>         testing
>         from the OS. This supports Gorry's "one chance per generation"
>         point.
>
>         Reference
>         [Wischik08] Wischik, D. "Short Messages" Philosophical
>         Transactions of
>         the Royal Society A, 2008, 366, 1941-1953
>
>
>         Bob
>
>         On 11/03/2021 16:17, Gorry Fairhurst wrote:
>         > Resend: On 11/03/2021 15:06, Gorry Fairhurst wrote:
>         >> HI,
>         >>
>         >> Two observations on ECN in what may have been learned?
>         >>
>         >> * I think the "one chance" is per one /Generation/ of
>         equipment for
>         >> hardware, possibly less time for software updates - but
>         hard to
>         >> eliminate deployed critical bugs completely.
>         >>
>         >> * "Measure widely before, but important also gently test
>         during
>         >> initial deployment", to avoid the pain when there are
>         issues and
>         >> therefore to provide a chance to refine the design if there
>         is a
>         >> problem, and reduce the pain of trying.
>         >>
>         >> I like the additional thought from Eric (in the RG meeting) on
>         >> mitigating user pain by using fallback techniques, so they
>         do not
>         >> share your pain when something happens to go wong.
>         >>
>         >> ————
>         >>
>         >> Is this a typo?: /Cannot be recovered at TCP layer/
>         >> - it seems like a partial sentence, does need a /This.../
>         >>
>         >> Gorry
>         >>
>         >> _______________________________________________
>         >> Panrg mailing list
>         >> Panrg@irtf.org <mailto:Panrg@irtf.org>
>         >> https://www.irtf.org/mailman/listinfo/panrg
>         <https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$>
>         >
>         >
>         > _______________________________________________
>         > Panrg mailing list
>         > Panrg@irtf.org <mailto:Panrg@irtf.org>
>         > https://www.irtf.org/mailman/listinfo/panrg
>         <https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$>
>
>         -- 
>         ________________________________________________________________
>         Bob Briscoe http://bobbriscoe.net/
>         <https://urldefense.com/v3/__http:/bobbriscoe.net/__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPep-Eu0I$>
>
>         _______________________________________________
>         Panrg mailing list
>         Panrg@irtf.org <mailto:Panrg@irtf.org>
>         https://www.irtf.org/mailman/listinfo/panrg
>         <https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$>
>
>
>
>     _______________________________________________
>
>     Panrg mailing list
>
>     Panrg@irtf.org  <mailto:Panrg@irtf.org>
>
>     https://www.irtf.org/mailman/listinfo/panrg  <https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$>
>
>
>
> -- 
> ________________________________________________________________
> Bob Briscoehttp://bobbriscoe.net/  <https://urldefense.com/v3/__http:/bobbriscoe.net/__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPep-Eu0I$>
>
> _______________________________________________
> Panrg mailing list
> Panrg@irtf.org
> https://www.irtf.org/mailman/listinfo/panrg