Re: [PANRG] On the new text on ECN: draft-irtf-panrg-what-not-to-do-16

Bob Briscoe <ietf@bobbriscoe.net> Fri, 12 March 2021 00:41 UTC

Return-Path: <ietf@bobbriscoe.net>
X-Original-To: panrg@ietfa.amsl.com
Delivered-To: panrg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 67F093A15AD for <panrg@ietfa.amsl.com>; Thu, 11 Mar 2021 16:41:36 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.433
X-Spam-Level:
X-Spam-Status: No, score=-1.433 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.001, SPF_HELO_NONE=0.001, SPF_SOFTFAIL=0.665, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=bobbriscoe.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jtjTy0I4Kq7l for <panrg@ietfa.amsl.com>; Thu, 11 Mar 2021 16:41:33 -0800 (PST)
Received: from mail-ssdrsserver2.hosting.co.uk (mail-ssdrsserver2.hosting.co.uk [185.185.84.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2FED53A15B0 for <panrg@irtf.org>; Thu, 11 Mar 2021 16:41:32 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=bobbriscoe.net; s=default; h=Content-Type:In-Reply-To:MIME-Version:Date: Message-ID:From:References:Cc:To:Subject:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=c3ZSrHI6EIkumU5z0sgRg+gnDYyBliwVIhD71eA3Jdk=; b=cBEtj30C0cHW9SDzZUu+lIR92 Wy/Pawf8yEtJZwAKRApNBiaDQccawRyPOntOoWs9Qz9KkrAfhr8WSRC/8knKhYFyzz81ATXBYRSvc 50JkIhlRf3WvSTDe5ivJ9NPqEb/qk2w7AOsgtoHimpG97GPvvpwQK6cykwoku18pYok4lAtjYgVp+ xKWFVn1BFJpcThjVB/oLEvIAbpWw4PD8UL2Cp51q0BfxJN6qAk05jUMI1XPraz1egFgw27ICVKprh 88zrRhTE7dVfk6S7+jErk/aRqW/BtbZS1Ubg8N/KgO09ljlhhHfW2fNVgzUcw0LYSqdskBoNng6h8 Zd+hanEpw==;
Received: from 67.153.238.178.in-addr.arpa ([178.238.153.67]:51154 helo=[192.168.1.11]) by ssdrsserver2.hosting.co.uk with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from <ietf@bobbriscoe.net>) id 1lKVrs-0007ap-W5; Fri, 12 Mar 2021 00:41:29 +0000
To: Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com>
Cc: panrg@irtf.org
References: <d072512f-ed66-bb8b-7338-58ed720210a2@erg.abdn.ac.uk> <45a9076d-d573-b976-8465-3e731081169a@erg.abdn.ac.uk> <455c5465-7a2a-211e-a712-1c6b412c73b4@bobbriscoe.net> <CAKKJt-fomvCaK+8LW=UmC2VcAXBD2KKhA_he6iys8OvcLoOZBA@mail.gmail.com>
From: Bob Briscoe <ietf@bobbriscoe.net>
Message-ID: <16ffb0c2-2330-6770-c0ce-53abc082037c@bobbriscoe.net>
Date: Fri, 12 Mar 2021 00:41:27 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1
MIME-Version: 1.0
In-Reply-To: <CAKKJt-fomvCaK+8LW=UmC2VcAXBD2KKhA_he6iys8OvcLoOZBA@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------64153E83CF5542F0A7EB1003"
Content-Language: en-GB
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - ssdrsserver2.hosting.co.uk
X-AntiAbuse: Original Domain - irtf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - bobbriscoe.net
X-Get-Message-Sender-Via: ssdrsserver2.hosting.co.uk: authenticated_id: in@bobbriscoe.net
X-Authenticated-Sender: ssdrsserver2.hosting.co.uk: in@bobbriscoe.net
X-Source:
X-Source-Args:
X-Source-Dir:
Archived-At: <https://mailarchive.ietf.org/arch/msg/panrg/fSopz-997Zvex6XNCN0ekrYoTvE>
Subject: Re: [PANRG] On the new text on ECN: draft-irtf-panrg-what-not-to-do-16
X-BeenThere: panrg@irtf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Path Aware Networking \(Proposed\) Research Group discussion list" <panrg.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/panrg>, <mailto:panrg-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/panrg/>
List-Post: <mailto:panrg@irtf.org>
List-Help: <mailto:panrg-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/panrg>, <mailto:panrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Mar 2021 00:41:36 -0000

Spencer,

The apocrypha in the draft needs to tell the whole story. The current 
message is that everything was done right, except we got blind-sided by 
a bug. No.

ECN involved a /three/-part deployment, client, server /and/ bottleneck 
AQM. The crashing home routers pushed back the client part, and servers 
until they finally realized they could enable ECN without making anyone 
more vulnerable to the home router crashes. But my email was about why 
the network part never got deployed (until the more recent FQ-CoDel and 
CAKE deployments).

Even once the bug was out of the way, the deployment pain for networks 
wasn't worth the small performance gain. In the language of the draft, 
ECN didn't "Outperform End-to-end Protocol Mechanisms", which had had 
plenty of time to mask losses in other ways before ECN arrived  (all 
evidenced in my original posting).

The final deployment of the network part in FQ-CoDel and CAKE is more 
difficult to explain. I think that was down to the nature of the open 
source communities that built them. A network operator would have 
assessed whether ECN's performance gain was worth the cost of deployment 
(as in the story in my original posting). But for the open source 
community the ability to function efficiently was enough. It wasn't 
driven by the more traditional cost-benefit analysis approach of 
commercial operators.

In contrast, if you look at most machines at the broadband access 
bottleneck from the major vendors (Nokia, Ericsson, Siemens, etc), they 
didn't implement AQM, let alone ECN, 'cos the operators they were 
selling to tendered for QoS features, which they understood as something 
the network alone provides, not something it helps end-systems to 
provide for themselves (as in AQM & ECN).


Bob

On 11/03/2021 23:37, Spencer Dawkins at IETF wrote:
> Thanks for continuing this discussion on the mailing list. If I might 
> add this ...
>
> On Thu, Mar 11, 2021 at 11:13 AM Bob Briscoe <ietf@bobbriscoe.net 
> <mailto:ietf@bobbriscoe.net>> wrote:
>
>     Gorry, and panrg,
>
>     I'd agree with Gorry, about one chance per generation.
>
>
> My understanding (and we're talking about this to understand it 
> better) is that there are two things in play with the early ECN 
> deployment experiment.
>
> Yes, routers crashing when they say non-zero ECN bits was in a device 
> which (at the time) wasn't likely to be updated or replaced for a 
> number of years (see "equipment generation"), but the second point 
> (and I think it was Kireeti who said this well) was that once you've 
> updated or replaced all of the problematic equipment, the people who 
> turned ECN off didn't immediately turn ECN back on (I remember the 
> phrase "a bad taste in their mouths" mentioned multiple times.
>
> So, as Kireeti and others observed, generations between updates are 
> getting shorter, but it's not clear that the ability of people to 
> forgive and forget past experiences is getting shorter in the same 
> way. I'm remembering the quote from Oscar Wilde, "Second marriage is 
> the triumph of hope over experience" - I think we're talking about the 
> same thing as "second marriage".
>
> Best,
>
> Spencer
>
>     I'd also say that there are two main prongs to the ECN story. Not
>     just
>     the "one chance" point.
>     The other is "Outperforming End-to-end Protocol Mechanisms"
>
>     When I was asked to write the business case for deploying ECN in
>     BT, my
>     colleagues in the tech strategy team pushed me on this point of
>     outperforming e2e mechanisms. Jamal Salim's performance evaluation
>     [RFC2884] showed ECN gave no performance benefit, except for short
>     flows. I had to admit that e2e FEC for short flows would be more
>     likely
>     to win out. And at the time, I'd just found Damon Wischik paper about
>     how little the extra percentage of traffic volume would be, if every
>     sender just duplicated the first few packets of each flow [Wischik08].
>
>     So the business case flat-lined. End systems had long since worked
>     out
>     tonnes of loss-hiding tricks. There was far too great a risk that, by
>     the time the 3-part deployment had got anywhere (sender, receiver and
>     network), the problem would have been solved e2e, and the high-risk
>     high-cost investment would all have been wasted.
>
>     This was actually the start of my journey to realize that the "ECN =
>     drop" rule was the problem, 'cos it disallowed the real benefit of
>     ECN -
>     to cut queuing delay, by providing a finer-grained signal than
>     loss. I
>     did some calculations to work out that the noise in the delay
>     signal was
>     too great to get queue delay down as low as would be achievable
>     with ECN
>     (which could use virtual queues once ISPs realized that bandwidth had
>     become plentiful enough). That was actually when I started working on
>     chirping (bringing in Mirja as a research fellow) and came to the
>     conclusion that we could only get a better delay signal out of the
>     noise
>     by creating more noise with the chirps. That's when I realized the
>     chirps should only be used at start-up, and ECN was the only way
>     to keep
>     queueing extremely low under load. Today you can see the limits of
>     using
>     e2e delay to reduce queuing in BBR, although of course BBR is
>     unlikely
>     to be the last word in e2e delay reduction.
>
>     I know it's unlikely that every ISP went through such a rigorous
>     exercise, but still none deployed it - probably intuitively
>     reaching the
>     same conclusion. The main reason being the deployment pain wasn't
>     worth
>     the small performance gain. Which is a TL;DR summary of all the above.
>
>     The issue with Linksys home routers crashing wasn't a biggie in that
>     assessment of ECN - there were few enough of that model around by
>     then
>     that it could be worked round with pre-deployment validation testing
>     from the OS. This supports Gorry's "one chance per generation" point.
>
>     Reference
>     [Wischik08] Wischik, D. "Short Messages" Philosophical
>     Transactions of
>     the Royal Society A, 2008, 366, 1941-1953
>
>
>     Bob
>
>     On 11/03/2021 16:17, Gorry Fairhurst wrote:
>     > Resend: On 11/03/2021 15:06, Gorry Fairhurst wrote:
>     >> HI,
>     >>
>     >> Two observations on ECN in what may have been learned?
>     >>
>     >> * I think the "one chance" is per one /Generation/ of equipment
>     for
>     >> hardware, possibly less time for software updates - but hard to
>     >> eliminate deployed critical bugs completely.
>     >>
>     >> * "Measure widely before, but important also gently test during
>     >> initial deployment", to avoid the pain when there are issues and
>     >> therefore to provide a chance to refine the design if there is a
>     >> problem, and reduce the pain of trying.
>     >>
>     >> I like the additional thought from Eric (in the RG meeting) on
>     >> mitigating user pain by using fallback techniques, so they do not
>     >> share your pain when something happens to go wong.
>     >>
>     >> ————
>     >>
>     >> Is this a typo?: /Cannot be recovered at TCP layer/
>     >> - it seems like a partial sentence, does need a /This.../
>     >>
>     >> Gorry
>     >>
>     >> _______________________________________________
>     >> Panrg mailing list
>     >> Panrg@irtf.org <mailto:Panrg@irtf.org>
>     >> https://www.irtf.org/mailman/listinfo/panrg
>     <https://www.irtf.org/mailman/listinfo/panrg>
>     >
>     >
>     > _______________________________________________
>     > Panrg mailing list
>     > Panrg@irtf.org <mailto:Panrg@irtf.org>
>     > https://www.irtf.org/mailman/listinfo/panrg
>     <https://www.irtf.org/mailman/listinfo/panrg>
>
>     -- 
>     ________________________________________________________________
>     Bob Briscoe http://bobbriscoe.net/ <http://bobbriscoe.net/>
>
>     _______________________________________________
>     Panrg mailing list
>     Panrg@irtf.org <mailto:Panrg@irtf.org>
>     https://www.irtf.org/mailman/listinfo/panrg
>     <https://www.irtf.org/mailman/listinfo/panrg>
>
>
> _______________________________________________
> Panrg mailing list
> Panrg@irtf.org
> https://www.irtf.org/mailman/listinfo/panrg

-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/