Re: [PANRG] On the new text on ECN: draft-irtf-panrg-what-not-to-do-16

Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com> Fri, 12 March 2021 11:47 UTC

Return-Path: <spencerdawkins.ietf@gmail.com>
X-Original-To: panrg@ietfa.amsl.com
Delivered-To: panrg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 319D43A191F for <panrg@ietfa.amsl.com>; Fri, 12 Mar 2021 03:47:54 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.997
X-Spam-Level:
X-Spam-Status: No, score=-1.997 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, HTTPS_HTTP_MISMATCH=0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id o_B3qFIb7x2V for <panrg@ietfa.amsl.com>; Fri, 12 Mar 2021 03:47:50 -0800 (PST)
Received: from mail-yb1-xb32.google.com (mail-yb1-xb32.google.com [IPv6:2607:f8b0:4864:20::b32]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 34C723A1917 for <panrg@irtf.org>; Fri, 12 Mar 2021 03:47:50 -0800 (PST)
Received: by mail-yb1-xb32.google.com with SMTP id x19so25098404ybe.0 for <panrg@irtf.org>; Fri, 12 Mar 2021 03:47:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0Qr0tX8OG9qz512bJvde222+P7Iohpv0JbvvLNJZBv8=; b=vJB8qYVzg6ItKXkaf5rCGcIl14ilQaH0y50yCUYXE//gJxrN/ogzu49ILNW29vOrtx woW+Bgf7Z1WzdB2mTL0WSqBnjMB7BigVx4X3ghizmhVPcVWdDVI7Qim4fUdKppF7wV0k CVi10bY6HoIK4zkdx5XglmK/AM7SjmkEp1GRR6C0EvIcYJhfFVjMoggItE62cKW3ROMP 9jZ1bwf8bjnakcEip0Pio51lov3x8NdMZ1JJ3OmyH01fVaNqmqaAf0Xi8Y1xRycO7tU9 vV/TO0zdRmnWQgPWayhSDdaUAI1ax/WoQrSO4YJj/7AMEmNqIt6nsOTrDEcsfP4kFUT/ RtEg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0Qr0tX8OG9qz512bJvde222+P7Iohpv0JbvvLNJZBv8=; b=g2y8jf45QZP/Y9ZzJxTH+Fd3+AsyrCEEpxRsKpWKu9g6DP/lYDFya9WndaK7E2Ps0P ZhpwTHokeAOew4EXhkiI9TPLQIdCEbdQqTYlLuI7q585dQY1H3/6uiNRs1tKvXa5vyZd PTiunySVmrDre7jFRUYopU0V4X9ZqBm8UK2v+0RruqYIsBxJ/7ObbdU/MtMDIvv1oHSC lFvwvzcw63OT3MLZC5pks8mN/m5k2I6/gdEkxYuiBBY5uVyj0c3AWMP21dXDlNvISWL2 jySXJ6Oo8q9zW/WZm5ppI8WKZid+SZ9329T/XUZbWoj0b7fECfWiVqbe98Itc+iaU+UP h14g==
X-Gm-Message-State: AOAM532keulbGvs9ONjA03fDA42rm3h/XTy5q2qvcbUTRz4mx9nYc5ox ZznLn0SHgCZWVU1LKErj6mZ7aNcSyXlSz7EsGbOvznnHOjQ=
X-Google-Smtp-Source: ABdhPJy7dQ+2NoAkxl1IGdnXLuQw5IGrjPAXSc9VjF0IhwqS6c36O3zWaJliLqb/whX4tyuXgbfJ1Gv638dLdHCEDY4=
X-Received: by 2002:a25:3250:: with SMTP id y77mr17399081yby.154.1615549668290; Fri, 12 Mar 2021 03:47:48 -0800 (PST)
MIME-Version: 1.0
References: <d072512f-ed66-bb8b-7338-58ed720210a2@erg.abdn.ac.uk> <45a9076d-d573-b976-8465-3e731081169a@erg.abdn.ac.uk> <455c5465-7a2a-211e-a712-1c6b412c73b4@bobbriscoe.net> <CAKKJt-fomvCaK+8LW=UmC2VcAXBD2KKhA_he6iys8OvcLoOZBA@mail.gmail.com> <16ffb0c2-2330-6770-c0ce-53abc082037c@bobbriscoe.net> <D429889D-FA1F-4629-8EF1-A8C93E98CE10@akamai.com> <1a14a4b0-32d4-0944-7464-293ebecf4967@erg.abdn.ac.uk>
In-Reply-To: <1a14a4b0-32d4-0944-7464-293ebecf4967@erg.abdn.ac.uk>
From: Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com>
Date: Fri, 12 Mar 2021 05:47:22 -0600
Message-ID: <CAKKJt-efQgmmbKHkginBj4dXAAWCR=zBmk6J-myaAUeLBmQBJQ@mail.gmail.com>
To: "panrg@irtf.org" <panrg@irtf.org>
Cc: Martin Duke <martin.h.duke@gmail.com>, Colin Perkins <csp@csperkins.org>
Content-Type: multipart/alternative; boundary="0000000000001873d305bd557821"
Archived-At: <https://mailarchive.ietf.org/arch/msg/panrg/Fm7oEvoxLl9Wq0lgR62UtTk7mkQ>
Subject: Re: [PANRG] On the new text on ECN: draft-irtf-panrg-what-not-to-do-16
X-BeenThere: panrg@irtf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Path Aware Networking \(Proposed\) Research Group discussion list" <panrg.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/panrg>, <mailto:panrg-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/panrg/>
List-Post: <mailto:panrg@irtf.org>
List-Help: <mailto:panrg-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/panrg>, <mailto:panrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Mar 2021 11:47:54 -0000

So, having had a few minutes to sleep on this discussion (which would never
happen in a face-to-face IETF meeting, of course) ...

On Fri, Mar 12, 2021 at 3:06 AM Gorry Fairhurst <gorry@erg.abdn.ac.uk>
wrote:

> This is helpful ... yesterday I was refelecting on the benefits of using
> AQM/RED (a different story) ... however, I'd agree with Jake: to me
> successful deployment of features that require new configuration needs
> operators to perceive the achievable benefits for the bulk of the traffic.
>

ECN is (quite) a bit different than most of the other contributions to
draft-irtf-panrg-what-not-to-do, because its (non-)deployment history spans
a decade or two (as Gorry said during the meeting, ST2 didn't achieve
deployment, and the Internet moved on, but ECN is different).

I THINK I have the opportunity to propose new text that breaks ECN out into
a broader story - what happens when you don't achieve deployment, for what
should have been a good idea, and you keep trying to achieve deployment,
and stumble over more lessons. This is different from what-not-to-do has
mostly been about, because ECN is still an active conversation in IETF
working groups.

On the other hand, I think (as Bob said) the story of ECN (non-)deployment
has had multiple chapters, so another option would be to say that
explicitly, and confine the scope of the new text on ECN to be only on the
first chapter, and focus on the new Lessons Learned, that we talked about
during the PANRG session, amd in this thread.

I THINK I have a preference to add the caveat about the ECN history
continuing into the present day, because most of the Lessons Learned from
the continuing story are already listed in the document, just not pointing
to ECN as another case where they are displayed.

But I'd like to hear the thoughts of others about this.

Best,

Spencer


> Gorry
>
> On 12/03/2021 07:24, Holland, Jake wrote:
>
> Hi panrg,
>
>
>
> I think I agree with Bob’s main point, if I’ve understood it correctly.
>
>
>
> A note worth highlighting for the innocent ECN bystanders here:
>
> Most of the benefits from FQ-codel and CAKE come from running an AQM and
> making it into a throughput bottleneck, which (I think uncontroversially)
> seems to have given substantial latency benefits to most of the people
> who’ve deployed it.
>
>
>
> The ECN marking itself (as opposed to the AQM that’s doing the marking,
> but could just as easily drop the packet instead of marking) usually
> provides only the marginal benefit of avoiding one retransmitted packet,
> usually on a TCP fast retransmit (plus a few benefits from corner cases
> where you wouldn’t have gotten 3 dup acks, or a lost packet otherwise would
> have hurt more than usual).
>
>
>
> There’s a fundamental problem that derives from treating loss as the right
> congestion signal to use, and then defining the CE marking to mean “respond
> the same as loss”.  In some sense the big problem underneath this came from
> landing on Reno instead of Vegas (along with policing aggressive senders)
> [1].
>
>
>
> (It might even be right to say that almost all our later
> congestion-related problems derive from that early “Reno vs. Vegas”
> outcome, the fallout from which still hasn’t really been fixed yet.  This
> for instance is a key reason that large tail-drop buffers consistently
> cause poor latency, and I can state with pretty good confidence that it
> dramatically complicated the deployment of alternative congestion
> controllers.)
>
>
>
> Anyway, this point about ECN performance is well-taken IMO, and it makes
> me think maybe there’s a good addition to the “test gently during initial
> deployment” advice that’s something like “characterize and quantify the
> achievable benefits for the bulk of the traffic that’s supposed to get
> benefits before nailing down semantics for scarce codepoints” [2].
>
>
>
> Best regards,
>
> Jake
>
>
>
> PS:  I’m not vouching for Bob’s side comments about characterizing the
> reasoning behind the ISP and open-source communities’ decision processes; I
> imagine those a little differently but just as speculatively, so I won’t
> debate them here.  I’m just agreeing with the specific point about the
> realizable performance gains from 3168 ECN as compared to what’s achievable
> without using up the codepoints.
>
>
>
>
>
> [1] I’ll just give one reference to probably my favorite explanation of
> the dynamics here, for those who haven’t read up on this:
>
> “Analysis and improvement of fairness between TCP Reno and Vegas for
> deployment of TCP Vegas to the Internet”, Hasegawa et.al.
>
> https://ieeexplore.ieee.org/abstract/document/896302
>
>
>
> [2] Worth noting is that on Sally Floyd’s ECN page, referenced from RFC
> 3168, the first question in the list of open ECN issues is about
> quantifying the benefits of ECN for TCP:
>
> http://www.icir.org/floyd/ecn.html#issues
> “Open issues for ECN: -          What are the quantitative benefits of
> ECN for *TCP*?”
>
>
>
>
>
>
>
> *From: *Bob Briscoe <ietf@bobbriscoe.net> <ietf@bobbriscoe.net>
> *Date: *Thursday, March 11, 2021 at 4:41 PM
> *To: *Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com>
> <spencerdawkins.ietf@gmail.com>
> *Cc: *"panrg@irtf.org" <panrg@irtf.org> <panrg@irtf.org> <panrg@irtf.org>
> *Subject: *Re: [PANRG] On the new text on ECN:
> draft-irtf-panrg-what-not-to-do-16
>
>
>
> Spencer,
>
> The apocrypha in the draft needs to tell the whole story. The current
> message is that everything was done right, except we got blind-sided by a
> bug. No.
>
> ECN involved a /three/-part deployment, client, server /and/ bottleneck
> AQM. The crashing home routers pushed back the client part, and servers
> until they finally realized they could enable ECN without making anyone
> more vulnerable to the home router crashes. But my email was about why the
> network part never got deployed (until the more recent FQ-CoDel and CAKE
> deployments).
>
> Even once the bug was out of the way, the deployment pain for networks
> wasn't worth the small performance gain. In the language of the draft, ECN
> didn't "Outperform End-to-end Protocol Mechanisms", which had had plenty of
> time to mask losses in other ways before ECN arrived  (all evidenced in my
> original posting).
>
> The final deployment of the network part in FQ-CoDel and CAKE is more
> difficult to explain. I think that was down to the nature of the open
> source communities that built them. A network operator would have assessed
> whether ECN's performance gain was worth the cost of deployment (as in the
> story in my original posting). But for the open source community the
> ability to function efficiently was enough. It wasn't driven by the more
> traditional cost-benefit analysis approach of commercial operators.
>
> In contrast, if you look at most machines at the broadband access
> bottleneck from the major vendors (Nokia, Ericsson, Siemens, etc), they
> didn't implement AQM, let alone ECN, 'cos the operators they were selling
> to tendered for QoS features, which they understood as something the
> network alone provides, not something it helps end-systems to provide for
> themselves (as in AQM & ECN).
>
>
> Bob
>
> On 11/03/2021 23:37, Spencer Dawkins at IETF wrote:
>
> Thanks for continuing this discussion on the mailing list. If I might add
> this ...
>
>
>
> On Thu, Mar 11, 2021 at 11:13 AM Bob Briscoe <ietf@bobbriscoe.net> wrote:
>
> Gorry, and panrg,
>
> I'd agree with Gorry, about one chance per generation.
>
>
>
> My understanding (and we're talking about this to understand it better) is
> that there are two things in play with the early ECN deployment experiment.
>
>
>
> Yes, routers crashing when they say non-zero ECN bits was in a device
> which (at the time) wasn't likely to be updated or replaced for a number of
> years (see "equipment generation"), but the second point (and I think it
> was Kireeti who said this well) was that once you've updated or replaced
> all of the problematic equipment, the people who turned ECN off didn't
> immediately turn ECN back on (I remember the phrase "a bad taste in their
> mouths" mentioned multiple times.
>
>
>
> So, as Kireeti and others observed, generations between updates are
> getting shorter, but it's not clear that the ability of people to forgive
> and forget past experiences is getting shorter in the same way. I'm
> remembering the quote from Oscar Wilde, "Second marriage is the triumph of
> hope over experience" - I think we're talking about the same thing as
> "second marriage".
>
>
>
> Best,
>
>
>
> Spencer
>
>
>
> I'd also say that there are two main prongs to the ECN story. Not just
> the "one chance" point.
> The other is "Outperforming End-to-end Protocol Mechanisms"
>
> When I was asked to write the business case for deploying ECN in BT, my
> colleagues in the tech strategy team pushed me on this point of
> outperforming e2e mechanisms. Jamal Salim's performance evaluation
> [RFC2884] showed ECN gave no performance benefit, except for short
> flows. I had to admit that e2e FEC for short flows would be more likely
> to win out. And at the time, I'd just found Damon Wischik paper about
> how little the extra percentage of traffic volume would be, if every
> sender just duplicated the first few packets of each flow [Wischik08].
>
> So the business case flat-lined. End systems had long since worked out
> tonnes of loss-hiding tricks. There was far too great a risk that, by
> the time the 3-part deployment had got anywhere (sender, receiver and
> network), the problem would have been solved e2e, and the high-risk
> high-cost investment would all have been wasted.
>
> This was actually the start of my journey to realize that the "ECN =
> drop" rule was the problem, 'cos it disallowed the real benefit of ECN -
> to cut queuing delay, by providing a finer-grained signal than loss. I
> did some calculations to work out that the noise in the delay signal was
> too great to get queue delay down as low as would be achievable with ECN
> (which could use virtual queues once ISPs realized that bandwidth had
> become plentiful enough). That was actually when I started working on
> chirping (bringing in Mirja as a research fellow) and came to the
> conclusion that we could only get a better delay signal out of the noise
> by creating more noise with the chirps. That's when I realized the
> chirps should only be used at start-up, and ECN was the only way to keep
> queueing extremely low under load. Today you can see the limits of using
> e2e delay to reduce queuing in BBR, although of course BBR is unlikely
> to be the last word in e2e delay reduction.
>
> I know it's unlikely that every ISP went through such a rigorous
> exercise, but still none deployed it - probably intuitively reaching the
> same conclusion. The main reason being the deployment pain wasn't worth
> the small performance gain. Which is a TL;DR summary of all the above.
>
> The issue with Linksys home routers crashing wasn't a biggie in that
> assessment of ECN - there were few enough of that model around by then
> that it could be worked round with pre-deployment validation testing
> from the OS. This supports Gorry's "one chance per generation" point.
>
> Reference
> [Wischik08] Wischik, D. "Short Messages" Philosophical Transactions of
> the Royal Society A, 2008, 366, 1941-1953
>
>
> Bob
>
> On 11/03/2021 16:17, Gorry Fairhurst wrote:
> > Resend: On 11/03/2021 15:06, Gorry Fairhurst wrote:
> >> HI,
> >>
> >> Two observations on ECN in what may have been learned?
> >>
> >> * I think the "one chance" is per one /Generation/ of equipment for
> >> hardware, possibly less time for software updates - but hard to
> >> eliminate deployed critical bugs completely.
> >>
> >> * "Measure widely before, but important also gently test during
> >> initial deployment", to avoid the pain when there are issues and
> >> therefore to provide a chance to refine the design if there is a
> >> problem, and reduce the pain of trying.
> >>
> >> I like the additional thought from Eric (in the RG meeting) on
> >> mitigating user pain by using fallback techniques, so they do not
> >> share your pain when something happens to go wong.
> >>
> >> ————
> >>
> >> Is this a typo?: /Cannot be recovered at TCP layer/
> >> - it seems like a partial sentence, does need a /This.../
> >>
> >> Gorry
> >>
> >> _______________________________________________
> >> Panrg mailing list
> >> Panrg@irtf.org
> >> https://www.irtf.org/mailman/listinfo/panrg
> <https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$>
> >
> >
> > _______________________________________________
> > Panrg mailing list
> > Panrg@irtf.org
> > https://www.irtf.org/mailman/listinfo/panrg
> <https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$>
>
> --
> ________________________________________________________________
> Bob Briscoe                               http://bobbriscoe.net/
> <https://urldefense.com/v3/__http:/bobbriscoe.net/__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPep-Eu0I$>
>
> _______________________________________________
> Panrg mailing list
> Panrg@irtf.org
> https://www.irtf.org/mailman/listinfo/panrg
> <https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$>
>
>
>
> _______________________________________________
>
> Panrg mailing list
>
> Panrg@irtf.org
>
> https://www.irtf.org/mailman/listinfo/panrg <https://urldefense.com/v3/__https:/www.irtf.org/mailman/listinfo/panrg__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPdIRbtu0$>
>
>
>
> --
>
> ________________________________________________________________
>
> Bob Briscoe                               http://bobbriscoe.net/ <https://urldefense.com/v3/__http:/bobbriscoe.net/__;!!GjvTz_vk!FxUkNSuED9rm82ew4jMi4N0VnQsoyAwFhzcCQSB4d7zUi8K7RgWu0biPep-Eu0I$>
>
>
> _______________________________________________
> Panrg mailing listPanrg@irtf.orghttps://www.irtf.org/mailman/listinfo/panrg
>
>
>