Re: [PANRG] New ECN section for

Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com> Wed, 17 February 2021 20:23 UTC

Return-Path: <spencerdawkins.ietf@gmail.com>
X-Original-To: panrg@ietfa.amsl.com
Delivered-To: panrg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9A5F43A1D10 for <panrg@ietfa.amsl.com>; Wed, 17 Feb 2021 12:23:28 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.098
X-Spam-Level:
X-Spam-Status: No, score=-2.098 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id C_n1drBXKX8y for <panrg@ietfa.amsl.com>; Wed, 17 Feb 2021 12:23:26 -0800 (PST)
Received: from mail-yb1-xb30.google.com (mail-yb1-xb30.google.com [IPv6:2607:f8b0:4864:20::b30]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0B3B93A1D0F for <panrg@irtf.org>; Wed, 17 Feb 2021 12:23:25 -0800 (PST)
Received: by mail-yb1-xb30.google.com with SMTP id m188so14998133yba.13 for <panrg@irtf.org>; Wed, 17 Feb 2021 12:23:25 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ndJIB8rDQWZPvB/+rYpWq3Zhez4YYbqip5Nf6wnkIyY=; b=fLWWXUxU9YGDzEn70AKGDcG01EUIbc1R3IU7XEYtWRArmlvTxbtMF8TFg8NDEBU6Q4 ioKI9iwEjXCWwwD7e8hlg+jmnbNrA7v9RrRWzhNMh7RNrmJBWBJnK/ntD0x0BxatePiP 6TvjsVY+FS/brqRz2BYso7grMrSzjNBFowyaxQGwAhIQjqvMZ0Zvqg+Nlg6U3IadEgkP Fvb3qqNeEBEajEU8KZe9+iscvkCYj+zukpcFsmLjluuhJQbTZyyjh0e5b31GTreN/wmw 9sUz6A+yCxlzsEBcOMsO8y8l3IHy+AoGBopKnKCvZoD2BBnQH48Om9GGG5DdvRUXUF04 2PDQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ndJIB8rDQWZPvB/+rYpWq3Zhez4YYbqip5Nf6wnkIyY=; b=lYNSuQBiMMLDPIpj2eanZk+dVwi5sHSgxWHCpVR9xFs/WI/4/on1kuTi61EjwR20Up jS+OtfdTL+IitXkmAd6tpH5v2u+pB6/kW3pOAgZ+3voNBSYcjqkzsvGtyJo/qxKpTTQL 1IGDEot/2hZew7WvQaRgzm6yPCB33D82iLT0E5BvKpr1stHgbD4+xmVN2yew1CkC4aIV 1Ol8V2APX47eTPilIfR9hzsoz1th9D0fKuhFGPOYA8P3F25keo2Ju67RQO/1EE0+VANi cYbjcqmwfY4o8bZvSzSAbHQ/TZ5UtLoas6kTZhltfULT0kjdZreOzEDiHHcyXW4PQe+o vAfw==
X-Gm-Message-State: AOAM533/2qdXrhYO7+hNuKuvuatC3dLcs052xfyrXH0mDnQMdmvJyG98 aTW2cFMxVzpK2UCxDzaoIfcV75PiFAEkxOlKdFsYEh9RG2k=
X-Google-Smtp-Source: ABdhPJya2eexRiMhXzAsc8Qo0M/JFIoEEwecKbCyLc8GGRg2aTe799DxmX3tXSfUacXgOHBghRMUtdjHkdkDHld5oSU=
X-Received: by 2002:a25:ab6a:: with SMTP id u97mr1731679ybi.288.1613593404892; Wed, 17 Feb 2021 12:23:24 -0800 (PST)
MIME-Version: 1.0
References: <CAKKJt-fUSg8ro1YGm2bRWQXf862RPbEZjzmUHwb+RMcce9YFmw@mail.gmail.com>
In-Reply-To: <CAKKJt-fUSg8ro1YGm2bRWQXf862RPbEZjzmUHwb+RMcce9YFmw@mail.gmail.com>
From: Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com>
Date: Wed, 17 Feb 2021 14:22:58 -0600
Message-ID: <CAKKJt-fkfYdgY28gbeHBUDykpUTmmrNXtKWWbo5-+uMSWPFBGg@mail.gmail.com>
To: panrg@irtf.org
Cc: Martin Duke <martin.h.duke@gmail.com>, Colin Perkins <csp@csperkins.org>
Content-Type: multipart/alternative; boundary="000000000000b5e83905bb8dfd4d"
Archived-At: <https://mailarchive.ietf.org/arch/msg/panrg/bxtU-3mSaQMvGPK3fq2VWrJ88E4>
Subject: Re: [PANRG] New ECN section for
X-BeenThere: panrg@irtf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Path Aware Networking \(Proposed\) Research Group discussion list" <panrg.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/panrg>, <mailto:panrg-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/panrg/>
List-Post: <mailto:panrg@irtf.org>
List-Help: <mailto:panrg-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/panrg>, <mailto:panrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Feb 2021 20:23:29 -0000

So, Martin had one question about
https://github.com/panrg/draft-dawkins-panrg-what-not-to-do/pull/6, and
that was the very reasonable

"You get one chance to get it right" may be true, but it seems like a
shallow recommendation. What could the ECN proponents have done
differently? Would wider testing have uncovered this? More involvement of
router vendors in standards development and interop?


I think that this is a really important question, although the answer might
be better placed in the "what TO do" document we've talked about producing
after "what NOT to do" was finished. But just to get us thinking about that
...

Martin asked about "more testing", but ECN went through a fair amount of
testing - the original proposal in IETF was produced as Experimental,
before advancing to Proposed Standard, which was once pretty common for TSV
documents that touched congestion avoidance behavior, and Sally Floyd's ECN
page contains a lot of information about simulations and testing results.
It's not obvious that the dreaded router described in {vista-impl],

      Intermediate Gateway Device problem #1: one of the most popular
      versions from one of the most popular vendors.  When a data packet
      arrives with either ECT(0) or ECT(1) (indicating successful ECN
      capability negotiation) indicated, router crashed.  Cannot be
      recovered at TCP layer

even existed when ECN was being tested (we could chase more details to find
out, but that doesn't help us with future protocols).

Martin also asked about "more involvement of router vendors in standards
development and interop". Again, we could chase details, but my experience
has been that

   - router vendors are some of the most consistent participants in
   standards development, and
   - the IETF is one of the most cross-area-review-oriented SDOs we know of.

Although (again, I don't know the details), my experience has been that
"intermediate device vendors" may not have much of a plan for upgrades
(check your home gateway device for a worked example). So maybe that would
have helped.

What I'm thinking of, in addition to Martin's questions, are these things.

   - ECN may be a FABULOUS example of what happens when protocol designers
   didn't manage to "grease" unused fields in their protocols ("whatever we
   thought the TOS bits were in those positions meant, they'll always be zero,
   until they weren't"). I'm referencing "use it or lose it" as
   https://datatracker.ietf.org/doc/draft-thomson-quic-bit-grease/, since
   that seems most current.
   - It may be that we get more chances to get something right if our
   protocols are deployed on devices that know how to install updates without
   human intervention (Windows 10 yes, most IoT devices no).

Other thoughts?

Best,

Spencer

On Sun, Feb 14, 2021 at 2:02 PM Spencer Dawkins at IETF <
spencerdawkins.ietf@gmail.com> wrote:

> Dear PANRG.
>
> I have https://github.com/panrg/draft-dawkins-panrg-what-not-to-do/pull/6
> in Github, to add this material, as we've talked about on the mailing list.
>
> I would love to have comments from anyone familiar with the history of
> ECN, or its current practice.
>
> If you don't Github (there may be a few), here's what's new:
>
> I added this material to Section 4,
>
> 4.13.  Ability to Recover From Missteps
>
>    If early implementers discover problems with a new feature, that
>    feature is likely to be disabled, and convincing implementers to re-
>    enable that feature can be very difficult, and can require years or
>    decades.  (See Section 6.9).
>
> I asserted that "One Chance" was invariant, in Table 1, as
>
>     +-----------------------------------------------------+-----------+
>     | One Chance to Achieve Deployment (Section 4.13)     | Invariant |
>     +-----------------------------------------------------+-----------+
>
> I added this material to Section 6.
>
> 6.9.  Explicit Congestion Notification (ECN)
>
>    The suggested references for Explicit Congestion Notification (ECN)
>    are:
>
>    *  Recommendations on Queue Management and Congestion Avoidance in
>       the Internet [RFC2309]
>
>    *  A Proposal to add Explicit Congestion Notification (ECN) to IP
>       [RFC2481]
>
>    *  The Addition of Explicit Congestion Notification (ECN) to IP
>       [RFC3168]
>
>    *  Implementation Report on Experiences with Various TCP RFCs
>       [vista-impl], slides 6 and 7
>
>    *  Implementation and Deployment of ECN [SallyFloyd]
>
>    In the early 1990s, the large majority of Internet traffic used TCP
>    as its transport protocol, but TCP had no way to detect path
>    congestion before the path was so congested that packets were being
>    dropped, and these congestion events could affect all senders using a
>    path, either by "lockout", where long-lived flows monopolized the
>    queues along a path, or by "full queues", where queues remain full,
>    or almost full, for a long period of time.
>
>    In response to this situation, "Active Queue Management" (AQM) was
>    deployed in the network.  A number of AQM disciplines have been
>    deployed, but one common approach was that routers dropped packets
>    when a threshold buffer length was reached, so that transport
>    protocols like TCP that were responsive to loss would detect this
>    loss and reduce their sending rates.  Random Early Detection (RED)
>    was one such proposal in the IETF.  As the name suggests, a router
>    using RED as its AQM discipline that detected time-averaged queue
>    lengths passing a threshold would choose incoming packets
>    probablistically to be dropped [RFC2309].
>
>    Researchers suggested that providing "explicit congestion
>    notifications" to senders when routers along the path detected their
>    queues were building, giving them an indication that router queues
>    along the path were building, so that some senders would "slow down"
>    as if a loss had occurred, so that the path queues had time to drain,
>    and the path still had sufficient buffer capacity to accommodate
>    bursty arrivals of packets from other senders.  This was proposed as
>    an Experiment in [RFC2481], and standardized in [RFC3168].
>
>    A key aspect of ECN was the use of IP header fields rather than IP
>    options to carry explicit congestion notifications, since the
>    proponents recognized that
>
>       Many routers process the "regular" headers in IP packets more
>       efficiently than they process the header information in IP
>       options.
>
> 6.9.1.  Reasons for Non-deployment
>
>    The proponents of ECN did so much right, anticipating many of the
>    Lessons Learned now recognized in Section 4.  They recognized the
>    need to support incremental deployment (Section 4.2).  They
>    considered the impact on router throughput (Section 4.8).  They even
>    considered trust issues between end nodes and the network, both for
>    non-compliant end nodes (Section 4.10) and non-compliant routers
>    (Section 4.9).
>
>    They were rewarded with ECN being implemented in major operating
>    systems, both for end nodes and for routers.  A number of
>    implementations are listed under "Implementation and Deployment of
>    ECN" at [SallyFloyd].
>
>    What they did not anticipate, was routers that would crash, when they
>    saw bits 6 and 7 in the IPv4 TOS octet [RFC0791]/IPv6 Traffic Class
>    field [RFC2460], which [RFC2481] redefined to be "currently unused",
>    being set to a non-zero value.
>
>    As described in [vista-impl],
>
>       Intermediate Gateway Device problem #1: one of the most popular
>       versions from one of the most popular vendors.  When a data packet
>       arrives with either ECT(0) or ECT(1) (indicating successful ECN
>       capability negotiation) indicated, router crashed.  Cannot be
>       recovered at TCP layer
>
>    This implementation, which would be run on a significant percentage
>    of Internet end nodes, was shipped with ECN disabled, as was true for
>    several of the other implementations listed under "Implementation and
>    Deployment of ECN" at [SallyFloyd].  Even if subsequent router
>    vendors fixed these implementations, ECN was still disabled on end
>    nodes, and given the tradeoff between the benefits of enabling ECN
>    (somewhat better behavior during congestion) and the risks of
>    enabling ECN (possibly crashing a router somewhere along the path),
>    ECN tended to stay disabled on implementations that supported ECN for
>    decades afterwards.
>
> 6.9.2.  Lessons Learned
>
>    Of the contributions included in Section 6, ECN may be unique in
>    providing these lessons:
>
>    *  Even if you do everything right, you may trip over implementation
>       bugs in devices you know nothing about, that will cause severe
>       problems that prevent successful deployment of your path aware
>       technology.
>
>    *  After implementations disable your path aware technology, it may
>       take years, or even decades, to convince implementers to re-enable
>       it by default.
>
>    These two lessons, taken together, could be summarized as "you get
>    one chance to get it right".
>
> Best,
>
> Spencer
>