Re: [tsvwg] path forward on L4S issue #16

"Rodney W. Grimes" <ietf@gndrsh.dnsmgr.net> Sat, 20 June 2020 03:14 UTC

Return-Path: <ietf@gndrsh.dnsmgr.net>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 35B9B3A1002; Fri, 19 Jun 2020 20:14:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.621
X-Spam-Level:
X-Spam-Status: No, score=-1.621 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, KHOP_HELO_FCRDNS=0.276, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tST_Sf6-mpvm; Fri, 19 Jun 2020 20:14:07 -0700 (PDT)
Received: from gndrsh.dnsmgr.net (br1.CN84in.dnsmgr.net [69.59.192.140]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0AB043A1001; Fri, 19 Jun 2020 20:14:06 -0700 (PDT)
Received: from gndrsh.dnsmgr.net (localhost [127.0.0.1]) by gndrsh.dnsmgr.net (8.13.3/8.13.3) with ESMTP id 05K3E5ql097426; Fri, 19 Jun 2020 20:14:05 -0700 (PDT) (envelope-from ietf@gndrsh.dnsmgr.net)
Received: (from ietf@localhost) by gndrsh.dnsmgr.net (8.13.3/8.13.3/Submit) id 05K3E4hR097425; Fri, 19 Jun 2020 20:14:04 -0700 (PDT) (envelope-from ietf)
From: "Rodney W. Grimes" <ietf@gndrsh.dnsmgr.net>
Message-Id: <202006200314.05K3E4hR097425@gndrsh.dnsmgr.net>
In-Reply-To: <CCF60E29-276F-45AA-8045-D14DFE44CDBE@akamai.com>
To: "Holland, Jake" <jholland=40akamai.com@dmarc.ietf.org>
Date: Fri, 19 Jun 2020 20:14:04 -0700 (PDT)
CC: "wes@mti-systems.com" <wes@mti-systems.com>, "tsvwg@ietf.org" <tsvwg@ietf.org>
X-Mailer: ELM [version 2.4ME+ PL121h (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/Q29HOiWX5lD9uZI3E0CE_6OsUCI>
Subject: Re: [tsvwg] path forward on L4S issue #16
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 20 Jun 2020 03:14:09 -0000

Jake,
	Thank you for spending the time to collect this
detailed summary.

	I believe you left out: (adding one to your last one and listing)

 7.  Use a DSCP to seperate the experiment, leaving ECT(1) and CE as
     currently specified in the L4S draft.

 8.  Use a DSCP to classify the traffic as L4S and leave ECT(1) unused,
     altering CE semantics.

 9.  Use a DSCP to classify the traffic, and use ECT(1) as a 1/p signal,
     leaving CE semantics in place.

10.  Dropping L4S as over promising and short delivering with complexity
     that almost certainly sets it up for a failed deployment.

Note that in all 4 of these solutions bleaching is unlikely to be
used if there are problems, and the experiment is rather trivial to
terminate if there are problems.  These also keep ECT(1) avaliable
for a future non-experiment version of L4S should the experiment work,
or something else should it fail.  7 to 9 can even be started today
without an IETF consenses and some real operational data created.

On the side, IMHO, the work going into L4S would be better spent addressing:
a)  DSCP global traversal
b)  ack thinning being underspecified such that it creates protocol
    problems.  Specifically the fact it tosses out changes in reserved
    bits by thinning packets with different bit values.  This was
    identified years ago and left as a problem, it needs cleaned up.
c)  Revision to RFC6040 and other tunnel related drafts to clear
    the issues there.  Again, identified years ago and left to clean
    up.

Regards,
Rod

> Hi Wes and tsvwg,
> 
> On 6/4/20, 1:54 PM, "Wesley Eddy" <wes@mti-systems.com> wrote:
> > I think we should discuss the path forward on L4S issue #16 and what 
> > people are working on, planning to do, or expecting to see in this regard.
> >
> > This is the issue on interaction with RFC 3168 ECN AQMs in the network.
> > 
> > I think this is one of the more important ones in many recent 
> > discussions, so would like to make sure we're agreeing on what it will 
> > take to complete or what success will look like.
> >
> > The classic bottleneck detection work is a key part of this.
> ...(snip)...
> 
> I think a few other ideas have also been floated, so I'd ask to include
> those proposals as invited parts of the discussion as well.
> 
> This thread seemed like it had a lot of branches and was kind of hard
> to follow in places, so I thought I'd compile a list of the proposed
> paths forward that I've seen.  I'm not sure I got everything (please
> respond if anybody knows another suggestion that was left out).
> 
> So in hopes it's useful, here's my list (in no particular order):
> 
> 1. a robust classic bottleneck detection mechanism
> 
> 2. Changing L4S to use a 2-signal approach, using ECT(1)->ECT(0) for
>    the 1/p signal and ECT(1|0)->CE as a 1/sqrt(p) signal.
> 
> 3. a flag day to deprecate ECT(1)->CE marking by classic queues (instead
>    treating ECT(1) as NECT if no non-3168 meaning is implemented).
> 
> 4. operational considerations to recommend changing ECT(1) to NECT at
>    ingress to networks that have marking classic queues deployed
> 
> 5. operational considerations to recommend policing strategies that can
>    solve the general case of non-compliant traffic that does not respond
>    with the expected backoff to AQM congestion signaling.
> 
> I'll make another new suggestion now:
> 
> 6. An experiment-linked public whitelist of participant-registered IP
>    ranges that have a L4S compatible dualq in their reachability path at
>    the likely bottleneck, which would be checked by endpoints before
>    negotiating L4S.
> 
> 
> To add some of my own color commentary on these:
> 
> 1. a robust classic bottleneck detection mechanism
> a. AFAICT this is still the method preferred by the L4S authors.
> b. There's been some discussion about how technically feasible this
>    goal is, and also what level of "robust" is necessary.
> c. I'll respectfully suggest that we as a WG should ideally have at least
>    one solid backup plan, in case this approach proves hard to reach
>    consensus.
> 
> 2. Changing L4S to use a 2-signal approach, using ECT(1)->ECT(0) for
>    the 1/p signal and ECT(1|0)->CE as a 1/sqrt(p) signal.
> a. a few people seemed interested in considering this, but several
>    objections were raised, chiefly (IIRC):
>    i.   late out of order packets from chained loaded dualqs, and the
>         corresponding spurious retransmissions (with a side note that
>         this problem worsens as dualq deployment increases)
>    ii.  loss of the 1/p signal with RFC 6040 tunnel decapsulation, and
>         a corresponding limited scope initially, with a long slow path
>         (including standards actions) to get to ubiquitous deployment
>    iii. discards the long-term L4S goal of reclaiming ECT(0) for other
>         purposes
>    iv.  doesn't match the desired timeline for experimental deployment
>         by those who have engaged with the L4S work.  (I'm not sure any
>         of the proposed paths forward will satisfy this objection, but
>         IIRC this one was specifically raised in response to the 2-signal
>         proposal.)
> 
> 3. a flag day to deprecate ECT(1)->CE marking by classic queues (instead
>    treating ECT(1) as NECT if no non-3168 meaning is implemented).
> a. Interestingly, the scope of this approach seems to track on the same
>    questions as the "how important is a robust classic queue detection"
>    question in #1.b, because they both depend on the current and in-flight
>    deployment footprint for CE-marking classic queues.  If robustness is
>    not very important, then it's because classic queue deployment is low,
>    so a flag day would be a low-touch event.  And if a flag day would be
>    a major lift with a lot of work for operators, then good robustness
>    would likewise be very important if not doing a flag day.
> b. However, this presumably requires an update to RFC 3168, and I'm not
>    sure whether there's a well-established process for organizing an
>    event like this.  Obviously the outreach effort is much higher than
>    if e.g. option #1 or 2 turned out to be feasible to get working well
>    enough to satisfy rough consensus.
> 
> 4. operational considerations to recommend changing ECT(1) to NECT at
>    ingress to networks that have marking classic queues deployed.
> a. Note that these considerations are for networks NOT participating in the
>    L4S experiment, so they can't easily be folded into L4S-specific
>    operational considerations.
> b. This seems to me most appropriate as an add-on of a fallback position
>    during the #3 (flag day), where classic queues can't be reconfigured to
>    treat ECT(1) as NECT.  But I left it as a separate point because it was
>    proposed independently, and maybe there's an argument that only networks
>    that experience a problem would need to do this and could do it as a
>    post-hoc fix when problems are encountered, rather than pre-arranging
>    it with a flag day and the associated proactive outreach it would need.
>    (to be clear: I am currently against that position, but I acknowledge I
>    might be in the rough when consensus is checked)
> 
> 5. operational considerations to recommend policing strategies that can
>    solve the general case of non-compliant traffic that does not respond
>    with the expected backoff to AQM congestion signaling.
> a. Arguably this should be done regardless of L4S, because it seems like
>    an underdeveloped piece of the puzzle for the general problem of active
>    queue management in network devices.  However, solving this well and
>    getting solutions widely deployed or at least available (or somehow
>    deployed in conjunction with L4S endpoint enablement) could also
>    potentially address issue 16.
> b. There are many possible strategies here, so outlining some of the
>    known ones maybe seems worthwhile.  A few examples that spring to
>    mind:
>    i.   FQ
>    ii.  PPV (http://ppv.elte.hu/), as we saw in ICCRG 104 from Szilveszter
>         Nadas, seems to have a lot of promise here
>    iii. Likewise the work trying to solve a similar problem written up in
>         "Fair Resource Sharing for Stateless-Core Packet-Switched Networks
>         With Prioritization" by Michael Menth and Nikolas Zeitler
>         (https://ieeexplore.ieee.org/document/8419697)
>    iv.  There's also some good insights along these lines in the "Rationale"
>         section of the docsis queue protection scheme doc:
> https://tools.ietf.org/html/draft-briscoe-docsis-q-protection-00#section-5
>         It says the same approach could be used in scenarios beyond dualq,
>         and I think there's some applicability to codel or pie, with or
>         without ECN.
>    v.   Perhaps some generic guidelines that captures what many of these
>         have in common--in general a policing response could be based on
>         sampled monitoring (not necessarily integrated closely with the
>         queues) that maintains stats on the top current and recent senders,
>         and blacklists or downgrades their traffic for some time in response
>         to a large enough standing queue, or overflow of an AQM.  (On the
>         grounds that someone here is non-compliant since it exceeded
>         expected operational bounds, and thus at least the highest volume
>         recent senders have presumably failed to back off appropriately.)
>    A BCP that lists these (and maybe other options) and captures a more
>    generalized version of the advice from docsis-q-protection seems likely
>    helpful here.
> c. Also worth noting: the need for some kind of isolation for non-compliant
>    senders is not inherently an ECN-related (nor L4S-related) problem--there's
>    no forced reason that a sender necessarily has to respond to loss either...
> 
> 6. An experiment-linked public whitelist of participant-registered IP
>    ranges that have a L4S compatible dualq in their reachability path at
>    the likely bottleneck, which would be checked by endpoints before
>    negotiating L4S.
> a. to flesh the idea out a bit, I'm imagining this as a web API with a
>    known URL and a database attached, which is documented and maintained
>    as part of the L4S experiment, where experiment participants who have
>    deployed and enabled dualq-capable devices register the applicable
>    IP ranges, and L4S-capable endpoints query the web API before opening
>    connections, so that they avoid negotiating L4S support except when
>    either inside a network with a registered dualq, or when connecting to
>    a remote endpoint that's inside such a network, caching the answers
>    for ~10-30 minutes (or according to http headers in the web api or
>    something)
> b. This would let innocent bystander 3168 traffic operate unmolested at
>    access bottlenecks while gaining live operational experience with L4S.
> c. If the rapid ubiquitous rollout goes as planned according to the L4S
>    project intent, once the dualq devices are far more prevalent than classic
>    ECN queues and widely available on all the bandwidth-shaping access
>    technologies and the non-experiment participants are predominantly not
>    doing any classic marking, the whitelist could be gradually retired, or
>    turned into a blacklist of L4S on IPs that are known to have problems with
>    legacy systems that haven't been upgraded (which can be discovered by
>    gradually enabling L4S on non-participant paths, perhaps with A/B tests,
>    and following up with those networks to ask afterward whether it caused
>    problems)
> d. This approach is of course operationally awkward and not a good long-term
>    solution, and also comes with potential privacy concerns as the deployment
>    grows, but would allow for forward progress on L4S without fixing the
>    underlying incompatibility problem with the CE ambiguity, so that time and
>    an ongoing outreach effort could have a chance to resolve the classic ECN
>    deployment footprint questions.  This also allows for some amount of
>    targeted experimentation with the classic queue detection work.
> 
> 
> (Of these, it's maybe interesting to note that only 1, 2, and 6 do not
> seem to require a standards or BCP action, which IIRC was originally meant
> to be out of scope for the L4S work, outside of 8311.)
> 
> 
> As far as the original "working on, planning to do, or expecting to see"
> question:
> 
> I guess I'm expecting to see at some point the results from what the L4S
> team is doing on the detection mechanisms.  But I remain not very hopeful
> they will address all the concerns that have been raised, so I'm hoping
> it'll be coupled with a credible outreach effort that seems likely to reach
> all the networks that have deployed or are in-process of deploying shared
> classic queues, at the very least.
> 
> I'm not sure about "expecting", but I'd also be very much in favor of seeing
> some sort of approval-style poll conducted (maybe with a preference
> weighting or rank ordering or something) of what the WG members think of the
> technical viability of the different proposed approaches, so that people can
> have a better idea of what others think sound like promising directions.
> 
> Outside that, I'd personally love to see further discussion on these or
> other ideas, but maybe forking the thread into different threads for the
> different proposals, to better avoid the confusion I've been feeling trying
> to find the prior links to the preceding messages on this monster (I eventually
> gave up).
> 
> 
> I regret that I don't currently have much to offer on the "working on" or
> "planning to do" front, being rather busy right now with a few other
> challenging problems.
> 
> But I am supportive of efforts to improve internet latency.  Especially
> those aimed at increasing the deployment (and enablement) of the currently
> available 3168 AQMs and the increased default use of 3168 ECN by more
> endpoint stacks, since that would have a useful impact on application level
> latency in TCP connections right away.
> 
> When I see good opportunities and I'm able, I'll aim to make minor
> contributions to latency-related efforts or do things like minor testing
> support when possible.  So anyone engaging in that kind of thing, please do
> keep the wg (or at least me) posted on progress and any opportunities to
> provide useful low-effort support.  I can spend a day on this kind of thing
> every once in a while (tho sadly, not necessarily every time it might be
> useful), but I can't spend weeks at a time.  That's probably true for the
> near forseeable future.
> 
> Best regards,
> Jake
> 
> 
> 
> 

-- 
Rod Grimes                                                 rgrimes@freebsd.org