Re: [tsvwg] path forward on L4S issue #16

Wesley Eddy <wes@mti-systems.com> Mon, 22 June 2020 15:39 UTC

Return-Path: <wes@mti-systems.com>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2BC4F3A0DC0 for <tsvwg@ietfa.amsl.com>; Mon, 22 Jun 2020 08:39:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=mti-systems-com.20150623.gappssmtp.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4D7_YQkIVd9h for <tsvwg@ietfa.amsl.com>; Mon, 22 Jun 2020 08:39:28 -0700 (PDT)
Received: from mail-qt1-x82b.google.com (mail-qt1-x82b.google.com [IPv6:2607:f8b0:4864:20::82b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C90D03A0D69 for <tsvwg@ietf.org>; Mon, 22 Jun 2020 08:39:28 -0700 (PDT)
Received: by mail-qt1-x82b.google.com with SMTP id d27so12958343qtg.4 for <tsvwg@ietf.org>; Mon, 22 Jun 2020 08:39:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mti-systems-com.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-transfer-encoding:content-language; bh=veOQQOFLCXyvYAonS7XWjuEuJ8FXg9NBfgeZIveAdrA=; b=lyfJHcCWDswSIZxXknZoLgeDStRWskEKAkgVBNS3rr2IGwwAk8/oDmH0fLCp8Ake0z xEy4wDymsLnri0BOAgq1nVo5pjV/fqjDxKjigHhqFVDGxxtsD83wC+9rekz4Kd970r9V afjy2JhAuoJrKHphBa0gyDKPMGFTFq70CZ1gA6nRxK2lRVXlCxvYY7U0M2fo0zRVwOrw DYIXis4Vcd8Hek2YsEXOKeZvF+R+RA5Jg93kFpPSLw7cbayqI78dWNJAfhdUmVQXN7lV ZuSWhBitrGJG8DyKEcP1AOHYlbXqv80tMFJUaOBMBrKqWhlIEy8AwbTnAymwAONyoML/ 0TYw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=veOQQOFLCXyvYAonS7XWjuEuJ8FXg9NBfgeZIveAdrA=; b=UtE/AXflX3Vf9cA1Jhq9wCaFcbHJd4I50jxD7Qw1NzkKt2H5MAGn3a9j3LcaTvc8bd t+mhHx9cs6CtZPlPV1zesGv3hDuegG2ujd0A485KpgKTEaPQULYdZmOiWzU7TcOdClRt 0jNnhuQPqK4ZuOOhhijEaNC3TZIFWfRKoAP0tSsHnRXrfruJDTF26cNu/Gj1wBmmkqmo D50vQIJR3gi7DmZsxmtfpDBXoN7a13UTyT6ksl8GyLS706FvyiQJNdZsuYYph909VqOE LGgY/8OZvnehFsmKG1ysZxjmAo4fOAX8KeNwnKfF+58JZejvjtyKZlpsA5n00a15zPMT Ep2A==
X-Gm-Message-State: AOAM532ricadAEKJ1UR2AQkHntgtb6tYfV6atP0qim+KAsmv/jnXvtYV WEiaU1n9W+Rlsoqvw837MUGj6UT3VDKdhw==
X-Google-Smtp-Source: ABdhPJwKM7sSVEvwbpgNza6X3pQqde9ymMlA6NNHG1nSXAmYKuTILTYD+PVKx7XR4hnor1LhdVCqiQ==
X-Received: by 2002:aed:30cf:: with SMTP id 73mr16671363qtf.81.1592840367440; Mon, 22 Jun 2020 08:39:27 -0700 (PDT)
Received: from [192.168.1.114] (rrcs-69-135-1-122.central.biz.rr.com. [69.135.1.122]) by smtp.gmail.com with ESMTPSA id c189sm14812768qkb.8.2020.06.22.08.39.26 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 22 Jun 2020 08:39:26 -0700 (PDT)
To: "Holland, Jake" <jholland@akamai.com>, "tsvwg@ietf.org" <tsvwg@ietf.org>
References: <8a8947e1-f852-c489-c85a-be874039f132@mti-systems.com> <CCF60E29-276F-45AA-8045-D14DFE44CDBE@akamai.com>
From: Wesley Eddy <wes@mti-systems.com>
Message-ID: <fb4c3dcf-9487-7199-0b14-a21a3d83db0a@mti-systems.com>
Date: Mon, 22 Jun 2020 11:39:19 -0400
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.9.0
MIME-Version: 1.0
In-Reply-To: <CCF60E29-276F-45AA-8045-D14DFE44CDBE@akamai.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/MlYZOK_-QWXYVRTCq9M7_iMAGnw>
Subject: Re: [tsvwg] path forward on L4S issue #16
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 22 Jun 2020 15:39:31 -0000

On 6/19/2020 9:33 PM, Holland, Jake wrote:
> To add some of my own color commentary on these:
>
> 1. a robust classic bottleneck detection mechanism
> a. AFAICT this is still the method preferred by the L4S authors.
> b. There's been some discussion about how technically feasible this
>     goal is, and also what level of "robust" is necessary.
> c. I'll respectfully suggest that we as a WG should ideally have at least
>     one solid backup plan, in case this approach proves hard to reach
>     consensus.

I think your comment (c) agrees with what several others have said too, 
so seems like a good idea.


> 2. Changing L4S to use a 2-signal approach, using ECT(1)->ECT(0) for
>     the 1/p signal and ECT(1|0)->CE as a 1/sqrt(p) signal.
> a. a few people seemed interested in considering this, but several
>     objections were raised, chiefly (IIRC):
>     i.   late out of order packets from chained loaded dualqs, and the
>          corresponding spurious retransmissions (with a side note that
>          this problem worsens as dualq deployment increases)
>     ii.  loss of the 1/p signal with RFC 6040 tunnel decapsulation, and
>          a corresponding limited scope initially, with a long slow path
>          (including standards actions) to get to ubiquitous deployment
>     iii. discards the long-term L4S goal of reclaiming ECT(0) for other
>          purposes
>     iv.  doesn't match the desired timeline for experimental deployment
>          by those who have engaged with the L4S work.  (I'm not sure any
>          of the proposed paths forward will satisfy this objection, but
>          IIRC this one was specifically raised in response to the 2-signal
>          proposal.)

I agree with your summary.


> 3. a flag day to deprecate ECT(1)->CE marking by classic queues (instead
>     treating ECT(1) as NECT if no non-3168 meaning is implemented).
> a. Interestingly, the scope of this approach seems to track on the same
>     questions as the "how important is a robust classic queue detection"
>     question in #1.b, because they both depend on the current and in-flight
>     deployment footprint for CE-marking classic queues.  If robustness is
>     not very important, then it's because classic queue deployment is low,
>     so a flag day would be a low-touch event.  And if a flag day would be
>     a major lift with a lot of work for operators, then good robustness
>     would likewise be very important if not doing a flag day.
> b. However, this presumably requires an update to RFC 3168, and I'm not
>     sure whether there's a well-established process for organizing an
>     event like this.  Obviously the outreach effort is much higher than
>     if e.g. option #1 or 2 turned out to be feasible to get working well
>     enough to satisfy rough consensus.

I think this would also be a pretty long-term plan, and it's not clear 
to me (at least) if it's very useful.  It assumes L4S will be very 
successful, and if that's the case, these classic queues probably get 
upgraded anyways at some point.


> 4. operational considerations to recommend changing ECT(1) to NECT at
>     ingress to networks that have marking classic queues deployed.
> a. Note that these considerations are for networks NOT participating in the
>     L4S experiment, so they can't easily be folded into L4S-specific
>     operational considerations.
> b. This seems to me most appropriate as an add-on of a fallback position
>     during the #3 (flag day), where classic queues can't be reconfigured to
>     treat ECT(1) as NECT.  But I left it as a separate point because it was
>     proposed independently, and maybe there's an argument that only networks
>     that experience a problem would need to do this and could do it as a
>     post-hoc fix when problems are encountered, rather than pre-arranging
>     it with a flag day and the associated proactive outreach it would need.
>     (to be clear: I am currently against that position, but I acknowledge I
>     might be in the rough when consensus is checked)

Useful perspective; thanks.


> 5. operational considerations to recommend policing strategies that can
>     solve the general case of non-compliant traffic that does not respond
>     with the expected backoff to AQM congestion signaling.
> a. Arguably this should be done regardless of L4S, because it seems like
>     an underdeveloped piece of the puzzle for the general problem of active
>     queue management in network devices.  However, solving this well and
>     getting solutions widely deployed or at least available (or somehow
>     deployed in conjunction with L4S endpoint enablement) could also
>     potentially address issue 16.

I personally agree with this.  I think it's not specific to L4S at all 
though.  How bad flows are dealt with in general will also be applicable 
to a buggy or malicious L4S flow.  So I personally think the gate for 
L4S work in this regard should simply be WG confidence that the 
situation is not made substantially worse by L4S flows (whether they are 
legitimate or not).


> b. There are many possible strategies here, so outlining some of the
>     known ones maybe seems worthwhile.  A few examples that spring to
>     mind:
>     i.   FQ
>     ii.  PPV (http://ppv.elte.hu/), as we saw in ICCRG 104 from Szilveszter
>          Nadas, seems to have a lot of promise here
>     iii. Likewise the work trying to solve a similar problem written up in
>          "Fair Resource Sharing for Stateless-Core Packet-Switched Networks
>          With Prioritization" by Michael Menth and Nikolas Zeitler
>          (https://ieeexplore.ieee.org/document/8419697)
>     iv.  There's also some good insights along these lines in the "Rationale"
>          section of the docsis queue protection scheme doc:
> https://tools.ietf.org/html/draft-briscoe-docsis-q-protection-00#section-5
>          It says the same approach could be used in scenarios beyond dualq,
>          and I think there's some applicability to codel or pie, with or
>          without ECN.
>     v.   Perhaps some generic guidelines that captures what many of these
>          have in common--in general a policing response could be based on
>          sampled monitoring (not necessarily integrated closely with the
>          queues) that maintains stats on the top current and recent senders,
>          and blacklists or downgrades their traffic for some time in response
>          to a large enough standing queue, or overflow of an AQM.  (On the
>          grounds that someone here is non-compliant since it exceeded
>          expected operational bounds, and thus at least the highest volume
>          recent senders have presumably failed to back off appropriately.)
>     A BCP that lists these (and maybe other options) and captures a more
>     generalized version of the advice from docsis-q-protection seems likely
>     helpful here.
> c. Also worth noting: the need for some kind of isolation for non-compliant
>     senders is not inherently an ECN-related (nor L4S-related) problem--there's
>     no forced reason that a sender necessarily has to respond to loss either...

I'm in agreement with your remarks here, and think this might be a good 
starting point for material in operational guidelines.  In regards to 
the totally valid concerns that operators not doing L4S won't be reading 
operational guidelines drafts for L4S, I think we should try to rely on 
things that are effective and useful anyways without regard to L4S 
deployment.  FQ is a good example, that already exists in some form in 
many places, and will continue to without regard to L4S.  So, if it 
helps to keep a buggy or maliciously-marked L4S flow in check, this is good.


> 6. An experiment-linked public whitelist of participant-registered IP
>     ranges that have a L4S compatible dualq in their reachability path at
>     the likely bottleneck, which would be checked by endpoints before
>     negotiating L4S.
> a. to flesh the idea out a bit, I'm imagining this as a web API with a
>     known URL and a database attached, which is documented and maintained
>     as part of the L4S experiment, where experiment participants who have
>     deployed and enabled dualq-capable devices register the applicable
>     IP ranges, and L4S-capable endpoints query the web API before opening
>     connections, so that they avoid negotiating L4S support except when
>     either inside a network with a registered dualq, or when connecting to
>     a remote endpoint that's inside such a network, caching the answers
>     for ~10-30 minutes (or according to http headers in the web api or
>     something)
> b. This would let innocent bystander 3168 traffic operate unmolested at
>     access bottlenecks while gaining live operational experience with L4S.
> c. If the rapid ubiquitous rollout goes as planned according to the L4S
>     project intent, once the dualq devices are far more prevalent than classic
>     ECN queues and widely available on all the bandwidth-shaping access
>     technologies and the non-experiment participants are predominantly not
>     doing any classic marking, the whitelist could be gradually retired, or
>     turned into a blacklist of L4S on IPs that are known to have problems with
>     legacy systems that haven't been upgraded (which can be discovered by
>     gradually enabling L4S on non-participant paths, perhaps with A/B tests,
>     and following up with those networks to ask afterward whether it caused
>     problems)
> d. This approach is of course operationally awkward and not a good long-term
>     solution, and also comes with potential privacy concerns as the deployment
>     grows, but would allow for forward progress on L4S without fixing the
>     underlying incompatibility problem with the CE ambiguity, so that time and
>     an ongoing outreach effort could have a chance to resolve the classic ECN
>     deployment footprint questions.  This also allows for some amount of
>     targeted experimentation with the classic queue detection work.

This seems like it's useful for coordinating experiments and working on 
some of the research questions too.

However, in the case of malicious flows wrongly saying they're L4S, I 
don't think it does anything.  If there were some bug or issue found, it 
may create a way to exploit that.  That said, I'm not sure how much of 
an actual concern this should be.