Re: [tsvwg] L4S Issue #16: Classic ECN Fall-Back: Tech Report updated ... and standby for more

Jonathan Morton <chromatix99@gmail.com> Fri, 24 April 2020 22:49 UTC

Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
From: Jonathan Morton <chromatix99@gmail.com>
In-Reply-To: <0315efd8-4941-bd74-c1c1-6782210ab618@bobbriscoe.net>
Date: Sat, 25 Apr 2020 01:49:04 +0300
Cc: tsvwg IETF list <tsvwg@ietf.org>, "tsvwg-chairs@ietf.org" <tsvwg-chairs@ietf.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <B88994D7-4201-497C-B8E9-F1E8C8274ABF@gmail.com>
References: <6c9bdc29-38d5-eccf-3255-7730d58ea15a@bobbriscoe.net> <DCE6B60C-20C7-4D6B-9AB1-6171D6194C74@gmx.de> <9CAE8E49-30CA-4043-A555-DF90D7EFA05A@gmail.com> <0315efd8-4941-bd74-c1c1-6782210ab618@bobbriscoe.net>
To: Bob Briscoe <ietf@bobbriscoe.net>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/vIyN3t_CSQaLQ-bdz2L0g9eb1RI>
Subject: Re: [tsvwg] L4S Issue #16: Classic ECN Fall-Back: Tech Report updated ... and standby for more
Precedence: list

> On 24 Apr, 2020, at 9:42 pm, Bob Briscoe <ietf@bobbriscoe.net> wrote:
> 
> 1a) Of course one can make a Classic AQM undetectable by altering its configuration so that will evade the heuristics of this algorithm. You are welcome to consider this a problem with our algorithm if you want. This looks to me more like clutching at straws.

You may notice that the plot we illustrated this point with features RED - an old AQM that is notoriously difficult to configure correctly, but widely implemented in hardware.  An AQM that is difficult to configure correctly will, in practice, often be configured incorrectly - and entirely by accident, that is what we initially did.

These are actually the very first settings we tried with RED, *before* we stepped back and realised that a 150KB hard limit at 50Mbps corresponds to just 25ms, and that the default parameters then cause marking to begin at only about 2ms.  It was then that we added a second RED configuration with a 400KB limit, taking the latter from the example in
the tc-red man page.  Incidentally, the man page provides no practical guidance on choosing the limit value correctly.

What's interesting about this is that conventional flows (and SCE ones) were perfectly happy with the 150KB configuration.  A network engineer could easily set this up, run some measurements, and be quite satisfied with the results.  It's also easy to imagine a 100Mbps device with a 256KB buffer that was configured more or less similarly, simply because 256KB was the amount of memory available at the right speed and the right price at the time of manufacture.  So this is a configuration you should expect to encounter in the wild, at least occasionally.

The 400KB configuration begins marking at 5.5ms, which is similar to Codel's default target.  However, if we closely examine those plots, we can see that TCP Prague is basically on the edge of its detection parameters, and still exhibits some L4S-type responses.  The easiest to spot is the roughly 35 seconds it takes to switch into classic fallback mode on the 160ms path, well beyond the 20-second failure criterion mentioned in your paper.  For reference, 160ms corresponds to a Europe to West Coast USA path.

All of which brings mildly into question your initial assumption that Codel would be the "most difficult" AQM to distinguish from L4S.  Assumptions such as these often lead to spherical cows.

But returning to Codel, the default settings are tuned for "typical" Internet paths, whose median is often quoted at 80ms.  Supposing someone had little interest in the world beyond their home state and their ISP's local CDNs, and decided to tune Codel to match that philosophy.  Cake (which uses a version of Codel internally) actually includes a preset keyword for that purpose; "regional" selects 30ms interval, 1.5ms target.  Only a minority of users will bother to change the default, but this alternative setting serves a legitimate use case.

Here is how L4S reacts, predictably:

http://sce.dnsmgr.net/_archive/ect1-2020-04-25T001927/l4s-s2-twoflow/l4s-s2-twoflow-ns-cubic-vs-prague-cake_regional_-20ms_tcp_delivery_with_rtt.svg

Obviously, the behaviour is much better if we leave Cake's FQ features switched on, but that is not the point of the present discussion.

> 1b) You will see that we have not tried to fix the false positives yet, because they only occur in situations where a fairness problem does not arise from the detection failure. We do plan to address these to some degree, and the tech report I sent out describes how we intend to. If you do find genuine cases where these false positives actually affect the fairness goal, then we will raise the priority of this question.

But the false positives *would* result in a fairness problem, as mentioned in the text, if a second L4S flow were added which was not subject to the added jitter.  This second flow would squeeze out the one experiencing the false-positive detection, in exactly the same way as it would squeeze out any other conventional-behaving flow.

Granted, this is not a situation directly relevant to friendly coexistence on *existing* networks.  We included false-positive results mainly to point out that there is little if any margin for merely adjusting parameters of the heuristic to include the false-negative cases identified elsewhere.

> 1c) I couldn't find anything further in your pages to explain what you meant by "insensitivity to the delay variation occurring around a packet loss". Please elaborate. This might be related to #2 below, and therefore might have gone away now the bug is fixed.

We simply mean that TCP Prague apparently did not switch into classic fallback mode, despite the presence of a very strong delay variance signal, when there were also a lot of packet drops.  This was visible also on one of the PIE runs that *did* have ECN marking enabled, but had a hard limit low enough to also cause tail loss on slow-start.  This is of course worth re-testing with the bugfix.

> 2. The insensitivity to packet loss was a simple bug. Thank you for pointing out the symptom. There was no evil intent to ignore losses. Olivier has already issued a bug fix, I believe - and thank you again for you help on this one.

Nevertheless, I think it is very much worth pointing out that such a serious (if easily solved) bug made it past your test suite, however extensive you say it is.  I think our decision to go for a broad rather than deep approach to testing was the right one.

In the meeting, the discussion over ECT(1) will most likely centre on how each proposal copes with traversing a conventional network.  We think SCE is in an excellent position here, as RFC-3168 compliant behaviour is built into the design.  It is up to you to prove that L4S is just as good - but we have counterexamples, such as the one I linked above.

Ultimately I think what L4S will need to detect with its heuristic is not the latency-controlling behaviour of the AQM, but the presence of competing conventional flows.  It is precisely in the latter case that the potential for starvation exists.  The delay-variance heuristic is a rather poor substitute.

 - Jonathan Morton

[tsvwg] L4S Issue #16: Classic ECN Fall-Back: Tec… Bob Briscoe
Re: [tsvwg] L4S Issue #16: Classic ECN Fall-Back:… Sebastian Moeller
Re: [tsvwg] L4S Issue #16: Classic ECN Fall-Back:… Jonathan Morton
Re: [tsvwg] L4S Issue #16: Classic ECN Fall-Back:… Bob Briscoe
Re: [tsvwg] L4S Issue #16: Classic ECN Fall-Back:… Sebastian Moeller
Re: [tsvwg] L4S Issue #16: Classic ECN Fall-Back:… Bob Briscoe
Re: [tsvwg] L4S Issue #16: Classic ECN Fall-Back:… Pete Heist
Re: [tsvwg] L4S Issue #16: Classic ECN Fall-Back:… Jonathan Morton
Re: [tsvwg] L4S Issue #16: Classic ECN Fall-Back:… Sebastian Moeller