Re: [tsvwg] L4S status: #17 Interaction w/ FQ AQMs

Jonathan Morton <> Thu, 15 August 2019 15:56 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 81CB3120806 for <>; Thu, 15 Aug 2019 08:56:03 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.749
X-Spam-Status: No, score=-1.749 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id g9ybUlcf3NWh for <>; Thu, 15 Aug 2019 08:56:00 -0700 (PDT)
Received: from ( [IPv6:2a00:1450:4864:20::236]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 761D1120220 for <>; Thu, 15 Aug 2019 08:56:00 -0700 (PDT)
Received: by with SMTP id x18so2687465ljh.1 for <>; Thu, 15 Aug 2019 08:56:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=3oW1HbUd5+uP513qhAZ1ifUpUISzGS/1rTVcibpHaFI=; b=tz1gHASFx8s4bZOIJq7bbgbWkfch5XfLhCWo7ZbS2/hLWVyE40Ze8gyDpk5bQYEzkf eRe9tGVSXZ1naD1OPEXdI0ta4He4W0hZRNBHW3cGCn/QJqgqaiROFvyP60soADALMY7f j9Z43C2ZorKr7eSrdZwA0bH+mXJ6APs4KsfLJ2D5SYEUu7Wk2BoHLxC1qCNkS3ShpAJ4 JIahTH2vgLqCjWZiQYr5JcPfeI1KtgEQOlbm8OKb37UNHUWA2TgD5LZMK9Et1tzYAJVU jw2rbD+Vcfz2k6JSlB44FuAYyUYe6gEtiKqq/drcfACT/6RKgyPYaMeeR8pa42RD17XA hfQQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=3oW1HbUd5+uP513qhAZ1ifUpUISzGS/1rTVcibpHaFI=; b=pghkznGxRxaTfEXsUERTthobm59UqKJWwGuI7AZazY9ngIh8a1OYxRV6b25SprXCKn dAktWzteH5nS97afEHRar/Iok4KNb7gDLIBrgKnJFZ1zstxyjXqN3v4nQrxV9PrXupNb 11YAaCD8a3UPxClVGCa+vOU2vqvC0n8FKJH4u1vhxvsHpJJt8/1O3LzK9XkNwbtsTQjy aKxYt1PG/6pygMtty14rUGeOfHbAUWm1X6LFlWXTFiwpc1HzVKlV6KT5jzcTW1mvgf1n ORjPiw2u3znaJdyuwxpsi715m4dNA1e5aW2wcsL3esc/aluloVPgS9iCMcZv0hMt6/GC vRZw==
X-Gm-Message-State: APjAAAUZPXVaO7WVg1ZmIRbXwE7q9VLHWT09v3NFzsSscKbNd3uXoB58 OfPNYuTu3uymBXdJjUh1nbM=
X-Google-Smtp-Source: APXvYqwohR347vbl4LRIZ+kSlzfDtjN4eT/PvhW/8+Zq9MMJrD9WQUDGnqTOCEYtjMoKt+RAP/8SIQ==
X-Received: by 2002:a2e:969a:: with SMTP id q26mr1776970lji.227.1565884558725; Thu, 15 Aug 2019 08:55:58 -0700 (PDT)
Received: from jonathartonsmbp.lan ( []) by with ESMTPSA id r16sm499269lji.33.2019. (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Aug 2019 08:55:57 -0700 (PDT)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
From: Jonathan Morton <>
In-Reply-To: <>
Date: Thu, 15 Aug 2019 18:55:56 +0300
Cc: Wesley Eddy <>, "" <>
Content-Transfer-Encoding: quoted-printable
Message-Id: <>
References: <> <> <> <> <>
To: Greg White <>
X-Mailer: Apple Mail (2.3445.9.1)
Archived-At: <>
Subject: Re: [tsvwg] L4S status: #17 Interaction w/ FQ AQMs
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 15 Aug 2019 15:56:03 -0000

> [GREG] I don't believe that what I described was a fallacy.   To be clear, when an RFC-3168 sender responds to a CE mark with a multiplicative decrease, what it is multiplicatively decreasing is NOT its sending rate.  It is decreasing its cwnd.  When cwnd is greater than BDP, sending rate is no longer proportional to cwnd.  In an idealized case, the bottleneck link will trigger cwnd reductions such that the reduced cwnd is exactly equal to the BDP (and queuing delay is zero), which means that the cwnd reduction is triggered when cwnd = BDP/0.5 (for Reno) or cwnd = BDP/0.7 (for Cubic).  This preserves continuous 100% link utilization with the minimum queuing delay, and cwnd is always >= BDP.

In other words, you expect the peak queue delay to be equal to the baseline RTT.  This is also the best-practice sizing guideline for dumb FIFOs - which should give you pause.

> The goal of AQMs has generally been to preserve (in as many scenarios as possible) near 100% link utilization, while minimizing queuing delay.   Is your AQM aiming at some other goal?

Actually, most AQMs in my experience aim to reduce latency as much as possible, while keeping link utilisation as high as is compatible with that goal.  Or, more precisely, to achieve some tunable balance between link utilisation and queue delay.  In particular, they aim to reduce latency relative to a dumb FIFO (see above).

Certainly that is the stated goal of Codel: to keep the *standing* persistent queue limited to 5ms by default, where the default settings explicitly assume an Internet-scale 100ms RTT, and to drop or mark only as many packets as is necessary to achieve that.  The "measurement window" defined by the interval parameter is only to filter out short-term bursts that have many natural causes in the Internet, effectively defining "short term" as "within one assumed RTT".

Codel actually does a better job of maintaining low queue latency than PIE does.  Please do look at the SCE slide deck and note the peak latency increments of roughly 5ms in each chart involving an RFC-3168 flow, almost exactly on Codel's target.  So if you think Codel is not designed for low latency, rather than high throughput, then you have seriously misunderstood it.

>> It seems as if your concern is that, when the RFC3168 flow cuts its cwnd to drain standing queue, that the L4S sender will shut out the RFC3168 sender on the access network link, and thus force the RFC3168 flow's FQ queue to drain completely.
>    No, the *eventual* convergence to fair shares in steady state is not in question.
> [GREG] Neither (in my view) is the rapid convergence to fair shares**.  But this apparently needs to be demonstrated.

>    FQ enforces it to the best of its ability (which, crucially, is determined by the traffic which makes it through the preceding network elements),
> [GREG]  FQ enforces it *precisely* as long as neither queue drains completely.

And, as I was trying to point out, it is possible for a large burst from one flow collecting in the dumb FIFO to starve the FQ of packets from the other flow, thereby inhibiting its ability to give that flow its fair share of throughput.  Simultaneously both flows suffer the increased latency caused by the dumb FIFO's depth, because the FQ is also inhibited from this aspect of its flow isolation task.

> [GREG]  Ok, here I think we're on the same page.  The CoDel response is too sluggish to bring latency under control as quickly as we would like.

Please stop making that incorrect statement.  Codel's response is very well tuned and very effective on RFC-3168 compliant flows.  The DCTCP response used by L4S is what is too sluggish, when provided with RFC-3168 compliant CE marks.  That is why DCTCP was effectively barred from use outside tightly controlled datacentre environments.

>    Here is a simple experiment that should verify the existence and extent of the problem:


>    Correct behaviour would show a brief latency peak caused by the interaction of slow-start with the FIFO in the subject topology, or no peak at all for the control topology; you should see this for whichever RFC-3168 flow is chosen as the control.  Expected results with L4S in the subject topology, however, are a peak extending about 4 seconds before returning to baseline.
> [GREG] I would not expect to see the FIFO holding a queue in either case, (i.e. I would expect this phenomenon to only affect the latency of the L4S flow) as long as BW > 24 Mbps (see ** below) but we will see.

That's precisely why I want you to actually perform the experiment as described.  I think you will see something you weren't expecting.

> [GREG] ** In "Sebastian's Topology" the difference between the Dumb FIFO BW and the FQ_Codel BW needs to be greater than or equal to one MTU / RTT (for Reno or DCTCP/Prague) to prevent queuing from occurring in the Dumb FIFO for a single steady state flow.  I suppose it needs to be an even bigger difference to prevent FIFO queuing with Cubic.   What is the rationale for using a multiplicative factor (105%) rather than an additive one scaled based on expected CC dynamics?

Again, you are thinking primarily about the steady state, after the exit of slow-start and when the AQM has settled on the correct marking rate.  That's an important design point, but we must also consider the transient states which lead up to that point.

In particular, the last couple of RTTs of slow-start are critical and tend to involve bursts of a full BDP in excess of the correct final cwnd - and that's *if* it is handled optimally.  Slow start can be very nasty to handle.

Additionally, those four seconds in which Codel is desperately ramping up to find the correct marking rate for L4S are also important.  We saw 125ms peak delays here on a 10ms RTT.  Your question should be where that delay mostly appears - in the dumb FIFO or in the FQ-AQM?

The greater the margin between the FIFO link rate and the FQ-AQM shaper, the faster the FIFO will drain into the FQ-AQM, and the greater benefit the latter will have in these transient cases.  For RFC-3168 flows, we've found a 5% margin to be adequate, and so that's what is deployed in at least one commercial device.  Hence why I'm inviting you to directly compare RFC-3168 and L4S behaviour in this topology.

 - Jonathan Morton