Re: [tcpPrague] [aqm] L4S status update

Dave Täht <> Wed, 30 November 2016 03:46 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id BD7C01299A1; Tue, 29 Nov 2016 19:46:11 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -3.398
X-Spam-Status: No, score=-3.398 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RP_MATCHES_RCVD=-1.497, SPF_PASS=-0.001] autolearn=unavailable autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id A6QjOiZcVW8y; Tue, 29 Nov 2016 19:46:10 -0800 (PST)
Received: from ( []) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 079FE129D2A; Tue, 29 Nov 2016 19:40:38 -0800 (PST)
Received: from dair-2506.local ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPSA id 07F4621341; Wed, 30 Nov 2016 03:40:34 +0000 (UTC)
To: Jonathan Morton <>, Matt Mathis <>
References: <> <> <> <> <> <> <> <>
From: Dave Täht <>
Message-ID: <>
Date: Tue, 29 Nov 2016 19:40:31 -0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Thunderbird/45.4.0
MIME-Version: 1.0
In-Reply-To: <>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
Archived-At: <>
Cc: tcpm IETF list <>, tsvwg IETF list <>, Bob Briscoe <>, TCP Prague List <>, "Bless, Roland (TM)" <>, AQM IETF list <>
Subject: Re: [tcpPrague] [aqm] L4S status update
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: "To coordinate implementation and standardisation of TCP Prague across platforms. TCP Prague will be an evolution of DCTCP designed to live alongside other TCP variants and derivatives." <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 30 Nov 2016 03:46:11 -0000

The advantages of FQ are probably more widespread than people think.

* Switches tend to multiplex between ports, thus mixing up traffic

* Most modern ethernet cards expose 8-64 queues - why? because this
provides clear paths for multiple cpus to forward them - it's not
network related at all! - 1gig cards typically have 8, 10gig, 64.

While 8 induces a severe birthday problem, 64 is generally "just fine"
and in either case, large amounts of mixed traffic respond reasonably to
even only a few queues.

I don't see the trend to ever more hardware queues subsiding, as it's
a function of the number of cores, which keeps going up and up.

* Google'd hosts - and those of a few cloudy providers - long ago
switched to sch_fq which also interleaves (and paces) flows at an
appropriate burst level on a 1ms interval. So source flows are getting
fq'd on an ever more regular basis.

So when the assertion is made that fq is impossible on core routers, I
have to point to the fact that much of the traffic traversing them have
already had a great deal of mixing already applied to it by devices
downstream, AND most core routers have more than one link they are
aggregating via multiplexing in the first place.

That's the FQ in the core and datacenter today. I view the side-effects
of the 1 cpu per hw queue as essentially accomplishing all the fq
needed, AND that applying a single aqm to all those flows is well nigh
impossible as these flows MUST be spread across cores to reach these rates.


I see no future in hardware designs that can scale past 25Gbit that do
not do some form of parallelism across flows. This again, is something
that already essentially happens within vlans and with multiple
forwarding lookup tables in those.

100Gbit devices are 4 25GB queues wired together, at minimum. Etc.

(Most of our problems nowadays btw, is on rx, not tx - it's really hard
to do all the work required on the rx side!)

In terms of fq_codel -

For strong reasons (birthday problem) we settled on 1024 queues as a
good general purpose number that works well in practice on real data on
real loads, for speeds between 0 and 10Gbits, for homes and small
business. Initially.

We are now seeing people deploy fq_codel successfully on mid-range
systems with 64 hardware queues at 10Gbit - 64K fq_codel'd queues.

Whether or not the aqm part is that effective with that many queues on
real traffic is beyond me. I've had no complaints - as the alternative -
no aqm, no fq - is far worse.


The early experiments with fq_pie (not the bsd version, the sch_fq
derived version) showed something like a 12% (don't quote me!)
improvement in the aqm performance by presenting a more equitable
distribution of test traffic (that is presently over-bursty, created by
bursty things like IWXX, tso/gso/gro offloads, and so on).

I was happy with that result, but cognizant that so much test traffic
looks *nothing like* real traffic. Real traffic does not consist of the
long sustained loads so beloved of experimenters and theorists.

The benchmarks I care about most are voip, videoconferencing, gaming,
dns, tls setup, page load time, and anything else that demands low
latency, of course.


Cake has pushed the limits of the possible, with the 8 way set
associative idea making for the perfect fq all the math is based on,
I rather would like more folk testing it - it's stable enough now, and
backported as far back as linux 3.14. I am still unsure of the various
mods to the aqm algo - the new per host fq within fq feature (a common
feature request), the denatting, and the optional shaper is a major
advance, and the classification ideas needing further evaluation.

It seems eminently possible to add l4s ideas to it, as soon as someone
gets around to a usable ect(1) patch for tcp ecn negotiation.

When last I tried it, it could push 17GBit through a 40Gbit card, where
pfifo fast did about 27 on a single core/hw queue benchmark. We were
bottlenecking on rx! to be unable to test harder than that.