Re: [tsvwg] FQ-CoDel response to unresponsive traffic (was: Related to "Non-L4S traffic abusing the L-queue" discussion during the interim)

Dave Taht <dave.taht@gmail.com> Sat, 26 February 2022 17:13 UTC

Return-Path: <dave.taht@gmail.com>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 80E773A0958 for <tsvwg@ietfa.amsl.com>; Sat, 26 Feb 2022 09:13:46 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.108
X-Spam-Level:
X-Spam-Status: No, score=-2.108 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VLT52-jgS6Uc for <tsvwg@ietfa.amsl.com>; Sat, 26 Feb 2022 09:13:42 -0800 (PST)
Received: from mail-ej1-x62b.google.com (mail-ej1-x62b.google.com [IPv6:2a00:1450:4864:20::62b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C504F3A094D for <tsvwg@ietf.org>; Sat, 26 Feb 2022 09:13:41 -0800 (PST)
Received: by mail-ej1-x62b.google.com with SMTP id gb39so16853449ejc.1 for <tsvwg@ietf.org>; Sat, 26 Feb 2022 09:13:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=TWXikmvxTY7q0LwRiKcjtEwcQu8D9UPeb70arl92Ngg=; b=piFkq4idWCUNeqb4Jj0v4Ri5BjQxqaSYaroClgFeh8waL7ZZjjGffnxI4m7pietqjy MgEhMZWDrsutma1R691iOrI0fpyHYfYsHaoZcD0NSY9YK2auvqYFb8Gxt3XabmE2fQIr hAiFdYzL7k73g5VD6vjeAuLgVqg2OPPYhG+CWl35N0XMqaXa9qDqdnVB0siKkD5bOUZO geb8QkHef8PSLjfuFDEe+9JIlZy6mQVYnYAgCTAz6FiZBt2q4K3KNl877KhwPqA0A8Mt +dIIs65y/coSYpIflTdL7S4jQXZIucBJ1FXqDkWu7BlrAPUCtC2GxPAirE4BCU54czDC C0ZQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=TWXikmvxTY7q0LwRiKcjtEwcQu8D9UPeb70arl92Ngg=; b=oX1BunZ3rZWWztmiAqCJYS/xCmprL6Bfa+TOW1FQDi29DAx7JDS8jsn+PlNa0S6G4H Df3WDvauJyFplmpmTvCfowe+MQbDmMUmBmVQylPDMgeswGi+P3PFKLSyDqslyO/4xAko topPmZQau/VjVfSKkrW5fDAqER3yR1p0W0ydez9L6dAuddPHlL2QSrL0zIw7PT/co6im cVuzBQ5HhS3R09aO/SYXDqNuFX8wnNVnkJVrhDgxdypk0tsV4SdHvqZWaSQqApXzZpoI 88fKB1et3E8/3iKdltKSiPuzbFEo/JgemsPOLY1uqBK5MATjGQc9SmXa6kGmr0G1IEK9 sZew==
X-Gm-Message-State: AOAM530DPgjFlZd7OU9T/8UJ+QKhvGVY1CTxdSvV3y455tmubAosaXcg NCTEfEuhkotgTEVRAU7kHJsV6Ei2zw5yZXvlfHE=
X-Google-Smtp-Source: ABdhPJzqyB7eVrFfxzb2/4F3U8VsfTc3/cXqosznZGvgNQwscnqRN8E8cPG4VMt24fgZvdkM4W0/cadDhFrOgIbdSis=
X-Received: by 2002:a17:906:d9ce:b0:6ce:6a06:c01 with SMTP id qk14-20020a170906d9ce00b006ce6a060c01mr10010850ejb.666.1645895619344; Sat, 26 Feb 2022 09:13:39 -0800 (PST)
MIME-Version: 1.0
References: <AM9PR07MB7313D5AAF6B9D66C74CC35A1B9369@AM9PR07MB7313.eurprd07.prod.outlook.com> <AM9PR07MB7313F1401B14F6F2DB72A2B2B93E9@AM9PR07MB7313.eurprd07.prod.outlook.com> <MN2PR19MB40454F60DEE5735EAD428465833E9@MN2PR19MB4045.namprd19.prod.outlook.com> <CADVnQyk+uSX9GJtMBnsBhn9NzY+L3BKfhhUJ=yu4Aya98YEonw@mail.gmail.com> <MN2PR19MB40458624D266CDB54009AB19833E9@MN2PR19MB4045.namprd19.prod.outlook.com> <AM9PR07MB731311A9E4532FD501B5D94CB93E9@AM9PR07MB7313.eurprd07.prod.outlook.com> <CAA93jw4=JuO9UqBoHLHXCQrLn7toTqPDerFehDajEH2-2dtZWA@mail.gmail.com> <CAA93jw4CtiYjBg9RAFuOjJHX4T7aUQ07KdetWSgKrNgJg=DPPA@mail.gmail.com> <5114db28-89ac-1eae-b846-22ae37391c6c@bobbriscoe.net>
In-Reply-To: <5114db28-89ac-1eae-b846-22ae37391c6c@bobbriscoe.net>
From: Dave Taht <dave.taht@gmail.com>
Date: Sat, 26 Feb 2022 12:13:26 -0500
Message-ID: <CAA93jw7BaL-=SOv_JicXPwD8_4Rs89NrUrxBdY8vO92KpthAow@mail.gmail.com>
To: Bob Briscoe <ietf@bobbriscoe.net>
Cc: "De Schepper, Koen (Nokia - BE/Antwerp)" <koen.de_schepper@nokia-bell-labs.com>, tsvwg IETF list <tsvwg@ietf.org>, codel@lists.bufferbloat.net
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/aV7tsTFLiy33dO0DFCTs1DRlQ8I>
Subject: Re: [tsvwg] FQ-CoDel response to unresponsive traffic (was: Related to "Non-L4S traffic abusing the L-queue" discussion during the interim)
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 26 Feb 2022 17:13:47 -0000

At one level you are interpreting an observed behavior as "tail drop"
- which may well be possible somewhere in the stack,
but it's not clear if you were running a post 2016 kernel which is
what added the drop_batch facility.

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=9d18562a227

This drops from the head, not the tail.

I was not satisfied with this solution btw, and in some later patch
added an increment to the codel count in drop_batch so as to pass "bad
things are happening elsewhere" back over to the main portion of the
algorithm. I'm still very unsatisfied with the concept of a fixed and
user configurable drop_batch length, rather than something that
autotuned.

elsewhere in the fq_codel_fast repo I experimented with eliminating
the queue search, but accepting that small but constant cpu overhead
for a optimizing for what is perceived to be (and may not be!) a
rarely hit condition, or accepting the cost of the search when it
happens, remains to be seen.

So, while trying to disregard your conclusion this was tail drop, I am
happy that you have clearly identified (with a kernel version), and
described a test (yay!) that tickles a count caching problem and
proposed some solutions here:

https://bobbriscoe.net/projects/latency/CoDel-delta-bug.pdf

cc-ing the codel list.

On Sat, Feb 26, 2022 at 8:45 AM Bob Briscoe <ietf@bobbriscoe.net> wrote:
>
> Dave,
>
> I will keep reminding everyone that this shift of topic to FQ-CoDel is
> distracting from the task at hand:
>      "Is Jonathan going to confirm that his 'throughput bonus' and 'fast
> lane' accusations against DualQ are baseless because his experiment was
> broken?"
>
> Nonetheless, response on FQ-CoDel is below, tagged [BB]...
>
> On 25/02/2022 21:06, Dave Taht wrote:
> > while I do not want to spend much time nitpicking this document...
> >
> > "causing most of the time tail-drop" stood out. codel, fq_codel, cake
> > all do head drop, and always have.
>
> [BB] For the list, we're talking about Figure 5 here:
> https://l4steam.github.io/overload-results/
>
> I'm nearly certain that the cap at 600 ms is tail drop.
> Cause: The control law increases head drop so slowly that the flow-queue
> containing the unresponsive flow eventually fills the buffer allocated
> to the whole qdisc. Then I believe it moves into what Jonathan calls
> 'tallest sunflower' drop mode (tail drop focused on the longest flow-queue).
>
> To help prove this, here's an experiment Asad ran for me last Oct on
> FQ-CoDel with an unresponsive flow rate just greater than the link rate.
> https://bobbriscoe.net/projects/latency/CoDel-delta-bug.pdf#page=4
> We were testing very slight overload, so it would stay in head drop
> mode, without hitting the need for tail drop. The plot shows a similar
> series of humps in the queue, but without the cut-off due to tail drop.
> So it's fairly conclusive that Koen's Fig 5 is showing tail drop.
>
> I'll answer your question (on the SANE list) about why the humps repeat,
> but that's a trivial bug compared to the time CoDel takes in the first
> place.
> It's a design flaw, not a bug.
> The so-called 'control' law never even measures the queue it is meant to
> be controlling.
> Here's some history:
>
> * On 12-Nov-2013 I reported that to Kathie and Van as CoDel designers,
> cc the AQM list:
> https://mailarchive.ietf.org/arch/msg/aqm/l4H1QdRl8B-E5FWpJh4w50B_nQE/
> * No response by anyone for over 18 months, until...
> * 07-Jun-2015: Toke confirmed my analysis empirically (see it, via same
> thread above)
>     Toke's plot:
> https://kau.toke.dk/ietf/codel-drop-rate/codel-drop-rate.svg
> * On 30-Sep-2015 you (DaveT) said "cake uses a better curve for CoDel
> but we still need to do more testing in the lab"
>     As far as I understand it, that missed the point: CAKE's curve is
> still extremely slow, but somewhat faster than CoDel.
>     But, CAKE's control law still never measures the queue it is meant
> to be controlling.
> * 25-Feb-2022: You say you don't want to spend much time nitpicking
> Koen's experiment.
>      If not you, someone needs to grasp this nettle, given FQ-CoDel is
> the default qdisc in the Linux mainline.
>
>
>
> Bob
>
> --
> ________________________________________________________________
> Bob Briscoe                               http://bobbriscoe.net/
>


-- 
I tried to build a better future, a few times:
https://wayforward.archive.org/?site=https%3A%2F%2Fwww.icei.org

Dave Täht CEO, TekLibre, LLC