Re: [tsvwg] sce vs l4s comparison plots?

Dave Taht <dave@taht.net> Mon, 11 November 2019 00:18 UTC

Return-Path: <dave@taht.net>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 720891200FE; Sun, 10 Nov 2019 16:18:17 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ASxeD8MuJdsb; Sun, 10 Nov 2019 16:18:15 -0800 (PST)
Received: from mail.taht.net (mail.taht.net [IPv6:2a01:7e00:e000:2d4:f00f:f00f:b33b:b33b]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 14319120058; Sun, 10 Nov 2019 16:18:15 -0800 (PST)
Received: from dancer.taht.net (unknown [IPv6:2603:3024:1536:86f0:eea8:6bff:fefe:9a2]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.taht.net (Postfix) with ESMTPSA id C3BF221B46; Mon, 11 Nov 2019 00:18:11 +0000 (UTC)
From: Dave Taht <dave@taht.net>
To: Tom Henderson <tomh@tomh.org>
Cc: tsvwg IETF list <tsvwg@ietf.org>, tcpm@ietf.org
In-Reply-To: <4b67d594-e4fc-92d8-fcdc-8384fcb7286b@tomh.org> (Tom Henderson's message of "Sun, 10 Nov 2019 22:27:58 +0000 (UTC)")
References: <742142FB-6233-4048-931B-EE2DD9024454@gmx.de> <87mud4ejl9.fsf@taht.net> <4b67d594-e4fc-92d8-fcdc-8384fcb7286b@tomh.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)
Date: Sun, 10 Nov 2019 16:17:59 -0800
Message-ID: <87a7931d1k.fsf@taht.net>
MIME-Version: 1.0
Content-Type: text/plain
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/-UEeaXDbmQYi3BqxwuSBI3EIcO8>
Subject: Re: [tsvwg] sce vs l4s comparison plots?
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Nov 2019 00:18:18 -0000

Tom Henderson <tomh@tomh.org> writes:

> Dave,
> I missed replying earlier to a few of your points below (inline).
>
> On 11/9/19 3:05 PM, Dave Taht wrote:
>
>>
>> By default (when run without -x) flent captures very little metadata
>> about the system it is run on (IP addresses, a couple sysctls, and
>> qdiscs) but it's helpful to have. One example that would be in that
>> metadata, is that I'm unsure if the ns3 data is using an IW4 or IW10?
>>
>> It sounds like you are rate limiting with htb? (to what quantum?)
>>
>> Another example, in more "native" environments running at a simulated
>> line rate, BQL is quite important to have in the simulation
>> also. there's been a couple papers published on BQL's benefits vs a raw
>> txring, thus far, there's a good plot of what it does in fig 6 of:
>>
>> http://sci-hub.tw/10.1109/LANMAN.2019.8847054
>
> in the ns-3 simulations posted, BQL is enabled on all links.

Cool. If only the dsl and cable worlds had adopted this! it allows for
much smarter handling of packet delivery higher in the stack at the cost
of one interrupt's worth of standing queue. Without BQL we wouldn't be
scaling linux past 10GigE today.

I keep hoping *switches* will start doing bql, also.

Note:

The AQL work (long in google wifi and a few other places) - "airtime
queue limits" - is finally entering linux mainline. This makes for a
similar savings to BQL at interrupt time on media with variable "line
rate" encodings, extremely useful for wifi and lte and possibly for
cable and ethernet over powerline devices.

Some details on how well it works are in the google document here:

http://flent-newark.bufferbloat.net/~d/Airtime%20based%20queue%20limit%20for%20FQ_CoDel%20in%20wireless%20interface.pdf

Toke's current patchset: https://patchwork.kernel.org/cover/11206223/

>
>>
>> Lastly...
>>
>> So far as I know ns(X) does not correctly simulate GSO/TSO even when
>> run in DCE mode, but I could be out of date on that. TBF (and cake)
>> do break apart superpackets, htb (+ anything, like fq_codel or dualq)
>> do not.
>
> Correct, we do not have an ns-3 model for GSO/TSO.  Is it needed (in
> the simulation) if BQL is enabled with small device queues?

I don't know. Are you seeing GRO/GSO/TSO superpackets in the path on
this simulation? It isn't on for a variety of pseudo devices,
particularly in older releases of linux.

The 4.4 kernel (released 2016-01-10) you are leveraging predates some
major new network subsystem features, like better pacing, sch_etx, and
the switch to EDF scheduling (in linux 5.1). I'm sure there are dozens
more things that might matter, which is why we test. :) I imagine neal
is more up on all the changes since 4.4 that might matter.

In the field:

We started running into problems with GRO starting in 2014, where gige
line rate bursts of IW10,20,30,42 or more packets hit certain brands of
home router hardware, and were then released as a single superpacket. The
resulting load spike (stepping down, to, say, 5mbit) was quite noticible
when doing concurrent voip applications, and also codel tended to be late
in recogizing stuff past the "burp", it didn't matter how much FQ we had
in certain scenarios when dealing with superpackets.

back in 2015, we'd had to put in commit:
a5d28090405038ca1f40c13f38d6d4285456efee
to get GSO even remotely right.

Anyway...

After much gnashing of teeth and pulling of hair while trying to
selectively disable or split gso on various pieces of consumer hw,
in the end we made gso-splitting the default in cake, rather than
try to turn off GRO everywhere it was needed, or revise the htb shaper
so fq_codel could be used more effectively with GSO in place. (as I said
"tbf" does splitting, htb - which is what most sqm systems use - does not).

fq_codel_fast (which has sce support) also
splits GSO. I'd filed a bug on the L4S github site requesting that they
also implement GSO splitting and use memory, not packet limits, a while back.

Given how fine grained either SCE or L4S need to be on their signalling,
my assumption (needing testing!), is that GSO/GRO/TSO need to be
disabled to get an expected result, and with GSO/GRO/TSO on, especially
at lower rates, the results will be "interesting", and thus in the field
with the final delivered codebase(s), it needs to be handled properly by
the qdisc either way it goes.

Last year GSO went always on in linux kernels, which generally means
that locally sourced tcp packets are always 2 packets in size or larger.
the popular sch_fq scheduler has a 2 packet quantum also. This is
great, if you are targetting 40gige....


commit 0a6b2a1dc2a2105f178255fe495eb914b09cb37a
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Feb 19 11:56:47 2018 -0800

    tcp: switch to GSO being always on
    
    Oleksandr Natalenko reported performance issues with BBR without FQ
    packet scheduler that were root caused to lack of SG and GSO/TSO on
    his configuration.
    
    In this mode, TCP internal pacing has to setup a high resolution timer
    for each MSS sent.
    
    We could implement in TCP a strategy similar to the one adopted
    in commit fefa569a9d4b ("net_sched: sch_fq: account for schedule/timers drifts")
    or decide to finally switch TCP stack to a GSO only mode.
    
    This has many benefits :
    
    1) Most TCP developments are done with TSO in mind.
    2) Less high-resolution timers needs to be armed for TCP-pacing
    3) GSO can benefit of xmit_more hint
    4) Receiver GRO is more effective (as if TSO was used for real on sender)
       -> Lower ACK traffic
    5) Write queues have less overhead (one skb holds about 64KB of payload)
    6) SACK coalescing just works.
    7) rtx rb-tree contains less packets, SACK is cheaper.
        This patch implements the minimum patch, but we can remove some legacy
    code as follow ups.
    
    Tested:
    
    On 40Gbit link, one netperf -t TCP_STREAM
    
    BBR+fq:
    sg on:  26 Gbits/sec
    sg off: 15.7 Gbits/sec   (was 2.3 Gbit before patch)
    
    BBR+pfifo_fast:
    sg on:  24.2 Gbits/sec
    sg off: 14.9 Gbits/sec  (was 0.66 Gbit before patch !!! )
    
    BBR+fq_codel:
    sg on:  24.4 Gbits/sec
    sg off: 15 Gbits/sec  (was 0.66 Gbit before patch !!! )
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Signed-off-by: David S. Miller <davem@davemloft.net>


>
> - Tom