Re: [tsvwg] [Ecn-sane] Comments on L4S drafts

Bob Briscoe <ietf@bobbriscoe.net> Thu, 04 July 2019 13:45 UTC

Return-Path: <ietf@bobbriscoe.net>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6E1E912000F for <tsvwg@ietfa.amsl.com>; Thu, 4 Jul 2019 06:45:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.998
X-Spam-Level:
X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=bobbriscoe.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id UPzSz4aXahsA for <tsvwg@ietfa.amsl.com>; Thu, 4 Jul 2019 06:45:05 -0700 (PDT)
Received: from server.dnsblock1.com (server.dnsblock1.com [85.13.236.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 6C067120224 for <tsvwg@ietf.org>; Thu, 4 Jul 2019 06:45:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=bobbriscoe.net; s=default; h=Content-Type:In-Reply-To:MIME-Version:Date: Message-ID:From:References:Cc:To:Subject:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=0vRuGZIfcPKVTWNc7lkjZAiARy+wgcCNyyYWtF/X8qc=; b=rs/M95yfCOrVDYBEdl9CUspbl oUWiVcV4XWBihEsFzP2M2MUyBkaqXktpVPo682mDH4Null1xSJA6usrNx03vpZVbut77sz/AATOl8 R/tGNz+8wnDwuns8PzdDmQ8LTJTS8IfcNGXAlbTBM9b8fP20eMG3LETKcAhZ4Dx5VPgEsMmco0eAT 7dq5gei6yX/icTX0zpOaV1YD8h1J43Ge6YeWpLAHX8e2YfToPx/ZdzcYQcLS7P412jT2mNpSmJ8N8 9RYD9he43IbGLuoc5EVOfkUlctnmMtA7ce2xmgHUxLhy24rAzx+V5xqAkkJZGxUE43orDI/xdeCcO 0PpY89PvA==;
Received: from [31.185.128.20] (port=56546 helo=[192.168.0.6]) by server.dnsblock1.com with esmtpsa (TLSv1.2:ECDHE-RSA-AES128-GCM-SHA256:128) (Exim 4.92) (envelope-from <ietf@bobbriscoe.net>) id 1hj22o-0005jO-4k; Thu, 04 Jul 2019 14:45:02 +0100
To: "Holland, Jake" <jholland@akamai.com>
Cc: Luca Muscariello <muscariello@ieee.org>, "ecn-sane@lists.bufferbloat.net" <ecn-sane@lists.bufferbloat.net>, "tsvwg@ietf.org" <tsvwg@ietf.org>
References: <364514D5-07F2-4388-A2CD-35ED1AE38405@akamai.com> <cc446538-cf23-4fd0-12df-7839ec6c04a2@bobbriscoe.net> <CAH8sseSPz3FoLWZNPEJcwb4xQNYk_FXb8VS5ec9oYwocHAHCBg@mail.gmail.com> <4aff6353-eb0d-b0b8-942d-9c92753f074e@bobbriscoe.net> <D13294C4-105C-4F58-A762-6911A21A18C6@akamai.com>
From: Bob Briscoe <ietf@bobbriscoe.net>
Message-ID: <59eb30b5-5e19-ea53-2ca8-78c4e3b23439@bobbriscoe.net>
Date: Thu, 04 Jul 2019 14:45:01 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <D13294C4-105C-4F58-A762-6911A21A18C6@akamai.com>
Content-Type: multipart/alternative; boundary="------------FC176D984065BDDA703CBF71"
Content-Language: en-GB
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - server.dnsblock1.com
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - bobbriscoe.net
X-Get-Message-Sender-Via: server.dnsblock1.com: authenticated_id: in@bobbriscoe.net
X-Authenticated-Sender: server.dnsblock1.com: in@bobbriscoe.net
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/IRz04jBY93zb8aCAJgzrWz0f1tw>
Subject: Re: [tsvwg] [Ecn-sane] Comments on L4S drafts
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 04 Jul 2019 13:45:09 -0000

Jake,

On 19/06/2019 05:24, Holland, Jake wrote:
> Hi Bob and Luca,
>
> Thank you both for this discussion, I think it helped crystallize a
> comment I hadn't figured out how to make yet, but was bothering me.
>
> I’m reading Luca’s question as asking about fixed-rate traffic that does
> something like a cutoff or downshift if loss gets bad enough for long
> enough, but is otherwise unresponsive.
>
> The dualq draft does discuss unresponsive traffic in 3 of the sub-
> sections in section 4, but there's a point that seems sort of swept
> aside without comment in the analysis to me.
>
> The referenced paper[1] from that section does examine the question
> of sharing a link with unresponsive traffic in some detail, but the
> analysis seems to bake in an assumption that there's a fixed amount
> of unresponsive traffic, when in fact for a lot of the real-life
> scenarios for unresponsive traffic (games, voice, and some of the
> video conferencing) there's some app-level backpressure, in that
> when the quality of experience goes low enough, the user (or a qoe
> trigger in the app) will often change the traffic demand at a higher
> layer than a congestion controller (by shutting off video, for
> instance).
>
> The reason I mention it is because it seems like unresponsive
> traffic has an incentive to mark L4S and get low latency.  It doesn't
> hurt, since it's a fixed rate and not bandwidth-seeking, so it's
> perfectly happy to massively underutilize the link. And until the
> link gets overloaded it will no longer suffer delay when using the
> low latency queue, whereas in the classic queue queuing delay provides
> a noticeable degradation in the presence of competing traffic.
It is very much intentional to allow unresponsive traffic in the L queue 
if it is not contributing to queuing.

You're right that the title of S.4.1.3 sounds like there's a presumption 
that all unresponsive ECN traffic is bad. Sorry that was not the 
intention. Elsewhere the drafts do say that a reasonable amount of 
smoothly paced unresponsive traffic is OK alongside any responsive traffic.

(I've just posted an -09 rev, but I'll post a draft-10 that fixes that, 
hopefully before the Monday cut-off).

If you're talking about where unresponsive traffic is mentioned in 
4.1.1, I think that's OK, 'cos that's in the context of saturated 
congestion marking (when it's not OK to be unresponsive).



>
> I didn't see anywhere in the paper that tried to check the quality
> of experience for the UDP traffic as non-responsive traffic approached
> saturation, except by inference that loss in the classic queue will
> cause loss in the LL queue as well.
Yeah, in the context of Henrik's thesis (your [1]), "unresponsive" was 
used as a byword for "attack traffic". But that shouldn't be taken to 
mean unresponsive is considered evil for L4S in general.

Indeed, Low Latency DOCIS started from the assumption of using a low 
latency queue for unresponsive traffic (games, VoIP, etc), then added 
responsive L4S traffic into the same queue later.

You may have seen the draft about assigning a DSCP for 
Non-Queue-Building (NQB) traffic for that purpose (as with L4S and 
unlike Diffserv, this codepoint solely describes the traffic's 
behaviour, not what it wants or needs).
     https://tools.ietf.org/html/draft-white-tsvwg-nqb-02
And there are references in ecn-l4s-id to other identifiers that could 
be used to get unresponsive traffic into the low latency queue (DOCSIS 
classifies EF and NQB as low latency by default).

We don't want ECN to be the only way to get into the L queue, cos we 
don't want to encourage mismarking as 'ECN' when a flow is not actually 
going to respond to ECN).

>
> But letting unresponsive flows get away with pushing out more classic
> traffic and removing the penalty that classic flows would give it seems
> like a risk that would result in more use of this kind of unresponsive
> traffic marking itself for the LL queue, since it just would get lower
> latency almost up until overload.
As explained to Luca, it's counter-intuitive, but responsive flows 
(either C or L) use the same share of capacity irrespective of which 
queue any unresponsive traffic is in. Think of it as the unresponsive 
traffic subtracting capacity from the aggregate (because both queues can 
use the whole aggregate), then the coupling sharing out what's left. The 
coupling makes it like a FIFO from a bandwidth perspective.

You can try this with the tool you mentioned that you had downloaded. 
There's a slider to add unresponsive traffic to either queue.

So it's fine if unresponsive traffic doesn't cause any queuing itself. 
It can happily use the L queue. This was a very important design goal, 
but we write about it circumspectly in the IETF drafts, 'cos talk about 
allowing unresponsive traffic can trigger political correctness 
arguments. (Oops, am I writing on an IETF list?)

Nonetheless, when an unresponsive flow(s) is consuming some capacity, 
and a responsive flow(s) takes the total over the available capacity, 
then both are responsible in proportion to their contribution to the 
queue, 'cos the unresponsive flow didn't respond (it didn't even try to).

This is why it's OK to have a small unresponsive flow, but it becomes 
less and less OK to have a larger and larger unresponsive flow.

BTW, the proportion of blame for the queue is what the queuing score 
represents in the DOCSIS queue protection algo. It's quite simple but 
subtle. See your PS at the end. Right now I'm going to get on with 
writing about that in a proper doc, rather than in an email.


>
> Many of the apps that send unresponsive traffic would benefit from low
> latency and isolation from the classic traffic, so it seems a mistake
> to claim there's no benefit, and it furthermore seems like there's
> systematic pressures that would often push unresponsive apps into this
> domain.
There's no bandwidth benefit.
There's only latency benefit, and then the only benefits are:

  * the low latency behaviour of yourself and other flows behaving like you
  * and, critically, isolation from those flows not behaving well like you.

Neither give an incentive to mismark - you get nothing if you don't 
behave. And there's a disincentive for 'Classic' TCP flows to mismark, 
'cos they badly underutilize without a queue.

(See also reply to Luca addressing accidents and malice, which lie 
outside control by incentives).

>
> If that line of reasoning holds up, the "rather specific" phrase in
> section 4.1.1 of the dualq draft might not turn out to be so specific
> after all, and could be seen as downplaying the risks.
Yup, as said, will fix the phrasing in 4.1.3. But I'm not going to touch 
4.1.1. without better understand what the problem is there.

>
> Best regards,
> Jake
>
> [1] https://riteproject.files.wordpress.com/2018/07/thesis-henrste.pdf
>
> PS: This seems like a consequence of the lack of access control on
> setting ECT(1), and maybe the queue protection function would address
> it, so that's interesting to hear about.
Yeah, I'm trying to write about that next. But if you extract Appendix P 
from the DOCSIS 3.1 spec it's explained pretty well already and openly 
available.

However, I want it to be clear that Q Prot is not /necessary/ for L4S - 
and it's also got wider applicability, I think.

> But I thought the whole point of dualq over fq was that fq state couldn't
> scale properly in aggregating devices with enough expected flows sharing
> a queue?  If this protection feature turns out to be necessary, would that
> advantage be gone?  (Also: why would one want to turn this protection off
> if it's available?)
1/ The q-prot mechanism certainly has the disadvantage that it has to 
access L4 headers. But it is much more lightweight than FQ.

There's no queue state per flow. The flow-state is just a number that 
represents its own expiry time - a higher queuing score pushes out the 
expiry time further. If it has expired when the next packet of the flow 
arrives, it just starts from now, like a new flow, otherwise it adds to 
the existing expiry time. Long-running L4S flows don't hold on to 
flow-state between most packets - it usually expires reasonably early in 
the gap between the packets of a normal flow, then it can be recycled 
for packets from any other flows that arrive in between. So only 
misbehaving flows hold flow state persistently.

The subtle part is the queuing score. It uses the internal variable from 
the AQM that drives the ECN marking probability - call it p (between 0 
and 1 in floating point). And it takes the size of each arriving packet 
of a flow and scales by the value of p on arrival. This would accumulate 
a number which would rise at the so-called congestion-rate of the flow, 
i.e. the rate at which the flow is causing congestion (the rate at which 
it is sending bytes that are ECN marked or dropped).

However, rather than just doing that, the queuing score is also 
normalized into time units (to represent the expiry time of the flow 
state, as above). That's possible by just dividing by a constant that 
represents the acceptable congestion-rate per flow (rounded up to an 
integer power of 2 for efficiency). A nice property of the linear 
scaling of L4S is that this number is a constant for any link rate.

That's probably not understandable. Let me write it up properly - with 
some explanatory pictures and examples.


Bob

>
>
> _______________________________________________
> Ecn-sane mailing list
> Ecn-sane@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/ecn-sane

-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/