[tsvwg] taht REVIEW of draft-white-tsvwg-nqb-02
Dave Taht <dave@taht.net> Sat, 24 August 2019 14:50 UTC
Return-Path: <dave@taht.net>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4847E1200B7 for <tsvwg@ietfa.amsl.com>; Sat, 24 Aug 2019 07:50:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gacgqaBjKzUM for <tsvwg@ietfa.amsl.com>; Sat, 24 Aug 2019 07:50:06 -0700 (PDT)
Received: from mail.taht.net (mail.taht.net [IPv6:2a01:7e00::f03c:91ff:feae:7028]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D6D9E120026 for <tsvwg@ietf.org>; Sat, 24 Aug 2019 07:50:05 -0700 (PDT)
Received: from dancer.taht.net (unknown [IPv6:2603:3024:1536:86f0:eea8:6bff:fefe:9a2]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.taht.net (Postfix) with ESMTPSA id 8C45A21B1A for <tsvwg@ietf.org>; Sat, 24 Aug 2019 14:50:03 +0000 (UTC)
From: Dave Taht <dave@taht.net>
To: tsvwg@ietf.org
Date: Sat, 24 Aug 2019 07:49:50 -0700
Message-ID: <87lfvid3e9.fsf@taht.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/hZGjm899t87YZl9JJUOWQq4KBsk>
Subject: [tsvwg] taht REVIEW of draft-white-tsvwg-nqb-02
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 24 Aug 2019 14:50:09 -0000
This is my review of: https://tools.ietf.org/id/draft-white-tsvwg-nqb-02.html I'm not sure if I should cc the IESG? - It can be improved by making it much shorter. It would be easier on the reader if this document merely defined the bits up front, and then moved on to the 12 pages of explanatory text. e.g.: This document defines the NQB diffserv codepoint, '2A'. It MAY be used to express the intent of an application to send a non-queue-building flow. Or something like that. If a busy coder is then willing to wade through the next 12 pages, great. - Despite having "made a contribution" to this document, can you please remove my name from the acknowledgments? To me having made one "drive-by" pass at it 6(?) months ago and summoning the intestinal fortitude to tackle it again today doesn't imply I want the slightest thing to do with it. I don't think very much of the NQB idea and I don't think it's possible to change my mind. If it wasn't for the vague thought that perhaps 2A might be a better choice to declare a L4S or SCE flow - and my name hadn't got stuck in here, and the fq_codel section currently so egregious - I'd not be commenting. If you take my name off this thing you can say whatever you want. You don't even have to read my comments below. Top level bits: - IMHO: It is not safe to define a codepoint to ask for priority, and the assertion that "There is no incentive for an application to mismark its packets as NQB (or vice versa)." is not true in reality. An identifiable short queue for malicious applications to attack is bad. - The "must employ queue protection" is an attempt to hand-wave away the problem. Does there exist an IETF definition for what "queue protection" actually is? - The assertion that dual-queue / NQB is "FQ without the downsides" is BS. - The mis-characterisation of fair queuing and fq_codel's properties in support of the NQB idea is really, really hard to swallow. In particular the attempt to newspeak the sparse flows concept set me off... And then I have a bunch of misc notes piled up at the end that I should probably put into a second message. To quote that section in full here - and then comment on everything from sparse flows to why on earth is ledbat even mentioned here - fasten your seatbelts! This got long and had I more time, I'd have made it shorter. >From Section 9: "Flow queueing (FQ) approaches (such as fq_codel [RFC8290]), on the other hand, achieve latency improvements by associating packets into "flow" queues and then prioritizing "sparse flows", i.e. packets that arrive to an empty flow queue. Flow queueing does not attempt to differentiate between flows on the basis of value (importance or latency-sensitivity), it simply gives preference to sparse flows, and tries to guarantee that the non-sparse flows all get an equal share of the remaining channel capacity and are interleaved with one another. As a result, FQ mechanisms could be considered more appropriate for unmanaged environments and general Internet traffic. Downsides to this approach include loss of low latency performance due to the possibility of hash collisions (where a sparse flow shares a queue with a bulk data flow), complexity in managing a large number of queues in certain implementations, and some undesirable effects of the Deficit Round Robin (DRR) scheduling. The DRR scheduler enforces that each non-sparse flow gets an equal fraction of link bandwidth, which causes problems with VPNs and other tunnels, exhibits poor behavior with less-aggressive congestion control algorithms, e.g. LEDBAT [RFC6817], and could exhibit poor behavior with RTP Media Congestion Avoidance Techniques (RMCAT) [I-D.ietf-rmcat-cc-requirements]. In effect, the network element is making a decision as to what constitutes a flow, and then forcing all such flows to take equal bandwidth at every instant. The Dual-queue approach defined in this document achieves the main benefit of fq_codel: latency improvement without value judgements, without the downsides. The distinction between NQB flows and QB flows is similar to the distinction made between "sparse flow queues" and "non-sparse flow queues" in fq_codel. In fq_codel, a flow queue is considered sparse if it is drained completely by each packet transmission, and remains empty for at least one cycle of the round robin over the active flows (this is approximately equivalent to saying that it utilizes less than its fair share of capacity). While this definition is convenient to implement in fq_codel, it isn't the only useful definition of sparse flows." So here I go... Perhaps writing something more accurate here, will help elsewhere. Flow queueing (FQ) approaches (such as fq_codel [RFC 8290]), on the other hand, achieve latency improvements by associating packets into "flow" queues And thus achieve better statistical multiplexing between packets of all the flows in the system. ... aside ... "Better statistical multiplexing" has got very little to do with the sparse flow optimization that happens to be in fq_codel. Let's take an example. We have 10 fat "QB" flows at 100Mbit. In a single queue system that will impose 16ms of delay.... ... If you have only one fat QB flow, and another packet arrives from another flow. (This is not actually true, because in a BQL'd system there are typically a few packets underneath living in the driver or hardware) In a std DRR FQ system servicing these 10 flows would take 1.3ms, and the total queue delay observed by a new 11th flow would be 1.430 ms. In a DRR++ system - (like fq-pie,codel,cake) one with the "sparse" optimization - the queue delay for that 11th packet, would be ~0. At 100Mbit, in the case of 1500 byte packets in a FQ'd system, one packet from each flow takes 130us to service, which leads to the following table. # 100 Mbit with one sparse flow entering | Variant | QB Flows | Single Pie Queue | FQ | FQ sparse | | Delay | 1 | 16ms | 130us | 0 | | Delay | 10 | 16ms | 1.3ms | 0 | | Delay | 100 | 16ms | 13ms | 0 | This chart is of course gross simplifications of what actually happens. Ignoring the possibility of collisions, the effect of bursts on the AQM and other pesky details, what actually happens with 100 TCPs at a short RTT at 100Mbit is a worthwhile exercise in learning about all carnage of all the TCP the fall back mechanisms, like RTO, and cwnd 2 caps, that TCP has to offer. The interactions with big buffers, short buffers, all the different aqms, fq, drops, and ECN handling are edifying. (in other words, if you have 100 fat flows feeding into a 100Mbit short RTT pipe, and want to add one more... you're gonna have other problems) But I digress. IF that sparse flow becomes queue building std FQ statistical multiplexing techniques apply, which helps. Tackling the text again ... I think I ended up rewriting bits of this: Additionally, fq_codel has an optimization for "sparse flows", i.e. packets that arrive to an empty flow queue. Flow queueing alone does not attempt to differentiate between flows on the basis of value (importance or latency-sensitivity), and guarantees that all flows all get an equal share of the channel capacity by interleaving them with one another. As a result, FQ mechanisms are considered more appropriate for unmanaged environments and general Internet traffic. Flow queuing is usually combined with an additional classifier, to create a 3-4 tier QoS system - Interactive, Best Effort, and background, especially on low rate networks. At higher rates any form of additional QoS is both harder to do and frequently unessessary. "Downsides to this approach include loss of low latency performance due to the possibility of hash collisions (where a sparse flow shares a queue with a bulk data flow)" As per the above... This *vastly* overstates the concern of hash collisions: The odds of a collision with another flow with a direct map hash is a function of the number of flows and the number of queues, but is considerably less. With set associativity this can be improved to basically nil for all practical purposes. Whereas the odds of a collision with a fat flow in a single queued system are *100%*. If this document could actually start with the odds of colliding with a fat flow being 100% on a single queued system as a part of the problem statement we might get somewhere. In addition, fq_codel in general maintains shorter queues than a non-fq-aqm system, so the cost of a collission with a fat flow is lower also. After my ranting above can anyone here (still with me?) hazard a guess of what the cost of a collision is with 10 fat flows (in steady state)? (kind of a test to see if I've conveyed intuition yet) and what the actual odds of that collision are with 1024 queues? complexity in managing a large number of queues in certain implementations, I've had my "complexity" rant elsewhere on these lists on several other occasions. Whether you have 1 queue or thousands in a fq'd system, the code is the same as is the "complexity" for all intended purposes. pickaqueue = myqueue[hash & 1] // 2 queues pickaqueue = myqueue[hash & 3FF) // 1024 queues So I typically use the word complexity in a context like this: https://en.wikipedia.org/wiki/Computational_complexity And usually flavor it by throwing in a reference to https://en.wikipedia.org/wiki/Gustafson%27s_law because folk in the pie department do really like the low resource requirements of pie, which is indeed more aimiable to a hardware implementation, as is the brick wall dctcp red-derived ramp. in a software router - routing lookups, firewall rules, interrupt response times, and other processing dominates the runtime. At one point we'd counted a minimum of 42 function calls from ingress to entry - with a ton of potential callbacks - in linux. Perhaps long term a better way for those hardware systems that cannot manage more than 2 queues would be to say: "inability to handle more than a few queues with on-chip resources in a few potential implementations". But it's not "complexity" as I understand it. Neither pie or fq_codel are O(1), either, and if you take a look at the assembly language generated by the ingress or egress code you might give into despair as to how many extra ops are eaten by having to peer so deeply into a packet just to get/set the dscp/ecn value. and some undesirable effects of the Deficit Round Robin (DRR) scheduling. The DRR scheduler enforces that each flow gets an equal fraction of link bandwidth, which causes problems with VPNs and other tunnels, exhibits poor "some" vpns, some tunnels. ipsec terminating on the bottleneck router works. Short queues make vpns work better, and they get 0 queueing latency when they are under their fair share fq-wise. behavior with less-aggressive congestion control algorithms, e.g. LEDBAT [RFC6817] Bringing in ledbat here is not relevant at all ?? to marking or not unless you want to be marking LEDBAT NQB? Is this whole section just copy-pasted from some other anti-fq_codel-rant elsewhere? LEDBAT fails to work as the designers would prefer on *any* fq or aqm system, period. Pie, dualpi, sfq. In fact, if you have a tail drop buffer of less than 100ms - the delay based component fails to work as specified too. [1] ( https://perso.telecom-paristech.fr/drossi/paper/rossi14comnet-b.pdf ) And I wouldn't call ledbat's behavior under aqm or fq "poor behavior" either - ledbat reverts to reno, and the difference between a system that considers 100+ms of induced delay and jitter to be "good" and one that aims for zero and gets it for most things, and 5-10ms for the rest is like night and day. The "poor protocol designers" at the time... thought inflicting 100ms delays was a good choice for a scavenging protocol. What is a "less-aggressive" congestion control algorithm composed of? LEDBAT++ has a 70ms delay target - still a broken idea - but at least has a slower ramp.... and could exhibit poor behavior with RTP Media Congestion Avoidance Techniques (RMCAT) [I-D.ietf-rmcat-cc-requirements]. This is completely unsubstantiated. videoconferencing with fq_codel works better than any alternative I've tried. "While this definition is convenient to implement in fq_codel, it isn't the only useful definition of sparse flows." We should try to nail down terms better. I'm VERY unfond of y'all attempting to conflate what we've called sparse and the NQB concept in this document and elsewhere. There is a key difference between "sparse" and NQB, in that "sparse" flows can be *anything using much less than its fair share*, where NQB requires truthful marking. As one typical example, for a short web transaction that completes in under a IW10 (which is most of them!): syn/ack - sparse ssl neg - sparse (as it takes time to calculate) ssl http req - sparse (takes time to do the get) iw10 burst - not sparse and essentially not congestion controlled either (sigh) fin - (often part of that burst) In the other direction of the flow, syn, ssl neg, req, acks, are all "sparse". Whereas - unless you want to start flipping dscp bits on this five tuple for each of these phases doesn't apply to the NQB concept. ... Anyway in conclusion ... Were it me, and I had control of the pen, and i believed in the NQB idea I'd still try to write a more accurate and favoriable description of fq_codel, and try to characterise the NQB idea as "poor man's prioritization". * Security constraints "There is no incentive for an application to mismark its packets as NQB (or vice versa)." This is not true in reality. The "must employ queue protection" is an attempt to hand-wave away the very real security issues this proposal raises. Am I the only one that deals with malicious applications every day? A specialized, short queue for this sort of stuff - particularly if its used for important packets (ra, hello, dns) can be subject to low rate attacks and disabled easier so anyone that cares about those potential problems would sleep better in the BE queue. ... While we're on vague terms: Is there an IETF definition of what "queue protection" actually is? ... At the moment - were L4S to become a "thing", I'd actually lean towards making '2A' codepoint mean "L4S" or SCE, and retire the overload on ECT1. That retains the apparent desired overlay with EF, allows for ECT1 and ECT0 to be used in a backward compatible way. # The how to implement section Were this to become a codepoint, we'd merely toss it into the interactive queue (in sqm and cake) via adding it to the lookup table just as we did recently for rfc8622: static const u8 diffserv3[] = { 0, 1, 0, 0, 2, 0, 0, 0, ^New LE codepoint 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, }; This form of implementation: 1) retains a full mapping of all dscp codes to a given set of QoS queues 2) is naturally tcp-friendly congestion controlled 3) does fate sharing with all other traffic in this queue, and retains the overall sparse and fq levels of interleave. 4) Is FQ'd fully so it's naturally treated as NQB - and if it is lying about NQB ends up being QB and treated properly 5) is one byte of code change in exchange for having had to read 12 pages of text https://github.com/dtaht/sch_cake/blob/master/sch_cake.c#L357 tin = q->tin_index[diffserv3[dscp]]; (note in this case, queue 0 is best effort, 1 is background, and 2 interactive ) ... This cites an unpublished and not recently updated I-D about a 3g bearer channel in as a supporting document ... If you are going to start arbitrarily throwing flows marked as EF or as NQB into the L queue, the dualpi code recently submitted to the linux kernel needs to change. [1] In this paper, we study the interaction between Active Queue Management (AQM) and Low-priority Congestion Control (LPCC). We consider a fairly large number of AQM techniques (i.e., RED, CHOKe, SFQ, DRR and CoDel) and LPCC protocols (i.e., LEDBAT, NICE, and TCP-LP), studying system performance (mainly expressed in terms of TCP% breakdown, average queue size E[Q], and link utilization η) with both simulative and experimental methodologies. Summarizing our main findings, we observe that AQM resets the relative level of priority between best-effort TCP and LPCC. That is to say, the TCP share of the bottleneck capacity drops dramatically, becoming close to the LPCC share. Additionally, while reprioritization generally equalizes the priority of LPCC and TCP, we also find that some AQM settings may actually lead best-effort TCP to starvation – as these are however non-systematic, we believe the starvation issue to be of less practical relevance than reprioritization. Reprioritization is a fairly general phenomenon, as it holds for any combination of AQM technique, LPCC protocol and network scenario. This is testified by our thorough sensitivity analysis, where we confirmed the phenomenon for varying network parameters such as buffer size Qmax, link capacity C, heterogeneous RTT delay, flows number and flow duty cycle for over 3,000 simulations. Reprioritization is a real world phenomenon, easily replicable on testbeds and the real Internet, which cannot be easily solved via AQM tuning. Hence, we advocate that explicit collaboration is needed from LPCC (e.g., openly advertise low priority) to assist AQM in taking decisions that maintain the desired level of priority