[tsvwg] taht REVIEW of draft-white-tsvwg-nqb-02

Dave Taht <dave@taht.net> Sat, 24 August 2019 14:50 UTC

From: Dave Taht <dave@taht.net>
To: tsvwg@ietf.org
Date: Sat, 24 Aug 2019 07:49:50 -0700
Message-ID: <87lfvid3e9.fsf@taht.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/hZGjm899t87YZl9JJUOWQq4KBsk>
Subject: [tsvwg] taht REVIEW of draft-white-tsvwg-nqb-02
Precedence: list

This is my review of:
https://tools.ietf.org/id/draft-white-tsvwg-nqb-02.html

I'm not sure if I should cc the IESG?

- It can be improved by making it much shorter.

It would be easier on the reader if this document merely defined the bits
up front, and then moved on to the 12 pages of explanatory text. e.g.:

  This document defines the NQB diffserv codepoint, '2A'. It MAY be used
  to express the intent of an application to send a non-queue-building flow.

Or something like that. If a busy coder is then willing to wade
through the next 12 pages, great.

- Despite having "made a contribution" to this document, can you
  please remove my name from the acknowledgments? To me having made
  one "drive-by" pass at it 6(?) months ago and summoning the intestinal
  fortitude to tackle it again today doesn't imply I want the
  slightest thing to do with it. I don't think very much of the NQB
  idea and I don't think it's possible to change my mind.

  If it wasn't for the vague thought that perhaps 2A might be a better
  choice to declare a L4S or SCE flow - and my name hadn't got stuck
  in here, and the fq_codel section currently so egregious - I'd not be
  commenting.

  If you take my name off this thing you can say whatever you want.
  You don't even have to read my comments below.

Top level bits:

- IMHO: It is not safe to define a codepoint to ask for priority, and the
  assertion that "There is no incentive for an application to mismark
  its packets as NQB (or vice versa)." is not true in reality. An
  identifiable short queue for malicious applications to attack is bad.

- The "must employ queue protection" is an attempt to hand-wave away
  the problem. Does there exist an IETF definition for what "queue
  protection" actually is?

- The assertion that dual-queue / NQB is "FQ without the downsides"
  is BS. 

- The mis-characterisation of fair queuing and fq_codel's properties
  in support of the NQB idea is really, really hard to swallow. In
  particular the attempt to newspeak the sparse flows concept set me off...

And then I have a bunch of misc notes piled up at the end that I should
probably put into a second message.

To quote that section in full here - and then comment on everything from
sparse flows to why on earth is ledbat even mentioned here - fasten your
seatbelts! This got long and had I more time, I'd have made it shorter.

>From Section 9:

"Flow queueing (FQ) approaches (such as fq_codel [RFC8290]), on the
other hand, achieve latency improvements by associating packets into
"flow" queues and then prioritizing "sparse flows", i.e. packets that
arrive to an empty flow queue. Flow queueing does not attempt to
differentiate between flows on the basis of value (importance or
latency-sensitivity), it simply gives preference to sparse flows, and
tries to guarantee that the non-sparse flows all get an equal share of
the remaining channel capacity and are interleaved with one
another. As a result, FQ mechanisms could be considered more
appropriate for unmanaged environments and general Internet traffic.

Downsides to this approach include loss of low latency performance due
to the possibility of hash collisions (where a sparse flow shares a
queue with a bulk data flow), complexity in managing a large number of
queues in certain implementations, and some undesirable effects of the
Deficit Round Robin (DRR) scheduling. The DRR scheduler enforces that
each non-sparse flow gets an equal fraction of link bandwidth, which
causes problems with VPNs and other tunnels, exhibits poor behavior
with less-aggressive congestion control algorithms, e.g. LEDBAT
[RFC6817], and could exhibit poor behavior with RTP Media Congestion
Avoidance Techniques (RMCAT) [I-D.ietf-rmcat-cc-requirements]. In
effect, the network element is making a decision as to what
constitutes a flow, and then forcing all such flows to take equal
bandwidth at every instant.

The Dual-queue approach defined in this document achieves the main
benefit of fq_codel: latency improvement without value judgements,
without the downsides.

The distinction between NQB flows and QB flows is similar to the
distinction made between "sparse flow queues" and "non-sparse flow
queues" in fq_codel. In fq_codel, a flow queue is considered sparse if
it is drained completely by each packet transmission, and remains
empty for at least one cycle of the round robin over the active flows
(this is approximately equivalent to saying that it utilizes less than
its fair share of capacity). While this definition is convenient to
implement in fq_codel, it isn't the only useful definition of sparse
flows."

So here I go...

Perhaps writing something more accurate here, will help elsewhere.

    Flow queueing (FQ) approaches (such as fq_codel [RFC 8290]), on the
    other hand, achieve latency improvements by associating packets into
    "flow" queues 

And thus achieve better statistical multiplexing between packets of
all the flows in the system.

... aside ...

"Better statistical multiplexing" has got very little to do with the
sparse flow optimization that happens to be in fq_codel.

Let's take an example. We have 10 fat "QB" flows at 100Mbit. In a single
queue system that will impose 16ms of delay....

... If you have only one fat QB flow, and another packet arrives from
another flow. 

(This is not actually true, because in a BQL'd system there are typically
a few packets underneath living in the driver or hardware)

In a std DRR FQ system servicing these 10 flows would take 1.3ms, and the
total queue delay observed by a new 11th flow would be 1.430 ms.

In a DRR++ system - (like fq-pie,codel,cake) one with the "sparse"
optimization - the queue delay for that 11th packet, would be ~0.

At 100Mbit, in the case of 1500 byte packets in a FQ'd system, one
packet from each flow takes 130us to service, which leads to the
following table.

# 100 Mbit with one sparse flow entering 

| Variant | QB Flows | Single Pie Queue | FQ    | FQ sparse |
| Delay   |        1 | 16ms             | 130us |         0 |
| Delay   |       10 | 16ms             | 1.3ms |         0 |
| Delay   |      100 | 16ms             | 13ms  |         0 |

This chart is of course gross simplifications of what actually
happens. Ignoring the possibility of collisions, the effect of bursts
on the AQM and other pesky details, what actually happens with 100
TCPs at a short RTT at 100Mbit is a worthwhile exercise in learning
about all carnage of all the TCP the fall back mechanisms, like RTO,
and cwnd 2 caps, that TCP has to offer. The interactions with big
buffers, short buffers, all the different aqms, fq, drops, and ECN
handling are edifying.

(in other words, if you have 100 fat flows feeding into a 100Mbit short
RTT pipe, and want to add one more... you're gonna have other problems)

But I digress. IF that sparse flow becomes queue building
std FQ statistical multiplexing techniques apply, which helps.

Tackling the text again ... I think I ended up rewriting bits of this:
   
    Additionally, fq_codel has an optimization for "sparse flows",
    i.e. packets that arrive to an empty flow queue.

    Flow queueing alone does not attempt to differentiate between
    flows on the basis of value (importance or latency-sensitivity),
    and guarantees that all flows all get an equal share of the
    channel capacity by interleaving them with one another.  As a
    result, FQ mechanisms are considered more appropriate for
    unmanaged environments and general Internet traffic. 

Flow queuing is usually combined with an additional classifier, to
create a 3-4 tier QoS system - Interactive, Best Effort, and
background, especially on low rate networks. 

At higher rates any form of additional QoS is both harder to do and
frequently unessessary.

    "Downsides to this approach include loss of low latency performance
    due to the possibility of hash collisions (where a sparse flow
    shares a queue with a bulk data flow)"

As per the above... 

This *vastly* overstates the concern of hash collisions: The odds of a
collision with another flow with a direct map hash is a function
of the number of flows and the number of queues, but is
considerably less.

With set associativity this can be improved to basically nil for all
practical purposes.

Whereas the odds of a collision with a fat flow in a single queued
system are *100%*.

If this document could actually start with the odds of colliding with
a fat flow being 100% on a single queued system as a part of the problem
statement we might get somewhere.

In addition, fq_codel in general maintains shorter queues than a
non-fq-aqm system, so the cost of a collission with a fat flow is lower
also. After my ranting above can anyone here (still with me?) hazard a
guess of what the cost of a collision is with 10 fat flows (in steady
state)?  (kind of a test to see if I've conveyed intuition yet) and what
the actual odds of that collision are with 1024 queues?

    complexity in managing a large number  of queues in certain
    implementations, 

I've had my "complexity" rant elsewhere on these lists on several
other occasions. Whether you have 1 queue or thousands in a fq'd
system, the code is the same as is the "complexity" for all intended
purposes.

pickaqueue = myqueue[hash & 1] // 2 queues
pickaqueue = myqueue[hash & 3FF) // 1024 queues

So I typically use the word complexity in a context like
this:

https://en.wikipedia.org/wiki/Computational_complexity

And usually flavor it by throwing in a reference to 

https://en.wikipedia.org/wiki/Gustafson%27s_law

because folk in the pie department do really like the low resource
requirements of pie, which is indeed more aimiable to a hardware
implementation, as is the brick wall dctcp red-derived ramp. in a
software router - routing lookups, firewall rules, interrupt response
times, and other processing dominates the runtime. At one point we'd
counted a minimum of 42 function calls from ingress to entry - with a
ton of potential callbacks - in linux.

Perhaps long term a better way for those hardware systems that cannot
manage more than 2 queues would be to say:

    "inability to handle more than a few queues with on-chip resources
     in a few potential implementations".

But it's not "complexity" as I understand it. Neither pie or fq_codel
are O(1), either, and if you take a look at the assembly language
generated by the ingress or egress code you might give into despair
as to how many extra ops are eaten by having to peer so deeply into
a packet just to get/set the dscp/ecn value.

    and some undesirable effects of the Deficit Round Robin (DRR)
    scheduling.  The DRR scheduler enforces that each flow gets an
    equal fraction of link bandwidth, which causes problems with VPNs
    and other tunnels, exhibits poor

"some" vpns, some tunnels. ipsec terminating on the bottleneck
router  works. Short queues make vpns work better, and they get 0
queueing  latency when they are under their fair share fq-wise.

      behavior with less-aggressive congestion control algorithms,
    e.g.   LEDBAT [RFC6817] 

Bringing in ledbat here is not relevant at all ?? to marking or not
unless you want to be marking LEDBAT NQB? Is this whole section
just copy-pasted from some other anti-fq_codel-rant elsewhere?

LEDBAT fails to work as the designers would prefer on *any* fq or aqm
system, period.  Pie, dualpi, sfq.  In fact, if you have a tail drop
buffer of less than 100ms - the delay based component fails to work as
specified too. [1] (
https://perso.telecom-paristech.fr/drossi/paper/rossi14comnet-b.pdf )

And I wouldn't call ledbat's behavior under aqm or fq "poor behavior"
either - ledbat reverts to reno, and the difference between a system
that considers 100+ms of induced delay and jitter to be "good" and one
that aims for zero and gets it for most things, and 5-10ms for the rest
is like night and day.

The "poor protocol designers" at the time... thought inflicting 100ms
delays was a good choice for a scavenging protocol.

What is a "less-aggressive" congestion control algorithm composed of?

LEDBAT++ has a 70ms delay target - still a broken idea - but at least
has a slower ramp....

    and could exhibit poor behavior with RTP Media
    Congestion Avoidance Techniques (RMCAT)
    [I-D.ietf-rmcat-cc-requirements].

This is completely unsubstantiated.  videoconferencing with fq_codel
works better than any alternative I've tried.

    "While this definition is convenient to implement in fq_codel, it
    isn't the only useful definition of sparse flows."

 We should try to nail down terms better. I'm VERY unfond of y'all
 attempting to conflate what we've called sparse and the NQB concept
 in this document and elsewhere. There is a key difference between
 "sparse" and NQB, in that "sparse" flows can be *anything using much
 less than its fair share*, where NQB requires truthful marking.

As one typical example, for a short web transaction that completes in
under a IW10 (which is most of them!):

 syn/ack - sparse
 ssl neg - sparse (as it takes time to calculate)
 ssl http req - sparse (takes time to do the get)
 iw10 burst - not sparse and essentially not congestion controlled either (sigh)
 fin - (often part of that burst)

In the other direction of the flow, syn, ssl neg, req, acks, are all
"sparse".

Whereas - unless you want to start flipping dscp bits on this five
tuple for each of these phases doesn't apply to the NQB concept.

... Anyway in conclusion ...

Were it me, and I had control of the pen, and i believed in the NQB idea
I'd still try to write a more accurate and favoriable description of
fq_codel, and try to characterise the NQB idea as "poor man's
prioritization".


*    Security constraints

    "There is no incentive for an application to mismark its packets as
    NQB (or vice versa)."

This is not true in reality. The "must employ queue protection" is an
attempt to hand-wave away the very real security issues this proposal
raises. Am I the only one that deals with malicious applications every
day?

A specialized, short queue for this sort of stuff - particularly if
its used for important packets (ra, hello, dns) can be subject to low
rate attacks and disabled easier so anyone that cares about those
potential problems would sleep better in the BE queue.

...

While we're on vague terms:

Is there an IETF definition of what "queue protection" actually is?

...

At the moment - were L4S to become a "thing", I'd actually lean
towards making '2A' codepoint mean "L4S" or SCE, and retire the
overload on ECT1. That retains the apparent desired overlay with EF,
allows for ECT1 and ECT0 to be used in a backward compatible way.

# The how to implement section

Were this to become a codepoint, we'd merely toss it into the
interactive queue (in sqm and cake) via adding it to the lookup table
just as we did recently for rfc8622: 

static const u8 diffserv3[] = {
	0, 1, 0, 0, 2, 0, 0, 0,
           ^New LE codepoint
	1, 0, 0, 0, 0, 0, 0, 0,
	0, 0, 0, 0, 0, 0, 0, 0,
	0, 0, 0, 0, 0, 0, 0, 0,
	0, 0, 0, 0, 0, 0, 0, 0,
	0, 0, 0, 0, 2, 0, 2, 0,
	2, 0, 0, 0, 0, 0, 0, 0,
	2, 0, 0, 0, 0, 0, 0, 0,
};

This form of implementation:

1) retains a full mapping of all dscp codes to a given set of QoS queues
2) is naturally tcp-friendly congestion controlled 3) does fate sharing
with all other traffic in this queue, and retains the overall sparse and
fq levels of interleave.  4) Is FQ'd fully so it's naturally treated as
NQB - and if it is lying about NQB ends up being QB and treated properly
5) is one byte of code change in exchange for having had to read 12
pages of text

https://github.com/dtaht/sch_cake/blob/master/sch_cake.c#L357

tin = q->tin_index[diffserv3[dscp]];

(note in this case, queue 0 is best effort, 1 is background, and 2
interactive )

...

This cites an unpublished and not recently updated I-D about a 3g bearer
channel in as a supporting document

...

If you are going to start arbitrarily throwing flows marked as EF
or as NQB into the L queue, the dualpi code recently submitted to the
linux kernel needs to change.

[1] 

In this paper, we study the interaction between Active Queue
Management (AQM) and Low-priority Congestion Control (LPCC). We
consider a fairly large number of AQM techniques (i.e., RED, CHOKe,
SFQ, DRR and CoDel) and LPCC protocols (i.e., LEDBAT, NICE, and
TCP-LP), studying system performance (mainly expressed in terms of
TCP% breakdown, average queue size E[Q], and link utilization η) with
both simulative and experimental methodologies.  Summarizing our main
findings, we observe that AQM resets the relative level of priority
between best-effort TCP and LPCC. That is to say, the TCP share of the
bottleneck capacity drops dramatically, becoming close to the LPCC
share. Additionally, while reprioritization generally equalizes the
priority of LPCC and TCP, we also find that some AQM settings may
actually lead best-effort TCP to starvation – as these are however
non-systematic, we believe the starvation issue to be of less
practical relevance than reprioritization.

Reprioritization is a fairly general phenomenon, as it holds for any
combination of AQM technique, LPCC protocol and network scenario. This
is testified by our thorough sensitivity analysis, where we confirmed
the phenomenon for varying network parameters such as buffer size
Qmax, link capacity C, heterogeneous RTT delay, flows number and flow
duty cycle for over 3,000 simulations.  Reprioritization is a real
world phenomenon, easily replicable on testbeds and the real Internet,
which cannot be easily solved via AQM tuning. Hence, we advocate that
explicit collaboration is needed from LPCC (e.g., openly advertise low
priority) to assist AQM in taking decisions that maintain the desired
level of priority

[tsvwg] taht REVIEW of draft-white-tsvwg-nqb-02 Dave Taht
Re: [tsvwg] taht REVIEW of draft-white-tsvwg-nqb-… Bob Briscoe