[aqm] Review of draft-ietf-aqm-eval-guidelines

Jim Gettys <jg@freedesktop.org> Mon, 13 April 2015 14:38 UTC

MIME-Version: 1.0
Sender: gettysjim@gmail.com
Date: Mon, 13 Apr 2015 10:37:52 -0400
Message-ID: <CAGhGL2Cj44U94UL6W-9OeoyKTm4TE=JWVozCzAp-wr5eN9UETg@mail.gmail.com>
From: Jim Gettys <jg@freedesktop.org>
To: "aqm@ietf.org" <aqm@ietf.org>
Content-Type: multipart/alternative; boundary="f46d043be09c1577de05139c0fa1"
Archived-At: <http://mailarchive.ietf.org/arch/msg/aqm/HUmxuDzNxiBtqQ0aSiMrkeQDcGY>
Subject: [aqm] Review of draft-ietf-aqm-eval-guidelines
Precedence: list

Sorry this review is later than I would have liked: I came down with a cold
after the IETF and this review was half done. Hope this helps.

- Jim

Required behavior, and running code
============================

My immediate, overall reaction is that this document prescribes a large set
of tests to be performed, using RFC 2119 MUST/SHOULD/MAY terminology.

However, I doubt if any AQM algorithm has been evaluated to all of those
tests; and it is unrealistic without "running code" for such tests that
anyone will actually run them; instead, this seems to be an "exercise to
the AQM implementer" of large magnitude, possibly so large as that no one
will actually perform even a fraction of them.

Will that mean that future AQM algorithms get stalled in working groups
because the tests are not done as prescribed? And how will we intercompare
tests between algorithms, with the test code not available?

And as there is little consequence that I can see to not performing the
tests, it seems unlikely that anyone will actually perform them.
I think if one cataloged all the tests requested by this document that the
number of tests is very large. I wonder if a small number of more
aggressive tests might not be easier.

It seems to me that if, as a community, we really believe in doing such
evaluations, we need a common set of tools for doing them, so that we can
actually easily inter compare results.

We really need *running code* for tests to evaluate AQM's/packet scheduling
algorithms. I fear that this will be an academic exercise without such a
shared code base for testing.

My reaction to this draft is that without such a set of tests, this
document is an academic exercise that will have little impact. I'd much
prefer to see a (smaller) concrete set of tests devised (probably an
elaboration of the data we get from Toke's netperf-wrapper tool).

Packet scheduling
==============

This draft is essentially bereft of any discussion of packet
scheduling/flow isolation, when, in many ways, it makes a bigger difference
than the AQM (mark/drop algorithm only, in the definition of this document).

Section 12 purports to discuss packet scheduling, but it is 5 sentences (<
1/2 page) total out of a document of >20 pages (of actual discussion, not
boiler plate)!

Somehow I suspect it is worth a bit more discussion ;-).

Yet packet scheduling/AQM do interact at some level; I don't think
that it can/should be handled in a separate draft.

For example, one of the wins of flow isolation is that TCP's ack clocking
starts behaving normally again, and ack compression is much less
problematic. One of the surprises of fq_codel was finding that we were able
to use the full bandwidth of the link in both directions simultaneously
(increasing net aggregate bandwidth on a link).

A second example: once you have flow isolation (e.g. fq_codel), it is still
important to understand if the AQM (mark/drop algorithm) is effective, as
bufferbloat can still kill single application performance if the mark/drop
algorithm is ineffective. Knowing what will happen to that that single
flow's behavior still matters.

A third example: ironically, TCP RTT unfairness gets worse when you control
the RTT with an effective AQM. Flow queuing helps with this issue, which

Time variance
===========

The draft is silent on the topic of time variance behavior of AQM's.

Those of us facing the problems at the edge of the network are amazingly
aware of how rapidly available bandwidth may change, both because radio is
inherently a shared medium, and because of radio propagation.

Other media are also highly variable, if not as commonly as WiFi/cellular;
even switched Ethernet is very variable these days (vlan's have hidden the
actual bandwidth of a cable from view).
I talked to one VOIP provider who tracked their problems down to what vlans
were doing behind his back.

Cable and other systems are also shared/variable media (e.g. PowerBoost on
Comcast, or even day/night diurnal variation.

*The* most exciting day for me while CoDel was being developed was when
Kathy Nichols shared with me some data from a simulation showing CoDel's
behavior when faced with a dramatic (instantaneous, in this case) drop in
bandwidth (10x), and watching it recover to a new operating point.

It seems to me that some sort of test(s) for what happens when bandwidth
changes (in particular, high to low) are in order to evaluate algorithms.

Unfortunately, the last I knew some simulators were not very good at
simulating this sort of behavior (IIRC, Kathy was using ns2, and she had to
kludge something to get this behavior).

Most important (particularly once you have flow queuing), may be whether
the algorithm may become unstable/oscillate in the face of such changes or
not. This is still an area I worry about for WiFi, where the RTT of many
paths is roughly comparable to the kind of variance in throughput seen.
Time will tell, I guess; but right now, we have no tools to explore this is
any sort of testing.

Specific comments
===============

Section 1. Introduction
------------------------------

Para 1.
It is a bit strange there is no reference to fq_codel at this date,
particularly since everyone involved in CoDel much prefers fq_codel and we
cannot anticipate anyone using CoDel in preference to fq_codel; it's in
Linux primarily to enable easy testing, rather than expecting anyone to
*actually* use it.

Flow queuing is a huge win. And fq_codel is shipping/deployed on millions
of devices at this date. There is an ID in the working group defining
fq_codel.... Having a concrete thing to talk about you that does packet
scheduling will help you when you get to redoing section 12.

Para 3:
The first sentence is misleading and untrue, as far as I can tell. SLA's
have had little to do with it, at least in the consumer edge where we hurt
the most today.

Bufferbloat has happened as 1) memory became really cheap, 2) Worshiping at
the altar of the religion that dropping a packet must be bad 2) to do a
single stream TCP performance test has meant that the minimum size of a
drop tail queue has been at least the BDP.

o the maximum bandwidth the device might ever go
o 100ms, which is roughly continental delay.

Net result is *minimum* buffering that is often 5-10 or more times sane
buffering. Then people, thinking that more must be better, used any left
over memory. Note that this is already much too much for many applications
(music, VOIP; 100ms is *not* an acceptable amount of latency for a
connection to impose)....

Compounding this has been accounting in packets, rather than bytes or
time. So people built operating systems to saturate a high speed link with
tiny UDP packets for RPC, without realizing that the buffering was then
another factor of 20 higher for TCP...

Then these numbers (for Ethernet) were copied into the WiFi drivers without
thought. Broadband head ends and modems aren't much better.

As trying to explain the history of how things went so badly wrong is not
helpful, I'd just strike the first sentence starting "In order to meet" and
just note that buffering has often been ten-100 or more times larger than
needed.

The Interplanetary record at the moment is held by GoGo Inflight, with 760
seconds of buffering measured (Mars is closer for part of its orbit).

Section 2. End to end metrics.
----------------------------------------

There is no discussion of any "Flow startup time".

For many/most "ants", the key time for good performance is the time to
*start* or *restart an idle* flow.

Not only are these operations such as DNS lookups, TCP opens, SSL
handshakes, dhcp handshakes, etc, but for web pages, the first packet or so
contains the size information required to layout a page, which is key to
interactive "speed", all needing any formal traffic classification.

The biggest single win of flow queuing is minimizing this lost time;
fq_codel goes even a step further by its "sparse flow" optimization, where
the first packet(s) of a new or idle flow are scheduled before flows that
have built a queue. So DNS lookups, TCP opens, the first data packet in an
image, SSL handshakes, all fly through the router.

For web surfing, the flow completion time is relatively irrelevant; what
the user cares about is how long until he can read the screen without it
reflowing.

Section 3.2 Buffer Size
------------------------------

The first sentence of this section is pretty useless. What is the
bandwidth? What is the delay? later tests talk about testing at different
RTT's... And the whole point of AQM's is to get away from static buffer
sizes....

Instead, when static buffer sizes are set, there needs to be justification
and explanation given, so that other implementers/testers can understand
how/when/if it should be changed for deployment or other testing.

These buffer default sizes are mentioned in the fq_codel draft since home
routers have limited RAM, so having some upper limit on buffer sizes is a
practical implementation limit (and one we'd be happy to do do away with
entirely; we just haven't gotten around to it as yet, and the draft
describes the current implementation, rather than mythology).

The buffer size constants on fq_codel are really on the "to fix someday"
list, rather than anything we otherwise care about in the algorithm. With
more code complexity, we could remove both the # flows and buffer size
limits in fq_codel and make either/both fully dynamic. So far, it hasn't
seemed worth the effort to make these automatic, particularly before we
understand more about the dynamic behavior in wireless networks, ***and
have better feedback/integration with the device driver layer so that we
might have insight into the information required to make these dynamic***.
Right now, too much information is hidden from the queue management layer
(at least in Linux).

Section 4.3 Unresponsive transport sender
---------------------------------------------------------

You might note that flow queuing systems help protect against (some) such
unresponsive senders.

Why are you specifying a New Reno flow? One could argue, possibly
correctly, that Cubic is now the most common CC algorithm. Some rational
is in order. And Cubic is more likely to drive badly bloated connections
to the point of insanity.

Section 5.1 Motivation
-----------------------------

Some mention of flow queuing is in order...

5.2 Required Tests
--------------------------

It seems to me that there is a third test: a short RTT (maybe 20MS), which
is comparable to what you might find given a CDN is nearby.

Section 6. Burst absorption
-------------------------------------

There is a converse situation to burst absorption: that is momentary
bandwidth drops; these can occur both in VLAN situations, but also in
wireless systems, where you can easily have temporary drops in available
bandwidth.

Section 6.2: Required tests
-------------------------------------

Unfortunately, HTTP/1.1 Web traffic in the current world is much worse than
the proposed tests: typically it will be N essentially simultaneous
connections, each with an IW10; and the size of most objects are small, and
fit into that IW10 window. Bursts of hundreds of packets are commonplace.

I cover some of this from Google data in my blog:

https://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/

So the web test is not specified decently: is it 100 packets back to back?
or 10 IW10 connections?

Also, flow completion, while interesting, is *not* the only
interesting/useful metric; getting the first packet(s) of each connection
allows page layout to occur without reflows and is equally
interesting/useful.

If you look in the above blog entry, you'll see a bar chart of visiting a
web page N times; one of the wonderful fq_codel results was the lack of
variance in the completion time, since the wrong packets weren't getting
badly delayed/dropped. (The web page completion time bar chart).

7.2.1 Definition of the congestion level
---------------------------------------------------

I am nervous about defining congestion in terms of loss rate; when I
measured loss rates in today's bloated edge network, the loss rates were
quite high (several percent) even with single flows (probably because Cubic
was being driven insane by the lack of losses on links with large amounts
of buffering. So encouraging readers to believe that loss rates always
correlates with congestion level seems fraught with problems to me, as most
will jump to the incorrect conclusion that measuring one gives you the
other.

7.3 Parameter Sensitivity
---------------------------------

How would one show stability? Is there a reference you can give to an
example of such stability analysis?

8.1 Traffic Mix
-------------------

The test is ill defined... What is "6 webs" for a test? Why 5 TCP flows?
are the flows bi-directional, or unidirectional?

8.2 Bi-directional traffic
------------------------------

It isn't clear that bi-directional traffic is actually required; the second
paragraph seems to contradict the third paragraph.

10. Operator control knobs...
---------------------------------------

Isn't really this a case of wanting to understand the degenerate behavior
of an algorithm, so that we understand the "worst case" behavior (and try
to see if an algorithm "first, does no harm", even in cases where said
algorithm should have been tuned (or autotuned itself)?

For example, the current # of hash buckets default in fq_codel (that
someday, we may make fully auto-tuning), if applied in a situation with
many flows so there are frequent hash collisions causes the algorithm to
lose its sparse flow optimization, but it is still going to apply CoDel to
any elephant flows and behave roughly as DRR would; it should not
"misbehave" and cause bad behavior: just not the better optimal behavior if
the # of hash buckets had been set to something reasonable.

This sort of analysis is worth including in an evaluation....

12. Interaction with scheduling
----------------------------------------

Somehow this section is currently pretty useless, particularly since good
scheduling can be of greater benefit that the details of mark/drop
algorithm.

I think we can/should do better here.

13.3 Comparing AQM schemes
------------------------------------------

I think there is a different issue left unmentioned at the moment, which is
the "applicability" of an algorithm to a given situation.

For example, setting the CoDel target in fq_codel below the media access
latency will hurt its behavior. So twisting that knob is in fact necessary
under some circumstances, and a document should discuss when that may be
necessary.

Similarly the target of 5ms is pretty arbitrary and CoDel is in general not
very sensitive to what the target or interval are (except in this case of
setting an unacheivable target. The values were chosen after a lot of
simulation by Kathy Nichols. Helping people understand these cases is
important.

Section 13.4 Packet Sizes...
--------------------------------------

I think there is a simple rule that needs addition to this section: an
AQM/scheduling system "should guarantee progress", or particular flows can
fail entirely under some circumstances. This "liveness" constraint is a
property that all algorithms should provide.

[aqm] Review of draft-ietf-aqm-eval-guidelines Jim Gettys
Re: [aqm] Review of draft-ietf-aqm-eval-guidelines Nicolas Kuhn