Re: [aqm] Is bufferbloat a real problem?

In message <2134947047.1078309.1424979858723.JavaMail.yahoo@mail.yahoo.com>
Daniel Havey writes:

>  
> I know that this question is a bit ridiculous in this community.  Of
> course bufferbloat is a real problem.  However, it would be nice to
> formally address the question and I think this community is the right
> place to do so.
>  
> Does anybody have a measurement study?  I have some stuff from the FCC
> Measuring Broadband in America studies, but, that doesn't address
> bufferbloat directly.
>  
> Let me repeat for clarity.  I don't need to be convinced.  I need
> evidence that I can use to convince my committee.
>  
> ...Daniel
>  
> _______________________________________________
> aqm mailing list
> aqm@ietf.org
> https://www.ietf.org/mailman/listinfo/aqm

Daniel,

You are convinced.  So am I but we need to temper the message to avoid
going too far in the opposite direction - too little buffer.

If you are not interested in the details, just skip to the last
paragraph.

Bufferbloat should in principle only be a problem for interactive
realtime traffic.  Interactive means two way or multiway.  This is
SIP, Skype, audio and video conferencing, etc.  In practice it is also
bad for TCP flows with short RTT and max window set small.

One way realtime (such as streaming audio and video) should be
unnafected by all but huge bufferbloat.  That is should be.  For
example, youtube video is carried over TCP and is typically either way
ahead of the playback or choppy (inadequate aggregate bandwidth or
marigninal and/or big drop with TCP stall).  It would be nice if
marginal aggregate bandwidth was dealt with by switching to a lower
bandwidth encoding, but too often this is not the case.  This doesn't
mean that some streaming formats don't manage to get this wrong and
end up delay sensitive.

TCP needs buffering to function correctly.  Huge bufferbloat is bad
for TCP, particularly for small transfers that never get out of TCP
slow start and for short RTT flows.  For long RTT flows too little
buffer causes problems.

[ aside: For example, if TCP starts with 4 segments at 1K segment
size, it will take 4 RTT to hit 64KB window, the typical max window
without TCP large window (TCPLW) option.  During that time, 60KB will
be sent.  After that 64KB will be sent each RTT.  With geographic RTT
is 70 msec (approximate US continental RTT due to finite speed of
light in fiber and fiber distance), 60 KB is sent in the first 280
msec and 64KB gets sent every 70 msec yielding 7 mb/s.  OTOH if there
is a server 2 msec RTT away (1 msec one way is 125mi = 200km), then
60KB in first 8 msec and 256 Mb/s after that.  If there is 100 msec
buffer at a bottleneck, then this low RTT TCP flow will be slowed by a
factor of 50.  OTOH, if bottlenecks have a lot less than 1 RTT of
buffer, then the long TCP flows will get even further slowed. ]

One of the effects of some buffer, but not excessive, is short RTT
flows which given the no-TCPLW max window get slowed down while longer
RTT are less affected.  This becomes more fair wrt to transfer rates
among TCP flows.  The same holds true if TCPLW gets turned on in
commodity gadgets and the commonly deploed max window increases, but
the number change.

If the buffer grows a little and the deployed window sizes become the
limiting factor, then this is very light congestion with delay but
absolutely zero loss due to queue drops (not considering AQM for the
moment).

Some uses of TCP increase the window to work better over long RTT.  It
takes a bit longer to hit the max window but the rate once it has been
hit is greater.  Setting TCP window large on short RTT flows is
counterproductive since one or a small number of flows can cause a
bottleneck on a slow provider link (ie: 10-100 Mb/s range typical of
home use).  On a LAN RTT can be well under 1 msec on Ethernet and
highly variable on WiFi.  On WiFi larger window can contribute to some
real trouble.  So best that the default window be changed.  

[ Note that the work on automatic sizing of tcp_sndbuf and scp_recvbuf
may create a tendency to saturate links as the window can go up to 2MB
with default parameters.  Has this hit consumer devices yet?  This
could be bad it this rolls out before widespread use of AQM. ]

When a small amount of loss occurs, such as one or much less than the
current window size, TCP cuts the current window size in half and
retransmits the packet for the window in flight (ignoring selective
acknowledgment extension aka SACK for the moment).

If the buffer is way too small, then a large amount of premature drop
occurs when the buffer limit is hit.  Lots of TCP flows slow down.
The long RTT flows slow down the most.  Some retransmission occurs
(which doesn't help congestion).  If there is a long period of drop
relative to a short RTT, then a entire window can be dropped and this
is terrible for TCP (slow start is initiated after delay based on an
estimate of RTT and RTT stdev, or 3 sec if RTT estimate is stale -
this is a TCP stall).  So with too little buffer some TCP flows get
hammered and stall.  TCP flows with long RTT tend to stall less but
are more sensitve to the frequency of drop events and can get
extremely slow due to successively cutting window in half and then
growing the window linearly rather than exponentially.

With tiny buffers really bad things tend to happen.  The rate of
retransmission can drive goodput (the amount of non-retransmit traffic
per time) can drop substantially.  Long RTT flows can become
hopelessly slow.  Stalls become more common.  In the worst case (which
has been observed in a ISP network during a tiny buffer experiment
about a decade ago, details in private email) TCP synchronization can
occur, and utilization and goodput drop dramatically.

A moderate amount of buffer is good for all TCP.  A large buffer is
good for long RTT TCP flows, particularly those that have increased
max window.  As mentioned before, any but a very small buffer is bad
for interactive real time applications.

Enter AQM.  A large buffer can be used but with a lower target delay
and some form of AQM to introduce a low rate of isolated drops as
needed to slow the senders.  Avoiding queue tail drop events where a
lot of drops occur over an interval lowers the amount of
retransmission and avoids stalls.  Long RTT flows tend to get
penalized the most.

Fairness is not great with a single queue and AQM but this is much
better than a single queue with either small or large buffer and tail
drop.  Fairness is greatly improved with some form of FQ or SFQ.

Ideally with FQ each flow would get its own queue.  In practice this
is not the case but the situation is greatly improved.  A real time
flow, which is inherently rate limited, would see minimal delay and no
loss.  A short RTT flow would see a moderate increase in delay and a
low level of loss (ie: typically much less than 1%) enough to slow it
down enough to avoid congestion.  a long RTT flow would see a moderate
increase in delay and no loss if still running slower than the small
RTT flows.  This does wonders for fairness and provides the best
possible service for each service type.

In practice, some FQ or SFQ queues have a mix of real time, low RTT
TCP, and high RTT TCP.  If any such queue is taking a smaller share
than other queues, delay is low and loss is low or zero.  If such a
queue is taking more than its share, then the situation is similar to
the single queue case.  Less flows end up in such a queue.  Cascaded
queues have been proposed and in some cases (no longer existing) have
been implemented.  In a cascaded SFQ scheme, the queues taking more
than their share are further subdivied.  Repeat the subdivision a few
times and you can end up with the large bandwidth contributors in
their own queue and getting a fair share of capacity.

So excuse the length of this but solving bufferbloat is *not* a silver
bullet.  Not understanding that point and just making buffers really
small could result in an even worse situation than we have now.

Curtis

ps - Some aspects of this may not reflect WG direction.  IMHO- the
down sides of just making buffers smaller and/or setting low delay
targets may not be getting enough (or any) attention in the WG.  Maybe
discussion wouldn't hurt.