Re: [tcpm] TCP Tuning for HTTP - update

Willy Tarreau <w@1wt.eu> Wed, 17 August 2016 21:18 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9460A12D5BF for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Wed, 17 Aug 2016 14:18:13 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -8.168
X-Spam-Level:
X-Spam-Status: No, score=-8.168 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.247, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=unavailable autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id e2mmjaBfaNo7 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Wed, 17 Aug 2016 14:18:11 -0700 (PDT)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 76155126D74 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Wed, 17 Aug 2016 14:18:11 -0700 (PDT)
Received: from lists by frink.w3.org with local (Exim 4.80) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1ba8A9-0007ZX-HE for ietf-http-wg-dist@listhub.w3.org; Wed, 17 Aug 2016 21:14:13 +0000
Resent-Date: Wed, 17 Aug 2016 21:14:13 +0000
Resent-Message-Id: <E1ba8A9-0007ZX-HE@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtps (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <w@1wt.eu>) id 1ba8A2-0007Y9-6K for ietf-http-wg@listhub.w3.org; Wed, 17 Aug 2016 21:14:06 +0000
Received: from wtarreau.pck.nerim.net ([62.212.114.60] helo=1wt.eu) by lisa.w3.org with esmtp (Exim 4.80) (envelope-from <w@1wt.eu>) id 1ba89z-0002Aj-9S for ietf-http-wg@w3.org; Wed, 17 Aug 2016 21:14:05 +0000
Received: (from willy@localhost) by pcw.home.local (8.15.2/8.15.2/Submit) id u7HLDH7O016960; Wed, 17 Aug 2016 23:13:17 +0200
Date: Wed, 17 Aug 2016 23:13:17 +0200
From: Willy Tarreau <w@1wt.eu>
To: Joe Touch <touch@isi.edu>
Cc: Mark Nottingham <mnot@mnot.net>, tcpm@ietf.org, HTTP Working Group <ietf-http-wg@w3.org>, Patrick McManus <pmcmanus@mozilla.com>, Daniel Stenberg <daniel@haxx.se>
Message-ID: <20160817211317.GA16929@1wt.eu>
References: <0CC24FC1-37E1-4125-9627-05726A9D9406@mnot.net> <7fa95741-ac58-3183-1b92-238bd4b4dae6@isi.edu> <5CD67877-19E3-4E79-BBF2-3E270343A378@mnot.net> <2197232f-10d7-28cb-fcc9-05bd495e3c22@isi.edu> <20160817064545.GD16017@1wt.eu> <7f7b129c-f156-d067-bef8-4a2213f461ac@isi.edu> <20160817180802.GA16773@1wt.eu> <4ab7c5b0-3722-1346-f481-a8d76de70034@isi.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <4ab7c5b0-3722-1346-f481-a8d76de70034@isi.edu>
User-Agent: Mutt/1.6.0 (2016-04-01)
Received-SPF: pass client-ip=62.212.114.60; envelope-from=w@1wt.eu; helo=1wt.eu
X-W3C-Hub-Spam-Status: No, score=-5.5
X-W3C-Hub-Spam-Report: AWL=-0.574, BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, W3C_AA=-1, W3C_IRA=-1, W3C_WL=-1
X-W3C-Scan-Sig: lisa.w3.org 1ba89z-0002Aj-9S eb6660ec4246fad940e656451fef9473
X-Original-To: ietf-http-wg@w3.org
Subject: Re: [tcpm] TCP Tuning for HTTP - update
Archived-At: <http://www.w3.org/mid/20160817211317.GA16929@1wt.eu>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/32290
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On Wed, Aug 17, 2016 at 11:31:33AM -0700, Joe Touch wrote:
> > It can be cited in new RFCs
> > to justify certain choices. 
> Hmm. Like the refs I gave could be cited in this doc to justify *its*
> choices? :-)

I think it would be nice that this is cited, but to be clear on one
point, I've never heard about your papers before you advertised them
here in this thread, and yet I've been dealing with timewait issues
for 15 years like many people facing moderate to large web sites
nowadays. It just happens that you probably identified these issues
very early at low connection rates, but today anyone dealing with
more than 500-1000 connections per second on a server or worse on
a gateway quickly discovers that he has to make a choice.

> >> Yes, and discussing those issues would be useful - but not in this
> >> document either.
> > Why ? Lots of admins don't understand why the time_wait timeout remains
> > at 240 seconds on Solaris with people saying "if you want to be conservative
> > don't touch it but if you want to be modern simply shrink it to 30 seconds
> > or so". People need to understand why advices have changed over 3 decades.
> 
> The advice hasn't really changed - the advice was given in the 99 ref,
> which includes some cases where it can still be appropriate to decrease
> that timer.

Most people see it the other way around : they see no valid case to *increase*
it beyond a few seconds, because for them the default value should be extremely
low (ie this firewall vendor several years ago trying to insist on one second).
Yes that's really sad but that's reality. And you can tell them to read 6191
they won't care.

> >   - TCP timestamps: what they provide, what are the risks (some people in
> >     banking environments refuse to enable them so that they cannot be used
> >     as an oracle to help in timing attacks).
> That's already covered in the security considerations of RFC 7323. How
> is HTTP different, if at all, from any other app?

HTTP is special in that it is fairly common to have to deal with tens of
thousands of connections per second between one client and one server when
you are on the server side, because you place a number of gateways (also
called reverse-proxies) which combine all of the possible issues you can
think of at a single place. Timestamps are one way to improve fast connection
recycling between the client and the server without having to cheat on
timewaits. But since they consume 12 bytes per packet, it's often advised
to disable them in benchmarks to get the highest throughput...

> >   - window scaling : how much is needed.
> Same issue here, same ref - how is HTTP different?

Same as above.

> >   - socket sizing : contrary to what you write, there's a lot of tuning
> >     on the web where people set the default buffer sizes to 16MB without
> >     understanding the impacts when dealing with many sockets
> There's a whole book that encompasses that and some related issues:
> http://www.onlamp.com/pub/a/onlamp/2005/11/17/tcp_tuning.html

Looks fine, could be added to the list of references.

> Some advice is also given in Sec 6.3.3 of this:
> J. Sterbenz, J. Touch, /High Speed Networking: A Systematic Approach to
> High-Bandwidth Low-Latency Communication/
> <http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471330361.html>,
> Wiley, May 2001.
> 
> >   - SACK : why it's better. DSACK what it adds on top of SACK.
> That's in the SACK docs, which aren't cited. Again, how is HTTP
> different from any app?
> >   - ECN : is it needed ? does it really work ? where does it cause issues ?
> That's in the ECN docs, which aren't cited. Again, how is HTTP different
> from any app?
> >   - SYN cookies : benefits, risks
> That's in the RFC 4987, which at least IS cited. Again, how is HTTP
> different from any app?

So probably you're starting to see the benefit of having a single doc
to concentrate all this. You provided at least 3 different articles
to read and 2 or 3 different RFCs in addition to the original ones,
of course. A hosting provider whose web sites are down due to a lack
of tuning doesn't start to read many very long articles and even less
the most scientific ones, they need to find quick responses that they
can apply immediately (matter of minutes). So they launch google, they
type "web site dead, time-wait overflow" and they get plenty of
responses on stackoverflow and serverfault, many from people having
done the same in the past and repeating the same mistakes over and over.

A document validated by several people and giving links for further
reading can help improve this situation.

> >   - TCP reuse/recycling : benefits, risks
> Not sure what you mean here. There are a lot of docs on the issues with
> persistent-HTTP vs per-connection HTTP.

Often on a gateway you cannot completely chose. You have a mix of both.

> >   - dealing with buffer bloat : tradeoffs between NIC-based acceleration
> >     and pacing
> Bufferbloat typically involves large *uplink* transfers and how they
> interact with other uplink connections. Neither TCP nor HTTP is involved
> in this really.

Maybe in the end Daniel is not the only one not to read all articles
published on the web :-)

     http://www.cringely.com/2012/03/25/linux-3-3-finally-a-little-good-news-for-bufferbloat/
     https://www.ietf.org/proceedings/86/slides/slides-86-iccrg-0.pdf
     https://lwn.net/Articles/564978/

In short, by sending 64kB segments to your NIC and counting on it to
cut them into small pieces for you and sending the resulting packets
very close together, you increase the risk of losses on many network
equipments which run with not-that-large buffers. It's easy to observe
even on some 10Gbps switches. When you switch mixes 10G and 40G, it's
horrible, really.

> >   - what are orphans and why you should care about them in HTTP close mode
> Orphaned TCP connections or orphaned HTTP processes?

TCP connections. The server sends a response, performs a close() on the
socket, the data remain in the kernel buffers for the time it takes to
deliver these data to the client and to get them ACKed. From this point
the socket is called orphaned on the system because it doesn't belong to
any process anymore. But immediately after this, a process which runs with
a limited connection count can accept a new connection, which will in turn
be fed with large chunks of data and closed. And this runs over and over,
eating a huge amount of socket memory despite a small limit imposed on the
server's connection concurrency. At some point the kernel doesn't have
enough socket buffers anymore (especially after admins have set them to
16MB as instructed on $random tuning guide) and starts to kill some orphaned
connections to get some memory back. The not-so-nice effect that the admin
cannot detect is that the client gets truncated responses. Only a small part
of the tail is missing but in the logs, it's said that everything was sent.
But sent by the process means sent to the system only.

> >   - TCP fastopen : how does it work, what type of workload is improved,
> >     what are the risks (ie: do not enable socket creation without cookie
> >     by default just because you find it reduces your server load)
> 
> Another doc that exists.
> 
> >   - whether to choose a short or a large SYN backlog depending on your
> >     workload (ie: do you prefer to process everything even if the dequeuing
> >     is expensive or to drop early in order to recover fast).
> >
> > ... and probably many other that don't immediately come to my mind. None
> > of these ones was a real issue 20 years ago.
> 
> See above. Many were known around that time, but weren't documented in
> detail (it took a while for a proper ref to SYN cookies, and the book I
> wrote with Sterbenz came about because we'd seen wheels being
> rediscovered for 15 years).

People rediscover wheels because it's hard to find simple and accurate
information on the net. Basically you have the choice :
  - either uneducated blog posts saying "how I saved my web site using 2
    sysctls"
  - or academic papers which are only understandable by scientific people
    having enough time

At least the first ones have the merit of being easy to test, and since
they appear to work they are viral.

> >  All of them became issues for
> > many web server admins who just copy-paste random settings from various
> > blogs found on the net who just copy the same stupidities over and over
> > resulting in the same trouble being caused to each of their reader.
> 
> This doc is all over the place.
> 
> If you want a doc to advise web admins, do so.

That's *exactly* what Daniel started to do when you told him he shouldn't
do it.

> But most of the items
> above aren't under admin control; they're buried in app and OS
> implementations, and most have already evolved do to the right thing.

That's not true anymore. There was a time where Solaris had one of the
most configurable TCP stack, you could do ndd /dev/tcp with a lot of
things. Nowadays all modern operating systems let you fiddle with a ton
of stuff. And even for the missing parts you can very easily find patches
all over the web because opensource has changed this considerably.

> I agree that a summary of a *focused set* of these might be useful *as a
> review* (note that review discussions usually include lots of refs).

The purpose as I understood it was precisely to gather knowledge from
people here operating such systems and adding some elements found from
various sources. I also expressed my interest in sharing such experience
which some will find valid and others not, which is perfect because it
means there's a choice to make depending on a use case.

> The key question is "what is the focus":
> 
>     - HTTP/TCP interactions
>     - server administration advice
>     - ?

Those two go hand-in-hand nowadays. You probably know that the difficulty
for HTTP implementors to find a properly tuned TCP stack is what makes them
consider UDP-based alternatives, so that they can totally control the
behaviour from user-space and provide quick updates for their protocol.
Want an example ?

    https://www.chromium.org/quic

If it was possible to always have a perfectly tuned TCP stack I'm pretty
sure we wouldn't even hear about such initiatives. And this is not something
new, I happened to discuss about this subject with some people at IETF83 in
2012 already. By then I thought "they have no idea what they're trying to
reinvent" and now I'm thinking "well, they have their reasons and they
might be right after all given all the resistance they may be facing on
the TCP side to get some default timers changed".

> IMO, RFCs should focus on issues specific to the protocols and their
> deployment - not general knowledge that exists in courses and textbooks.

Courses and textbooks are totally outdated when they're not simply wrong.
I've first been tought that it was normal to fork a process after performing
an accept(), resulting in my very first proxy working this way. Then I've
been taught that it was mandatory to send an empty ACK after a SYN-ACK to
validate a connection (which is totally wrong otherwise it would not allow
this ACK to be lost). I've been taught that in order to accept an incoming
connection you had to have your socket in listen mode exclusively. This is
false as well, two clients can connect together during their connect()
phase, it even has some security implications that are often overlooked.
Like it or not, the HTTP protocol has brought TCP to an area it was not
initially intended for and I find it fantastic to see that this protocol
still scales so well. But we need to consider modern usages of this protocol
for the web, and not just academic research and e-mail.

> >> I.e., at the most this is a man page (specific to an OS). At the least,
> >> this isn't useful at all.
> > As you can see above, nothing I cited was OS-specific but only workload
> > specific. That's why I think that an informational RFC is much more suited
> > to this than an OS-specific man page. The OS man page may rely on the RFC
> > to propose various tuning profiles for different workloads however.
> 
> You have a good point that this is general info, but OS issues are not
> in the scope of the IETF and there are courses and books that already
> provide good advice on efficiently running web (and other) servers.

Well, you want to run a quick test ? Google "tcp tuning for http server".
Skip the first two responses which are Daniel's document, take the next one :

    https://gist.github.com/kgriffs/4027835

Read a bit... Not bad, could have been much worse. OK found : 5 seconds
time_wait timeout for conntrack. Dig a little bit more... suggests 2M
time_wait sockets. At 5s per socket, that's an expected 400k connections
per second limit. This with buffers that can be as large as 16MB on both
sides (one could ask why to put 16MB buffers on the Rx path for a web
server). So basically the guy probably thinks that there's a need to
fill up to 6.4 TB/s and despite this the max-orphans was not set according
to the time-wait value, so users of his doc will start to cause some data
truncation before they run into connection failures, thus only their visitors
will know there are problems.

And this is the highest ranked doc on google after Daniel's. And overall
it's really not bad, much better than what we easily find everywhere.

The fact that Daniel's doc found before is a good hope that this sad state
of affairs can reach an end. That's why I think we should encourage him to
continue and give him all the reference he needs to have an undiscutably
good reference doc.

Regards,
Willy