Re: [tcpm] TCP Tuning for HTTP - update

Joe Touch <touch@isi.edu> Wed, 17 August 2016 22:23 UTC

Return-Path: <touch@isi.edu>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BB3C012D19F for <tcpm@ietfa.amsl.com>; Wed, 17 Aug 2016 15:23:57 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -8.167
X-Spam-Level:
X-Spam-Status: No, score=-8.167 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.247] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BD5A26IA3YBx for <tcpm@ietfa.amsl.com>; Wed, 17 Aug 2016 15:23:54 -0700 (PDT)
Received: from boreas.isi.edu (boreas.isi.edu [128.9.160.161]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8784112D18E for <tcpm@ietf.org>; Wed, 17 Aug 2016 15:23:54 -0700 (PDT)
Received: from [128.9.184.210] ([128.9.184.210]) (authenticated bits=0) by boreas.isi.edu (8.13.8/8.13.8) with ESMTP id u7HMNCoT018451 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Wed, 17 Aug 2016 15:23:12 -0700 (PDT)
To: Willy Tarreau <w@1wt.eu>
References: <0CC24FC1-37E1-4125-9627-05726A9D9406@mnot.net> <7fa95741-ac58-3183-1b92-238bd4b4dae6@isi.edu> <5CD67877-19E3-4E79-BBF2-3E270343A378@mnot.net> <2197232f-10d7-28cb-fcc9-05bd495e3c22@isi.edu> <20160817064545.GD16017@1wt.eu> <7f7b129c-f156-d067-bef8-4a2213f461ac@isi.edu> <20160817180802.GA16773@1wt.eu> <4ab7c5b0-3722-1346-f481-a8d76de70034@isi.edu> <20160817211317.GA16929@1wt.eu>
From: Joe Touch <touch@isi.edu>
Message-ID: <c928d1ca-fc89-d0b0-4e1a-8a0bd960d2bb@isi.edu>
Date: Wed, 17 Aug 2016 15:23:10 -0700
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0
MIME-Version: 1.0
In-Reply-To: <20160817211317.GA16929@1wt.eu>
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: 7bit
X-ISI-4-43-8-MailScanner: Found to be clean
X-MailScanner-From: touch@isi.edu
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/umxlBli-EWCgYv9PrTl9vkG0P8I>
Cc: Mark Nottingham <mnot@mnot.net>, tcpm@ietf.org, HTTP Working Group <ietf-http-wg@w3.org>, Daniel Stenberg <daniel@haxx.se>, Patrick McManus <pmcmanus@mozilla.com>
Subject: Re: [tcpm] TCP Tuning for HTTP - update
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Aug 2016 22:23:58 -0000


On 8/17/2016 2:13 PM, Willy Tarreau wrote:
> On Wed, Aug 17, 2016 at 11:31:33AM -0700, Joe Touch wrote:
>>> It can be cited in new RFCs
>>> to justify certain choices. 
>> Hmm. Like the refs I gave could be cited in this doc to justify *its*
>> choices? :-)
> I think it would be nice that this is cited, but to be clear on one
> point, I've never heard about your papers before you advertised them
> here in this thread, 

A search engine on the terms "TCP HTTP interaction" would have popped
them up rather quickly.

> and yet I've been dealing with timewait issues
> for 15 years like many people facing moderate to large web sites
> nowadays. 

"timewait issues" and we're the 5th hit in Google.


> It just happens that you probably identified these issues
> very early at low connection rates, but today anyone dealing with
> more than 500-1000 connections per second on a server or worse on
> a gateway quickly discovers that he has to make a choice.

The issue was very common when the doc was written in 99, when even at
that time there were two issues - running out of the number space and
running out of kernel memory.

The number space issue of running out of ports was the basis of the IETF
port names doc in 2006
(https://tools.ietf.org/html/draft-touch-tcp-portnames-00) that became
the current proposal for a TCP "service number option" in 2013 (which
has been discussed at various IETFs in TCPM since then).

>>>> Yes, and discussing those issues would be useful - but not in this
>>>> document either.
>>> Why ? Lots of admins don't understand why the time_wait timeout remains
>>> at 240 seconds on Solaris with people saying "if you want to be conservative
>>> don't touch it but if you want to be modern simply shrink it to 30 seconds
>>> or so". People need to understand why advices have changed over 3 decades.
>> The advice hasn't really changed - the advice was given in the 99 ref,
>> which includes some cases where it can still be appropriate to decrease
>> that timer.
> Most people see it the other way around : they see no valid case to *increase*
> it beyond a few seconds, because for them the default value should be extremely
> low (ie this firewall vendor several years ago trying to insist on one second).
> Yes that's really sad but that's reality. And you can tell them to read 6191
> they won't care.

Most people's servers don't need to run fast enough to care (note that
nearly everyone runs some sort of web server on nearly every device,
whether for control or configuration). The only issue are high-volume
servers (the kind sysadmins deal with), and those people tend to already
know what the tradeoffs are and accept the risks.

>
>>>   - TCP timestamps: what they provide, what are the risks (some people in
>>>     banking environments refuse to enable them so that they cannot be used
>>>     as an oracle to help in timing attacks).
>> That's already covered in the security considerations of RFC 7323. How
>> is HTTP different, if at all, from any other app?
> HTTP is special in that it is fairly common to have to deal with tens of
> thousands of connections per second between one client and one server when
> you are on the server side, because you place a number of gateways (also
> called reverse-proxies) which combine all of the possible issues you can
> think of at a single place.

There are lots of services that have that many transactions - DNS
servers (even local ones), remote databases, etc.

The point is that HTTP doesn't make the problem different, so this isn't
an HTTP issue. It's a high rate server issue.


>  Timestamps are one way to improve fast connection
> recycling between the client and the server without having to cheat on
> timewaits. But since they consume 12 bytes per packet, it's often advised
> to disable them in benchmarks to get the highest throughput...
>
>>>   - window scaling : how much is needed.
>> Same issue here, same ref - how is HTTP different?
> Same as above.
>
>>>   - socket sizing : contrary to what you write, there's a lot of tuning
>>>     on the web where people set the default buffer sizes to 16MB without
>>>     understanding the impacts when dealing with many sockets
>> There's a whole book that encompasses that and some related issues:
>> http://www.onlamp.com/pub/a/onlamp/2005/11/17/tcp_tuning.html
> Looks fine, could be added to the list of references.
>
>> Some advice is also given in Sec 6.3.3 of this:
>> J. Sterbenz, J. Touch, /High Speed Networking: A Systematic Approach to
>> High-Bandwidth Low-Latency Communication/
>> <http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471330361.html>,
>> Wiley, May 2001.
>>
>>>   - SACK : why it's better. DSACK what it adds on top of SACK.
>> That's in the SACK docs, which aren't cited. Again, how is HTTP
>> different from any app?
>>>   - ECN : is it needed ? does it really work ? where does it cause issues ?
>> That's in the ECN docs, which aren't cited. Again, how is HTTP different
>> from any app?
>>>   - SYN cookies : benefits, risks
>> That's in the RFC 4987, which at least IS cited. Again, how is HTTP
>> different from any app?
> So probably you're starting to see the benefit of having a single doc
> to concentrate all this.
The same reason it's useful to have this all in one place is the reason
we already do - there are books and courses on this.

>  You provided at least 3 different articles
> to read and 2 or 3 different RFCs in addition to the original ones,
> of course. A hosting provider whose web sites are down due to a lack
> of tuning doesn't start to read many very long articles and even less
> the most scientific ones, they need to find quick responses that they
> can apply immediately (matter of minutes). So they launch google, they
> type "web site dead, time-wait overflow" and they get plenty of
> responses on stackoverflow and serverfault, many from people having
> done the same in the past and repeating the same mistakes over and over.

These people don't read RFCs to fix problems. They take online courses
or read "how to" books - which do already exist in this space.

> A document validated by several people and giving links for further
> reading can help improve this situation.
Those are the books and courses I'm talking about already.

>
>>>   - TCP reuse/recycling : benefits, risks
>> Not sure what you mean here. There are a lot of docs on the issues with
>> persistent-HTTP vs per-connection HTTP.
> Often on a gateway you cannot completely chose. You have a mix of both.

Sure... but again, that's not in this doc either.

>
>>>   - dealing with buffer bloat : tradeoffs between NIC-based acceleration
>>>     and pacing
>> Bufferbloat typically involves large *uplink* transfers and how they
>> interact with other uplink connections. Neither TCP nor HTTP is involved
>> in this really.
> Maybe in the end Daniel is not the only one not to read all articles
> published on the web :-)
>
>      http://www.cringely.com/2012/03/25/linux-3-3-finally-a-little-good-news-for-bufferbloat/
>      https://www.ietf.org/proceedings/86/slides/slides-86-iccrg-0.pdf
>      https://lwn.net/Articles/564978/
>
> In short, by sending 64kB segments to your NIC and counting on it to
> cut them into small pieces for you and sending the resulting packets
> very close together, you increase the risk of losses on many network
> equipments which run with not-that-large buffers.

Bufferbloat describes what happens when the buffers are too large, not
too small.

The problem you're describing is the interaction between burstiness and
tail-drop, which is addressed by ECN.

>  It's easy to observe
> even on some 10Gbps switches. When you switch mixes 10G and 40G, it's
> horrible, really.
RFCs aren't typically used as context in switch design...


>
>>>   - what are orphans and why you should care about them in HTTP close mode
>> Orphaned TCP connections or orphaned HTTP processes?
> TCP connections. The server sends a response, performs a close() on the
> socket, the data remain in the kernel buffers for the time it takes to
> deliver these data to the client and to get them ACKed. From this point
> the socket is called orphaned on the system because it doesn't belong to
> any process anymore. But immediately after this, a process which runs with
> a limited connection count can accept a new connection, which will in turn
> be fed with large chunks of data and closed. And this runs over and over,
> eating a huge amount of socket memory despite a small limit imposed on the
> server's connection concurrency. At some point the kernel doesn't have
> enough socket buffers anymore (especially after admins have set them to
> 16MB as instructed on $random tuning guide) and starts to kill some orphaned
> connections to get some memory back. The not-so-nice effect that the admin
> cannot detect is that the client gets truncated responses. Only a small part
> of the tail is missing but in the logs, it's said that everything was sent.
> But sent by the process means sent to the system only.

That's an implementation issue of the OS (or a bug, depending on whether
you consider TCP a reliable transport or not, IMO).

>
>>>   - TCP fastopen : how does it work, what type of workload is improved,
>>>     what are the risks (ie: do not enable socket creation without cookie
>>>     by default just because you find it reduces your server load)
>> Another doc that exists.
>>
>>>   - whether to choose a short or a large SYN backlog depending on your
>>>     workload (ie: do you prefer to process everything even if the dequeuing
>>>     is expensive or to drop early in order to recover fast).
>>>
>>> ... and probably many other that don't immediately come to my mind. None
>>> of these ones was a real issue 20 years ago.
>> See above. Many were known around that time, but weren't documented in
>> detail (it took a while for a proper ref to SYN cookies, and the book I
>> wrote with Sterbenz came about because we'd seen wheels being
>> rediscovered for 15 years).
> People rediscover wheels because it's hard to find simple and accurate
> information on the net.
Nobody looks to RFCs to solve that problem...

>  Basically you have the choice :
>   - either uneducated blog posts saying "how I saved my web site using 2
>     sysctls"
>   - or academic papers which are only understandable by scientific people
>     having enough time

... that's what net FAQs are for, as well as courses and books.

> At least the first ones have the merit of being easy to test, and since
> they appear to work they are viral.
>
>>>  All of them became issues for
>>> many web server admins who just copy-paste random settings from various
>>> blogs found on the net who just copy the same stupidities over and over
>>> resulting in the same trouble being caused to each of their reader.
>> This doc is all over the place.
>>
>> If you want a doc to advise web admins, do so.
> That's *exactly* what Daniel started to do when you told him he shouldn't
> do it.

I didn't say a doc to advise web admins wasn't useful. I said it wasn't
an RFC.

It's a web FAQ, a book, etc.


>
>> But most of the items
>> above aren't under admin control; they're buried in app and OS
>> implementations, and most have already evolved do to the right thing.
> That's not true anymore. There was a time where Solaris had one of the
> most configurable TCP stack, you could do ndd /dev/tcp with a lot of
> things. Nowadays all modern operating systems let you fiddle with a ton
> of stuff. And even for the missing parts you can very easily find patches
> all over the web because opensource has changed this considerably.
There are no kernel configs to tell apps to open more than one
connection at a time. You don't need a kernel config to tell Firefox to
disable Nagle, use a reasonable socket size, etc - yes, there are OS
defaults, but most web servers and clients build in the correct
overrides already.

>
>> I agree that a summary of a *focused set* of these might be useful *as a
>> review* (note that review discussions usually include lots of refs).
> The purpose as I understood it was precisely to gather knowledge from
> people here operating such systems and adding some elements found from
> various sources. I also expressed my interest in sharing such experience
> which some will find valid and others not, which is perfect because it
> means there's a choice to make depending on a use case.
>
>> The key question is "what is the focus":
>>
>>     - HTTP/TCP interactions
>>     - server administration advice
>>     - ?
> Those two go hand-in-hand nowadays. You probably know that the difficulty
> for HTTP implementors to find a properly tuned TCP stack is what makes them
> consider UDP-based alternatives, 
They want something different for a variety of reasons - the same kind
of airtight logic by which TBL developed HTTP instead of using FTP (he
said that you'd only typically need one file from a location, so why
open 2 connections? now we're stuck trying to mux control and data
rather than having a proper solution that already existed at the time -
it took nearly a decade for HTTP servers to catch up to the performance
of FTP).

> so that they can totally control the
> behaviour from user-space and provide quick updates for their protocol.
> Want an example ?
>
>     https://www.chromium.org/quic

There are many thousands of monkeys typing everywhere - look at the
Linux source if you want even better examples.

> If it was possible to always have a perfectly tuned TCP stack I'm pretty
> sure we wouldn't even hear about such initiatives. And this is not something
> new, I happened to discuss about this subject with some people at IETF83 in
> 2012 already. By then I thought "they have no idea what they're trying to
> reinvent" and now I'm thinking "well, they have their reasons and they
> might be right after all given all the resistance they may be facing on
> the TCP side to get some default timers changed".
>
>> IMO, RFCs should focus on issues specific to the protocols and their
>> deployment - not general knowledge that exists in courses and textbooks.
> Courses and textbooks are totally outdated when they're not simply wrong.
I'm now confused.

You don't want sysadmins to read books or take courses because they're
outdated and thus wrong, but you want to issue an immutable RFC (which
will likely be outdated by the time it's issued)?
> I've first been tought that it was normal to fork a process after performing
> an accept(), resulting in my very first proxy working this way. Then I've
> been taught that it was mandatory to send an empty ACK after a SYN-ACK to
> validate a connection (which is totally wrong otherwise it would not allow
> this ACK to be lost). I've been taught that in order to accept an incoming
> connection you had to have your socket in listen mode exclusively. This is
> false as well, two clients can connect together during their connect()
> phase, it even has some security implications that are often overlooked.
I'm the first to admit there are bad courses, certainly.

> Like it or not, the HTTP protocol has brought TCP to an area it was not
> initially intended for 
HTTP makes mistakes that people blame on TCP (like HOL blocking), and
TCP is based on assumptions that are no longer true (not just for HTTP,
but for many other app protocols, e.g., the issue of burst after idle is
based on the outdated assumption that most transfers are roughly symmetric).

> and I find it fantastic to see that this protocol
> still scales so well. 
> But we need to consider modern usages of this protocol
> for the web, and not just academic research and e-mail.
You might consider that TCPM and TSVWG don't exist for just "academic
research and e-mail". What do you think we've been doing for the past 40
years?

>
>>>> I.e., at the most this is a man page (specific to an OS). At the least,
>>>> this isn't useful at all.
>>> As you can see above, nothing I cited was OS-specific but only workload
>>> specific. That's why I think that an informational RFC is much more suited
>>> to this than an OS-specific man page. The OS man page may rely on the RFC
>>> to propose various tuning profiles for different workloads however.
>> You have a good point that this is general info, but OS issues are not
>> in the scope of the IETF and there are courses and books that already
>> provide good advice on efficiently running web (and other) servers.
> Well, you want to run a quick test ? Google "tcp tuning for http server".
> Skip the first two responses which are Daniel's document, take the next one :
>
>     https://gist.github.com/kgriffs/4027835

Like I said about Linux... ;-)

> Read a bit... Not bad, could have been much worse. OK found : 5 seconds
> time_wait timeout for conntrack. Dig a little bit more... suggests 2M
> time_wait sockets. At 5s per socket, that's an expected 400k connections
> per second limit. This with buffers that can be as large as 16MB on both
> sides (one could ask why to put 16MB buffers on the Rx path for a web
> server). So basically the guy probably thinks that there's a need to
> fill up to 6.4 TB/s and despite this the max-orphans was not set according
> to the time-wait value, so users of his doc will start to cause some data
> truncation before they run into connection failures, thus only their visitors
> will know there are problems.
>
> And this is the highest ranked doc on google after Daniel's. And overall
> it's really not bad, much better than what we easily find everywhere.
>
> The fact that Daniel's doc found before is a good hope that this sad state
> of affairs can reach an end. That's why I think we should encourage him to
> continue and give him all the reference he needs to have an undiscutably
> good reference doc.

You might try some other terms in your searches. Like just "TCP tuning",
or "TCP HTTP interactions".

Joe