Re: Fwd: Re: [tcpm] FW: Call for Adoption: TCP Tuning for HTTP

Willy Tarreau <w@1wt.eu> Thu, 03 March 2016 21:16 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D06581ACD39 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 3 Mar 2016 13:16:52 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.908
X-Spam-Level:
X-Spam-Status: No, score=-6.908 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RP_MATCHES_RCVD=-0.006, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FgTqzNymTBzt for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 3 Mar 2016 13:16:48 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 927B51A9107 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Thu, 3 Mar 2016 13:16:48 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.80) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1abaXJ-000513-UZ for ietf-http-wg-dist@listhub.w3.org; Thu, 03 Mar 2016 21:11:53 +0000
Resent-Date: Thu, 03 Mar 2016 21:11:53 +0000
Resent-Message-Id: <E1abaXJ-000513-UZ@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtps (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <w@1wt.eu>) id 1abaXC-00050I-Uh for ietf-http-wg@listhub.w3.org; Thu, 03 Mar 2016 21:11:46 +0000
Received: from wtarreau.pck.nerim.net ([62.212.114.60] helo=1wt.eu) by maggie.w3.org with esmtp (Exim 4.80) (envelope-from <w@1wt.eu>) id 1abaX6-0003Fn-L1 for ietf-http-wg@w3.org; Thu, 03 Mar 2016 21:11:46 +0000
Received: (from willy@localhost) by pcw.home.local (8.14.3/8.14.3/Submit) id u23LB55D023886; Thu, 3 Mar 2016 22:11:05 +0100
Date: Thu, 03 Mar 2016 22:11:05 +0100
From: Willy Tarreau <w@1wt.eu>
To: Joe Touch <touch@isi.edu>
Cc: ietf-http-wg@w3.org
Message-ID: <20160303211105.GA23875@1wt.eu>
References: <56D74C23.5010705@isi.edu> <56D76A7E.7090507@isi.edu> <20160302232125.GA18275@1wt.eu> <56D77892.2000308@isi.edu> <20160303065545.GA18412@1wt.eu> <56D87BAC.4060204@isi.edu> <20160303184418.GA18774@1wt.eu> <56D88D58.5060406@isi.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <56D88D58.5060406@isi.edu>
User-Agent: Mutt/1.4.2.3i
Received-SPF: pass client-ip=62.212.114.60; envelope-from=w@1wt.eu; helo=1wt.eu
X-W3C-Hub-Spam-Status: No, score=-7.0
X-W3C-Hub-Spam-Report: AWL=0.926, BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, W3C_AA=-1, W3C_IRA=-1, W3C_IRR=-3, W3C_WL=-1
X-W3C-Scan-Sig: maggie.w3.org 1abaX6-0003Fn-L1 c8209485fdf3817cf7f69cb22dcb325a
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Fwd: Re: [tcpm] FW: Call for Adoption: TCP Tuning for HTTP
Archived-At: <http://www.w3.org/mid/20160303211105.GA23875@1wt.eu>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/31170
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On Thu, Mar 03, 2016 at 11:15:36AM -0800, Joe Touch wrote:
> > It not that black and white unfortunately, and in practice it's very
> > common to see proxies fail in field above 500 connections per second
> > because their TCP stack was not appropriately tuned, and with the
> > default 60s TIME_WAIT timeout of their OS, they exhaust the default
> > 28k source ports. The first things admins do in this case is to
> > enable tcp_tw_recycle (which basically causes timewaits to be killed
> > when needed), and this appears to solve the situation while it makes
> > it even worse.
> 
> Those proxies are acting as servers. We've known about the server TW for
> a long time.

No, I'm sorry if my explanation was not clear. I'm talking about the outgoing
connection from the proxy to the server. Ie most common case :

   internet  ----> reverse ----> server
   clients          proxy

The problem is not with the TW on the internet-facing side of the proxy
but with the server-facing side.

> > Among the solutions, we can count on :
> >   - putting back idle connections into pools hoping that they will be
> >     reusable. Connection reuse rate still remains low on average.
> >   - keeping a high enough keep-alive idle timeout on the proxy and a
> >     smaller one on the server (when the proxy is a gateway installed
> >     on the server side) hoping for the server to close first
> 
> This just moves the problem to the server where it is much worse in
> aggregate.

No, they're cheap and that what *everyone* having to deal with more
than 1000 connections per second has to do.

> >   - disabling lingering before closing when the HTTP state indicates
> >     the proxy has received all data
> 
> Linger has nothing to do with this issue.

I explained already, I'm sorry.

> > These are just general principles and many derivatives may exist in
> > various contexts, but these ones are definitely important points that
> > HTTP implementors have to be aware of before falling into the same
> > traps as the ones having done so previously.
> 
> The important issue is largely the buildup of TW at the server, and this
> has been known for a long time, as have the mitigations.

This comes with a very small cost. The main cost of the TW buildup on the
server is that netstat can take ages to list sockets. This compared to a
completely stopped web site is pretty minor.

> >> 	2) consumes memory space (and potentially CPU resources)
> > 
> > This one is vey cheap. A typical TW connection is just a few tens of bytes.
> > 
> >> Neither is typically an issue for HCI-based clients.
> > 
> > I don't know what you call HCI here, I'm sorry.
> 
> Human-computer interface, i.e., client whose traffic is driven by a
> person clicking on things.

OK thanks. So a browser. This is indeed the case where you don't care
for now (except when bogus scripts run connections in loops but that
causes other issues as well).

> >> Servers have much
> >> higher rate requirements for a given address when they act as a proxy
> >> and consume more memory overall because they interact with a much larger
> >> set of addresses.
> > 
> > Servers are not penalized at all with the connection rate since it only
> > limits the *outgoing* connection rate and not the incoming one.
> 
> When TW accumulates, new connections are rejected at the side that
> accumulates the TW. That can happen on either side.

No, no no. Please read what I explained twice. This is RFC1122. If this
is what you believe I understand why you're against accumulating them on
the server. But if that was try or any internet-connected host, the web
would be pretty sad nowadays...

Let me restate it :
  - the client cannot bypass the TW state because it has no way to know
    whether or not the server has closed after seeing the last ACK or
    is still waiting for it.

  - the server when it sees a SYN with an ISN larger than the end of the
    previous window, *knows* that the client has closed, otherwise the
    client would be in LAST_ACK and couldn't send a SYN in this state.
    This is why servers recycle connections, and only in this case.

> > There's
> > never any ambiguity when a SYN is received regarding the possibility that
> > the connection still exists on the other side, 
> 
> That's what the TW is intended to inhibit.
> 
> > which is why TW connections
> > are recycled when receiving a new SYN. 
> 
> That's exactly the opposite of what TW does.

No I disagree and RFC1122 disagrees with you as well, just like any web
server you can connect to over the net.

> > Regarding the memory usage, it
> > remains very low compared to the memory used by the application itself.
> > My personal record was at 5.5 million timewaits on a server at 90000
> > connections per second. It was only 300 MB of RAM on a server having
> > something like 64 GB. And not everyone needs 90k conns/s but everyone
> > needs more than 500/s nowadays in any infrastructure.
> 
> Memory impact depends on the device.

Absolutely. But 56 bytes (IPv4) or 84 bytes (IPv6) are already much smaller
than the minimum 536 bytes that a device is supposed to accept when no MSS
is negociated.

> > Yes they do, that's the problem everyone running a load balancer faces!
> > The highest connection rate you can reach per server is around 1000 with
> > 64k ports! That started not being enough 15 years ago!
> 
> That rate is nearly never seen on clients (which is the context of this
> part of the thread); it's certainly seen on servers and proxies.

Yes it is on proxies and proxies are clients. That's the point.
You find it as well on database clients by the way. Just search the
net for people reporting the inability to connect to mysql after
accumulating timewaits on the client. Just because the protocol was
misdesigned, the client says "QUIT", the server responds "OK", then
the client closes. BAM. One TW blocking this port for 60 seconds.
After 64k connections the application cannot connect to the server
anymore.

> >> The error is in treating the pool
> >> of source ports as global across all IP addresses, which TW does not
> >> require.
> > 
> > No, the problem is to keep a TW which blocks a precious resource which is
> > the source port that is only addressed on 16 bits!
> 
> The TW should be blocking based on the socket pair - i.e., source IP,
> dest IP, source port, dest port. Only one of these is fixed for HTTP
> (the dest port). At either the client or server, typically the other
> side's IP address also varies more than enough to avoid this issue - the
> primary exception being proxy interactions.

>From the beginning we're talking about connection initiation issues by
TW accumulated on the client side, particularly on proxies, but not only.
Client here refers to the TCP client since the document is about TCP tuning
and best protocol practises to scale better and to better resist attacks,
not the client in terms of the role of the device in the complete chain.

> > I've seen people
> > patch their kernels to lower the TIME_WAIT down to 2 seconds to address
> > such shortcomings! Quite frankly, this workaround *is* causing trouble!
> 
> Again, we've known that for nearly two decades.

Then why do you insist on enforcing this well-known problem while it's
absolutely not necessary ?

> >>> and can be recycled when a new valid SYN
> >>> arrives.
> >>
> >> The purpose of TW is to inhibit new SYNs involving the same port. When a
> >> new SYN arrives on another port, that has no impact on existing TWs.
> > 
> > I'm always talking about the same port. On todays hardware and real world
> > workloads, source ports can be reused every second (60k conns/s). Only the
> > server with TIME_WAIT can tell whether or not an incoming SYN is a retransmit
> > or a new one. The client knows it's a new one but doesn't know if the server
> > is still in LAST_ACK or has really closed, and due to this uncertainty it
> > refrains from connecting.
> 
> If the server initiates the close, it never goes into LAST_ACK (the
> client would).

Absolutely agree. I'm talking about the case where the client closes first.

> A node can end up in LAST_ACK only if the final ACK of
> the closing three-way handshake is lost, which should not be the typical
> case.

It happens with the same rate as other ones, and this state tends to last
more than other ones for a simple reason : some intermediary NAT devices
considerably reduce their timeout once they see the first FIN packet (some
at least take care of the second one), so that past this point the random
loss of the last ACK will become more visible because retransmits will not
reach the other side. That's quite common and a non-problem most often.

> However, it shouldn't matter that it refrains from connecting at
> that point - the other side should be in TIME-WAIT anyway.

Definitely, that was the case I mentionned, where the client being
in TW must refrain from recycling since it doesn't know if the server
on the othe side is in LAST_ACK.

> >>> A TIME_WAIT on the client is not recyclable. That's why
> >>> TIME_WAIT is a problem for the client and not for the server.
> >>
> >> See above; TW is *never* recyclable.
> > 
> > Yes it definitely is on the server side, which is the point. When you
> > receive a SYN whose ISN is higher than the end of the current window,
> > it's a new one by definition (as indicated in RFC1122).
> 
> RFC1122's statement on TIME-WAIT has nothing to do with the ISN the
> server receives; it has to do with the ISN the server assigns.

But it's what is implemented everywhere. Just send a SYN in window to
any internet-facing host, you'll get an ACK in response. Send a SYN
above the window and you get a SYN-ACK. Most implementations even support
PAWS and then a higher TCP timestamp helps sorting out the new and the
old packets. That's what is needed for example when some firewalls break
the end-to-end transparency by randomizing sequence numbers that were
already properly randomized.

> You don't
> "recycle" the TIME-WAIT; you effectively reopen the connection with the
> same port pair. But that then requires keeping more state in the TW.

Excuse-me for the terminology but that's exactly what I call "recycling"
since a TIME_WAIT socket gets destroyed and recreated in SYN-RECV state.

> >>> The problem is that in some cases it's suggested that the client
> >>> closes first and this causes such problems.
> >>
> >> That actually helps the server (see our 99 Infocom paper).
> > 
> > Sure since the server doesn't receive any more traffic from this client,
> > that definitely helps, but the point is to ensure traffic flows between
> > the two hosts, not that one of them refrains from connecting.
> 
> Neither side "refrains from connecting"; TW is intended to block certain
> connection attempts, not prevent them from being attempted.

Yes thus as seen from the network, the client refrains from connecting since
its internal stack blocks its application's connection attempts.

> >>> The only workaround for
> >>> the client is to close with an RST by disabling lingering,
> >>
> >> That's not what SO_LINGER does. See:
> >> http://man7.org/linux/man-pages/man7/socket.7.html
> > 
> > But in practice it's used for this.
> 
> That practice is incorrect and should never be recommended.

But there's no other option when you have to close the connection and you're
on the client side if you don't want to run out of ports in the same second.

> > When you disable lingering before
> > closing, you purge any pending data which has the benefit that the data
> > you just received from the server that carried an ACK for data you don't
> > have anymore triggers a reset. 
> 
> Linger is intended to let the other side keep sending you data after you
> issue a close - that's entirely valid TCP behavior.

No, it lets your local TCP stack continue to send the data pending in
the window after the application closes. Some operating systems call
such connections "orphans". You can't receive data over a connection
after you've closed it, it immediately causes an RST to be emitted,
which is one of the issues caused with redirects on POST in HTTP, you
have to drain all the incoming data to be sure no RST is emitted or
you risk the other side never to get your response. I think you were
mixing this with a FIN_WAIT state where you have performed the
shutdown(WR) and are still receiving data, but that's pretty standard
TCP.

> I.e., when you use the option as intended, it won't accomplish what you
> want.
>
> > Yes it's absolutely ugly but you have no
> > other option when you are a client and are forced to close first due to
> > the protocol. Don't forget that we're discussing a document whose outcome
> > should be that protocols are designed in the future to avoid such horrible
> > workarounds.
> 
> The doc should provide correct advice, and certainly ought to not
> reinvent 17-year old wheels.

What 17-year old wheels ? The only one I know about consists in patching
kernels to force shorter timewaits in order not to block outgoing
connections when the rate approaches 1000/s. Until we have 32 bits for
the source port, these are the only two options. At some point one must
not wonder why more and more the transport is migrating to userland :-/

> >> TCP has a significant error regarding RSTs; the side that throws a RST
> >> on an existing connection should really go into TW - for all the same
> >> reasons that TW exists in the first place, to protect new connections
> >> from old data still in the network.
> > 
> > There are many other issues regarding RST. When you send an RST through
> > a firewall, you'd better cross fingers for it not to be lost between the
> > firewall and the destination, otherwise chances are that you won't get
> > a second chance. 
> 
> Agreed - RSTs ought to be used only in "emergencies" to abort
> connections, not as a hack to jump around the protocol space.

The problem we have is that the protocol space has not grown in the
last 35 years while the average bandwidth might have grown by 1 or
10 millions. This is an amazing achievement for such a design by then
bu we still have to live with todays needs.

> > That's one of the reasons why I'd love to live in a
> > world where a client never has to close first.
> 
> The RST is bad in either direction, and the world you want to live in
> was what we had 17 years ago and prompted those other papers.

No, I'm leaving in 2016 where load balancers have to distribute all the
traffic they receive to the servers that are locate behind them and
where connection rates above 10000/s per server are quite common. Multiply
this by the 60s default time_wait and you need 600k source ports. I'm sorry
but I cannot encode this on 16 bits, and the fix is implemented in every
operating system : have cheap time_wait states that are easily recycled
-pardon, closed then reopened- when receiving a valid new SYN.

> They ought to do better research (and so should this doc, IMO).

Joe, don't take me bad, but from the beginning I find you very aggressive
and very negative regarding this document, and you constantly say that
everyone around is doing everything wrong. Please simply tell us how to
support 100k connections per second from a load balancer to a server,
each using a single IP address, with a single port for the server,
without sending RSTs from the client and without cheating on the client's
time_wait timeout. The only solution you leave to us is to increase the
source port range to 23 bits. That's not compatible with the protocol we
have. All server implementations correctly deal with time_waits and quick
port rollover and the net today works fine thanks to all this nice work
that has accumulated over the last 35 years. Why would anyone drop all of
this to get back to the antique 1000 connections per second limit with
64k ports or even 250 connections per second on certain operating systems ?

Thanks,
Willy