Re: Fwd: Re: [tcpm] FW: Call for Adoption: TCP Tuning for HTTP

Willy Tarreau <w@1wt.eu> Thu, 03 March 2016 18:49 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3596C1B31E3 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 3 Mar 2016 10:49:55 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.908
X-Spam-Level:
X-Spam-Status: No, score=-6.908 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RP_MATCHES_RCVD=-0.006, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2ULq-2YwoPRP for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 3 Mar 2016 10:49:51 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0BB061B3F0C for <httpbisa-archive-bis2Juki@lists.ietf.org>; Thu, 3 Mar 2016 10:49:50 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.80) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1abYFB-0005Mw-CW for ietf-http-wg-dist@listhub.w3.org; Thu, 03 Mar 2016 18:45:01 +0000
Resent-Date: Thu, 03 Mar 2016 18:45:01 +0000
Resent-Message-Id: <E1abYFB-0005Mw-CW@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtps (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <w@1wt.eu>) id 1abYF5-0005M5-TN for ietf-http-wg@listhub.w3.org; Thu, 03 Mar 2016 18:44:55 +0000
Received: from wtarreau.pck.nerim.net ([62.212.114.60] helo=1wt.eu) by lisa.w3.org with esmtp (Exim 4.80) (envelope-from <w@1wt.eu>) id 1abYF3-0007Bk-Mi for ietf-http-wg@w3.org; Thu, 03 Mar 2016 18:44:55 +0000
Received: (from willy@localhost) by pcw.home.local (8.14.3/8.14.3/Submit) id u23IiI9Q023863; Thu, 3 Mar 2016 19:44:18 +0100
Date: Thu, 03 Mar 2016 19:44:18 +0100
From: Willy Tarreau <w@1wt.eu>
To: Joe Touch <touch@isi.edu>
Cc: ietf-http-wg@w3.org
Message-ID: <20160303184418.GA18774@1wt.eu>
References: <56D74C23.5010705@isi.edu> <56D76A7E.7090507@isi.edu> <20160302232125.GA18275@1wt.eu> <56D77892.2000308@isi.edu> <20160303065545.GA18412@1wt.eu> <56D87BAC.4060204@isi.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <56D87BAC.4060204@isi.edu>
User-Agent: Mutt/1.4.2.3i
Received-SPF: pass client-ip=62.212.114.60; envelope-from=w@1wt.eu; helo=1wt.eu
X-W3C-Hub-Spam-Status: No, score=-7.0
X-W3C-Hub-Spam-Report: AWL=0.924, BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, W3C_AA=-1, W3C_IRA=-1, W3C_IRR=-3, W3C_WL=-1
X-W3C-Scan-Sig: lisa.w3.org 1abYF3-0007Bk-Mi 561c9a7513a99672bdf51f3d2558bb4a
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Fwd: Re: [tcpm] FW: Call for Adoption: TCP Tuning for HTTP
Archived-At: <http://www.w3.org/mid/20160303184418.GA18774@1wt.eu>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/31166
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On Thu, Mar 03, 2016 at 10:00:12AM -0800, Joe Touch wrote:
> > This point is important because it means some proxies often should
> > better wait for a passive close from a server than deciding to
> > close themselves.
> 
> Transparent proxies don't have that choice - they're governed by the
> semantics of the connection (whether EOF == close or not).
> 
> Non-transparent proxies shouldn't be opening one connection per
> transaction anyway; they ought to use one or more persistent connections
> and leave them open while they are interacting with the proxy. If they
> do this, there won't be an issue with who closes the connection because
> the close frequency should be very low.

It not that black and white unfortunately, and in practice it's very
common to see proxies fail in field above 500 connections per second
because their TCP stack was not appropriately tuned, and with the
default 60s TIME_WAIT timeout of their OS, they exhaust the default
28k source ports. The first things admins do in this case is to
enable tcp_tw_recycle (which basically causes timewaits to be killed
when needed), and this appears to solve the situation while it makes
it even worse.

Among the solutions, we can count on :
  - putting back idle connections into pools hoping that they will be
    reusable. Connection reuse rate still remains low on average.
  - keeping a high enough keep-alive idle timeout on the proxy and a
    smaller one on the server (when the proxy is a gateway installed
    on the server side) hoping for the server to close first
  - appropriately add "connection: close" into outgoing requests to
    ask the server to close after the response.
  - disabling lingering before closing when the HTTP state indicates
    the proxy has received all data
  - doing whatever is imaginable to avoid closing first

These are just general principles and many derivatives may exist in
various contexts, but these ones are definitely important points that
HTTP implementors have to be aware of before falling into the same
traps as the ones having done so previously.

> >> In the bulk of HTTP connections, the server closes the connection,
> >> either to drop a persistent connection or to indicate "EOF" for a transfer.
> > 
> > Yes.
> > 
> >> Clients generally don't enter TIME-WAIT, so reducing the time they spend
> >> in a state they don't enter has no effect.
> > 
> > They can if they close first and that's exactly the problem we absolutely
> > want to avoid.
> 
> TW buildup has two effects:
> 
> 	1) limits the number connection rate to a given IP address

Exactly.

> 	2) consumes memory space (and potentially CPU resources)

This one is vey cheap. A typical TW connection is just a few tens of bytes.

> Neither is typically an issue for HCI-based clients.

I don't know what you call HCI here, I'm sorry.

> Servers have much
> higher rate requirements for a given address when they act as a proxy
> and consume more memory overall because they interact with a much larger
> set of addresses.

Servers are not penalized at all with the connection rate since it only
limits the *outgoing* connection rate and not the incoming one. There's
never any ambiguity when a SYN is received regarding the possibility that
the connection still exists on the other side, which is why TW connections
are recycled when receiving a new SYN. Regarding the memory usage, it
remains very low compared to the memory used by the application itself.
My personal record was at 5.5 million timewaits on a server at 90000
connections per second. It was only 300 MB of RAM on a server having
something like 64 GB. And not everyone needs 90k conns/s but everyone
needs more than 500/s nowadays in any infrastructure.

> > There are certain cases where we had to put warnings in
> > rfc7230/7540, especially in relation with proxies. The typical case is
> > when a client closes a connection to a proxy (eg: a CONNECT tunnel) and
> > the proxy is supposed to in turn close the connection to the server. In
> > this case the proxy is the connection initiator, and it can very quickly
> > condemn all of its source ports by accumulating TIME_WAITs there. 
> 
> That speaks to a mismanagement of port resources. If they are allocated
> on a per-IP basis, they won't run out.

Yes they do, that's the problem everyone running a load balancer faces!
The highest connection rate you can reach per server is around 1000 with
64k ports! That started not being enough 15 years ago!

> The error is in treating the pool
> of source ports as global across all IP addresses, which TW does not
> require.

No, the problem is to keep a TW which blocks a precious resource which is
the source port that is only addressed on 16 bits!

> > I'm saying that by all means the
> > server must close first to keep the TIME_WAIT on its side and never
> > on the client side. A TIME_WAIT on a server is very cheap (a few tens
> > of bytes of memory at worst) 
> 
> It costs exactly the same on the client and the server when implemented
> correctly.

It costs the same except that in one case it prevents a connection from
being established while in the othe case it does not. I've seen people
patch their kernels to lower the TIME_WAIT down to 2 seconds to address
such shortcomings! Quite frankly, this workaround *is* causing trouble!

> > and can be recycled when a new valid SYN
> > arrives.
> 
> The purpose of TW is to inhibit new SYNs involving the same port. When a
> new SYN arrives on another port, that has no impact on existing TWs.

I'm always talking about the same port. On todays hardware and real world
workloads, source ports can be reused every second (60k conns/s). Only the
server with TIME_WAIT can tell whether or not an incoming SYN is a retransmit
or a new one. The client knows it's a new one but doesn't know if the server
is still in LAST_ACK or has really closed, and due to this uncertainty it
refrains from connecting.

> > A TIME_WAIT on the client is not recyclable. That's why
> > TIME_WAIT is a problem for the client and not for the server.
> 
> See above; TW is *never* recyclable.

Yes it definitely is on the server side, which is the point. When you
receive a SYN whose ISN is higher than the end of the current window,
it's a new one by definition (as indicated in RFC1122).

> > The problem is that in some cases it's suggested that the client
> > closes first and this causes such problems.
> 
> That actually helps the server (see our 99 Infocom paper).

Sure since the server doesn't receive any more traffic from this client,
that definitely helps, but the point is to ensure traffic flows between
the two hosts, not that one of them refrains from connecting.

> > The only workaround for
> > the client is to close with an RST by disabling lingering,
> 
> That's not what SO_LINGER does. See:
> http://man7.org/linux/man-pages/man7/socket.7.html

But in practice it's used for this. When you disable lingering before
closing, you purge any pending data which has the benefit that the data
you just received from the server that carried an ACK for data you don't
have anymore triggers a reset. Yes it's absolutely ugly but you have no
other option when you are a client and are forced to close first due to
the protocol. Don't forget that we're discussing a document whose outcome
should be that protocols are designed in the future to avoid such horrible
workarounds.

> > but that's
> > really ugly and unreliable : if the RST is lost while the server is
> > in LAST_ACK (and chances are that it will happen if the ACK was lost
> > already), the new connection will not open until this connection
> > expires.
> 
> TCP has a significant error regarding RSTs; the side that throws a RST
> on an existing connection should really go into TW - for all the same
> reasons that TW exists in the first place, to protect new connections
> from old data still in the network.

There are many other issues regarding RST. When you send an RST through
a firewall, you'd better cross fingers for it not to be lost between the
firewall and the destination, otherwise chances are that you won't get
a second chance. That's one of the reasons why I'd love to live in a
world where a client never has to close first.

> > Also, there are people who face this issue and work around them using
> > some OS-specific tunables which allow to blindly recycle some of these
> > connections and these people don't understand the impacts of doing so.
> 
> They really ought to read the literature. It's been out there so long it
> can probably apply for a driver's license by now.

When people see their production servers stall at 5% CPU because their LBs
or proxies can't open new connections while full of TIME_WAIT, what they
do is ask their preferred search engine which simply proposes them such
advices :
  - https://ihazem.wordpress.com/2012/02/07/reducing-time_wait-socket-connections-recyclereuse/
  - http://serverfault.com/questions/212093/how-to-reduce-number-of-sockets-in-time-wait
  - http://kaivanov.blogspot.fr/2010/09/linux-tcp-tuning.html
  - http://www.linuxbrigade.com/reduce-time_wait-socket-connections/
  - http://www.stolk.org/debian/timewait.html

Yes they all involve the wrong and nasty workarounds consisting in allowing
to recycle outgoing TIME_WAIT connections, which is the worst ever thing to
do (except the last one which explains how to modify the TW timeout in the
kernel).

This is a *real* problem in field, it has been for a while because some
protocols have been designed for lower loads without imagining that one
day source ports would be reused that often. While we have to deal with
this the best we can, it's important to ensure the same mistake is not
done again in the future.

Regards,
Willy