Re: Fwd: Re: [tcpm] FW: Call for Adoption: TCP Tuning for HTTP

Joe Touch <touch@isi.edu> Thu, 03 March 2016 19:22 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 22A461B4187 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 3 Mar 2016 11:22:36 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.908
X-Spam-Level:
X-Spam-Status: No, score=-6.908 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RP_MATCHES_RCVD=-0.006, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id t29MQnCFx0FC for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 3 Mar 2016 11:22:32 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 98B391B418C for <httpbisa-archive-bis2Juki@lists.ietf.org>; Thu, 3 Mar 2016 11:22:32 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.80) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1abYkV-0005em-4V for ietf-http-wg-dist@listhub.w3.org; Thu, 03 Mar 2016 19:17:23 +0000
Resent-Date: Thu, 03 Mar 2016 19:17:23 +0000
Resent-Message-Id: <E1abYkV-0005em-4V@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtps (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <touch@isi.edu>) id 1abYkL-0005cO-7n for ietf-http-wg@listhub.w3.org; Thu, 03 Mar 2016 19:17:13 +0000
Received: from boreas.isi.edu ([128.9.160.161]) by maggie.w3.org with esmtps (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <touch@isi.edu>) id 1abYkF-0006gV-9X for ietf-http-wg@w3.org; Thu, 03 Mar 2016 19:17:12 +0000
Received: from [192.168.1.189] (cpe-172-250-251-17.socal.res.rr.com [172.250.251.17]) (authenticated bits=0) by boreas.isi.edu (8.13.8/8.13.8) with ESMTP id u23JFctK006535 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Thu, 3 Mar 2016 11:15:45 -0800 (PST)
To: Willy Tarreau <w@1wt.eu>
References: <56D74C23.5010705@isi.edu> <56D76A7E.7090507@isi.edu> <20160302232125.GA18275@1wt.eu> <56D77892.2000308@isi.edu> <20160303065545.GA18412@1wt.eu> <56D87BAC.4060204@isi.edu> <20160303184418.GA18774@1wt.eu>
Cc: touch@isi.edu, ietf-http-wg@w3.org
From: Joe Touch <touch@isi.edu>
Message-ID: <56D88D58.5060406@isi.edu>
Date: Thu, 03 Mar 2016 11:15:36 -0800
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0
MIME-Version: 1.0
In-Reply-To: <20160303184418.GA18774@1wt.eu>
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: 7bit
X-ISI-4-43-8-MailScanner: Found to be clean
X-MailScanner-From: touch@isi.edu
Received-SPF: none client-ip=128.9.160.161; envelope-from=touch@isi.edu; helo=boreas.isi.edu
X-W3C-Hub-Spam-Status: No, score=-8.9
X-W3C-Hub-Spam-Report: BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RP_MATCHES_RCVD=-0.001, W3C_AA=-1, W3C_WL=-1
X-W3C-Scan-Sig: maggie.w3.org 1abYkF-0006gV-9X 4f6468aee54c842db385b2c2dfad70c1
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Fwd: Re: [tcpm] FW: Call for Adoption: TCP Tuning for HTTP
Archived-At: <http://www.w3.org/mid/56D88D58.5060406@isi.edu>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/31167
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>


On 3/3/2016 10:44 AM, Willy Tarreau wrote:
> On Thu, Mar 03, 2016 at 10:00:12AM -0800, Joe Touch wrote:
>>> This point is important because it means some proxies often should
>>> better wait for a passive close from a server than deciding to
>>> close themselves.
>>
>> Transparent proxies don't have that choice - they're governed by the
>> semantics of the connection (whether EOF == close or not).
>>
>> Non-transparent proxies shouldn't be opening one connection per
>> transaction anyway; they ought to use one or more persistent connections
>> and leave them open while they are interacting with the proxy. If they
>> do this, there won't be an issue with who closes the connection because
>> the close frequency should be very low.
> 
> It not that black and white unfortunately, and in practice it's very
> common to see proxies fail in field above 500 connections per second
> because their TCP stack was not appropriately tuned, and with the
> default 60s TIME_WAIT timeout of their OS, they exhaust the default
> 28k source ports. The first things admins do in this case is to
> enable tcp_tw_recycle (which basically causes timewaits to be killed
> when needed), and this appears to solve the situation while it makes
> it even worse.

Those proxies are acting as servers. We've known about the server TW for
a long time.

> Among the solutions, we can count on :
>   - putting back idle connections into pools hoping that they will be
>     reusable. Connection reuse rate still remains low on average.
>   - keeping a high enough keep-alive idle timeout on the proxy and a
>     smaller one on the server (when the proxy is a gateway installed
>     on the server side) hoping for the server to close first

This just moves the problem to the server where it is much worse in
aggregate.

>   - appropriately add "connection: close" into outgoing requests to
>     ask the server to close after the response.

Same thing.

>   - disabling lingering before closing when the HTTP state indicates
>     the proxy has received all data

Linger has nothing to do with this issue.

>   - doing whatever is imaginable to avoid closing first
> 
> These are just general principles and many derivatives may exist in
> various contexts, but these ones are definitely important points that
> HTTP implementors have to be aware of before falling into the same
> traps as the ones having done so previously.

The important issue is largely the buildup of TW at the server, and this
has been known for a long time, as have the mitigations.

>>>> In the bulk of HTTP connections, the server closes the connection,
>>>> either to drop a persistent connection or to indicate "EOF" for a transfer.
>>>
>>> Yes.
>>>
>>>> Clients generally don't enter TIME-WAIT, so reducing the time they spend
>>>> in a state they don't enter has no effect.
>>>
>>> They can if they close first and that's exactly the problem we absolutely
>>> want to avoid.
>>
>> TW buildup has two effects:
>>
>> 	1) limits the number connection rate to a given IP address
> 
> Exactly.
> 
>> 	2) consumes memory space (and potentially CPU resources)
> 
> This one is vey cheap. A typical TW connection is just a few tens of bytes.
> 
>> Neither is typically an issue for HCI-based clients.
> 
> I don't know what you call HCI here, I'm sorry.

Human-computer interface, i.e., client whose traffic is driven by a
person clicking on things.

>> Servers have much
>> higher rate requirements for a given address when they act as a proxy
>> and consume more memory overall because they interact with a much larger
>> set of addresses.
> 
> Servers are not penalized at all with the connection rate since it only
> limits the *outgoing* connection rate and not the incoming one.

When TW accumulates, new connections are rejected at the side that
accumulates the TW. That can happen on either side.

> There's
> never any ambiguity when a SYN is received regarding the possibility that
> the connection still exists on the other side, 

That's what the TW is intended to inhibit.

> which is why TW connections
> are recycled when receiving a new SYN. 

That's exactly the opposite of what TW does.

> Regarding the memory usage, it
> remains very low compared to the memory used by the application itself.
> My personal record was at 5.5 million timewaits on a server at 90000
> connections per second. It was only 300 MB of RAM on a server having
> something like 64 GB. And not everyone needs 90k conns/s but everyone
> needs more than 500/s nowadays in any infrastructure.

Memory impact depends on the device.

>>> There are certain cases where we had to put warnings in
>>> rfc7230/7540, especially in relation with proxies. The typical case is
>>> when a client closes a connection to a proxy (eg: a CONNECT tunnel) and
>>> the proxy is supposed to in turn close the connection to the server. In
>>> this case the proxy is the connection initiator, and it can very quickly
>>> condemn all of its source ports by accumulating TIME_WAITs there. 
>>
>> That speaks to a mismanagement of port resources. If they are allocated
>> on a per-IP basis, they won't run out.
> 
> Yes they do, that's the problem everyone running a load balancer faces!
> The highest connection rate you can reach per server is around 1000 with
> 64k ports! That started not being enough 15 years ago!

That rate is nearly never seen on clients (which is the context of this
part of the thread); it's certainly seen on servers and proxies.

>> The error is in treating the pool
>> of source ports as global across all IP addresses, which TW does not
>> require.
> 
> No, the problem is to keep a TW which blocks a precious resource which is
> the source port that is only addressed on 16 bits!

The TW should be blocking based on the socket pair - i.e., source IP,
dest IP, source port, dest port. Only one of these is fixed for HTTP
(the dest port). At either the client or server, typically the other
side's IP address also varies more than enough to avoid this issue - the
primary exception being proxy interactions.

>>> I'm saying that by all means the
>>> server must close first to keep the TIME_WAIT on its side and never
>>> on the client side. A TIME_WAIT on a server is very cheap (a few tens
>>> of bytes of memory at worst) 
>>
>> It costs exactly the same on the client and the server when implemented
>> correctly.
> 
> It costs the same except that in one case it prevents a connection from
> being established while in the other case it does not.

TW prevents a connection in either direction if implemented according to
spec.

> I've seen people
> patch their kernels to lower the TIME_WAIT down to 2 seconds to address
> such shortcomings! Quite frankly, this workaround *is* causing trouble!

Again, we've known that for nearly two decades.

>>> and can be recycled when a new valid SYN
>>> arrives.
>>
>> The purpose of TW is to inhibit new SYNs involving the same port. When a
>> new SYN arrives on another port, that has no impact on existing TWs.
> 
> I'm always talking about the same port. On todays hardware and real world
> workloads, source ports can be reused every second (60k conns/s). Only the
> server with TIME_WAIT can tell whether or not an incoming SYN is a retransmit
> or a new one. The client knows it's a new one but doesn't know if the server
> is still in LAST_ACK or has really closed, and due to this uncertainty it
> refrains from connecting.

If the server initiates the close, it never goes into LAST_ACK (the
client would). A node can end up in LAST_ACK only if the final ACK of
the closing three-way handshake is lost, which should not be the typical
case. However, it shouldn't matter that it refrains from connecting at
that point - the other side should be in TIME-WAIT anyway.

>>> A TIME_WAIT on the client is not recyclable. That's why
>>> TIME_WAIT is a problem for the client and not for the server.
>>
>> See above; TW is *never* recyclable.
> 
> Yes it definitely is on the server side, which is the point. When you
> receive a SYN whose ISN is higher than the end of the current window,
> it's a new one by definition (as indicated in RFC1122).

RFC1122's statement on TIME-WAIT has nothing to do with the ISN the
server receives; it has to do with the ISN the server assigns. You don't
"recycle" the TIME-WAIT; you effectively reopen the connection with the
same port pair. But that then requires keeping more state in the TW.

>>> The problem is that in some cases it's suggested that the client
>>> closes first and this causes such problems.
>>
>> That actually helps the server (see our 99 Infocom paper).
> 
> Sure since the server doesn't receive any more traffic from this client,
> that definitely helps, but the point is to ensure traffic flows between
> the two hosts, not that one of them refrains from connecting.

Neither side "refrains from connecting"; TW is intended to block certain
connection attempts, not prevent them from being attempted.

>>> The only workaround for
>>> the client is to close with an RST by disabling lingering,
>>
>> That's not what SO_LINGER does. See:
>> http://man7.org/linux/man-pages/man7/socket.7.html
> 
> But in practice it's used for this.

That practice is incorrect and should never be recommended.

> When you disable lingering before
> closing, you purge any pending data which has the benefit that the data
> you just received from the server that carried an ACK for data you don't
> have anymore triggers a reset. 

Linger is intended to let the other side keep sending you data after you
issue a close - that's entirely valid TCP behavior.

I.e., when you use the option as intended, it won't accomplish what you
want.

> Yes it's absolutely ugly but you have no
> other option when you are a client and are forced to close first due to
> the protocol. Don't forget that we're discussing a document whose outcome
> should be that protocols are designed in the future to avoid such horrible
> workarounds.

The doc should provide correct advice, and certainly ought to not
reinvent 17-year old wheels.

>>> but that's
>>> really ugly and unreliable : if the RST is lost while the server is
>>> in LAST_ACK (and chances are that it will happen if the ACK was lost
>>> already), the new connection will not open until this connection
>>> expires.
>>
>> TCP has a significant error regarding RSTs; the side that throws a RST
>> on an existing connection should really go into TW - for all the same
>> reasons that TW exists in the first place, to protect new connections
>> from old data still in the network.
> 
> There are many other issues regarding RST. When you send an RST through
> a firewall, you'd better cross fingers for it not to be lost between the
> firewall and the destination, otherwise chances are that you won't get
> a second chance. 

Agreed - RSTs ought to be used only in "emergencies" to abort
connections, not as a hack to jump around the protocol space.

> That's one of the reasons why I'd love to live in a
> world where a client never has to close first.

The RST is bad in either direction, and the world you want to live in
was what we had 17 years ago and prompted those other papers.

>>> Also, there are people who face this issue and work around them using
>>> some OS-specific tunables which allow to blindly recycle some of these
>>> connections and these people don't understand the impacts of doing so.
>>
>> They really ought to read the literature. It's been out there so long it
>> can probably apply for a driver's license by now.
> 
> When people see their production servers stall at 5% CPU because their LBs
> or proxies can't open new connections while full of TIME_WAIT, what they
> do is ask their preferred search engine which simply proposes them such
> advices :
...

They ought to do better research (and so should this doc, IMO).

...
> This is a *real* problem in field, it has been for a while because some
> protocols have been designed for lower loads without imagining that one
> day source ports would be reused that often. While we have to deal with
> this the best we can, it's important to ensure the same mistake is not
> done again in the future.

Agreed. Again, please refer to those two papers, among others.

Joe