[Tmrg] Proposal to increase TCP initial CWND

fred at cisco.com (Fred Baker) Mon, 19 July 2010 15:35 UTC

From: "fred at cisco.com"
Date: Mon, 19 Jul 2010 08:35:56 -0700
Subject: [Tmrg] Proposal to increase TCP initial CWND
In-Reply-To: <AANLkTinruuD89W7WLSiQDSruq2omiCr3gLWO6L5Jqsr-@mail.gmail.com>
References: <AANLkTil937lyUzRvUtdqd2qdl9RN7AZ-Mo_cT-dtmqXz@mail.gmail.com> <AANLkTimgBjTOZQDi8wHaum8KomAKo3TJOBUMeGXjEX1B@mail.gmail.com> <AANLkTinruuD89W7WLSiQDSruq2omiCr3gLWO6L5Jqsr-@mail.gmail.com>
Message-ID: <F973C758-C0F5-4FDD-B56F-7711868B8E70@cisco.com>

On Jul 18, 2010, at 2:58 AM, Lachlan Andrew wrote:

> Regarding whether it is a "single flow" or not, I'd argue that even if
> it is multiple TCP connections, they're all really part of the same
> "flow", since they're aiming to carry (different parts of) the same
> map. 

Arg. We return to definitions of a flow...

I think the important thing here is the mathematics of it. If I look for an image at images.google.com, for example, I will download an html file followed by 20-or-so separate images. In tests I have run, those are often separate TCP sessions, all of them going through slow-start and possible later saw-toothing simultaneously. That basically means:

client sends 20-or-so SYNs
   ...
client receives 20-or-so SYN-ACKS
       sends 20-or-so Acks
       sends 20-or-so HTTP GET requests
   ...
client access interface starts receiving IW*20-or-so datagrams comprising those 20-or-so images.

That behavior should be fairly apparent in the attached, which is a graphical representation of exactly this between Firefox at my home and images.google.com. I'll attach also a comment I made to Nandita last April; it references a number of other files which I don't have any more but could recreate if folks are interested. 

Comparing that to the behavior of a single SCTP session with 20 streams, or the behavior of a single pipelined TCP session, I think the point should be apparent that these are mathematically not very similar, and that the value of IW is important.

Now, one will argue that getting a zillion thumbnail images is not much of an issue, as thumbnails are small. My concern is more essentially about discarding congestion control. If the average file moved is on the order of 15,000 bytes, IW=10, and typical message size is about 1500 bytes, my four function calculator says that we have in the typical case discarded congestion control entirely and instead are simply looking at the issues of retransmission. The average file is put entirely in flight during the first RTT.

In my email with Nandita and company, I asked about the possibilities of putting up a TCP Discard stream (representing competing traffic), letting it get established, and observing the effect on it when a flash crowd like this is imposed. I still think it would be an interesting test...


Begin forwarded message:

> From: Fred Baker <fred at cisco.com>
> Date: April 1, 2010 3:33:43 PM PDT
> To: Nandita Dukkipati <nanditad at google.com>
> Cc: draft-hkchu-tcpm-initcwnd at tools.ietf.org, Jerry Chu <hkchu at google.com>, Yuchung Cheng <ycheng at google.com>
> Subject: Re: cwnd=10
> 
> 
> On Mar 31, 2010, at 6:50 PM, Nandita Dukkipati wrote:
> 
>> Fred, Hi.
>> Some responses inline...
>> 
>>> Where I get concerned, as I mentioned last week, is when the receiver is outside the high bandwidth corridor that stretches from Australia and eastern Asia across anglophone north America and through western Europe. I have attached a map showing the corridor in very general terms, and making an error (oversimplifying) with respect to Mexico and the Caribbean. Geographically, this is a fairly narrow scope, as it leaves out the entire continents of South America and Africa and most of Asia (and btw polynesia and Antarctica). It does represent the more developed countries of the world, however, and much of Google's markets. I'd be willing to bet that the vast majority of your tests have been to places in that corridor. Outside the corridor, I have some grave concerns.
>> 
>> Great point! Knowing that this may in fact be the Achilles Heel in standardization of cwnd=10, we deliberately chose the test data-centers that also serve traffic to the above regions, including Africa and South America. The data presented (link: paper) is sliced and diced by network properties such as bandwidth (BW), RTT, and BDP. Assuming that the primary cause for concern in developing countries is low BW connectivity, the paper specifically lists the observations for low BW subnets including those with <=56Kbps. That said, we can also break the performance down by specific countries/continents, something that we are in the process of performing. We are also actively pursuing tests with clients in Africa to further validate our original numbers (that IW=10 doesn't hurt).
> 
> In the paper and in the slides, what I observed you testing was the impact to your sessions of changing the initial window. Did you do any testing of the impact on other sessions? For example, on a 56 KBPS link with its (Cisco low end default) 40 packet buffers, setting up half a dozen ten packet bursts should show up in a loss rate to the sixth session. If there was a competing session, say an FTP that had been happily moving a file for several seconds and found itself dumped on with sixty 125 millisecond packets (7.5 seconds worth of traffic), I could imagine the competing TCP session being lost entirely or at least being badly hurt. Dod you try setting up a TCP Discard session in parallel and testing the impact on other users?
> 
> You might be aware of issues between BitTorrent and Comcast, currently under discussion in the FCC, around the impact of BitTorrent on other users. It could extend to Google.
> 
>>> There are two things I have suggested in the past and would note for the record:
>>> 
>>> 1) I think any give host would do well to maintain a certain amount of history somewhere; if cwnd/rtt is an approximation of "a reasonable connection's characteristics" available to a destination, then recording that ratio and referring to it when starting a TCP session would go a long way in the direction you are suggesting. If you share a LAN with a device, I could imagine setting cwnd=rwin and let SACK and the fast retransmit heuristic pick up the pieces. More generally, if you were to record the ratio (as I believe Linux does) and initially set cwnd to some fraction of it times the RTT measured on the syn/syn-ack exchange, I imagine you would find cwnd low when it needs to be low and high when it is likely to work out. The secret sauce, so to speak, is in the fraction - is that half? 
>>> 
>>> 2) A major problem that TCP/SCTP have in the initial burst is that it is a burst. Ack-paced traffic tends to be spread out throughout the RTT, but the first few bursts tend to pile up, which can result in a momentary overload of switch/router/modem queues. Which makes me wonder: what would happen if we presumed some basic rate (perhaps derived from the saved cwnd and rtt or cwnd/rtt mentioned in [1]) and sent the initial burst *at*that*rate*? Picking numbers out of the air that happen to be simple to calculate with, if we found that someone typically had 1.2 MBPS to use and that their mss was 1500 bytes (12000 bits), that would suggest that we might try sending one segment every 10 ms until we got the first Ack and followed our favorite congestion control algorithm after that. If someone has a long RTT, that results in us having a lot of segments (in separate packet trains) in flight pretty quickly, but if the RTT is short it starts with an initial amount more appropriate to the RTT.
>> 
>> We have tried the approach in [1] and the presentation slides to ICCRG/TCPM cover the pros and cons of the approach. Overall, there is a tradeoff between the cache size used and the hit rate; cache hit rate is low because of load balancers before the front-ends. 
> 
> Looking through the presentation and paper...
> 
> "IW=10 saves up to four round trips" seems like an interesting statement. If IW=1, the second RTT carries 2, the third four, and the fourth 8. IW=10 takes 1+2+4+3 segments in the first RTT and the other five from the fourth RTT (plus however many more) in the second RTT. If IW=2, the sequence is 2, 4, 8, and conversion to IW=10 means that the first 2+4+4 segments go in the first RTT and the next six (plus whatever) in the second RTT. With CUBIC, It could rise more quickly. I see a reduction of two or three RTTs; where is there a reduction of four?
> 
> More importantly, the statement on slide 5 is that this reduces the trend of browsers opening up multiple TCP sessions. I'm not sure I see that. Take a look if you would at 
>    ftp://ftpeng.cisco.com/fred/google/firefox-tcp.pdf
>    ftp://ftpeng.cisco.com/fred/google/safari-tcp.pdf
>    ftp://ftpeng.cisco.com/fred/google/google-images-firefox.zip
>    ftp://ftpeng.cisco.com/fred/google/google-images-safari.zip
>    ftp://ftpeng.cisco.com/fred/google/google-images.xlsx.zip
> 
> The two pdfs are the graphics; the others are the supporting data. I spoke for an hour with Mike Belshe at the IETF, and he called out Google Images as an example of a web browser opening lots of sessions. OK, great example, let's play. So, I opened Safari, looked up "April Fools Pictures" on Google Images, cleared my cache, opened a tcpdump, hit "refresh", and turned off the tcpdump. Then I repeated the experiment with firefox. What I see is signature HTTP behavior:
> 
> - the browser downloaded the main page
> - the browser opened some number of TCP sessions to "get" objects
>        SYN, SYN-ACK, ACK, "GET"
> - the browser received each of the objects in question
>        Data, Ack, Data, Data, Ack, ...
> 
> You can see that very clearly in the pdfs. I have to ask. What would the fact that Google has bumped IW=10 got to do with how many sessions the browser opened? Google didn't even *try* to deliver the first of the requested objects until the last of the sessions was open.
> 
> Had it used SCTP, those would be streams in a session. I think I can figure out (your slide 5) how one builds the congestion control for that.
> 
> 
> BTW, it looks like you have already increased the initial burst at google. Picking up one of those sessions at random, and noting that my RTT to Google is very long compared to my RTT to work (I have a split tunnel, so that means that the google data center is really far away)
> 
> --- bw-in-f104.1e100.net ping statistics ---
> 10 packets transmitted, 10 packets received, 0.0% packet loss
> round-trip min/avg/max/stddev = 186.227/198.954/209.654/8.509 ms
> 
> --- irp-view13.cisco.com ping statistics ---
> 10 packets transmitted, 10 packets received, 0.0% packet loss
> round-trip min/avg/max/stddev = 19.579/25.110/36.951/5.107 ms
> 
> I see four bursts of activity in this session:
> 
> #1: SYN
> 09:53:38.102027 IP stealth-10-32-244-219.cisco.com.61348 > bw-in-f104.1e100.net.http: Flags [S], seq 567668515, win 65535, options [mss 1460,nop,wscale 3,nop,nop,TS val 418428163 ecr 0,sackOK,eol], length 0
> 
> #2: SYN-ACK, ACK, "GET"
> 09:53:38.303788 IP bw-in-f104.1e100.net.http > stealth-10-32-244-219.cisco.com.61348: Flags [S.], seq 1496444701, ack 567668516, win 5672, options [mss 1360,sackOK,TS val 765576747 ecr 418428163,nop,wscale 6], length 0
> 09:53:38.303856 IP stealth-10-32-244-219.cisco.com.61348 > bw-in-f104.1e100.net.http: Flags [.], ack 1, win 65535, options [nop,nop,TS val 418428166 ecr 765576747], length 0
> 09:53:38.303956 IP stealth-10-32-244-219.cisco.com.61348 > bw-in-f104.1e100.net.http: Flags [P.], seq 1:742, ack 1, win 65535, options [nop,nop,TS val 418428166 ecr 765576747], length 741
> 
> #3: I get five data segments and ack them
> 09:53:38.507269 IP bw-in-f104.1e100.net.http > stealth-10-32-244-219.cisco.com.61348: Flags [.], ack 742, win 112, options [nop,nop,TS val 765576951 ecr 418428166], length 0
> 09:53:38.509821 IP bw-in-f104.1e100.net.http > stealth-10-32-244-219.cisco.com.61348: Flags [.], seq 1:1349, ack 742, win 112, options [nop,nop,TS val 765576952 ecr 418428166], length 1348
> 09:53:38.509824 IP bw-in-f104.1e100.net.http > stealth-10-32-244-219.cisco.com.61348: Flags [.], seq 1349:2697, ack 742, win 112, options [nop,nop,TS val 765576952 ecr 418428166], length 1348
> 09:53:38.509894 IP stealth-10-32-244-219.cisco.com.61348 > bw-in-f104.1e100.net.http: Flags [.], ack 2697, win 65209, options [nop,nop,TS val 418428168 ecr 765576952], length 0
> 09:53:38.510106 IP bw-in-f104.1e100.net.http > stealth-10-32-244-219.cisco.com.61348: Flags [.], seq 2697:4045, ack 742, win 112, options [nop,nop,TS val 765576952 ecr 418428166], length 1348
> 09:53:38.510204 IP stealth-10-32-244-219.cisco.com.61348 > bw-in-f104.1e100.net.http: Flags [.], ack 4045, win 65535, options [nop,nop,TS val 418428168 ecr 765576952], length 0
> 09:53:38.510470 IP bw-in-f104.1e100.net.http > stealth-10-32-244-219.cisco.com.61348: Flags [.], seq 4045:5393, ack 742, win 112, options [nop,nop,TS val 765576952 ecr 418428166], length 1348
> 09:53:38.510508 IP stealth-10-32-244-219.cisco.com.61348 > bw-in-f104.1e100.net.http: Flags [.], ack 5393, win 65378, options [nop,nop,TS val 418428168 ecr 765576952], length 0
> 
> #4: I get a final segment and ack it
> 09:53:38.714801 IP bw-in-f104.1e100.net.http > stealth-10-32-244-219.cisco.com.61348: Flags [P.], seq 5393:5966, ack 742, win 112, options [nop,nop,TS val 765577158 ecr 418428168], length 573
> 09:53:38.714924 IP stealth-10-32-244-219.cisco.com.61348 > bw-in-f104.1e100.net.http: Flags [.], ack 5966, win 65474, options [nop,nop,TS val 418428170 ecr 765577158], length 0
> 
> No final (FIN)
> 
>> We understand the potential of pkt pacing as a way to mitigate bursts, but are also concerned with the complexity that pacing introduces to the OS implementations. There hasn't been any evidence so far of its necessity for a modest IW increase in the order of 10 segments and so we'd like to leave pacing as the last resort (if need be).
> 
> OK. What does it take to get you to look at SCTP?
> 
>> -Nandita



-------------- next part --------------
A non-text attachment was scrubbed...
Name: firefox-tcp.pdf
Type: application/pdf
Size: 68800 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/tmrg-interest/attachments/20100719/a18e8e0a/attachment-0001.pdf

[Tmrg] Proposal to increase TCP initial CWND Lachlan Andrew
[Tmrg] Proposal to increase TCP initial CWND Fred Baker
[Tmrg] Proposal to increase TCP initial CWND Lachlan Andrew
[Tmrg] Proposal to increase TCP initial CWND SCHARF, Michael
[Tmrg] Proposal to increase TCP initial CWND Aleksandar Milenkoski
[Tmrg] Proposal to increase TCP initial CWND Lachlan Andrew
[Tmrg] Proposal to increase TCP initial CWND Fred Baker
[Tmrg] Proposal to increase TCP initial CWND Fred Baker
[Tmrg] Proposal to increase TCP initial CWND Stefan Hirschmann
[Tmrg] Proposal to increase TCP initial CWND Fred Baker
[Tmrg] Proposal to increase TCP initial CWND Lachlan Andrew
[Tmrg] Proposal to increase TCP initial CWND Fred Baker
[Tmrg] Proposal to increase TCP initial CWND Fred Baker
[Tmrg] Proposal to increase TCP initial CWND Fred Baker
[Tmrg] Proposal to increase TCP initial CWND Stefan Hirschmann
[Tmrg] Proposal to increase TCP initial CWND Stefan Hirschmann
[Tmrg] Proposal to increase TCP initial CWND Fred Baker
[Tmrg] Proposal to increase TCP initial CWND Yuchung Cheng
[Tmrg] Proposal to increase TCP initial CWND Lachlan Andrew