Re: [tcpm] [Tmrg] Proposal to increase TCP initial CWND

Jerry Chu <hkchu@google.com> Thu, 04 November 2010 02:29 UTC

DomainKey-Signature: a=rsa-sha1; c=nofws; d=google.com; s=beta; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=Yc9//Gs2jkjo3LbhtS9s/9PZZWNjYrkiWzD1fD7SSOXpjtqjgz1oeamGueK0lJtHLR NftARG1aNG+Uj/JDi9LA==
MIME-Version: 1.0
In-Reply-To: <F973C758-C0F5-4FDD-B56F-7711868B8E70@cisco.com>
References: <AANLkTil937lyUzRvUtdqd2qdl9RN7AZ-Mo_cT-dtmqXz@mail.gmail.com> <AANLkTimgBjTOZQDi8wHaum8KomAKo3TJOBUMeGXjEX1B@mail.gmail.com> <AANLkTinruuD89W7WLSiQDSruq2omiCr3gLWO6L5Jqsr-@mail.gmail.com> <F973C758-C0F5-4FDD-B56F-7711868B8E70@cisco.com>
Date: Wed, 03 Nov 2010 19:29:44 -0700
Message-ID: <AANLkTin8sxmAJshoZ2PKnYViwyna+i1H+L7QE_H3MTf3@mail.gmail.com>
From: Jerry Chu <hkchu@google.com>
To: Fred Baker <fred@cisco.com>
Content-Type: multipart/mixed; boundary="000e0cd31074605638049430ee35"
Cc: Lachlan Andrew <lachlan.andrew@gmail.com>, tcpm@ietf.org, ywang15@ncsu.edu, tmrg <tmrg-interest@icsi.berkeley.edu>, Matt Mathis <mattmathis@google.com>
Subject: Re: [tcpm] [Tmrg] Proposal to increase TCP initial CWND
Precedence: list

+tcpm@ietf.org, ywang15@ncsu.edu

[resume an old but unfinished thread that was partly hampered by tmrg list
problems]

We've been working hard in the past couple of months to study a number of
questions/concerns/issues people have brought up. We were glad to continue
to have Yaogong Wang, a PhD student of Prof. Injong Rhee of NCSU to
collaborate with us after his internship finished in the summer, to setup
a 2nd testbed at Dr. Rhee's lab. The test result from his testbed is near
complete
and can be seen at

http://research.csc.ncsu.edu/netsrv/?q=content/iw10

In the meantime we continue to run tests and compare results from both
sides.
Unfortunately tests on slow links have been especially time-consuming simply
because they run very slow. It takes a long time to get quality data (and we
still
see large variations). But in light of the upcoming IETF meeting we will try
to
conclude and post some summarized remarks soon.

I do want to answer a simple question you brought up a while back below,
just
how many RTTs might IW10 save compared to IW3?

Your argument of max 3 RTTs below has neglected delayed acks, which
will affect
the formula. I'm including a graph showing majority of 3RTT savings,
followed by
2RTT, then 1 and 4RTTs. (The same numbers can be found at section 5.2 of
draft-ietf-tcpm-initcwnd-00.txt.) These numbers assume somewhat "pure" and
long delayed ack timers. In reality implementations tweak the delayed ack
logic
in different ways. Linux has a "quick ack" mode at the beginning of a new
connection.
Windows stack has its own delayed-ack heuristics. So in real life I believe
the
number of RTT savings will be somewhat in between. (Ideally I'd like to do
measurements against real stacks to count the numbers but I just don't have
the
time to do it right now.)

Jerry

On Mon, Jul 19, 2010 at 8:35 AM, Fred Baker <fred@cisco.com> wrote:

>
> On Jul 18, 2010, at 2:58 AM, Lachlan Andrew wrote:
>
> > Regarding whether it is a "single flow" or not, I'd argue that even if
> > it is multiple TCP connections, they're all really part of the same
> > "flow", since they're aiming to carry (different parts of) the same
> > map.
>
> Arg. We return to definitions of a flow...
>
> I think the important thing here is the mathematics of it. If I look for an
> image at images.google.com, for example, I will download an html file
> followed by 20-or-so separate images. In tests I have run, those are often
> separate TCP sessions, all of them going through slow-start and possible
> later saw-toothing simultaneously. That basically means:
>
> client sends 20-or-so SYNs
>   ...
> client receives 20-or-so SYN-ACKS
>       sends 20-or-so Acks
>       sends 20-or-so HTTP GET requests
>   ...
> client access interface starts receiving IW*20-or-so datagrams comprising
> those 20-or-so images.
>
> That behavior should be fairly apparent in the attached, which is a
> graphical representation of exactly this between Firefox at my home and
> images.google.com. I'll attach also a comment I made to Nandita last
> April; it references a number of other files which I don't have any more but
> could recreate if folks are interested.
>
> Comparing that to the behavior of a single SCTP session with 20 streams, or
> the behavior of a single pipelined TCP session, I think the point should be
> apparent that these are mathematically not very similar, and that the value
> of IW is important.
>
> Now, one will argue that getting a zillion thumbnail images is not much of
> an issue, as thumbnails are small. My concern is more essentially about
> discarding congestion control. If the average file moved is on the order of
> 15,000 bytes, IW=10, and typical message size is about 1500 bytes, my four
> function calculator says that we have in the typical case discarded
> congestion control entirely and instead are simply looking at the issues of
> retransmission. The average file is put entirely in flight during the first
> RTT.
>
> In my email with Nandita and company, I asked about the possibilities of
> putting up a TCP Discard stream (representing competing traffic), letting it
> get established, and observing the effect on it when a flash crowd like this
> is imposed. I still think it would be an interesting test...
>
>
> Begin forwarded message:
>
> > From: Fred Baker <fred@cisco.com>
> > Date: April 1, 2010 3:33:43 PM PDT
> > To: Nandita Dukkipati <nanditad@google.com>
> > Cc: draft-hkchu-tcpm-initcwnd@tools.ietf.org, Jerry Chu <
> hkchu@google.com>, Yuchung Cheng <ycheng@google.com>
> > Subject: Re: cwnd=10
> >
> >
> > On Mar 31, 2010, at 6:50 PM, Nandita Dukkipati wrote:
> >
> >> Fred, Hi.
> >> Some responses inline...
> >>
> >>> Where I get concerned, as I mentioned last week, is when the receiver
> is outside the high bandwidth corridor that stretches from Australia and
> eastern Asia across anglophone north America and through western Europe. I
> have attached a map showing the corridor in very general terms, and making
> an error (oversimplifying) with respect to Mexico and the Caribbean.
> Geographically, this is a fairly narrow scope, as it leaves out the entire
> continents of South America and Africa and most of Asia (and btw polynesia
> and Antarctica). It does represent the more developed countries of the
> world, however, and much of Google's markets. I'd be willing to bet that the
> vast majority of your tests have been to places in that corridor. Outside
> the corridor, I have some grave concerns.
> >>
> >> Great point! Knowing that this may in fact be the Achilles Heel in
> standardization of cwnd=10, we deliberately chose the test data-centers that
> also serve traffic to the above regions, including Africa and South America.
> The data presented (link: paper) is sliced and diced by network properties
> such as bandwidth (BW), RTT, and BDP. Assuming that the primary cause for
> concern in developing countries is low BW connectivity, the paper
> specifically lists the observations for low BW subnets including those with
> <=56Kbps. That said, we can also break the performance down by specific
> countries/continents, something that we are in the process of performing. We
> are also actively pursuing tests with clients in Africa to further validate
> our original numbers (that IW=10 doesn't hurt).
> >
> > In the paper and in the slides, what I observed you testing was the
> impact to your sessions of changing the initial window. Did you do any
> testing of the impact on other sessions? For example, on a 56 KBPS link with
> its (Cisco low end default) 40 packet buffers, setting up half a dozen ten
> packet bursts should show up in a loss rate to the sixth session. If there
> was a competing session, say an FTP that had been happily moving a file for
> several seconds and found itself dumped on with sixty 125 millisecond
> packets (7.5 seconds worth of traffic), I could imagine the competing TCP
> session being lost entirely or at least being badly hurt. Dod you try
> setting up a TCP Discard session in parallel and testing the impact on other
> users?
> >
> > You might be aware of issues between BitTorrent and Comcast, currently
> under discussion in the FCC, around the impact of BitTorrent on other users.
> It could extend to Google.
> >
> >>> There are two things I have suggested in the past and would note for
> the record:
> >>>
> >>> 1) I think any give host would do well to maintain a certain amount of
> history somewhere; if cwnd/rtt is an approximation of "a reasonable
> connection's characteristics" available to a destination, then recording
> that ratio and referring to it when starting a TCP session would go a long
> way in the direction you are suggesting. If you share a LAN with a device, I
> could imagine setting cwnd=rwin and let SACK and the fast retransmit
> heuristic pick up the pieces. More generally, if you were to record the
> ratio (as I believe Linux does) and initially set cwnd to some fraction of
> it times the RTT measured on the syn/syn-ack exchange, I imagine you would
> find cwnd low when it needs to be low and high when it is likely to work
> out. The secret sauce, so to speak, is in the fraction - is that half?
> >>>
> >>> 2) A major problem that TCP/SCTP have in the initial burst is that it
> is a burst. Ack-paced traffic tends to be spread out throughout the RTT, but
> the first few bursts tend to pile up, which can result in a momentary
> overload of switch/router/modem queues. Which makes me wonder: what would
> happen if we presumed some basic rate (perhaps derived from the saved cwnd
> and rtt or cwnd/rtt mentioned in [1]) and sent the initial burst
> *at*that*rate*? Picking numbers out of the air that happen to be simple to
> calculate with, if we found that someone typically had 1.2 MBPS to use and
> that their mss was 1500 bytes (12000 bits), that would suggest that we might
> try sending one segment every 10 ms until we got the first Ack and followed
> our favorite congestion control algorithm after that. If someone has a long
> RTT, that results in us having a lot of segments (in separate packet trains)
> in flight pretty quickly, but if the RTT is short it starts with an initial
> amount more appropriate to the RTT.
> >>
> >> We have tried the approach in [1] and the presentation slides to
> ICCRG/TCPM cover the pros and cons of the approach. Overall, there is a
> tradeoff between the cache size used and the hit rate; cache hit rate is low
> because of load balancers before the front-ends.
> >
> > Looking through the presentation and paper...
> >
> > "IW=10 saves up to four round trips" seems like an interesting statement.
> If IW=1, the second RTT carries 2, the third four, and the fourth 8. IW=10
> takes 1+2+4+3 segments in the first RTT and the other five from the fourth
> RTT (plus however many more) in the second RTT. If IW=2, the sequence is 2,
> 4, 8, and conversion to IW=10 means that the first 2+4+4 segments go in the
> first RTT and the next six (plus whatever) in the second RTT. With CUBIC, It
> could rise more quickly. I see a reduction of two or three RTTs; where is
> there a reduction of four?
> >
> > More importantly, the statement on slide 5 is that this reduces the trend
> of browsers opening up multiple TCP sessions. I'm not sure I see that. Take
> a look if you would at
> >    ftp://ftpeng.cisco.com/fred/google/firefox-tcp.pdf
> >    ftp://ftpeng.cisco.com/fred/google/safari-tcp.pdf
> >    ftp://ftpeng.cisco.com/fred/google/google-images-firefox.zip
> >    ftp://ftpeng.cisco.com/fred/google/google-images-safari.zip
> >    ftp://ftpeng.cisco.com/fred/google/google-images.xlsx.zip
> >
> > The two pdfs are the graphics; the others are the supporting data. I
> spoke for an hour with Mike Belshe at the IETF, and he called out Google
> Images as an example of a web browser opening lots of sessions. OK, great
> example, let's play. So, I opened Safari, looked up "April Fools Pictures"
> on Google Images, cleared my cache, opened a tcpdump, hit "refresh", and
> turned off the tcpdump. Then I repeated the experiment with firefox. What I
> see is signature HTTP behavior:
> >
> > - the browser downloaded the main page
> > - the browser opened some number of TCP sessions to "get" objects
> >        SYN, SYN-ACK, ACK, "GET"
> > - the browser received each of the objects in question
> >        Data, Ack, Data, Data, Ack, ...
> >
> > You can see that very clearly in the pdfs. I have to ask. What would the
> fact that Google has bumped IW=10 got to do with how many sessions the
> browser opened? Google didn't even *try* to deliver the first of the
> requested objects until the last of the sessions was open.
> >
> > Had it used SCTP, those would be streams in a session. I think I can
> figure out (your slide 5) how one builds the congestion control for that.
> >
> >
> > BTW, it looks like you have already increased the initial burst at
> google. Picking up one of those sessions at random, and noting that my RTT
> to Google is very long compared to my RTT to work (I have a split tunnel, so
> that means that the google data center is really far away)
> >
> > --- bw-in-f104.1e100.net ping statistics ---
> > 10 packets transmitted, 10 packets received, 0.0% packet loss
> > round-trip min/avg/max/stddev = 186.227/198.954/209.654/8.509 ms
> >
> > --- irp-view13.cisco.com ping statistics ---
> > 10 packets transmitted, 10 packets received, 0.0% packet loss
> > round-trip min/avg/max/stddev = 19.579/25.110/36.951/5.107 ms
> >
> > I see four bursts of activity in this session:
> >
> > #1: SYN
> > 09:53:38.102027 IP stealth-10-32-244-219.cisco.com.61348 >
> bw-in-f104.1e100.net.http: Flags [S], seq 567668515, win 65535, options [mss
> 1460,nop,wscale 3,nop,nop,TS val 418428163 ecr 0,sackOK,eol], length 0
> >
> > #2: SYN-ACK, ACK, "GET"
> > 09:53:38.303788 IP bw-in-f104.1e100.net.http >
> stealth-10-32-244-219.cisco.com.61348: Flags [S.], seq 1496444701, ack
> 567668516, win 5672, options [mss 1360,sackOK,TS val 765576747 ecr
> 418428163,nop,wscale 6], length 0
> > 09:53:38.303856 IP stealth-10-32-244-219.cisco.com.61348 >
> bw-in-f104.1e100.net.http: Flags [.], ack 1, win 65535, options [nop,nop,TS
> val 418428166 ecr 765576747], length 0
> > 09:53:38.303956 IP stealth-10-32-244-219.cisco.com.61348 >
> bw-in-f104.1e100.net.http: Flags [P.], seq 1:742, ack 1, win 65535, options
> [nop,nop,TS val 418428166 ecr 765576747], length 741
> >
> > #3: I get five data segments and ack them
> > 09:53:38.507269 IP bw-in-f104.1e100.net.http >
> stealth-10-32-244-219.cisco.com.61348: Flags [.], ack 742, win 112, options
> [nop,nop,TS val 765576951 ecr 418428166], length 0
> > 09:53:38.509821 IP bw-in-f104.1e100.net.http >
> stealth-10-32-244-219.cisco.com.61348: Flags [.], seq 1:1349, ack 742, win
> 112, options [nop,nop,TS val 765576952 ecr 418428166], length 1348
> > 09:53:38.509824 IP bw-in-f104.1e100.net.http >
> stealth-10-32-244-219.cisco.com.61348: Flags [.], seq 1349:2697, ack 742,
> win 112, options [nop,nop,TS val 765576952 ecr 418428166], length 1348
> > 09:53:38.509894 IP stealth-10-32-244-219.cisco.com.61348 >
> bw-in-f104.1e100.net.http: Flags [.], ack 2697, win 65209, options
> [nop,nop,TS val 418428168 ecr 765576952], length 0
> > 09:53:38.510106 IP bw-in-f104.1e100.net.http >
> stealth-10-32-244-219.cisco.com.61348: Flags [.], seq 2697:4045, ack 742,
> win 112, options [nop,nop,TS val 765576952 ecr 418428166], length 1348
> > 09:53:38.510204 IP stealth-10-32-244-219.cisco.com.61348 >
> bw-in-f104.1e100.net.http: Flags [.], ack 4045, win 65535, options
> [nop,nop,TS val 418428168 ecr 765576952], length 0
> > 09:53:38.510470 IP bw-in-f104.1e100.net.http >
> stealth-10-32-244-219.cisco.com.61348: Flags [.], seq 4045:5393, ack 742,
> win 112, options [nop,nop,TS val 765576952 ecr 418428166], length 1348
> > 09:53:38.510508 IP stealth-10-32-244-219.cisco.com.61348 >
> bw-in-f104.1e100.net.http: Flags [.], ack 5393, win 65378, options
> [nop,nop,TS val 418428168 ecr 765576952], length 0
> >
> > #4: I get a final segment and ack it
> > 09:53:38.714801 IP bw-in-f104.1e100.net.http >
> stealth-10-32-244-219.cisco.com.61348: Flags [P.], seq 5393:5966, ack 742,
> win 112, options [nop,nop,TS val 765577158 ecr 418428168], length 573
> > 09:53:38.714924 IP stealth-10-32-244-219.cisco.com.61348 >
> bw-in-f104.1e100.net.http: Flags [.], ack 5966, win 65474, options
> [nop,nop,TS val 418428170 ecr 765577158], length 0
> >
> > No final (FIN)
> >
> >> We understand the potential of pkt pacing as a way to mitigate bursts,
> but are also concerned with the complexity that pacing introduces to the OS
> implementations. There hasn't been any evidence so far of its necessity for
> a modest IW increase in the order of 10 segments and so we'd like to leave
> pacing as the last resort (if need be).
> >
> > OK. What does it take to get you to look at SCTP?
> >
> >> -Nandita
>
>
>
>

Attachment: IW-vs-RTT.png

Re: [tcpm] [Tmrg] Proposal to increase TCP initia… Jerry Chu

Re: [tcpm] [Tmrg] Proposal to increase TCP initial CWND

Attachment: IW-vs-RTT.png