Re: [tcpm] [Tmrg] Increasing the Initial Window - Notes

Jerry Chu <hkchu@google.com> Thu, 25 November 2010 01:34 UTC

Return-Path: <hkchu@google.com>
X-Original-To: tcpm@core3.amsl.com
Delivered-To: tcpm@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 8151C3A69DD for <tcpm@core3.amsl.com>; Wed, 24 Nov 2010 17:34:32 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -108.337
X-Spam-Level:
X-Spam-Status: No, score=-108.337 tagged_above=-999 required=5 tests=[AWL=-1.483, BAYES_00=-2.599, FB_YOU_CAN_BECOME=1.258, FM_FORGED_GMAIL=0.622, FRT_LOLITA1=1.865, RCVD_IN_DNSWL_HI=-8, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id iSRV162-GNfw for <tcpm@core3.amsl.com>; Wed, 24 Nov 2010 17:34:20 -0800 (PST)
Received: from smtp-out.google.com (smtp-out.google.com [74.125.121.35]) by core3.amsl.com (Postfix) with ESMTP id C48A43A69E2 for <tcpm@ietf.org>; Wed, 24 Nov 2010 17:34:18 -0800 (PST)
Received: from kpbe11.cbf.corp.google.com (kpbe11.cbf.corp.google.com [172.25.105.75]) by smtp-out.google.com with ESMTP id oAP1ZH8r018607 for <tcpm@ietf.org>; Wed, 24 Nov 2010 17:35:17 -0800
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta; t=1290648918; bh=2xz3gd7knTjqZtDqPCM/wlj8P5g=; h=MIME-Version:In-Reply-To:References:Date:Message-ID:Subject:From: To:Cc:Content-Type:Content-Transfer-Encoding; b=FvsXaqDOGfvD9t8JRdNofkzjgDTxFxrBlOqwisNoALE6WQVmN4foQQkvOblkT3dmQ jImhU0GiTkhjourXAuwQg==
Received: from gyb11 (gyb11.prod.google.com [10.243.49.75]) by kpbe11.cbf.corp.google.com with ESMTP id oAP1ZF4S003626 for <tcpm@ietf.org>; Wed, 24 Nov 2010 17:35:16 -0800
Received: by gyb11 with SMTP id 11so212070gyb.19 for <tcpm@ietf.org>; Wed, 24 Nov 2010 17:35:15 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=beta; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=d032uAOAfQ2oZek/chZyPmLgBztMfumZajPd72cP2PU=; b=J/ZHvS1irFt/nN0KRC8UllazjmXjf1CCxxy0bZADdDRWvwYxYXI2LkHt3ndq7typJX 74eJe/QTTNzA3GRDe5Gg==
DomainKey-Signature: a=rsa-sha1; c=nofws; d=google.com; s=beta; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=kXYM5IizStOtotuBMlyAvhpJgln+EK4ysOpNpK8+AuofT6P1OvtI2aIKWjuXBsgx9K EVHSF2aAUzsXTIgLqCHQ==
MIME-Version: 1.0
Received: by 10.151.158.7 with SMTP id k7mr2025612ybo.405.1290648914669; Wed, 24 Nov 2010 17:35:14 -0800 (PST)
Received: by 10.151.158.13 with HTTP; Wed, 24 Nov 2010 17:35:14 -0800 (PST)
In-Reply-To: <Pine.GSO.4.64.1011190001550.27897@sweet-brew-4.cisco.com>
References: <20101110152857.GA5094@hell> <804D02FE-39AF-4437-BB15-C2247842E120@mac.com> <20101110170017.GF5094@hell> <97C75EA8-6CC7-444C-A19D-370148B81918@mac.com> <20101110174056.GH5094@hell> <AANLkTim7g=XqfSMHpHVbw1qqPOL-oNApt2i_2RCt0SCi@mail.gmail.com> <AD2BFE84-CA5B-4CDC-8822-1FC2713E3AE0@cisco.com> <alpine.DEB.2.00.1011161345170.11898@wel-95.cs.helsinki.fi> <E798B9A8-29BB-425B-B0C2-2B2735C49948@cisco.com> <5FDC413D5FA246468C200652D63E627A0B7BD0DF@LDCMVEXC1-PRD.hq.netapp.com> <686EBD23-7B65-455F-9348-196BBFD88ECD@comsys.rwth-aachen.de> <931FAE2C-F66E-43B2-8EE1-CFEB17DABD5E@windriver.com> <7309FCBCAE981B43ABBE69B31C8D213909B72A7B1F@EUSAACMS0701.eamcs.ericsson.se> <36F89B79-EABA-4C38-A59E-023D9A630832@windriver.com> <4CE585E9.6060203@isi.edu> <6B25AF56-AFED-4085-AF42-F8AD47CB9F41@windriver.com> <4CE58FED.608@isi.edu> <0C53DCFB700D144284A584F54711EC580B36940A@xmb-sjc-21c.amer.cisco.com> <4CE595E3.3010109@isi.edu> <Pine.GSO.4.64.1011190001550.27897@sweet-brew-4.cisco.com>
Date: Wed, 24 Nov 2010 17:35:14 -0800
Message-ID: <AANLkTiknZkrKP=XoyK7pcDzf1ahjcsTJfrpfL-3ZtEd=@mail.gmail.com>
From: Jerry Chu <hkchu@google.com>
To: Andrew Yourtchenko <ayourtch@cisco.com>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: quoted-printable
X-System-Of-Record: true
Cc: "Anantha Ramaiah (ananth)" <ananth@cisco.com>, tcpm <tcpm@ietf.org>, David Borman <david.borman@windriver.com>, Joe Touch <touch@isi.edu>
Subject: Re: [tcpm] [Tmrg] Increasing the Initial Window - Notes
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/tcpm>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 25 Nov 2010 01:34:36 -0000

On Thu, Nov 18, 2010 at 4:47 PM, Andrew Yourtchenko <ayourtch@cisco.com> wrote:
>
>
> On Thu, 18 Nov 2010, Joe Touch wrote:
>
>> That's basically what I'm suggesting; it can easily be per subnet, per
>> interface, or per the entire machine as desired.
>>
>> The ultimate point is:
>>
>>        - put something in the end node that notices if/when
>>        it fails objectively, and fixes it
>>
>>        - that same thing can allow the IW to increase
>>        over time if there are no problems
>
> I think this is a very viable idea.
>
> However, I think there is a catch due to which it would need tweaking -
> there should be a signaling mechanism whereby a node on the "conservative"
> side of the connection would be able to signal the node on the "eager" side
> of the connection not to be too aggressive, in case the data flow is mostly
> from from "eager" to "conservative" side.
>
> One might consider the following model for a path:
>
> [peer A] --- [access A] ----- [ backbone ] ---- [access B] --- [peer B]
>
> Where the "access A" and "access B" are the access networks for the peers,
> and the "backbone" is the "big Internet" inbetween.
>
> I am speculating based on experience (please correct me if I am wrong) that
> the congestion would typically be either in [access A] or [access B] cloud.
>
> So in case peer B is a server then storing the losses about the connections
> on a per-something basis would not be productive - because a connection from
> peer A may experience the congestion at access A, whereas connection from
> peer C would not traverse the same access network - so there would not be a
> congestion.

But cwnd in the long run would converge to the fair share bottleneck
b/w regardless
of whether the bottleneck is at the near end or the far end. (Guess I
don't quite
understand your use case.)

>
> OTOH if the congestion is in [access B] for both cases then storing the data
> in peer B would indeed make sense.
>
> If the above logic is right, then the peers during the 3whs should
> communicate their memorized values and pick up the minimum of theirs and the
> communicated by the peer.

There is already a mechanism through which a receiving peer can influence how
much data the send side can burst out. It's called the "advertised
receive window".

>
> Therefore both parties know by the end of 3whs the IW value they are going
> to use - and can monitor the initial loss within the connection.
>
> If there happen to be a packet loss during the IW burst - both parties will
> be aware of it.
>
> the result: whether there was a loss or not - will affect the party that had
> the smaller value to begin with - if there was no loss, it means that party
> can use larger value for subsequent connections *if* it had LOSS_THRESH
> previous connections happen without the losses. If there was loss - that
> party will use smaller value for subsequent connections.
>
> I've done a poor man's simulation program (attached) to experiment with this
> idea - it seems more viable in terms of controlling the loss. It assumes a
> full mesh between 10 nodes with nodeX-nodeY backbone capacities potentially
> unique.
>
> What's more entertaining with this approach is that with sufficiently large
> value of LOSS_THRESH, it relatively converges *regardless* of the initial
> setting for IW.
>
> You can run this code in the following way:
>
> 1) Assume access and backbone links can uniformly support IW=10,
> nonadaptive:
>
> ./a.out 10 0 1000 10 11 10 11 | grep round
>
> ...
> Results from the round: loss: 0, headroom: 0, avg IW size: 10.000000
> Results from the round: loss: 0, headroom: 0, avg IW size: 10.000000
> Results from the round: loss: 0, headroom: 0, avg IW size: 10.000000
> ....
>
> 2) Assume access and backbone links can uniformly support only IW=5,
> nonadaptive:
>
> ./a.out 10 0 1000 5 6 5 6 | grep round
>
> Results from the round: loss: 500, headroom: 0, avg IW size: 10.000000
> Results from the round: loss: 500, headroom: 0, avg IW size: 10.000000
> Results from the round: loss: 500, headroom: 0, avg IW size: 10.000000
>
> 3) same as (1), adaptive with the LOSS_THRESH=1000 and IW of 4:
>
> ./a.out 4 1 1000 10 11 10 11 | grep round
>
> Results from the round: loss: 2, headroom: 0, avg IW size: 10.020000
> Results from the round: loss: 1, headroom: 0, avg IW size: 10.010000
> Results from the round: loss: 0, headroom: 0, avg IW size: 10.000000
> Results from the round: loss: 0, headroom: 0, avg IW size: 10.000000
>
> (loss fluctuates betweeen 0 and 2)
>
> 4) same as (2), same adaptive parameters as (3):
>
> ./a.out 4 1 1000 5 6 5 6 | grep round
>
> Results from the round: loss: 0, headroom: 0, avg IW size: 5.000000
> Results from the round: loss: 1, headroom: 0, avg IW size: 5.010000
> Results from the round: loss: 0, headroom: 0, avg IW size: 5.000000
> Results from the round: loss: 0, headroom: 0, avg IW size: 5.000000
>
> On to fun stuff with nonuniform distributions:
>
> Assume access links can support 5..8, and backbone 8..15
>
> 5) non-adaptive with IW=10:
>
> ./a.out 10 0 1000 5 8 8 15 | grep round
>
> Results from the round: loss: 434, headroom: 0, avg IW size: 10.000000
> Results from the round: loss: 423, headroom: 0, avg IW size: 10.000000
> Results from the round: loss: 415, headroom: 0, avg IW size: 10.000000
>
> 6) non-adaptive with IW = 5:
>
> ./a.out 5 0 1000 5 8 8 15 | grep round
>
> Results from the round: loss: 0, headroom: 71, avg IW size: 5.000000
> Results from the round: loss: 0, headroom: 84, avg IW size: 5.000000
> Results from the round: loss: 0, headroom: 66, avg IW size: 5.000000
>
> 7) adaptive with IW=10:
>
> ./a.out 10 1 1000 5 8 8 15 | grep round
>
> Results from the round: loss: 0, headroom: 12, avg IW size: 5.660000
> Results from the round: loss: 0, headroom: 11, avg IW size: 5.670000
> Results from the round: loss: 2, headroom: 13, avg IW size: 5.610000
> Results from the round: loss: 0, headroom: 26, avg IW size: 5.47000
>
> 8) adaptive with IW=5:
>
> ./a.out 5 1 1000 5 8 8 15 | grep round
>
> Results from the round: loss: 0, headroom: 11, avg IW size: 5.550000
> Results from the round: loss: 0, headroom: 10, avg IW size: 5.670000
> Results from the round: loss: 1, headroom: 10, avg IW size: 5.750000
> Results from the round: loss: 0, headroom: 3, avg IW size: 5.810000
> Results from the round: loss: 2, headroom: 6, avg IW size: 5.800000
> Results from the round: loss: 2, headroom: 0, avg IW size: 5.670000
>
> 9) (extreme:) - adaptive with IW=100:
>
> ./a.out 100 1 1000 5 8 8 15 | grep round
>
> Results from the round: loss: 0, headroom: 8, avg IW size: 5.550000
> Results from the round: loss: 0, headroom: 3, avg IW size: 5.600000
> Results from the round: loss: 0, headroom: 5, avg IW size: 5.68000
>
>
> 10) "conservative futuristic" - adaptive with IW=5, and access cap between
> 10 and 20 and backbone cap between 15 and 35:
>
> ./a.out 5 1 1000 10 20 15 35 | grep round
>
> Results from the round: loss: 0, headroom: 54, avg IW size: 14.200000
> Results from the round: loss: 0, headroom: 58, avg IW size: 13.750000
> Results from the round: loss: 0, headroom: 31, avg IW size: 13.470000
> Results from the round: loss: 0, headroom: 53, avg IW size: 13.860000
> Results from the round: loss: 0, headroom: 32, avg IW size: 13.590000
> Results from the round: loss: 0, headroom: 41, avg IW size: 13.790000
> Results from the round: loss: 0, headroom: 62, avg IW size: 14.290000
>
> 11) same as (10), "badly converging" - the THRESH being 5:
>
> ./a.out 5 1 5 10 20 15 35 | grep round
>
> Results from the round: loss: 20, headroom: 13, avg IW size: 14.140000
> Results from the round: loss: 18, headroom: 9, avg IW size: 14.760000
> Results from the round: loss: 21, headroom: 4, avg IW size: 14.470000
> Results from the round: loss: 18, headroom: 6, avg IW size: 14.420000
> Results from the round: loss: 19, headroom: 11, avg IW size: 14.370000
> Results from the round: loss: 22, headroom: 11, avg IW size: 14.42000
>
> 12) same as (10) with the fixed IW of 10, non-adaptive:
>
> ./a.out 10 0 5 10 20 15 35 | grep round
>
> Results from the round: loss: 0, headroom: 398, avg IW size: 10.000000
> Results from the round: loss: 0, headroom: 435, avg IW size: 10.000000
> Results from the round: loss: 0, headroom: 418, avg IW size: 10.000000
> Results from the round: loss: 0, headroom: 422, avg IW size: 10.000000
>
> How to do the signaling during the 3whs is another story - might be either
> an option or some tricks with the TCP window size in 3whs, so far I am not
> talking about it.

The additional signal, either explicitly through the previous TCP option
space, or implicit through some window encoding (as you mentioned to me
privately), seems to be adding unnecessary complexity hence violates our
KISS principle IMHO. Just use the advertised receive window described
above should achieve what you want, no?

Jerry

>
> Critiques very welcome - as I said, it is a rather non-scientific simulation
> done in half an hour just to test the concept - you'll see some
> "non-working" logic commented out in the code, maybe there is a better
> approach.
>
> cheers,
> andrew
>
> (also, the security side of things on this approach would need to be given a
> thought. There are some interesting attacks lurking with all of the
> "adaptive" logic.)
>
>>
>> I'll be glad to write this up if people need a more concrete proposal.
>>
>> Joe
>>
>> On 11/18/2010 1:05 PM, Anantha Ramaiah (ananth) wrote:
>>>
>>> Why can't you do something like this :
>>>
>>> - If any one of the TCP connections egressing out of that interface
>>> (this can determined in TCP layer) is in the TCP retransmit state (or
>>> has experienced congestion in the past xxx secs/mins), then go back to a
>>> lower IW for new TCP connections which are using the same output
>>> interface..
>>>
>>> - When you want to start a connection, use the connection history (Joe's
>>> RFC)
>>>
>>> Well, there may be some gotchas of this scheme, but you can become
>>> conservative when there is some concrete information.
>>>
>>> Thanks,
>>> -Anantha
>>>>
>>>> To be more clear, here's the case:
>>>>
>>>>        start 1000 connections in a row.
>>>>
>>>>        during the first connection, lose some packets and do
>>>>        normal TCP backoff
>>>>
>>>>        so what do the other 999 connections start with?
>>>>        ans: 10 packets
>>>>
>>>> The point is that subsequent connections don't do anything different.
>>>> If
>>>> you have 1000 connections, you're sending a certain amount of data
>>>
>>> into
>>>>
>>>> the network without reacting. We're tripling that.
>>>>
>>>> That can easily cause congestion. At which point the *existing*
>>>> connections will backoff, but new connections keep making problems.
>>>>
>>>> Joe
>>>> _______________________________________________
>>>> tcpm mailing list
>>>> tcpm@ietf.org
>>>> https://www.ietf.org/mailman/listinfo/tcpm
>>
>> _______________________________________________
>> tcpm mailing list
>> tcpm@ietf.org
>> https://www.ietf.org/mailman/listinfo/tcpm
>
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm
>
>