Re: [tcpm] initial RTO (was Re: Tuning TCP parameters for the 21st century)

Jerry Chu <> Tue, 28 July 2009 15:32 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 992F33A708D for <>; Tue, 28 Jul 2009 08:32:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -101.977
X-Spam-Status: No, score=-101.977 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, USER_IN_WHITELIST=-100]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id AJFUcnkk4rp8 for <>; Tue, 28 Jul 2009 08:32:04 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 4FCB43A7088 for <>; Tue, 28 Jul 2009 08:32:04 -0700 (PDT)
Received: from ( []) by with ESMTP id n6SFW5kA022471 for <>; Tue, 28 Jul 2009 08:32:05 -0700
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed;; s=beta; t=1248795125; bh=ktPzucje9BS9x9oTbO+0pFKKbaQ=; h=DomainKey-Signature:MIME-Version:In-Reply-To:References:Date: Message-ID:Subject:From:To:Cc:Content-Type: Content-Transfer-Encoding:X-System-Of-Record; b=lhEBB9J5psUWtgx5DW hCRPy/deP3q/iQeRlrihQLEnNzEeM8pzfzJyd/CgHK9eAcODCnopRekgOL/nxzhqXep w==
DomainKey-Signature: a=rsa-sha1; s=beta;; c=nofws; q=dns; h=mime-version:in-reply-to:references:date:message-id:subject:from:to: cc:content-type:content-transfer-encoding:x-system-of-record; b=jGlX+XQQYatHhxj9pXjTDm4TrMCkeNMtINx6iPpBEIwRUgYh2QB8zycZuoEwk9Zgr WUW09vI8+8g+oFJU+fAhA==
Received: from ( []) by with ESMTP id n6SFW2Ze018713 for <>; Tue, 28 Jul 2009 08:32:03 -0700
Received: by with SMTP id c5so45101anc.42 for <>; Tue, 28 Jul 2009 08:32:02 -0700 (PDT)
MIME-Version: 1.0
Received: by with SMTP id l13mr10106789ang.110.1248795122538; Tue, 28 Jul 2009 08:32:02 -0700 (PDT)
In-Reply-To: <>
References: <> <>
Date: Tue, 28 Jul 2009 08:32:02 -0700
Message-ID: <>
From: Jerry Chu <>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: quoted-printable
X-System-Of-Record: true
Cc: "" <>
Subject: Re: [tcpm] initial RTO (was Re: Tuning TCP parameters for the 21st century)
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Tue, 28 Jul 2009 15:32:05 -0000


Thanks for your thoughtful comments. The good news is, I'm pretty much on
the same page as you. The problem has to be real and the solution must be
reasonable with acceptable downside if the latter is unavoidable. From our
applications folks the problem seems very real and the performance
difference is not negligible. So I'll focus more on how to rein in the downside.

On Mon, Jul 27, 2009 at 12:28 PM, Mark Allman <> wrote:
> Jerry-
> A few random thoughts here ...
> > I'll start with Lowering initRTO.
> >
> > RFC1122 contains the following paragraph:
> >
> > The following values SHOULD be used to initialize the
> > estimation parameters for a new connection:
> >
> > (a)  RTT = 0 seconds.
> >
> > (b)  RTO = 3 seconds.  (The smoothed variance is to be
> > initialized to the value that will result in this RTO).
> >
> > The "3secs SHOULD" is reaffirmed in RFC2988.
> >
> > From our own measurement of world wide RTT distribution to Google
> > servers we believe 3secs is too conservative, and like to propose it
> > to be reduced to 1sect.
> I am not at all sure this is a good idea.  I have an easier time
> believing the others on your list are perhaps reasonable.  But, this one
> seems somewhat dubious to me.  A few things ...
>  - The fundamental problem here is that you have *no* information.
>    That is, we don't know how long the path is before we have done an
>    exchange.  When you start from scratch you have nothing to go on
>    except defaults.  So, it seems to me on those grounds alone
>    conservativeness is fine.  Because,

Agreed. The question is, how conservative one needs to be? Can we do
better for the majority and still not sacrificing the ones left behind? If
some cost is unavoidable, will the tradeoff be acceptable?

>  - If it was just an extra small packet or two that got sent out that
>    doesn't seem like a Big Deal.  But, once you retransmit the SYN you
>    no longer can take an RTT sample from the 3WHS per Karn's algorithm.
>    So, if in fact the initial RTO is too short then it isn't just going
>    to strobe out an extra packet, but what it means is that it's pretty
>    likely that the packets in your initial window---after clumsily
>    finishing the 3WHS---will likewise be retransmitted because the RTO
>    estimate is low and we did not get the opportunity in the 3WHS to
>    take an actual RTT sample to better seed the estimator.  This is RTO
>    Hell.

Understood. I've also had this as one of the bullet points on the cost side
in my slides. But my gut feeling was this problem could be mitigated somehow.
(Guess now I'll need to demonstrate how :-)

>  - At first blush timestamps might help here because if used then we
>    don't have to use Karn's algorithm.  But, again, since we are just
>    initiating a connection how do we know if the peer is going to use
>    timestamps?  If the initiator sends a timestamp option then there is
>    a chance that timestamps will be in use and therefore there is a
>    chance you'll avoid RTO Hell.  But, there is also a chance you
>    won't.

So the active open side has retransmitted SYN, assuming initRTO of
1sec, and eventually received a SYN-ACK but the TS option was
denied so no good RTT sample can be taken. It will continue to send
the init data with the initRTO.

If the SYN retransmission has been unnecessary because the initRTO
has been too aggressive, more SYN-ACK will be triggered. Assuming
these dup SYN-ACKs are not lost, could they be used as a hint to the
active open side that the initRTO has been too short, so that the RTO
timer can be reverted back to 3secs immediately, in time to avoid further
RTO hell?

One price to pay in the above scenario compared to the default case
of 3secs is that one could have gotten a good RTT sample of say,
2secs but due to the choice of a more aggressive initRTO of 1sec,
one ends up with a fall-back 3secs again for the data phase so any
loss from there will incur more time (3 vs 2secs) to recover. But the
price seems acceptable, and is incurred only when there is immediate
further loss. At least the dreadful endless spurious retransmission,
which would've been unacceptable, is stopped.

>The 3WHS responder (sender of the SYN+ACK) will know if
>    timestamps will be in use and therefore could perhaps lower the
>    initial RTO (basically, this is the ECNSYN trick).  That doesn't
>    seem all that unreasonable to me.

Correct, but I'd really like to find a solution for the active open side
because there are other techniques the server side can employ (e.g.,
initialize the initRTO from the RTT history...) as a backup if this one
doesn't fly, but those techniques don't work well on the client side.

>  - Now, if you track information across connections as others have
>    noted then, sure.  It seems perfectly acceptable to take the view
>    that with high confidence you understand that 1sec (or whatever)
>    will be fine for an initial RTO over some path that you have
>    transmitted traffic across in the recent past and so then you can
>    use that.  In this case, you are picking an initial RTO for a
>    connection but not flying completely in the dark.

Correct. There is a certain amount of added complexity associated
with the cached RTT idea. As such, it is still nice if this reducing initRTO
proposal can fly, not just to cover the client side but also as a simpler
solution for those servers that can't afford to use the more complex

>  - It seems that (per the discussion in today's meeting) a naive
>    lowering to 1sec is going to be problematic because we have
>    bandwidth-on-demand networks, deep queues in access devices are not
>    rare, etc.

Yes I heard it, loud and clear :-).

> In a subsequent email you note:
> > Correct so there is a fine line to walk. But if > 98% of all TCP
> > connections experience RTT << 1 sec, it just seems too conservative to
> > have a global initRTO == 3secs just to avoid spurious retransmission
> > in the < 2% category.
> I agree that it is a fine line.  But, I think your 98%-vs-2% is far too
> glib.  That is, we have to look at how bad we're making it for those 2%.
> If we degraded each of those 2% by "a smidge" then who cares.  But, if
> we really hose those connections (see second bullet above about RTO
> Hell) it doesn't seem like a good tradeoff.  It's useful to remember
> that TCP was designed to be general and not optimal.  Certainly we don't
> want to unduly penalize most of the traffic (/users) to dogmatically
> accommodate every last esoteric situation that might happen to crop up
> on the third Tuesday of the 6th month following the most recent blue
> moon.  But, we also can err on the other side, too.  I think simple
> percentages as you have given are pretty superficial and we'd need to go
> beyond that to really decide what line we wanted to walk.

Understood. See above.


> Just my two bits ...
> allman