Re: [tcpm] Review: draft-ietf-tcpm-early-rexmt-01

Mark Allman <> Wed, 23 September 2009 13:03 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 93BF828C0E1 for <>; Wed, 23 Sep 2009 06:03:59 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.441
X-Spam-Status: No, score=-2.441 tagged_above=-999 required=5 tests=[AWL=0.158, BAYES_00=-2.599]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id rnH-hJ+Ceort for <>; Wed, 23 Sep 2009 06:03:58 -0700 (PDT)
Received: from pork.ICSI.Berkeley.EDU (pork.ICSI.Berkeley.EDU []) by (Postfix) with ESMTP id ED37F3A6883 for <>; Wed, 23 Sep 2009 06:03:57 -0700 (PDT)
Received: from ( []) by pork.ICSI.Berkeley.EDU ( with ESMTP id n8ND53uU028480; Wed, 23 Sep 2009 06:05:03 -0700
Received: from ( []) by (Postfix) with ESMTP id 24C9D3D33169; Wed, 23 Sep 2009 09:04:57 -0400 (EDT)
Received: from (localhost []) by (Postfix) with ESMTP id 50D1D4AEFE3; Wed, 23 Sep 2009 09:04:57 -0400 (EDT)
To: Joe Touch <>
From: Mark Allman <>
In-Reply-To: <>
Organization: International Computer Science Institute (ICSI)
Song-of-the-Day: In the City
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="--------ma7413-1"; micalg="pgp-sha1"; protocol="application/pgp-signature"
Date: Wed, 23 Sep 2009 09:04:56 -0400
Message-Id: <>
Subject: Re: [tcpm] Review: draft-ietf-tcpm-early-rexmt-01
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 23 Sep 2009 13:03:59 -0000


Many thanks for the comments.

> Regarding Jac88, it's useful to make sure you're referencing the right
> paper. There are several versions of this doc, at least two are often
> confused: the Sigcomm version (most often cited):
> 	This is 16 pages, and has only a very brief appendices section
> and an extended version that has a long appendix with footnotes (this is
> the one that discusses idle times, too):
> 	This one is 21 pages, and decidedly NOT a reprint (even with
> 	formatting changes) of the one published at Sigcomm 1988, 	
> 	despite the note on the web page.
> The second version has most of the details that ended up in TCP
> implementations, several of which have been dogging us for years (lack
> of slow-start restart after idle in the Web in particular). I don't know
> if this is the case for the issues you're citing it for, but it's
> certainly worth a check.

I think we're OK here.  I.e., as far as I know the general notion we are
citing is discussed in all versions.  And, we cite RFC5681 for the
actual spec.

> The example on page three considers a window of three segments (FWIW,
> it should probably read "a window of three segments' worth of data",
> since windows are in bytes not segments). I'm wondering if ACK
> compression (as required) affects the example. It's worth either
> fixing the example, or addressing the effect of ACK compression (even
> if to clarify that there is none) somewhere in the doc.

I think you're talking about duplicate ACKs and not ACK compression
(ACKs getting squished together in the time domain), right?  The
assumption here is that stacks are following 5681 which says they should
not use duplicate ACKs if there is a hole in the sequence space.  I.e.,
that they should immediately ACK each incoming segment.  I'll add a
quick note.

> The data from BPS+98 implies that the bulk of RTOs can be avoided with
> early rexmit. Is that true? Or could there be other reasons for large
> numbers of RTOs that early rexmit won't help? If so, it'd be useful to
> caveat the impact of the proposed mod. 

BPS+98 notes the problem ER addresses and solves it using Limited
Transmit and also a scheme that sends dummy packets to induce three
duplicate ACKs.  Without crunching that data I don't know precisely how
ER would perform.  However, the problems they describe as the big issues
with RTOs are the problems that ER addresses.

(And, this question of how ER will work in the wild is the key reason
for experimental here.)

> Also, this paragraph makes an error I saw at the last TCPM meeting
> from the Google guy's talk - it equates median transfer size with
> median TCP connection duration. HTTP still has persistent connections,
> AFAIK, which mean that these aren't correlated. The conclusion that
> non-RTO recovery would be useful may be true for short transfers over
> persistent connections, not just short TCP connections (which is how I
> read "short TCP transfers", since a TCP transfer is over when the FIN
> sings ;-)

You're just reading this differently from what I meant.  I have

    Furthermore, [All00] shows that for one particular web server the
    median number of bytes carried by a connection is less than four
    segments, indicating that more than half of the connections will be
    forced to rely on the RTO timer to recover from any losses that

I.e., I meant "transfer size" as the amount of data carried by a
connection not the size of some subset of that data.  Hopefully the new
version is more clear.

> Section 2 again starts with math that appears to assume ACK per segment
> (maybe I'm not catching this - maybe it's that ACKs aren't compressed --
> or compressable -- when the data comes out of order, but that's worth
> noting if so. Sorry, I figure you know this better than I do, so I'll
> ask you before I dig into figuring it out. Let me know if you want me to
> dig as well...).

Per above, I don't think this is an issue.

> Section 2 talks about TCP in bytes and SCTP in messages, neither one in
> segments. It might be useful to put in enough context there, e.g., that
> SCTP includes message boundaries, but that they don't correspond to
> segment boundaries (right?).

Right.  I added a note.

> Doublecheck the term 'packet' throughout; I think you mean segments
> (i.e., TCP segments don't necessarily map to IP packets, e.g., given
> fragmentation).

Yep... just sloppy.  All fixed.

> Sec 2 talks about additional state at the sender for precision; this is
> the first time you mention a side-specific cost. It might be useful to
> hint earlier whether you are doing a send-side, receiver-side, or
> require mods on both sides to achieve a benefit. Seems like it's all
> send-side, but the benefit is receiver-side. Also worth noting that this
> then would allow widescale impact by deploying this at busy servers,
> avoiding per-client deployment for benefit.

OK... added a couple words to the intro.

> In 2.1, do you want to define this in terms a fixed value of 4*SMSS,
> or define it as a pointer (i.e., to the initial CWND, so if init CWND
> increases, so does this?) same for the part about packet-based (again,
> would that be segment-based?) not referring to 4, but the number of
> segments in the initial CWND (e.g., as "currently 4" -- PS, should
> that be 4, or shouldn't it be "initial_CWND/SMSS", i.e., a max of 4,
> but in most current cases it seems like this would still be 3).

No, I don't.  This doesn't have anything to do with the initial
congestion window size.  The "4"---which I thought was well motivated,
perhaps not---comes from fast retransmit's magic constant of "3".  I.e.,
if there are at least 4 segments outstanding and we lose one then we'll
have a shot at getting 3 dupacks.  If there are fewer segments
outstanding then we will have no chance at getting 3 dupacks.  So, this
has nothing to do with the initial window.

> (2.b) should call it the 'advertised receive window' (or
> receiver-advertised window, whatever is more common) for clarity

Fixed.  (Also (3.b).)

> These rules (2.a, 2.b) seem odd in the context of saying this is a MAY
> for SCTP (above the list) and then have a different set of rules in the
> paragraph below (end of page 4). IMO, put in two different rulesets
> (ditto for section 3, FWIW):
> 	1. TCP without SACK or for connections not supporting SACK
> 	2. TCP with SACK and SCTP

Good catch.  I did not do major surgery, but added some words around the
SACK / non-SACK variants to make it clear that the non-SACK variant is
for TCP only.

> This would avoid the self-reference to Early Rexmit in the last
> paragraph, which is (AFAICT, again), not yet defined. Did you mean to
> define it above:
>     When the above two conditions hold and the connection does not
>     support SACK the duplicate ACK threshold used to trigger a
>     retransmission MUST be reduced to:
>                   ER_thresh = ceiling (ownd/SMSS) - 1
> I would add "we call this reduced ACK threshhold enabling 'Early
> Retransimission',and when a retransmission occurs because of ER_thresh,
> we call that an Early Retransmission.", even if you split out the rules
> for non-SACK and SACK. This allows you to refer to it at the top of page
> 5 correctly, since it would now be defined.

OK, I didn't add those words exactly, but I did add a quick sentence
that concretely defines things (in both 2.1 & 2.2).

> Also, maybe I'm missing something, but I searched for ER_thresh all over
> the place. It isn't *used* anywhere. I.e., you define a variable but
> never use it. Seems like you need to use it where you say "the timer
> (ER_thresh) goes off and ..." somewhere specific. However, you say that
> you're lowering the fast rexmit threshold. So then wouldn't you be
> setting "FR_thresh", not "ER_thresh"? Even if so, it's useful to recap
> how the *_thresh value is used.

There is no "FR_thresh" sort of variable defined in RFC5681.  So, while
I understand sort of what you are saying I think this ...

    When the above two conditions hold and a TCP connection does not
    support SACK the duplicate ACK threshold used to trigger a
    retransmission MUST be reduced to:

is clear about how one goes about using the threshold.  (This is an
example ... there are other places with the same text.)

> The examples on page 5 need to include a bit about Nagle; if Nagle is
> on, you would never have three outstanding 400-byte segments  ;-)

I didn't change anything here.  You can pushback some more, but these
examples are in fact advertised as illustrations and I am not sure I
want to get into needless discussions about Nagle here.  You are right
that if Nagle is enable we wouldn't have 10 400~bytes segments
outstanding.  But, if Nagle is not enabled we certainly could.  I don't
think for the purposes of the examples this is an overly important

> When you talk about packet-based rexmit, did you want to say as well
> that "a TCP or SCTP that implements packet-based rexmit MUST NOT also
> implement byte-based rexmit", i.e., that packet-based rexmit supecedes
> byte-based rexmit? I could see the two MAYs being considered
> simultaneously, and that would be bad, no?

The intent was "here are two choices, pick one".  In the intro to
section 2 I added this:

    This document explicitly does not prefer one variant over the other,
    but leaves the choice to the implementer.

> When you say MUST NOT use ER, do you mean to use FR (fast) and LR
> (Limited), or is LR a superset of FR?

I meant LT and FR.  The words now say:

    When conditions (2.a) and (2.b) do not hold, the transport MUST NOT
    use Early Retransmit, but rather prefer the standard mechanisms,
    including Fast Retransmit and Limited Transmit.

(and, also they were changed in 2.2).

> Not sure about the explanation of the circular list for keeping track
> of segment boundaries. You say "fall within this region", but with
> SACK the region can be disconnected, so I'm not sure how to interpret
> "region".  Also, seems like you need to know the length of the
> segments too, no?  You can't assume they're all the same size...

Keeping the list of sequence numbers *does* keep track of the segment
sizes.  And, SACK does not come into play.  This is a list of
*transmitted* segments.  I added a couple words to note that this is the
sender's notion of the right side of the current window.

> Which brings me to a wrinkle - what happens if TCP resends data with
> different segment sizes, resulting in some segments being on different
> boundaries than those that may already be received (e.g., on multipath
> when a PMTU update comes in, and data is resent and resegmented
> differently). Your segment alg needs to be robust to this, or you need
> to explain why that doesn't matter.

It doesn't matter because once you retransmit all is moot.  Early
retransmit only *initiates* loss recovery.  What happens after that is
not ER's problem.

> Sec 3 (Discussion) should start with an overview of what you intend to
> discuss. Are these all issues? Do you want to say the benefits too?
> Impact on legacy receivers (if any)? Deployment motivation (who does it
> benefit, and who does the work?) Deployment asymmetry? etc. I'd then
> break the section down to these sorts of topics (e.g., benefits, impact
> of failure, deployment). (right now you jump in to the details of the
> preferred variant - which IMO belongs in the packet-based section, not
> here, then jump to the impact of failure without giving examples of
> benefit first)

Oh, I dunno ... I agree that this section is a collection of disparate
things.  I just added some subsections to perhaps make things flow

> The SACK preference discussion needs more lead-in. The first paragraph
> could easily be broken into three paragraphs with better context, and
> would make the argument more clear.

I didn't do anything here.  I re-read and it looks OK to me and nobody
else has tripped on it.  So, unless you want to say more about what
exactly you're looking for here I am going to think it's essentially

> Related work needs an intro. 

OK, I added a sentence.

> Also, doesn't the second one
> (Bal98) also result in a self-attack, i.e., it opens the CWND because
> an additional ACK will be received, even though no data is sent? I.e.,
> there's a side-effect to this that is probably worth
> avoiding... (separate question - has anyone noted that 0-window
> segment ACKs shouldn't open the window, or is that already the case?)

I don't think it is worth getting into such details in the related work
section of this document.  If someone wants to write the spec for the
scheme from [Bal98] then we can argue about the specifics.

> Other considerations: seems like you're making TCP send more segments
> into the net when data is being lost, vs. the existing mechanisms. If
> that's the case, and if loss is due to buffer overload, are you making
> things potentially worse? If not, please explain.

I don't understand this point.  Is it relative to ER or [Bal98]?

> I don't see why you have a normative reference to Eiffel; you don't
> depend on it that way. Seems like that's an informational reference in
> this case - esp. since you refer to it in the appendix discussing
> research options, not the body.

Yep.  That has been moved previously.  There is *no* reliance on Eifel.
For more than one reason.

Again, many thanks for the detailed comments.  I have rolled the changes
enumerated above into my working version.  My collaborators will have to
put their oar in the water, but I expect we'll get a new version pushed
to the repository shortly.