Re: [tcpm] Ordering of SACK blocks, flushing of reassembly queue after inactivity

Andre Oppermann <> Thu, 21 February 2008 22:52 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id E88AA3A6CF5; Thu, 21 Feb 2008 14:52:19 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -0.617
X-Spam-Status: No, score=-0.617 tagged_above=-999 required=5 tests=[AWL=-0.180, BAYES_00=-2.599, FH_RELAY_NODNS=1.451, HELO_MISMATCH_ORG=0.611, RDNS_NONE=0.1]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id NWS+sELUbWnF; Thu, 21 Feb 2008 14:52:18 -0800 (PST)
Received: from (localhost []) by (Postfix) with ESMTP id 6DFA53A6B51; Thu, 21 Feb 2008 14:52:18 -0800 (PST)
Received: from localhost (localhost []) by (Postfix) with ESMTP id 5BC823A6882 for <>; Thu, 21 Feb 2008 14:52:17 -0800 (PST)
X-Virus-Scanned: amavisd-new at
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id Ojqge3aTckPL for <>; Thu, 21 Feb 2008 14:52:15 -0800 (PST)
Received: from ( []) by (Postfix) with ESMTP id 56AFC3A683C for <>; Thu, 21 Feb 2008 14:51:49 -0800 (PST)
Received: (qmail 35990 invoked from network); 21 Feb 2008 22:07:41 -0000
Received: from localhost (HELO []) ([]) (envelope-sender <>) by (qmail-ldap-1.03) with SMTP for <>; 21 Feb 2008 22:07:41 -0000
Message-ID: <>
Date: Thu, 21 Feb 2008 23:51:44 +0100
From: Andre Oppermann <>
User-Agent: Thunderbird (Windows/20071210)
MIME-Version: 1.0
To: Matt Mathis <>
References: <> <> <> <>
In-Reply-To: <>
Cc: TCP Maintenance and Minor Extensions WG <>
Subject: Re: [tcpm] Ordering of SACK blocks, flushing of reassembly queue after inactivity
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <>
List-Unsubscribe: <>, <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

Matt Mathis wrote:
> If you are thinking at the scale of a dozen packet windows with just a 
> couple of losses, I don't disagree with you.  However, with 1000 packet 
> windows, you almost never see "just a couple of losses", and your 
> proposed simplification will be very exposed to lost ACKs.  Given that 
> the ACK rate doubles, your chances are not good.

Based on the excellent feedback and rationale I've implemented the
RFC2018 suggested ordering of the SACK blocks.  [The whole thing is
still in a separate development tree]


> BTW 1000 packet windows are not all that big: 1500 Byte MTU, 70 ms RTT 
> and only 90 Mb/s - well within reach for any university user in the US 
> or Europe
> and FIOS users in Korea or Japan.   The real challenge is getting the 
> scoreboard to work well with 10k or 100k packet windows....
> Thanks,
> --MM--
> -------------------------------------------
> Matt Mathis
> Work:412.268.3319    Home/Cell:412.654.7529
> -------------------------------------------
> Evil is defined by mortals who think they know
> "The Truth" and use force to apply it to others.
> On Thu, 24 Jan 2008, Andre Oppermann wrote:
>> Matt Mathis wrote:
>>>> The first is the ordering of SACK blocks.  The block based reassembly
>>>> queue lends itself to generate the SACK option from.  The blocks are
>>>> ordered ascending from the first segment closest to rcv_nxt.  Per RFC
>>>> 2018 Section 4 the first SACK block MUST be the one with the most
>>>> recent addition to it.  This is easily done by adding a pointer to
>>>> the block last modified.  However a bit later it says the following
>>>> SACK blocks SHOULD be also in order of their 'recentness' while also
>>>> specifying they may be listed in arbitrary order.  Now tracking this
>>>> information in the block based and sequence number ordered reassembly
>>>> queue it a bit more involved and I wonder if it really necessary and
>>>> useful.  In general the SACK blocks tend to be ordered sequentially
>>>> after the first SACK block in single random loss cases.  Only in
>>>> packet reordering cases, possibly mixed with loss, this may not be
>>>> the case.  On the other hand the most valuable information for the
>>>> sender are the blocks closest to rcv_nxt so it can fill the hole and
>>>> complete the sequence space.  While the RFC leaves me free to do as
>>>> I please I wonder if there are observations or rationales that clearly
>>>> show the the ordering by recentness to be better than sequential
>>>> ordering after the first SACK block?
>>> RFC 2018 is written the way it is written to maximize the robustness 
>>> in the presence of lost ACKs.  If the network never looses ACKs, then 
>>> just reporting the single SACK block with the newest data is 
>>> guaranteed to be sufficient for the sender to have perfect knowledge 
>>> about the state of the reassembly queue at the receiver.
>>> If occasional ACKs and one of several non-contiguous retransmissions 
>>> are lost, the sender might not learn that some of the retransmissions 
>>> were successful if the 2nd and 3rd SACK block do not reflect the 
>>> history of the first block.
>>> Although in most cases you want to know about blocks closest to 
>>> rcv_nxt, that is not always true.  In particular if the two oldest 
>>> retransmissions are lost and then some ACKs are also lost, the sender 
>>> will not know that some of the later holes have been successfully 
>>> filled.  If you are going to take a timeout anyhow, this isn't so 
>>> important, but we tried to avoid timeouts at all costs.
>> This pretty much where my question originates from.  The approach of
>> reporting the most recent block first and then closest and second
>> closest to rcv_nxt works just fine as long as the number of holes is
>> not large or excessive.  However even then I may argue that filling
>> the hole after rcv_nxt is the most important goal, above reporting
>> second and third recent blocks (if different).  Once the first hole
>> is filled the closest and second closest blocks to rcv_nxt automatically
>> shift to the right by one and get reported immediately.  When a large
>> number of holes appear the sender is very likely to face a collapsed
>> congestion window and will have to throttle down its rate of sending.
>> This makes it probable that guaranteed knowledge further out is not
>> relevant at this point in time.  By the time this becomes relevant it
>> get reported because it gets closer to rcv_nxt.
>> I'm not forcefully arguing this view but want to bring it up for further
>> elaboration and comparison with the other recommended approaches.
>>> If the 2nd and 3rd block reflect most recent history of the 1st 
>>> block, the sender will have perfect knowledge of the reassembly queue 
>>> as long as the network never looses 3 consecutive ACKs (and usually 
>>> tolerates many more losses).
>>> The "may use other algorithms" is there because we suspect that you 
>>> can do even better, for example by making the 2nd and 3rd blocks 
>>> periodically or randomly re-report all old SACK blocks.  As far a I 
>>> know this has never been pursued, and we certainly did not want to 
>>> specify or even suggest it in 2018.
>>> Everyone else implements it as written....
>> FreeBSD currently does so as well.  It's in my development branch I'm
>> working and experimenting with alternative approaches.  Using the block
>> building logic of the reassembly queue itself for SACK reporting is
>> quite tempting.  ;-)
>>> Oh, except nix "timeouts MUST clear the scoreboard"... If you 
>>> preserve the scoreboard on a timeout, you might be able to do 
>>> goodput=throughput (zero duplicate data at the receiver) under very 
>>> high loss rates, which is a good thing.
>> Thanks for your extensive explanations.
>> -- 
>> Andre

tcpm mailing list