Re: [tcpm] Ordering of SACK blocks, flushing of reassembly queue after inactivity

Andre Oppermann <> Wed, 23 January 2008 23:14 UTC

Return-path: <>
Received: from [] ( by with esmtp (Exim 4.43) id 1JHoo0-0001Wo-9u; Wed, 23 Jan 2008 18:14:52 -0500
Received: from tcpm by with local (Exim 4.43) id 1JHonz-0001Wh-2M for; Wed, 23 Jan 2008 18:14:51 -0500
Received: from [] ( by with esmtp (Exim 4.43) id 1JHony-0001WZ-7L for; Wed, 23 Jan 2008 18:14:50 -0500
Received: from ([]) by with esmtp (Exim 4.43) id 1JHonx-0002fE-4Q for; Wed, 23 Jan 2008 18:14:50 -0500
Received: (qmail 86102 invoked from network); 23 Jan 2008 22:36:09 -0000
Received: from localhost (HELO []) ([]) (envelope-sender <>) by (qmail-ldap-1.03) with SMTP for <>; 23 Jan 2008 22:36:09 -0000
Message-ID: <>
Date: Thu, 24 Jan 2008 00:14:51 +0100
From: Andre Oppermann <>
User-Agent: Thunderbird (Windows/20071210)
MIME-Version: 1.0
Subject: Re: [tcpm] Ordering of SACK blocks, flushing of reassembly queue after inactivity
References: <> <>
In-Reply-To: <>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 8fbbaa16f9fd29df280814cb95ae2290
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <>
List-Unsubscribe: <>, <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>

Joshua Blanton wrote:
> I don't have a good answer to your first question (other than to
> mention that, if you always send the most-recently-modified SACK
> regions, you ensure that they're sent multiple times - which is the
> only quasi-reliability you can create on the ACK path), but I would
> like to address the second.
> Andre Oppermann wrote:
>> The second is how long to hold onto data in the reassembly queue.
>> The general theme here is resource exhaustion be it through malicious
>> activity or just end points that drop off the net.  I think we can
>> all agree that holding onto reassembly queue data until the session
>> times out (if ever) is not really useful considering the overall
>> resource constrains.  The question now is after what time to flush
>> the reassembly queue (and to send an appropriate ACK)?  A range of
>> options are available.  On the wide side we have a flush timeout
>> of something like 2 times MSL.  On the small side we can go down to
>> the current calculated retransmit timeout value as seen from our side.
>> Also of importance is from where the timeout is calculated.  From the
>> time the first segment arrived in the reassembly queue (resetting when
>> rcv_nxt is advanced), or from the arrival time of the most recent
>> segment.  For the moment and testing I've chosen the former at four
>> times retransmit timeout as something that probably marks the boundary
>> between spurious network losses or partitioning and longer-term
>> disconnect or malicious activity.  Is any empirical data available on
>> abandoned sessions with data in the reassembly queue?  What is your
>> opinion and rationale on this?
> Well, I actually disagree that holding onto reassembly queue data is
> a lost cause, even after long periods of inactivity - so perhaps we
> don't all agree :-).  Certainly you could tell *after* the fact that
> holding such data was a fool's errand, if the connection is
> terminated; until that point, there's no reason to necessarily
> assume that the lack of progress in the connection is permanent.  In
> general, I would expect an operating system to hold on to reassembly
> data for forever, assuming that there's no memory resource concern
> that makes the buffers valuable...  To flush data simply because
> wall-clock time has elapsed doesn't make sense to me, since I've
> seen many traces where "long" time periods have elapsed and then
> connections suddenly resume.  If there's no global "we're running
> out of memory" trigger available for a given OS, a stack could set a
> timer to fire at some arbitrary time (4*RTO, for instance) and check
> for memory pressure - *if* it exists, go ahead and flush the data.

The memory pressure thing is critical part here.  Otherwise I fully
agree that we should hold onto the reassembly queue forever.
Unfortunately we can't afford to do that as memory is still limited.

The problematic part I'm trying to address here is the very difficult
definition of memory pressure.  In modern kernels this isn't as simple
as it seems on a first glance.  Of course we can and do detect when we
run out of physical memory in kernel.  However memory pressure starts
a lot earlier and may manifest itself by a couple of subsystems having
trouble to obtain enough memory out of their zones.  Memory also may
be used for more productive things than reassembly queues, for example
disk buffers.  In SMP and NUMA systems various pools of memory on the
CPUs may have associated memory regions which may be depleted to
different levels.  On top of it modern kernels run in memory
overcommit mode where not all potential memory requirements can
be fulfilled at the same time.  Otherwise we would have to lock down
the full socket buffer space for every connection we may have.  This
is very inefficient and uneconomical.

Whether limited kernel memory is more valuable in a reassembly queue
than other data structure really depends on the goals and purpose of a
particular system and its application setting.  All this and a lot more
makes it really hard to go for a purist solution.

We as developers of a general purpose operating system (in this case
FreeBSD) have to chose appropriate limits and defaults for a wide range
of operating conditions.  Special and niche applications may require
specific tunings and explicit settings.

We have to find a good balance among allocation of memory to the various
usages in the kernel.  Not to forget we also have to protect to a
certain extent against malicious attacks that try to chew us up.  For
this we use things like syncaches and other methods.  TCP reassembly
is no exception to this.

The hard part, and the reason I've come here to solicit input, is to
decide where to set the limits.  There are a number of (imaginary)
intersecting curves that represent the usefulness of memory tied up
in an inactive or non-responsive reassembly queue vs. other valuable
uses the kernel may have for that memory.  The question is where these
curves intersect and where to set the cutoff point in the reassembly
queue.  If for example 98% of all sessions get their reassembly act
together within 2xRTO it may be worth the negative impact on the other
2% to opportunistically use the memory for other purposes.  Even if
we flush the reassembly queue in those two percent it is a graceful
failure as the connections are not terminated and stay alive although
not at the theoretically optimal point from a network resource
conservation point of view.

I guess this really gets into a discussion about the economics theory
side of things...  ;-)

> I don't have any data showing how much reassembly data is left
> hanging around when a session is abandoned, but I have looked at
> quite a few traces trying to find SACK renegs (which would be the
> result of your data flushing).  In general, I believe that a scheme
> such as you're proposing is not used; other than some traces that
> I've found that p0f identifies as being FreeBSD receivers, there
> doesn't appear to be a solid link between connection progress
> timeouts and reneging.  I don't know the FreeBSD stack well enough
> to say that, in its current implementations, it definitely flushes
> reassembly queue data based on a timer - but I suspected that it
> did, and your question reinforces my suspicion.  If I am correct,
> and FreeBSD currently (5.x and 6.x) times out reassembly data as
> you're proposing, I've seen traces where this actually impedes a
> connection's recovery - so I vote against such a scheme.

The reassembly queues in FreeBSD 5, 6 and 7 do not get flushed unless
under severe system wide memory pressure (a time things already fall
apart left and right).  All the work and changes I'm doing and
discussing here are on a separate development and testing branch and
won't be in official FreeBSD until everything is sorted out and tested.

> Again, I have no problem with stacks flushing out-of-order data in
> the face of a low-memory condition.  Beyond that, I'd have to see
> some pretty convincing data that "long pauses == connection that
> will terminate without finishing," which is what I feel you're
> proposing.

Whether the connection will terminate without finishing is only an
optimization aspect.  The tipping point is "long pause && holding
on to valuable memory <> other more valuable use for memory".  Now
define long pause, value of memory and plot the trajectory.


tcpm mailing list