Re: [tcpm] Ordering of SACK blocks, flushing of reassembly queue after inactivity
Andre Oppermann <andre@freebsd.org> Wed, 23 January 2008 23:14 UTC
Return-path: <tcpm-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1JHoo0-0001Wo-9u; Wed, 23 Jan 2008 18:14:52 -0500
Received: from tcpm by megatron.ietf.org with local (Exim 4.43) id 1JHonz-0001Wh-2M for tcpm-confirm+ok@megatron.ietf.org; Wed, 23 Jan 2008 18:14:51 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1JHony-0001WZ-7L for tcpm@ietf.org; Wed, 23 Jan 2008 18:14:50 -0500
Received: from c00l3r.networx.ch ([62.48.2.2]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1JHonx-0002fE-4Q for tcpm@ietf.org; Wed, 23 Jan 2008 18:14:50 -0500
Received: (qmail 86102 invoked from network); 23 Jan 2008 22:36:09 -0000
Received: from localhost (HELO [127.0.0.1]) ([127.0.0.1]) (envelope-sender <andre@freebsd.org>) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for <jblanton@irg.cs.ohiou.edu>; 23 Jan 2008 22:36:09 -0000
Message-ID: <4797CA6B.9080502@freebsd.org>
Date: Thu, 24 Jan 2008 00:14:51 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Thunderbird 1.5.0.14 (Windows/20071210)
MIME-Version: 1.0
To: jblanton@irg.cs.ohiou.edu
Subject: Re: [tcpm] Ordering of SACK blocks, flushing of reassembly queue after inactivity
References: <47952032.3030809@freebsd.org> <20080122154207.GD4320@mauser.ipx.ath.cx>
In-Reply-To: <20080122154207.GD4320@mauser.ipx.ath.cx>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 8fbbaa16f9fd29df280814cb95ae2290
Cc: tcpm@ietf.org
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
Errors-To: tcpm-bounces@ietf.org
Joshua Blanton wrote: > I don't have a good answer to your first question (other than to > mention that, if you always send the most-recently-modified SACK > regions, you ensure that they're sent multiple times - which is the > only quasi-reliability you can create on the ACK path), but I would > like to address the second. > > Andre Oppermann wrote: >> The second is how long to hold onto data in the reassembly queue. >> The general theme here is resource exhaustion be it through malicious >> activity or just end points that drop off the net. I think we can >> all agree that holding onto reassembly queue data until the session >> times out (if ever) is not really useful considering the overall >> resource constrains. The question now is after what time to flush >> the reassembly queue (and to send an appropriate ACK)? A range of >> options are available. On the wide side we have a flush timeout >> of something like 2 times MSL. On the small side we can go down to >> the current calculated retransmit timeout value as seen from our side. >> Also of importance is from where the timeout is calculated. From the >> time the first segment arrived in the reassembly queue (resetting when >> rcv_nxt is advanced), or from the arrival time of the most recent >> segment. For the moment and testing I've chosen the former at four >> times retransmit timeout as something that probably marks the boundary >> between spurious network losses or partitioning and longer-term >> disconnect or malicious activity. Is any empirical data available on >> abandoned sessions with data in the reassembly queue? What is your >> opinion and rationale on this? > > Well, I actually disagree that holding onto reassembly queue data is > a lost cause, even after long periods of inactivity - so perhaps we > don't all agree :-). Certainly you could tell *after* the fact that > holding such data was a fool's errand, if the connection is > terminated; until that point, there's no reason to necessarily > assume that the lack of progress in the connection is permanent. In > general, I would expect an operating system to hold on to reassembly > data for forever, assuming that there's no memory resource concern > that makes the buffers valuable... To flush data simply because > wall-clock time has elapsed doesn't make sense to me, since I've > seen many traces where "long" time periods have elapsed and then > connections suddenly resume. If there's no global "we're running > out of memory" trigger available for a given OS, a stack could set a > timer to fire at some arbitrary time (4*RTO, for instance) and check > for memory pressure - *if* it exists, go ahead and flush the data. The memory pressure thing is critical part here. Otherwise I fully agree that we should hold onto the reassembly queue forever. Unfortunately we can't afford to do that as memory is still limited. The problematic part I'm trying to address here is the very difficult definition of memory pressure. In modern kernels this isn't as simple as it seems on a first glance. Of course we can and do detect when we run out of physical memory in kernel. However memory pressure starts a lot earlier and may manifest itself by a couple of subsystems having trouble to obtain enough memory out of their zones. Memory also may be used for more productive things than reassembly queues, for example disk buffers. In SMP and NUMA systems various pools of memory on the CPUs may have associated memory regions which may be depleted to different levels. On top of it modern kernels run in memory overcommit mode where not all potential memory requirements can be fulfilled at the same time. Otherwise we would have to lock down the full socket buffer space for every connection we may have. This is very inefficient and uneconomical. Whether limited kernel memory is more valuable in a reassembly queue than other data structure really depends on the goals and purpose of a particular system and its application setting. All this and a lot more makes it really hard to go for a purist solution. We as developers of a general purpose operating system (in this case FreeBSD) have to chose appropriate limits and defaults for a wide range of operating conditions. Special and niche applications may require specific tunings and explicit settings. We have to find a good balance among allocation of memory to the various usages in the kernel. Not to forget we also have to protect to a certain extent against malicious attacks that try to chew us up. For this we use things like syncaches and other methods. TCP reassembly is no exception to this. The hard part, and the reason I've come here to solicit input, is to decide where to set the limits. There are a number of (imaginary) intersecting curves that represent the usefulness of memory tied up in an inactive or non-responsive reassembly queue vs. other valuable uses the kernel may have for that memory. The question is where these curves intersect and where to set the cutoff point in the reassembly queue. If for example 98% of all sessions get their reassembly act together within 2xRTO it may be worth the negative impact on the other 2% to opportunistically use the memory for other purposes. Even if we flush the reassembly queue in those two percent it is a graceful failure as the connections are not terminated and stay alive although not at the theoretically optimal point from a network resource conservation point of view. I guess this really gets into a discussion about the economics theory side of things... ;-) > I don't have any data showing how much reassembly data is left > hanging around when a session is abandoned, but I have looked at > quite a few traces trying to find SACK renegs (which would be the > result of your data flushing). In general, I believe that a scheme > such as you're proposing is not used; other than some traces that > I've found that p0f identifies as being FreeBSD receivers, there > doesn't appear to be a solid link between connection progress > timeouts and reneging. I don't know the FreeBSD stack well enough > to say that, in its current implementations, it definitely flushes > reassembly queue data based on a timer - but I suspected that it > did, and your question reinforces my suspicion. If I am correct, > and FreeBSD currently (5.x and 6.x) times out reassembly data as > you're proposing, I've seen traces where this actually impedes a > connection's recovery - so I vote against such a scheme. The reassembly queues in FreeBSD 5, 6 and 7 do not get flushed unless under severe system wide memory pressure (a time things already fall apart left and right). All the work and changes I'm doing and discussing here are on a separate development and testing branch and won't be in official FreeBSD until everything is sorted out and tested. > Again, I have no problem with stacks flushing out-of-order data in > the face of a low-memory condition. Beyond that, I'd have to see > some pretty convincing data that "long pauses == connection that > will terminate without finishing," which is what I feel you're > proposing. Whether the connection will terminate without finishing is only an optimization aspect. The tipping point is "long pause && holding on to valuable memory <> other more valuable use for memory". Now define long pause, value of memory and plot the trajectory. -- Andre _______________________________________________ tcpm mailing list tcpm@ietf.org https://www1.ietf.org/mailman/listinfo/tcpm
- Re: [tcpm] Ordering of SACK blocks, flushing of r… Joshua Blanton
- [tcpm] Ordering of SACK blocks, flushing of reass… Andre Oppermann
- Re: [tcpm] Ordering of SACK blocks, flushing of r… Matt Mathis
- RE: [tcpm] Ordering of SACK blocks, flushing of r… Mahdavi, Jamshid
- Re: [tcpm] Ordering of SACK blocks, flushing of r… Andre Oppermann
- Re: [tcpm] Ordering of SACK blocks, flushing of r… Andre Oppermann
- Re: [tcpm] Ordering of SACK blocks, flushing of r… Matt Mathis
- Re: [tcpm] Ordering of SACK blocks, flushing of r… Lars Eggert
- Re: [tcpm] Ordering of SACK blocks, flushing of r… David Malone
- Re: [tcpm] Ordering of SACK blocks, flushing of r… Matt Mathis
- Re: [tcpm] Ordering of SACK blocks, flushing of r… Andre Oppermann