[Tmrg] tmix-linux: burst of FIN packets causes packet loss at end of experiment

ritesh at cs.unc.edu (Ritesh Kumar) Fri, 17 October 2008 00:08 UTC

From: "ritesh at cs.unc.edu"
Date: Thu, 16 Oct 2008 20:08:05 -0400
Subject: [Tmrg] tmix-linux: burst of FIN packets causes packet loss at end of experiment
In-Reply-To: <48F6CEB8.2010604@room52.net>
References: <48F5291C.8020507@swin.edu.au> <aa7d2c6d0810141854v50c9d379wcbb54435e71e890f@mail.gmail.com> <48F6CEB8.2010604@room52.net>
Message-ID: <f47983b00810161708k7d93b412n4b787f2dd931fc12@mail.gmail.com>

On Thu, Oct 16, 2008 at 1:18 AM, Lawrence Stewart <lstewart at room52.net>wrote:

> Hi Lachlan and all,
>
> Lachlan Andrew wrote:
>
> [snip]
>
> >
> > There is a fundamental statistical need to ignore the first little part,
> > so that the number of connections can reach "steady state", but I don't
> > know any   fundamental   reason to ignore the last 1/3 of the
> > experiment, provided that traffic generator ends flows cleanly.  Since
> > our suite is very time-constrained, we want to cut out any unnecessary
> > waiting.
> >
>

I think its a good idea to cut down as much wait time as possible.
The reason why we choose to ignore the last 1/3rd of the experiment is
because Tmix logs full results (response times etc) for only the connections
which get completed. So near the end, we have a lot of connections which
don't finish (and we subsequently don't log results for them) though the
traffic dynamics definitely get impacted by them. Hence, if you look at the
connection arrival/departure process from the tmix results (which you can
create using connection start times and the durations) then you will notice
a ramp up _and_ a ramp down of connections.

I understand that in many scenarios one may not need to follow this
recommendation. However, I would instead recommend not stopping tmix at a
given time but stopping it only when all connections are done. I think not
giving the -d <time> switch does that automatically.


>
> > I'm Cc'ing this to TMRG in case someone on the list knows of a strong
> > reason to ignore the end of an experiment (or knows a way around the
> > SNMP problem).
>
> I certainly couldn't say that ignoring the last 1/3rd of an experiment
> is necessary in my experience. However, I have observed some behaviour
> with the FreeBSD TCP implementation which also might be relevant to
> other TCPs as well and might be pertinent to this discussion.
>
> Increasing the tx socket buffer size at the sender can lead to a
> situation where at the end of the connection, the userland process
> closes the socket, but the kernel finds itself with a large buffer of
> data still needing to be sent. Going on memory here, I recall observing
> that sometimes (haven't taken the time to narrow down when/why etc) some
> of the TCP variables can get apparently messed up e.g. cwnd can take on
> some unexpected values while the buffer is being flushed. I can't be
> more specific than that right now, but I hope to sit down and nut it out
> at some point.
>
> Stepping back further, one might reasonably ask why you'd need to
> increase the tx socket buffer to a size where this problem is
> possible... I noticed by trial and error that when trying to use a
> non-real-time OS to do traffic generation, sometimes vagaries in kernel
> scheduling meant that you could end up with an empty tx socket buffer
> for periods of time during transmission if you didn't have the buffer
> sized a substantial amount larger than the BDP of the path.
>
> I've worked around the issue by chopping the end off files after the
> time at which the traffic generation process (iperf in my case) closes
> the socket (normally a few seconds at most depending on the test
> parameters). Definitely not ideal, I know, but it works around the issue
> and I thought it was a story from the coal face worth sharing.
>

Wow... thats a really interesting scenario. However, I always thought that a
close() on a connection would block till all the data for the connection is
sent. May be I am wrong... these overloaded syscalls are generally weird in
more than one way.

My experience with Linux's TCP implementation is the following:
1) I never tried pushing more than 1Gbps using a single connection on my
network. However, I didn't need to change tx socket buffer parameters. Linux
TCP has socket buffer autotuning which automatically allocates more to TCP
buffers. When I tried to play with these parameters, it shut off autotuning
giving me errors when I tried pushing a really large number of connections
through my testbed setup (~4000 simultaneous connections).
2) To make sure that we don't go into scheduling vagaries, we take a look at
the CPU usage from tmix logs. We make sure that we don't have high CPU usage
for a relatively large amount time. While running experiments we make sure
we run experiments with less traffic load than the tcvecs we ran the
calibration with. It is not uncommon to see different CPU usage statistics
with different origin traces at the same offered load. We hence run
calibration experiments with all our origin traces.
3) Linux TCP has a sysctl variable net.ipv4.tcp_low_latency. This is by
default kept disabled which enables "tcp prequeing" which enqueues incoming
data from the network interface in a "pre-queue" and processes in the
receiving process' context. I haven't tried/measured it but may be some
scheduling vagaries can be got rid off by enabling low latency? I, however,
still like over provisioning our testbed with more CPUs than figuring out
the detailed inner workings of complicated kernels like Linux :)

Warm regards,
Ritesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/tmrg-interest/attachments/20081016/b76ca541/attachment.html