Re: [tcpm] Detect Lost Retransmit with SACK

Hi group,

I forgot to mention the actual testing scenario I was doing, to profile all these
TCP stack against.

Basically, I used a userland TCP "forging" tool, where each frame can be individually
crafted (content, timing, loss).

My test opens a tcp session (http get request, for simplicity's sake), with SACK 
negotiated and then counts the segments being received, behaving (mostly) like a 
well-behaving TCP client. However, the segments with these numbers:

200, 250, 253, 255, 257, 258, 259, 260, 265, 267

are dropped this number of times they are seen in the stream:

1,   1,   1,   1,   1,   2,   1,   1,   1,   1

The grace period of 200 packets is to have a decently wide open cwnd; the drop at
Segment 200 also serves to check if the cwnd is larger than 50 segments when the
Burst drop (250-267) occurs, and also to "prime" the SACK scoreboard (preventing
the sender from fastpathing). The burst in this case is in the time axis and, with
segment number 258, sequence space axis...

None of the TCP Stacks I have investigated so far, were able to recover without
a RTO (between 0.2 and 1 sec later; Windows 7 was particularly peculiar, as it
starts shifting the original segments after the 2nd or 3rd dropped segment;
it seems to retransmit 1/2 1 1 1 1/2 segments if a contingeous hole > 1 segment
is being announced by SACK... :) But my code still drops the one containing the
258th sequence number again, leading even Win7 to a RTO...

And, on another front, I have checked a few systems in the field (our gear
Is run typically in high-speed (1/10 Gbps) LANs; I found one example where
Nearly 50% of the retransmissions where followed by a RTO, and even the
Less loaded systems showed a quite high number of RTOs (15-35%) after
Retransmissions.

I assume at this point, that only a minority of the RTOs is "legitimate" in 
the sense that 
*) TCP Session is not running with SACK
*) Client was forcefully removed from the network (loss of connectivity)

Which leaves probably between 70 and 95% or the RTO events as "burst loss"
candidates, where keeping the DUPACK detection armed during FastRetransmit
would help.

I will see to it, that I get statistically more relevant data, and also
Put this into context (i.e. total segments transmitted per week vs. total 
retransmitted segments per week vs. retransmit timeout events per week).

(Actually, I got scared at first, when I saw that high-load system reporting 
50% of all retransmissions are followed by RTOs... :) ).

Richard Scheffenegger
Field Escalation Engineer
NetApp Global Support 
NetApp
+43 1 3676811 3146 Office (2143 3146 - internal)
+43 676 654 3146 Mobile
www.netapp.com 
Franz-Klein-Gasse 5
1190 Wien 

-----Original Message-----
From: Scheffenegger, Richard 
Sent: Montag, 9. November 2009 18:27
To: Alexander Zimmermann
Cc: tcpm@ietf.org
Subject: Re: [tcpm] Detect Lost Retransmit with SACK

Hi Alexander, 

Thanks for the welcome :) 

I fork another thread with the LimitedTransport||FastRecovery / ABC interaction...

I will try to sketch up an example to demonstrate what problem I'm trying to address:

Let's assume the cwnd is already open for at least 7 segments, before the segment with sequence number 10000 is the first one to be dropped by the network.

Also, let's assume that FastRetransmit runs from the left edge of the leftmost hole
(SND.UNA) upwards, and that per ACK only a single segment is sent. 

             Triggering    ACK      Left     Right    Left     Right
             Segment                Edge 1   Edge 1   Edge 2   Edge 2

              9000          9000
             10000  (lost)       *
             11000  (lost)
             12000  (lost)
             13000  (lost)
             14000  (lost)
             15000         10000    15000    16000
             16000         10000    15000    17000
             17000         10000    15000    18000

3 ACKs trigger fast retransmit

             10000  (lost again) 
             11000         10000    11000    12000    15000    18000
             12000         10000    11000    13000    15000    18000
             13000         10000    11000    14000    15000    18000
-> here we have again 3 ACKs indicating a another loss of one of the retransmitted 
packets. The leftmost hole did not change, while the overall number of SACKed
octets did decrease for 3 consecutive ACKs (4; 3 and 2 segments marked by SACK).

Current behaviour of investigated TCP Stacks:
             14000         10000    11000    18000
(normal transmit resumes)
             18000         10000    11000    19000
             19000         10000    11000    20000
             20000         10000    11000    21000
             21000         10000    11000    22000
             22000         10000    11000    23000
             ::            ::       ::       ::
Eventually, RTO trips off, retransmitting the lost segment; this happens RTO later,
followed by slow-start...

             50000         10000    11000    50000
             ::
             ::
             10000         50000

However, this can be somewhere between 0.2 and 1.0 sec later with a "fresh" TCP 
session (no prior connection properties known (cached) by sender). Most likely,
the cwnd has filled up already way sooner (as demonstated, the problem seems to be
most prominent in Highspeed LANs), so that for nearly as long, no data is actually
transmitted.

Proposed behaviour: 

             Triggering    ACK      Left     Right    Left     Right
             Segment                Edge 1   Edge 1   Edge 2   Edge 2

              9000          9000
             10000  (lost)       *
             11000  (lost)
             12000  (lost)
             13000  (lost)
             14000  (lost)
             15000         10000    15000    16000
             16000         10000    15000    17000
             17000         10000    15000    18000

3 ACKs trigger fast retransmit

             10000  (lost again) *
             11000         10000    11000    12000    15000    18000
             12000         10000    11000    13000    15000    18000
             13000         10000    11000    14000    15000    18000

Once the ACK + SACK options indicate that the leftmost hole is not shrinking,
while the SACKed octets are increasing (to deal with clients which send one 
retransmission segment and one new segment interspaced, or when multiple holes
exist which are being filled, or when network reordering occurs, or when some 
wore segments get lost again): 

Reset the Rexmit vector to the beginning of the Hole-List (SND.UNA), clear
counter to count duplicates (just in case one segment gets lost again during
retransmit), and keep the DUPACK detection logic armed...

Also, this reaction should not occur before 1 RTT - so ACKs subsequent to the 
three which indicated the "lost again" segment will take care (in the typical case)
that no segments are retransmitted needlessly. ACK processing has to occur before
deciding which segment (retransmit / new) to send next. Holes will then be marked 
fully retransmitted, before the 2nd retransmission round would advance to them).

             10000         14000    15000      18000
             14000         18000
(normal transmit resumes, but with cwnd shrunk by 2 congestion events)
             18000         19000
             19000         20000

And yes, I was unclear with the use of the terminology; I should have probably 
stated "pipe" instead of "cwnd" below, as cwnd is not touched during LimitedTransmit /
FastRetransmit...

Richard Scheffenegger
Field Escalation Engineer
NetApp Global Support 
NetApp
+43 1 3676811 3146 Office (2143 3146 - internal)
+43 676 654 3146 Mobile
www.netapp.com 
Franz-Klein-Gasse 5
1190 Wien 

-----Original Message-----
From: Alexander Zimmermann [mailto:zimmermann@nets.rwth-aachen.de] 
Sent: Montag, 9. November 2009 16:26
To: Scheffenegger, Richard
Cc: tcpm@ietf.org Extensions WG
Subject: Detect Lost Retransmit with SACK

Hi Richard,

firstly welcome on the list :-)
Since your question in not really related to the poll I change the title...

Comments inline.

Am 09.11.2009 um 13:57 schrieb Scheffenegger, Richard:

>
>
> Hi Alexander et al.,
>
> This will be the first post to this group, so excuse me if I act 
> inappropriately.
>
> I'm curious about one little tidbit which has been bugging me for the 
> better part of the last two monts, and which is closely related with 
> TCP SACK operations (thus it might belong to this thread?)
>
>
> The implicit assumption for TCP fast recovery is, that packet loss  
> happens randomly (ie. to different segments each time) with low  
> correlation between the drop events. Also, a drop event is used as a  
> implicit signal to indicate congestion. So far, so good.
>
> It seems to me, that the focus of most developments has been the  
> internet environment - where statistical assumptions like the above  
> mentioned arguably hold true.
>
> However, certain high-speed LANs seem to exhibit characteristics,  
> which don't play well with these implicit assumptions (uncorrelated  
> packet loss) - the smaller the network, the more deviation from an  
> "good seasoned" link (exhibiting some form of congestion) is likely  
> to occur.
>
> Also, as has been noted in prior research, many internet routers do  
> use more "tcp-friendly" RED or WRED queue policies, over the  
> simplistic TailDrop most often encountered in LANs (default policy  
> of L2 switches and L3 routers).
>
> In one extreme, I have found a (misbehaving´?) TCP stack/host, which  
> sends out a burst of segments (4-6) @ 10GbE wirespeed, which  
> immediately cause queue buffer overload and TailDrop in the first  
> hop L2 Switch, when two such high performance hosts try to establish  
> a high speed communication. With other words, the hosts themselves  
> seem to make sure that there is a high correlation between TCP  
> (fast) recovery and further packet loss.
>
>
> But what puzzles me the most - even with SACK enabled TCP stacks,  
> virtually no implementation can detect / act upon detection of the  
> loss of a retransmitted segment during fast recovery. This despite  
> the fact, that the stipulations in RFC3517 requires the receiver to  
> make the information to detect such an event implicitly available to  
> the sender. The first SACK option has to reflect the last segment,  
> which triggered this SACK.
>
> Together with the scoreboard held at the sender, it should be rather  
> easy to find out if the left edge of the lowest hole (relative to  
> stream octets) closes.

What do you with "left edge of the lowest hole"? Do you mean SND.UNA?  
If ACK covers SND.UNA then it is an cumulative ACK.

>
> If that left edge stays constant for "DupThresh" number of ACKs,  
> which reduce the overall number of octets in holes (any one hole  
> might close due to the retransmitted packets still received), AND  
> the sender retransmits beginning with the lowest hole first, this  
> would be a clear indication of another segment retransmit loss...

Sorry, I don't understand. If we have 20 segments in flight and one  
segment gets lost, you will retransmit after 3 DUPACKS the oldest  
outstanding segment.
Then, assuming no reordering and no further lost, you will get 17  
DUPACKS (without Limited Transmit) before your hole is closed.

What do I miss here?

Can you give me an example?

>
> Even a less speedy detection logic would work for SACK-enabled  
> sessions: once the fast recovery is finished from the sender's point  
> of view, if the receiver still complains about missing segments  
> (indicated by having the SACK rightmost edge - in the first slot  
> SACK option - at a segment higher than when fast recovery started),  
> another round of fast recovery could be invoked, rather than waiting  
> for RTO.
>
> Of course, the first approach would be better for low cwnd sessions  
> with only very few segments in transit - and both could be combined  
> with the proposed sack recovery speed-ups... (Reducing DupThresh for  
> low cwnd sessions / when little data is being sent).
>
>
>
> Congestion control should act to this event (it will now, but only  
> one RTO later...), and the SACK retransmit vector (HighRxt) reset,  
> using LimitedTransmit for sending out the retransmission segments -  
> once cwnd + pipe allows; any retransmitted segments still in the  
> network will close their respective SACK holes before the new  
> HighRxt advances to them.
>
> And, RTO should be reduce (I guess to nearly zero, between SACK- 
> enabled hosts).
>
>
> I have run numerous tests, to check the behavior of different TCP  
> Stacks (FreeBSD 4.2 - 8.0; windows xp, vista, 7, 2003; Linux 2.6.16  
> and others).
>
>
> All these stacks seem to exhibit this issue; What I don't know yet  
> is the percentage of multi-loss segement events triggering RTO - but  
> I assume that the majority of RTOs happen because of this.
>
> In LAN environments (ie. 10 GbE over 1 km @ 2 ms latency due to the  
> L2 hops in between) featuring relatively few streams, the effect of  
> any single RTO can be quite tremendeous - taking considerable  
> theoretical bandwidth away from the session (ie. 1 sec minimum RTO  
> equals 1.2 GB; even with more recent RTO values around 0.2 - 0.4  
> sec, each RTO is still a few hundred MB "lost" capacity under  
> optimal circumstances.
>
>
> Nevertheless, I cann't imagine that I am the first one to bring up  
> this issue (despite having failed to find any study of this  
> effect). :)
>
>
> One more clarification, which came up after I looked at the FreeBSD  
> implementation of Limited Transmit; this might be a nit-pick, but  
> when RFC 3042 is active, shouldn't ABC also be used during  
> LimitedTransmit / FastRecovery?

Why? One reason for ABC are lying receivers (ACK Division). So, the  
worst case is Slow-Start...

> (FreeBSD MAIN is increasing cwnd by 1 mss for each new ACK, instead  
> for the amount of data in that ack...

What do you describe here? Slow-Start?
RFC 3042 says: "The congestion window (cwnd) MUST NOT be changed when  
these new segments are transmitted."

>
> Thanks a lot!
>
>
> Best regards,

Alex

>
>
>
> Richard Scheffenegger
> Field Escalation Engineer
> NetApp Global Support
> NetApp
> +43 1 3676811 3146 Office (2143 3146 - internal)
> +43 676 654 3146 Mobile
> www.netapp.com <BLOCKED::http://www.netapp.com/>
> Franz-Klein-Gasse 5
> 1190 Wien
>
> *	To: "tcpm at ietf.org <mailto:tcpm@DOMAIN.HIDDEN>  WG Extensions"  
> <tcpm at ietf.org <mailto:tcpm@DOMAIN.HIDDEN> >
> *	Subject: [tcpm] Should draft-ietf-tcpm-sack-recovery-entry update  
> RFC 3717 (SACK-TCP)
> *	From: Alexander Zimmermann <alexander.zimmermann at nets.rwth-aachen.de 
>  <mailto:alexander.zimmermann@DOMAIN.HIDDEN> >
> *	Date: Wed, 21 Oct 2009 12:22:50 +0200
>
>  _____
>
> Hi folks,
>
> based on the fact that the draft "draft-ietf-tcpm-sack-recovery- 
> entry" is adopted as WG item now and intended to be a "standards  
> track" document, I would like to start a poll/discussion whether the  
> draft should update RFC 3517 or not? Moreover, should we produce a  
> separate document or an update of RFC 3517?
>
> a) separate document, do not update RFC 3517
> b) separate document, update RFC 3517
> c) RFC3517bis, obsolete RFC 3517
>
> //
> // Dipl.-Inform. Alexander Zimmermann
> // Department of Computer Science, Informatik 4
> // RWTH Aachen University
> // Ahornstr. 55, 52056 Aachen, Germany
> // phone: (49-241) 80-21422, fax: (49-241) 80-22220
> // email: zimmermann at cs.rwth-aachen.de
> // web: http://www.umic-mesh.net
> //
>
>
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22220
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//

_______________________________________________
tcpm mailing list
tcpm@ietf.org
https://www.ietf.org/mailman/listinfo/tcpm