[tcpm] Detect Lost Retransmit with SACK

Alexander Zimmermann <alexander.zimmermann@nets.rwth-aachen.de> Mon, 09 November 2009 15:27 UTC

Return-Path: <alexander.zimmermann@nets.rwth-aachen.de>
X-Original-To: tcpm@core3.amsl.com
Delivered-To: tcpm@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 8DD013A69A1 for <tcpm@core3.amsl.com>; Mon, 9 Nov 2009 07:27:23 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.248
X-Spam-Level:
X-Spam-Status: No, score=-3.248 tagged_above=-999 required=5 tests=[AWL=0.953, BAYES_00=-2.599, HELO_EQ_DE=0.35, HELO_MISMATCH_DE=1.448, J_CHICKENPOX_33=0.6, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MJW-xsUPCKyY for <tcpm@core3.amsl.com>; Mon, 9 Nov 2009 07:27:22 -0800 (PST)
Received: from mta-2.ms.rz.rwth-aachen.de (mta-2.ms.rz.RWTH-Aachen.DE [134.130.7.73]) by core3.amsl.com (Postfix) with ESMTP id D98483A6B61 for <tcpm@ietf.org>; Mon, 9 Nov 2009 07:27:15 -0800 (PST)
MIME-version: 1.0
Content-type: text/plain; charset="utf-8"; format="flowed"; delsp="yes"
Received: from ironport-out-1.rz.rwth-aachen.de ([134.130.5.40]) by mta-2.ms.rz.RWTH-Aachen.de (Sun Java(tm) System Messaging Server 6.3-7.04 (built Sep 26 2008)) with ESMTP id <0KSU00EP0LM5LEF0@mta-2.ms.rz.RWTH-Aachen.de> for tcpm@ietf.org; Mon, 09 Nov 2009 16:27:41 +0100 (CET)
X-IronPort-AV: E=Sophos;i="4.44,708,1249250400"; d="scan'208";a="33057856"
Received: from relay-auth-2.ms.rz.rwth-aachen.de (HELO relay-auth-2) ([134.130.7.79]) by ironport-in-1.rz.rwth-aachen.de with ESMTP; Mon, 09 Nov 2009 16:27:41 +0100
Received: from miami.nets.rwth-aachen.de ([unknown] [137.226.12.180]) by relay-auth-2.ms.rz.rwth-aachen.de (Sun Java(tm) System Messaging Server 7.0-3.01 64bit (built Dec 9 2008)) with ESMTPA id <0KSU00JNWLM5HI60@relay-auth-2.ms.rz.rwth-aachen.de> for tcpm@ietf.org; Mon, 09 Nov 2009 16:27:41 +0100 (CET)
From: Alexander Zimmermann <alexander.zimmermann@nets.rwth-aachen.de>
Content-transfer-encoding: quoted-printable
Date: Mon, 09 Nov 2009 16:27:42 +0100
Message-id: <2D6C2761-5FC9-46E3-95D4-FDE632D4469C@nets.rwth-aachen.de>
To: Richard Scheffenegger <rs@netapp.com>
X-Mailer: Apple Mail (2.1076)
Cc: "tcpm@ietf.org Extensions WG" <tcpm@ietf.org>
Subject: [tcpm] Detect Lost Retransmit with SACK
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/tcpm>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Nov 2009 15:27:23 -0000

Hi Richard,

firstly welcome on the list :-)
Since your question in not really related to the poll I change the  
title...

Comments inline.


Am 09.11.2009 um 13:57 schrieb Scheffenegger, Richard:

>
>
> Hi Alexander et al.,
>
> This will be the first post to this group, so excuse me if I act  
> inappropriately.
>
> I'm curious about one little tidbit which has been bugging me for  
> the better part of the last two monts, and which is closely related  
> with TCP SACK operations (thus it might belong to this thread?)
>
>
> The implicit assumption for TCP fast recovery is, that packet loss  
> happens randomly (ie. to different segments each time) with low  
> correlation between the drop events. Also, a drop event is used as a  
> implicit signal to indicate congestion. So far, so good.
>
> It seems to me, that the focus of most developments has been the  
> internet environment - where statistical assumptions like the above  
> mentioned arguably hold true.
>
> However, certain high-speed LANs seem to exhibit characteristics,  
> which don't play well with these implicit assumptions (uncorrelated  
> packet loss) - the smaller the network, the more deviation from an  
> "good seasoned" link (exhibiting some form of congestion) is likely  
> to occur.
>
> Also, as has been noted in prior research, many internet routers do  
> use more "tcp-friendly" RED or WRED queue policies, over the  
> simplistic TailDrop most often encountered in LANs (default policy  
> of L2 switches and L3 routers).
>
> In one extreme, I have found a (misbehaving´?) TCP stack/host, which  
> sends out a burst of segments (4-6) @ 10GbE wirespeed, which  
> immediately cause queue buffer overload and TailDrop in the first  
> hop L2 Switch, when two such high performance hosts try to establish  
> a high speed communication. With other words, the hosts themselves  
> seem to make sure that there is a high correlation between TCP  
> (fast) recovery and further packet loss.
>
>
> But what puzzles me the most - even with SACK enabled TCP stacks,  
> virtually no implementation can detect / act upon detection of the  
> loss of a retransmitted segment during fast recovery. This despite  
> the fact, that the stipulations in RFC3517 requires the receiver to  
> make the information to detect such an event implicitly available to  
> the sender. The first SACK option has to reflect the last segment,  
> which triggered this SACK.
>
> Together with the scoreboard held at the sender, it should be rather  
> easy to find out if the left edge of the lowest hole (relative to  
> stream octets) closes.

What do you with "left edge of the lowest hole"? Do you mean SND.UNA?  
If ACK covers SND.UNA then it is an cumulative ACK.

>
> If that left edge stays constant for "DupThresh" number of ACKs,  
> which reduce the overall number of octets in holes (any one hole  
> might close due to the retransmitted packets still received), AND  
> the sender retransmits beginning with the lowest hole first, this  
> would be a clear indication of another segment retransmit loss...

Sorry, I don't understand. If we have 20 segments in flight and one  
segment gets lost, you will retransmit after 3 DUPACKS the oldest  
outstanding segment.
Then, assuming no reordering and no further lost, you will get 17  
DUPACKS (without Limited Transmit) before your hole is closed.

What do I miss here?

Can you give me an example?

>
> Even a less speedy detection logic would work for SACK-enabled  
> sessions: once the fast recovery is finished from the sender's point  
> of view, if the receiver still complains about missing segments  
> (indicated by having the SACK rightmost edge - in the first slot  
> SACK option - at a segment higher than when fast recovery started),  
> another round of fast recovery could be invoked, rather than waiting  
> for RTO.
>
> Of course, the first approach would be better for low cwnd sessions  
> with only very few segments in transit - and both could be combined  
> with the proposed sack recovery speed-ups... (Reducing DupThresh for  
> low cwnd sessions / when little data is being sent).
>
>
>
> Congestion control should act to this event (it will now, but only  
> one RTO later...), and the SACK retransmit vector (HighRxt) reset,  
> using LimitedTransmit for sending out the retransmission segments -  
> once cwnd + pipe allows; any retransmitted segments still in the  
> network will close their respective SACK holes before the new  
> HighRxt advances to them.
>
> And, RTO should be reduce (I guess to nearly zero, between SACK- 
> enabled hosts).
>
>
> I have run numerous tests, to check the behavior of different TCP  
> Stacks (FreeBSD 4.2 - 8.0; windows xp, vista, 7, 2003; Linux 2.6.16  
> and others).
>
>
> All these stacks seem to exhibit this issue; What I don't know yet  
> is the percentage of multi-loss segement events triggering RTO - but  
> I assume that the majority of RTOs happen because of this.
>
> In LAN environments (ie. 10 GbE over 1 km @ 2 ms latency due to the  
> L2 hops in between) featuring relatively few streams, the effect of  
> any single RTO can be quite tremendeous - taking considerable  
> theoretical bandwidth away from the session (ie. 1 sec minimum RTO  
> equals 1.2 GB; even with more recent RTO values around 0.2 - 0.4  
> sec, each RTO is still a few hundred MB "lost" capacity under  
> optimal circumstances.
>
>
> Nevertheless, I cann't imagine that I am the first one to bring up  
> this issue (despite having failed to find any study of this  
> effect). :)
>
>
> One more clarification, which came up after I looked at the FreeBSD  
> implementation of Limited Transmit; this might be a nit-pick, but  
> when RFC 3042 is active, shouldn't ABC also be used during  
> LimitedTransmit / FastRecovery?

Why? One reason for ABC are lying receivers (ACK Division). So, the  
worst case is Slow-Start...

> (FreeBSD MAIN is increasing cwnd by 1 mss for each new ACK, instead  
> for the amount of data in that ack...

What do you describe here? Slow-Start?
RFC 3042 says: "The congestion window (cwnd) MUST NOT be changed when  
these new segments are transmitted."

>
> Thanks a lot!
>
>
> Best regards,

Alex

>
>
>
> Richard Scheffenegger
> Field Escalation Engineer
> NetApp Global Support
> NetApp
> +43 1 3676811 3146 Office (2143 3146 - internal)
> +43 676 654 3146 Mobile
> www.netapp.com <BLOCKED::http://www.netapp.com/>
> Franz-Klein-Gasse 5
> 1190 Wien
>
> *	To: "tcpm at ietf.org <mailto:tcpm@DOMAIN.HIDDEN>  WG Extensions"  
> <tcpm at ietf.org <mailto:tcpm@DOMAIN.HIDDEN> >
> *	Subject: [tcpm] Should draft-ietf-tcpm-sack-recovery-entry update  
> RFC 3717 (SACK-TCP)
> *	From: Alexander Zimmermann <alexander.zimmermann at nets.rwth-aachen.de 
>  <mailto:alexander.zimmermann@DOMAIN.HIDDEN> >
> *	Date: Wed, 21 Oct 2009 12:22:50 +0200
>
> _____
>
> Hi folks,
>
> based on the fact that the draft "draft-ietf-tcpm-sack-recovery- 
> entry" is adopted as WG item now and intended to be a "standards  
> track" document, I would like to start a poll/discussion whether the  
> draft should update RFC 3517 or not? Moreover, should we produce a  
> separate document or an update of RFC 3517?
>
> a) separate document, do not update RFC 3517
> b) separate document, update RFC 3517
> c) RFC3517bis, obsolete RFC 3517
>
> //
> // Dipl.-Inform. Alexander Zimmermann
> // Department of Computer Science, Informatik 4
> // RWTH Aachen University
> // Ahornstr. 55, 52056 Aachen, Germany
> // phone: (49-241) 80-21422, fax: (49-241) 80-22220
> // email: zimmermann at cs.rwth-aachen.de
> // web: http://www.umic-mesh.net
> //
>
>
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22220
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//
//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22220
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//