Re: [tcpm] TCP Error recovery and efficiency

"Scheffenegger, Richard" <rs@netapp.com> Fri, 20 July 2012 21:19 UTC

Return-Path: <rs@netapp.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A8B3D11E809C for <tcpm@ietfa.amsl.com>; Fri, 20 Jul 2012 14:19:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.561
X-Spam-Level:
X-Spam-Status: No, score=-9.561 tagged_above=-999 required=5 tests=[AWL=0.439, BAYES_00=-2.599, J_CHICKENPOX_42=0.6, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id J3likxxL4xeV for <tcpm@ietfa.amsl.com>; Fri, 20 Jul 2012 14:19:47 -0700 (PDT)
Received: from mx2.netapp.com (mx2.netapp.com [216.240.18.37]) by ietfa.amsl.com (Postfix) with ESMTP id 8304911E809B for <tcpm@ietf.org>; Fri, 20 Jul 2012 14:19:47 -0700 (PDT)
X-IronPort-AV: E=Sophos;i="4.77,626,1336374000"; d="scan'208";a="667450105"
Received: from smtp2.corp.netapp.com ([10.57.159.114]) by mx2-out.netapp.com with ESMTP; 20 Jul 2012 14:20:29 -0700
Received: from vmwexceht03-prd.hq.netapp.com (vmwexceht03-prd.hq.netapp.com [10.106.76.241]) by smtp2.corp.netapp.com (8.13.1/8.13.1/NTAP-1.6) with ESMTP id q6KLKT97020469; Fri, 20 Jul 2012 14:20:29 -0700 (PDT)
Received: from SACEXCMBX02-PRD.hq.netapp.com ([169.254.1.34]) by vmwexceht03-prd.hq.netapp.com ([10.106.76.241]) with mapi id 14.02.0298.004; Fri, 20 Jul 2012 14:20:28 -0700
From: "Scheffenegger, Richard" <rs@netapp.com>
To: Anthony Sabatini <tsabatini@hotmail.com>, "mattmathis@google.com" <mattmathis@google.com>
Thread-Topic: [tcpm] TCP Error recovery and efficiency
Thread-Index: AQHNXeR5wlqJnmRWo0WikYEea35jtJcmOUgAgAUJBACABkotgIABdJwA//+yzVA=
Date: Fri, 20 Jul 2012 21:20:28 +0000
Message-ID: <012C3117EDDB3C4781FD802A8C27DD4F0D4C4E58@SACEXCMBX02-PRD.hq.netapp.com>
References: <BAY158-W52F89B5FC9355E00E9FE5FB0D30@phx.gbl>, <CAK6E8=dwcJhk5Hrv+VwNBF86FX8hxALpgBw2q1hs2e8=075m7w@mail.gmail.com>, <BAY158-W24B0DE01799F4A4F475B1B0D50@phx.gbl>, <CAH56bmDP-nW-z76OwNA4O_UOCv=CPsP9+sMK-_n31zCW=+YyVA@mail.gmail.com> <BAY158-W130F2B43E519CFA3EB8682B0D80@phx.gbl>
In-Reply-To: <BAY158-W130F2B43E519CFA3EB8682B0D80@phx.gbl>
Accept-Language: de-AT, en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.104.60.115]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: tcpm <tcpm@ietf.org>
Subject: Re: [tcpm] TCP Error recovery and efficiency
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/tcpm>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 20 Jul 2012 21:19:48 -0000

Hi Tony,

I think you need to clarify more the issues you see in the current signaling of SACK; 2018 SACK gives the most complete view of the receiver state at the time when the ACK carrying the SACK was sent... I don't quite follow the argument around larger window links...


Ad 1)

Your point here seems to indicate that the receiver is supposed to extrapolate the (then future) state of the receiver at the time, when the sender (in your notion, transmitter) is receiving an ACK.. But to make that extrapolation, the receiver has to run a model of the path, which is much more complex in a packet switched network than on p2p links (over "simple" media). 


Also, in the real world of TCP/IP, one cannot know if a packet was lost or delayed; this is also a major differentiating factor to P2P links (HDLC, SDLC etc., where you seem to be coming from). Therefore, the receiver cannot deduce that Packet 1 was lost, once it has received Packet 2 (with these packet numbers encoded in the tokens). It can merely state that Packet 1 was either delayed (any may be received still), or lost - less information here than in a P2P link!

In accordance with the principles of TCP, though, the sender could make the deduction that if the 1st retransmission of packet 1 was not received by the time packet 3 - sent out after the retransmission of 1 - was confirmed on the receiving side, that a 2nd retransmission is in order. 

And this really is a undefined space so far (linux does some tricks that are not in the IETF specs, but in the spirit).

Your token scheme (making each packet uniquely identifiable) would also go a long way to enable this.


Ad 2)

Burst loss of the ACKs: True; but burst loss of ACKs of that severity has other implications in TCP, e.g. the ACK clock will fail (or SACK without PRR-RB/Laminar TCP will send out burst of new packets, very possibly making things worse).

As Matt mentioned in his reply, a receiver is free to dither the returned SACK blocks from the available SACK blocks. This could address this contingency to some point. But I figure it's not trivial to figure out an optimal dithering strategy - e.g. blocks could also be repeated with varying probability to indicate their "importance". As indicated earlier, you place a very high value at the oldest block, to facilitate some forward progress on the receiver side as soon as possible (and to reset the RTO timer?) Thus the oldest block could be included in a certain fraction of ACKs, overriding the current LIFO ordering scheme...

Ad 3)

I don't buy this one; you proposed to send ACKs multiple times over and over - clearly much less efficient than establishing reliability by sending the same SACK block at least 3 times (thus covering at least 2 consecutive ACK losses).

Ad 4)
See the rescue retransmission in RFC3517bis and draft-dukkipati-tcpm-tcp-loss-probe; but having a fast re-ACK timer (~ inter packet interval) on the receiver appears to be quicker and possibly more efficient (often, tx power < rx power on mobile devices; smaller packets don't use up too much bandwidth as simply resending the last packet after a ~RTO timer).



Ad 5)

> 5) I admit the current draft does not properly express
> my views.  In the current RFC2018 we normally have
> three SACK blocks, in this version normally four. 
> What I was basically saying is that UNUSED SACK blocks
> should be filled with older, still not completed SACK
> blocks rather than transmit the same set of SACK blocks
> five, six or more times.  This will be clarified in the
> next revision.

SACK usually has 4 too - it's just often deployed/enabled together with Timestamps that take away option space. A combined SACK+TS (SACK+token) option could yield 4 SACK block plus the two Tokens / Timestamps. (2 + 2x(4) + nx(4+4)).

In 2018, there are no unused SACK blocks - whenever the receiver has discontinuous data, it makes it into a SACK block until all available SACK blocks are used up... usually from most recently received to least recently received... And as Matt mentioned, a receiver is free to cycle periodically through all the discontinuous blocks except for the first SACK block (it's just not implemented today; 

The resiliency achieved by the redundant sending of SACK blocks is viewed to be sufficient. If you have data to the contrary, please share.




Richard Scheffenegger

From: tcpm-bounces@ietf.org [mailto:tcpm-bounces@ietf.org] On Behalf Of Anthony Sabatini
Sent: Freitag, 20. Juli 2012 20:16
To: mattmathis@google.com
Cc: tcpm
Subject: Re: [tcpm] TCP Error recovery and efficiency

Matt -
 
First off I would like to thank you for your excellent work on RFC2018, the purpose of this update is not to diminish your work but rather extend and expand on it.  Without your lead I would never have had the impetus to extend my work from HDLC into the TCP domain.  Please understand that the first draft was a trial draft to toss around and get essential comments.  Also please understand that it was originally written as part of a thesis presentation on adapting token, state driven recovery to TCP, not as an RFC per se.
 
1) A problem with RFC2018 is that it provides no real information with respect to the receivers reassembly queue at the time the transmitter receives the acknowledgement.  The protocol presupposes conditions often not present in the real world, that the information transfer is near instantaneous and that a rather low number of messages can be in-flight i.e. 7 SACK blocks total combined in both directions (i.e. messages for which SACK blocks have not yet been received).  The RFC2018 DOES NOT send information about messages that were lost (nor can it with the implied requirement of packet fragmentation or out of order delivery), it can only say what it has seen.
 
2) On an active link SACK goes into the weeds whenever noise bursts affecting receipt of SACK blocks exceed approximately 100,000 bit times, a 10 ms noise burst at 10 mbps or 1 ms at 100 mbps causing a rollover of the active three SACK block window.  At that point SACK goes into timer based recovery. The problem with time based recovery is that timers must always be set to the maximum possible delay, destroying efficiency.  The purpose of this change is to take recovery from timer based to state based so that it is maximally responsive.
 
3) SACK presupposes a relatively stable link with a relatively low random packet drop.  Neither are true anymore and it is why you need only to look at RFC2883 to realize that SACK sends a lot of duplicate segments destroying efficiency.  I am simply trying to reduce that number.
 
4) SACK has an implicit "last character problem"; normally you would get multiple confirmation SACK blocks as additional messages arrive equal to the number of SACK blocks.  This is not true if the link goes idle where the next to last only has two acknowledgements and the last only has one.  If the last transmit segment is dropped and the link goes idle or the last ACK packet from the receiver is lost you are now into a condition where it takes a major link timeout to recover it.  The changes in the revised draft reduce the pssibility of that occurrence.
 
5) I admit the current draft does not properly express my views.  In the current RFC2018 we normally have three SACK blocks, in this version normally four.  What I was basically saying is that UNUSED SACK blocks should be filled with older, still not completed SACK blocks rather than transmit the same set of SACK blocks five, six or more times.  This will be clarified in the next revision.
 
6) Note that the argument has moved to whether we need to allow more that 16,000 messages to be in process during any particular second in the face of 10 Gbps links, proposal is on the table to expand that to 4,000,000.  Also there is a proposal on SACK block compression in process (RFC1072 was not wrong in concept only in implementation) so that we can include more segments.
 
I do appreciate your comments and feedback.