Re: [tcpm] TCP Error recovery and efficiency

Matt -

First off I would like to thank you for your excellent work on RFC2018, the purpose of this update is not to diminish your work but rather extend and expand on it.  Without your lead I would never have had the impetus to extend my work from HDLC into the TCP domain.  Please understand that the first draft was a trial draft to toss around and get essential comments.  Also please understand that it was originally written as part of a thesis presentation on adapting token, state driven recovery to TCP, not as an RFC per se.

1) A problem with RFC2018 is that it provides no real information with respect to the receivers reassembly queue at the time the transmitter receives the acknowledgement.  The protocol presupposes conditions often not present in the real world, that the information transfer is near instantaneous and that a rather low number of messages can be in-flight i.e. 7 SACK blocks total combined in both directions (i.e. messages for which SACK blocks have not yet been received).  The RFC2018 DOES NOT send information about messages that were lost (nor can it with the implied requirement of packet fragmentation or out of order delivery), it can only say what it has seen.

2) On an active link SACK goes into the weeds whenever noise bursts affecting receipt of SACK blocks exceed approximately 100,000 bit times, a 10 ms noise burst at 10 mbps or 1 ms at 100 mbps causing a rollover of the active three SACK block window.  At that point SACK goes into timer based recovery. The problem with time based recovery is that timers must always be set to the maximum possible delay, destroying efficiency.  The purpose of this change is to take recovery from timer based to state based so that it is maximally responsive.

3) SACK presupposes a relatively stable link with a relatively low random packet drop.  Neither are true anymore and it is why you need only to look at RFC2883 to realize that SACK sends a lot of duplicate segments destroying efficiency.  I am simply trying to reduce that number.

4) SACK has an implicit "last character problem"; normally you would get multiple confirmation SACK blocks as additional messages arrive equal to the number of SACK blocks.  This is not true if the link goes idle where the next to last only has two acknowledgements and the last only has one.  If the last transmit segment is dropped and the link goes idle or the last ACK packet from the receiver is lost you are now into a condition where it takes a major link timeout to recover it.  The changes in the revised draft reduce the pssibility of that occurrence.

5) I admit the current draft does not properly express my views.  In the current RFC2018 we normally have three SACK blocks, in this version normally four.  What I was basically saying is that UNUSED SACK blocks should be filled with older, still not completed SACK blocks rather than transmit the same set of SACK blocks five, six or more times.  This will be clarified in the next revision.

6) Note that the argument has moved to whether we need to allow more that 16,000 messages to be in process during any particular second in the face of 10 Gbps links, proposal is on the table to expand that to 4,000,000.  Also there is a proposal on SACK block compression in process (RFC1072 was not wrong in concept only in implementation) so that we can include more segments.

I do appreciate your comments and feedback.

Tony 

Anthony Sabatini
200 West 20th Street
Apt. 1216
New York, NY 10011

Phone: (212) 867-7179
Mobile: (917) 224-8388

Date: Thu, 19 Jul 2012 13:02:43 -0700
Subject: Re: [tcpm] TCP Error recovery and efficiency
From: mattmathis@google.com
To: tsabatini@hotmail.com
CC: tcpm@ietf.org

I don't understand what problem you are trying to solve: As written 2018 provides the sender with perfect information about the state of the receivers reassembly queue (e.g. which data is still missing) as long as most ACKs are delivered, mostly in order.   The minimum cases which causes the sender to send duplicate data are timeouts (but see the errata against 2018), or fairly complicated patterns of sustained losses on both sides of the link.

In fact, if there is no loss or reordering on the return path, a single SACK block is sufficient to give the sender perfect information.

With classic TCP timestamps (not for PAWS) it is even possible to disambiguate between holes that were filled by ooo data or retransmissions.  Due to the TS echo rules for PAWS, it actually does less well.  I am not the expert here, but Richard Scheffenegger has explored this problem extensively.   But the ambiguity is whether congestion control undo is permitted, not which data is still needed at the receiver.

If you look at 2018, (and 2883) the first SACK block is described as MUST be the chronologically newest.  The rest of the option is described with SHOULDs, because we knew that SACK could theoretically be improved by dithering the rest of the SACK option across all past SACK blocks.   Such a scheme would make SACK work well with unreasonable levels of loss on the ACK path.  (e.g. the 2nd SACK block could alternate between the two most recent prior 1st blocks, the 3rd across all earlier blocks).

But now, would that really a feature for TCP to work well on the forward path, while the (low bandwidth) return path is being mauled by SACK?

Thanks,

--MM--
The best way to predict the future is to create it.  - Alan Kay

On Sun, Jul 15, 2012 at 12:59 PM, Anthony Sabatini <tsabatini@hotmail.com> wrote:

All -

First off a thank you to Richard Scheffenegger for his comments but most importantly when he asked whether the number of ACKs being sent was excessive (since we are trying for efficiency unnecessary ACKs diminish that efficiency).  When I first wrote this RFC six years ago for my thesis, my concern was with my extensions not in reanalyzing the base protocol.  This turns out to be a mistake because, by moving the retransmission process farther away from using timers and changing it to purely a state basis, a number of statements/features of the base protocol no longer work the same.

As a statement of policy SACK should send no more ACKs then a current modern version of TCP does.  For the most part SACK only adds from 2 to 34 bytes in the options area of existing messages.  Even with this change we are only talking 6 to 38 bytes.  It is only in extreme situations where there are more than four gaps in the acknowledged data that SACK requires additional ACKs.  As to when these ACKs are sent again normal TCP practice applies for the most part (as long as all SACK blocks occupy a single ACK).

The recovery basis in Enhanced SACK (E-SACK) except for the "Link Broken" disaster timer is entirely state driven on the transmit side.  Since timers by their very nature, must be set to the worst case value, this automated retransmission is much quicker and more accurate.  The combination of TCP ACKs, SACK block changes and Returned Transmit Token changes form the triggers that guide the recovery and transmission processes.  In order to insure the proper flow of those signals the receiver is running three classes of timer - Returned Transmit Token delay timer, Second ACK timer and the "I did not receive what I needed" keep alive timer.  More about those in the commentary.

The principle that all state changing information should be sent twice (second ACK) from TCP is a good one and should be preserved in this protocol since the loss of an ACK is as disastrous as the loss of a message, we just need to be cognizant that there are now three state change elements, any of which triggers the ACK pair, specifically the TCP ACK (acknowledging a change in the TCP transmit floor), a change to a SACK block, and finally a change to one of the tokens (note I Include both Transmit and Returning here).  The question then becomes how do we implement this.  The underlying reality here is that on a link with even moderate activity a second intentional ACK is not needed since the ACK triggered by the next arriving message would replace it.  The problem we encounter is that ACK messages are significantly shorter than full data messages so we would always send a second ACK if we send it immediately after the first since nothing has changed.  The thought is then to delay the second ACK until the link goes idle (1.3 times the length of time to receive a full packet on the fast side to 2.6 x the length of time to receive a full packet on the slow side).  Note that this delay has a direct effect on the Returned Transmit Token delay timer as that timer can not be less than this delay plus the amount of time to transmit all of the ACKs to hold the complete list of SACK blocks in order to insure that state information is up to date before the token changes, thus possibly degrading link efficiency.

One note of clarification, if a state change triggers another pair of ACKs, this takes precedence over completing the remainder of the ACK group if multiple ACKs are required to transmit the SACK blocks.

The reason we trigger an ACK on either token changing is to get around the dreaded "last character hang" of very early TCP implementations.  If the transmitter sends an ACK with the updated Transmit Token from the last outgoing segment when the link goes idle it insures the receiver has the updated value of that Transmit Token which will lead to the retransmission of that last missing segment.

We delay the return of the received Transmit Token for two reasons, first it allows packets that got out of order to be received and put back in order before being declared missing and secondly, because a group of SACK blocks might span multiple ACKs, to make sure that state information is up to date before we declare a segment as missing.

This leads the to a discussion of very bad burst mode errors.  If the receiver has not received the segments it requires and has not sent out an ACK for RTT*1.3 then the receiver needs to send another ACK pair.  Note this applies to the transmitter also in that if its Returned Transmit Token does not equal Last Transmit Token after the same RTT*1.3 it is obligated to send an ACK pair to the receiver to prompt it to restart.  Obviously RTT*1.3 is a long time on transcontinental or satellite links which is why 1/4 RTT for an ACK pair from the receiver as a keep alive is under consideration, the maximum length of time to cover the burst error without degrading efficiency too badly.

In filling the outgoing ACKs with SACK blocks at the receiver, the list is maintained in the order oldest (smallest segment start) to the youngest.  The list is queried three times when building the ACK group, first for all blocks that just changed and thus have the most critical information (ACK NEEDED = 2), then a scan for those blocks which already had their first ACK (ACK NEEDED = 1) and then for the blocks which have already been transmitted twice (ACK NEEDED = 0).  The ACK NEEDED count is then decremented. 

To make E-SACK work three lists must be maintained, one at the receiver and two at the transmitter.  The one in the receiver is identical in concept to the one in the original RFC with the addition of ACK NEEDED and the proviso that it be in oldest to youngest order.  At the transmitter an identical list is maintained built from the recovered SACK blocks.  When a changed Returned Transmit Token is received, the recovered SACK block list is used to construct a retransmit queue of all unacknowledged segments.  Then starting at the transmit queue entry plus one derived from the Returned Transmit Token, and proceeding forward until all transmitted segment extents have been processed, entries are removed or modified in the new retransmit queue to compensate for segments sent after the time of the Returned Transmit Token.  The adjusted new transmit queue replaces the existing queue and the transmission process is started again starting at the first entry on the queue, not proceeding with new transmission until the queue is completed.  Please note that incoming SACK blocks must be checked against the retransmit queue in order to remove segments that have been subsequently correctly received.

Noto Bene - This protocol is an update to RFC 2018.  It is not compatible with nor do I believe in the philosophy of RFC2883/RFC3708 as the whole purpose of this change is to eliminate duplicate segments and increase efficiency and reliability where the changes outlined in those RFCs degrade efficiency (especially by generating extra SACK blocks) without substantially improving anything.  Also as adopted in the RFC there is a major fallacy, it assumes that there can be only one SACK block with a status change where in reality, due to processing considerations, there can be two or even more blocks with changes before an ACK can be sent.

Tony

Anthony Sabatini
200 West 20th Street
Apt. 1216
New York, NY 10011

Phone: (212) 867-7179
Mobile: (917) 224-8388

_______________________________________________
tcpm mailing list
tcpm@ietf.org
https://www.ietf.org/mailman/listinfo/tcpm