Re: [tcpm] Window update algorithm differences

Matt Mathis <mathis@psc.edu> Mon, 23 June 2008 18:56 UTC

Return-Path: <tcpm-bounces@ietf.org>
X-Original-To: tcpm-archive@megatron.ietf.org
Delivered-To: ietfarch-tcpm-archive@core3.amsl.com
Received: from [127.0.0.1] (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 0EBF63A68B5; Mon, 23 Jun 2008 11:56:51 -0700 (PDT)
X-Original-To: tcpm@core3.amsl.com
Delivered-To: tcpm@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 5E78B3A68B5 for <tcpm@core3.amsl.com>; Mon, 23 Jun 2008 11:56:50 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.46
X-Spam-Level:
X-Spam-Status: No, score=0.46 tagged_above=-999 required=5 tests=[BAYES_20=-0.74, J_CHICKENPOX_33=0.6, J_CHICKENPOX_35=0.6]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uE-MlhHnL3Gc for <tcpm@core3.amsl.com>; Mon, 23 Jun 2008 11:56:49 -0700 (PDT)
Received: from mailer1.psc.edu (mailer1.psc.edu [IPv6:2001:5e8:1:3a::64]) by core3.amsl.com (Postfix) with ESMTP id 7D90F3A6838 for <tcpm@ietf.org>; Mon, 23 Jun 2008 11:56:48 -0700 (PDT)
Received: from tesla.psc.edu (tesla.psc.edu [128.182.58.233]) by mailer1.psc.edu (8.14.2/8.13.3) with ESMTP id m5NIukqs020791 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 23 Jun 2008 14:56:46 -0400 (EDT)
Received: from localhost.psc.edu (localhost.psc.edu [127.0.0.1]) by tesla.psc.edu (8.13.1/8.13.1) with ESMTP id m5NIukoQ011549; Mon, 23 Jun 2008 14:56:46 -0400
Date: Mon, 23 Jun 2008 14:56:46 -0400
From: Matt Mathis <mathis@psc.edu>
To: Andre Oppermann <andre@freebsd.org>
In-Reply-To: <484FCA2C.2020600@freebsd.org>
Message-ID: <Pine.LNX.4.64.0806231429410.7040@tesla.psc.edu>
References: <484FCA2C.2020600@freebsd.org>
MIME-Version: 1.0
Cc: tcpm@ietf.org
Subject: Re: [tcpm] Window update algorithm differences
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://www.ietf.org/mailman/private/tcpm>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Sender: tcpm-bounces@ietf.org
Errors-To: tcpm-bounces@ietf.org

I don't recall all of the details (since it is part of the spec that I don't 
normally deal with) but I believe that problem has to do with deadlocks caused 
by lost window updates when zero window probing.

The problem is that zero window probes and keep alives have to update SND.WND, 
even though they are technically out of window (but by just one byte).

I believe the failure case is if both directions go into zero window probing, 
and then both ends read queued data (freeing window space) but neither end can 
send data in window to cause the other end to see its new window announcement.

I think the fix is to make first window check smidgen wide, do the window 
update processing, and then exclude the edge case further down, such that the 
net window is the same as 793, as updated by 1323.   I recall a comment 
someplace about "accepting window updates from out of order segments", which 
is the key.

I hope this helps,
--MM--
-------------------------------------------
Matt Mathis     http://staff.psc.edu/mathis
Work:412.268.3319    Home/Cell:412.654.7529
-------------------------------------------
Evil is defined by mortals who think they know
"The Truth" and use force to apply it to others.

On Wed, 11 Jun 2008, Andre Oppermann wrote:

> There is some considerable disagreement on the correctness of the original
> window update test in various operating systems.  Here is an overview of
> the current approaches used by the popular and open source TCP implementations:
>
> RFC793: section 3.9, page 72
>  SND.UNA < SEG.ACK =< SND.NXT, update window but not SND.WU.[SEQ|ACK]
>  (SND.WU.SEQ < SEG.SEQ or (SND.WU.SEQ = SEG.SEQ and SND.WU.ACK =< SEG.ACK))
>  update everything.
>
> Stevens Vol.2: section 29.7, page 981-983
> FreeBSD: src/sys/netinet/tcp_input.c, rev. 1.376
> OpenBSD: src/sys/netinet/tcp_input.c, rev. 1.215
> NetBSD: src/sys/netinet/tcp_input.c, rev. 1.287
>  SEG.SEQ > SND.WU.SEQ or (SEG.SEQ = SND.WU.SEQ and (SEG.ACK > SND.WU.ACK or
>  (SEG.ACK = SND.WU.ACK and SEG.WND > SND.WND))) update everything.
>
> OpenSolaris: src/uts/common/inet/tcp/tcp.c, @swnd_update, rev. 6707
>  SEG.ACK > SND.WU.ACK or SEG.SEQ > SND.WU.SEQ or
>  (SEG.SEQ = SND.WU.SEQ and SEQ.WND > SND.WND) update everything.
>
> Linux: net/ipv4/tcp_input.c, @tcp_ack_update_window(), rel. 2.6.25
>  SEG.ACK > SND.UNA or SEG.SEQ > SND.WU.SEQ or
>  (SEG.SEQ = SND.WU.SEQ and SEG.WND > SND.WND) update everything.
>
> The OpenSolaris code contains some comments about being better in case
> of bi-directional traffic and alleged problems with the RFC793 method.
> The Linux code contains some general comments about the incorrectness
> of the BSD method without further elaboration.
>
> The obvious question is which one is correct or better than the others?
>
> Lets have a look at the basic requirement of the send window update.
>
>  o Only newer than already seen segment should update the send window
>    to prevent old and outdated information being used.
>
>  o All evolves around how to reliably detect newer updates.
>
> Lets have a look at what makes a segment new:
>
>  o When using timestamps, either the reflected TS is higher than the
>    last one we got (we're sending data), or the TS from the other end
>    is newer than what we currently reflect (we're receiving data or
>    a window update).
>    Problem: what to do when the round trip time is faster than the
>    timestamp resolution?  Fall back to the SEQ and ACK checks.
>
>    SEG.TSECR > TS_RECENT_AGE or SEG.TSVAL > TS_RECENT
>
>  o Data we sent has been ack'ed.
>    Problem: None really.  Doesn't trigger on old retransmits or out-of
>    order.
>
>    SEG.ACK > SND.UNA  (and implicit SEG.ACK <= SND.NXT)
>
>  o We receive new data.
>    Problem: out-of order into reassembly queue, retransmits of missing
>    segments, reordering of segments.  Retransmit contains newer value.
>
>    SEG.SEQ > RCV.NXT
>
>  o No data sent or received but window increases.
>    Problem: old delayed segment.  Only allow if window increases.
>
>    SEG.WND > SND.WND
>
>
> Hence I propose the following updated acceptable window update check:
>
>  [1] (TS and SEG.TSECR > TS_RECENT_AGE or SEG.TSVAL > TS_RECENT or
>  [2] SEG.ACK > SND.UNA or
>  [3] (SEG.SEQ > SND.WU.SEQ and SEG.ACK >= SND.UNA) or
>  [4] (SEG.SEQ = SND.WU.SEQ and SEG.ACK = SND.UNA and SEG.WND > SND.WND)
>
> 	SND.WND <- SEG.WND
>  [5]	SEG.SEQ > SND.WU.SEQ
> 		SND.WU.SEQ <- SEG.SEQ
>  [6]	(SND.WU.ACK <- SEG.ACK)
>
> [1] If either timestamp is newer than what we've already seen this
>     is a new segment and the window it contains is certain to be valid
>     without any further checks.
> [2] This is reliable indicator of a genuine window update.  With the
>     arrival of new data that is ack'ed the window also has been updated.
> [3] A higher sequence number tells us new data was received but if
>     the ACK is lower than what we've already seen it must be a retransmit.
> [4] A pure window update if the sequence number is the same, the ACK is
>     not lower than what we've already seen and the advertised window is
>     larger than the one we had.
> [5] Only change the last update that gave us a window update if it is
>     higher than what we have.  This prevents retransmitted or reordered
>     segments without a new ACK from updating our window.  With timestamps
>     we can reliably differentiate retransmits from out-of order segments.
> [6] Tracking the last ACK that updated the window has become unnecessary.
>     SND.WU.ACK is also known as SND.WL2.
>
> Cases:
>
>  o In unidirectional send we trigger always on [2] when our data is ack'ed.
>  o In unidirectional receive we trigger on [3] for in-order segments.  Out
>    of order segments do not update the window unless they advance SND.WU.SEQ.
>    Retransmits are not detected unless timestamps are enabled.  In that case
>    [1] triggers if the RTT is larger than the resolution of the timestamp
>    clock.  Otherwise window updates will resume when all missing segments
>    are retransmitted and new segments beyond SND.WU.SEQ arrive.
>  o In bidirectional traffic we trigger on [3] and [2].  If the transfer has
>    loss or is re-ordered in either or both directions we also trigger in all
>    important cases due to [2] when new data was ack'ed, and [3] new data with
>    an up to date ACK is received.  Above all the timestamp check allows all
>    new segments no matter what order they are in.
>
>
> Feedback and pointing out of mistakes are welcome.
>
> BTW: TAC's are gone for good!
>
> -- 
> Andre
>
> andre@FreeBSD.org
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm
>
_______________________________________________
tcpm mailing list
tcpm@ietf.org
https://www.ietf.org/mailman/listinfo/tcpm