Re: [tcpm] draft-ietf-tcpm-1323bis

"Scheffenegger, Richard" <rs@netapp.com> Fri, 26 November 2010 01:26 UTC

Return-Path: <rs@netapp.com>
X-Original-To: tcpm@core3.amsl.com
Delivered-To: tcpm@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 226D43A6AEC for <tcpm@core3.amsl.com>; Thu, 25 Nov 2010 17:26:48 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -8.938
X-Spam-Level:
X-Spam-Status: No, score=-8.938 tagged_above=-999 required=5 tests=[AWL=-1.539, BAYES_50=0.001, J_CHICKENPOX_33=0.6, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uv-ya51GZJbN for <tcpm@core3.amsl.com>; Thu, 25 Nov 2010 17:26:44 -0800 (PST)
Received: from mx4.netapp.com (mx4.netapp.com [217.70.210.8]) by core3.amsl.com (Postfix) with ESMTP id D13B63A6A9E for <tcpm@ietf.org>; Thu, 25 Nov 2010 17:26:42 -0800 (PST)
X-IronPort-AV: E=Sophos;i="4.59,258,1288594800"; d="scan'208";a="227965676"
Received: from smtp3.europe.netapp.com ([10.64.2.67]) by mx4-out.netapp.com with ESMTP; 25 Nov 2010 17:27:43 -0800
Received: from ldcrsexc2-prd.hq.netapp.com (emeaexchrs.hq.netapp.com [10.65.251.110]) by smtp3.europe.netapp.com (8.13.1/8.13.1/NTAP-1.6) with ESMTP id oAQ1ReLb000756; Thu, 25 Nov 2010 17:27:41 -0800 (PST)
Received: from LDCMVEXC1-PRD.hq.netapp.com ([10.65.251.108]) by ldcrsexc2-prd.hq.netapp.com with Microsoft SMTPSVC(6.0.3790.3959); Fri, 26 Nov 2010 01:27:40 +0000
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Date: Fri, 26 Nov 2010 01:27:38 -0000
Message-ID: <5FDC413D5FA246468C200652D63E627A0B9AD5FE@LDCMVEXC1-PRD.hq.netapp.com>
In-Reply-To: <201003221915.UAA02621@TR-Sys.de>
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
Thread-Topic: [tcpm] draft-ietf-tcpm-1323bis
Thread-Index: AcrJ9D82AnH7H+TmTqCH5lJNlBbjGjDEf+Ww
References: <201003221915.UAA02621@TR-Sys.de>
From: "Scheffenegger, Richard" <rs@netapp.com>
To: David Borman <david.borman@windriver.com>, Braden@ISI.EDU, van@parc.com
X-OriginalArrivalTime: 26 Nov 2010 01:27:40.0875 (UTC) FILETIME=[1E0F91B0:01CB8D09]
Cc: Alfred HÎnes <ah@TR-Sys.de>, tcpm@ietf.org, mallman@icir.org
Subject: Re: [tcpm] draft-ietf-tcpm-1323bis
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/tcpm>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 26 Nov 2010 01:26:48 -0000

 Bob, Dave, Van, Group,

Is this WG Item still alive?

I would be interested in your opinion of how to address some potential optimizations in the Timestamp algorithm and RTT Measurement, when one takes SACK interactions into account.


More specifically, I finally found the original discussion which explains why RFC1323(bis) has the requirement on both Sender and Receiver to a) reflect only very specific timestamps, and b) only make use of these under very certain circumstances.

However, if the sender also takes available SACK information into account, some of these restrictions are lifted, and that may in turn allow improved RTTM (especially during a window containing loss - very likely showing higher latencies than during in-sequence delivery), and also improved SACK recovery while strictly adhering to packet conservation.



Best regards,

Richard Scheffenegger

------------------------


From braden  Mon Aug 19 17:22:45 1991
Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local-3)
	id <AA18322>; Mon, 19 Aug 91 17:22:45 -0700
Date: Mon, 19 Aug 91 17:21:38 PDT
From: braden
Posted-Date: Mon, 19 Aug 91 17:21:38 PDT
Message-Id: <9108200021.AA11332@braden.isi.edu>
Received: by braden.isi.edu (4.0/4.0.3-4)
	id <AA11332>; Mon, 19 Aug 91 17:21:38 PDT
To: end2end-interest
Subject: RFC-1072

Folks,

Recently we (Van, Dave Borman, and I) have been discussing some details
of the timestamp echo mechanism of RFC-1072. 

There is considerable pressure to enter RFC-1072 and RFC-1185 into the
Internet standards track, but I recently noticed some ambiguities and
omissions.  Upon further discussion with Van, I discovered that his
implementation violates my understanding of what RFC-1072 meant.

I thought this topic ought might be of wider interest, so I am forwarding
the string of relevant messages.  We might talk about it in Stockholm.


Bob

____________________________________________________________________________

[[I have recently noticed that RFC-1072 did not explain how to prevent old
duplicate ACKs from falling into the current window after sequence number
wrap-around.  I discussed this with Van at the Gigabits meeting, and he
said he wants to send timestamps on ACKs as well as data segments.
Then I wrote:]]

From braden@ISI.EDU Fri Aug  9 14:48:19 1991
Date: Fri, 9 Aug 91 14:47:52 PDT
From: braden@ISI.EDU
To: dab@cray.com, van@helios.ee.lbl.gov
Subject: RFC-1072 & RFC-1185
Cc: braden@ISI.EDU, postel@ISI.EDU

Friends,

The issue of whether or not to include TCP Echo options on ACK segments
per RFC-1185 made me aware that there are other ambiguities in these
RFCs.  Since the IESG now wants to make them Internet standards, we
ought to remove ambiguities.

To this end, I have written down two descriptions of the algorithms.
The first in in psuedo-code; the second is as a delta on RFC-793
section 3.9 Event Processing.  I would appreciate your looking this
over carefully, to see whether it agrees with your understanding.

One new issue that came up: When TCP receives a data segment outside
the window, it sends an ACK for the current window in reply.  Question:
should this timestamp echo the timestamp from the (old duplicate)
data segment?  (The code as written here does send it).

Bob

[[text deleted as immaterial to this thread.  Dave Borman actually read
my pseudo-code and pointed out some bugs.]]
______________________________________________________________________________

[[Van called me and we discussed his implementation of RFC-1072 in
the DARTnet TCP, which did not seem to me to match RFC-1072.  Van
was trying to persuade me that it was logically equivalent, just
implemented differently]]
_______________________________________________________________________________
From van@ee.lbl.gov Mon Aug 12 15:57:18 1991
To: braden@ISI.EDU
Subject: Re: RFC-1072: timestamps and a bit flag 
In-Reply-To: Your message of Sun, 11 Aug 91 16:30:43 PDT.
Date: Mon, 12 Aug 91 15:57:31 PDT
From: Van Jacobson <van@ee.lbl.gov>

Bob,

I was ambiguous on the phone.  Depending on where in the path
you sit, there are different left edges of the window.  BSD
tracks both the 'receiver' left edge (relative to rcv_nxt) and
the 'left edge known to peer' (relative to the ack number in the
last packet you sent).  The latter view is used as part of the
rcvr silly-window avoidance to prevent window advertisments that
retract the window.  We have a receiver state variable, rcv_acked,
that tracks the last sequence number acked.  (E.g., when a
packet is sent rcv_acked is updated from seg_ack.  In earlier
versions of BSD an equivalent variable, rcv_adv, tracked the end
of the window.  In 4.4, I realized that the code got simpler if
you tracked the beginning.)  So, the timestamp test we do in
tcp_input is

	if (ti->ti_seq == tp->rcv_acked)
		/*
		 * this segment at left edge of window know to peer.
		 * record its timestamp for echo reply and
		 * incoming validity checking (per rfc1072 & 1185).
		 */
		tp->rcv_tstamp = tstamp;

(There are earlier validity checks, including testing the
incoming timestamp against tp->rcv_tstamp so we're sure at this
point that the incoming segment is legal & 'recent'.)  This
handles both delayed acks & retransmits.  I believe it does just
what 1072 wants.

 - Van

_____________________________________________________________________
From braden@ISI.EDU Mon Aug 12 16:45:30 1991
Date: Mon, 12 Aug 91 16:45:15 PDT
From: braden@ISI.EDU
To: van@ee.lbl.gov
Subject: Re: RFC-1072: timestamps and a bit flag
Cc: braden@ISI.EDU

I don't think your trick will work for case (B) on page 13 of RFC-1072:
"A hole in the sequence space".  It goes on about how valuable it is
to have RTT's for ACKs resulting from out-of-order segments.  If I
understand your message, none of the out-of-order segments will
update rcv_tstamp, so the resulting ACKs will all echo the same
timestamp value.  Is this really what you want??

A.1 ---> (sets tstamp); B is lost
C.3 --->
 <-- ACK A.1
D.4 --->
  <-- ACK A.1
E.5 --->
  <-- ACK A.1
B.6 --->
  <-- ACK E.6
  
Right?

   Bob

________________________________________________________________________
From braden@ISI.EDU Tue Aug 13 17:00:58 1991
Date: Tue, 13 Aug 91 17:00:23 PDT
From: braden@ISI.EDU
To: van@helios.ee.lbl.gov
Subject: For discussion...
Cc: braden@ISI.EDU, dab@cray.com

Here is what I think RFC-1072 and RFC-1185 imply should happen:

 Assume sequence of data segments A, B, C, ...
                                      
                              Receiver does RFC-1185 Timestamp
                              comparison against value:
                                         V
    <A,ECopt=1> -------------------->    0 (say)    
    <B,ECopt=2> -------------------->    1    
                 <---- <ACK(B),ECRopt=1>    
    <C,ECopt=3> ------> (lost)    
    <D,ECopt=4> -------------------->    1   
                <----- <ACK(B),ECRopt=4>
   (retransmit C)
    <C,ECopt=5> ------------------->     1   
                <------ <ACK(D),ECRopt=5
               
Now, I believe your code would do the following with the same sequence:

    <A,ECopt=1> -------------------->    0 (say)    
    <B,ECopt=2> -------------------->    1   
                 <---- <ACK(B),ECRopt=1>     
    <C,ECopt=3> ----------> (lost)    
    <D,ECopt=4> -------------------->    1    
                <----- <ACK(B),ECRopt=1>
   (retransmit)
    <C,ECopt=5> ------------------->     1
                  <------ <ACK(D),ECRopt=5

Dave, I would be interested to know how your implementation would
behave...

Bob
                  
_________________________________________________________________________
From van@ee.lbl.gov Wed Aug 14 06:18:15 1991
To: braden@ISI.EDU
Cc: dab@cray.com
Subject: Re: For discussion... 
In-Reply-To: Your message of Tue, 13 Aug 91 17:00:23 PDT.
Date: Wed, 14 Aug 91 06:18:18 PDT
From: Van Jacobson <van@ee.lbl.gov>

Bob,

Ouch.  You're too good a protocol lawyer & you caught me.  Yes,
1072 suggests the first sequence and I would generate the
second.  Note that the results for all of 1185 and case (a) &
(c) of 1072 p.13 are the same for both schemes and what I do is
more conservative for case (b). (the sender will overestimate
the rtt by some amount that should be <= twice what the
algorithm in 1072 would estimate.)  So I'll still argue for my
scheme as an alternative, safe and slightly simpler way to
implement a 1072 receiver.

 - Van
________________________________________________________________________

From braden@ISI.EDU Wed Aug 14 09:44:21 1991
Date: Wed, 14 Aug 91 09:43:49 PDT
From: braden@ISI.EDU
To: van@ee.lbl.gov
Subject: Re: For discussion...
Cc: braden@ISI.EDU, dab@cray.com

van,

What may appear to be protocol lawyering is just simple-mindedness -- which,
like guilt, can sometimes be put to positive use.

Hmmmmm.  When we put together RFC-1072, we were trying to do the best
damn job we could in measuring the true RTT, as free from biases as
possible.... that was the whole rationale for the timestamp echo in
RFC-1072.  I was dealing with the control theorist Van Jacobson.  Now I
have a feeling I am dealing with the implementor Van Jacobson, and
hearing something different.  It seems that you are trying to reduce
the amount of state (one 32-bit timestamp instead of two, and one fewer
control bits) in the tcbcb.  Is the control theorist dormant?
Are you really sure this is moving in a constructive direction??

Bob

___________________________________________________________________________
From braden@ISI.EDU Thu Aug  8 12:59:17 1991
Date: Thu, 8 Aug 91 12:58:49 PDT
From: braden@ISI.EDU
To: van@helios.ee.lbl.gov
Subject: more Braden bother...
Cc: braden@ISI.EDU

van,

Well, I didn't get a LOT of joy out of the Dartnet netinet/ modules.
It seems puzzlingly partial... eg it does not seem to USE the returned Echo
reply values to compute the RTT (is there something subtle going on here
I don't understand?); and, it does not seem to implement the "newest segment
from the oldest sequence number" algorithm of RFC-1072. 

Hmmm.  My [delayed] bogosity meter is going off, concerning your idea
of the symmetry between data and ACKs.  OK, so we can send ECopts in
both data and ACK segments.  But echoing ACK segment timestamps in data
segments seems broken...   the timing depends upon how soon the
application decides to send a response, which would artificially
inflate the RTT measurement.  Perhaps I misunderstood your reasoning;
If your argument is just that it is simpler to code and/or faster to
execute to always stuff some ECRopt value into every packet, but ignore
ECRopt values received on data segments, then I at least understand...

Bob

________________________________________________________________________

From dab@berserkly.cray.com Thu Aug 15 08:32:46 1991
Date: Thu, 15 Aug 91 10:31:07 -0500
From: dab@berserkly.cray.com (David Borman)
To: braden@ISI.EDU, van@helios.ee.lbl.gov
Subject: Re: For discussion...


Well, using Bobs example, my code would do the following:

                              Receiver does RFC-1185 Timestamp
                              comparison against value:
                                         V
    <A,ECopt=1> -------------------->    0 (say)
    <B,ECopt=2> -------------------->    1
                 <---- <ACK(B),ECRopt=2>
    <C,ECopt=3> ------> (lost)
    <D,ECopt=4> -------------------->    2	(compare against 2, not 1!)
                <----- <ACK(B),ECRopt=4>	
   (retransmit C)
    <C,ECopt=5> ------------------->     2
                <------ <ACK(D),ECRopt=5
    <E,ECopt=6> ------------------->     5
                <------ <ACK(D),ECRopt=6

Uff da.  In my code, the indicated ack (*) will have an ECRopt of 4,
not 1, but not because of what's written in 1072 or 1185...  As written,
will respond with the ECRopt value of whatever was most recently
received in an ECopt.  My code more or less looks like:

	tp->echo_value		Most recently received ECopt
	tp->echo_timestamp	ECopt value from last packet received
				at the left edge of the window.
	ECHO_NEEDED		Flag to indicate that tp->echo_value
				needes to be sent back in ECRopt.
	ECHO_RCVD		Flag to indicate that ECopt was found
				while processing the options.

    On recieve:

	Process TCP options:
		If echo option is received, its value is saved in
		tp->echo_value, and ECHO_NEEDED and ECHO_RCVD flags are set.

	If the packet is then determined to be at the left edge of the
		window, and the ECHO_RCVD bit is set, tp->echo_value is
		copied into tp->echo_timestamp.
	Otherwise, if the tp->echo_value is less than tp->echo_timestamp,
		the packet is tossed.

    On transmit:
	If ECHO_NEEDED is set, copy tp->echo_value into ECRopt, and
	clear ECHO_NEEDED.

My code is obviously wrong.  An old packet wandering in could muck
things up, and I clearly violate case A on page 13 of 1072.

So, let toss out some cases for discussion:

1) Packets arrive in sequence, and every packet is acked.  This
   is not an issue, we all understand what should happen here.
	<A, ECopt=1> ------------------->   0
		<---- <ACK(A), ECRopt=1>
	<B, ECopt=2> ------------------->   1
		<---- <ACK(B), ECRopt=2>
	<C, ECopt=3> ------------------->   2
		<---- <ACK(C), ECRopt=3>
	<D, ECopt=4> ------------------->   3
		<---- <ACK(D), ECRopt=4>

2) Packets arrive in sequence, and some of the acks are delayed.
   What echo value do you use?  RFC 1072, pg. 13, Case (A) says
   you use the oldest echo value received.  I don't think that
   there is an issue here either.
	<A, ECopt=1> ------------------->   0
	<B, ECopt=2> ------------------->   1
	<C, ECopt=3> ------------------->   2
		<---- <ACK(C), ECRopt=1>
	<D, ECopt=4> ------------------->   3
	<E, ECopt=5> ------------------->   4
	<F, ECopt=6> ------------------->   5
		<---- <ACK(F), ECRopt=6>

3) Packets arrive out of order, and we are acking every packet.
	<A, ECopt=1> ------------------->   0
		<---- <ACK(A), ECRopt=1>
	<C, ECopt=3> ------------------->   1
		<---- <ACK(A), ECRopt=3>	3, 1, or no ECR at all?
	<B, ECopt=2> ------------------->   1
		<---- <ACK(C), ECRopt=2>
	<E, ECopt=5> ------------------->   2
		<---- <ACK(C), ECRopt=5>	5, 2, or no ECR at all?
	<D, ECopt=4> ------------------->   2
		<---- <ACK(D), ECRopt=4>
	<F, ECopt=6> ------------------->   4
		<---- <ACK(F), ECRopt=6>

4) Packets arrive out of order, and we are NOT acking every packet.
	<A, ECopt=1> ------------------->   0
	<C, ECopt=3> ------------------->   1
		<---- <ACK(A), ECRopt=1>
	<D, ECopt=4> ------------------->   1
		<---- <ACK(A), ECRopt=4>	4, 1, or no ECR at all?
	<B, ECopt=2> ------------------->   1
		<---- <ACK(C), ECRopt=2>
	<E, ECopt=5> ------------------->   2
		<---- <ACK(E), ECRopt=4>	5, 2, or no ECR at all?
	<F, ECopt=6> ------------------->   5
		<---- <ACK(F), ECRopt=6>

I would argue that when sending an ACK due to out-of-order (lost) packets,
you either 1) send the ECR with the pre-existing value that hasn't been sent
yet (delayed acks), 2) if no ECR was already queued up, send the ECR with
the value in the packet, or 3) don't send an ECR in the ack.  Sending an
ECR in the ack with the last left-window-edge value instead of the EC value
just received could really screw things up.

Imagine:
        <A, ECopt=1>   ------------------->   0
                <---- <ACK(A), ECRopt=1>
	    (connection is idle for a while...)
        <C, ECopt=103> ------------------->   1
                <---- <ACK(A), ECRopt=103>        103, 1, or no ECR at all?
        <B, ECopt=102> ------------------->   1
                <---- <ACK(C), ECRopt=102>

If the middle ack was 1, not 103, then when the ACK is received, the RTT
value that is computed is going to be drastically out of wack.  Either use
the value that was received in the out-of-sequence packet, or don't send
and ECR in the ack.

Comments?  I'll be fixing my code, but I'll wait until we all agree on
what should really be happening.

			-Dave Borman, dab@cray.com

___________________________________________________________________________

From braden@ISI.EDU Fri Aug 16 14:42:15 1991
Date: Fri, 16 Aug 91 14:41:44 PDT
From: braden@ISI.EDU
To: dab@cray.com, van@helios.ee.lbl.gov
Subject: RFC-1072/1185 issues
Cc: braden@ISI.EDU

Dave,

Yes, your corrections to my pseudo-code are certainly right.  I will
incorporate them...

Van,

Are there any circumstances in which a TCP receiving data containing
ECopts would send back an ACK that DID NOT contain an ECRopt?  Dave
argued in his message yesterday that the answer is "yes", when the only
timestamp available to be echoed is known to be seriously out of date.
Here is the same argument:

   When a host receives an ECRopt, it has to believe it and update
   its RTT estimate (I don't see a simple alternative).  There are
   times when it SHOULD NOT believe it (an ECRopt in a data segment
   coming some long idle period but carrying a timestamp echoed from
   an ACK.)    
                   ------> <ACK,ECopt=17> ------>
                                  (idle)
      (=>bad RTT!)  <---- <data,ECRopt=17> <-----
   
   The only way to heal this is for the ECopt received in
   an ACK segment to NOT be echoed in the next data segment that is
   sent.
                 ------> <ACK,ECopt=17> ------> (ignore ECopt in ACK)
                                  (idle)
                  <------ <data> <----------
   
Here is another argument:

   Consider a simplex data transfer.  Assuming the ACKs carry ECopts,
   if we echo the ACK timestamps in the data segments, then the data
   receiver must calculate an RTT that it will never use, every time it
   receives a data segment.  That wasted processing is much greater
   than the processing to include or exclude an ECRopt option.

Am I convincing you?

Bob

_________________________________________________________________________ 

From van@ee.lbl.gov Mon Aug 19 06:51:22 1991
To: braden@ISI.EDU
Cc: dab@cray.com
Subject: Re: RFC-1072/1185 issues 
In-Reply-To: Your message of Fri, 16 Aug 91 14:41:44 PDT.
Date: Mon, 19 Aug 91 06:51:36 PDT
From: Van Jacobson <van@ee.lbl.gov>

> When we put together RFC-1072, we were trying to do the best
> damn job we could in measuring the true RTT, as free from biases
> as possible.... that was the whole rationale for the timestamp
> echo in RFC-1072.

Wrong.  You've made the classic mistake of confusing a means
with an end.  A TCP sender is trying to a compute a function L()
best described as "the earliest time at which I will know some
packet has (almost certainly) been lost in transit".  The
value of this function is crucial to the stability of the
network (if L() is optimistic, spurious retransmits will fill up
the network and it will congestion collapse).  From a mix of
theory and experience, we've developed a useful approximation
for L() based on the current time, an estimate of the first &
second order statistics of the rtt and some local state (e.g.,
Karn's algorithm on the recent retransmission history).
Note that the rtt estimate is valuable only in that we use it
to compute L().  We don't need a "true RTT", we need a
conservative L().

The rationale for the 1072 timestamp was that

 a) every TCP implementation I know of (at least 5 independent
    implementations) measures the rtt on *at most* one packet
    per window of data sent (usually the packets iss+1, iss+W+1,
    iss+2W+1, ...).  [This feature of the implementations is not
    intentional and not obvious -- one individual really believed
    his implementation timed almost every packet until I forced
    him to put in a trace buffer to look at which packets updated
    the rtt estimate.]

 b) Anyone who looks at how rtt varies over the life of a
    conversation notices that there is a great deal of structure.
    In particular, there is almost always a periodic variation
    with minima at integer multiples of the window size: (the
    reason for this is obvious if you consider how a window
    protocol loads the pipe.)

       |                                                          
       |        *********      *********      *****               
    R  |       *              *              *                    
    T  |      *              *              *                    
    T  |     *              *              *                      
       |    *              *              *                       
       |   *              *              *                        
       |  *              *              *                         
       |
       |
       ---+--------------+--------------+--------------+--------------
        iss+1         W+iss+1        2W+iss+1       3W+iss+1         
			    Sequence Number


The consequence of (a) combined with (b) is that sampling artifacts
(technically known as Nyquist aliasing) will result in a possibly
substantial underestimate of the rtt.  A similar effect also appears
in the 2nd order statistics (the variance is much lower for the
first packet of a window).  The magnitude of the total effect is
obviously a function of window size:  for small W it's often
negligible but for large W it can cause very poor estimates of L()
(i.e., spurious retransmits).  1072 was intended to correct this
problem by offering a simple way for implementations to time every
packet and, thus, use the variance in the rtt over a window to
help inflate L() to appropriately conservative values.

In 1072, we gave 3 scenarios where the timestamp would be useful
(cases a, b & c on p.13).  Subsequent experience showed we got
two out of three right but (b) is bogus:  When congestion occurs
(i.e., at a packet loss) the sender should get cautious ("let's
not make the problem worse").  This implies that we want L() to
increase.  Clearly, the timestamp available to the receiver that
will cause the biggest increase in L() at the sender is the
timestamp of the left edge of the window (the timestamp of the
most recent in-sequence packet).  This is what my implementation
sends.  I believe it is correct on both theoretical & practical
grounds.

Actually, on practical grounds, (b) on p.13 is almost certainly
irrelevant.  1072 wasn't intended to change the semantics of
rtt (i.e., when the sender computes rtt) [though you seem to
have missed or disagree with this point -- see the note later].
With the existing semantics, the sender can update rtt only
when snd.una moves.  Thus the echo reply resulting from the
out-of-sequence packets will be ignored (since they cannot
move snd.una) *except when* the ack for the packet at the
left edge of the window gets lost.  In this case, it is best
on both information theoretic & system stability grounds if
the 'duplicate' acks generated by out-of-sequence packets are
indeed duplicates.  I.e., if they have the same echo reply
value as the ack for the packet at the left edge of the window.


Moving on to Dave's scenarios:

> 3) Packets arrive out of order, and we are acking every packet.
>     <A, ECopt=1> ------------------->   0
>             <---- <ACK(A), ECRopt=1>
>     <C, ECopt=3> ------------------->   1
>             <---- <ACK(A), ECRopt=3>	3, 1, or no ECR at all?
>     <B, ECopt=2> ------------------->   1
>             <---- <ACK(C), ECRopt=2>

The ack that results from C must have ECRopt=1, not 3.  There
are two separate ways of arriving at this:

1) If we consider what happens if the ack resulting from A gets
   lost, an ECRopt=3 in the ack from C will result in an
   *underestimate* of the rtt for A.  This is the last thing you
   want to happen.  If the ack from A does arrive, the ack from
   C doesn't move snd.una so the rtt isn't updated & it doesn't
   matter what you put in the ECRopt.  Since the receiver doesn't
   know which acks will get dropped, the only safe choice is
   ECRopt=1 in the ack from C.

2) Ultimately the rtt will include the time it takes the receiving
   application to consume that data (for most implementations, this
   happens indirectly via delayed acks & closed window probes but
   everything works better if an implementation is structured so
   the application is in the receive loop.  I.e., if acks are
   (usually) generated when the receiving application consumes the
   data, not when the data arrives at the receiver).  Since data
   cannot be delivered out of sequence, the rtt for C is controlled
   by B (this is why the ACK(C) resulting from B's arrival has
   ECRopt=2).  Since B hasn't arrived yet, all the receiver knows
   is the bounds on its timestamp: EC(A) <= EC(B) <= EC(C).  The most
   conservative decision (in terms of the effect on L() at the sender)
   is for the receiver to use the lower bound, EC(A).  [I will claim,
   without proof, that it's a good idea to be conservative about
   retransmits in the face of packet re-ordering.]


> 4) Packets arrive out of order, and we are NOT acking every packet.
> 	<A, ECopt=1> ------------------->   0
> 	<C, ECopt=3> ------------------->   1
> 		<---- <ACK(A), ECRopt=1>
> 	<D, ECopt=4> ------------------->   1
> 		<---- <ACK(A), ECRopt=4>	4, 1, or no ECR at all?
> 	<B, ECopt=2> ------------------->   1
> 		<---- <ACK(C), ECRopt=2>

For the same reasons as above, the ack that results from D must have
ECRopt=1, not 4.

> Imagine:
>         <A, ECopt=1>   ------------------->   0
>                 <---- <ACK(A), ECRopt=1>
> 	    (connection is idle for a while...)
>         <C, ECopt=103> ------------------->   1
>                 <---- <ACK(A), ECRopt=103>        103, 1, or no ECR at all?
>         <B, ECopt=102> ------------------->   1
>                 <---- <ACK(C), ECRopt=102>
> 
> If the middle ack was 1, not 103, then when the ACK is received,
> the RTT value that is computed is going to be drastically out of wack. 

Remember, 1072 changed the information available to compute the
rtt.  It did *not* change *when* the rtt is computed (i.e., only
when new data is acked).  The *only* way the ECRopt in the ack
resulting from C can be used is when the ack from A is dropped
(and, of course, C must be sent before the sender times out &
retransmits A).  In this case, the sender should get ECRopt=1 in
the ack from C so it will compute a reasonable rtt for A.  This
will inflate L but that's perfectly legitimate -- packet loss in
the reverse path increases uncertainty at the sender so it
wants to be more conservative about retransmiting.

Going back to Bob's last note,

> When a host receives an ECRopt, it has to believe it and update
> its RTT estimate (I don't see a simple alternative).  There are
> times when it SHOULD NOT believe it (an ECRopt in a data segment
> coming some long idle period but carrying a timestamp echoed
> from an ACK.)

The "believe it" part of your statement is correct but the "and
update the rtt estimate" doesn't follow.  Nowhere in rfc1072
did we specify the sender's rtt estimation algorithm.  It never
occurred to me that anyone would conceive of changing what
currently happens without echo/echo-reply --- the rtt estimate
is updated only on 'new' acks (ones that move snd.una).  Assuming
that people don't make gratuitous changes to the rtt estimate
semantics while implementing 1072, the scenario you're worried
about can't arise and every packet should contain an ECRopt.

> Consider a simplex data transfer.  Assuming the ACKs carry
> ECopts, if we echo the ACK timestamps in the data segments, then
> the data receiver must calculate an RTT that it will never use,
> every time it receives a data segment.  That wasted processing
> is much greater than the processing to include or exclude an
> ECRopt option.

Same mistake as above:  1072 did not suggest that the rtt
estimation semantics be changed.  I.e., the intent was that
an implementation use exactly the code it does now but the
step that looks like (in BSD):
	
	/*
	 * If transmit timer is running and timed sequence
	 * number was acked, update smoothed round trip time.
	 */
	if (tp->t_rtt && SEQ_GT(ti->ti_ack, tp->t_rtseq)) {
		tcp_xmit_timer(tp, tp->t_rtt);
		tp->t_rtt = 0;
	}

could be augmented to time every packet if echo was negotiated:
(in 4.4, "echo_reply" points to the ECROPT in an incoming packet
(if there is one) & EXTRACT_ECROPT is a machine dependent (e.g.,
alignment constraints) macro to marshall the timestamp in the option.)

	if (echo_reply)
		tcp_xmit_timer(tp, now - EXTRACT_ECROPT(echo_reply));
	else if (tp->t_rtt && SEQ_GT(ti->ti_ack, tp->t_rtseq)) {
		tcp_xmit_timer(tp, tp->t_rtt);
		tp->t_rtt = 0;
	}

But this rtt estimate update code is executed *only* when new
data is acked (snd_una moves forward) and no one proposed
changing that.  Thus there is no wasted effort updating rtt
for the receiver side of a simplex connection (since snd_una
will never move).  The only cost of the ECRopt when it's not
used is the cost of setting up the echo_reply pointer (if we
specify a 'suggested order' for the EC/ECR opts, this can
be done in two instructions).

> Am I convincing you?

By now, you've probably guessed the answer to this one :).

 - Van
 

> -----Original Message-----
> From: Alfred HÎnes [mailto:ah@TR-Sys.de] 
> Sent: Montag, 22. März 2010 20:16
> To: tcpm@ietf.org; mallman@icir.org
> Subject: Re: [tcpm] poll for adoption of 
> draft-gont-tcpm-tcp-timestamps-03
> 
> Mark Allman wrote:
> > ...
> >
> >   - But, to me, the right thing to do here is to roll these changes
> >     into the work item this WG already has going: 1323bis.  [...]
> 
> Going?  Gone ??
> It's listed on <http://tools.IETF.ORG/wg/tcpm/>:
> 
>    Expired:
>    draft-ietf-tcpm-1323bis                -01    2009-03-04  Expired
> 
> [ And, btw,   -00  was published  2008-01-29,
>   so extrapolating linearly, I still hope for a -02 soon.
>   But the confidence level of statistics based on a sample of size 2
>   is rather small.  :-) ]
> 
> A significant part of my review comments from early in 2009 
> have been addressed in the -01 draft version, but other, 
> non-trivial, parts have been deferred to the next update.  I 
> cannot recall discussion of this draft on the list since a 
> very long time.
> 
> 
> So it looks like the choice for the WG might be having a 
> short document that can be shipped by the end of this year 
> (or even much faster), or with a -1323bis in 5 years or so.
> 
> (Or would you prefer doing both?)
> 
> IIRC, Fernando's timestamp draft has been phrased as a BCP 
> because feedback from the WG (MPLS IETF?) indicated it would 
> not be acceptable to the WG with normative language, on the 
> Standards Track.
> 
> I personally would not oppose to Standards Track, but I'm not the WG.
> 
> 
> Kind regards,
>   Alfred.
> 
> -- 
> 
> +------------------------+------------------------------------
> --------+
> | TR-Sys Alfred Hoenes   |  Alfred Hoenes   Dipl.-Math., 
> Dipl.-Phys.  |
> | Gerlinger Strasse 12   |  Phone: (+49)7156/9635-0, Fax: -18 
>         |
> | D-71254  Ditzingen     |  E-Mail:  ah@TR-Sys.de             
>         |
> +------------------------+------------------------------------
> --------+
> 
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm
>