Re: Last Call: TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms to Proposed Standard

Thomas Narten <narten@vnet.ibm.com> Wed, 20 March 1996 20:54 UTC

Received: from ietf.cnri.reston.va.us by IETF.CNRI.Reston.VA.US id aa25573; 20 Mar 96 15:54 EST
Received: from CNRI.Reston.VA.US by IETF.CNRI.Reston.VA.US id aa25569; 20 Mar 96 15:54 EST
Received: from ietf.cnri.reston.va.us by CNRI.Reston.VA.US id aa12968; 20 Mar 96 15:54 EST
Received: from ietf.cnri.reston.va.us by IETF.CNRI.Reston.VA.US id aa25560; 20 Mar 96 15:54 EST
Received: from vnet.ibm.com by IETF.CNRI.Reston.VA.US id aa25555; 20 Mar 96 15:54 EST
Received: from RTP by VNET.IBM.COM (IBM VM SMTP V2R3) with BSMTP id 6916; Wed, 20 Mar 96 15:54:09 EST
Received: by RTP (XAGENTA 4.0) id 4843; Wed, 20 Mar 1996 15:54:22 -0500
Received: from localhost.raleigh.ibm.com by cichlid.raleigh.ibm.com (AIX 3.2/UCB 5.64/4.03) id AA18885; Wed, 20 Mar 1996 15:54:19 -0500
Message-Id: <9603202054.AA18885@cichlid.raleigh.ibm.com>
To: Richard Stevens <rstevens@noao.edu>
Cc: The IESG <iesg@IETF.CNRI.Reston.VA.US>
Subject: Re: Last Call: TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms to Proposed Standard
In-Reply-To: Your message of "Tue, 19 Mar 1996 13:19:06 EST." <9603191319.aa18538@IETF.CNRI.Reston.VA.US>
Date: Wed, 20 Mar 1996 15:54:18 -0400
X-Orig-Sender: iesg-request@IETF.CNRI.Reston.VA.US
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Thomas Narten <narten@vnet.ibm.com>

Overall, it is great to see this finally made an RFC. I have some
wording suggestions that I think might improve the document.

General comment. As a Proposed Standard, this should be telling
implementors what to do. There are places where the text comes off
more as a suggestion/explanation, or "here is what some
implementations do". This document should make it very clear what
implementations should do. That is, when is fast retransmit invoked?
MUST slow-start be done on LANs?  Etc.

I'd suggest reorganizing a bit and moving all the general descriptions
to the front (including Fast Retransmit and Fast Recovery, plus the
history of what was implemented where), and then having a final
section that describes the combined algorithms in perhaps even more
detail than is already done. The combined algorithms should be very
clear about what MUST be done, and what parameters should used. Right
now, the current description doesn't seem to fully describe Fast
Retransmit/Recovery, for example.

The remaining comments are more specific.

>     The algorithm to avoid this is called slow start.  It operates by
>     observing that the rate at which new packets should be injected into
>     the network is the rate at which the acknowledgments are returned by
>     the other end.

To be more precise, new packets should be injected at the same rate as
packets are leaving the network at the other end of the
connection. Received ACKs provide an indication of packets exiting.

>     Slow start adds another window to the sender's TCP:  the congestion
>     window, called "cwnd".  When a new connection is established with a
>     host on another network, the congestion window is initialized to one
>     segment (i.e., the segment size announced by the other end, or
>     the

Question: aren't the 'cwnd' calculations typically implemented in
bytes rather than segments? I understand that describing everything in
segments makes it easier to explain (at a high level), but I'm not
sure we want to leave out those details. It would help the document
to be more consistent in the units it uses.

>     default, typically 536 or 512).  Each time an ACK is received, the
>     congestion window is increased by one segment. The sender can
>     transmit up to the minimum of the congestion window and the
>     advertised window.  The congestion window is flow control imposed by
>     the sender, while the advertised window is flow control imposed by
>     the receiver.  The former is based on the sender's assessment of
>     perceived network congestion; the latter is related to the amount of
>     available buffer space at the receiver for this connection.
>
>     The sender starts by transmitting one segment and waiting for its
>     ACK.  When that ACK is received, the congestion window is
>     incremented from one to two, and two segments can be sent.  When
>     each of those two segments is acknowledged, the congestion window is
>     increased to four.

This maybe getting too picky,  but the above sentence could be
interpreted as don't expand the window until both ACKs for two
segments have been received.

This provides an exponential growth, although it
>     is not exactly exponential because the receiver may delay its ACKs,
>     typically sending one ACK for every two segments that it receives.
>
>     At some point the capacity of the internet can be reached, and an
>     intermediate router will start discarding packets.  This tells the
>     sender that its congestion window has gotten too large.
>
>     Early implementations performed slow start only if the other end was
>     on a different network.  Current implementations always perform slow
>     start.

Be stronger? Say MUST always perform SS?

>
> 2.  Congestion Avoidance
>
>     Congestion can occur when data arrives on a big pipe (a fast LAN)
>     and gets sent out a smaller pipe (a slower WAN).  Congestion can
>     also occur when multiple input streams arrive at a router whose
>     output capacity is less than the sum of the inputs.  Congestion
>     avoidance is a way to deal with lost packets.  It is described in
>     [2].

"congestion avoidance" seems to have a number of definitions. I'm not
sure of which one VJ uses, but it has always seemed more correct to me
use "avoidance" in the context of preventing congestion that leads to
packet loss. Once loss has occurred, congestion is present, and
"congestion control" takes over.

Indeed, doesn't VJ use congestion avoidance to refer specfically to
the phase where the window increases linearly? I.e., the window is
being expanded slowly under the assumption that is already operating
at close to the optimal window size?

>
>     The assumption of the algorithm is that packet loss caused by damage
>     is very small (much less than 1%), therefore the loss of a packet
>     signals congestion somewhere in the network between the source and
>     destination.  There are two indications of packet loss:  a timeout
>     occurring and the receipt of duplicate ACKs.
>
>     Congestion avoidance and slow start are independent algorithms with
>     different objectives.  But when congestion occurs TCP must slow down
>     its transmission rate of packets into the network, and then invoke
>     slow start to get things going again.  In practice they are
>     implemented together.

Better define the two modes/objectives. SS is designed to increase the
window (relatively) quickly, back to a "known" "optimal" window size
(without sending a window-sized burst of back-to-back packets).  CA
increases the window much more slowly, with the goal of using more
bandwidth that may have recently become available.

>
>     Congestion avoidance and slow start require that two variables be
>     maintained for each connection: a congestion window, cwnd, and a
>     slow start threshold size, ssthresh.  The combined algorithm
>     operates as follows:
>
>     1.  Initialization for a given connection sets cwnd to one segment
>         and ssthresh to 65535 bytes.
>
>     2.  The TCP output routine never sends more than the minimum of cwnd
>         and the receiver's advertised window.

Comparing bytes and segments. Confusing? Would it be possible to use
consistent units everywhere?

>
>     3.  When congestion occurs (indicated by a timeout or the reception
>         of duplicate ACKs),

How many duplicate ACKs? Isn't this fast retransmit? Shouldn't it be
described independent of this description? It would be nice, for
instance, to describe a bit why fast retransmit skips SS. [Oh. I see
that comes later in the text.]

>         one-half of the current window size (the
>         minimum of cwnd and the receiver's advertised window, but at
>         least two segments) is saved in ssthresh.  Additionally, if the
>         congestion is indicated by a timeout, cwnd is set to one segment
>         (i.e., slow start).
>
>     4.  When new data is acknowledged by the other end, increase cwnd,
>         but the way it increases depends on whether TCP is performing
>         slow start or congestion avoidance.
>
>
>
> Stevens                                                         [Page 3]
> 
> INTERNET-DRAFT                                                March 1996
>
>
>         If cwnd is less than or equal to ssthresh, TCP is in slow start;
>         otherwise TCP is performing congestion avoidance.  Slow start
>         continues until TCP is halfway to where it was when congestion
>         occurred (since it recorded half of the window size that caused
>         the problem in step 2), and then congestion avoidance takes
                                ^^^

add: (i.e., cwnd reaches ssthress)

>         over.

>         Slow start has cwnd begin at one segment, and be incremented by
>         one segment every time an ACK is received.  As mentioned
>         earlier, this opens the window exponentially:  send one segment,
>         then two, then four, and so on. Congestion avoidance dictates
>         that cwnd be incremented by segsize*segsize/cwnd each time an
>         ACK is received, where segsize is the segment size and cwnd is
>         maintained in bytes.

I must be missing something really basic here wrt the
segsize*segsize/cwnd formula. Congestion avoidance increases cwnd by
one packet each RTT. If segsize and cwnd are in bytes, shouldn't the
increase be "cwnd += segsize/cwnd" ?

Also, is the above strictly true, even if the ACK covers multiple
segments? And when the ACK is a duplicate?

>This is a linear growth of cwnd, compared
>         to slow start's exponential growth.  The increase in cwnd should
>         be at most one segment each round-trip time (regardless how many
>         ACKs are received in that RTT), whereas slow start increments
>         cwnd by the number of ACKs received in a round-trip time.
>
>     Many implementations incorrectly add a small fraction of the segment
>     size (typically the segment size divided by 8) during congestion
>     avoidance.  This is wrong and should not be emulated in future
>     releases.
>
> 3.  Fast Retransmit

Move this above the previous section, since the previous section
references Fast Retransmit.

>
>     Modifications to the congestion avoidance algorithm were proposed in
>     1990 [3].  Before describing the change, realize that TCP may
>     generate an immediate acknowledgment (a duplicate ACK) when an out-
>     of-order segment is received (Section 4.2.2.21 of [1], with a note
>     that one reason for doing so was for the experimental fast-
>     retransmit algorithm).  This duplicate ACK should not be delayed.
>     The purpose of this duplicate ACK is to let the other end know that
>     a segment was received out of order, and to tell it what sequence
>     number is expected.
>
>     Since TCP does not know whether a duplicate ACK is caused by a lost
>     segment or just a reordering of segments, it waits for a small
>     number of duplicate ACKs to be received.  It is assumed that if
>     there is just a reordering of the segments, there will be only one
>     or two duplicate ACKs before the reordered segment is processed,
>     which will then generate a new ACK.  If three or more duplicate ACKs
>     are received in a row, it is a strong indication that a segment has
>     been lost.  TCP then performs a retransmission of what appears to be
>     the missing segment, without waiting for a retransmission timer to
>     expire.

Need to be more firm. At what point are implementations supposed to
retransmit? If it is 3, say so.

>
> 4.  Fast Recovery
>
>
>
> Stevens                                                         [Page 4]
> 
> INTERNET-DRAFT                                                March 1996
>
>
>     After fast retransmit sends what appears to be the missing segment,
>     congestion avoidance, but not slow start is performed.  This is the
>     fast recovery algorithm.  It is an improvement that allows high
>     throughput under moderate congestion, especially for large windows.
>
>     The reason for not performing slow start in this case is that the
>     receipt of the duplicate ACKs tells TCP more than just a packet has
>     been lost.  Since the receiver can only generate the duplicate ACK
>     when another segment is received, that segment has left the network
>     and is in the receiver's buffer.  That is, there is still data
>     flowing between the two ends, and TCP does not want to reduce the
>     flow abruptly by going into slow start.
>
>     The fast retransmit and fast recovery algorithms are usually
>     implemented together as follows.
>
>     1.  When the third duplicate ACK in a row is received, set ssthresh
>         to one-half the current congestion window, cwnd, but no less
>         than two segments.  Retransmit the missing segment.  Set cwnd to
>         ssthresh plus 3 times the segment size.  This inflates the
>         congestion window by the number of segments that have left the
>         network and which the other end has cached (3).
>
>     2.  Each time another duplicate ACK arrives, increment cwnd by the
>         segment size.  This inflates the congestion window for the
>         additional segment that has left the network.  Transmit a
>         packet, if allowed by the new value of cwnd.
>
>     3.  When the next ACK arrives that acknowledges new data, set cwnd
>         to ssthresh (the value set in step 1).  This ACK should be the
>         acknowledgment of the retransmission from step 1, one round-trip
>         time after the retransmission.  Additionally, this ACK should
>         acknowledge all the intermediate segments sent between the lost
>         packet and the receipt of the first duplicate ACK.  This step is
>         congestion avoidance, since TCP is down to one-half the rate it
>         was at when the packet was lost.
>
>     The fast retransmit algorithm first appeared in the 4.3BSD Tahoe
>     release, and it was followed by slow start.  The fast recovery
>     algorithm appeared in the 4.3BSD Reno release.
>
> 5.  Security Considerations
>
>     Security considerations are not discussed in this memo.
>
>
>
>
>
>
>
> Stevens                                                         [Page 5]
> 
> INTERNET-DRAFT                                                March 1996
>
>
> 6.  References
>
>     [1]  B. Braden, ed., "Requirements for Internet Hosts --
>          Communication Layers," RFC 1122, Oct. 1989.
>
>     [2]  V. Jacobson, "Congestion Avoidance and Control," Computer
>          Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988.
>          ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.
>
>     [3]  V. Jacobson, "Modified TCP Congestion Avoidance Algorithm,"
>          end2end-interest mailing list, April 30, 1990.
>          ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail.
>
>     [4]  W. R. Stevens, "TCP/IP Illustrated, Volume 1: The Protocols",
>          Addison-Wesley, 1994.
>
>     [5]  G. R. Wright, W. R. Stevens, "TCP/IP Illustrated, Volume 2:
>          The Implementation", Addison-Wesley, 1995.
>
> Author's  Address:
>
>     W. Richard Stevens
>     1202 E. Paseo del Zorro
>     Tucson, AZ  85718
>
>     Phone: 520-297-9416
>
>     EMail: rstevens@noao.edu
>     Home Page: http://www.noao.edu/~rstevens
>
>
>     Expires: September 14, 1996
>
>
>
>
>
>
>
>
>










Stevens                                                         [Page 6]