Re: Last Call for TCP Congestion Control I-D
Vern Paxson <vern@ee.lbl.gov> Tue, 10 November 1998 18:28 UTC
X-Authentication-Warning: assateague-fi.lerc.nasa.gov: listserv set sender to owner-tcp-impl@lerc.nasa.gov using -f
Message-Id: <199811101828.KAA18295@daffy.ee.lbl.gov>
To: tcp-impl@lerc.nasa.gov
Subject: Re: Last Call for TCP Congestion Control I-D
In-reply-to: Your message of Tue, 10 Nov 1998 10:27:23 PST.
Date: Tue, 10 Nov 1998 10:28:03 -0800
From: Vern Paxson <vern@ee.lbl.gov>
Sender: owner-tcp-impl@lerc.nasa.gov
Precedence: bulk
Status: RO
Content-Length: 24192
Lines: 510
--- draft-ietf-tcpimpl-cong-control-00.txt Mon Aug 10 09:50:28 1998 +++ draft-ietf-tcpimpl-cong-control-01.txt Tue Nov 10 09:46:25 1998 @@ -1,11 +1,10 @@ - -TCP Implementation Working Group W. Stevens -INTERNET DRAFT Consultant -File: draft-ietf-tcpimpl-cong-control-00.txt M. Allman - NASA Lewis/Sterling Software - V. Paxson +TCP Implementation Working Group M. Allman +INTERNET DRAFT NASA Lewis/Sterling Software +File: draft-ietf-tcpimpl-cong-control-01.txt V. Paxson LBNL - August, 1998 + W. Stevens + Consultant + November, 1998 TCP Congestion Control @@ -55,9 +54,9 @@ implementation of these algorithms. -Expires: February, 1999 [Page 1] +Expires: June, 1999 [Page 1] -draft-ietf-tcpimpl-cong-control-00.txt August 1998 +draft-ietf-tcpimpl-cong-control-01.txt November 1998 This document is organized as follows. Section 2 provides various definitions which will be used throughout the paper. Section 3 @@ -106,18 +105,21 @@ The restart window is the size of the congestion window after a TCP restarts transmission after an idle period. + FLIGHT SIZE: + The amount of data the has been sent but not yet acknowledged. + 3 Congestion Control Algorithms This section defines the four congestion control algorithms: slow start, congestion avoidance, fast retransmit and fast recovery, - developed in [Jac88] and [Jac90]. In some situations it may be - beneficial for a TCP sender to be more conservative than the - algorithms allow, however a TCP MUST NOT be more aggressive than the -Expires: February, 1999 [Page 2] +Expires: June, 1999 [Page 2] -draft-ietf-tcpimpl-cong-control-00.txt August 1998 +draft-ietf-tcpimpl-cong-control-01.txt November 1998 + developed in [Jac88] and [Jac90]. In some situations it may be + beneficial for a TCP sender to be more conservative than the + algorithms allow, however a TCP MUST NOT be more aggressive than the following algorithms allow (that is, MUST NOT send data when the value of cwnd computed by the following algorithms would not allow the data to be sent). @@ -170,13 +172,13 @@ algorithm is used when cwnd > ssthresh. When cwnd and ssthresh are equal the sender may use either slow start or congestion avoidance. - During slow start, a TCP increments cwnd by at most MSS bytes for - each ACK received that acknowledges new data. Slow start ends when - -Expires: February, 1999 [Page 3] +Expires: June, 1999 [Page 3] -draft-ietf-tcpimpl-cong-control-00.txt August 1998 +draft-ietf-tcpimpl-cong-control-01.txt November 1998 + + During slow start, a TCP increments cwnd by at most MSS bytes for + each ACK received that acknowledges new data. Slow start ends when cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted above); or when cwnd reaches rwnd; or when congestion is observed. @@ -188,7 +190,8 @@ cwnd += MSS*MSS/cwnd (2) - This provides an acceptable approximation to the underlying + This adjustment is executed on every incoming non-duplicate ACK. + Equation (2) provides an acceptable approximation to the underlying principle of increasing cwnd by 1 full-sized segment per RTT. (Note that for a connection in which the receiver acknowledges every data segment, (2) proves slightly more aggressive than 1 segment per RTT, @@ -223,19 +226,22 @@ timer, the value of ssthresh MUST be set to no more than the value given in equation 3: - ssthresh = max (min (cwnd, rwnd) / 2, 2*MSS) (3) + ssthresh = max (FlightSize / 2, 2*MSS) (3) - Implementation Note: an easy mistake to make is to forget the inner - min() operation and simply use cwnd, which in some implementations - may incidentally increase well beyond rwnd. + As discussed above, FlightSize is the amount of outstanding data in + the network. - Furthermore, upon a timeout cwnd MUST be set to no more than the - loss window, LW, which equals 1 full-sized segment (regardless of - -Expires: February, 1999 [Page 4] +Expires: June, 1999 [Page 4] -draft-ietf-tcpimpl-cong-control-00.txt August 1998 +draft-ietf-tcpimpl-cong-control-01.txt November 1998 + + Implementation Note: an easy mistake to make is to simply use cwnd, + rather than FlightSize, which in some implementations may + incidentally increase well beyond rwnd. + + Furthermore, upon a timeout cwnd MUST be set to no more than the + loss window, LW, which equals 1 full-sized segment (regardless of the value of IW). Therefore, after retransmitting the dropped segment the TCP sender uses the slow start algorithm to increase the window from 1 full-sized segment to the new value of ssthresh, at @@ -252,31 +258,43 @@ First, they can be caused by dropped segments. In this case, all segments after the dropped segment will trigger duplicate ACKs. Second, duplicate ACKs can be caused by the re-ordering of data - segments by the network (not a rare event along some network paths). - Finally, duplicate ACKs can be caused by replication of ACK or data - segments by the network. + segments by the network (not a rare event along some network paths + [Pax97]). Finally, duplicate ACKs can be caused by replication of + ACK or data segments by the network. In addition, a TCP receiver + SHOULD send an immediate ACK when the incoming segment fills in all + or part of a gap in the sequence space. This will generate more + timely information for a sender recovering from a loss through a + retransmission timeout, a fast retransmit, or an experimental loss + recovery algorithm, such as NewReno [FH98]. The TCP sender SHOULD use the "fast retransmit" algorithm to detect and repair loss, based on incoming duplicate ACKs. The fast - retransmit algorithm uses the arrival of 3 duplicate ACKs (i.e., 4 - identical ACKs) as an indication that a segment has been lost. - After receiving 3 duplicate ACKs, TCP performs a retransmission of - what appears to be the missing segment, without waiting for the - retransmission timer to expire. + retransmit algorithm uses the arrival of 3 duplicate ACKs (4 + identical ACKs without the arrival of any other intervening packets) + as an indication that a segment has been lost. After receiving 3 + duplicate ACKs, TCP performs a retransmission of what appears to be + the missing segment, without waiting for the retransmission timer to + expire. After the fast retransmit sends what appears to be the missing segment, the "fast recovery" algorithm governs the transmission of new data until a non-duplicate ACK arrives. The reason for not performing slow start is that the receipt of the duplicate ACKs not - only tells the TCP that a segment has been lost, but also that - segments are leaving the network. In other words, since the - receiver can only generate a duplicate ACK when a segment has - arrived, that segment has left the network and is in the receiver's - buffer, so we know it is no longer consuming network resources. - Furthermore, since the ACK "clock" [Jac88] is preserved, the TCP - sender can continue to transmit new segments (although transmission - must continue using a reduced cwnd). + only indicates that a segment has been lost, but also that segments + are most likely leaving the network (although a massive segment + duplication by the network can invalidate this conclusion). In + other words, since the receiver can only generate a duplicate ACK + when a segment has arrived, that segment has left the network and is + in the receiver's buffer, so we know it is no longer consuming + network resources. Furthermore, since the ACK "clock" [Jac88] is + preserved, the TCP sender can continue to transmit new segments + (although transmission must continue using a reduced cwnd). +Expires: June, 1999 [Page 5] + +draft-ietf-tcpimpl-cong-control-01.txt November 1998 + + The fast retransmit and fast recovery algorithms are usually implemented together as follows. @@ -290,11 +308,6 @@ 3. For each additional duplicate ACK received, increment cwnd by MSS. This artificially inflates the congestion window in order - -Expires: February, 1999 [Page 5] - -draft-ietf-tcpimpl-cong-control-00.txt August 1998 - to reflect the additional segment that has left the network. 4. Transmit a segment, if allowed by the new value of cwnd and the @@ -310,30 +323,13 @@ out-of-order delivery of data segments at the receiver). Additionally, this ACK should acknowledge all the intermediate segments sent between the lost segment and the receipt of the - first duplicate ACK, if none of these were lost. - - Implementing fast retransmit/fast recovery in this manner can lead - to a phenomenon which allows the fast retransmit algorithm to repair - multiple dropped segments from a single window of data [Flo94]. - However, in doing so, the size of cwnd is also reduced multiple - times, which reduces performance. The following steps MAY be taken - to reduce the impact of successive fast retransmits on performance. + third duplicate ACK, if none of these were lost. - A. After the third duplicate ACK is received follow step 1 above, - but also record the highest sequence number transmitted - (send_high). + Note: This algorithm is known to generally not recover very + efficiently from multiple losses in a single flight of packets. One + proposed set of modifications to it to address this problem can be + found in [FH98]. - B. Instead of reducing cwnd to ssthresh upon receipt of the first - non-duplicate ACK received (step 5), the sender should wait - until an ACK covering send_high is received. In addition, each - duplicate ACK received should continue to artificially inflate - cwnd by 1 MSS. - - C. A non-duplicate ACK that does not cover send_high, followed by 3 - duplicate ACKs should not reduce ssthresh or cwnd but SHOULD - trigger a retransmission. A non-duplicate ACK that does not - cover send_high SHOULD NOT cause any adjustment in cwnd. - 4 Additional Considerations 4.1 Re-starting Idle Connections @@ -349,14 +345,14 @@ [Jac88] recommends that a TCP use slow start to restart transmission after a relatively long idle period. Slow start serves to restart - -Expires: February, 1999 [Page 6] - -draft-ietf-tcpimpl-cong-control-00.txt August 1998 - the ACK clock, just as it does at the beginning of a transfer. This mechanism has been widely deployed in the following manner. When TCP has not received a segment for more than one retransmission + +Expires: June, 1999 [Page 6] + +draft-ietf-tcpimpl-cong-control-01.txt November 1998 + timeout, cwnd is reduced to the value of the restart window (RW) before transmission begins. @@ -379,43 +375,82 @@ beginning transmission if the TCP has not sent data in an interval exceeding the retransmission timeout. -4.2 Acknowledgment Mechanisms +4.2 Generating Acknowledgments The delayed ACK algorithm specified in [Bra89] SHOULD be used by a TCP receiver. When used, a TCP receiver MUST NOT excessively delay - acknowledgments. Specifically, an ACK MUST be generated for every - second full-sized segment. (This "MUST" is listed in [Bra89] in one - place as a SHOULD and another as a MUST; here we unambiguously state - it is a MUST.) Furthermore, an ACK SHOULD be generated for every - second segment regardless of size. Finally, an ACK MUST NOT be - delayed for more than 500 ms waiting on a second full-sized segment - to arrive. Out-of-order data segments SHOULD be acknowledged - immediately, in order to trigger the fast retransmit algorithm. + acknowledgments. Specifically, an ACK SHOULD be generated for at + least every second full-sized segment, and MUST be generated within + 500 ms of the arrival of the first unacknowledged packet. - A TCP receiver MUST NOT generate more than one ACK for every - incoming segment. + The requirement that an ACK "SHOULD" be generated for at least every + second full-sized segment is listed in [Bra89] in one place as a + SHOULD and another as a MUST. Here we unambiguously state it is a + SHOULD. We also emphasize that this is a "strong" SHOULD, meaning + that an implementor should indeed only deviate from this requirement + after careful consideration of the implications. See the discussion + of "Stretch ACK violation" in [PAD+98] and the references therein + for a discussion of the possible performance problems with + generating ACKs less frequently than every second full-sized + segment. - TCP implementations that implement the selective acknowledgment - (SACK) option [MMFR96] are able to determine which segments have not - arrived at the receiver. Consequently, they can retransmit the lost - segments more quickly than TCPs without SACKs. This allows a TCP - sender to repair multiple losses in roughly one RTT after detecting - loss [FF96,MM96a,MM96b]. While no specific SACK-based recovery - algorithm has yet been standardized, any SACK-based algorithm should - follow the general principles embodied by the above algorithms. - First, as soon as loss is detected, ssthresh should be decreased per - equation (3). Second, in the RTT following loss detection, the - number of segments sent should be no more than half the number - transmitted in the previous RTT (i.e., before loss occurred). - Third, after the recovery period is finished, cwnd should be set to + In some cases, the sender and receiver may not agree on what what + constitutes a full-sized segment. An implementation is deemed to + comply with this requirement if it sends at least one acknowledgment + every time it receives 2*MSS bytes of new data from the sender, + where MSS is the Maximum Segment Size specified by the receiver to + the sender (or the default value of 536 bytes, per [Bra89], if the + receiver does not specify an MSS option during connection + establishment). Finally, we repeat that an ACK MUST NOT be delayed + for more than 500 ms waiting on a second full-sized segment to + arrive. Out-of-order data segments SHOULD be acknowledged + immediately, in order to accelerate loss recovery. To trigger the + fast retransmit algorithm, the receiver SHOULD send an immediate + duplicate ACK when it receives a data segment above a gap in the -Expires: February, 1999 [Page 7] +Expires: June, 1999 [Page 7] -draft-ietf-tcpimpl-cong-control-00.txt August 1998 +draft-ietf-tcpimpl-cong-control-01.txt November 1998 - the reduced value of ssthresh. The SACK-based algorithms outlined - in [FF96,MM96a,MM96b] adhere to these guidelines. + sequence space. To provide feedback to senders recovering from + losses, the receiver SHOULD send an immediate ACK when it receives a + data segment that fills in all or part of a gap in the sequence + space. + A TCP receiver MUST NOT generate more than one ACK for every + incoming segment, other than to update the offered window as the + receiving application consumes new data. + +4.4 Loss Recovery Mechanisms + + A number of loss recovery algorithms that augment fast retransmit + and fast recovery have been suggested by TCP researchers. While + some of these algorithms are based on the TCP selective + acknowledgment (SACK) option [MMFR96], such as [FF96,MM96a,MM96b], + others do not require SACKs [Hoe96,FF96,FH98]. The non-SACK + algorithms use ``partial acknowledgments'' (ACKs which cover new + data, but not all the data outstanding when loss was detected) to + trigger retransmissions. While this document does not standardize + any of the specific algorithms that may improve fast retransmit/fast + recovery, these enhanced algorithms are implicitly allowed, as long + as they follow the general principles of the basic four algorithms + outlined above. + + Therefore, when the first loss in a window of data is detected, + ssthresh MUST be set to no more than the value given by equation + (3). Second, until all lost segments in the window of data in + question are repaired, the number of segments transmitted in each + RTT MUST be no more than half the number of outstanding segments + when the loss was detected. Finally, after all loss in the given + window of segments has been successfully retransmitted, cwnd MUST be + set to no more than ssthresh and congestion avoidance MUST be used + to further increase cwnd. Loss in two successive windows of data, + or the loss of a retransmission, should be taken as two indications + of congestion and, therefore, cwnd (and ssthresh) MUST be lowered + twice in this case. The algorithms outlined in + [Hoe96,FF96,MM96a,MM6b] follow the principles of the basic four + congestion control algorithms outlined in this document. + 5. Security Considerations This document requires a TCP to diminish its sending rate in the @@ -431,6 +466,11 @@ The Internet to a considerable degree relies on the correct implementation of these algorithms in order to preserve network stability and avoid congestion collapse. An attacker could cause + +Expires: June, 1999 [Page 8] + +draft-ietf-tcpimpl-cong-control-01.txt November 1998 + TCP endpoints to respond more aggressively in the face of congestion by forging excessive duplicate acknowledgments or excessive acknowledgments for new data. Conceivably, such an attack could @@ -448,16 +488,13 @@ (Addison-Wesley, 1995). This material is used with the permission of Addison-Wesley. - Sally Floyd devised the algorithm presented in section 3.3 for - avoiding multiple cwnd cutbacks in the presence of multiple packets - lost from the same flight. Craig Partridge and Joe Touch + Neal Cardwell, Sally Floyd, Craig Partridge and Joe Touch contributed a number of helpful suggestions. - + References [AFP98] M. Allman, S. Floyd, C. Partridge, Increasing TCP's Initial - Window Size, Internet-Draft draft-floyd-incr-init-win-03.txt. - May, 1998. (Work in progress). + Window Size, September 1998. RFC 2414. [Bra89] B. Braden, ed., "Requirements for Internet Hosts -- Communication Layers," RFC 1122, Oct. 1989. @@ -465,27 +502,34 @@ [Bra97] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. - - + [FF96] K. Fall, S. Floyd. Simulation-based Comparisons of Tahoe, + Reno and SACK TCP. Computer Communication Review, July 1996. + ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z. -Expires: February, 1999 [Page 8] - -draft-ietf-tcpimpl-cong-control-00.txt August 1998 + [FH98] S. Floyd, T. Henderson. The NewReno Modification to TCP's + Fast Recovery Algorithm. Internet-Draft + draft-ietf-tcpimpl-newreno-00.txt, November 1998. (Work in + progress). - [FF96] Kevin Fall and Sally Floyd. Simulation-based Comparisons of - Tahoe, Reno and SACK TCP. Computer Communication Review, July - 1996. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z. - [Flo94] S. Floyd, TCP and Successive Fast Retransmits. Technical report, October 1994. - ftp://ftp.ee.lbl.gov/papers/fastretrans.ps. - - [HTH98] Amy Hughes, Joe Touch, John Heidemann. Internet-Draft + ftp://ftp.ee.lbl.gov/papers/fastretrans.ps. + + [Hoe96] J. Hoe, Improving the Start-up Behavior of a Congestion + Control Scheme for TCP. In ACM SIGCOMM, August 1996. + + [HTH98] A. Hughes, J. Touch, J. Heidemann. Issues in TCP Slow-Start + Restart After Idle. Internet-Draft draft-ietf-tcpimpl-restart-00.txt, March 1998. (Work in progress). [Jac88] V. Jacobson, "Congestion Avoidance and Control," Computer Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988. + +Expires: June, 1999 [Page 9] + +draft-ietf-tcpimpl-cong-control-01.txt November 1998 + ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z. [Jac90] V. Jacobson, "Modified TCP Congestion Avoidance Algorithm," @@ -504,11 +548,14 @@ [MMFR96] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, "TCP Selective Acknowledgement Options", RFC 2018, October 1996. - [PAD+98] V. Paxson, M. Allman, S. Dawson, J. Griner, I. Heavens, - K. Lahey, J. Semke, B. Volz. Internet-Draft - draft-ietf-tcpimpl-prob-04.txt, August 1998. (Work in + [PAD+98] V. Paxson, M. Allman, S. Dawson, W. Fenner, J. Griner, + I. Heavens, K. Lahey, J. Semke, B. Volz. Internet-Draft + draft-ietf-tcpimpl-prob-05.txt, October 1998. (Work in progress). + [Pax97] V. Paxson, "End-to-End Internet Packet Dynamics," + Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997. + [Pos81] J. Postel, Transmission Control Protocol, September 1981. RFC 793. @@ -527,19 +574,23 @@ -Expires: February, 1999 [Page 9] + + + + + + + + + + + +Expires: June, 1999 [Page 10] -draft-ietf-tcpimpl-cong-control-00.txt August 1998 +draft-ietf-tcpimpl-cong-control-01.txt November 1998 Author's Address: - W. Richard Stevens - 1202 E. Paseo del Zorro - Tucson, AZ 85718 - 520-297-9416 - rstevens@kohala.com - http://www.kohala.com/~rstevens - Mark Allman NASA Lewis Research Center/Sterling Software 21000 Brookpark Rd. MS 54-2 @@ -556,6 +607,12 @@ 510-486-7504 vern@ee.lbl.gov + W. Richard Stevens + 1202 E. Paseo del Zorro + Tucson, AZ 85718 + 520-297-9416 + rstevens@kohala.com + http://www.kohala.com/~rstevens @@ -586,5 +643,6 @@ -Expires: February, 1999 [Page 10] + +Expires: June, 1999 [Page 11]
- Last Call for TCP Congestion Control I-D Vern Paxson
- Re: Last Call for TCP Congestion Control I-D Vern Paxson
- Re: Last Call for TCP Congestion Control I-D Kevin M. Lahey
- Re: Last Call for TCP Congestion Control I-D Vern Paxson
- Re: Last Call for TCP Congestion Control I-D Kevin M. Lahey
- Re: Last Call for TCP Congestion Control I-D Kevin M. Lahey
- Re: Last Call for TCP Congestion Control I-D Vern Paxson
- Re: Last Call for TCP Congestion Control I-D Rick Jones
- RE: Last Call for TCP Congestion Control I-D Steve Alexander