comments on RFC1323.bis
Vern Paxson <vern@ee.lbl.gov> Tue, 05 May 1998 00:02 UTC
Delivery-Date: Mon, 04 May 1998 20:02:35 -0400
Return-Path: tcplw-relay@services.BSDI.COM
Received: from cnri.reston.va.us (ns.cnri.reston.va.us [132.151.1.1]) by ietf.org (8.8.5/8.8.7a) with ESMTP id UAA15602 for <ietf-archive@ietf.org>; Mon, 4 May 1998 20:02:34 -0400 (EDT)
Received: from services.BSDI.COM (services.BSDI.COM [205.230.225.19]) by cnri.reston.va.us (8.8.5/8.8.7a) with ESMTP id UAA06062 for <IETF-archive@cnri.reston.va.us>; Mon, 4 May 1998 20:04:59 -0400 (EDT)
Received: (from daemon@localhost) by services.BSDI.COM (8.8.7/8.8.8) id SAA28321 for tcplw-list@bsdi.com; Mon, 4 May 1998 18:02:13 -0600 (MDT)
Received: from mailfilter.bsdi.com (mailfilter.BSDI.COM [205.230.225.21]) by services.BSDI.COM (8.8.7/8.8.8) with ESMTP id SAA28318; Mon, 4 May 1998 18:02:08 -0600 (MDT)
Received: from daffy.ee.lbl.gov (daffy.ee.lbl.gov [131.243.1.31]) by mailfilter.bsdi.com (BSDI-MF 1.0) with ESMTP id SAA20690 env-from (vern@ee.lbl.gov); Mon, 4 May 1998 18:01:05 -0600 (MDT)
Received: by daffy.ee.lbl.gov (8.8.8/8.8.5) id RAA13212; Mon, 4 May 1998 17:02:05 -0700 (PDT)
Message-Id: <199805050002.RAA13212@daffy.ee.lbl.gov>
To: David Borman <dab@bsdi.com>
Cc: tcplw@bsdi.com
Subject: comments on RFC1323.bis
Date: Mon, 04 May 1998 17:02:05 -0700
From: Vern Paxson <vern@ee.lbl.gov>
Here are context diffs to the nroff source to fix some typos and phrasing, and also to point out some (minor) issues that need to be addressed. These last are done by introducing comments in the source, except when the comments are made inside a display. Some other issues: * The document doesn't specify the relationship between the options. For example, if you use window scaling, then is it a MUST that you use timestamps too? Or a SHOULD? Or ... ? * I added some MUSTs and SHOULDs (and the obligatory RFC 2119 cite to go with them). But I may have missed some places where these should be used. * A significant technical issue: the current RTTM discussion does not mention anything about altering the constants used for the exponentially-weighted moving average when updating the estimate of RTT. Sally Floyd has pointed out that using the usual constants is incorrect when the RTT is updated more than once per window; their use will result in an RTT estimate that is much more sensitive to transient changes in RTT. I think at a minimum the document needs to point out that there is an open issue here. * It needs a "security considerations" section. I sketched some thoughts on what might go in one. - Vern --- rfc1323.bis.ORIG Mon May 4 16:55:48 1998 +++ rfc1323.bis Mon May 4 16:54:46 1998 @@ -72,6 +72,9 @@ There is no one-line answer to the question: "How fast can TCP go?". There are two separate kinds of issues, performance and reliability, and each depends upon different parameters. We discuss each in turn. +.sp +(This document uses terms such as MUST and SHOULD. +See RFC 2119 for the exact interpretation of these terms.) .IN +0.3i .LT "1.1 TCP Performance" 0.3i .sp @@ -127,8 +130,10 @@ corresponding increase of the probability of more than one packet per window being dropped. This could have a devastating effect upon the throughput of TCP over an LFN. In addition, if a congestion control -mechanism based upon some form of random dropping were introduced into -gateways, randomly spaced packet drops would become common, possible +mechanism based upon some form of random dropping (such as discussed +in RFC2309) +were introduced into +gateways, randomly spaced packet drops would become common, possibly increasing the probability of dropping more than one packet per window. .sp @@ -318,7 +323,7 @@ However, some buggy TCP implementation might be crashed by the first appearance of an option on a non-SYN segment. Therefore, for each of the extensions defined below, TCP options will be sent on non-SYN -segments only after an exchange of options on the the SYN segments has +segments only after an exchange of options on the SYN segments has indicated that both sides understand the extension. Furthermore, an extension option will be sent in a <SYN,ACK> segment only if the corresponding option was received in the initial <SYN> segment. @@ -333,6 +338,12 @@ segment, adding 12 bytes to the 20-byte TCP header. We believe that the bandwidth saved by reducing unnecessary retransmissions will more than pay for the extra header bandwidth. +.\" How does the Timestamps option help with reducing unnecessary +.\" retransmissions? It only will if currently the RTO estimates +.\" are too low. While some TCP implementations suffer from this +.\" problem, most do not. In particular, using the coarse-grained +.\" BSD RTO algorithm works quite conservatively. So this argument +.\" is not right. .sp There is also an issue about the processing overhead for parsing the variable byte-aligned format of options, particularly with a @@ -342,7 +353,7 @@ and if it is verified then use a fast path. Hosts that use this canonical layout will effectively use the options as a set of fixed-format fields appended to the TCP header. However, to retain the -philosophical and protocol framework of TCP options, a TCP must be +philosophical and protocol framework of TCP options, a TCP MUST be prepared to parse an arbitrary options field, albeit with less efficiency. .sp @@ -415,7 +426,7 @@ with the SYN bit on and the ACK bit off). It may also be sent in a <SYN,ACK> segment, but only if a Window Scale option was received in the initial <SYN> segment. A Window Scale option in a segment without -a SYN bit should be ignored. +a SYN bit SHOULD be ignored. .sp The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment itself is never scaled. @@ -577,7 +588,7 @@ set in the TCP header; if it is valid, it echos a timestamp value that was sent by the remote TCP in the TSval field of a Timestamps option. .mc -When TSecr is not valid, its value must be zero. +When TSecr is not valid, its value MUST be zero. .mc The TSecr value will generally be from the most recent Timestamp option that was received; however, there are exceptions that are explained @@ -608,6 +619,8 @@ represent the corresponding cumulative acknowledgments. The two timestamp fields of the Timestamps option are shown symbolically as <TSval= x,TSecr=y>. Each TSecr field contains the value most recently +.\" "most recently received" conflicts with 3.4, which spells out +.\" exactly which value is kept received in a TSval field; these echoed values. labelled "TS.Recent", are shown in parentheses. .nf @@ -625,7 +638,11 @@ 4. (130) <--- <ACK(B),TSval=130,TSecr=6> (6) . . . ( Pause for 60 timestamp clock ticks ) . . . . - + +.\" This formatting is messed up. Epochs 4 and 5 appear +.\" twice. In the first 4, TS.Recent is 130, but in the +.\" subsequent 5, we have TSecr=120. Also, why in 5 is +.\" TSval=1?? 5. (130) <C,TSval=1,TSecr=120> ---> (1) @@ -636,6 +653,8 @@ 5. ... <--- <y,ACK(A),TSval=191,TSecr=5> (5) +.\" What is the point of this second, less precisely specified +.\" example?? TCP A TCP B @@ -685,8 +704,8 @@ .LT (A) 0.5i Delayed ACKs. .sp -Many TCP's acknowledge only every Kth segment out of a group of -segments arriving within a short time interval; this policy is known +RFC1122 requires TCP's to acknowledge every 2nd full-sized segment. +The policy of acknowledging only every 2nd segment is known generally as "delayed ACKs". The data-sender TCP must measure the effective RTT, including the additional time due to delayed ACKs, or else it will retransmit unnecessarily. Thus, when delayed ACKs are in @@ -704,7 +723,7 @@ situation the sender should be conservative about retransmission. Furthermore, it is better to overestimate than underestimate the RTT. An ACK for an out-of-order segment should therefore contain the -timestamp from the most recent segment that advanced the window. +timestamp from the most recent (in-order) segment that advanced the window. .sp The same situation occurs if segments are re-ordered by the network. .sp @@ -734,7 +753,8 @@ SEG.TSval >= TSrecent and SEG.SEQ <= Last.ACK.sent .IN -0.3i then SEG.TSval is copied to TS.Recent; otherwise, it is -ignored. +ignored. Note that this test replaces Karn's algorithm [Karn87], +required by 4.2.3.1 of of RFC1122. .sp .LT (3) 0.5i When a TSopt is sent, its TSecr field is set to the current TS.Recent @@ -743,7 +763,10 @@ The following examples illustrate these rules. Here A, B, C... represent data segments occupying successive blocks of sequence numbers, and ACK(A),... represent the corresponding acknowledgment -segments. Note that ACK(A) has the same sequence number as B. We show +segments. Note that ACK(A) has the same sequence number as B, because +the first sequence number of B is the one immediately followly the +last sequence number included in A. +We show only one direction of timestamp echoing, for clarity. .IN +0.5i .LT o 0.5i @@ -857,8 +880,8 @@ connection will be discarded by the normal 3-way handshake and sequence number checks of TCP. .sp -It is recommended that RST segments NOT carry timestamps, and that RST -segments be acceptable regardless of their timestamp. Old duplicate +RST segments SHOULD NOT carry timestamps, and RST +segments SHOULD be accepted regardless of their timestamp. Old duplicate RST segments should be exceedingly unlikely, and their cleanup function should take precedence over timestamps. .IN +0.3i @@ -869,7 +892,7 @@ .IN +0.5i .LT R1) 0.5i If there is a Timestamps option in the arriving segment and SEG.TSval < -TS.Recent and if TS.Recent is valid (see later discussion), then treat +TS.Recent and if TS.Recent is valid (see later discussion in 4.2.3), then treat the arriving segment as not acceptable: .IN +0.5i Send an acknowledgement in reply as specified in RFC-793 page 69 and @@ -939,12 +962,15 @@ If B's retransmission was triggered by the "fast retransmit" algorithm, i.e., by duplicate ACKs, then the queued segments that caused these ACKs must have been received already. +.\" .sp +.\" Even if a segment were delayed past the RTO, the Fast Retransmit +.\" mechanism [Jacobson90c] will cause the delayed +.\" packets to be retransmitted at the same time as B.2, avoiding an extra +.\" RTT and therefore causing a very small performance penalty. +.\" +.\" ^^^^ This isn't right: fast retransmission will only retransmit +.\" one packet, not all of the delayed packets. .sp -Even if a segment were delayed past the RTO, the Fast Retransmit -mechanism [Jacobson90c] will cause the delayed -packets to be retransmitted at the same time as B.2, avoiding an extra -RTT and therefore causing a very small performance penalty. -.sp We know of no case with a significant probability of occurrence in which timestamps will cause performance degradation by unnecessarily discarding segments. @@ -973,7 +999,7 @@ .sp To make this more quantitative, any clock faster than 1 tick/sec will reject old duplicate segments for link speeds of ~8 Gbps. A 1ms -timestamp clock will work at link speeds up to 8 Tbps (8*10**12) bps! +timestamp clock will work at link speeds up to 8 Tbps (8*10**12 bps)! .sp .LT (b) 0.5i The timestamp clock must not be "too fast". @@ -1124,7 +1150,7 @@ .LT "4.3. Duplicates from Earlier Incarnations of Connection" 0.3i .sp The PAWS mechanism protects against errors due to sequence number -wrap-around on high-speed connection. Segments from an earlier +wrap-around on high-speed connections. Segments from an earlier incarnation of the same connection are also a potential cause of old duplicate errors. In both cases, the TCP mechanisms to prevent such errors depend upon the enforcement of a maximum segment lifetime (MSL) @@ -1179,8 +1205,19 @@ .ne 2 [Braden89] Braden, R., editor, "Requirements for Internet Hosts -- Communication Layers", -RFC 1122, October, 1989 +RFC 1122, October, 1989. +.sp .ne 2 +[Braden98] Braden, B., et al, +"Recommendations on Queue Management and Congestion Avoidance in the Internet", +RFC 2309, April, 1998. +.sp +.ne 2 +[Bradner97] +S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", +RFC 2119, March 1997. +.sp +.ne 2 [Clark87] Clark, D., Lambert, M., and L. Zhang, "NETBLT: A Bulk Data Transfer Protocol", RFC 998, MIT, March 1987. .sp @@ -1335,6 +1372,8 @@ .sp So, the MSS value to be sent in an MSS option should be equal to the effective MTU minus the fixed IP and TCP headers. +.\" But you don't know the effective MTU when sending an initial SYN +.\" and doing PMTU discovery Since both IP and TCP options are ignored when calculating the value for the MSS option, if there are any IP or TCP options to be sent in a packet, @@ -1361,6 +1400,9 @@ fragmented, and packets sent with the constraints in the lower right of this grid will cause IP fragmentation, the only way to guarantee that this doesn't happen is for +.\" It's not the "only way" - the other way is to confine +.\" the behavior to the first column (MSS adjusted to include +.\" options), since that's always conservative. the data sender to decrease the TCP data length by the size of the IP and TCP options. And since the sender will be adjusting the TCP data @@ -1439,6 +1481,8 @@ not the major contributor to this problem; the RTT is the limiting factor in how quickly connections can be opened and closed. Therefore, this problem will be no worse at high transfer speeds. +.\" It is worse at high transfer speeds, because you can sustain +.\" more connections per second. .sp .LT (b) 0.5i Allow old duplicate segments to expire. @@ -1526,9 +1570,9 @@ is disabled. The Karn algorithm disables all RTT measurements during retransmission, since it is ambiguous whether the ACK is -is for the original packet, or the retransmitted packet. +for the original packet, or the retransmitted packet. With Timestamps, that ambiguity is removed since the TSecr -in the ACK will contain the TSval from which ever data +in the ACK will contain the TSval from whichever data packet made it to the destination. .sp .LT (b) 0.5i @@ -1552,7 +1596,7 @@ to fill in the SEG.WND value, not SND.WND. .sp .LT (d) 0.5i -New pseudo-code summary has been added in Appendix E. +A new pseudo-code summary has been added in Appendix E. .sp .LT (e) 0.5i Appendix A has been expanded with information about @@ -1584,7 +1628,7 @@ Clock Values my.TSclock: Local source of 32-bit timestamp values - my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec). + my.TSclock.rate: Tick granularity of my.TSclock (1 ms to 1 sec). Per-Connection State Variables @@ -1649,7 +1693,7 @@ (my.TSclock - SEG.TSecr)*my.TSclock.rate ) ; } - if Segment contains WSopt) then { + if (Segment contains WSopt) then { Snd.wind.scale = SEG.WSopt; Snd.WS.OK = TRUE; } @@ -1701,6 +1745,9 @@ else Update_SRTT( /* for compatibility */ (my.TSclock - Start.Time)/my.TSclock.rate); + ** Won't this update the RTT estimate on every + ** segment rather than once per window, requiring + ** new EWMA constants? } } @@ -1972,7 +2019,15 @@ .ne 3 .LT "Security Considerations" 0.3i .sp -Security issues are not discussed in this memo. +"Security issues are not discussed in this memo" is no longer +acceptable. A few considerations that come to mind: window scaling +makes denial-of-service easier if one can find an endless TCP data source +(such as chargen) since it can be made to send data at a higher rate +than it otherwise could; if mandatory, timestamps could making TCP +spoofing more difficult, because the spoofer has a harder time crafting +the timestamp echoes for the spoofed side of the connection; accepting +RSTs regardless of their timestamps doesn't make it any harder or +easier to spoof RST packets. .sp .LT "Authors' Addresses" 0.3i .sp @@ -1980,10 +2035,10 @@ Van Jacobson University of California Lawrence Berkeley Laboratory -Mail Stop 46A +Mail Stop 50B/2239 Berkeley, CA 94720 .sp -Phone: (415) 486-6411 +Phone: (510) 486-7519 EMail: van@ee.lbl.gov .sp 2 .ne 8
- comments on RFC1323.bis Vern Paxson
- Re: comments on RFC1323.bis braden
- Re: comments on RFC1323.bis Vern Paxson
- Re: comments on RFC1323.bis braden
- Re: comments on RFC1323.bis Greg Minshall
- Re: comments on RFC1323.bis Vern Paxson
- Re: comments on RFC1323.bis Sally Floyd