Draft RFC on Path-MTU Discovery
mogul (Jeffrey Mogul) Fri, 12 January 1990 23:40 UTC
Received: by acetes.pa.dec.com (5.54.5/4.7.34)
id AA26665; Fri, 12 Jan 90 15:40:58 PST
From: mogul (Jeffrey Mogul)
Message-Id: <9001122340.AA26665@acetes.pa.dec.com>
Date: 12 Jan 1990 1540-PST (Friday)
To: mtudwg
Cc:
Subject: Draft RFC on Path-MTU Discovery
Network Working Group K.McCloghrie, Hughes LAN Systems Request for Comments: DRAFT R. Fox, Hughes LAN Systems J. Mogul, Digital Equipment 12 January 1990 WORKING DRAFT - Do not circulate Path-MTU Discovery Protocol 1. Status of this Memo This memo describes a protocol for discovering the maximum transmission unit of an internet path, using one new IP option and two new ICMP messages. This is proposed as an alternative to the procedures described in RFC-1063. This memo does not define an Internet standard. Distribution of this memo is unlimited. 2. Introduction When one IP host has a large amount of data to send to another host, the data is transmitted as a series of IP datagrams. It is preferable, in the general case, that these datagrams be of the largest size which does not require fragmentation during transmission. This size is referred to as the Path-MTU (PMTU) of the path from the source to the destination, and is equal to the minimum of the MTUs of each hop in the path, where the MTU of a hop is the maximum size of an IP datagram on that hop. A shortcoming of the current Internet protocol suite is the lack of a standard mechanism for a host to discover the PMTU of an arbitrary path. The reasons to avoid fragmentation and the problems it incurs are well-documented in [1]. Some of these problems include: - use of fragmentation can sometimes lead to "deterministic fragment loss", where something in the internetwork causes certain fragments to be lost with higher than usual probability. For example, a router with insufficient buffer capacity might always drop the 4th packet in a burst. Since fragments are not individually acknowledged, this leads to miserable performance, or total failure, since retransmissions of the original datagram suffer the same fate. (If the datagrams are not fragmented, even when McCloghrie/Fox/Mogul DRAFT [page 1] RFC DRAFT Path-MTU Discovery January 1990 they are lost deterministically a protocol such as TCP will make slow but steady progress.) - sending datagrams of a size just larger than the PMTU causes each original datagram to be fragmented into one full-size and one tiny fragment. So, half the datagrams are tiny fragments which makes not only for inefficient use of the bandwidth, but also results in the gateways having to forward twice as many datagrams as necessary. - IP reassembly depends on having unique IP Identification values in each in-flight datagram. With the Ident field being only 16 bits wide, the need to guarantee that every datagram in flight from one host to another has a unique Ident, imposes a restriction on the maximum datagram transmission rate. (See [1] for an example.) - IP reassembly is inherently less efficient than transport layer "reassembly". (Again, see [1] for the arguments.) 3. Summary of the Protocol The PMTU Discovery Protocol proposed in this memo is a hybrid of two different mechanisms. Both mechanisms are invoked by the sender of a datagram, using the IP PMTU-Query Option. In the primary mechanism, gateways along the path use this option to compute the minimum MTU of any hop on the path; the last-hop gateway then uses an ICMP Path-MTU message to report the PMTU to the sending host. Provisions are made to detect if any of the gateways do not implement this option (in which case, the computed value is only an upper bound on the PMTU). This mechanism does not involve participation by the receiving host, and so can be used without upgrading all end hosts. The IP PMTU-Query option is transmitted periodically, not on all datagrams, so that the gateways will not be burdened by excessive option-processing. The secondary mechanism is used as a backup, in case the IP PMTU-Query option is not supported by all the gateways, or in case the route changes to one with a lower PMTU before the source resends the PMTU-Query. It does require support from the receiving host. The receiver uses the reception of an IP PMTU- ! Query as an indication that the sender has invoked the protocol, and as a request to cache the PMTU of the path from the sender. This cached value provides the receiver with a way to detect McCloghrie/Fox/Mogul DRAFT [page 2] RFC DRAFT Path-MTU Discovery January 1990 changes in the PMTU, by comparing the size of incoming fragmented datagrams against the cached PMTU. When a change in the PMTU value is detected, the sender is notified through the use of an ICMP Unexpected Fragment Report message. Mechanisms are provided to limit how many ICMP Unexpected Fragment Report messages are sent, to avoid polluting the network. 4. Analysis of Previous Approaches 4.1. RFC-1063 RFC-1063 [2] describes one solution to the PMTU Discovery problem. It defines a new IP option that is examined and possibly updated by each gateway along a path in such a way that it contains the PMTU when it arrives at its destination. Another new IP option is used to convey the PMTU back to the sender, piggybacked on a datagram going in the other direction. Discussion of the RFC-1063 proposal (e.g. in [4]) has highlighted a number of drawbacks: - for correct operation of the scheme, all gateways along a path must be updated to support the new IP option, which, for some paths, may take a long time. If hosts start using the scheme before all the gateways have been updated, they may end up incurring more fragmentation than if not using it, because they may conclude that the PMTU of a given path is larger than it really is, for example: when the two gateways connected by the hop with the minimum MTU do not support the new option. - this scheme does not work in cases where there is no return traffic on which to piggyback the discovered PMTU. For example, a host sending unsolicited datagram "trap" or "event" messages to a network monitoring center would be unable to discover the PMTU to the monitoring center. - most gateways are optimized for forwarding datagrams which do not contain IP options. Thus, an overhead is imposed on the gateways' throughput if the MTU option is sent too frequently. However, sending the MTU option too infrequently can also cause additional overhead when a decrease in the PMTU occurs (e.g. due to an alternate routing around a link outage), since the sender will continue sending datagrams that need to be fragmented on the new path until it next uses the option and discovers McCloghrie/Fox/Mogul DRAFT [page 3] RFC DRAFT Path-MTU Discovery January 1990 the decrease. 4.2. Report Fragmentation To counteract these drawbacks, alternative schemes have been suggested (e.g. in [4]) based on the receiver reporting the occurrence of fragmentation to the sender via a new ICMP message. Such schemes have been called "report-fragmentation". They too, however, have a number of drawbacks: - to discover increases in the PMTU, report-fragmentation schemes require that larger datagrams are occasionally sent, to probe the path to determine whether they get fragmented. Since the PMTU of a path changes only infrequently, most of these "on-purpose" fragmentations are otherwise unnecessary. - in situations where data is to be sent to a new destination, for which the sender has no (cached) PMTU information, the PMTU Discovery cannot take place until there is a large datagram to be sent. In reality, most applications initially exchange a number of small messages (e.g. connection establishment messages) prior to sending their first large datagram. In contrast, a query-based scheme (like RFC-1063) can send its probe piggybacked on a datagram of any size, and receive the reply containing the PMTU during the exchange of the initial small messages so that all the subsequent large datagrams can be optimally sized. - a report-fragmentation scheme must specify which arriving fragments cause the sending of a ICMP message to report the fragmentation. Sending on all occurrences would be excessive since there can be a round-trip time's worth of datagrams in-flight concurrently, each of which would generate an ICMP message; on a high-bandwidth/long-delay path this could generate hundreds or more unnecessary ICMP messages. It would also be wasteful to send report messages to a source which would ignore them. Alternately, datagrams could be marked to request that if they arrive fragmented, an ICMP message is to be generated. The ideal way to do this is if there were a spare bit in the IP header (see [4]). If no spare bit is available, then the mark presumably needs to be carried via an IP option, but this now inherits some of the drawbacks of the RFC-1063 scheme mentioned above. McCloghrie/Fox/Mogul DRAFT [page 4] RFC DRAFT Path-MTU Discovery January 1990 - there are existing applications which ignore the PMTU, and always send larger datagrams regardless of whether fragmentation will result. In order not to send unnecessary ICMP messages reporting the fragmentation of such datagrams, (until now) it has been necessary for an implementation to keep track of which applications are mindful of PMTU Discovery and which applications ignore it. In practice, this is probably excessive state information for it to be kept in the IP layer. Thus, some amount of the implementation must be done in every transport/ application layer which cares about avoiding fragmentation. This not only duplicates implementation effort and code, but probably requires separate ICMP messages to be sent for each transport/application which results in a significant increase in the overhead. 5. The Proposed Protocol 5.1. Description The PMTU Discovery protocol proposed in this memo defines one new IP option and two new ICMP messages. The new IP option is the PMTU-Query option; the two new ICMP messages are the Path- MTU message and the Unexpected Fragment Report message. These are used as follows: The IP PMTU-Query option can be carried on any datagram and asks the gateways through which it passes to update, if necessary, its minimum MTU value (like RFC-1063). However, the PMTU-Query option also contains other information which must be updated by each gateway supporting the option. This provides the means for the destination to determine whether all gateways on the path taken by that datagram do support the option, and therefore whether the minimum MTU value it contains is indeed the PMTU or just an upper bound. The ICMP Path-MTU message is used to reply to a PMTU-Query option in order to carry the PMTU information back to the sender. Since this is sent as an ICMP message (in contrast to RFC-1063's replying IP option), it does not require there to be return traffic in the other direction. In addition, it can be sent immediately on receipt of the PMTU-Query option, rather than having to be queued in the receiver awaiting a datagram on which to piggyback it. McCloghrie/Fox/Mogul DRAFT [page 5] RFC DRAFT Path-MTU Discovery January 1990 The ICMP Unexpected Fragment Report message is sent by a host receiving a fragment of an unexpected size to the sender of the fragmented datagram. These messages are sent because the unexpected size indicates that the PMTU has changed (i.e., no ICMP message is sent on receiving a fragment of the expected size). After sending a limited number of ICMP Unexpected Fragment Report messages, the expected size is updated, so that no more will be sent before the next change in the PMTU. The PMTU-Query option contains a minimum-MTU-so-far field and a next-hop address field. The minimum-MTU-so-far field is decreased by any gateway on the path if the MTU of the next/previous hop is less than the value which the field currently contains. The next-hop address field is set to the address of the next IP gateway to which it is forwarded. As each gateway processes the option, it expects to find its own address in the next-hop address field; if it does then it updates the next-hop field to the address to which it next forwards the datagram; otherwise, it sets the next-hop field to zero. Thus, if the option arrives at the destination with its next-hop field non- zero, then all gateways along that path have updated the option and the minimum-MTU-so-far value is indeed the PMTU; otherwise, it is just a best-guess. The ICMP Path-MTU message is sent either by the "last-hop" gateway or by the destination host. The "last-hop" gateway is the gateway which in processing the datagram recognizes that it does not need to be forwarded via any more gateways, but can now be sent directly to the destination host. The last-hop gateway sends the ICMP Path-MTU if and only if the PMTU-Query's next-hop field contains the gateway's own address. The ICMP Path-MTU message is sent by the destination host only if it receives the option with its next-hop field set to zero and if the destination host is willing to send ICMP Unexpected Fragment Report messages. Notice that the MTU of the last hop (from the last-hop gateway to the destination host) is known by the last-hop gateway; thus if all gateways along the path support the option, the PMTU can be determined and communicated back to the source host even if the destination host does not support PMTU Discovery. The number of ICMP Unexpected Fragment Report messages that the destination host is allowed to send for each PMTU McCloghrie/Fox/Mogul DRAFT [page 6] RFC DRAFT Path-MTU Discovery January 1990 change is limited. The value of the limit is a configured parameter in the destination, but each PMTU-Query option carries an implicit refresh of that limit, which restores the remaining count of subsequent messages that can be sent to the full limit. The destination host caches both the current PMTU value and the remaining count of Unexpected Fragment Report messages it can still send. Whenever the destination host receives a fragment (actually, only fragment-0 of a datagram), it compares the size of this fragment to its cached value of the PMTU; if they are different and the remaining count has not yet decremented to zero, then it sends a Unexpected Fragment Report message and decrements the remaining count; otherwise, no ICMP Unexpected Fragment Report message is generated. The destination host maintains its cached value of the PMTU both from the MTU-value field in a received PMTU-Query option, and from the size of arriving fragments (when the remaining count gets decremented to zero). Both the MTU- value in a PMTU-Query option with its next-hop field containing a valid address and the size of a fragment-0 are considered accurate indications of the current PMTU of the path, and overwrite the destination host's cached value. However, the MTU-value in a PMTU-Query option with its next-hop field set to zero is considered only an upper- bound on the PMTU of the path, and is never used to increase (only to decrease or to initialize) the cached PMTU value. The source host caches the PMTU information it learns as part of its information about the route to each non-local- network destination, i.e. as an extension to its IP Route Cache. Whenever a Route Cache entry is created, the initial value for PMTU is set to 576 and an PMTU-Query option is sent with the first datagram. Then, if no further information is obtained, the PMTU value will remain at 576 for the life of the cache entry. When an PMTU-Query option is sent, it may fail to get an answer. One reason for this is because at least one gateway and the destination host do not support the option. Another reason is that the datagram carrying it, or the replying ICMP Path-MTU message might get lost in the network. To cater for the latter case, the option needs to be retransmitted until an answering Path-MTU message is received or a retry count is exhausted. McCloghrie/Fox/Mogul DRAFT [page 7] RFC DRAFT Path-MTU Discovery January 1990 In addition, the PMTU-Query option needs to be resent periodically to determine if the PMTU has changed, or if the status of support for PMTU queries by the gateways on the current route has changed. Maximum and minimum values for the period between resends are suggested elsewhere in this memo. The period remains at the minimum while there is enough support for the protocol on the current path such that at least one of the protocol's mechanisms is working. If there is insufficient support for either mechanism, then the query is still resent (in order to detect changes in the level of support) but the period is exponentially backed-off up to the maximum value. 5.2. Advantages The advantages of the proposed protocol are: - the protocol works even when only some of the gateways and hosts implement the procedures. - its benefits increase incrementally as more of the gateways and hosts implement the procedures, - it does not require there to be any return traffic, and replies to the PMTU-Query are transmitted immediately instead of having to be queued waiting for a datagram to be sent in the reverse direction. - it does not involve sending any large datagrams which are expected to get fragmented just to test if the PMTU has increased. - the number of additional datagrams and IP options due to this scheme is limited. - this protocol accommodates the fact that existing applications do (and will likely continue to) ignore PMTU sizes in the datagrams they send. Such behaviour does not induce additional overhead with this scheme, since it is NOT the fragmentation of these jumbo datagrams which generates the ICMP messages; rather, it is the change in the size of the fragments which generates the messages. In fact, only a small specified number of ICMP Unexpected Fragment Report messages are sent after a change in the PMTU, regardless of whether or not the source transport/applications adjust the size of the datagrams McCloghrie/Fox/Mogul DRAFT [page 8] RFC DRAFT Path-MTU Discovery January 1990 they send. - this protocol can be implemented entirely within the IP layer (without needing to keep per-protocol/port state information). This avoids the duplications of implementing in each transport and/or application layer, and of having each transport/application sending its own queries/responses. Interaction with the transport/application layer is only necessary in the source host, where the transport/ application can ask IP for the PMTU to a particular destination (see GET_MAXSIZES in [3]), and be informed when sending a datagram larger than the current PMTU (e.g. because the PMTU just decreased). McCloghrie/Fox/Mogul DRAFT [page 9] RFC DRAFT Path-MTU Discovery January 1990 6. IP Option and ICMP Message Formats The formats of the new IP option and the new ICMP messages are given in the following sections. 6.1. IP PMTU-Query Option 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Length | Min-MTU | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next-Hop | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Type <<to be assigned>> Length 8 Min-MTU The minimum of the MTU values of each hop through which the datagram containing this IP option has so far been transmitted. Next-Hop Either the Internet address of the next IP entity to which this datagram is being forwarded, or zero. The field is set to zero when the option is processed by an IP entity if this field does not contain the IP address of that entity. Thus a zero value indicates that the Min-MTU field is not necessarily accurate. (This option is not copied on fragmentation). McCloghrie/Fox/Mogul DRAFT [page 10] RFC DRAFT Path-MTU Discovery January 1990 6.2. ICMP Path-MTU Message 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | PMTU | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Internet Header + 64 bits of Original Datagram Data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Type <<to be assigned>> Code 0 - Valid 1 - Upper-Bound Checksum The 16-bit one's complement of the one's complement sum of the ICMP message, starting with the Type field. For computing the checksum, the Checksum field is initialized to zero. Reserved Sent as zero; ignored on reception. PMTU The PMTU value. If Code = Valid, this is the known value; if Code = Upper-Bound, this is the best-guess value. The value is the PMTU of the path taken by the PMTU-Query to which this message is a response. Internet Header + 64 bits of Original Datagram Data These are extracted from the datagram containing the IP PMTU-Query option to which this message is a response. McCloghrie/Fox/Mogul DRAFT [page 11] RFC DRAFT Path-MTU Discovery January 1990 6.3. ICMP Unexpected Fragment Report Message 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Internet Header + 64 bits of Original Datagram Data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Type <<to be assigned>> Code 0 Checksum The 16-bit one's complement of the one's complement sum of the ICMP message, starting with the Type field. For computing the checksum, the Checksum field is initialized to zero. Reserved Sent as zero; ignored on reception. Internet Header + 64 bits of Original Datagram Data These are extracted from the datagram, for which fragment 0 has been received. (This message is a member of the class of ICMP error messages.) McCloghrie/Fox/Mogul DRAFT [page 12] RFC DRAFT Path-MTU Discovery January 1990 7. Use and Implementation of the PMTU Discovery Protocol The PMTU Discovery protocol proposed in this memo requires enhancements to the procedures implemented by both gateways and hosts. These are explained in the following sections. 7.1. Gateways Gateways should recognize and process the IP PMTU-Query option when it occurs in a datagram they are forwarding. The processing is as follows: compare the three values: the Min-MTU field of the PMTU-Query option; the MTU of the interface on which it was received; and the MTU of the interface through which it will be forwarded. If necessary, update the Min-MTU field to contain the minimum of the three values. compare the Internet Address in the Next-Hop field in the PMTU-Query option to the address of the interface on which it was received: - if not equal, set the Next-Hop field in the option to zero. - if equal, determine whether the datagram can now be sent directly to the destination host or whether it must be forwarded via another gateway: - if via another gateway, set the Next-Hop field to the address of the next gateway. - if directly to the destination host, then set the Next- Hop field to the destination host's address and generate an ICMP Path-MTU Code=Valid message with its PMTU field copied from the Min-MTU field in the PMTU-Query option, and send it to the source address of the datagram which contained the PMTU-Query option. 7.2. Source Hosts A host which implements this scheme to determine the PMTU of its paths to other hosts, needs to keep a cache of state information for each non-local-network destination. This state information McCloghrie/Fox/Mogul DRAFT [page 13] RFC DRAFT Path-MTU Discovery January 1990 is an extension of the state information which a host should already be keeping in its Route Cache (see section 3.3.1.3 in the Host Requirements RFC [3]). The additional state information to be kept for PMTU Discovery is as follows: Src_PMTU - the PMTU of the path which has the local host as the source and this cache entry's host as the destination. If the PMTU has not yet been discovered then this value must be DEFAULT_NONLOCAL_PMTU; else, this value will equal the known or best-guess PMTU. Resend_Interval - the frequency at which to send PMTU-Query options. This is initially set to the value MIN_PERIOD, and increased exponentially by doubling up to a maximum value of MAX_PERIOD when no answering ICMP Path-MTU message is being received. It is reset to MIN_PERIOD when either an ICMP PATH-MTU or an Unexpected Fragment Report message is received. Retry_Count - the count of remaining retries of the IP PMTU-Query option while awaiting an answering ICMP Path-MTU message. It is set to zero after the answering Path-MTU message is received, to indicate that the PMTU-Query is no longer being retried. Query_Flag - set to indicate a PMTU-Query option should be sent on the next datagram to this destination. This scheme also requires a timer to be associated with each entry in the Route Cache. At different times during the protocol exchanges, this timer operates either as a retry-timer or as a resend-timer. The procedures followed by a source host need to be updated with the following additions: 1) When IP creates an entry in its Route Cache, the PMTU Discovery fields should be initialized as: Src_PMTU = DEFAULT_NONLOCAL_PMTU, Resend_Interval = MIN_PERIOD, Retry_Count = QUERY_RETRIES, Query_Flag = set 2) if IP receives a request to send a datagram to a non- local-net destination for which the Route Cache entry has McCloghrie/Fox/Mogul DRAFT [page 14] RFC DRAFT Path-MTU Discovery January 1990 its Query_Flag set, then add an IP PMTU-Query option to the datagram prior to sending it. The PMTU-Query option should be initialized with: Min-MTU = MTU of interface on which the datagram is to be transmitted, Next-Hop = address of first-hop gateway. Then, clear the Query_Flag and start a retry timer for this Cache entry to go off after an interval of time larger than the round-trip time to the destination. 3) If an ICMP Path-MTU message is received, find the Cache entry for the appropriate host. Set Retry_Count to zero to indicate that a response has been received for the last PMTU-Query option sent, and set the Src_PMTU value from the Min-MTU field in the ICMP message. Also, set the value of Resend_Interval to MIN_PERIOD since there is at least some support for PMTU Discovery on the path to this destination. Lastly, (re-)start this Cache entry's timer to go off after MIN_PERIOD, when the next PMTU-Query needs to be sent. 4) When a Cache entry's timer goes off, examine the value of Retry_Count. If Retry_Count is zero, then this was a resend-timer and it is now time to send a new PMTU-Query, so set Retry_Count to QUERY_RETRIES and set the Query_Flag. Otherwise (Retry_Count is greater than zero), then this was a retry-timer and no ICMP Path-MTU message has been received in response to the last PMTU-Query sent; so, decrement Retry_Count. Providing that the decrementing of Retry_Count does not make it become zero, then the PMTU- Query needs to be retransmitted, so set the Query_Flag. If Retry_Count has been decremented to zero, then the retry limit is exhausted, so the Src_PMTU value must be reset to DEFAULT_NONLOCAL_PMTU; this reset of Src_PMTU is necessary because the lack of a response could indicate the route has changed and that now neither all the gateways nor the destination host support this protocol. Also, increase the value of Resend_Interval to twice its current value but not greater than MAX_PERIOD. Then, start a timer to go off after the interval given by Resend_Interval, when the next PMTU-Query needs to be sent. 5) If an ICMP Unexpected Fragment Report message is received from a host for which there is a Cache entry, and the size McCloghrie/Fox/Mogul DRAFT [page 15] RFC DRAFT Path-MTU Discovery January 1990 of the fragment-0 as reported in the message is not equal to the cached Src_PMTU, then update the cached Src_PMTU to the fragment-0 size and set the Cache entry's Query_Flag. Setting the Query_Flag here is intended to send another PMTU-Query option not only to determine if the PMTU change has also caused a change in how many gateways on the path support the option, but just as importantly, to restore the destination's count of ICMP Unexpected Fragment Report messages it can send. 7.3. Destination Hosts To implement this scheme as a destination host, IP must maintain a cache of information with one entry per host from which it has received an IP PMTU-Query option. The cache entries should be deleted if they have not been referenced for some DCACHE_TIMER period. Each cache entry maintains the following information: Address - the Internet address of this entry's host. UFR_Count - the remaining count of subsequent ICMP Unexpected Fragment Report messages that may be sent to this entry's host. Dest_PMTU - set to the (known or best-guess) PMTU of the path which has the local host as the destination and this entry's host as the source. The procedures followed by a destination host need to be updated with the following additions: 1) On receiving a datagram with an PMTU-Query option, the option's Min-MTU and Next-Hop fields should first be updated according to the local interface on which the datagram was received. The Min-MTU field should be set to the minimum of its own value and the MTU of local interface; the Next-Hop field should be set to zero if its value is not the address of the local interface. Next, locate the cache entry for the source of the datagram; if none exists create one and initialize its Dest_PMTU value from the Min-MTU in the option. Next, if the Next-Hop field in the option has a non-zero value or if the Dest_PMTU value would be decreased, then McCloghrie/Fox/Mogul DRAFT [page 16] RFC DRAFT Path-MTU Discovery January 1990 set the Dest_PMTU value from the Min-MTU field in the option. (Note, do not increase Dest_PMTU if the Next-Hop field is zero.) Lastly, set UFR_Count to the value UFR_LIMIT, and if the Next-Hop field in the option has a zero value, generate an ICMP Path-MTU message Code=Upper-Bound with its Path-MTU field set to the value of Dest_PMTU. 2) When a datagram fragment is received and it is fragment-0, then locate the cache entry for sending Host. If there is no cache information for the sending host, then there is no PMTU Discovery processing to be done. This could be the case where the source does not implement this PMTU Discovery protocol. Next, compare the size of the fragment-0 with the Dest_PMTU value. If they are equal then there is nothing to be done since this is not a change in the PMTU, i.e., the receipt of this fragment provides no new PMTU information. However, if the size of fragment-0 is not equal to the value of Dest_PMTU, and UFR_Count is greater than zero, then decrement UFR_Count and send an ICMP Unexpected Fragment Report message. If UFR_Count was zero or is now zero, then set the value of Dest_PMTU to the size of the fragment. This is now the expected size of incoming datagram initial fragments. 8. Discussion 8.1. Timers The PMTU Discovery protocol proposed in this memo uses a number of timers: a query-retry timer, a query-resend timer, and a destination cache timer. It is important that the intervals of these timers be set correctly. For the retry timer, it is ideal to have the interval be just larger than the round-trip time (RTT) to the destination, but it is important that it not be smaller than the RTT. Typically, IP has no knowledge of the RTT to a particular destination. One possibility is to set the retry interval to a constant which is McCloghrie/Fox/Mogul DRAFT [page 17] RFC DRAFT Path-MTU Discovery January 1990 larger than the maximum RTT to any destination, e.g. 2*MSL, i.e. 4 minutes. Alternatively, the RTT could be initially set to a smaller value (e.g. 30 seconds) and doubled for every retry. For the resend-query timer, the interval is set according to the degree of support for the protocol on the current path to a destination. In most cases, the interval determines for how long fragmentation will occur after a PMTU change. For example, if all gateways are supporting the protocol but not the destination host, then it is only these resends which will detect a PMTU change. So the interval needs to be small enough to limit the amount of fragmentation if the PMTU decreases. If the destination host supports the protocol, then there is (at least some) reliance on ICMP Unexpected Fragment Report messages to detect PMTU changes. Here also, the interval needs to be set to limit the amount of fragmentation which would occur if all the Unexpected Fragment Report messages were to get lost (or the destination were rebooted). Only if neither all the gateways nor the destination host supports the protocol can the interval be made longer, since in this case the reason to resend the query is to occasionally test if the set of gateways on the path has changed such that they all now support the protocol. The interval of the destination cache timer determines how long unreferenced PMTU information stays in the destination host's cache before being deleted. Besides being needed to prevent the destination's cache growing too large, this timer is also necessary in case the source host's IP address is re-assigned to another host which may not implement PMTU Discovery. Note, however, that the destination host's cache is only referenced by received PMTU-Query options and by received fragments. Since there may not be any fragments received over a period when the PMTU has not decreased, it is important that this destination cache timer be at least several times larger than the corresponding value of the resend-query timer (i.e. the value of the resend-query timer when the destination-host supports the protocol). 8.2. Host Requirements The Host Requirements RFC [3] requires transport/applications to call IP via the GET_MAXSIZES function when they wish to (re- )determine the PMTU. (Section 3.4 of [3] specifies that GET_MAXSIZES returns the PMTU value as MMS_S). With this already required, this PMTU Discovery protocol places no new McCloghrie/Fox/Mogul DRAFT [page 18] RFC DRAFT Path-MTU Discovery January 1990 requirements on transport/applications. However, it is suggested that IP return a warning indication on a send call, in the event that the packet-size is larger than the current PMTU, to indicate that the transport/application SHOULD call GET_MAXSIZES. [Note that [3] provides no way for a transport/ application to determine the MTU of a local interface; this may be an omission.] 8.3. Mixed Support by Transport/Applications The benefits from implementing this PMTU Discovery protocol in a source host are likely to be incrementally achieved, as each of the transport and/or application protocols takes advantage by creating packets at the full PMTU size. During this incremental evolution, a source host may have some protocols dynamically adjusting to changing PMTU sizes and some which are still sending jumbo datagrams which ignore PMTU sizes. In this situation, this proposal does not need to associate PMTU sizes with specific protocols (or connections). In particular, if all a source host's protocols ignore PMTU Discovery, then that host should not be sending PMTU-Query options; otherwise, the Unexpected Fragment Report messages are sent not because some applications' (e.g. NFS) datagrams are getting fragmented, but because the PMTU has changed and the Unexpected Fragment Report messages pass that information back to the sender so that that other protocols (which are not ignoring PMTU) can be informed. 8.4. Making Room for Options One of the disadvantages of any usage of IP options (including usage of the PMTU-Query option) is that they increase the size of datagrams. It is the transport/application layers which decide on the size of a packet they give IP to send as a datagram. When IP has its own algorithms for deciding which datagrams need to have IP options added to them, there is the chance that the increased size of a datagram will be larger than the PMTU, and thus will result in fragmentation. There are two ways to avoid this: - have some additional communication between IP and the transport/applications whereby IP can request the next send be of a smaller packet-size such that the addition of the option will not exceed the PMTU. This approach complicates the interface between IP and TCP and between IP and UDP- McCloghrie/Fox/Mogul DRAFT [page 19] RFC DRAFT Path-MTU Discovery January 1990 based applications, especially since IP doesn't know which of them will next send to a particular destination. - decrease the PMTU which IP advertises to the transport/ applications such that when IP adds an IP option to a packet of the decreased size the resulting IP datagram does not exceed the real PMTU. This approach does not complicate the interface to IP from above, but has the disadvantage that a small fraction of the bandwidth is wasted since those datagrams to which no IP option is added are not quite as large as the PMTU. (See the discussion of this approach in section 3.3.3 of the Host Requirement RFC [3].) 8.5. No Local Network Usage Notice that using the protocol proposed in this memo, a PMTU- Query option is never sent between two hosts on the same (sub- )network. Since no PMTU-Query option is ever sent, neither will any ICMP Path-MTU messages nor any Unexpected Fragment Report messages ever be sent between two hosts on the same (sub- )network. 8.6. Dual-MTU Networks A dual-MTU network is a network containing multiple types of media with each media having its own MTU value. An example is an FDDI/Ethernet network in which a bridge interconnects an Ethernet (MTU = 1500) with an FDDI network (MTU = 4K). A host on the FDDI part of the network has an MTU of 4K when communicating with another host to which an all FDDI path exists, but an MTU of only 1500 when communicating with an Ethernet host, or via an Ethernet to another FDDI host. The method by which the MTU can be determined for a particular destination is not yet specified, but the significance from an IP host/gateway's perspective, is that the MTU of such a network can be different for different destinations, and will probably have to be determined dynamically. Note that the task of obtaining the MTU of such a data link is required in any event, for the router to determine if the datagram must be fragmented; processing of the PMTU Query option requires no additional determination. Also note that the data link MTU must be determined by some other means; the protocol McCloghrie/Fox/Mogul DRAFT [page 20] RFC DRAFT Path-MTU Discovery January 1990 described in this RFC is not meant for that purpose. 9. Suggested Values This section provides suggested values for the configured parameters required by protocol implementations. MIN_PERIOD Suggested Value: 15 minutes This value is the interval between sending the IP PMTU- Query option if either all the gateways in the path or the destination host support the PMTU Discovery protocol. MAX_PERIOD Suggested Value: 2 hours This value is the maximum interval between sending the IP PMTU-Query option if no ICMP Path-MTU message is being received in response. DCACHE_TIMER Suggested Value: 35 minutes This value is the minimum time that a destination's cached PMTU information for a particular source should be kept without being referenced. QUERY_RETRIES Suggested Value: 3 This value is the number of times a PMTU-Query option is retried if no answering ICMP Path-MTU message is received in response, before the source concludes that both the destination host and at least one gateway on the current path do not support this PMTU Discovery protocol. McCloghrie/Fox/Mogul DRAFT [page 21] RFC DRAFT Path-MTU Discovery January 1990 UFR_LIMIT Suggested value: 3 This value is the maximum number of ICMP Unexpected Fragment Report messages the receiver can send between receiving PMTU-Query options. The only reason to set this greater than one is to protect against the possibility of an ICMP Unexpected Fragment Report getting lost. Providing at least one ICMP Unexpected Fragment Report message arrives, the source will update its Src_PMTU value and send another PMTU-Query which will refresh the destination's remaining count to its maximum value ready for the next PMTU change. DEFAULT_NONLOCAL_PMTU Suggested value: 576 This is the value to which the PMTU should be set if no answering ICMP-MTU messages have been received (recently). 10. Acknowledgements This proposal is the output of the IETF MTU Discovery Working Group. It is a combination of the ideas of many people. At one time or another, Steve Deering, Chris Kent, Charles Lynn, and Jeff Mogul all suggested using an ICMP message to report the size of fragments. Noel Chiappa suggested the use of the next- hop field in order to know whether all gateways on the path support the protocol. Others who have contributed to this proposal are: Art Berggreen (ACC), etc. etc. McCloghrie/Fox/Mogul DRAFT [page 22] RFC DRAFT Path-MTU Discovery January 1990 References [1] C. Kent and J. Mogul. Fragmentation Considered Harmful. Proc. ACM SIGCOMM '87 Workshop, August 1987. [2] J. Mogul, C. Kent, C. Partridge and K. McCloghrie. IP MTU Discovery Options. RFC 1063, SRI Network Information Center, July 1988. [3] R. Braden. Host Requirements - Communication Layer. RFC 1122, SRI Network Information Center, September 1989. [4] S. Deering. IP and ICMP Extensions for MTU Discovery. Draft Memo, October 1989. McCloghrie/Fox/Mogul DRAFT [page 23]
- Draft RFC on Path-MTU Discovery Jeffrey Mogul