Draft RFC on Path-MTU Discovery
mogul (Jeffrey Mogul) Fri, 12 January 1990 23:40 UTC
Received: by acetes.pa.dec.com (5.54.5/4.7.34)
id AA26665; Fri, 12 Jan 90 15:40:58 PST
From: mogul (Jeffrey Mogul)
Message-Id: <9001122340.AA26665@acetes.pa.dec.com>
Date: 12 Jan 1990 1540-PST (Friday)
To: mtudwg
Cc:
Subject: Draft RFC on Path-MTU Discovery
Network Working Group K.McCloghrie, Hughes LAN Systems
Request for Comments: DRAFT R. Fox, Hughes LAN Systems
J. Mogul, Digital Equipment
12 January 1990
WORKING DRAFT - Do not circulate
Path-MTU Discovery Protocol
1. Status of this Memo
This memo describes a protocol for discovering the maximum
transmission unit of an internet path, using one new IP option
and two new ICMP messages. This is proposed as an alternative
to the procedures described in RFC-1063. This memo does not
define an Internet standard. Distribution of this memo is
unlimited.
2. Introduction
When one IP host has a large amount of data to send to another
host, the data is transmitted as a series of IP datagrams. It is
preferable, in the general case, that these datagrams be of the
largest size which does not require fragmentation during
transmission. This size is referred to as the Path-MTU (PMTU)
of the path from the source to the destination, and is equal to
the minimum of the MTUs of each hop in the path, where the MTU
of a hop is the maximum size of an IP datagram on that hop. A
shortcoming of the current Internet protocol suite is the lack
of a standard mechanism for a host to discover the PMTU of an
arbitrary path.
The reasons to avoid fragmentation and the problems it incurs
are well-documented in [1]. Some of these problems include:
- use of fragmentation can sometimes lead to "deterministic
fragment loss", where something in the internetwork causes
certain fragments to be lost with higher than usual
probability. For example, a router with insufficient
buffer capacity might always drop the 4th packet in a
burst. Since fragments are not individually acknowledged,
this leads to miserable performance, or total failure,
since retransmissions of the original datagram suffer the
same fate. (If the datagrams are not fragmented, even when
McCloghrie/Fox/Mogul DRAFT [page 1]
RFC DRAFT Path-MTU Discovery January 1990
they are lost deterministically a protocol such as TCP will
make slow but steady progress.)
- sending datagrams of a size just larger than the PMTU
causes each original datagram to be fragmented into one
full-size and one tiny fragment. So, half the datagrams
are tiny fragments which makes not only for inefficient use
of the bandwidth, but also results in the gateways having
to forward twice as many datagrams as necessary.
- IP reassembly depends on having unique IP Identification
values in each in-flight datagram. With the Ident field
being only 16 bits wide, the need to guarantee that every
datagram in flight from one host to another has a unique
Ident, imposes a restriction on the maximum datagram
transmission rate. (See [1] for an example.)
- IP reassembly is inherently less efficient than transport
layer "reassembly". (Again, see [1] for the arguments.)
3. Summary of the Protocol
The PMTU Discovery Protocol proposed in this memo is a hybrid of
two different mechanisms. Both mechanisms are invoked by the
sender of a datagram, using the IP PMTU-Query Option. In the
primary mechanism, gateways along the path use this option to
compute the minimum MTU of any hop on the path; the last-hop
gateway then uses an ICMP Path-MTU message to report the PMTU to
the sending host. Provisions are made to detect if any of the
gateways do not implement this option (in which case, the
computed value is only an upper bound on the PMTU). This
mechanism does not involve participation by the receiving host,
and so can be used without upgrading all end hosts. The IP
PMTU-Query option is transmitted periodically, not on all
datagrams, so that the gateways will not be burdened by
excessive option-processing.
The secondary mechanism is used as a backup, in case the IP
PMTU-Query option is not supported by all the gateways, or in
case the route changes to one with a lower PMTU before the
source resends the PMTU-Query. It does require support from the
receiving host. The receiver uses the reception of an IP PMTU-
! Query as an indication that the sender has invoked the protocol,
and as a request to cache the PMTU of the path from the sender.
This cached value provides the receiver with a way to detect
McCloghrie/Fox/Mogul DRAFT [page 2]
RFC DRAFT Path-MTU Discovery January 1990
changes in the PMTU, by comparing the size of incoming
fragmented datagrams against the cached PMTU. When a change in
the PMTU value is detected, the sender is notified through the
use of an ICMP Unexpected Fragment Report message. Mechanisms
are provided to limit how many ICMP Unexpected Fragment Report
messages are sent, to avoid polluting the network.
4. Analysis of Previous Approaches
4.1. RFC-1063
RFC-1063 [2] describes one solution to the PMTU Discovery
problem. It defines a new IP option that is examined and
possibly updated by each gateway along a path in such a way that
it contains the PMTU when it arrives at its destination.
Another new IP option is used to convey the PMTU back to the
sender, piggybacked on a datagram going in the other direction.
Discussion of the RFC-1063 proposal (e.g. in [4]) has
highlighted a number of drawbacks:
- for correct operation of the scheme, all gateways along a
path must be updated to support the new IP option, which,
for some paths, may take a long time. If hosts start using
the scheme before all the gateways have been updated, they
may end up incurring more fragmentation than if not using
it, because they may conclude that the PMTU of a given path
is larger than it really is, for example: when the two
gateways connected by the hop with the minimum MTU do not
support the new option.
- this scheme does not work in cases where there is no return
traffic on which to piggyback the discovered PMTU. For
example, a host sending unsolicited datagram "trap" or
"event" messages to a network monitoring center would be
unable to discover the PMTU to the monitoring center.
- most gateways are optimized for forwarding datagrams which
do not contain IP options. Thus, an overhead is imposed on
the gateways' throughput if the MTU option is sent too
frequently. However, sending the MTU option too
infrequently can also cause additional overhead when a
decrease in the PMTU occurs (e.g. due to an alternate
routing around a link outage), since the sender will
continue sending datagrams that need to be fragmented on
the new path until it next uses the option and discovers
McCloghrie/Fox/Mogul DRAFT [page 3]
RFC DRAFT Path-MTU Discovery January 1990
the decrease.
4.2. Report Fragmentation
To counteract these drawbacks, alternative schemes have been
suggested (e.g. in [4]) based on the receiver reporting the
occurrence of fragmentation to the sender via a new ICMP
message. Such schemes have been called "report-fragmentation".
They too, however, have a number of drawbacks:
- to discover increases in the PMTU, report-fragmentation
schemes require that larger datagrams are occasionally
sent, to probe the path to determine whether they get
fragmented. Since the PMTU of a path changes only
infrequently, most of these "on-purpose" fragmentations are
otherwise unnecessary.
- in situations where data is to be sent to a new
destination, for which the sender has no (cached) PMTU
information, the PMTU Discovery cannot take place until
there is a large datagram to be sent. In reality, most
applications initially exchange a number of small messages
(e.g. connection establishment messages) prior to sending
their first large datagram. In contrast, a query-based
scheme (like RFC-1063) can send its probe piggybacked on a
datagram of any size, and receive the reply containing the
PMTU during the exchange of the initial small messages so
that all the subsequent large datagrams can be optimally
sized.
- a report-fragmentation scheme must specify which arriving
fragments cause the sending of a ICMP message to report the
fragmentation. Sending on all occurrences would be
excessive since there can be a round-trip time's worth of
datagrams in-flight concurrently, each of which would
generate an ICMP message; on a high-bandwidth/long-delay
path this could generate hundreds or more unnecessary ICMP
messages. It would also be wasteful to send report
messages to a source which would ignore them. Alternately,
datagrams could be marked to request that if they arrive
fragmented, an ICMP message is to be generated. The ideal
way to do this is if there were a spare bit in the IP
header (see [4]). If no spare bit is available, then the
mark presumably needs to be carried via an IP option, but
this now inherits some of the drawbacks of the RFC-1063
scheme mentioned above.
McCloghrie/Fox/Mogul DRAFT [page 4]
RFC DRAFT Path-MTU Discovery January 1990
- there are existing applications which ignore the PMTU, and
always send larger datagrams regardless of whether
fragmentation will result. In order not to send
unnecessary ICMP messages reporting the fragmentation of
such datagrams, (until now) it has been necessary for an
implementation to keep track of which applications are
mindful of PMTU Discovery and which applications ignore it.
In practice, this is probably excessive state information
for it to be kept in the IP layer. Thus, some amount of
the implementation must be done in every transport/
application layer which cares about avoiding fragmentation.
This not only duplicates implementation effort and code,
but probably requires separate ICMP messages to be sent for
each transport/application which results in a significant
increase in the overhead.
5. The Proposed Protocol
5.1. Description
The PMTU Discovery protocol proposed in this memo defines one
new IP option and two new ICMP messages. The new IP option is
the PMTU-Query option; the two new ICMP messages are the Path-
MTU message and the Unexpected Fragment Report message. These
are used as follows:
The IP PMTU-Query option can be carried on any datagram and
asks the gateways through which it passes to update, if
necessary, its minimum MTU value (like RFC-1063).
However, the PMTU-Query option also contains other
information which must be updated by each gateway
supporting the option. This provides the means for the
destination to determine whether all gateways on the path
taken by that datagram do support the option, and therefore
whether the minimum MTU value it contains is indeed the
PMTU or just an upper bound.
The ICMP Path-MTU message is used to reply to a PMTU-Query
option in order to carry the PMTU information back to the
sender. Since this is sent as an ICMP message (in contrast
to RFC-1063's replying IP option), it does not require
there to be return traffic in the other direction. In
addition, it can be sent immediately on receipt of the
PMTU-Query option, rather than having to be queued in the
receiver awaiting a datagram on which to piggyback it.
McCloghrie/Fox/Mogul DRAFT [page 5]
RFC DRAFT Path-MTU Discovery January 1990
The ICMP Unexpected Fragment Report message is sent by a
host receiving a fragment of an unexpected size to the
sender of the fragmented datagram. These messages are sent
because the unexpected size indicates that the PMTU has
changed (i.e., no ICMP message is sent on receiving a
fragment of the expected size). After sending a limited
number of ICMP Unexpected Fragment Report messages, the
expected size is updated, so that no more will be sent
before the next change in the PMTU.
The PMTU-Query option contains a minimum-MTU-so-far field
and a next-hop address field. The minimum-MTU-so-far field
is decreased by any gateway on the path if the MTU of the
next/previous hop is less than the value which the field
currently contains. The next-hop address field is set to
the address of the next IP gateway to which it is
forwarded. As each gateway processes the option, it
expects to find its own address in the next-hop address
field; if it does then it updates the next-hop field to the
address to which it next forwards the datagram; otherwise,
it sets the next-hop field to zero. Thus, if the option
arrives at the destination with its next-hop field non-
zero, then all gateways along that path have updated the
option and the minimum-MTU-so-far value is indeed the PMTU;
otherwise, it is just a best-guess.
The ICMP Path-MTU message is sent either by the "last-hop"
gateway or by the destination host. The "last-hop" gateway
is the gateway which in processing the datagram recognizes
that it does not need to be forwarded via any more
gateways, but can now be sent directly to the destination
host. The last-hop gateway sends the ICMP Path-MTU if and
only if the PMTU-Query's next-hop field contains the
gateway's own address. The ICMP Path-MTU message is sent
by the destination host only if it receives the option with
its next-hop field set to zero and if the destination host
is willing to send ICMP Unexpected Fragment Report
messages. Notice that the MTU of the last hop (from the
last-hop gateway to the destination host) is known by the
last-hop gateway; thus if all gateways along the path
support the option, the PMTU can be determined and
communicated back to the source host even if the
destination host does not support PMTU Discovery.
The number of ICMP Unexpected Fragment Report messages that
the destination host is allowed to send for each PMTU
McCloghrie/Fox/Mogul DRAFT [page 6]
RFC DRAFT Path-MTU Discovery January 1990
change is limited. The value of the limit is a configured
parameter in the destination, but each PMTU-Query option
carries an implicit refresh of that limit, which restores
the remaining count of subsequent messages that can be sent
to the full limit. The destination host caches both the
current PMTU value and the remaining count of Unexpected
Fragment Report messages it can still send. Whenever the
destination host receives a fragment (actually, only
fragment-0 of a datagram), it compares the size of this
fragment to its cached value of the PMTU; if they are
different and the remaining count has not yet decremented
to zero, then it sends a Unexpected Fragment Report message
and decrements the remaining count; otherwise, no ICMP
Unexpected Fragment Report message is generated.
The destination host maintains its cached value of the PMTU
both from the MTU-value field in a received PMTU-Query
option, and from the size of arriving fragments (when the
remaining count gets decremented to zero). Both the MTU-
value in a PMTU-Query option with its next-hop field
containing a valid address and the size of a fragment-0 are
considered accurate indications of the current PMTU of the
path, and overwrite the destination host's cached value.
However, the MTU-value in a PMTU-Query option with its
next-hop field set to zero is considered only an upper-
bound on the PMTU of the path, and is never used to
increase (only to decrease or to initialize) the cached
PMTU value.
The source host caches the PMTU information it learns as
part of its information about the route to each non-local-
network destination, i.e. as an extension to its IP Route
Cache. Whenever a Route Cache entry is created, the initial
value for PMTU is set to 576 and an PMTU-Query option is
sent with the first datagram. Then, if no further
information is obtained, the PMTU value will remain at 576
for the life of the cache entry.
When an PMTU-Query option is sent, it may fail to get an
answer. One reason for this is because at least one gateway
and the destination host do not support the option.
Another reason is that the datagram carrying it, or the
replying ICMP Path-MTU message might get lost in the
network. To cater for the latter case, the option needs to
be retransmitted until an answering Path-MTU message is
received or a retry count is exhausted.
McCloghrie/Fox/Mogul DRAFT [page 7]
RFC DRAFT Path-MTU Discovery January 1990
In addition, the PMTU-Query option needs to be resent
periodically to determine if the PMTU has changed, or if
the status of support for PMTU queries by the gateways on
the current route has changed. Maximum and minimum values
for the period between resends are suggested elsewhere in
this memo. The period remains at the minimum while there
is enough support for the protocol on the current path such
that at least one of the protocol's mechanisms is working.
If there is insufficient support for either mechanism, then
the query is still resent (in order to detect changes in
the level of support) but the period is exponentially
backed-off up to the maximum value.
5.2. Advantages
The advantages of the proposed protocol are:
- the protocol works even when only some of the gateways and
hosts implement the procedures.
- its benefits increase incrementally as more of the gateways
and hosts implement the procedures,
- it does not require there to be any return traffic, and
replies to the PMTU-Query are transmitted immediately
instead of having to be queued waiting for a datagram to be
sent in the reverse direction.
- it does not involve sending any large datagrams which are
expected to get fragmented just to test if the PMTU has
increased.
- the number of additional datagrams and IP options due to
this scheme is limited.
- this protocol accommodates the fact that existing
applications do (and will likely continue to) ignore PMTU
sizes in the datagrams they send. Such behaviour does not
induce additional overhead with this scheme, since it is
NOT the fragmentation of these jumbo datagrams which
generates the ICMP messages; rather, it is the change in
the size of the fragments which generates the messages. In
fact, only a small specified number of ICMP Unexpected
Fragment Report messages are sent after a change in the
PMTU, regardless of whether or not the source
transport/applications adjust the size of the datagrams
McCloghrie/Fox/Mogul DRAFT [page 8]
RFC DRAFT Path-MTU Discovery January 1990
they send.
- this protocol can be implemented entirely within the IP
layer (without needing to keep per-protocol/port state
information). This avoids the duplications of implementing
in each transport and/or application layer, and of having
each transport/application sending its own
queries/responses. Interaction with the
transport/application layer is only necessary in the source
host, where the transport/ application can ask IP for the
PMTU to a particular destination (see GET_MAXSIZES in [3]),
and be informed when sending a datagram larger than the
current PMTU (e.g. because the PMTU just decreased).
McCloghrie/Fox/Mogul DRAFT [page 9]
RFC DRAFT Path-MTU Discovery January 1990
6. IP Option and ICMP Message Formats
The formats of the new IP option and the new ICMP messages are
given in the following sections.
6.1. IP PMTU-Query Option
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length | Min-MTU |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next-Hop |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Type <<to be assigned>>
Length 8
Min-MTU The minimum of the MTU values of each hop through
which the datagram containing this IP option has so
far been transmitted.
Next-Hop Either the Internet address of the next IP entity to
which this datagram is being forwarded, or zero. The
field is set to zero when the option is processed by
an IP entity if this field does not contain the IP
address of that entity. Thus a zero value indicates
that the Min-MTU field is not necessarily accurate.
(This option is not copied on fragmentation).
McCloghrie/Fox/Mogul DRAFT [page 10]
RFC DRAFT Path-MTU Discovery January 1990
6.2. ICMP Path-MTU Message
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reserved | PMTU |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Internet Header + 64 bits of Original Datagram Data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Type <<to be assigned>>
Code 0 - Valid
1 - Upper-Bound
Checksum The 16-bit one's complement of the one's complement
sum of the ICMP message, starting with the Type
field. For computing the checksum, the Checksum
field is initialized to zero.
Reserved Sent as zero; ignored on reception.
PMTU The PMTU value. If Code = Valid, this is the known
value; if Code = Upper-Bound, this is the best-guess
value. The value is the PMTU of the path taken by the
PMTU-Query to which this message is a response.
Internet Header + 64 bits of Original Datagram Data
These are extracted from the datagram containing
the IP PMTU-Query option to which this message is a
response.
McCloghrie/Fox/Mogul DRAFT [page 11]
RFC DRAFT Path-MTU Discovery January 1990
6.3. ICMP Unexpected Fragment Report Message
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Internet Header + 64 bits of Original Datagram Data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Type <<to be assigned>>
Code 0
Checksum The 16-bit one's complement of the one's complement
sum of the ICMP message, starting with the Type
field. For computing the checksum, the Checksum
field is initialized to zero.
Reserved Sent as zero; ignored on reception.
Internet Header + 64 bits of Original Datagram Data
These are extracted from the datagram, for which
fragment 0 has been received.
(This message is a member of the class of ICMP error messages.)
McCloghrie/Fox/Mogul DRAFT [page 12]
RFC DRAFT Path-MTU Discovery January 1990
7. Use and Implementation of the PMTU Discovery Protocol
The PMTU Discovery protocol proposed in this memo requires
enhancements to the procedures implemented by both gateways and
hosts. These are explained in the following sections.
7.1. Gateways
Gateways should recognize and process the IP PMTU-Query option
when it occurs in a datagram they are forwarding. The
processing is as follows:
compare the three values: the Min-MTU field of the PMTU-Query
option; the MTU of the interface on which it was received; and
the MTU of the interface through which it will be forwarded.
If necessary, update the Min-MTU field to contain the minimum
of the three values.
compare the Internet Address in the Next-Hop field in the
PMTU-Query option to the address of the interface on which it
was received:
- if not equal, set the Next-Hop field in the option to
zero.
- if equal, determine whether the datagram can now be sent
directly to the destination host or whether it must be
forwarded via another gateway:
- if via another gateway, set the Next-Hop field to the
address of the next gateway.
- if directly to the destination host, then set the Next-
Hop field to the destination host's address and generate
an ICMP Path-MTU Code=Valid message with its PMTU field
copied from the Min-MTU field in the PMTU-Query option,
and send it to the source address of the datagram which
contained the PMTU-Query option.
7.2. Source Hosts
A host which implements this scheme to determine the PMTU of its
paths to other hosts, needs to keep a cache of state information
for each non-local-network destination. This state information
McCloghrie/Fox/Mogul DRAFT [page 13]
RFC DRAFT Path-MTU Discovery January 1990
is an extension of the state information which a host should
already be keeping in its Route Cache (see section 3.3.1.3 in
the Host Requirements RFC [3]). The additional state
information to be kept for PMTU Discovery is as follows:
Src_PMTU - the PMTU of the path which has the local host as
the source and this cache entry's host as the
destination. If the PMTU has not yet been discovered
then this value must be DEFAULT_NONLOCAL_PMTU; else,
this value will equal the known or best-guess PMTU.
Resend_Interval - the frequency at which to send PMTU-Query
options. This is initially set to the value
MIN_PERIOD, and increased exponentially by doubling up
to a maximum value of MAX_PERIOD when no answering
ICMP Path-MTU message is being received. It is reset
to MIN_PERIOD when either an ICMP PATH-MTU or an
Unexpected Fragment Report message is received.
Retry_Count - the count of remaining retries of the IP
PMTU-Query option while awaiting an answering ICMP
Path-MTU message. It is set to zero after the
answering Path-MTU message is received, to indicate
that the PMTU-Query is no longer being retried.
Query_Flag - set to indicate a PMTU-Query option should be
sent on the next datagram to this destination.
This scheme also requires a timer to be associated with each
entry in the Route Cache. At different times during the
protocol exchanges, this timer operates either as a retry-timer
or as a resend-timer.
The procedures followed by a source host need to be updated with
the following additions:
1) When IP creates an entry in its Route Cache, the PMTU
Discovery fields should be initialized as:
Src_PMTU = DEFAULT_NONLOCAL_PMTU,
Resend_Interval = MIN_PERIOD,
Retry_Count = QUERY_RETRIES,
Query_Flag = set
2) if IP receives a request to send a datagram to a non-
local-net destination for which the Route Cache entry has
McCloghrie/Fox/Mogul DRAFT [page 14]
RFC DRAFT Path-MTU Discovery January 1990
its Query_Flag set, then add an IP PMTU-Query option to the
datagram prior to sending it. The PMTU-Query option should
be initialized with:
Min-MTU = MTU of interface on which the datagram is
to be transmitted,
Next-Hop = address of first-hop gateway.
Then, clear the Query_Flag and start a retry timer for this
Cache entry to go off after an interval of time larger than
the round-trip time to the destination.
3) If an ICMP Path-MTU message is received, find the Cache
entry for the appropriate host. Set Retry_Count to zero to
indicate that a response has been received for the last
PMTU-Query option sent, and set the Src_PMTU value from the
Min-MTU field in the ICMP message. Also, set the value of
Resend_Interval to MIN_PERIOD since there is at least some
support for PMTU Discovery on the path to this destination.
Lastly, (re-)start this Cache entry's timer to go off after
MIN_PERIOD, when the next PMTU-Query needs to be sent.
4) When a Cache entry's timer goes off, examine the value of
Retry_Count. If Retry_Count is zero, then this was a
resend-timer and it is now time to send a new PMTU-Query,
so set Retry_Count to QUERY_RETRIES and set the Query_Flag.
Otherwise (Retry_Count is greater than zero), then this was
a retry-timer and no ICMP Path-MTU message has been
received in response to the last PMTU-Query sent; so,
decrement Retry_Count. Providing that the decrementing of
Retry_Count does not make it become zero, then the PMTU-
Query needs to be retransmitted, so set the Query_Flag.
If Retry_Count has been decremented to zero, then the retry
limit is exhausted, so the Src_PMTU value must be reset to
DEFAULT_NONLOCAL_PMTU; this reset of Src_PMTU is necessary
because the lack of a response could indicate the route has
changed and that now neither all the gateways nor the
destination host support this protocol. Also, increase the
value of Resend_Interval to twice its current value but not
greater than MAX_PERIOD. Then, start a timer to go off
after the interval given by Resend_Interval, when the next
PMTU-Query needs to be sent.
5) If an ICMP Unexpected Fragment Report message is received
from a host for which there is a Cache entry, and the size
McCloghrie/Fox/Mogul DRAFT [page 15]
RFC DRAFT Path-MTU Discovery January 1990
of the fragment-0 as reported in the message is not equal
to the cached Src_PMTU, then update the cached Src_PMTU to
the fragment-0 size and set the Cache entry's Query_Flag.
Setting the Query_Flag here is intended to send another
PMTU-Query option not only to determine if the PMTU change
has also caused a change in how many gateways on the path
support the option, but just as importantly, to restore the
destination's count of ICMP Unexpected Fragment Report
messages it can send.
7.3. Destination Hosts
To implement this scheme as a destination host, IP must maintain
a cache of information with one entry per host from which it has
received an IP PMTU-Query option. The cache entries should be
deleted if they have not been referenced for some DCACHE_TIMER
period. Each cache entry maintains the following information:
Address - the Internet address of this entry's host.
UFR_Count - the remaining count of subsequent ICMP Unexpected
Fragment Report messages that may be sent to this
entry's host.
Dest_PMTU - set to the (known or best-guess) PMTU of the
path which has the local host as the destination and
this entry's host as the source.
The procedures followed by a destination host need to be updated
with the following additions:
1) On receiving a datagram with an PMTU-Query option, the
option's Min-MTU and Next-Hop fields should first be
updated according to the local interface on which the
datagram was received. The Min-MTU field should be set to
the minimum of its own value and the MTU of local
interface; the Next-Hop field should be set to zero if its
value is not the address of the local interface.
Next, locate the cache entry for the source of the
datagram; if none exists create one and initialize its
Dest_PMTU value from the Min-MTU in the option.
Next, if the Next-Hop field in the option has a non-zero
value or if the Dest_PMTU value would be decreased, then
McCloghrie/Fox/Mogul DRAFT [page 16]
RFC DRAFT Path-MTU Discovery January 1990
set the Dest_PMTU value from the Min-MTU field in the
option. (Note, do not increase Dest_PMTU if the Next-Hop
field is zero.)
Lastly, set UFR_Count to the value UFR_LIMIT, and if the
Next-Hop field in the option has a zero value, generate an
ICMP Path-MTU message Code=Upper-Bound with its Path-MTU
field set to the value of Dest_PMTU.
2) When a datagram fragment is received and it is fragment-0,
then locate the cache entry for sending Host. If there is
no cache information for the sending host, then there is no
PMTU Discovery processing to be done. This could be the
case where the source does not implement this PMTU
Discovery protocol.
Next, compare the size of the fragment-0 with the Dest_PMTU
value. If they are equal then there is nothing to be done
since this is not a change in the PMTU, i.e., the receipt
of this fragment provides no new PMTU information.
However, if the size of fragment-0 is not equal to the
value of Dest_PMTU, and UFR_Count is greater than zero,
then decrement UFR_Count and send an ICMP Unexpected
Fragment Report message.
If UFR_Count was zero or is now zero, then set the value of
Dest_PMTU to the size of the fragment. This is now the
expected size of incoming datagram initial fragments.
8. Discussion
8.1. Timers
The PMTU Discovery protocol proposed in this memo uses a number
of timers: a query-retry timer, a query-resend timer, and a
destination cache timer. It is important that the intervals of
these timers be set correctly.
For the retry timer, it is ideal to have the interval be just
larger than the round-trip time (RTT) to the destination, but it
is important that it not be smaller than the RTT. Typically, IP
has no knowledge of the RTT to a particular destination. One
possibility is to set the retry interval to a constant which is
McCloghrie/Fox/Mogul DRAFT [page 17]
RFC DRAFT Path-MTU Discovery January 1990
larger than the maximum RTT to any destination, e.g. 2*MSL, i.e.
4 minutes. Alternatively, the RTT could be initially set to a
smaller value (e.g. 30 seconds) and doubled for every retry.
For the resend-query timer, the interval is set according to the
degree of support for the protocol on the current path to a
destination. In most cases, the interval determines for how
long fragmentation will occur after a PMTU change. For example,
if all gateways are supporting the protocol but not the
destination host, then it is only these resends which will
detect a PMTU change. So the interval needs to be small enough
to limit the amount of fragmentation if the PMTU decreases. If
the destination host supports the protocol, then there is (at
least some) reliance on ICMP Unexpected Fragment Report messages
to detect PMTU changes. Here also, the interval needs to be set
to limit the amount of fragmentation which would occur if all
the Unexpected Fragment Report messages were to get lost (or the
destination were rebooted). Only if neither all the gateways
nor the destination host supports the protocol can the interval
be made longer, since in this case the reason to resend the
query is to occasionally test if the set of gateways on the path
has changed such that they all now support the protocol.
The interval of the destination cache timer determines how long
unreferenced PMTU information stays in the destination host's
cache before being deleted. Besides being needed to prevent the
destination's cache growing too large, this timer is also
necessary in case the source host's IP address is re-assigned to
another host which may not implement PMTU Discovery. Note,
however, that the destination host's cache is only referenced by
received PMTU-Query options and by received fragments. Since
there may not be any fragments received over a period when the
PMTU has not decreased, it is important that this destination
cache timer be at least several times larger than the
corresponding value of the resend-query timer (i.e. the value of
the resend-query timer when the destination-host supports the
protocol).
8.2. Host Requirements
The Host Requirements RFC [3] requires transport/applications to
call IP via the GET_MAXSIZES function when they wish to (re-
)determine the PMTU. (Section 3.4 of [3] specifies that
GET_MAXSIZES returns the PMTU value as MMS_S). With this
already required, this PMTU Discovery protocol places no new
McCloghrie/Fox/Mogul DRAFT [page 18]
RFC DRAFT Path-MTU Discovery January 1990
requirements on transport/applications. However, it is
suggested that IP return a warning indication on a send call, in
the event that the packet-size is larger than the current PMTU,
to indicate that the transport/application SHOULD call
GET_MAXSIZES. [Note that [3] provides no way for a transport/
application to determine the MTU of a local interface; this may
be an omission.]
8.3. Mixed Support by Transport/Applications
The benefits from implementing this PMTU Discovery protocol in a
source host are likely to be incrementally achieved, as each of
the transport and/or application protocols takes advantage by
creating packets at the full PMTU size. During this incremental
evolution, a source host may have some protocols dynamically
adjusting to changing PMTU sizes and some which are still
sending jumbo datagrams which ignore PMTU sizes. In this
situation, this proposal does not need to associate PMTU sizes
with specific protocols (or connections). In particular, if all
a source host's protocols ignore PMTU Discovery, then that host
should not be sending PMTU-Query options; otherwise, the
Unexpected Fragment Report messages are sent not because some
applications' (e.g. NFS) datagrams are getting fragmented, but
because the PMTU has changed and the Unexpected Fragment Report
messages pass that information back to the sender so that that
other protocols (which are not ignoring PMTU) can be informed.
8.4. Making Room for Options
One of the disadvantages of any usage of IP options (including
usage of the PMTU-Query option) is that they increase the size
of datagrams. It is the transport/application layers which
decide on the size of a packet they give IP to send as a
datagram. When IP has its own algorithms for deciding which
datagrams need to have IP options added to them, there is the
chance that the increased size of a datagram will be larger than
the PMTU, and thus will result in fragmentation. There are two
ways to avoid this:
- have some additional communication between IP and the
transport/applications whereby IP can request the next send
be of a smaller packet-size such that the addition of the
option will not exceed the PMTU. This approach complicates
the interface between IP and TCP and between IP and UDP-
McCloghrie/Fox/Mogul DRAFT [page 19]
RFC DRAFT Path-MTU Discovery January 1990
based applications, especially since IP doesn't know which
of them will next send to a particular destination.
- decrease the PMTU which IP advertises to the transport/
applications such that when IP adds an IP option to a
packet of the decreased size the resulting IP datagram does
not exceed the real PMTU. This approach does not
complicate the interface to IP from above, but has the
disadvantage that a small fraction of the bandwidth is
wasted since those datagrams to which no IP option is added
are not quite as large as the PMTU. (See the discussion of
this approach in section 3.3.3 of the Host Requirement RFC
[3].)
8.5. No Local Network Usage
Notice that using the protocol proposed in this memo, a PMTU-
Query option is never sent between two hosts on the same (sub-
)network. Since no PMTU-Query option is ever sent, neither will
any ICMP Path-MTU messages nor any Unexpected Fragment Report
messages ever be sent between two hosts on the same (sub-
)network.
8.6. Dual-MTU Networks
A dual-MTU network is a network containing multiple types of
media with each media having its own MTU value. An example is
an FDDI/Ethernet network in which a bridge interconnects an
Ethernet (MTU = 1500) with an FDDI network (MTU = 4K). A host
on the FDDI part of the network has an MTU of 4K when
communicating with another host to which an all FDDI path
exists, but an MTU of only 1500 when communicating with an
Ethernet host, or via an Ethernet to another FDDI host. The
method by which the MTU can be determined for a particular
destination is not yet specified, but the significance from an
IP host/gateway's perspective, is that the MTU of such a network
can be different for different destinations, and will probably
have to be determined dynamically.
Note that the task of obtaining the MTU of such a data link is
required in any event, for the router to determine if the
datagram must be fragmented; processing of the PMTU Query option
requires no additional determination. Also note that the data
link MTU must be determined by some other means; the protocol
McCloghrie/Fox/Mogul DRAFT [page 20]
RFC DRAFT Path-MTU Discovery January 1990
described in this RFC is not meant for that purpose.
9. Suggested Values
This section provides suggested values for the configured
parameters required by protocol implementations.
MIN_PERIOD
Suggested Value: 15 minutes
This value is the interval between sending the IP PMTU-
Query option if either all the gateways in the path or the
destination host support the PMTU Discovery protocol.
MAX_PERIOD
Suggested Value: 2 hours
This value is the maximum interval between sending the IP
PMTU-Query option if no ICMP Path-MTU message is being
received in response.
DCACHE_TIMER
Suggested Value: 35 minutes
This value is the minimum time that a destination's cached
PMTU information for a particular source should be kept
without being referenced.
QUERY_RETRIES
Suggested Value: 3
This value is the number of times a PMTU-Query option is
retried if no answering ICMP Path-MTU message is received
in response, before the source concludes that both the
destination host and at least one gateway on the current
path do not support this PMTU Discovery protocol.
McCloghrie/Fox/Mogul DRAFT [page 21]
RFC DRAFT Path-MTU Discovery January 1990
UFR_LIMIT
Suggested value: 3
This value is the maximum number of ICMP Unexpected
Fragment Report messages the receiver can send between
receiving PMTU-Query options. The only reason to set this
greater than one is to protect against the possibility of
an ICMP Unexpected Fragment Report getting lost. Providing
at least one ICMP Unexpected Fragment Report message
arrives, the source will update its Src_PMTU value and send
another PMTU-Query which will refresh the destination's
remaining count to its maximum value ready for the next
PMTU change.
DEFAULT_NONLOCAL_PMTU
Suggested value: 576
This is the value to which the PMTU should be set if no
answering ICMP-MTU messages have been received (recently).
10. Acknowledgements
This proposal is the output of the IETF MTU Discovery Working
Group. It is a combination of the ideas of many people. At one
time or another, Steve Deering, Chris Kent, Charles Lynn, and
Jeff Mogul all suggested using an ICMP message to report the
size of fragments. Noel Chiappa suggested the use of the next-
hop field in order to know whether all gateways on the path
support the protocol. Others who have contributed to this
proposal are:
Art Berggreen (ACC),
etc.
etc.
McCloghrie/Fox/Mogul DRAFT [page 22]
RFC DRAFT Path-MTU Discovery January 1990
References
[1] C. Kent and J. Mogul.
Fragmentation Considered Harmful.
Proc. ACM SIGCOMM '87 Workshop, August 1987.
[2] J. Mogul, C. Kent, C. Partridge and K. McCloghrie.
IP MTU Discovery Options.
RFC 1063, SRI Network Information Center, July 1988.
[3] R. Braden.
Host Requirements - Communication Layer.
RFC 1122, SRI Network Information Center, September 1989.
[4] S. Deering.
IP and ICMP Extensions for MTU Discovery.
Draft Memo, October 1989.
McCloghrie/Fox/Mogul DRAFT [page 23]
- Draft RFC on Path-MTU Discovery Jeffrey Mogul