Draft RFC on Path-MTU Discovery

mogul (Jeffrey Mogul) Fri, 12 January 1990 23:40 UTC

Received: by acetes.pa.dec.com (5.54.5/4.7.34) id AA26665; Fri, 12 Jan 90 15:40:58 PST
From: mogul (Jeffrey Mogul)
Message-Id: <9001122340.AA26665@acetes.pa.dec.com>
Date: 12 Jan 1990 1540-PST (Friday)
To: mtudwg
Cc:
Subject: Draft RFC on Path-MTU Discovery

Network Working Group                K.McCloghrie, Hughes LAN Systems
Request for Comments: DRAFT                R. Fox, Hughes LAN Systems
                                          J. Mogul, Digital Equipment
                                                      12 January 1990

		    WORKING DRAFT - Do not circulate

                       Path-MTU Discovery Protocol


     1.  Status of this Memo

     This memo describes a protocol for discovering the maximum
     transmission unit of an internet path, using one new IP option
     and two new ICMP messages.  This is proposed as an alternative
     to the procedures described in RFC-1063.  This memo does not
     define an Internet standard.   Distribution of this memo is
     unlimited.


     2.  Introduction

     When one IP host has a large amount of data to send to another
     host, the data is transmitted as a series of IP datagrams. It is
     preferable, in the general case, that these datagrams be of the
     largest size which does not require fragmentation during
     transmission.  This size is referred to as the Path-MTU (PMTU)
     of the path from the source to the destination, and is equal to
     the minimum of the MTUs of each hop in the path, where the MTU
     of a hop is the maximum size of an IP datagram on that hop.  A
     shortcoming of the current Internet protocol suite is the lack
     of a standard mechanism for a host to discover the PMTU of an
     arbitrary path.

     The reasons to avoid fragmentation and the problems it incurs
     are well-documented in [1].  Some of these problems include:

        - use of fragmentation can sometimes lead to "deterministic
          fragment loss", where something in the internetwork causes
          certain fragments to be lost with higher than usual
          probability.  For example, a router with insufficient
          buffer capacity might always drop the 4th packet in a
          burst.  Since fragments are not individually acknowledged,
          this leads to miserable performance, or total failure,
          since retransmissions of the original datagram suffer the
          same fate.  (If the datagrams are not fragmented, even when



     McCloghrie/Fox/Mogul  DRAFT                              [page 1]






     RFC DRAFT              Path-MTU Discovery            January 1990


          they are lost deterministically a protocol such as TCP will
          make slow but steady progress.)

        - sending datagrams of a size just larger than the PMTU
          causes each original datagram to be fragmented into one
          full-size and one tiny fragment.  So, half the datagrams
          are tiny fragments which makes not only for inefficient use
          of the bandwidth, but also results in the gateways having
          to forward twice as many datagrams as necessary.

        - IP reassembly depends on having unique IP Identification
          values in each in-flight datagram.  With the Ident field
          being only 16 bits wide, the need to guarantee that every
          datagram in flight from one host to another has a unique
          Ident, imposes a restriction on the maximum datagram
          transmission rate.  (See [1] for an example.)

        - IP reassembly is inherently less efficient than transport
          layer "reassembly".  (Again, see [1] for the arguments.)


     3.  Summary of the Protocol

     The PMTU Discovery Protocol proposed in this memo is a hybrid of
     two different mechanisms.  Both mechanisms are invoked by the
     sender of a datagram, using the IP PMTU-Query Option.  In the
     primary mechanism, gateways along the path use this option to
     compute the minimum MTU of any hop on the path; the last-hop
     gateway then uses an ICMP Path-MTU message to report the PMTU to
     the sending host. Provisions are made to detect if any of the
     gateways do not implement this option (in which case, the
     computed value is only an upper bound on the PMTU).  This
     mechanism does not involve participation by the receiving host,
     and so can be used without upgrading all end hosts.  The IP
     PMTU-Query option is transmitted periodically, not on all
     datagrams, so that the gateways will not be burdened by
     excessive option-processing.

     The secondary mechanism is used as a backup, in case the IP
     PMTU-Query option is not supported by all the gateways, or in
     case the route changes to one with a lower PMTU before the
     source resends the PMTU-Query.  It does require support from the
     receiving host.  The receiver uses the reception of an IP PMTU-
   ! Query as an indication that the sender has invoked the protocol,
     and as a request to cache the PMTU of the path from the sender.
     This cached value provides the receiver with a way to detect



     McCloghrie/Fox/Mogul  DRAFT                              [page 2]






     RFC DRAFT              Path-MTU Discovery            January 1990


     changes in the PMTU, by comparing the size of incoming
     fragmented datagrams against the cached PMTU. When a change in
     the PMTU value is detected, the sender is notified through the
     use of an ICMP Unexpected Fragment Report message. Mechanisms
     are provided to limit how many ICMP Unexpected Fragment Report
     messages are sent, to avoid polluting the network.


     4.  Analysis of Previous Approaches

     4.1.  RFC-1063

     RFC-1063 [2] describes one solution to the PMTU Discovery
     problem. It defines a new IP option that is examined and
     possibly updated by each gateway along a path in such a way that
     it contains the PMTU when it arrives at its destination.
     Another new IP option is used to convey the PMTU back to the
     sender, piggybacked on a datagram going in the other direction.
     Discussion of the RFC-1063 proposal (e.g. in [4]) has
     highlighted a number of drawbacks:

        - for correct operation of the scheme, all gateways along a
          path must be updated to support the new IP option, which,
          for some paths, may take a long time.  If hosts start using
          the scheme before all the gateways have been updated, they
          may end up incurring more fragmentation than if not using
          it, because they may conclude that the PMTU of a given path
          is larger than it really is, for example: when the two
          gateways connected by the hop with the minimum MTU do not
          support the new option.

        - this scheme does not work in cases where there is no return
          traffic on which to piggyback the discovered PMTU.  For
          example, a host sending unsolicited datagram "trap" or
          "event" messages to a network monitoring center would be
          unable to discover the PMTU to the monitoring center.

        - most gateways are optimized for forwarding datagrams which
          do not contain IP options.  Thus, an overhead is imposed on
          the gateways' throughput if the MTU option is sent too
          frequently.  However, sending the MTU option too
          infrequently can also cause additional overhead when a
          decrease in the PMTU occurs (e.g. due to an alternate
          routing around a link outage), since the sender will
          continue sending datagrams that need to be fragmented on
          the new path until it next uses the option and discovers



     McCloghrie/Fox/Mogul  DRAFT                              [page 3]






     RFC DRAFT              Path-MTU Discovery            January 1990


          the decrease.

     4.2.  Report Fragmentation

     To counteract these drawbacks, alternative schemes have been
     suggested (e.g. in [4]) based on the receiver reporting the
     occurrence of fragmentation to the sender via a new ICMP
     message. Such schemes have been called "report-fragmentation".
     They too, however, have a number of drawbacks:

        - to discover increases in the PMTU, report-fragmentation
          schemes require that larger datagrams are occasionally
          sent, to probe the path to determine whether they get
          fragmented.  Since the PMTU of a path changes only
          infrequently, most of these "on-purpose" fragmentations are
          otherwise unnecessary.

        - in situations where data is to be sent to a new
          destination, for which the sender has no (cached) PMTU
          information, the PMTU Discovery cannot take place until
          there is a large datagram to be sent.  In reality, most
          applications initially exchange a number of small messages
          (e.g. connection establishment messages) prior to sending
          their first large datagram. In contrast, a query-based
          scheme (like RFC-1063) can send its probe piggybacked on a
          datagram of any size, and receive the reply containing the
          PMTU during the exchange of the initial small messages so
          that all the subsequent large datagrams can be optimally
          sized.

        - a report-fragmentation scheme must specify which arriving
          fragments cause the sending of a ICMP message to report the
          fragmentation.  Sending on all occurrences would be
          excessive since there can be a round-trip time's worth of
          datagrams in-flight concurrently, each of which would
          generate an ICMP message; on a high-bandwidth/long-delay
          path this could generate hundreds or more unnecessary ICMP
          messages.  It would also be wasteful to send report
          messages to a source which would ignore them.  Alternately,
          datagrams could be marked to request that if they arrive
          fragmented, an ICMP message is to be generated.  The ideal
          way to do this is if there were a spare bit in the IP
          header (see [4]).  If no spare bit is available, then the
          mark presumably needs to be carried via an IP option, but
          this now inherits some of the drawbacks of the RFC-1063
          scheme mentioned above.



     McCloghrie/Fox/Mogul  DRAFT                              [page 4]






     RFC DRAFT              Path-MTU Discovery            January 1990


        - there are existing applications which ignore the PMTU, and
          always send larger datagrams regardless of whether
          fragmentation will result.  In order not to send
          unnecessary ICMP messages reporting the fragmentation of
          such datagrams, (until now) it has been necessary for an
          implementation to keep track of which applications are
          mindful of PMTU Discovery and which applications ignore it.
          In practice, this is probably excessive state information
          for it to be kept in the IP layer.  Thus, some amount of
          the implementation must be done in every transport/
          application layer which cares about avoiding fragmentation.
          This not only duplicates implementation effort and code,
          but probably requires separate ICMP messages to be sent for
          each transport/application which results in a significant
          increase in the overhead.


     5.  The Proposed Protocol

     5.1.  Description

     The PMTU Discovery protocol proposed in this memo defines one
     new IP option and two new ICMP messages.  The new IP option is
     the PMTU-Query option; the two new ICMP messages are the Path-
     MTU message and the Unexpected Fragment Report message.  These
     are used as follows:

          The IP PMTU-Query option can be carried on any datagram and
          asks the gateways through which it passes to update, if
          necessary, its minimum MTU value (like RFC-1063).
          However, the PMTU-Query option also contains other
          information which must be updated by each gateway
          supporting the option.  This provides the means for the
          destination to determine whether all gateways on the path
          taken by that datagram do support the option, and therefore
          whether the minimum MTU value it contains is indeed the
          PMTU or just an upper bound.

          The ICMP Path-MTU message is used to reply to a PMTU-Query
          option in order to carry the PMTU information back to the
          sender.  Since this is sent as an ICMP message (in contrast
          to RFC-1063's replying IP option), it does not require
          there to be return traffic in the other direction.  In
          addition, it can be sent immediately on receipt of the
          PMTU-Query option, rather than having to be queued in the
          receiver awaiting a datagram on which to piggyback it.



     McCloghrie/Fox/Mogul  DRAFT                              [page 5]






     RFC DRAFT              Path-MTU Discovery            January 1990


          The ICMP Unexpected Fragment Report message is sent by a
          host receiving a fragment of an unexpected size to the
          sender of the fragmented datagram.  These messages are sent
          because the unexpected size indicates that the PMTU has
          changed (i.e., no ICMP message is sent on receiving a
          fragment of the expected size).  After sending a limited
          number of ICMP Unexpected Fragment Report messages, the
          expected size is updated, so that no more will be sent
          before the next change in the PMTU.

          The PMTU-Query option contains a minimum-MTU-so-far field
          and a next-hop address field.  The minimum-MTU-so-far field
          is decreased by any gateway on the path if the MTU of the
          next/previous hop is less than the value which the field
          currently contains.  The next-hop address field is set to
          the address of the next IP gateway to which it is
          forwarded.  As each gateway processes the option, it
          expects to find its own address in the next-hop address
          field; if it does then it updates the next-hop field to the
          address to which it next forwards the datagram; otherwise,
          it sets the next-hop field to zero.  Thus, if the option
          arrives at the destination with its next-hop field non-
          zero, then all gateways along that path have updated the
          option and the minimum-MTU-so-far value is indeed the PMTU;
          otherwise, it is just a best-guess.

          The ICMP Path-MTU message is sent either by the "last-hop"
          gateway or by the destination host.  The "last-hop" gateway
          is the gateway which in processing the datagram recognizes
          that it does not need to be forwarded via any more
          gateways, but can now be sent directly to the destination
          host.  The last-hop gateway sends the ICMP Path-MTU if and
          only if the PMTU-Query's next-hop field contains the
          gateway's own address.  The ICMP Path-MTU message is sent
          by the destination host only if it receives the option with
          its next-hop field set to zero and if the destination host
          is willing to send ICMP Unexpected Fragment Report
          messages.  Notice that the MTU of the last hop (from the
          last-hop gateway to the destination host) is known by the
          last-hop gateway; thus if all gateways along the path
          support the option, the PMTU can be determined and
          communicated back to the source host even if the
          destination host does not support PMTU Discovery.

          The number of ICMP Unexpected Fragment Report messages that
          the destination host is allowed to send for each PMTU



     McCloghrie/Fox/Mogul  DRAFT                              [page 6]






     RFC DRAFT              Path-MTU Discovery            January 1990


          change is limited. The value of the limit is a configured
          parameter in the destination, but each PMTU-Query option
          carries an implicit refresh of that limit, which restores
          the remaining count of subsequent messages that can be sent
          to the full limit.  The destination host caches both the
          current PMTU value and the remaining count of Unexpected
          Fragment Report messages it can still send.  Whenever the
          destination host receives a fragment (actually, only
          fragment-0 of a datagram), it compares the size of this
          fragment to its cached value of the PMTU; if they are
          different and the remaining count has not yet decremented
          to zero, then it sends a Unexpected Fragment Report message
          and decrements the remaining count; otherwise, no ICMP
          Unexpected Fragment Report message is generated.

          The destination host maintains its cached value of the PMTU
          both from the MTU-value field in a received PMTU-Query
          option, and from the size of arriving fragments (when the
          remaining count gets decremented to zero).  Both the MTU-
          value in a PMTU-Query option with its next-hop field
          containing a valid address and the size of a fragment-0 are
          considered accurate indications of the current PMTU of the
          path, and overwrite the destination host's cached value.
          However, the MTU-value in a PMTU-Query option with its
          next-hop field set to zero is considered only an upper-
          bound on the PMTU of the path, and is never used to
          increase (only to decrease or to initialize) the cached
          PMTU value.

          The source host caches the PMTU information it learns as
          part of its information about the route to each non-local-
          network destination, i.e. as an extension to its IP Route
          Cache. Whenever a Route Cache entry is created, the initial
          value for PMTU is set to 576 and an PMTU-Query option is
          sent with the first datagram.  Then, if no further
          information is obtained, the PMTU value will remain at 576
          for the life of the cache entry.

          When an PMTU-Query option is sent, it may fail to get an
          answer. One reason for this is because at least one gateway
          and the destination host do not support the option.
          Another reason is that the datagram carrying it, or the
          replying ICMP Path-MTU message might get lost in the
          network.  To cater for the latter case, the option needs to
          be retransmitted until an answering Path-MTU message is
          received or a retry count is exhausted.



     McCloghrie/Fox/Mogul  DRAFT                              [page 7]






     RFC DRAFT              Path-MTU Discovery            January 1990


          In addition, the PMTU-Query option needs to be resent
          periodically to determine if the PMTU has changed, or if
          the status of support for PMTU queries by the gateways on
          the current route has changed.  Maximum and minimum values
          for the period between resends are suggested elsewhere in
          this memo.  The period remains at the minimum while there
          is enough support for the protocol on the current path such
          that at least one of the protocol's mechanisms is working.
          If there is insufficient support for either mechanism, then
          the query is still resent (in order to detect changes in
          the level of support) but the period is exponentially
          backed-off up to the maximum value.

     5.2.  Advantages

     The advantages of the proposed protocol are:

        - the protocol works even when only some of the gateways and
          hosts implement the procedures.

        - its benefits increase incrementally as more of the gateways
          and hosts implement the procedures,

        - it does not require there to be any return traffic, and
          replies to the PMTU-Query are transmitted immediately
          instead of having to be queued waiting for a datagram to be
          sent in the reverse direction.

        - it does not involve sending any large datagrams which are
          expected to get fragmented just to test if the PMTU has
          increased.

        - the number of additional datagrams and IP options due to
          this scheme is limited.

        - this protocol accommodates the fact that existing
          applications do (and will likely continue to) ignore PMTU
          sizes in the datagrams they send.  Such behaviour does not
          induce additional overhead with this scheme, since it is
          NOT the fragmentation of these jumbo datagrams which
          generates the ICMP messages; rather, it is the change in
          the size of the fragments which generates the messages.  In
          fact, only a small specified number of ICMP Unexpected
          Fragment Report messages are sent after a change in the
          PMTU, regardless of whether or not the source
          transport/applications adjust the size of the datagrams



     McCloghrie/Fox/Mogul  DRAFT                              [page 8]






     RFC DRAFT              Path-MTU Discovery            January 1990


          they send.

        - this protocol can be implemented entirely within the IP
          layer (without needing to keep per-protocol/port state
          information).  This avoids the duplications of implementing
          in each transport and/or application layer, and of having
          each transport/application sending its own
          queries/responses.  Interaction with the
          transport/application layer is only necessary in the source
          host, where the transport/ application can ask IP for the
          PMTU to a particular destination (see GET_MAXSIZES in [3]),
          and be informed when sending a datagram larger than the
          current PMTU (e.g. because the PMTU just decreased).




































     McCloghrie/Fox/Mogul  DRAFT                              [page 9]






     RFC DRAFT              Path-MTU Discovery            January 1990


     6.  IP Option and ICMP Message Formats

     The formats of the new IP option and the new ICMP messages are
     given in the following sections.


     6.1.  IP PMTU-Query Option

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |      Type     |     Length    |            Min-MTU            |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                            Next-Hop                           |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

     Type       <<to be assigned>>

     Length     8

     Min-MTU    The minimum of the MTU values of each hop through
                which the datagram containing this IP option has so
                far been transmitted.

     Next-Hop   Either the Internet address of the next IP entity to
                which this datagram is being forwarded, or zero.  The
                field is set to zero when the option is processed by
                an IP entity if this field does not contain the IP
                address of that entity.  Thus a zero value indicates
                that the Min-MTU field is not necessarily accurate.

     (This option is not copied on fragmentation).

















     McCloghrie/Fox/Mogul  DRAFT                             [page 10]






     RFC DRAFT              Path-MTU Discovery            January 1990


     6.2.  ICMP Path-MTU Message

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |      Type     |      Code     |           Checksum            |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Reserved           |             PMTU              |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |      Internet Header + 64 bits of Original Datagram Data      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

     Type       <<to be assigned>>

     Code       0 - Valid
                1 - Upper-Bound

     Checksum   The 16-bit one's complement of the one's complement
                sum of the ICMP message, starting with the Type
                field.  For computing the checksum, the Checksum
                field is initialized to zero.

     Reserved   Sent as zero; ignored on reception.

     PMTU       The PMTU value.  If Code = Valid, this is the known
                value; if Code = Upper-Bound, this is the best-guess
                value.  The value is the PMTU of the path taken by the
                PMTU-Query to which this message is a response.

     Internet Header + 64 bits of Original Datagram Data
                These are extracted from the datagram containing
                the IP PMTU-Query option to which this message is a
                response.
















     McCloghrie/Fox/Mogul  DRAFT                             [page 11]






     RFC DRAFT              Path-MTU Discovery            January 1990


     6.3.  ICMP Unexpected Fragment Report Message

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |     Type      |     Code      |           Checksum            |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                           Reserved                            |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |      Internet Header + 64 bits of Original Datagram Data      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

     Type       <<to be assigned>>

     Code       0

     Checksum   The 16-bit one's complement of the one's complement
                sum of the ICMP message, starting with the Type
                field.  For computing the checksum, the Checksum
                field is initialized to zero.

     Reserved   Sent as zero; ignored on reception.

     Internet Header + 64 bits of Original Datagram Data
                These are extracted from the datagram, for which
                fragment 0 has been received.


     (This message is a member of the class of ICMP error messages.)




















     McCloghrie/Fox/Mogul  DRAFT                             [page 12]






     RFC DRAFT              Path-MTU Discovery            January 1990


     7.  Use and Implementation of the PMTU Discovery Protocol

     The PMTU Discovery protocol proposed in this memo requires
     enhancements to the procedures implemented by both gateways and
     hosts.  These are explained in the following sections.


     7.1.  Gateways

     Gateways should recognize and process the IP PMTU-Query option
     when it occurs in a datagram they are forwarding.  The
     processing is as follows:

       compare the three values: the Min-MTU field of the PMTU-Query
       option; the MTU of the interface on which it was received; and
       the MTU of the interface through which it will be forwarded.
       If necessary, update the Min-MTU field to contain the minimum
       of the three values.

       compare the Internet Address in the Next-Hop field in the
       PMTU-Query option to the address of the interface on which it
       was received:

         - if not equal, set the Next-Hop field in the option to
         zero.

         - if equal, determine whether the datagram can now be sent
         directly to the destination host or whether it must be
         forwarded via another gateway:

           - if via another gateway, set the Next-Hop field to the
           address of the next gateway.

           - if directly to the destination host, then set the Next-
           Hop field to the destination host's address and generate
           an ICMP Path-MTU Code=Valid message with its PMTU field
           copied from the Min-MTU field in the PMTU-Query option,
           and send it to the source address of the datagram which
           contained the PMTU-Query option.


     7.2.  Source Hosts

     A host which implements this scheme to determine the PMTU of its
     paths to other hosts, needs to keep a cache of state information
     for each non-local-network destination.  This state information



     McCloghrie/Fox/Mogul  DRAFT                             [page 13]






     RFC DRAFT              Path-MTU Discovery            January 1990


     is an extension of the state information which a host should
     already be keeping in its Route Cache (see section 3.3.1.3 in
     the Host Requirements RFC [3]).  The additional state
     information to be kept for PMTU Discovery is as follows:

       Src_PMTU  -  the PMTU of the path which has the local host as
               the source and this cache entry's host as the
               destination.  If the PMTU has not yet been discovered
               then this value must be DEFAULT_NONLOCAL_PMTU; else,
               this value will equal the known or best-guess PMTU.

       Resend_Interval  -  the frequency at which to send PMTU-Query
               options.  This is initially set to the value
               MIN_PERIOD, and increased exponentially by doubling up
               to a maximum value of MAX_PERIOD when no answering
               ICMP Path-MTU message is being received.  It is reset
               to MIN_PERIOD when either an ICMP PATH-MTU or an
               Unexpected Fragment Report message is received.

       Retry_Count  -  the count of remaining retries of the IP
               PMTU-Query option while awaiting an answering ICMP
               Path-MTU message.  It is set to zero after the
               answering Path-MTU message is received, to indicate
               that the PMTU-Query is no longer being retried.

       Query_Flag  -  set to indicate a PMTU-Query option should be
               sent on the next datagram to this destination.

     This scheme also requires a timer to be associated with each
     entry in the Route Cache.  At different times during the
     protocol exchanges, this timer operates either as a retry-timer
     or as a resend-timer.

     The procedures followed by a source host need to be updated with
     the following additions:

       1) When IP creates an entry in its Route Cache, the PMTU
          Discovery fields should be initialized as:

               Src_PMTU        = DEFAULT_NONLOCAL_PMTU,
               Resend_Interval = MIN_PERIOD,
               Retry_Count     = QUERY_RETRIES,
               Query_Flag      = set

       2) if IP receives a request to send a datagram to a non-
          local-net destination for which the Route Cache entry has



     McCloghrie/Fox/Mogul  DRAFT                             [page 14]






     RFC DRAFT              Path-MTU Discovery            January 1990


          its Query_Flag set, then add an IP PMTU-Query option to the
          datagram prior to sending it. The PMTU-Query option should
          be initialized with:

               Min-MTU   = MTU of interface on which the datagram is
                           to be transmitted,
               Next-Hop  = address of first-hop gateway.

          Then, clear the Query_Flag and start a retry timer for this
          Cache entry to go off after an interval of time larger than
          the round-trip time to the destination.

       3) If an ICMP Path-MTU message is received, find the Cache
          entry for the appropriate host.  Set Retry_Count to zero to
          indicate that a response has been received for the last
          PMTU-Query option sent, and set the Src_PMTU value from the
          Min-MTU field in the ICMP message.  Also, set the value of
          Resend_Interval to MIN_PERIOD since there is at least some
          support for PMTU Discovery on the path to this destination.
          Lastly, (re-)start this Cache entry's timer to go off after
          MIN_PERIOD, when the next PMTU-Query needs to be sent.

       4) When a Cache entry's timer goes off, examine the value of
          Retry_Count.  If Retry_Count is zero, then this was a
          resend-timer and it is now time to send a new PMTU-Query,
          so set Retry_Count to QUERY_RETRIES and set the Query_Flag.
          Otherwise (Retry_Count is greater than zero), then this was
          a retry-timer and no ICMP Path-MTU message has been
          received in response to the last PMTU-Query sent; so,
          decrement Retry_Count.  Providing that the decrementing of
          Retry_Count does not make it become zero, then the PMTU-
          Query needs to be retransmitted, so set the Query_Flag.

          If Retry_Count has been decremented to zero, then the retry
          limit is exhausted, so the Src_PMTU value must be reset to
          DEFAULT_NONLOCAL_PMTU; this reset of Src_PMTU is necessary
          because the lack of a response could indicate the route has
          changed and that now neither all the gateways nor the
          destination host support this protocol.  Also, increase the
          value of Resend_Interval to twice its current value but not
          greater than MAX_PERIOD.  Then, start a timer to go off
          after the interval given by Resend_Interval, when the next
          PMTU-Query needs to be sent.

       5) If an ICMP Unexpected Fragment Report message is received
          from a host for which there is a Cache entry, and the size



     McCloghrie/Fox/Mogul  DRAFT                             [page 15]






     RFC DRAFT              Path-MTU Discovery            January 1990


          of the fragment-0 as reported in the message is not equal
          to the cached Src_PMTU, then update the cached Src_PMTU to
          the fragment-0 size and set the Cache entry's Query_Flag.
          Setting the Query_Flag here is intended to send another
          PMTU-Query option not only to determine if the PMTU change
          has also caused a change in how many gateways on the path
          support the option, but just as importantly, to restore the
          destination's count of ICMP Unexpected Fragment Report
          messages it can send.


     7.3.  Destination Hosts

     To implement this scheme as a destination host, IP must maintain
     a cache of information with one entry per host from which it has
     received an IP PMTU-Query option.  The cache entries should be
     deleted if they have not been referenced for some DCACHE_TIMER
     period.  Each cache entry maintains the following information:

       Address  -  the Internet address of this entry's host.

       UFR_Count -  the remaining count of subsequent ICMP Unexpected
               Fragment Report messages that may be sent to this
               entry's host.

       Dest_PMTU  -  set to the (known or best-guess) PMTU of the
               path which has the local host as the destination and
               this entry's host as the source.

     The procedures followed by a destination host need to be updated
     with the following additions:

       1) On receiving a datagram with an PMTU-Query option, the
          option's Min-MTU and Next-Hop fields should first be
          updated according to the local interface on which the
          datagram was received.  The Min-MTU field should be set to
          the minimum of its own value and the MTU of local
          interface; the Next-Hop field should be set to zero if its
          value is not the address of the local interface.

          Next, locate the cache entry for the source of the
          datagram; if none exists create one and initialize its
          Dest_PMTU value from the Min-MTU in the option.

          Next, if the Next-Hop field in the option has a non-zero
          value or if the Dest_PMTU value would be decreased, then



     McCloghrie/Fox/Mogul  DRAFT                             [page 16]






     RFC DRAFT              Path-MTU Discovery            January 1990


          set the Dest_PMTU value from the Min-MTU field in the
          option.  (Note, do not increase Dest_PMTU if the Next-Hop
          field is zero.)

          Lastly, set UFR_Count to the value UFR_LIMIT, and if the
          Next-Hop field in the option has a zero value, generate an
          ICMP Path-MTU message Code=Upper-Bound with its Path-MTU
          field set to the value of Dest_PMTU.

       2) When a datagram fragment is received and it is fragment-0,
          then locate the cache entry for sending Host.  If there is
          no cache information for the sending host, then there is no
          PMTU Discovery processing to be done.  This could be the
          case where the source does not implement this PMTU
          Discovery protocol.

          Next, compare the size of the fragment-0 with the Dest_PMTU
          value.  If they are equal then there is nothing to be done
          since this is not a change in the PMTU, i.e., the receipt
          of this fragment provides no new PMTU information.

          However, if the size of fragment-0 is not equal to the
          value of Dest_PMTU, and UFR_Count is greater than zero,
          then decrement UFR_Count and send an ICMP Unexpected
          Fragment Report message.

          If UFR_Count was zero or is now zero, then set the value of
          Dest_PMTU to the size of the fragment.  This is now the
          expected size of incoming datagram initial fragments.



     8.  Discussion

     8.1.  Timers

     The PMTU Discovery protocol proposed in this memo uses a number
     of timers: a query-retry timer, a query-resend timer, and a
     destination cache timer.  It is important that the intervals of
     these timers be set correctly.

     For the retry timer, it is ideal to have the interval be just
     larger than the round-trip time (RTT) to the destination, but it
     is important that it not be smaller than the RTT.  Typically, IP
     has no knowledge of the RTT to a particular destination.  One
     possibility is to set the retry interval to a constant which is



     McCloghrie/Fox/Mogul  DRAFT                             [page 17]






     RFC DRAFT              Path-MTU Discovery            January 1990


     larger than the maximum RTT to any destination, e.g. 2*MSL, i.e.
     4 minutes.  Alternatively, the RTT could be initially set to a
     smaller value (e.g. 30 seconds) and doubled for every retry.

     For the resend-query timer, the interval is set according to the
     degree of support for the protocol on the current path to a
     destination.  In most cases, the interval determines for how
     long fragmentation will occur after a PMTU change.  For example,
     if all gateways are supporting the protocol but not the
     destination host, then it is only these resends which will
     detect a PMTU change.  So the interval needs to be small enough
     to limit the amount of fragmentation if the PMTU decreases. If
     the destination host supports the protocol, then there is (at
     least some) reliance on ICMP Unexpected Fragment Report messages
     to detect PMTU changes.  Here also, the interval needs to be set
     to limit the amount of fragmentation which would occur if all
     the Unexpected Fragment Report messages were to get lost (or the
     destination were rebooted).  Only if neither all the gateways
     nor the destination host supports the protocol can the interval
     be made longer, since in this case the reason to resend the
     query is to occasionally test if the set of gateways on the path
     has changed such that they all now support the protocol.

     The interval of the destination cache timer determines how long
     unreferenced PMTU information stays in the destination host's
     cache before being deleted.  Besides being needed to prevent the
     destination's cache growing too large, this timer is also
     necessary in case the source host's IP address is re-assigned to
     another host which may not implement PMTU Discovery.  Note,
     however, that the destination host's cache is only referenced by
     received PMTU-Query options and by received fragments.  Since
     there may not be any fragments received over a period when the
     PMTU has not decreased, it is important that this destination
     cache timer be at least several times larger than the
     corresponding value of the resend-query timer (i.e. the value of
     the resend-query timer when the destination-host supports the
     protocol).


     8.2.  Host Requirements

     The Host Requirements RFC [3] requires transport/applications to
     call IP via the GET_MAXSIZES function when they wish to (re-
     )determine the PMTU.  (Section 3.4 of [3] specifies that
     GET_MAXSIZES returns the PMTU value as MMS_S).  With this
     already required, this PMTU Discovery protocol places no new



     McCloghrie/Fox/Mogul  DRAFT                             [page 18]






     RFC DRAFT              Path-MTU Discovery            January 1990


     requirements on transport/applications.  However, it is
     suggested that IP return a warning indication on a send call, in
     the event that the packet-size is larger than the current PMTU,
     to indicate that the transport/application SHOULD call
     GET_MAXSIZES.  [Note that [3] provides no way for a transport/
     application to determine the MTU of a local interface; this may
     be an omission.]


     8.3.  Mixed Support by Transport/Applications

     The benefits from implementing this PMTU Discovery protocol in a
     source host are likely to be incrementally achieved, as each of
     the transport and/or application protocols takes advantage by
     creating packets at the full PMTU size.  During this incremental
     evolution, a source host may have some protocols dynamically
     adjusting to changing PMTU sizes and some which are still
     sending jumbo datagrams which ignore PMTU sizes.  In this
     situation, this proposal does not need to associate PMTU sizes
     with specific protocols (or connections).  In particular, if all
     a source host's protocols ignore PMTU Discovery, then that host
     should not be sending PMTU-Query options; otherwise, the
     Unexpected Fragment Report messages are sent not because some
     applications' (e.g. NFS) datagrams are getting fragmented, but
     because the PMTU has changed and the Unexpected Fragment Report
     messages pass that information back to the sender so that that
     other protocols (which are not ignoring PMTU) can be informed.


     8.4.  Making Room for Options

     One of the disadvantages of any usage of IP options (including
     usage of the PMTU-Query option) is that they increase the size
     of datagrams.  It is the transport/application layers which
     decide on the size of a packet they give IP to send as a
     datagram.  When IP has its own algorithms for deciding which
     datagrams need to have IP options added to them, there is the
     chance that the increased size of a datagram will be larger than
     the PMTU, and thus will result in fragmentation.  There are two
     ways to avoid this:

        - have some additional communication between IP and the
          transport/applications whereby IP can request the next send
          be of a smaller packet-size such that the addition of the
          option will not exceed the PMTU.  This approach complicates
          the interface between IP and TCP and between IP and UDP-



     McCloghrie/Fox/Mogul  DRAFT                             [page 19]






     RFC DRAFT              Path-MTU Discovery            January 1990


          based applications, especially since IP doesn't know which
          of them will next send to a particular destination.

        - decrease the PMTU which IP advertises to the transport/
          applications such that when IP adds an IP option to a
          packet of the decreased size the resulting IP datagram does
          not exceed the real PMTU.  This approach does not
          complicate the interface to IP from above, but has the
          disadvantage that a small fraction of the bandwidth is
          wasted since those datagrams to which no IP option is added
          are not quite as large as the PMTU.  (See the discussion of
          this approach in section 3.3.3 of the Host Requirement RFC
          [3].)


     8.5.  No Local Network Usage

     Notice that using the protocol proposed in this memo, a PMTU-
     Query option is never sent between two hosts on the same (sub-
     )network.  Since no PMTU-Query option is ever sent, neither will
     any ICMP Path-MTU messages nor any Unexpected Fragment Report
     messages ever be sent between two hosts on the same (sub-
     )network.


     8.6.  Dual-MTU Networks

     A dual-MTU network is a network containing multiple types of
     media with each media having its own MTU value.  An example is
     an FDDI/Ethernet network in which a bridge interconnects an
     Ethernet (MTU = 1500) with an FDDI network (MTU = 4K).  A host
     on the FDDI part of the network has an MTU of 4K when
     communicating with another host to which an all FDDI path
     exists, but an MTU of only 1500 when communicating with an
     Ethernet host, or via an Ethernet to another FDDI host.  The
     method by which the MTU can be determined for a particular
     destination is not yet specified, but the significance from an
     IP host/gateway's perspective, is that the MTU of such a network
     can be different for different destinations, and will probably
     have to be determined dynamically.

     Note that the task of obtaining the MTU of such a data link is
     required in any event, for the router to determine if the
     datagram must be fragmented; processing of the PMTU Query option
     requires no additional determination.  Also note that the data
     link MTU must be determined by some other means; the protocol



     McCloghrie/Fox/Mogul  DRAFT                             [page 20]






     RFC DRAFT              Path-MTU Discovery            January 1990


     described in this RFC is not meant for that purpose.



     9.  Suggested Values

     This section provides suggested values for the configured
     parameters required by protocol implementations.

     MIN_PERIOD

          Suggested Value: 15 minutes

          This value is the interval between sending the IP PMTU-
          Query option if either all the gateways in the path or the
          destination host support the PMTU Discovery protocol.

     MAX_PERIOD

          Suggested Value: 2 hours

          This value is the maximum interval between sending the IP
          PMTU-Query option if no ICMP Path-MTU message is being
          received in response.

     DCACHE_TIMER

          Suggested Value:  35 minutes

          This value is the minimum time that a destination's cached
          PMTU information for a particular source should be kept
          without being referenced.

     QUERY_RETRIES

          Suggested Value: 3

          This value is the number of times a PMTU-Query option is
          retried if no answering ICMP Path-MTU message is received
          in response, before the source concludes that both the
          destination host and at least one gateway on the current
          path do not support this PMTU Discovery protocol.







     McCloghrie/Fox/Mogul  DRAFT                             [page 21]






     RFC DRAFT              Path-MTU Discovery            January 1990


     UFR_LIMIT

          Suggested value: 3

          This value is the maximum number of ICMP Unexpected
          Fragment Report messages the receiver can send between
          receiving PMTU-Query options.  The only reason to set this
          greater than one is to protect against the possibility of
          an ICMP Unexpected Fragment Report getting lost.  Providing
          at least one ICMP Unexpected Fragment Report message
          arrives, the source will update its Src_PMTU value and send
          another PMTU-Query which will refresh the destination's
          remaining count to its maximum value ready for the next
          PMTU change.

     DEFAULT_NONLOCAL_PMTU

          Suggested value: 576

          This is the value to which the PMTU should be set if no
          answering ICMP-MTU messages have been received (recently).


     10.  Acknowledgements

     This proposal is the output of the IETF MTU Discovery Working
     Group.  It is a combination of the ideas of many people.  At one
     time or another, Steve Deering, Chris Kent, Charles Lynn, and
     Jeff Mogul all suggested using an ICMP message to report the
     size of fragments. Noel Chiappa suggested the use of the next-
     hop field in order to know whether all gateways on the path
     support the protocol.  Others who have contributed to this
     proposal are:

          Art Berggreen (ACC),
          etc.
          etc.












     McCloghrie/Fox/Mogul  DRAFT                             [page 22]






     RFC DRAFT              Path-MTU Discovery            January 1990


     References

     [1]   C. Kent and J. Mogul.
           Fragmentation Considered Harmful.
           Proc. ACM SIGCOMM '87 Workshop, August 1987.

     [2]   J. Mogul, C. Kent, C. Partridge and K. McCloghrie.
           IP MTU Discovery Options.
           RFC 1063, SRI Network Information Center, July 1988.

     [3]   R. Braden.
           Host Requirements - Communication Layer.
           RFC 1122, SRI Network Information Center, September 1989.

     [4]   S. Deering.
           IP and ICMP Extensions for MTU Discovery.
           Draft Memo, October 1989.
































     McCloghrie/Fox/Mogul  DRAFT                             [page 23]