yet another MTU discovery scheme

Steve Deering <deering@pescadero.stanford.edu> Sun, 25 February 1990 06:58 UTC

Received: from decwrl.dec.com by acetes.pa.dec.com (5.54.5/4.7.34) id AA21472; Sat, 24 Feb 90 22:58:01 PST
Received: by decwrl.dec.com; id AA21625; Sat, 24 Feb 90 22:57:58 -0800
Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA13610; Sat, 24 Feb 90 22:57:53 PDT
Date: Sat, 24 Feb 1990 20:51:00 -0000
From: Steve Deering <deering@pescadero.stanford.edu>
Subject: yet another MTU discovery scheme
To: mtudwg
Message-Id: <90/02/24

My original RF-bit proposal assumes that it's OK for fragmentation and
reassembly to occur once in a while, and uses intentional fragmentation
to learn path-MTUs.

If it is decided that fragmentation (actually, reassembly) is to be
avoided at all costs, in order not to be vulnerable to Identifier
wraparound at high speeds, my proposal could be modified to require
that fragmented packets be discarded at the destination rather than
reassembled (assuming they have the RF bit set and ICMP Fragment
Received messages are sent back to the source).  Compared to the
original proposal, this has the drawback of wasting one round-trip
time to learn the MTU of those paths whose MTU is less than
MIN(first-hop-MTU, MSS).  The time is "wasted" in the sense that
the packets that are used to discover the path-MTU must be thrown
away by the receiver.

Let me try out another idea -- actually, a variation on an old idea.

The old idea is to have senders detect fragmentation by setting the
Don't Fragment (DF) bit on all packets, reducing their packet size
if they receive "ICMP Destination Unreachable: Fragmentation Needed
and DF Set" messages (I'll call them "Can't Fragment" messages, for
short).  The problem with this approach is that the Can't Fragment
messages do not tell the sender what the MTU of the next hop network
is, so it may take several retransmissions to learn the MTU (perhaps
even doing a binary search for the right packet size).

The variation I am proposing is to have the gateway that generates the
Can't Fragment message include the recommended MTU (that is, the MTU
of the next hop) in the 32-bit "unused" field of the Can't Fragment
message.  (I'll call that field the "Recommended MTU field".)

The sender behavior is as follows:

	- start by assuming that the path-MTU is MIN(first-hop-MTU, MSS),
	  and send all packets with the DF bit set.

	- if a Can't Fragment message is received with a non-zero
	  Recommended MTU value less than the currently-assumed
	  path-MTU, adopt the recommended MTU as the new assumed MTU,
	  and re-packetize and retransmit the packet identified by the
	  Can't Fragment message (at the transport layer).  Continue to
	  set the DF bit in all packets.

	- if a Can't Fragment message is received with a zero
	  Recommended MTU value, that means the gateway that generated
	  the message has not been upgraded to the new protocol.
	  If the currently-assumed path-MTU is greater than 576,
	  change it to 576.  Repacketize and retransmit the rejected
	  packet.  *Stop* setting the DF bit in outgoing packets.

	- at fairly large intervals (10 - 20 minutes?), reset the
	  assumed-MTU to MIN(first-hop-MTU,MSS) and start sending
	  DF bits again, in order to learn if the path-MTU has
	  increased.

No changes are required at the receiver, and gateways can be upgraded
gradually.  No new IP header bits are required.  If a path's MTU shrinks
at an unmodified gateway, the sender ends up reverting to the conservative
576 rule.  As the hosts and gateways are upgraded to use this strategy,
fragmentation is eliminated from the Internet (just as everyone is getting
fast enough to run into the Identifier-wraparound problem).

In many cases, setting the DF bit will *not* trigger any Can't Fragment
messages, now that  the NSFnet backbone and most regionals support a 1500
byte MTU.  If FDDI ever gets off the ground, it's 4K packets will trigger
Can't Fragment messages from the near-side gateway to any smaller-MTU
subnet, allowing the sender to learn the correct MTU (or 576) in less
than one RTT.  In those rare cases where the MTU shrinks at more than
one point in a path, it will require multiple retransmissions to learn
the path-MTU, but that doesn't seem too serious.

Many of the implementation issues are the same as for the RF-bit scheme,
such as caching of path-MTUs at the IP layer in the sender, participation
of the transport layer in the sender, limiting the number of Can't Fragment
messages generated in response to pipes-full of packets with DF bit set, etc.

Possible problem: gateways that do not send ICMP Can't Fragment messages
when they should.  Are there any such gateways?

Now, I'm not renouncing my original RF-bit proposal (yet).  First, I'd
like to hear some opinions on whether or not the Identifier-wraparound
problem is something we should worry about.

Steve