Re: What's your favorite MTU?

mogul (Jeffrey Mogul) Fri, 13 April 1990 23:08 UTC

Received: by acetes.pa.dec.com (5.54.5/4.7.34) id AA08345; Fri, 13 Apr 90 16:08:28 PDT
From: mogul (Jeffrey Mogul)
Message-Id: <9004132308.AA08345@acetes.pa.dec.com>
Date: 13 Apr 1990 1508-PST (Friday)
To: fab%saturn.ACC.COM@salt.acc.com (Fred Bohle acc_gnsc)
Cc: mtudwg
Subject: Re: What's your favorite MTU?
In-Reply-To: fab%saturn.ACC.COM@salt.acc.com (Fred Bohle acc_gnsc) / Fri, 13 Apr 90 18:15:54 EST. <9004132315.AA02188@saturn.acc.com>

    1. How about a simple binary search to determine the MTU size?
    Looking at the list of real MTU sizes, they are very close to
    a fit to a simple halving to come up with an MTU size which
    will not fragment, e.g.:
	    64000 -> 32000 -> 16000 -> 8000 -> 4000 -> 2000 ->1000
	    -> 500 -> 250 -> 125 -> 62
    
    This converges in the above sequence in 10 steps for the worst
    case, Hyperchannel to an undefined minimum MTU network.  

The nice thing about binary searches is that they converge quickly
when N is large.  The nasty thing about them is that they converge
poorly when N isn't so large.  I want the common case to converge
quickly, not the weirdest one; FDDI to 802.3 would take a binary
search
	4352 -> 2176 -> 1088 -> 1632 -> 1360
4 steps before you even get close, and because 802.3 has the
unhappy luck to have an MTU just below the next search point,
you need
	1360 -> 1496 -> 1428
before you get within 5% of the right number.  My scheme, on
the other hand, gets the exact answer in 2 steps, and isn't
more than 0.5% wrong if there is an Ethernet instead of an 802.3
network there instead.

I'll grant that you can throw in fixes for FDDI and Ethernet,
but then you are beginning to admit that using knowledge about
the problem domain is better than using a "clean" but knowledge-
free algorithm (I thought the AI people had learned this lesson,
until the boom in neural nets started).  Why not admit that
using a linear search over likely data values converges just as
fast in the worst case, and much faster in the likely cases?

Remember also that at each step in the convergence you are going to
throw some confusion into the TCP slow-start algorithm, an effect
that I don't entirely understand but about which I'd rather
not tempt fate.

    2. One objection to a simple convergence on a value for minimum
    MTU was the lack of notification when a larger MTU became
    available (due to a gateway coming back on line somewhere).
    The discussion on this list has lost track of this line of thinking.

Partly because we are discussing this without the benefit of a draft
RFC to refer to.  (I intend to fix that one way or another within a
week.)  Steve Deering and I have always understood that there could
be some sort of timeout on path MTUs learned this way; our model is
that after (order of) 10 minutes one would simply start from scratch,
i.e. set the Path MTU estimate back to the local interface MTU and
see what happens.  (As it turns out, at least popular routing daemon
for Unix systems trashes routes older than 6 minutes, anyway).

    I suggest "probing" for a larger MTU after some amount of time/data
    has passed.  In TCP a measure based on round trip times or
    some number of windows seems reasonable.  Probably multiples
    of those, like maybe 100 RTT's or 10 windows. (Suggestions
    anyone?)

Trying to do this in terms of multiples of the RTT seems too complicated
to me.  If you do it too often, you can get confused by flapping routes;
if the routes are flapping, you might as well stick to the low side
on MTU estimates since this is approach most likely to make some small
progress.

    3. Probing would continue the binary search, only in an upward direction,
    remembering the last value which failed, and the last value which
    worked without fragmentation.  Average them to get the size for
    the probe packet.  We could do something tricky here
    to avoid holding up the data transfer for too long.  Send the
    probe packet with "Don't Fragment" set, and if we get an ICMP
    Can't Fragment message, retransmit with DF turned off. 

Among other things, this would make the host implementation changes
a LOT more complicated.  I don't think I realized, back when we
were considering Keith McCloughrie's and Rich Fox's draft, just how
much complexity would have to be added to BSD-based TCPs to play this
kind of game.  The DF-based scheme has the advantage that there is
almost nothing "tricky" going on, especially nothing involving timers
(except for purging things after 10 minutes).
    
    Well, it has been a long day, and I can't think any more.  Let
    me know where we take this idea from here.  I still think we are
    on the right track with the DF bit, since it does not need a new
    bit in the IP header.  Just having the new format of the ICMP
    message seems to do it.
    
Here I agree with you, but in part because I see (after having implemented
it) how simple it is.

-Jeff