MTU discovery considered harmful?

smb@ulysses.att.com Fri, 20 April 1990 19:53 UTC

Received: from decwrl.dec.com by acetes.pa.dec.com (5.54.5/4.7.34) id AA16629; Fri, 20 Apr 90 12:53:36 PDT
Received: by decwrl.dec.com; id AA21298; Fri, 20 Apr 90 12:53:26 -0700
Message-Id: <9004201953.AA21298@decwrl.dec.com>
Received: by inet; Fri Apr 20 15:53 EDT 1990
From: smb@ulysses.att.com
To: mtudwg
Subject: MTU discovery considered harmful?
Date: Fri, 20 Apr 1990 15:53:12 -0400

I know it's late in the game, but I'm becoming very concerned that
MTU discovery may be fundamentally a Bad Idea.  In particular, I
haven't seen any discussion of the relationship between MTU, window
size, and router hop counts; the latter aspect would again tend to
pull us towards tying in to OSPFIGP.  And without some changes, I
think we're going to be opening a can of worms.

Let's look at a not-so-absurd limiting case:  FDDI rings at both
LANs, and point-to-point links across a regional net.  FDDI
uses a 4K MTU; serial lines, being HDLC, have more or less arbitrary
MTUs, and will likely be set to 4K once FDDI becomes common.  Current
TCPs (at least, many of them) have default window sizes of 4K.
This means that we've reduced sliding window to send-and-wait.

Even with 8K windows, we haven't helped much here -- the sender will
transmit two 4K packets right away, and then have to wait for the
first ACK; if delayed ACKs are used, we'll quite likely see just
one ACK for both packets.

There's another issue as well:  serialization time on the links.
When a packet is being sent over a wire, there's a non-negligible
transmission time due to the clock speed of the link.  For example,
a DS0 link -- 56K bps -- has a serialization speed of 1/7 msec/byte.
For 4K packets, that's 585 msecs just to clock the bits onto the
wire.  Since we're routing packets at the IP level, a gateway has
to accumulate the entire packet before it can retransmit it; thus,
we pay a 585 msec delay penalty for each DS0 hop.  (For DS1 speeds --
1.544M bps -- which are used on today's backbone, the cost is of
course less, about 21 msec for each transmission of a 4K packet.)

Note what happens if that 4K packet is broken up into 4 1K chunks.
We still pay the serialization price the first time for sending
the 4K bytes; however, each gateway can now hand off the packet as
soon as 1K has arrived.  We thus get overlapped transmissions -- while
the host (or rather, the first long-haul gateway) is still sending
the last packet, the first three are simultaneously being sent over
three other links.  The per-hop cost is therefore only for a 1K packet.

To grossly oversimplify things, to a (poor) first approximation
the optimum MTU size is the window size divided by the number of
hops, or at least the number of ``slow'' (a term I'll leave undefined)
hops.  That way, each router can be busy sending a packet simultaneously.
(I say that this is a poor approximation because of the considerable
overhead per packet.  But don't overestimate that overhead; for a
router using slow lines, the serialization time dominates.  For
example, according to some measurements I've done recently, on a
Cisco router the fixed overhead is on the order of 2 ms, plus the
cost per byte -- and for a 40-byte minimum TCP packet, that's 5.7 ms.)

The proposal in the draft RFC gives a good mechanism for calculating
the PMTU, but yields no information on the hop count.  Informal
looks at some non-random traceroutes suggest typical connections are
traveling at least 10 hops.  I'd say as a guess, without looking
at maps of the NSFNET backbone or any of the regional nets, that
we can assume 3 hops within a regional net to reach NSFNET, 2 or 3
hops on the backbone, and another 3 hops via the destination regional
net.  This would suggest that maximum MTU be approximately 1/8 of
the window size -- a number that's remarkably close to what we're
now using.  That said, has anyone done any throughput measurements
using a TCP that's been hacked to use, say, 1500 byte MTUs?

Our discussions over the last few months make it fairly obvious that
we can't rely on munging the routers to give us hopcount information
via a new path discovery mechanism.  But hosts can adjust their
window sizes.  Let me suggest, off the top of my head, two strategies.
First, a host can more-or-less reliably detect use of Path MTU by
noting the arrival of TCP packets with Don't Fragment set.  If a
host notices that PMTU is in use, it should increase the window size
for that connection by some factor, perhaps (if it knows) using its
own PMTU information as a guess about the other end's PMTU.  By the
same token, a host using PMTU should nevertheless restrict its
maximum effective PMTU to some fraction of the largest receive window
ever advertised at it.  (I realize I'm being TCP-specific here.)
What fraction should we use?  I suspect that a factor of 4 will
work, though it wouldn't hurt to try some experiments.  In today's
world, that means that on an all-Ethernet LAN (InterLAN?  CateLAN?),
the typical local situation, we'll see MTUs of 1K rather than 1500 --
a reduction that isn't serious.  All-FDDI locales will not be common
for a while; first penetration will be in the campus backbone market.
People who really want the 4K MTU in such situations can always
specify SUBNETSARELOCAL, thereby bypassing the whole process.

Comments?

		--Steve Bellovin