Minutes of MTU Discovery Working Group Meeting (7 Feb 1990)

mogul (Jeffrey Mogul) Mon, 19 February 1990 21:08 UTC

Received: by acetes.pa.dec.com (5.54.5/4.7.34) id AA16668; Mon, 19 Feb 90 13:08:39 PST
Date: Mon, 19 Feb 1990 13:08:39 -0800
From: mogul
Message-Id: <9002192108.AA16668@acetes.pa.dec.com>
To: mtudwg
Subject: Minutes of MTU Discovery Working Group Meeting (7 Feb 1990)

	MTU Discovery Working Group
	Chairperson: Jeffrey Mogul/DECWRL

	CURRENT MEETING REPORT
	7 February 1990
	Held at the IETF meeting in Tallahassee, Florida
	Reported by Jeffrey Mogul

	AGENDA

		a) Report on current draft (McCloghrie/Fox/Mogul)
		b) Review other alternatives
		c) Review goals and assumptions
		d) Obtain consensus on approach
		e) Focus on details
		f) What next?

	ATTENDEES

	Art Berggreen		art@salt.acc.com
	Noel Chiappa		jnc@PTT.LCS.MIT.EDU
	Farokh Deboo		sun!iruucp!ntrlink!fjd
	Rich Fox		sytek!rfox@sun.com
	Keith Mc Cloghrie	sytek!kzm@hplabs.HP.COM
	Jeff Mogul		mogul@decwrl.dec.com
	Nuggehalli Pradeep	pradeep@orville.nas.nasa.gov
	James VanBokkelen	jbvb@ftp.com
	Tony Mason		mason@transarc.com
	Drew Perkins		ddp@andrew.cmu.edu
	John Moy		jmoy@proteon.com
	David Paul Zimmerman	dpz@convex.com
	James R. Davin		jrd@ptt.lcs.mit.edu
	Bill Melohn		melohn@eng.sun.com
	Richard Bosch		probe@mit.edu
	Michael Petry		petry@trantor.umd.edu
	Ron Broersma		ron@nosc.mil
	Mark Rosenstein		mar@mit.edu
	Ballard Bare		bare%hprnd@hplabs.hp.com
	John Veizades		veizades@apple.com
	Tony Staw		staw@marvin.enet.dec.com
	John Cavanaugh		John.Cavanaugh@StPaul.ncr.com
	John M. Wobus		JMWobus@suvm.acs.syr.edu
	Steve Willis		swillis@wellfleet.com
	Van Jacobson		van@lbl-csam.arpa
	Mike Karels		karels@berkeley.edu
	Mike Marcinkericz	mdm@gumby.dsd.trw.com

	[Some of these people did not sign the roster, but were obviously
	there.  Some other people who didn't sign the roster have slipped
	my memory.]

	MINUTES

	This was the second meeting of the MTU Discovery Working Group.

	We started with a quick presentation by Keith McCloghrie of
	the draft that he and Rich Fox wrote based on the apparent
	consensus of the December meeting.  Some attendees had not
	read the draft, and we tried to ensure that everyone understood
	the basic outline.  [Summary: senders occasionally attach an
	IP PTMU-Query Option to their datagrams.  Routers update the
	PMTU value in the option; the last-hop router returns the PMTU
	to the sender using the ICMP Path-MTU message.  If the destination
	host detects a change in the MTU (when a fragment is received),
	it sends an ICMP Unexpected Fragment Report message.]
	
	We also reviewed the "Steve Deering" proposal from last year,
	as there was a realization that it might not be dead, after all.
	Among other things, we now know that there are not 1 but 4 spare
	bits in the IP header (there are 3 unused in the TOS field), and
	that the powers that be might therefore be likely to let us use
	one.  [Summary of Deering proposal: senders often send datagrams
	with "RF" (Report Fragmentation) bit set in the IP header.  A
	host receiving fragment-0 of a datagram with RF set sends an
	ICMP Fragmentation Occurred message.]

	We then started a fairly unstructured discussion comparing the
	costs and benefits of the two approaches.
	
	    (1) Lifetime of protocol: on the one hand, in principle
	    MTU discovery should be obviated by the coming revolution
	    in routing protocols.  Within "a few" years, the routing
	    protocols will provide path-MTU information, so MTU discovery
	    will be unnecessary.  Of course, we all know about things
	    that are supposed to happen "real soon now"; we particularly
	    all know about relatively new things that "everyone" implements.
	    Still, while avoiding the trap of assuming that the world
	    will be perfect in just a couple of years, it may not be
	    worth trying to solve the problem of MTU discovery for all
	    time, since it may not be useful for that long.

	    (2) Rapidity of deployment:  Clearly, MTU discovery of any
	    form only works for a sender if some subset of the other
	    nodes (routers and/or destinations) suport it.  Query-based
	    schemes depend upon support from a large fraction of the
	    routers; RF-style schemes only help if a large fraction of
	    the end-hosts support it.  There was some debate about
	    which population is more likely to upgrade soon (routers or
	    end-hosts).  No consensus was reached.

	    (3) Connection lifetimes:  Van's data suggest that most
	    non-local TCP connections are short (ca. 4 datagrams).
	    This makes some sense (mostly SMTP) although this is only
	    one sample point, and we agreed that more data would be
	    useful.  Van argued that this works against a query-based
	    scheme, since by the time one has useful information,
	    there's not much left to do with it.  His argument in favor
	    of the RF scheme was that the right way to use it is to
	    assume that you can send large datagrams (sized by your
	    first-hop MTU, or perhaps some estimate of the NSFNET PMTU,
	    ca. 1500), and let the destination tell you if you are
	    screwing up.

	    In general, we realize that fragmentation is not inherently
	    evil.  Although it might create some extra overhead for the
	    routers, what we really have to avoid is the "deterministic
	    fragment loss" problem which causes connections to stall.
	    Thus, (I hope I am correctly paraphrasing Van's argument)
	    it is only worth doing for connections that last a while,
	    either because they are carrying lots of data, or because
	    they are stalled due to fragment loss.  Query-based schemes
	    waste router resources because processing IP options is
	    expensive, and the payoff is unlikely.

	    It was argued that, since the senders cache the MTU values
	    learned by either scheme in the per-host routing entries,
	    querying would not have to be done on every connection to
	    be useful.  Again, Van drew on his traffic studies to
	    suggest that (even over a 12-hour period) there was
	    generally little correlation between connections ... that
	    is, just because one pair of hosts makes a connection does
	    not mean that they will do so any time soon.  Some of us
	    did not believe that is necessarily true (for example, how
	    much traffic comes from mail-hub machines like DECWRL and
	    UUNET?)  Again, we agreed that it would be nice to have
	    more traffic data available.

	    (4) Complexity: Now that the draft specification for the
	    query-based scheme is done, we realized that it is a lot
	    more complex than we thought.  One problem is the number
	    of tunable parameters.  Since the RF scheme doesn't require
	    the receiver to maintain any state about the sender [actually,
	    this is not quite true, as noted later], doesn't require
	    the sender to schedule when to send the option, doesn't
	    cause the receiver to send notifications when intentional
	    fragmentation occurs [NFS would probably not set RF], and
	    it requires no support at all from the routers, it appears
	    to be simpler [but keep reading].

	After this discussion, it was pretty clear that the consensus
	had shifted to trying to use the RF scheme.  We made the assumption
	that we could get a header bit (Van argued that although the
	RF scheme could be done using an option, the cost/benefit
	analysis might be against it).  The next step was to explore
	how well that would really work.
	
	One problem that came up right away is that James VanBokkelen
	believes there to exist many PC-based systems that
	    (1) do not reassemble fragments
	    (2) do advertise MSS values of 1500 to non-local peers
	Currently, these hosts function because the 576-if-nonlocal
	rule observed by most non-PC hosts means that, given today's
	Internet, even when they advertise an MTU of 1500 to a non-local
	host, the host	at the other end will not send datagrams big
	enough to be fragmented.  [I suppose it is unlikely for two
	PCs to talk to each other over long distances.]   However, if
	we use the simplest RF scheme, these hosts are going to get
	fragmented datagrams.  Since we assume that any host which
	implements MTU discovery is also in conformance with the other
	rules (specifically, fragmentation reassembly), we therefore
	know that such sub-standard PCs won't send the ICMP Fragmentation
	Occurred message, and these connections would stall.
	
	The obvious fix is to not invoke MTU discovery (i.e., not send
	segments > 576 bytes) unless you are sure that the other end
	supports it.  This means that you have to have seen a datagram
	with RF set coming back to you from the destination before
	you can send large datagrams.
	
	More subtly, since we don't want to mislead these stupid PCs
	(which apparently don't follow the 576-byte rule in either
	direction) you cannot even send an MSS > 576 to a non-local
	peer until you have seen an RF bit from it.  Thus, since the
	TCP MSS option can only be sent on the SYN datagram, a host
	initiating a TCP connection may not be able to use MTU discovery
	(and large segments) unless it has talked with the other end
	recently.  (The second host is in a better position; since it
	sees the RF bit before it has to sends its own MSS option, it
	can set a large MSS immediately.  This is nice for FTP retrieves;
	it doesn't help for SMTP, alas).
	
	The consensus was that this limitation was acceptable, since
	it erred on the conservative side.  (Although it errs on the
	case of the most common connection-type [SMTP], since SMTP
	connections are normally short we wouldn't gain much anyway.)
	When two connections are made in quick succession, things work
	nicely (e.g., several mail messages, or the control connection
	of an FTP session followed by the data connection.  The control
	connection will seldom carry large segments, but the exchange
	of RF bits done then will allow the data connection to use
	large segments right away.)
	
	Mike Karels proposed (off-the-cuff, not necessarily believing
	that it was right) that routers fragmenting a datagram with
	RF set could also send the fragmentation-occurred ICMP.  This
	seemed to create problems given the requirement for handshaking
	imposed by the broken-PC crowd, so Mike agreed to go off and
	think about this one.
	
	One question arose about the use of a previously unused bit in
	the IP header: what would current implementations do if they
	see it set?  (We know that we can safely add options, since
	by definition these are ignored if not known.)  While the IP
	spec says these bits must be zero, the "robustness principle"
	implies that routers and hosts should ignore them.  Unfortunately,
	John Moy from Proteon admitted that Proteon routers drop such
	datagrams, and Noel Chiappa says that this is true of other
	implementations based on his old MIT "C-gateway" code.  We have
	to find out just how bad this is going to be; perhaps Proteon
	will be able to upgrade all of its customers before MTU discovery
	is widely implemented.
	
	[Side note: Clearly, implementations contrary to the basic
	IP spec are causing us serious grief.  How much do we twist
	the protocol to accomodate them?]
    
	An orthogonal issue is that in high-speed long-distance
	networks, there might be lots of packets in flight when the
	route changes to one with a lower MTU (e.g., on a satellite
	link with a half-second RTT, 4kb packets, and 100 Mbit/sec
	channel, this means 1500 packets per RTT!)  Since the source
	cannot react to a Fragment Occurred message sooner than one
	RTT worth of packets after the one that triggered the message,
	we are concerned that setting the RF bit on every packet could
	lead to positive (i.e., anti-stability) feedback in a network
	that is loosing capacity.
	
	This could be attacked in two ways: limit the rate at which the
	RF bit is sent, or limit the rate at which the ICMP is sent.
	The former could be done "once per RTT", once per some constant
	time period, or perhaps once per window.  It's not clear if
	there is a convenient way of marking out the boundaries
	between windows

	ACTION ITEMS
	
	(1) Noel Chiappa and Van Jacobson were assigned to try to
	get the IESG to free up an IP header bit.
	
	(2) Mike Karels was going to think more about having routers
	send ICMPs when they fragment.
	
	(3) We need to determine how many routers will drop packets
	with RF set, and how hard it will be to fix this.  Is it any
	different if we use one of the bits in the TOS area?
	
	(4) Ditto for end-hosts; are there any that drop such packets?
	
	(5) The Router Requirements WG was known to be considering
	changing the way that fragmentation was done (fragment into
	equal-size pieces; currently, routers are supposed to send
	N maximal-size fragments and one smaller one).  This would
	make the RF scheme nearly useless. [Phil Almquist says that
	the RRWG will work with us on this, so it shouldn't be a problem].
	
	(6) Perhaps more traffic studies would be useful.
	
	(7) Someone has to write the next draft.  Keith and Rich were
	thanked for their hard work, on their draft that is now tabled,
	and were not coerced into starting a different document.  Since
	Van was the fiercest proponent of RF at the meeting, he was given
	responsibility to see to it that the draft is written.  He agreed
	but said he was going to try to get Steve Deering to do the work
	(Steve was absent due to serious thesis time-pressure, so maybe
	Van is going to be stuck with it.)  The chair requested a draft
	within one month (7 March 1990).

	IESG ACTION

	On Thursday, February 8, at the open IESG meeting, the IESG
	was asked to allow this bit to be used for MTU discovery.
	I was not there, but I understand that the IESG is willing
	to release this bit if we come to a consensus on a protocol
	that they think is reasonable.

	SCHEDULE

	We expect to meet again at the May IETF meeting.

	At that point, we will probably either adopt one of the
	schemes, or give up.