Re: [tsvwg] UDP-Options: UDP has two ???maximums???

Paul Vixie <> Sun, 04 April 2021 03:31 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 5FBE23A11FD for <>; Sat, 3 Apr 2021 20:31:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id M_UF2uFdCN2b for <>; Sat, 3 Apr 2021 20:31:29 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id B392D3A11FB for <>; Sat, 3 Apr 2021 20:31:29 -0700 (PDT)
Received: by (Postfix, from userid 716) id 693E77599B; Sun, 4 Apr 2021 03:31:28 +0000 (UTC)
Date: Sun, 4 Apr 2021 03:31:28 +0000
From: Paul Vixie <>
To: Joseph Touch <>
Cc: Gorry Fairhurst <>, "" <>
Message-ID: <>
References: <> <> <> <> <> <> <> <> <> <>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <>
Archived-At: <>
Subject: Re: [tsvwg] UDP-Options: UDP has two ???maximums???
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sun, 04 Apr 2021 03:31:34 -0000

On Sat, Apr 03, 2021 at 06:49:46PM -0700, Joseph Touch wrote:
> > On Apr 3, 2021, at 6:29 PM, Paul Vixie <> wrote:
> > 
> > you're implicitly positing a situation wherein a DNS speaker could know
> > that the far endpoint knew about UDP options and could reassemble, and
> > that the local endpoint (kernel) knows about UDP options and could
> > fragment, and that using UDP Fragmentation would be seen as a better
> > choice than leaving out optional data or signalling a need to retry
> > with TCP.
> Yes. Note though that the fragmentation in UDP can be used safely;
> legacy endpoints just see (at most) packets with zero data.

i wasn't considering it unsafe. some kind of initiation signal would
be required though, or else the initiator would see a small set of
apparently empty UDP payloads coming back, and wonder why.

> > i worry about microbursts. ...
> > 
> > what we've learned from NFS, and high volume authoriative UDP DNS,
> > is that the network doesn't love minimum-spaced back-to-back packets,
> > and that if an 8KiB NFS result gets chopped into ~1500B chunks, tail
> > drop is likely. this is the biggest source of operator pain from IP
> > fragmentation, fwiw.
> Although I appreciate this concern, TCP does the same kind of bursts --

TCP has a congestion window that commonly keeps burst size within "range".

> I had thought we knew about this long enough that vendors didn???t use
> tail drop; they should have been doing AQM or at least something akin
> to RED.

in routers and endpoints, yes. in switches, no. to a switch (multiport
bridge), the problem is felt too late and too near to the copper. in a fan-in
topology there can be too many gozinta for the gozouta, and this doesn't even
depend on link-layer flow control or whether it works, just ten gallons of
water trying to fit into a five gallon hat.

by the time a switch knows this is happening, the ethernet chip has already
DMA'd the contents of the queue. the driver's only recourse may be to reset
the chip to empty the queue, then re-enter the things not-wanted-lost, which
may already have been garbage collected because their refcount went to zero
after the last DMA. so, the things that aren't sent are those which come

but it doesn't matter whether it's tail drop or RED, what's important is, the
loss won't be noticed until timeout and transmission. in NFS, most of us just
moved over to TCP, but those still using UDP know to set an explicit
rsize/wsize that we know is comfortably smaller than the PMTU. (if you
wondered why modern switches allow 9100B "jumbograms", it's likely NFS.)

> A few of us at ISI explored ways to adjust TCP slow start restart to
> avoid this issue too:

good times.

> > further digression: the framer of messages (like TCP, or DNS, or NFS)
> > ought to know the PMTU, which is why PMTUD was originally a non-optional
> > feature of IPv6 until we learned that ICMPv6 was dangerous as hell and
> > threw out PMTUD, thus leaving us with the pessimal and never-expected-
> > to-be-used 1280 and 1232 numbers. if we can get PLPMTUD then we can make
> > IPv6 better than IPv4 in terms of header amortization rather than (as it
> > currently is) worse.
> Agreed; that???s aided in UDP with options as per Gorry???s draft.

if we implement PLPMTUD/UDP for DNS, we're going to have a decades-long
period during which the far end doesn't understand UDP options, and so
we will have to do fancy guesswork based on transaction timeouts, just as
we did for EDNS in the first 19 years. i hope these time scales are clear.

the original IPng PMTUD model whereby the endpoint's routing table would
remember a discovered MTU for each endpoint was a shinier city than this
on a better hill. i hope to see this outcome in the PLPMTUD world, so that
(for example) TCP, QUIC, NFS, and UDP can all set their MSS accordingly,
without each service having to do its own discovery work per endpoint.

Paul Vixie