Re: Transport requirements for DNS-like protocols

Rob Austein <> Fri, 28 June 2002 19:45 UTC

Return-Path: <>
Received: from by (PMDF V6.0-025 #44856) id <> (original mail from; Fri, 28 Jun 2002 15:45:01 -0400 (EDT)
Received: from by (PMDF V6.0-025 #44856) id <> for (ORCPT; Fri, 28 Jun 2002 15:34:21 -0400 (EDT)
Received: from by (PMDF V6.0-025 #44856) id <> for (ORCPT ; Fri, 28 Jun 2002 15:34:20 -0400 (EDT)
Received: from by (PMDF V6.0-025 #44856) id <> for (ORCPT; Fri, 28 Jun 2002 15:34:20 -0400 (EDT)
Received: from by (PMDF V6.0-025 #44856) id <> for (ORCPT; Fri, 28 Jun 2002 15:34:20 -0400 (EDT)
Received: from by (PMDF V6.0-025 #44856) id <> for (ORCPT; Fri, 28 Jun 2002 15:34:20 -0400 (EDT)
Received: from ( []) by (PMDF V6.0-025 #44856) with ESMTP id <> for; Fri, 28 Jun 2002 15:34:19 -0400 (EDT)
Received: from (localhost []) by (Postfix) with ESMTP id 2AB4318ED for <>; Fri, 28 Jun 2002 15:34:05 -0400 (EDT)
Date: Fri, 28 Jun 2002 15:34:04 -0400
From: Rob Austein <>
Subject: Re: Transport requirements for DNS-like protocols
In-reply-to: <>
Reply-to: Rob Austein <>
Message-id: <>
MIME-version: 1.0 (generated by SEMI 1.13.7 - "Awazu")
Content-type: text/plain; charset="US-ASCII"
User-Agent: Wanderlust/2.4.1 (Stand By Me) SEMI/1.13.7 (Awazu) FLIM/1.13.2 (Kasanui) Emacs/20.7 (i386--freebsd) MULE/4.0 (HANANOEN)
References: <> <> <> <>
List-Owner: <>
List-Post: <>
List-Subscribe: <>, <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Help: <>, <>
List-Id: <>

At Fri, 28 Jun 2002 09:47:02 -0400, Michael Mealling wrote:
> On Thu, Jun 27, 2002 at 11:41:33PM -0400, Rob Austein wrote:
> > b) The idempotence of normal DNS queries, and the relatively small
> >    amount of work that a DNS server has to do in order to process a
> >    normal query.
> Which is something any designer of a Layer 2 service should keep in mind.

Yes.  As others have suggested, the key word here is "relative": that
is, what one has to look at is the cost of re-running the query versus
the full cost of buffering the answer for retransmission.  My notion
of the "full" cost of retransmission is probably a bit skewed from
what most people think about, due to having spent most of a decade
writing embedded protocol stacks designed to run on bare silicon
(kernel?  we don't need no stinkeen kernel...).

> > c) The relatively low probability that any particular DNS response
> >    message will be dropped by the network.
> Is this due to its size? I.e. can that be re-written as:
>  c) The relatively low probability that any particular 512 byte UDP packet
>     will be dropped by the network.

I suspect that the community lacks consensus on whether one should
count by packets or count by bytes.  No doubt the real answer is
somewhere in between.  I tend to count by packets on the theory that
router packet buffers and CPU time are more often a bottleneck than
the number of bits one can shove down a pipe, but that begs the
question of whether one should build that assumption into protocols.

> Ok, this and the rest of the paragraphs assume IP fragmentation. The
> question I have is this: since there is no congestion control at the
> IP layer for fragments, and IP fragmentation seems to be acceptable, 
> why not do it at the application layer? Limit the packet size to the
> minimum MTU for that network and don't do retransmission of packets.
> I'm not a transport guru but it seems that all of the congestion
> control issues come about because of packet retransmission. If you
> just don't _do_ packet level retransmission then congestion control
> becomes a non-issue.

Multiple IP packets sent all at once == congestion issues, even if
those multiple packets are really just fragments of a single packet.
Sorry.  This is where I ran into trouble with my NETBLT-like
suggestion a few years ago.

I think the theory behind preferring IP level fragmentation over
having the application do the same thing is that the latter guarentees
fragmentation while the former only risks it.  That is, even in the
absence of PMTU discovery, there is still a chance that a larger than
minimum IP packet might still make it through the net unfragmented.
Note that the EDNS0 size mechanism is really just a statement by the
client to the server that the client would like the server to risk
sending a reply up to size N.

Also note that (in IPv4) fragmentation can happen anywhere along the
path.  Thus, if (count by packets) congestion is a problem near the
server but one can keep the local MTU high on the links nearest the
server, one can defer fragmentation until the response packet is
closer to its destination.  IPv6 fragmentation doesn't work this way,
I haven't yet figured out whether that's a serious problem.

Finally, note that there are really two different magic numbers that
often get confused: the MTU, and the reassembly buffer size.  If the
network isn't dropping fragments, the MTU is (almost) irrelevant, what
matters is whether the receiver can reassemble the frags upon receipt.
If the network is lossy, however, and if the lossiness is somehow
proportional to the number of packets in flight, MTU issues become
much more important.

If you haven't already done so, I recommend reading the Jacobson &
Karels 1988 SIGCOMM paper on congestion control and avoidance:

Figuring out how it applies to things like IP fragmentation takes some
work, but the basic congestion issues almost certainly apply, even if
the specific techniques for overcoming them don't.

> In other words, instead of negotiating packet size in the first packet,
> have it hardwired to 512 bytes but send multiple packets with
> simple sequence numbers so the application can just piece 'em back
> together. If it fails you can retry the query via TCP (or UDP
> if you feel it might work the second time). But you never ask for individual 
> packets to be retransmitted, and you never ACK. 
> Its the same network profile as IP fragmentation but with a much
> better chance of success at greater than 1500 byte message sizes.

Having just recently had to explain to a bunch of nontechnical folks
that the magic number thirteen in the sentence "the thirteen root name
servers" derives, ultimately, from the hardwired 512 byte message size
specified in RFC 1035, you will understand that I would prefer not to
repeat this particular mistake (I'd rather make new ones...).

> > In the long term, if we ever get a real white pages protocol and
> > people stop caring about having cute DNS names, we're still going to
> > need an underlying system for associating long-term identifiers with
> > IP addresses.  At a technical level, such a system will probably look
> > an awful lot like the DNS, but perhaps it'll be sufficently
> > decentralized that a reliable transport protocol wouldn't be such a
> > burden on the servers.
> Awfully prescient!

Can't take full credit, I came to this conclusion by standing on the
toes of others (including John Klensin).  Sadly, all of us who have
been thinking this way for much of the last decade have found that it
is much easier to predict that water which is demonstrably running
downhill will eventually reach bottom than it is to build a pump.