[Driu] Query distribution vs. query routing

Ted Hardie <ted.ietf@gmail.com> Thu, 19 July 2018 22:57 UTC

MIME-Version: 1.0
From: Ted Hardie <ted.ietf@gmail.com>
Date: Thu, 19 Jul 2018 18:57:02 -0400
Message-ID: <CA+9kkMB+yODJARiFqyestCRtX6Od-Ryz7qkjhWmn5ToqYBOLgA@mail.gmail.com>
To: driu@ietf.org
Content-Type: multipart/alternative; boundary="000000000000c507730571621a36"
Archived-At: <https://mailarchive.ietf.org/arch/msg/driu/mmKBircD2UfqZJXPB9XkYHKgVEQ>
Subject: [Driu] Query distribution vs. query routing
Precedence: list

Howdy,

The mic line interaction and related jabber traffic around Mark's draft
were interesting, and they nicely illustrated just how confused I can be.
Sorry for sharing that experience with the group.

Where I think it's useful to dive in a little further is in the distinction
between query distribution and query routing.

Let's start by making a parallel: a device on a lan with a single router
has a default next hop destination; that router may also have a default
next hop destination.  In both cases, all traffic leaving one device will
go through the other (lan device to lan router; lan router to default
upstream destination).  In the router case, it's pretty common for there to
be two different upstream possibilities, with the router making the choice
between upstream possibilities for packets as they arrive from the device.
That choice may be a simple algorithm that load balances between them, like
ECMP, or it may be made by consulting a mapping that tells it which
upstream is a better route to a specific destination network (that mapping
being derived from data distributed by some routing protocol).

At a hideously gross approximation, a host talking to a recursive resolver
is in a parallel situation.  It can have a single upstream resolver it
talks to, or it can have more than one.  If it has more than one, it may
direct queries to them as a load balancing function, or as a function of
the names one serves usually because one is configured to serve the local
names in a split DNS situation.  That configuration is the simple routing
information in the parallel here.

What I heard in the jabber room as explanatory text was that this proposal
suggests allowing the default upstream server to name other default
upstream servers, who could share the load of the query stream.  There
would be benefits to the upstreams in load-shedding and benefits to the
device in reducing the information revealed to any particular upstream.  I
have some serious concerns there about whether the default upstream servers
it names could be scoped to avoid this becoming a DOS vector, but even
putting that aside, I think the incentive is poor.  The local device
already has a known good upstream with an established TCP/TLS/HTTP session
going, and it's going to want to avoid the latency of establishment of load
balancing in a lot of cases.

What I originally thought was in the proposal was closer to distributing
routing information.  There, the server isn't distributing new default
servers, it is distributing information about servers that are suited to
answering queries about specific names.  The choice to use them, in other
words, wouldn't be based on load balancing, but on query routing--sending a
query to the best available query responder from a latency or authority
perspective.  There's a whole bunch of work around securing routing
information that we'd have to import if we went down this path.  Down this
path split DNS becomes fragmented DNS, in which you may find some answers
are "reachable" and some are not.

I can also read the document to be recommending a hybrid mode, in which
every DOH responder agrees to be a default query path but still tells you
which names it is well suited to resolve (giving you a situation like that
in which a router gets multiple routing announcements that include both
specific networks and a default route).  There are a bunch of interesting
possibilities there, both in optimizations and attacks.  Ultimately,
though, the incentive works only when this actually results in quicker
connection setups and when both the DOH client and DOH responder care a
great deal about that timing.

What I think that turns out to mean, and this is from a very dusty crystal
ball, is that a DOH responder will have to maintain a very robust DNS
recursive resolver infrastructure if it agrees to be a default query path
in order to attract query traffic bound for its network.  That means a big
cache, good connections to other DNS services, the lot.  If it doesn't, its
delays in responding to "default upstream" queries will be slower than
those from other potential services, and a DNS client will eventually cease
using them (presuming it is still testing optimization by query response
time).  If it still gets any queries, it will be along the "fragmented DNS"
query model, where it gets only queries where the latency of its answers
for specific networks is very important.

I think those later in the line that were pointing out the risks of
consolidation were right, in other words, and maybe didn't even go far
enough.  I think the end game of this model is that the user has no control
over where the queries go and the heuristic system underneath them ends up
sending them to site willing to offer the highest number of names (more
specific routes) and the biggest DNS query infrastructure. That's going to
land everything behind a few CDNS, unless I miss my guess.

I'm sure I've made several additional howlers in this note, for which I
apologize early,

Ted

[Driu] Query distribution vs. query routing Ted Hardie
Re: [Driu] Query distribution vs. query routing Martin Thomson