Re: [DNSOP] Benjamin Kaduk's No Objection on draft-ietf-dnsop-serve-stale-09: (with COMMENT)

Dave Lawrence <tale@dd.org> Thu, 05 December 2019 19:59 UTC

MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-ID: <24041.24980.589811.102928@gro.dd.org>
Date: Thu, 05 Dec 2019 14:59:16 -0500
From: Dave Lawrence <tale@dd.org>
To: The IESG <iesg@ietf.org>, draft-ietf-dnsop-serve-stale@ietf.org, dnsop-chairs@ietf.org, dnsop@ietf.org
In-Reply-To: <157550657421.11138.9701797814145174644.idtracker@ietfa.amsl.com>
References: <157550657421.11138.9701797814145174644.idtracker@ietfa.amsl.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/8wXqir9sUUhGikYBm_X8A4S_uJo>
Subject: Re: [DNSOP] Benjamin Kaduk's No Objection on draft-ietf-dnsop-serve-stale-09: (with COMMENT)
Precedence: list

Thank you very much for your review, Ben.

Benjamin Kaduk via Datatracker writes:
>    For a comprehensive treatment of DNS terms, please see [RFC8499].
> 
> (side note: I myself would not use the word "comprehensive" when it
> explicitly says that "some DNS-related terms are interpreted quite
> differently by different DNS experts", but I understand why it is used
> here.)

Would "For a glossary of ..." work better?  Slightly simpler text then too.

>    There are a number of reasons why an authoritative server may become
>    unreachable, including Denial of Service (DoS) attacks, network
>    issues, and so on.  If a recursive server is unable to contact the
>    authoritative servers for a query but still has relevant data that
> 
> side note: the way this is worded might make a reader wonder if the
> recursive is expected to attempt to contact all known authoritatives
> before declaring failure.

It's slightly complicated but I'd say "expected to" is yes in the
general sense of it but the devil is in the details.  Most
implementations basically do try all authorities of the set their
working with, but also have safeguards for getting bogged down, like
what is later described as the query resolution timer.  This keeps
them from being tarpitted on an NS RRset that provides many
authorities that all have some sort of aberrant behavior, whether that
be unreachability or super slow TCP or whatever.  This isn't just
theoretical either, but observed in the wild.

Based on that I'd be reluctant to add either "all" into there, or to
fully describe the situation.

>    Several recursive resolver operators, including Akamai, currently use
>    stale data for answers in some way.  A number of recursive resolver
> 
> I did not follow the discussions that led to this wording, but one of my
> colleagues at Akamai suggested that "currently fall back to stale data
> for answers under some circumstances" might be a nicer wording, though I
> note that Adam has already proposed some text here as well, which is
> probably fine.

Per Adam's review I'd already stricken the mention of Akamai's
operations here.  In the earliest versions of the draft it kind of fit
in more naturally as being one member of a set of other explicitly
named providers, but the latter were implementations rather than
specific operations and subsequent edits separated them out.
Ultimately the call-out to a specific operator (and in particular, the
one I worked for at the time) was more meaningful background when
the draft was first introduced than is necessary to have in there now.

> I recommend using "[this document]" in the section references, since a
> reader reading the updated content in the context of RFC 1035 might look
> there instead of here.

Yes, I had also updated this per Adam's review to have the RFC Editor
adjust accordingly.

> side note: I'm slightly surprised that the semantics of the absence of
> Recusion Desired are not more tightly nailed down, but neither is it the
> role of this document to specify them.

We're barely scratching the surface about what might surprise you that
isn't more tightly nailed down in the DNS. Did I ever tell you about
that time in DoH where we argued for metaphorical days over what TC
really means? 

>    When no authorities are able to be reached during a resolution
>    attempt, the resolver should attempt to refresh the delegation and
>    restart the iterative lookup process with the remaining time on the
>    query resolution timer.  This resumption should be done only once
>    during one resolution effort.
> 
> Is the "during one" more like a global cap or more like "during a
> given"?

I confess I don't quite understand the distinction you're drawing, so
perhaps a concrete example helps.  The idea to be communicated is that
if you're trying to resolve foo.example.com but your current NS RRSET
for example.com are all having issues (timeout, servfail, whatever)
then you should try to refresh example.com NS again and if any new
servers are learned, go ahead and try them.  If they fail just bail.
How does that square with your thoughts about a global cap?

>    The client response timer is another variable which deserves
>    consideration.  If this value is too short, there exists the risk
>    that stale answers may be used even when the authoritative server is
>    actually reachable but slow; this may result in sub-optimal answers
>    being returned.  Conversely, waiting too long will negatively impact
>    user experience.
> 
> Not just sub-optimal but potentially even wrong or actively harmful
> answers, no?

True, that's in the bounds of possibility and not well-encapsulated by
the understated "sub-optimal".  s/sub-optimal/undesirable/?
Undesirable does a better a job of incorporating wrongness but is
still subtle.  Or would you prefer s/sub-optimal/sub-optimal, wrong,
or even actively harmful/?  Is the latter not covered adequately by
the Security Considerations section?

Realistically there's not a special significance here with regard to
the client response timer vs the whole idea of serve stale in general.
If someone is trying to aggressively attack through this mechanism
then knowing the specific setting of this timer at a resolver would be
only a very minor consideration.  

>    The balance for the failure recheck timer is responsiveness in
>    detecting the renewed availability of authorities versus the extra
>    resource use for resolution.  [...] If this variable is too small,
>    authoritative servers may be rapidly hit with a significant amount of
>    traffic when they become reachable again.
> 
> I think part of the concern is also that setting the value too small
> will cause additional traffic towards the authoritative even while it is
> nonresponsive/nonreachable, which could aggravate any DoS attack ongoing
> against the authoritative.  Which is to say, that perhaps "became
> reachable again" does not quite reflect the full set of considerations.

I agree that is also a risk.  Should it be updated to just, "If this
variable is too small, authoritative servers may be targeted with a
significant amount of excess traffic."  That would encompass all sorts
of availability situations.

>    There's also no record of TTLs in the wild having the most
>    significant bit set in DNS-OARC's "Day in the Life" samples.  With no
> 
> Should we have a reference for DNS-OARC's samples?

Suresh suggested a reference to
http://www.caida.org/projects/ditl/
or there could also be one to
https://www.dns-oarc.net/oarc/data/ditl
but the actual samples are only available to members.

>    Be aware that Canonical Name (CNAME) and DNAME [RFC6672] records
>    mingled in the expired cache with other records at the same owner
>    name can cause surprising results. [...]
>
> I'm not sure to what extent the lesson from this scenario is limited to
> "CNAME/DNAME are special" versus "when serving stale, serve the
> least-stale you have".

That's a fair point, but I'm not sure how to incorporate anything
about it.  CNAME/DNAME were called out precisely because they are special.

>    Details of Apple's implementation are not currently known.
> 
> I'm amenable to the other reviewer's comment that this section might be
> interesting to keep, RFC 6982 notwithstanding, in which case this might
> be more appropriately worded as "publicly disclosed" -- one assumes that
> the Apple employees that wrote it know what it does!

That's a fair point.  I'm a little reticent on "publicly disclosed"
though only because while denotatively true it carries a bit of a
connotation that they're hiding something.  In the repo this sentence
currently reads, "Apple's system resolvers are also known to use
stale answers, but the details are not currently known," but that
doesn't really address your remark.

>    The most obvious security issue is the increased likelihood of DNSSEC
>    validation failures when using stale data because signatures could be
>    returned outside their validity period.  Stale negative records can
> 
> We seem to be carefully not giving explicit guidance about using "stale"
> DNSSEC keys in addition to stale resolution records.  If the
> consequences of potentially using expired key material are more severe
> than the consequences of potentially using expired DNS records (as it
> seems to me), perhaps we should explicitly reiterate that serve-stale is
> not an excuse to ignore key validity periods (as we are implicitly doing
> here)?

Hrm, I think this is pretty clearly giving deference to the key
validity period.  Would you like to propose a specific text change?

[DNSOP] Benjamin Kaduk's No Objection on draft-ie… Benjamin Kaduk via Datatracker
Re: [DNSOP] Benjamin Kaduk's No Objection on draf… Dave Lawrence
Re: [DNSOP] Benjamin Kaduk's No Objection on draf… Benjamin Kaduk