Re: [DNSOP] Benjamin Kaduk's No Objection on draft-ietf-dnsop-serve-stale-09: (with COMMENT)

Benjamin Kaduk <kaduk@mit.edu> Thu, 05 December 2019 23:02 UTC

Date: Thu, 05 Dec 2019 15:01:56 -0800
From: Benjamin Kaduk <kaduk@mit.edu>
To: Dave Lawrence <tale@dd.org>
Cc: The IESG <iesg@ietf.org>, draft-ietf-dnsop-serve-stale@ietf.org, dnsop-chairs@ietf.org, dnsop@ietf.org
Message-ID: <20191205230156.GK13890@kduck.mit.edu>
References: <157550657421.11138.9701797814145174644.idtracker@ietfa.amsl.com> <24041.24980.589811.102928@gro.dd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <24041.24980.589811.102928@gro.dd.org>
User-Agent: Mutt/1.12.1 (2019-06-15)
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/-1kqUUrWikZ8qIDM3D34GHxnhcQ>
Subject: Re: [DNSOP] Benjamin Kaduk's No Objection on draft-ietf-dnsop-serve-stale-09: (with COMMENT)
Precedence: list

On Thu, Dec 05, 2019 at 02:59:16PM -0500, Dave Lawrence wrote:
> Thank you very much for your review, Ben.
> 
> Benjamin Kaduk via Datatracker writes:
> >    For a comprehensive treatment of DNS terms, please see [RFC8499].
> > 
> > (side note: I myself would not use the word "comprehensive" when it
> > explicitly says that "some DNS-related terms are interpreted quite
> > differently by different DNS experts", but I understand why it is used
> > here.)
> 
> Would "For a glossary of ..." work better?  Slightly simpler text then too.

Indeed it would!  And now I feel better about making you read a "side note"
that I didn't expect to change anything...

> >    There are a number of reasons why an authoritative server may become
> >    unreachable, including Denial of Service (DoS) attacks, network
> >    issues, and so on.  If a recursive server is unable to contact the
> >    authoritative servers for a query but still has relevant data that
> > 
> > side note: the way this is worded might make a reader wonder if the
> > recursive is expected to attempt to contact all known authoritatives
> > before declaring failure.
> 
> It's slightly complicated but I'd say "expected to" is yes in the
> general sense of it but the devil is in the details.  Most
> implementations basically do try all authorities of the set their
> working with, but also have safeguards for getting bogged down, like
> what is later described as the query resolution timer.  This keeps
> them from being tarpitted on an NS RRset that provides many
> authorities that all have some sort of aberrant behavior, whether that
> be unreachability or super slow TCP or whatever.  This isn't just
> theoretical either, but observed in the wild.
> 
> Based on that I'd be reluctant to add either "all" into there, or to
> fully describe the situation.

Understood, and that's why it was a "side note" :)
(I was mostly not sure about the scope of "the set they're working with",
not that that matters.)

> >    Several recursive resolver operators, including Akamai, currently use
> >    stale data for answers in some way.  A number of recursive resolver
> > 
> > I did not follow the discussions that led to this wording, but one of my
> > colleagues at Akamai suggested that "currently fall back to stale data
> > for answers under some circumstances" might be a nicer wording, though I
> > note that Adam has already proposed some text here as well, which is
> > probably fine.
> 
> Per Adam's review I'd already stricken the mention of Akamai's
> operations here.  In the earliest versions of the draft it kind of fit
> in more naturally as being one member of a set of other explicitly
> named providers, but the latter were implementations rather than
> specific operations and subsequent edits separated them out.
> Ultimately the call-out to a specific operator (and in particular, the
> one I worked for at the time) was more meaningful background when
> the draft was first introduced than is necessary to have in there now.
> 
> > I recommend using "[this document]" in the section references, since a
> > reader reading the updated content in the context of RFC 1035 might look
> > there instead of here.
> 
> Yes, I had also updated this per Adam's review to have the RFC Editor
> adjust accordingly.
> 
> > side note: I'm slightly surprised that the semantics of the absence of
> > Recusion Desired are not more tightly nailed down, but neither is it the
> > role of this document to specify them.
> 
> We're barely scratching the surface about what might surprise you that
> isn't more tightly nailed down in the DNS. Did I ever tell you about
> that time in DoH where we argued for metaphorical days over what TC
> really means? 

I'll have to ask you about that in Vancouver.  (Over beer.)

> >    When no authorities are able to be reached during a resolution
> >    attempt, the resolver should attempt to refresh the delegation and
> >    restart the iterative lookup process with the remaining time on the
> >    query resolution timer.  This resumption should be done only once
> >    during one resolution effort.
> > 
> > Is the "during one" more like a global cap or more like "during a
> > given"?
> 
> I confess I don't quite understand the distinction you're drawing, so
> perhaps a concrete example helps.  The idea to be communicated is that
> if you're trying to resolve foo.example.com but your current NS RRSET
> for example.com are all having issues (timeout, servfail, whatever)
> then you should try to refresh example.com NS again and if any new
> servers are learned, go ahead and try them.  If they fail just bail.
> How does that square with your thoughts about a global cap?

The mis-reading of the sentence that I was thinking of would be more like
"should be done only once, ever, and that one attempt should be done during
one resolution attempt"; in your example that would in effect mean "this
recursive only tries to refresh example.com NS once during runtime", which
is of course nonsensical.  So my suggestion would be to s/during one/per/,
but given how nonsensical the misreading in question is, I would understand
if you ignore it.

> >    The client response timer is another variable which deserves
> >    consideration.  If this value is too short, there exists the risk
> >    that stale answers may be used even when the authoritative server is
> >    actually reachable but slow; this may result in sub-optimal answers
> >    being returned.  Conversely, waiting too long will negatively impact
> >    user experience.
> > 
> > Not just sub-optimal but potentially even wrong or actively harmful
> > answers, no?
> 
> True, that's in the bounds of possibility and not well-encapsulated by
> the understated "sub-optimal".  s/sub-optimal/undesirable/?
> Undesirable does a better a job of incorporating wrongness but is
> still subtle.  Or would you prefer s/sub-optimal/sub-optimal, wrong,
> or even actively harmful/?  Is the latter not covered adequately by
> the Security Considerations section?

I think just "undesirable" would be fine (but I wouldn't complain if you
did the longer version ;) ).  The current security considerations do pretty
well.

> Realistically there's not a special significance here with regard to
> the client response timer vs the whole idea of serve stale in general.
> If someone is trying to aggressively attack through this mechanism
> then knowing the specific setting of this timer at a resolver would be
> only a very minor consideration.  
> 
> >    The balance for the failure recheck timer is responsiveness in
> >    detecting the renewed availability of authorities versus the extra
> >    resource use for resolution.  [...] If this variable is too small,
> >    authoritative servers may be rapidly hit with a significant amount of
> >    traffic when they become reachable again.
> > 
> > I think part of the concern is also that setting the value too small
> > will cause additional traffic towards the authoritative even while it is
> > nonresponsive/nonreachable, which could aggravate any DoS attack ongoing
> > against the authoritative.  Which is to say, that perhaps "became
> > reachable again" does not quite reflect the full set of considerations.
> 
> I agree that is also a risk.  Should it be updated to just, "If this
> variable is too small, authoritative servers may be targeted with a
> significant amount of excess traffic."  That would encompass all sorts
> of availability situations.

That sounds good to me.

> >    There's also no record of TTLs in the wild having the most
> >    significant bit set in DNS-OARC's "Day in the Life" samples.  With no
> > 
> > Should we have a reference for DNS-OARC's samples?
> 
> Suresh suggested a reference to
> http://www.caida.org/projects/ditl/
> or there could also be one to
> https://www.dns-oarc.net/oarc/data/ditl
> but the actual samples are only available to members.

Even with the membership caveat, having the link seems better than not, to
me.

> >    Be aware that Canonical Name (CNAME) and DNAME [RFC6672] records
> >    mingled in the expired cache with other records at the same owner
> >    name can cause surprising results. [...]
> >
> > I'm not sure to what extent the lesson from this scenario is limited to
> > "CNAME/DNAME are special" versus "when serving stale, serve the
> > least-stale you have".
> 
> That's a fair point, but I'm not sure how to incorporate anything
> about it.  CNAME/DNAME were called out precisely because they are special.
> 
> >    Details of Apple's implementation are not currently known.
> > 
> > I'm amenable to the other reviewer's comment that this section might be
> > interesting to keep, RFC 6982 notwithstanding, in which case this might
> > be more appropriately worded as "publicly disclosed" -- one assumes that
> > the Apple employees that wrote it know what it does!
> 
> That's a fair point.  I'm a little reticent on "publicly disclosed"
> though only because while denotatively true it carries a bit of a
> connotation that they're hiding something.  In the repo this sentence
> currently reads, "Apple's system resolvers are also known to use
> stale answers, but the details are not currently known," but that
> doesn't really address your remark.

Even that's something of an improvement, IMO.  To add to the brainstorming,
perhaps "the details are not readily available"?

> >    The most obvious security issue is the increased likelihood of DNSSEC
> >    validation failures when using stale data because signatures could be
> >    returned outside their validity period.  Stale negative records can
> > 
> > We seem to be carefully not giving explicit guidance about using "stale"
> > DNSSEC keys in addition to stale resolution records.  If the
> > consequences of potentially using expired key material are more severe
> > than the consequences of potentially using expired DNS records (as it
> > seems to me), perhaps we should explicitly reiterate that serve-stale is
> > not an excuse to ignore key validity periods (as we are implicitly doing
> > here)?
> 
> Hrm, I think this is pretty clearly giving deference to the key
> validity period.  Would you like to propose a specific text change?

I guess the relevant sentiment is that "even though RRSIG records are DNS
records and might be served stale, their contents, including signature
validity period, are unchanged when served stale".  But if that doesn't fit
or is too obvious, don't add it just on my account.

Thanks!

-Ben

[DNSOP] Benjamin Kaduk's No Objection on draft-ie… Benjamin Kaduk via Datatracker
Re: [DNSOP] Benjamin Kaduk's No Objection on draf… Dave Lawrence
Re: [DNSOP] Benjamin Kaduk's No Objection on draf… Benjamin Kaduk