[DNSOP] Benjamin Kaduk's No Objection on draft-ietf-dnsop-serve-stale-09: (with COMMENT)

Benjamin Kaduk via Datatracker <noreply@ietf.org> Thu, 05 December 2019 00:42 UTC

Return-Path: <noreply@ietf.org>
X-Original-To: dnsop@ietf.org
Delivered-To: dnsop@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id 36C28120018; Wed, 4 Dec 2019 16:42:54 -0800 (PST)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
From: Benjamin Kaduk via Datatracker <noreply@ietf.org>
To: "The IESG" <iesg@ietf.org>
Cc: draft-ietf-dnsop-serve-stale@ietf.org, Suzanne Woolf <suzworldwide@gmail.com>, dnsop-chairs@ietf.org, suzworldwide@gmail.com, dnsop@ietf.org
X-Test-IDTracker: no
X-IETF-IDTracker: 6.111.0
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: Benjamin Kaduk <kaduk@mit.edu>
Message-ID: <157550657421.11138.9701797814145174644.idtracker@ietfa.amsl.com>
Date: Wed, 04 Dec 2019 16:42:54 -0800
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/q-CBzj7Sz0mhqdmKtu6MxC8eXZo>
Subject: [DNSOP] Benjamin Kaduk's No Objection on draft-ietf-dnsop-serve-stale-09: (with COMMENT)
X-BeenThere: dnsop@ietf.org
X-Mailman-Version: 2.1.29
List-Id: IETF DNSOP WG mailing list <dnsop.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dnsop>, <mailto:dnsop-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dnsop/>
List-Post: <mailto:dnsop@ietf.org>
List-Help: <mailto:dnsop-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dnsop>, <mailto:dnsop-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 05 Dec 2019 00:42:54 -0000

Benjamin Kaduk has entered the following ballot position for
draft-ietf-dnsop-serve-stale-09: No Objection

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about IESG DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-dnsop-serve-stale/



----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

Thanks for this document; it's some good comprehensive discussion of the
issues related to this topic and will improve the stability of the
internet.  I have several minor coments and a few side notes that
are expected to lead to at most my own elucidiation (but no textual
changes).

Section 2

   For a comprehensive treatment of DNS terms, please see [RFC8499].

(side note: I myself would not use the word "comprehensive" when it
explicitly says that "some DNS-related terms are interpreted quite
differently by different DNS experts", but I understand why it is used
here.)

Section 3

   There are a number of reasons why an authoritative server may become
   unreachable, including Denial of Service (DoS) attacks, network
   issues, and so on.  If a recursive server is unable to contact the
   authoritative servers for a query but still has relevant data that

side note: the way this is worded might make a reader wonder if the
recursive is expected to attempt to contact all known authoritatives
before declaring failure.

   Several recursive resolver operators, including Akamai, currently use
   stale data for answers in some way.  A number of recursive resolver

I did not follow the discussions that led to this wording, but one of my
colleagues at Akamai suggested that "currently fall back to stale data
for answers under some circumstances" might be a nicer wording, though I
note that Adam has already proposed some text here as well, which is
probably fine.

Section 4

   The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is
   amended to read:

   TTL  a 32-bit unsigned integer number of seconds that specifies the
      duration that the resource record MAY be cached before the source
      of the information MUST again be consulted.  Zero values are
      interpreted to mean that the RR can only be used for the
      transaction in progress, and should not be cached.  Values SHOULD
      be capped on the orders of days to weeks, with a recommended cap
      of 604,800 seconds (seven days).  If the data is unable to be
      authoritatively refreshed when the TTL expires, the record MAY be
      used as though it is unexpired.  See the Section 5 and Section 6
      sections for details.

I recommend using "[this document]" in the section references, since a
reader reading the updated content in the context of RFC 1035 might look
there instead of here.

Section 5

   The resolver then checks its cache for any unexpired records that
   satisfy the request and returns them if available.  If it finds no
   relevant unexpired data and the Recursion Desired flag is not set in
   the request, it should immediately return the response without
   consulting the cache for expired records.  Typically this response
   would be a referral to authoritative nameservers covering the zone,
   but the specifics are implementation-dependent.

side note: I'm slightly surprised that the semantics of the absence of
Recusion Desired are not more tightly nailed down, but neither is it the
role of this document to specify them.

   When no authorities are able to be reached during a resolution
   attempt, the resolver should attempt to refresh the delegation and
   restart the iterative lookup process with the remaining time on the
   query resolution timer.  This resumption should be done only once
   during one resolution effort.

Is the "during one" more like a global cap or more like "during a
given"?

Section 6

   The client response timer is another variable which deserves
   consideration.  If this value is too short, there exists the risk
   that stale answers may be used even when the authoritative server is
   actually reachable but slow; this may result in sub-optimal answers
   being returned.  Conversely, waiting too long will negatively impact
   user experience.

Not just sub-optimal but potentially even wrong or actively harmful
answers, no?

   The balance for the failure recheck timer is responsiveness in
   detecting the renewed availability of authorities versus the extra
   resource use for resolution.  If this variable is set too large,
   stale answers may continue to be returned even after the
   authoritative server is reachable; per [RFC2308], Section 7, this
   should be no more than five minutes.  If this variable is too small,
   authoritative servers may be rapidly hit with a significant amount of
   traffic when they become reachable again.

I think part of the concern is also that setting the value too small
will cause additional traffic towards the authoritative even while it is
nonresponsive/nonreachable, which could aggravate any DoS attack ongoing
against the authoritative.  Which is to say, that perhaps "became
reachable again" does not quite reflect the full set of considerations.

   Regarding the TTL to set on stale records in the response,
   historically TTLs of zero seconds have been problematic for some
   implementations, and negative values can't effectively be
   communicated to existing software.  Other very short TTLs could lead
   to congestive collapse as TTL-respecting clients rapidly try to
   refresh.  The recommended value of 30 seconds not only sidesteps
   those potential problems with no practical negative consequences, it
   also rate limits further queries from any client that honors the TTL,
   such as a forwarding resolver.

I a little-bit wonder whether an RFC 8085 reference would make sense
here, but that's not exactly my area of expertise.

   There's also no record of TTLs in the wild having the most
   significant bit set in DNS-OARC's "Day in the Life" samples.  With no

Should we have a reference for DNS-OARC's samples?

   apparent reason for operators to use them intentionally, that leaves
   either errors or non-standard experiments as explanations as to why
   such TTLs might be encountered, with neither providing an obviously
   compelling reason as to why having the leading bit set should be
   treated differently from having any of the next eleven bits set and
   then capped per Section 4.

side note(?): This discussion, as roughly "we can't think of any reason
why the change would be problematic", calls to mind the ongoing
discussions of RFC (text) format changes, where arguments are being made
for more-strict backwards/historical compatibility.  That said, I have
no reason to doubt the WG consensus position here, hence "side note".

Section 7

   Be aware that Canonical Name (CNAME) and DNAME [RFC6672] records
   mingled in the expired cache with other records at the same owner
   name can cause surprising results.  This was observed with an initial
   implementation in BIND when a hostname changed from having an IPv4
   Address (A) record to a CNAME.  The version of BIND being used did
   not evict other types in the cache when a CNAME was received, which
   in normal operations is not a significant issue.  However, after both
   records expired and the authorities became unavailable, the fallback
   to stale answers returned the older A instead of the newer CNAME.

I'm not sure to what extent the lesson from this scenario is limited to
"CNAME/DNAME are special" versus "when serving stale, serve the
least-stale you have".

Section 8

   Details of Apple's implementation are not currently known.

I'm amenable to the other reviewer's comment that this section might be
interesting to keep, RFC 6982 notwithstanding, in which case this might
be more appropriately worded as "publicly disclosed" -- one assumes that
the Apple employees that wrote it know what it does!

Section 10

   The most obvious security issue is the increased likelihood of DNSSEC
   validation failures when using stale data because signatures could be
   returned outside their validity period.  Stale negative records can

We seem to be carefully not giving explicit guidance about using "stale"
DNSSEC keys in addition to stale resolution records.  If the
consequences of potentially using expired key material are more severe
than the consequences of potentially using expired DNS records (as it
seems to me), perhaps we should explicitly reiterate that serve-stale is
not an excuse to ignore key validity periods (as we are implicitly doing
here)?

   In [CloudStrife], it was demonstrated how stale DNS data, namely
   hostnames pointing to addresses that are no longer in use by the
   owner of the name, can be used to co-opt security such as to get
   domain-validated certificates fraudulently issued to an attacker.
   While this document does not create a new vulnerability in this area,
   it does potentially enlarge the window in which such an attack could
   be made.  A proposed mitigation is that certificate authorities
   should fully look up each name starting at the DNS root for every
   name lookup.  Alternatively, CAs should use a resolver that is not
   serving stale data.

[I think Adam has probably already covered this one, but keeping just in
case.]
I note that the target of this guidance (CAs) is not obviously in the
expected readership set for a document about DNS recursive resolver
operational considerations.  Can we do more to expand the visibility of
this guidance to the audience where it would be most useful?  (I don't
see an obvious candidate for, e.g., an additional Updates: relationship,
but perhaps someone has other ideas.)