Re: [DNSOP] Review of draft-ietf-dnsop-serve-stale-02.txt

Dave Lawrence <tale@dd.org> Mon, 05 November 2018 06:37 UTC

Return-Path: <tale@dd.org>
X-Original-To: dnsop@ietfa.amsl.com
Delivered-To: dnsop@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4EB9E12EB11 for <dnsop@ietfa.amsl.com>; Sun, 4 Nov 2018 22:37:29 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tl7nZbTGaddS for <dnsop@ietfa.amsl.com>; Sun, 4 Nov 2018 22:37:27 -0800 (PST)
Received: from gro.dd.org (gro.dd.org [207.136.192.136]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 65E71128B14 for <dnsop@ietf.org>; Sun, 4 Nov 2018 22:37:27 -0800 (PST)
Received: by gro.dd.org (Postfix, from userid 102) id 39F7C301D8; Mon, 5 Nov 2018 01:37:25 -0500 (EST)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-ID: <23519.58661.219419.142204@gro.dd.org>
Date: Mon, 05 Nov 2018 01:37:25 -0500
From: Dave Lawrence <tale@dd.org>
To: dnsop@ietf.org
In-Reply-To: <20181103081228.GA32569@naina>
References: <20181103081228.GA32569@naina>
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/eEUA1VPFNzxCk8hJSc0uu7oXjjE>
Subject: Re: [DNSOP] Review of draft-ietf-dnsop-serve-stale-02.txt
X-BeenThere: dnsop@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF DNSOP WG mailing list <dnsop.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dnsop>, <mailto:dnsop-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dnsop/>
List-Post: <mailto:dnsop@ietf.org>
List-Help: <mailto:dnsop-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dnsop>, <mailto:dnsop-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 05 Nov 2018 06:37:29 -0000

Thanks very much for the review, Mukund!  Puneet has already
incorporated the editorial feedback into the GitHub copy.

Mukund Sivaraman writes:
>>  "It is predicated on the observation that authoritative server
>>   unavailability can cause outages even when the underlying data 
>>   those servers would return is typically unchanged."
> 
> While reading this, I wonder whether the last sentence was meant to be
> written as: "It is predicated on the observation that zone data returned
> by authoritative nameservers during a resolver refresh would typically
> be unchanged."

I agree that it's a reasonable rephrasing, and I'm wondering if you
see that there's a practical difference.  Not that I'm particularly
wed to the original text, just wondering if I'm missing something
either in grammar or semantics that makes the rewrite superior.

> >    issues, and so on.  If the recursive server is unable to contact the
> >    authoritative servers for a name but still has relevant data that has
> 
> s/for a name/for a query/

Puneet made this change, but I'll observe that since names are how
authoritative servers get delegated, I think "servers for a name" is an
acceptable, natural way to refer to the process.

> It is a curious thing why it was decided that [TTL] values with the
> high order bit set are not clamped but set to zero. Possibly because
> it can be thought that such high values are bogus and assumed to be
> made in error, and so a resolver should attempt to re-query such
> records instead of caching them for a very long time. OTOH, one can
> think the same of a TTL=2147483647 answer too. :P

Yeah, I don't know the reasoning and haven't searched the dnsext
archives to see if there was discussion on it back in the mid-90s when
the treat-it-as-zero clarification was decided.  Neither zero not 2^31
seem like good ideas really, and the issue will be in the presentation
to dnsop today.

> This option seems to me too complicated to generate, parse and make use
> of. RRs are re-ordered very late during message rendering in some DNS
> implementations, and updating this syntax in the EDNS option just looks
> too painful. It does not appear parsing (by resolvers) will be easy
> either, and whether this fine granularity in determining staleness is
> generally useful.

I do agree that it is somewhat more complicated, but I'm not sure
about "too complicated".  My thinking when I first offered it is that
if I'm using an option for diagnostic purposes, I want explicit
information returned.  So for something like "dig +any +edns-stale
example.com" (or whatever) when I'm debugging, I can count off the
indices well enough.

I would not expect automated systems that receive the option to really
care much about specifically what RRSets were stale, if they were
concerned about staleness at all.  And if they do care ... they can
count off the indices well enough too.

Here's another fun idea, to specifically identify the relevant parts
of the message:  name compression pointers!  Okay, okay ... yeah it
makes me feel a little skeevy too.  

On the other hand you could also iterate over each name/type with
either recursion disabled or the other EDNS option, so there's that
way of dealing with diagnostics too.

> Would it be better to limit fetches by the resolver for 30 seconds,
> while still returning TTL=0 answers?

It's an interesting though, but besides my general wariness of TTL 0
records I do note that this would mean keeping more state in the
resolver than it currently has to keep, and when I think about how I'd
implement that in BIND at least there's a fair bit of complexity there.

> Do all implementations mentioned earlier supporting the idea of this
> draft attempt to refresh stale data before serving it? Does this draft
> prescribe if resolvers SHOULD/MUST do so? Because the two approaches
> result in quite different behaviors.

To the best of my knowledge, no, though hopefully I've missed news on
an update to Unbound.  My understanding about how Unbound does it's
feature is that it's basically "shoot first and ask questions later".
That is, I think it'll use any data from the cache first, stale or
not, before trying the resolution.   That's part of the reason that I
wrote explicitly in the draft about honoring the intent of the TTL and
using stale data only in unusual circumstances.

> I think some implementations of this draft do not implement the client
> response timer, and so waiting for the query resolution timer (which may
> be a large duration) may result in application getaddrinfo() timeouts.

That would circumvent pretty much the whole intent to add resiliency
for clients waiting for answers.