Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03

Dave Lawrence <tale@dd.org> Fri, 08 March 2019 20:29 UTC

Return-Path: <tale@dd.org>
X-Original-To: dnsop@ietfa.amsl.com
Delivered-To: dnsop@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 80B361277CD for <dnsop@ietfa.amsl.com>; Fri, 8 Mar 2019 12:29:53 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QfSpl5oqqV4E for <dnsop@ietfa.amsl.com>; Fri, 8 Mar 2019 12:29:52 -0800 (PST)
Received: from gro.dd.org (host2.dlawren-3-gw.cust.sover.net [207.136.201.30]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A92C01274A1 for <dnsop@ietf.org>; Fri, 8 Mar 2019 12:29:49 -0800 (PST)
Received: by gro.dd.org (Postfix, from userid 102) id 6632E2A9D9; Fri, 8 Mar 2019 15:29:48 -0500 (EST)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-ID: <23682.53436.400539.805166@gro.dd.org>
Date: Fri, 08 Mar 2019 15:29:48 -0500
From: Dave Lawrence <tale@dd.org>
To: dnsop <dnsop@ietf.org>
In-Reply-To: <CAJE_bqdugE3oMqyHres4hwhs4-NpO8yW2FwGDrk2WDAtbweBiQ@mail.gmail.com>
References: <CAJE_bqdugE3oMqyHres4hwhs4-NpO8yW2FwGDrk2WDAtbweBiQ@mail.gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/KsJNCmxsOvYdS-W0p0XitluvUdI>
Subject: Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
X-BeenThere: dnsop@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF DNSOP WG mailing list <dnsop.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dnsop>, <mailto:dnsop-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dnsop/>
List-Post: <mailto:dnsop@ietf.org>
List-Help: <mailto:dnsop-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dnsop>, <mailto:dnsop-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Mar 2019 20:29:53 -0000

Thank you very much for the feedback, Jinmei.  

Combined with previous changes we made following the other messages on
the draft we expect to republish it before the Monday IETF 104
submission deadline, after one last review by all of the co-authors.

Jinmei:
>> The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is
>> amended to read:
>>
>>    TTL  [...]  If the authority for the data is unavailable
>>    when attempting to refresh, the record MAY be used as though it is
>>    unexpired.
> 
>   On understanding that this is the only real normative description,
>   I'd suggest making some more points explicit to prevent abusing of
>   this leniency:
>   - explicitly say "all authoritative servers" instead of just "the
>     authority"
>   - also explicitly note that this MUST NOT be allowed if at least one
>     authoritative server is available

I can see how the phrasing of "if the authority" could imply to a
novice in the DNS space that only one server would be tried, so I
updated the wording to:

     If the data is unable to be authoritatively refreshed when the
     TTL expires, the record MAY be used as though it is unexpired.

I think this phrasing is sufficient without needing to explicitly say
all servers and must not if at least one responds.  Existing resolver
implementations are extremely thorough in trying to get authoritative
answers and it is extremely hard to imagine that anyone would take
this draft to mean that they should be any less thorough.

>   - clarify whether this means a 0-TTL record can be cached and reused
>     under this condition (I assume it must not, but it's not very clear
>     to me)

Added this to the caveats section:

     The continuing prohibition against using data with a 0 second TTL
     beyond the current transaction explicitly extends to it being
     unusable even for stale fallback, as it is not to be cached at all.

>>   If it finds no relevant unexpired data and the Recursion Desired
>>   flag is not set in the request, it SHOULD immediately return the
>>   response without consulting the cache for expired records.

>   It would be nice if it clarified *what* to return in this case (if
>   it's intentionally left open, explicitly say so).

Added:

     Typically this response would be a referral to authoritative
     nameservers covering the zone, but the specifics are implementation
     dependent.

I was surprised to discover when testing against BIND 9.12 (without
serve-stale in play) that dig +norec for an unknown example.com name
gave a referral to com, even when it knew the NS for example.com
either via the parent delegation or even from the apex.

>>   Outside the period of the resolution recheck timer, the resolver
>>   SHOULD start the query resolution timer and begin the iterative
>>   resolution process.
> 
>   It's not clear to me how this timer is related to the 'server-stale'
>   behavior; [...]

I think it's main utility in the example method is to emphasize that
even if you send a stale answer to the client while a lengthy
resolution attempt is still playing out, you've got to keep trying.
Admittedly capping the work of that lengthy attempt is not
specifically relevant, but as you noted this is an example.
I can see your point about possibly simplifying by removing a few
sentences related to it, but as I also think that capping work is an
important aspect of resiliency I'm inclined to leave it in.

>   this draft doesn't explain what happens when this timer
>   expires, for example.

Based on "This timer bounds the work done by the resolver
when contacting external authorities" I'd have thought it was
implicitly clear, but I have added:

     If this timer expires on an attempted lookup that is still
     being processed, the resolution effort is abandoned.

> Also, in my understanding unbound doesn't have this timer - it
> eventually gives up a resolution if all possible external query
> fails with a per-query timeout, but it doesn't cap the overall
> resolution time.

Interesting.  I know of an Unbound-derived server that definitely caps
work, though that may have been local changes and not incorporated
into mainline.  Tarpitting was a significant issue for the people
involved.

>> Stale data is used only when refreshing has failed, in order to
>> adhere to the original intent of the design of the DNS and the
>> behaviour expected by operators.
> 
>   I agree on this statement, but I wonder how widely this behavior is
>   actually implemented.  As noted in Section 7, unbound doesn't behave
>   this way, and in my understanding it's intentional, mainly due to
>   a concern about related IPR.

Huh, My understanding from a hallway conversation with Benno was that
the immediate response is only sent for names that would have been
subject to pre-fetching, such that the immediate response in this case
is sufficiently covered under the guidance of a recent attempt being
made.  If that is not the case, and you can get stale answers from
Unbound even without a recent refresh attempt, then I personally think
that is an error in Unbound and not this document.

Also, the IPR is effectively released. (... but I am not a lawyer.)

> So I'm curious about implementation status about this point, and if
> many different implementations intentionally ignore this "caveat"
> for the same reason, I think we should adjust the text to match the
> reality.

I have strong objections to doing that.  Imagine having data that
expired days ago sitting in a cache but unrefreshed because no one
asked for it, then poof sending possibly obsolete information to the
first person who asks even though the authorities were perfectly well
responsive.  I think it is very important that implementations do
their best to honor TTLs as a refresh signal, and removing text that
emphasizes that is a bad idea.

> - Section 7
> 
>> Unbound has a similar feature for serving stale answers, but will
>> respond with stale data immediately if it has recently tried and
>> failed to refresh the answer by pre-fetching.
> 
>   If I understand the implementation correctly, this is not 100%
>   accurate: unbound always return the stale data if it's found in the
>   cache as long as the "serve-expired" option is enabled. 

In addition to the note above, which needs clarification regarding the
Unbound implementation, note that section 7 will be removed entirely
once the document moves on.

Thanks again.