Re: [DNSOP] state management related to TTL

Dave Lawrence <tale@dd.org> Wed, 17 October 2018 17:16 UTC

MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-ID: <23495.28266.254692.693475@gro.dd.org>
Date: Wed, 17 Oct 2018 13:16:26 -0400
From: Dave Lawrence <tale@dd.org>
To: dnsop@ietf.org
In-Reply-To: <5A0BE237.5090104@redbarn.org>
References: <5A0BE237.5090104@redbarn.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/2o62EmB35OiBKNq9YZLfSMpqL2U>
Subject: Re: [DNSOP] state management related to TTL
Precedence: list

Paul, apologies for taking nearly a year to recall this message and
respond to it:

https://www.ietf.org/mail-archive/web/dnsop/current/msg21367.html

I'll trim down citation material for response, but that's not to mean
the parts I am not responding to are ignored.  For example, I pretty
much agree with the first five paragraphs.

The first part that jumps out to me as bearing further discussion
starts here:

Paul Vixie writes:
> another method that's been deployed of avoiding simultaneous "don't
> have" with "great need" is to liberally reinterpret TTL such that
> RRsets can be reused beyond their explicit TTL lifetime, while their
> refresh queries proceed in the background.  commonly, the authority
> servers responsible for answering these refresh events are down or
> unreachable at the time of most acute need.

Here "in the background" implies to me that there is a perception that 
the stale answer is given preference to refreshing the data, so to be
clear that is not the case.  The draft is explicit that an attempt
should be made -- in the foreground -- to refresh before falling back
to stale data.  It is therefore used in exceptional circumstances, and
I also would not be inclined to describe the unreachability of
authorities as "commonly".  Operational experience bears this out.

> the danger of TTL stretching is that reuse beyond TTL may cause
> RRsets that are in fact supposed to be unreachable, to be
> effectively reachable. examples include security-related takedown of
> criminal DNS servers or networks, or failover strategies where end
> systems will not try to reach their backup servers unless they
> cannot reach their primary servers, and the unreachability of those
> primary servers is hidden from them by TTL
> stretching. fundamentally, an RRset and its TTL are the property of
> the zone administrator, and it's controversial for any other party
> to use this data beyond its specified use parameters.

This is the meat of this message to me.  Can you please elaborate on
the scenarios where this takedown situation is a problem?  What are
the circumstances by which a takedown is only able to be effected
through some mechanism which would be subverted by serve-stale?
Removing the delegation still works, as does repointing the
delegation, or rerouting the authority addresses, or physically taking
over the authorities and thereby being able to change their answers.
The only scenario that I can imagine it not working is when you can
physically disable links to the to-be-disabled authorities but have
none of the other remedies available.  Is this something that happens?

I know you don't mean to say this, but it's also hard not to have it
come to mind that this sounds a bit like "we can't do this because bad
people might use it to their advantage."  Maybe, but beyond the
question of how, does that sufficiently outweigh the benefit
non-baddies can get?  We have lots of technology that bad people can
use to do bad things.  It's really hard to evaluate that without more a
detailed look at the threat model.

Similarly, I'm wondering about these other existing systems that rely
on unusable primary delegations to fix those delegations to point to
backup servers, especially with the typical TTLs in TLDs being the
dominating consideration for actually being able to cause the
failover.  That's not to say I doubt such systems exist, because of
course the DNS is constantly monitored by any reasonable
provider. Every such monitoring system with which I have personal
experience has many checks than would not be impacted by serve-stale.
I'm specifically interested in learning more about the systems for
which serve-stale causes breakage, and how they might end up getting a
stale-serving resolver without an affirmative administrative choice to
install and enable the feature in such a resolver for such a
monitoring system.

> most of us recognize that TTL's will continue to be stretched no
> matter what changes are or are not made to the specification, and so
> we expect the resulting RFC to document current practice _without
> recommending it_ and to also document a new practice _with
> recommendations_ as to its proper uses.

I think you'll need more support for the assertion that "most of us
... expect".  Based on the conversation that's happened around this so
far, and with my best attempt at fairly evaluating feedback both on
the list and in person, my own impression is that most implementers
and operators with whom I've spoken are supportive of the immediate
resilience benefit of serve-stale as described in the draft.

> noone has proposed any new signaling between the stub and the
> recursive, but it's possible that a stub may want a true TTL and so
> we might add signaling from the stub (as initiator) saying, don't
> stretch, or perhaps saying, if this is a stretched TTL, tell me so
> explicitly.

The draft that predated your message by a couple of weeks proposed the
functionality whereby a stub could indeed know explicitly that any
given RRSet in the response was stale.  The recently republished draft
offers a simplified alternative method for discussion.  Personally I
still prefer the more featureful option as providing the most clear
information, but in any event signaling is and has been part of the
document.

[DNSOP] state management related to TTL Paul Vixie
Re: [DNSOP] state management related to TTL Dave Lawrence