[DNSOP] Review of draft-ietf-dnsop-rfc5011-security-considerations-11

Viktor Dukhovni <ietf-dane@dukhovni.org> Wed, 21 February 2018 21:55 UTC

From: Viktor Dukhovni <ietf-dane@dukhovni.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Reply-To: dnsop@ietf.org
Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\))
Message-Id: <E5735B24-EEF4-40DB-91CF-028F5799EE06@dukhovni.org>
Date: Wed, 21 Feb 2018 16:55:41 -0500
To: dnsop@ietf.org
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/f2VstCRkxJ1e757dPo6HOFbG2xE>
Subject: [DNSOP] Review of draft-ietf-dnsop-rfc5011-security-considerations-11
Precedence: list


   1.  Introduction

   Because of this lack of guidance, zone publishers may derive
   incorrect assumptions about safe usage of the RFC5011 DNSKEY

s/derive/arrive at/

   and is intended to complement the guidance offered in RFC5011 (which
   is written to provide timing guidance solely to a Validating
   Resolver's point of view).

s/solely to/solely from/

   1.1.  Document History and Motivation

   To verify this lack of understanding is wide-spread, the authors

s/verify/confirm that/

   All 5 experts answered with an insecure value, and we determined that
   this lack of mathematical understanding might cause security concerns

s/mathematical//

   in deployment.  We hope that this companion document to RFC5011 will
   rectify this understanding and provide better guidance to zone

s/understanding//

   publishers that wish to make use of the RFC5011 rollover process.

s/that/who/

   1.2.  Safely Rolling the Root Zone's KSK in 2017/2018

   One important note about ICANN's (currently in process) 2017/2018 KSK
   rollover plan for the root zone: the timing values chosen for rolling
   the KSK in the root zone appear completely safe, and are not affected
   by the timing concerns introduced by this draft

s/introduced by this draft/discussed in this draft./

   2.  Background

   The RFC5011 process describes a process by which a RFC5011 Resolver

s/The RFC5011 process/RFC5011/
s/a RFC5011/an RFC5011/

   operational guidance or recommendations about the RFC5011 process and
   restricts itself to solely the security and operational ramifications

s/to solely/solely to/

   of switching to exclusively using recently added keys or removing
   revoked keys too soon.

s/of switching to ... too soon/of prematurely switching to .../

   4.  Timing Associated with RFC5011 Processing

   These sections define a high-level overview of [RFC5011] processing.

s/These sections define/The subsections below give/

OLD:
   These steps are not sufficient for proper RFC5011 implementation, but
NEW:
   The description is not by itself sufficient for a full RFC5011 implementation, but

   4.1.  Timing Associated with Publication

   RFC5011's process of safely publishing a new DNSKEY and then assuming
   RFC5011 Resolvers have adopted it for trust falls into a number of
   high-level steps to be performed by the SEP Publisher.  This document

s/falls into/can be broken down into/

   discusses the following scenario, which the principle way RFC5011 is

s/principle/principal/

   5.  Denial of Service Attack Walkthrough

   If an attacker is able to provide a RFC5011 Resolver with past
   responses, such as when it is in-path or able to perform any number

s/in-path/on-path/

   5.1.  Enumerated Attack Example

   The following example settings are used in the example scenario

s/example settings/settings/

   attack.  The timing schedule listed below is based on a SEP Publisher

s/timing schedule listed/timeline/

   "T+0".  All numbers in this sequence refer to days before and after

s/sequence/timeline/

   was introduced into the fictitious zone being discussed.

s/fictitious/example/

   In this dialog, we consider two keys within the example zone:

s/dialog/exposition/

   K_old:  An older KSK and Trust Anchor being replaced.

   K_new:  A new KSK being transitioned into active use and expected to
      become a Trust Anchor via the RFC5011 automated trust anchor
      update process.

   5.1.1.  Attack Timing Breakdown

   The steps shows an attack that foils the adoption of a new DNSKEY by

s/The steps shows/Below we examine/

   6.  Minimum RFC5011 Timing Requirements

   First, we define the term components used in all equations in
   Section 6.1.

s/term components/component terms/

   6.1.6.  timingSafetyMargin

   Mentally, it is easy to assume that the period of time required for

s/Mentally/Naively/

   will be entirely based off the length of the addHoldDownTime.

s/based off/based on/

   protocol and in operational realities in deploying it require waiting

s/and in/and the/

   and additional period of time longer.  In subsections Section 6.1.6.1

s/and/an/

   6.1.6.1.  activeRefreshOffset

   Security analysis of the timing associated with the query rate of

s/Security analysis/An analysis/

   (at time T), the resolver would send checking queries at T+7, T+14,

s/(at time T)/(at time T+0)/

   The activeRefreshOffset term defines this time difference and
   becomes:

    activeRefreshOffset = addHoldDownTime % activeRefresh

   The % symbol denotes the mathematical mod operator (calculating the
   remainder in a division problem).  This will frequently be zero, but
   can be nearly as large as activeRefresh itself.

Given imperfect clocks, lost packets, ... I would argue that it necessary
to pessimistically just set "activeRefreshOffset = activeRefresh" and
NOT assume that exact divisibility is operationally meaningful.  Very
small differences in either value in the expression can easily change
values near zero to values near the upper bound, so the upper bound is
the only sound choice I would think.  Perhaps the "clockskewDriftMargin"
in the next section accounts for this, but I am a bit skeptical at first
glance.

   6.1.6.3.  retryDriftMargin

   that it becomes impossible to predict, from the perspective of the
   PEP Publisher, when the final important measurement query will

Should PEP be SEP here?   s/final important/conclusive/

   6.1.6.4.  timingSafetyMargin Value

   The activeRefreshOffset, clockskewDriftMargin, and retryDriftMargin
   parameters all deal with additional wait-periods that must be
   accounted for after analyzing what conditions the client will take
   longer than expected to make its last query while waiting for the
   addHoldDownTime period to pass.  But these values may be merged into
   a single term by waiting the longest of any of them.  We define
   timingSafetyMargin as this "worst case" value:

        timingSafetyMargin = MAX(activeRefreshOffset,
                                 clockskewDriftMargin,
                                 retryDriftMargin)

        timingSafetyMargin = MAX(addWaitTime % activeRefresh,
                                 activeRefresh,
                                 activeRefresh)

        timingSafetyMargin = activeRefresh

Here we see that the choice of "addWaitTime % activeRefresh" vs.
just "activeRefresh" is not material, and could probably have
been made at the outset.

   6.1.7.  retrySafetyMargin

   None the less, we do offer the following as one method considering

s/None the less/Nonetheless/

   numResolvers:  The number of client RFC5011 Resolvers

   With the successRate and numResolvers values selected and the
   definition of retryTime from RFC5011, one method for determining how
   many retryTime intervals to wait in order to reduce the set of
   uncompleted servers to 0 assuming normal probability is thus:

Here the text needs to be considerably more clear.  The first
observation is that "uncompleted servers" is not defined.  It
seems this is the expected number of resolvers that failed to
acquire the new trust anchor.  If so, this should be stated
clearly.  It is also rather unclear what is normally distributed,
and why such a distribution is reasonable to assume.

It seems to me that tuning to achieve zero failure cases for
each size of the resolver population is (to put it kindly) not
necessarily sound.

Instead, one might want to achieve an acceptably low probability
of any chosen resolver failing due to random packet loss, and to
handle non-random "short-term" outages (which may last days if
say hypothetically that the US East-coast grid goes down for 3
days, or much more likely some home computer is shut down for
a few weeks, while the administrator is on vacation).

So here, the zone administrator needs to stretch the retry interval
by a fudge factor and/or to a minimum time they're comfortable with,
but I don't see much relevance of "numResolvers" or plausibility of
any sort of "normal distribution" model.  When an ISP has an outage,
or power is lost, or there's a DDoS attack, lost queries are highly
correlated.

Therefore, I would discard the table, and just recommend a sensible
fudge factor that is likely to work well enough in practice.  Say
strength the retry time by a factor of 5 to account for transient
connectivity loss, software restarts, ... and also set a minimum
time for less random "short-term" outages one is willing to tolerate.

No finite time can protect a resolver or set of resolvers subject to
a sustained long-term DoS, and these will need to be manually rekeyed
once reliable connectivity is restored.

-- 
	Viktor.

[DNSOP] Review of draft-ietf-dnsop-rfc5011-securi… Viktor Dukhovni
Re: [DNSOP] Review of draft-ietf-dnsop-rfc5011-se… Wes Hardaker