[DNSOP] Responding to Viktor's comments on RFC5011-security-considerations

Wes Hardaker <wjhns1@hardakers.net> Fri, 23 March 2018 08:20 UTC

From: Wes Hardaker <wjhns1@hardakers.net>
To: dnsop <dnsop@ietf.org>
Date: Fri, 23 Mar 2018 01:20:18 -0700
Message-ID: <yblzi2zdwzh.fsf@wu.hardakers.net>
User-Agent: Gnus/5.130014 (Ma Gnus v0.14) Emacs/25.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/Og0QlvnAsE2LiVrejmvDXTJjMrQ>
Subject: [DNSOP] Responding to Viktor's comments on RFC5011-security-considerations
Precedence: list

TL;DR:   I've pushed a new (final before IETF LC?) copy of the document.

Viktor,

Thanks for the excellent and thorough review.  I'm very glad you were
willing to take a look at it, since it's a better document because of
it.  I failed to add you to the acknowledgment list for -12, but the
change is queued for the future -13.

All changes accepted with a few exceptions/notes as below.  '+' signs
indicate the start of my response; all other text was functionally yours
or the documents.



6.1.6.  timingSafetyMargin

   Mentally, it is easy to assume that the period of time required for

s/Mentally/Naively/

   + NOTE: not changing this one; I wanted the emphasis on thinking
     not assuming

6.1.6.1.  activeRefreshOffset

   Security analysis of the timing associated with the query rate of

s/Security analysis/An analysis/

   + NOTE: changed to A security analysis

numResolvers:  The number of client RFC5011 Resolvers

   With the successRate and numResolvers values selected and the
   definition of retryTime from RFC5011, one method for determining how
   many retryTime intervals to wait in order to reduce the set of
   uncompleted servers to 0 assuming normal probability is thus:

Here the text needs to be considerably more clear.  The first
observation is that "uncompleted servers" is not defined.  It
seems this is the expected number of resolvers that failed to
acquire the new trust anchor.  If so, this should be stated
clearly.

It is also rather unclear what is normally distributed,
and why such a distribution is reasonable to assume.

   + good point; changed to: ...to wait in order to reduce the set of
     resolvers that have not accepted the new trust anchor to 0
     is thus:

* REJECTED:

    It seems to me that tuning to achieve zero failure cases for
    each size of the resolver population is (to put it kindly) not
    necessarily sound.

    Instead, one might want to achieve an acceptably low probability
    of any chosen resolver failing due to random packet loss, and to
    handle non-random "short-term" outages (which may last days if
    say hypothetically that the US East-coast grid goes down for 3
    days, or much more likely some home computer is shut down for
    a few weeks, while the administrator is on vacation).

    So here, the zone administrator needs to stretch the retry interval
    by a fudge factor and/or to a minimum time they're comfortable with,
    but I don't see much relevance of "numResolvers" or plausibility of
    any sort of "normal distribution" model.  When an ISP has an outage,
    or power is lost, or there's a DDoS attack, lost queries are highly
    correlated.

    Therefore, I would discard the table, and just recommend a sensible
    fudge factor that is likely to work well enough in practice.  Say
    strength the retry time by a factor of 5 to account for transient
    connectivity loss, software restarts, ... and also set a minimum
    time for less random "short-term" outages one is willing to tolerate.

    No finite time can protect a resolver or set of resolvers subject to
    a sustained long-term DoS, and these will need to be manually rekeyed
    once reliable connectivity is restored.

+ Response: There is no perfect solution.  In fact, I didn't plan on
  including any network loss factors in the document at all, since I
  originally intended it to be just an analysis of the math behind the
  model and not real-world scenarios.  However, consensus was that we
  should include at least some real-world outage estimates.  Mike
  StJohns came up with this model and text.  He put a fair amount of
  analysis into it, and included a java-based simulation in one mail
  message that showed the model works as expected.  Thus, though I
  think an alternative and potentially simpler model would also work,
  we'll stick with his model since the text is already written and
  complete and the model been shown to work in simulations.

  In the end, as stated earlier in the document, I really think a
  much more extensive guide is needed for properly using 5011 in all
  kinds of situations.  Maybe someone will author such a guide in the
  future and a different model for calculating potential losses can
  completed and that document can UPDATE this one.


-- 
Wes Hardaker
USC/ISI

[DNSOP] Responding to Viktor's comments on RFC501… Wes Hardaker