Re: [DNSOP] I-D Action: draft-ietf-dnsop-extended-error-04.txt

Wes Hardaker <wjhns1@hardakers.net> Mon, 11 March 2019 22:11 UTC

Return-Path: <wjhns1@hardakers.net>
X-Original-To: dnsop@ietfa.amsl.com
Delivered-To: dnsop@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3815C1310C6 for <dnsop@ietfa.amsl.com>; Mon, 11 Mar 2019 15:11:10 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id u_udqxJCwihL for <dnsop@ietfa.amsl.com>; Mon, 11 Mar 2019 15:11:07 -0700 (PDT)
Received: from mail.hardakers.net (mail.hardakers.net [168.150.192.181]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C1577128664 for <dnsop@ietf.org>; Mon, 11 Mar 2019 15:11:07 -0700 (PDT)
Received: from localhost (pc5.it-anacpk1-unet.ocn.ne.jp [153.150.27.29]) by mail.hardakers.net (Postfix) with ESMTPA id E08CE23FE2; Mon, 11 Mar 2019 15:11:02 -0700 (PDT)
From: Wes Hardaker <wjhns1@hardakers.net>
To: Petr Špaček <petr.spacek@nic.cz>
Cc: dnsop@ietf.org
References: <154689301066.32204.17312124670782800354@ietfa.amsl.com> <ybl1s5nxgau.fsf@w7.hardakers.net> <3c2ef704-148f-ed03-26a9-8ea29256acc2@nic.cz>
Date: Mon, 11 Mar 2019 15:10:51 -0700
In-Reply-To: <3c2ef704-148f-ed03-26a9-8ea29256acc2@nic.cz> ("Petr Špaček"'s message of "Thu, 7 Feb 2019 16:47:01 +0100")
Message-ID: <yblpnqx6p04.fsf@wu.hardakers.net>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/rs07ma7XSnFwaYwyNoVDtVbtYYk>
Subject: Re: [DNSOP] I-D Action: draft-ietf-dnsop-extended-error-04.txt
X-BeenThere: dnsop@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF DNSOP WG mailing list <dnsop.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dnsop>, <mailto:dnsop-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dnsop/>
List-Post: <mailto:dnsop@ietf.org>
List-Help: <mailto:dnsop-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dnsop>, <mailto:dnsop-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2019 22:11:10 -0000


Hi Petr,

Sorry for the delay in responding to your excellent review.  You raised
a large number of good suggestions and clarifications.  Attached are my
more detailed actions and responses to your points.  Look for "results:"
and "response:" for my/our responses to each item.  [Warren and I
discussed many of these items yesterday.]


7 Petr Spacek
=============

  Prelim: first of all I believe this is useful and suppor the work, but
  still


7.1 TODO implementations needed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  needs more work *and implementation experience* before going to LC.

  Here is couple specific changes to version 04.

  + results: I believe the WG agrees, and the draft will not likely
    progress until implementations exist.

    --- Minor changes/clarifications ---


7.2 DONE reserved bits
~~~~~~~~~~~~~~~~~~~~~~

  > 2.  Extended Error EDNS0 option format o The RESERVED bits, 15 bits:
  >    these bits are reserved for future use, potentially as additional
  >    flags.  The RESERVED bits MUST be set to 0 by the sender and MUST
  >    be ignored by the receiver.

  IMHO "SHOULD be ignored" is asking for trouble. We just went through
  DNS flag day to clean up implementations which insisted on some fields
  being zero. Can we please use this instead?  set to 0 by the sender
  and MUST be ignored by the receiver.

  + Result: that make sense. Done


7.3 DONE EDNS option vs OPT Pseudo-RR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  > 3.  Use of the Extended DNS Error option The Extended DNS Error
  >    (EDE) is an EDNS option.  It can be included in any response
  >    (SERVFAIL, NXDOMAIN, REFUSED, etc) to a query that includes an
  >    EDNS option.

  Why "EDNS option" (at very end of the sentence) and not "OPT
  Pseudo-RR"?  AFAIK it is perfectly fine to send EDNS0 OPT without any
  options inside.  Proposed text (only the last line was changed): The
  Extended DNS Error (EDE) is an EDNS option.  It can be included in any
  response (SERVFAIL, NXDOMAIN, REFUSED, etc) to a query that includes
  OPT Pseudo-RR [RFC 6891].

  + Results: accepted; thanks for the text.


7.4 DONE wording issues with the response-code field text
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  > 3.2.  The RESPONSE-CODE field This 4-bit value SHOULD be a copy of
  >    the RCODE from the primary DNS packet.  Multiple EDNS0/EDE
  >    records may be included in the response.  When including multiple
  >    EDNS0/EDE records in a response in order to provide additional
  >    error information, other RESPONSE-CODEs MAY use a different
  >    RCODE.  This paragraph worries me for multiple reasons:

  1) Terminology: EDE is an EDNS option, not record!
  a) If I am an implementer, in what cases I might want to go against
  	"4-bit value SHOULD be a copy of the RCODE"?  b) Terminology:
  	Where is a definition of "primary DNS packet"?  c) When I read
  	this now, many months after the initial draft, I have trouble
  	understanding logic why we are duplicating RCODE here. There
  	might be a good reasons but we need to state them explicitly
  	otherwise it will get ignored (or misunderstood).

  Unfortunatelly I have trouble understanding intent behind this
  description so I'm not able to draft a better text.

  + Response:

  We'll work on the wording, and I can hopefully address your issue with
  the lack of clarity with the text and I thank you for pointing out
  that it's not clear.

  In the past, the WG has discussed (more than once) whether to and how
  to divide up the error code range.  There are some slides from past
  IETF meetings, as well as past conversations on the mailing list (see
  the conversation with Donald Eastlake, for example).  A few thoughts
  that came out of the discussions centered around multiple points:

  - the desire to include an organized set of error codes grouped by
    RCODE
  - most of the time, the extended error codes would be directly related
    to a particular RCODE (you found an exception)
  - There was a desire to include multiple extended error codes within a
    response, and sometimes it may be beneficial to return an error code
    associated with another RCODE as a supplemental error code.
  - If two RCODEs needed a similar extended error, there is no reason
    you can't create two separate (likely identical) extended error
    codes attached to two RCODE values.
  - Packing it all into a single 16-bit integer/short width field meant
    implementations could treat the combination as a double-lookup table
    if they'd prefer, or as a single 16-bit error code and it should
    work either way, providing implementations greater flexibility.

  Hopefully that makes sense?  I've added your new proposed stale codes,
  as mentioned below.

  I've changed the text for RESPONSE-CODE and INFO-CODE in order to
  hopefully help.  I'd love your thoughts and suggestions for
  improvements though.


7.5 NOCHANGE why an R flag in unsupported key/ds
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  > 4.1.1.  NOERROR Extended DNS Error Code 1 - Unsupported DNSKEY
  > Algorithm The resolver attempted to perform DNSSEC validation, but a
  > DNSKEY RRSET contained only unknown algorithms.  The R flag should
  > be set.  4.1.2.  NOERROR Extended DNS Error Code 2 - Unsupported DS
  > Algorithm The resolver attempted to perform DNSSEC validation, but a
  > DS RRSET contained only unknown algorithms.  The R flag should be
  > set.

  Why R flag? This is not an error, resolution suceeded, and there is
  nothing to retry. I propose change both cases to "The R flag should
  not be set."

  + Stephane answered on list with this same answer as mentioned below

  + Answer: Because other resolvers may understand DS and DNSKEY
    algorithms.  So the client (stub resolver) should keep trying.


7.6 DONE indeterminate should be NOERROR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  > 4.2.2.  SERVFAIL Extended DNS Error Code 2 - DNSSEC Indeterminate
  > The resolver attempted to perform DNSSEC validation, but validation
  > ended in the Indeterminate state.  The R flag should not be set.

  This should be in NOERROR category.

  AFAIK Indeterminate state is not an error, it is most likely a
  configuration choice on the resolver. E.g. DNSSEC-validating resolver
  running without any trust anchor is in Indeterminate state.

  + Result: You're right, it should be (according to 4033).


  --- New code points ---

  I propose to add couple more codes:


7.7 DONE new code: NSEC missing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  + SERVFAIL Extended DNS Error Code 8 - NSEC Missing The resolver
    attempted to perform DNSSEC validation, but the requested data were
    missing and covering NSEC was not provided.  RETRY=0

  + status: good idea and added.  I set the retry bit, though, as
    another resolver may not have the same issues, or may have NSEC data
    cached.


7.8 DONE new code: Cached error
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  + SERVFAIL Extended DNS Error Code 9 - Cached Error The resolver has
    cached SERVFAIL for this query.  RETRY=1
  Often the SERVFAIL comes from cache which is unlikely to contain
  specific error details, but it is still useful to distinguish "proper"
  cached SERVFAIL from other weird errors like running out of file
  descriptors etc. Info text could contain remaining TTL ...

  + status: added


7.9 DONE new code: server not ready
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  + SERVFAIL Extended DNS Error Code 10 - Server Not Ready Server is not
    up and running (yet). RETRY=1

  + status: added


7.10 DONE new code: depricated
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  + NOTIMP Extended DNS Error Code 1 - Deprecated
  Requested operation or query is not supported because it was
  deprecated.  Retrying request elsewhere is unlikely to yield any other
  results.  RETRY=0 Intended use:
  - OPCODE=IQUERY
  - OPCODE=QUERY QTYPE={ANY, RRSIG, MAILA, MAILB} etc.

  + status: Added.  Was tempted to set R=1 because other servers may
    support it, but the reality is that if its deprecated it shouldn't
    be used at all.

  --- More adventurous proposals ---


7.11 new flags
~~~~~~~~~~~~~~

  a) Two more bits to implement "advice for user" (longer explanation
  can be found in archives
  [https://mailarchive.ietf.org/arch/msg/dnsop/b3wtVj_aWm24PXyHr1M9NMj3LJ0])

  I believe this will make the draft way more useful for everyone and
  not just geeks.

  Proposed addition to text:

  > 2.  Extended Error EDNS0 option format
  +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  4: | R | N | F | RESERVED |
  +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  proposal


7.11.1 NOCHANGE NEAR flag
-------------------------

  o The NEAR flag, 1 bit; the NEAR bit (N) indicates a flag defined for
     use in this specification.


7.11.2 NOCHANGE FAR flag
------------------------

  o The FAR flag, 1 bit; the FAR bit (F) indicates a flag defined for
     use in this specification.

  > 3.  Use of the Extended DNS Error option

  3.2.  The N (Near) flag The N (Near) flag indicates that the error
  reported is likely caused by conditions "near" the sender. Value 1 is
  a hint for user interface that user should contact administrator
  responsible for local DNS.

  For example, an DNS resolver running on CPE will set N=1 in its error
  responses if it detects that all queries to upstream DNS resolver
  timed out. This likely indicates a link problem and must be fixed
  locally.

  Another example is an DNSSEC-validator which detects that query ". IN
  NS" fails DNSSEC validation because signature is expired or not yet
  valid. This most likely indicates misconfigured system time and needs
  to investigated and fixed locally.


  3.3. The F (Far) flag The F (Far) flag indicates that the error
     reported is likely caused by conditions on the "far" end,
     i.e. typically authoritative side or upstream forwarder. Value 1 is
     a hint for user interface to display message suggesting user to
     contact operator of the "far end" because it is unlikely that local
     operator can fix the problem.

  For example, an DNS resolver might set F=1 if all authoritative
  servers for a given domain are lame.


7.11.3 NOCHANGE Response to both:
---------------------------------

  These seem interesting on the face, and potentially useful for
  receivers as you indicate.  However, they also seem subjective and
  hard to be deterministic about when and how to set them.
  Additionally, most errors should already give a hint as to whether a
  given error is near or far based on the error itself (even better
  hints might be put into the EXTRA-TEXT field).

  I'd (we'd) love to hear other WG member opinions on this subject.


7.12 NOCHANGE optional TTL to the option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  b) Another thing to consider is adding optional TTL value to EDE
  option.  E.g. there is no point in retrying the query again and again
  until bogus response is cached. It is much better to display error
  message "try again in 10 seconds, if the problem persists call X" than
  just "try again".

  What do you think?

  + Result (Wes): So, I think this adds too much complexity to the
    system that we're otherwise trying to keep simple.  If particular
    errors are likely to be retried successfully after a certain period
    of time, text could be added to the error descriptions to hint at
    that instead.  Otherwise we're adding another layer of caching,
    which spells a lot more code I'd think.


7.13 DONE answer with stale data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Yet another code proposal:
  * answer with stale data

    The resolver was unable to resolve answer within its time limits and
    decided to answer with stale data instead of answering with an
    error.  This is typically caused by problems on authoritative side,
    possibly as result of an DoS attack. Retrying is likely to cause
    load and not yield a fresh answer, RETRY=0.

  Here is a problem that this code point is applicable to NOERROR as
  well as NXDOMAIN answers so I'm not sure how to categorize it. This
  reinforces my unanswered question why the draft proposes to copy RCODE
  into EDE.

  + Result: Added two codes, one per RCODE, per discussion above.

-- 
Wes Hardaker
USC/ISI