Re: [DNSOP] Dnsdir last call review of draft-ietf-dnsop-caching-resolution-failures-03

"Wessels, Duane" <dwessels@verisign.com> Thu, 29 June 2023 23:58 UTC

Return-Path: <dwessels@verisign.com>
X-Original-To: dnsop@ietfa.amsl.com
Delivered-To: dnsop@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AAA90C14CE30; Thu, 29 Jun 2023 16:58:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.096
X-Spam-Level:
X-Spam-Status: No, score=-2.096 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=verisign.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KJho7k8eY8d6; Thu, 29 Jun 2023 16:58:40 -0700 (PDT)
Received: from mail3.verisign.com (mail3.verisign.com [72.13.63.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 38BEDC151073; Thu, 29 Jun 2023 16:58:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=verisign.com; l=15528; q=dns/txt; s=VRSN; t=1688083120; h=from:to:cc:date:message-id:references:in-reply-to: content-id:content-transfer-encoding:mime-version:subject; bh=5A5BWtGm+I59zFXeSE2owIoNdNioMLZkfPIeMCJRQR4=; b=Inj8fOBMf00J+O+dM7hwNBseiB5TVTF9HDx5vwx/nwxwe45qfJCk9C3D OKxXkx0lMXOj6neSn2ApvyALnNtOe3gINTnG/A7KnV+DfEUitrj9tP+Jv X0JGvCsZmD0rbUdgqH95WpMzjcNBPNC1RnuIgqdLc4fcWibAvW1fObIls iFvsdxfou8FjS3zcNGHJ3Aru+Vh2GL2+ck8/DEs4Q46fn7djU5nXwNoAW oP//J/hkqMWXZPGZxrqdM+gW0aJDja6RYdyc17jfFPJo1vYTbHSVDR1Fb h0PiuLSfoY3wWh417+m9fIhN2bGGNLTBDTDgtQmVrXgnjIWWvgRbnCcIN w==;
IronPort-Data: A9a23:IXxRxK8unMkOcA0VhcZYDrUDVX+TJUtcMsCJ2f8bNWPcYEJGY0x3z WAdDGGGafyLYmWge9B2PtvnpE4AuZ/Um99qSVZo/ikxFiIbosf7XtnIdU2Y0wF+jCHgZBk+s 5hBMImowOQcFCK0SsKFa+C5xZVE/fjUAOC6UoYoAwgpLSd8UiAtlBl/rOAwh49skLCRDhiE0 T/Ii5S31GSNhXgsawr414rZ8Ek05Kqo6WtD1rADTasjUGH2xiF94K03ePnZw0vQGuF8AuO8T uDf+7C1lkuxE8AFU47Nfh7TKyXmc5aKVeS8oiM+t5uK23CukhcPPpMTb5LwX28M0mnUwIoho Dl6ncfYpQ8BZsUgkcxDC0UIS3kW0aduoNcrKlDn2SCfItGvn9IBDJyCAWlvVbD09NqbDkl38 vgSLhwsKSuZxNO8mq21esRwgeYKeZyD0IM34hmMzBnzN9B/frbuc/2To8FT2y0owMlCW+jEf MxfYj1qBPjCS0QXfA5IU9Rnwbzu2imXnz5w8Tp5oYI7/GXI1wF1y5DzPcDUYd2FQ4NemUPwS mfupT+oUkxHaIX3JTyt+yy8ocD9vDjCeJ8vF4Tjptpvm2yNyTlGYPERfR7hyRWjsWamVs5SM QoK8yxooakw92SzScbwRRG+uziPuRt0c9ZWCOE78imMx7bapQGDCQAsQjhab8QOtcIqS3otz FDht9/zDDJz9byYVXzY+rGPqiv3MiEeLW4EamoeQBAC58T/oYY1yxzGT9J+CqOuyNTxHRnxz iyE6i8kiN07iccQy+Cw9FTDqzOhupaPSRQ6jjg7RUqv9AUge4iod9TxrEPF97BFLZ3cRF7Ht mICwo6A9vsIS5qKkURhXdkwIV1g3N7dWBW0vLKlN8BJG+iFk5J7Qb1t3Q==
IronPort-HdrOrdr: A9a23:rUzLUakBzV0RJY2cem885knaP03pDfLx3DAbv31ZSRFFG/Fw8P re+cjztCWE6gr5N0tBpTntAse9qBDnmqKdiLN5VYtKNzOW21dAQrsC0aLShxPtHCHk/vNQ2O NKY8FFZOHYPBxfgdzh6Ae1V/Qt0LC8mpyAtKP7w212RQ9nL5t86Rx0Yzz3LmRtSBJYCYECGJ 2Q28pCq1ObEkgqUg==
X-Talos-CUID: 9a23:f0RvL2uM58FFEBBA9sTEZuIa6IscdXnXylvxZHSiSldjEqKbeXKM/4FNxp8=
X-Talos-MUID: 9a23:uz4bNgbUi17JGuBTvmH8ox5QFdxRw7mTC3lWwbEDoPiaOnkl
X-IronPort-AV: E=Sophos;i="6.01,169,1684800000"; d="scan'208";a="23985541"
Received: from BRN1WNEX01.vcorp.ad.vrsn.com (10.173.153.48) by BRN1WNEX01.vcorp.ad.vrsn.com (10.173.153.48) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.27; Thu, 29 Jun 2023 19:58:29 -0400
Received: from BRN1WNEX01.vcorp.ad.vrsn.com ([10.173.153.48]) by BRN1WNEX01.vcorp.ad.vrsn.com ([10.173.153.48]) with mapi id 15.01.2507.027; Thu, 29 Jun 2023 19:58:29 -0400
From: "Wessels, Duane" <dwessels@verisign.com>
To: Peter van Dijk <peter.van.dijk@powerdns.com>
CC: "dnsdir@ietf.org" <dnsdir@ietf.org>, "dnsop@ietf.org" <dnsop@ietf.org>, "draft-ietf-dnsop-caching-resolution-failures.all@ietf.org" <draft-ietf-dnsop-caching-resolution-failures.all@ietf.org>, "last-call@ietf.org" <last-call@ietf.org>
Thread-Topic: [EXTERNAL] Dnsdir last call review of draft-ietf-dnsop-caching-resolution-failures-03
Thread-Index: AQHZqD0uH3eUs429vkipbuatsyzVjq+ivbuA
Date: Thu, 29 Jun 2023 23:58:29 +0000
Message-ID: <354927FC-9FF7-4F21-A5D9-023D944522A5@verisign.com>
References: <168779086892.55920.13910161227412972733@ietfa.amsl.com>
In-Reply-To: <168779086892.55920.13910161227412972733@ietfa.amsl.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-mailer: Apple Mail (2.3731.600.7)
x-originating-ip: [10.170.148.18]
Content-Type: text/plain; charset="utf-8"
Content-ID: <BCBC0F7674D81B43A4E9C1D5EFA88BA0@verisign.com>
Content-Transfer-Encoding: base64
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/ddL-vh5f2cSFcGCtqcu0QCTvB3g>
Subject: Re: [DNSOP] Dnsdir last call review of draft-ietf-dnsop-caching-resolution-failures-03
X-BeenThere: dnsop@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: IETF DNSOP WG mailing list <dnsop.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dnsop>, <mailto:dnsop-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dnsop/>
List-Post: <mailto:dnsop@ietf.org>
List-Help: <mailto:dnsop-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dnsop>, <mailto:dnsop-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Jun 2023 23:58:44 -0000

Hi Peter,

Thank you for the detailed review.  Responses from the authors are inline below.



> On Jun 26, 2023, at 7:47 AM, Peter van Dijk via Datatracker <noreply@ietf.org> wrote:
> 
> Reviewer: Peter van Dijk
> Review result: Almost Ready
> 
> I have been selected as the DNS Directorate reviewer for this draft. The
> DNS Directorate seeks to review all DNS or DNS-related drafts as
> they pass through IETF last call and IESG review, and sometimes on special
> request. The purpose of the review is to provide assistance to the ADs.
> For more information about the DNS Directorate, please see
> https://secure-web.cisco.com/1C33i4Dh-oRD64xG7a2vNsTYqSkPmK_TtaZK5jIi1iKC1wcUTEdFfOLVw2n8nSUfJfqkbodz3_NdwV50FyoYhfQgSizKNv3M_ep9Yx9lkNt6oLgHQ4Vzp63kYnLezQGHdj0sornqx_SXzMbvj1BAzh7zPtyqF42myyeJXxiCyrr6e3QAdSUEYSAtaWewUcT8nEODhdq9BzBue8j3jhmGSCvOkEqY7Y8Tgfwi52mCcXFC6XNOjWYytI5iJuWrXd6WvjALuITYr8tv11C5mJVNaMj6-jLCgBzJbwJt0tkCxrL4MjxM4GuoUUvZ29P9i3K07/https%3A%2F%2Fwiki.ietf.org%2Fen%2Fgroup%2Fdnsdir
> 
> This document is generally in good shape. It is not too prescriptive, leaving
> room to implementers to honour the requirements in a way that makes sense for
> their implementations. The document has not seen a lot of WG discussion; I hope
> this means people have read it and generally agree.
> 
> Section 3.3 contains a "FOR DISCUSSION" note. I believe this means the document
> cannot currently pass Last Call. (See below for some notes on that discussion
> item.)
> 
> I have various nits and suggestions. Please see them below. Section numbers are
> for -03.
> 
> ## 2
> 
> I know section 2 is not meant to be exhaustive, but I wonder if FORMERR
> deserves a mention. In theory, a FORMERR response will not improve with time
> until an auth operator actively intervenes (by updating/replacing software to a
> more compliant version). SERVFAILs, by comparison, can be much more transient.

Good suggestion, we’ve added a short section on FORMERR.

> 
> ## 2.1
> 
> current:
> 
>> Authoritative servers, and more specifically secondary servers,
>> return server failure responses when they don't have any valid data
>> for a zone.  That is, a secondary server has been configured to serve
>> a particular zone, but is unable to retrieve or refresh the zone data
>> from the primary server.
> 
> proposed:
> 
>> Authoritative servers, and more specifically secondary servers,
>> return server failure responses when they don't have any valid data
>> for a query.  For example, a secondary server has been configured to serve
>> a particular zone, but is unable to retrieve or refresh the zone data
>> from the primary server.

Yes, thanks.

> 
> ## 2.2
> 
> The first paragraph correctly mentions "policy reasons". The second paragraph
> correctly says "they are not authoritative". I am not sure not being
> authoritative can be considered a policy reason, so perhaps these two
> paragraphs can be connected with an "or"?

I see your point.  We propose this change to the introduction sentence:

A name server returns a message with the RCODE field set to REFUSED
when it refuses to process the query, e.g., for policy or other reasons.


> 
> ## 2.3
> 
> "If, however, the implementation does not join outstanding queries together,
> ..." - this could use a reference to 5452 4.3 and 5452 5, pointing out that
> implementations really should be joining queries together for security reasons
> whenever they can, beside the reason given in the draft of not overloading
> authoritatives.

Added.

> 
> ## 3.1
> 
> "A resolver MUST NOT retry a given query over a server's transport  more than
> twice" - should this be clarified to say "in a short period of time" or
> something like that? Clearly a retry is allowed *eventually*.

For reference, here’s the sentence in question at the start of 3.1:

   A resolver MUST NOT retry a given query over a server's transport more
   than twice (i.e., three queries in total) before considering the
   server's transport unresponsive for that query.

We feel that “a given query” and “for that query” in the sentence sufficiently limits the
scope here, and there is no need to qualify it by some amount of time.

As an example, let’s say that a recursive has been asked to lookup www.example.com (our “given” query).  The example.com zone has two name servers, each of which has two IP addresses, and (presumably) two transports.  It can send 3 queries to 199.43.135.53 over UDP (then that transport is unresponsive), 3 queries to 199.43.133.53 over UDP, same over TCP, over IPv6, and so on.  In total the recursive can send 2x2x2x3 = 24 queries before it has to give up if all servers and all transports are unresponsive. At this point the resolver gives up on that query and returns SERVFAIL.

Then, section 3.2 is about caching and says that the resolution failure MUST be cached for at least 5 seconds, but otherwise gives implementations a lot of freedom in how to do that.  Could be by query tuple, by server/transport, or some other way.




> 
> Also, "MUST NOT" is pretty strong language. Given the various process models of
> resolver implementations, two subprocesses (threads) both retrying the same or
> a similar thing a few times can not always be avoided. Would you settle for
> SHOULD NOT? The "given" in "retry a given query" gives some leeway, but not
> enough, I feel.

We feel that MUST NOT is appropriate but would like more input from working group
members and implementors especially.


> 
> "may retry a given query over a different transport .. believe .. is available"
> - this ignores that some transports have better security properties than
> others. One currently active draft in this area is
> draft-ietf-dprive-unilateral-probing. Perhaps add some wording, without being
> too prescriptive, such as "available, and compatible with the resolver's
> security policies, ..".

We think “compatible with the resolver’s security policies” goes without saying, but don’t mind making it explicit.


> 
> ## 3.2
> 
> A previous review
> (https://secure-web.cisco.com/1-uwEOxF71cZbW0W3ux-QNC1pO0bJjYJvc0KHnZ_wN4Xw3M1XWB_K8diPjdzzV1zzAfZ98vObLHcs-9USjQPtEzxOdqnjHtcYGPxv8yID-fDRYNW8i8BtGJL-qahSS-JHbS3LHL6Bfm0duG-nUUKdSZF_MOoDFhQymCFnu838N4-l8Ky7xjoVKijU3pbZHLVQFpxjYecSLm0hqLoc4GW9n2Ri-vYT-lKiSPl5qB72Q1kbSUp21qnHSMMrfCCEizICDfjVzCKrwtau5DkwfiR7PVxgh2wT1twgX8oVBhJIY-0QfTaJLnHg7itWRgwH3tcX/https%3A%2F%2Fmailarchive.ietf.org%2Farch%2Fmsg%2Fdnsop%2FsJlbyhro-4bDhfGBnXhhD5Htcew%2F)
> suggested that the then-chosen tuple was not specific enough, and also said it
> was too prescriptive. I agree with both. The current draft prescribes nothing,
> which I'm generally a fan of!
> 
> However, speaking to a coworker (the one likely responsible for implementing
> this draft, if it turns out our implementation deviates from its final form)
> told me "some guidance would be nice". After some discussion on
> prescriptiveness, here is our suggestion: do not prescribe, but mention
> (without wanting to be complete) a few tuple formats that might make sense, and
> suggest that implementations document what they choose here.

The relevant text here currently says:

   The implementation might cache different resolution failure conditions
   differently.  For example, DNSSEC validation failures might be cached
   according to the queried name, class, and type, whereas unresponsive
   servers might be cached only according to the server's IP address.

So we provide two examples, although not really phrased as “tuples”.  I guess you’re suggesting to see more options here and talk about them more as tuples?

For the documentation suggestion, maybe something like this?: “Developers SHOULD document their implementation choices so that operators know what behaviors to expect when resolution failures are cached.”

> 
> ## 3.3
> 
>> FOR DISCUSSION: the requirement quoted above may be problematic
>> today.  e.g., focusing on NS as the query type (a) probably goes
>> against qname minimization, and (b) is not the real problem.  Also
>> RFC 4697 doesn't place any time restriction (TTL) on this.
> 
> *Before* qname minimization, queries that yield delegation answers often did
> not have type NS. With qname minimization, depending on the implementation,
> those queries might in fact be NS. (7816 specifies NS; 9156 relaxes the qtype
> requirement for qname-minimized queries). That said, there is no reason for the
> requery (which, as this draft reiterates, MUST NOT be done) to use NS, and so,
> I do agree the focus on the NS type should be removed.
> 
> As for TTL, the originally received delegation will eventually expire, so the
> requery will in fact happen at some time after that expiry.

First, we apologize for not realizing that this and two other “for discussion” questions were not yet resolved.  We plan to remove the first (from the Introduction).

For the one that was in section 2.6, we propose this updated text and new section 3.4:

2.6.  DNSSEC Validation Failures

   For zones that are signed with DNSSEC, a resolution failure can occur
   when a security-aware resolver believes it should be able to
   establish a chain-of-trust for an RRset but is unable to do so,
   possibly after trying multiple authoritative name servers.  DNSSEC
   validation failures may be due to signature mismatch, missing DNSKEY
   RRs, problems with denial-of-existence records, clock skew, or other
   reasons.

   Section 4.7 of [RFC4035] already discusses the requirements and
   reasons for caching validation failures.  Section 3.4 of this
   document strengthens those requirements.

3.4.  DNSSEC Validation Failures

   Section 4.7 of [RFC4035] states:

   To prevent such unnecessary DNS traffic, security-aware resolvers MAY
   cache data with invalid signatures, with some restrictions.

   This document updates [RFC4035] with the following, stronger
   requirement:

   To prevent such unnecessary DNS traffic, security-aware resolvers
   MUST cache DNSSEC validation failures, with some restrictions.



And for the one in section 3.3 we propose this:  

3.3.  Requerying Delegation Information

   Section 2.1 of [RFC4697] identifies circumstances in which "every
   name server in a zone's NS RRSet is unreachable (e.g., during a
   network outage), unavailable (e.g., the name server process is not
   running on the server host), or misconfigured (e.g., the name server
   is not authoritative for the given zone, also known as 'lame')."  It
   prohibits unnecessary "aggressive requerying" to the parent of a non-
   responsive zone by sending NS queries.

   The problem of aggresive requerying to parent zones is not limited to
   queries of type NS.  This document updates the requirement from
   section 2.1.1 of [RFC4697] to apply more generally: Upon encountering
   a zone whose name servers are all non-responsive, a resolver MUST
   cache the resolution failure.  Furthermore, the resolver MUST limit
   queries to the non-responsive zone's parent zone (and other ancestor
   zones) just as it would limit subsequent queries to the non-
   responsive zone.


DW