Re: [DNSOP] Working Group Last call for draft-ietf-dnsop-dns-error-reporting

Roy Arends <roy@dnss.ec> Wed, 05 July 2023 11:17 UTC

Return-Path: <roy@dnss.ec>
X-Original-To: dnsop@ietfa.amsl.com
Delivered-To: dnsop@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6D922C14CF13 for <dnsop@ietfa.amsl.com>; Wed, 5 Jul 2023 04:17:57 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.096
X-Spam-Level:
X-Spam-Status: No, score=-2.096 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=dnss.ec
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0kHa_zks-J8D for <dnsop@ietfa.amsl.com>; Wed, 5 Jul 2023 04:17:52 -0700 (PDT)
Received: from mail-qk1-x72f.google.com (mail-qk1-x72f.google.com [IPv6:2607:f8b0:4864:20::72f]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A29ABC151700 for <dnsop@ietf.org>; Wed, 5 Jul 2023 04:17:50 -0700 (PDT)
Received: by mail-qk1-x72f.google.com with SMTP id af79cd13be357-76728ae3162so555502485a.3 for <dnsop@ietf.org>; Wed, 05 Jul 2023 04:17:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dnss.ec; s=google; t=1688555869; x=1691147869; h=to:references:message-id:content-transfer-encoding:date:in-reply-to :from:subject:mime-version:from:to:cc:subject:date:message-id :reply-to; bh=9Z8BToIuV6hW428KywA4Cnd26fOgKoHagsG730JPe34=; b=ZANKRy5CYGAK3SkAjFDt/Iy4W7Z3dpLZDNBr2nf3cB9xA/1HUNfOuNQFBYD9qCavvZ YAdQCZYd+vcv7EFxSs/IEQp377KQx+g9cjuFrhkvlKpOH+hXvMAcsKwxGRTGFVb3IiwY Bi55k9T/5ZX2IVlVRf67jMfyN3SQzx+sTqd5A=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688555869; x=1691147869; h=to:references:message-id:content-transfer-encoding:date:in-reply-to :from:subject:mime-version:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=9Z8BToIuV6hW428KywA4Cnd26fOgKoHagsG730JPe34=; b=NFwyL1ya6SBDMZ7DLwPEDtut0EeqPLYXJdFJ4s7r7CwUvitEmVrAe2DkJ5i90AckjY yykCwBfvRX4gH+wDzz2ZG29PqQ4x0pRMFq4/GMOYFMroXicWlg8nqh3vMx57UF2yuNn7 tedw1hONxoTNqksR4q7ntdusR+CzbzQqlZcaz5g5LkiWB3Gwl7hOi1Rmilrlvu/e/MfN 1jVUEfwv8AawL/zveaEBRQxLke1dVeWo0s/uW9PpEJXfQNnyksgTNiOLeWIWs7l5guZu gYWfgWDXuXaQxHs9AUppBvGaeivBbdG+ZRQlF33g/x1vHgombYQ+oVsMfcwZa+PmM5zY vb6g==
X-Gm-Message-State: AC+VfDzVPZEJK7AvMu0raNS3nRjEG6c89bWZarKl5WN88BPNouELMVQ4 enolkWEPCb0u6uVVhOZ0S7Yqccfkorpiob7B+Bg=
X-Google-Smtp-Source: ACHHUZ42hqJTRs/LeLkmE0RLaX93xuYu1Jyq+Y2nsDE1wWRSpWnvqKotCZztv0TgFMFEqgeClVT0zQ==
X-Received: by 2002:a37:ab12:0:b0:767:2076:5bee with SMTP id u18-20020a37ab12000000b0076720765beemr12759549qke.9.1688555868728; Wed, 05 Jul 2023 04:17:48 -0700 (PDT)
Received: from smtpclient.apple ([89.33.15.144]) by smtp.gmail.com with ESMTPSA id v12-20020ae9e30c000000b007676718243esm3717689qkf.123.2023.07.05.04.17.47 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 05 Jul 2023 04:17:48 -0700 (PDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.600.7\))
From: Roy Arends <roy@dnss.ec>
In-Reply-To: <ZJn_cwWWOKIn1wbq@straasha.imrryr.org>
Date: Wed, 05 Jul 2023 12:17:34 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <76E9FBC8-9F6D-4050-9C6F-E92A2CBEB326@dnss.ec>
References: <ZJn_cwWWOKIn1wbq@straasha.imrryr.org>
To: dnsop@ietf.org
X-Mailer: Apple Mail (2.3731.600.7)
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/wzj-mJ6D9aX5uC_fkKrkYFI9UJQ>
Subject: Re: [DNSOP] Working Group Last call for draft-ietf-dnsop-dns-error-reporting
X-BeenThere: dnsop@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: IETF DNSOP WG mailing list <dnsop.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dnsop>, <mailto:dnsop-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dnsop/>
List-Post: <mailto:dnsop@ietf.org>
List-Help: <mailto:dnsop-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dnsop>, <mailto:dnsop-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 05 Jul 2023 11:17:57 -0000

Viktor, thanks for your feedback. Comments inline.

> On 26 Jun 2023, at 22:13, Viktor Dukhovni <ietf-dane@dukhovni.org> wrote:
> 
> On Thu, Jun 08, 2023 at 11:59:59AM +0200, Benno Overeinder wrote:
> 
>> This starts a two week Working Group Last Call process, and ends on: 
>> June 22nd, 2023.
> 
> I hope my feedback is not too late.  There are a few important elements
> of the draft that could use some changes.
> 
> On Tue, Jun 20, 2023 at 01:14:02PM +0200, Willem Toorop wrote:
>> 
>> I have one nit.
>> 
>> In the Example in section 4.2., a request still "includes an empty ENDS0 
>> report channel". The third paragraph of that same section states 
>> something similar: "As support for DNS error reporting was indicated by 
>> a empty EDNS0 report channel option in the request". But Section 6.1. 
>> Reporting Resolver Specification states: "The EDNS0 report channel 
>> option MUST NOT be included in queries."
> 
> On Tue, Jun 20, 2023 at 12:20:51PM +0100, Roy Arends wrote:
>> 
>> Ah, yes, I will remove that sentence completely!
> 
> So, under what conditions is the authoritative server free to include
> the error reporting channel extension in its reply?
> 
>    - Does the resolver have to explicitly solicit it?

No.

> The reason this is important, is that there is non-negligible population
> of authoritative servers that (EDNS0 requirements notwithstanding) are
> not tolerant of unrecognised EDNS0 options.

Correct.

>  Therefore, soliciting the
> error reporting channel information is (at least initially, while this
> is not widely supported) more likely to lead to errors than to help to
> resolver errors.

Indeed. 

>  This is then not attractive to implement!
> 
> I would prefer to require resolvers to be more tolerant of unexpected 
> options, and would have servers report the channel without explicit
> solicitation.

That is indeed the plan. I shall make that explicit in the new text.

> 
> On Tue, Jun 20, 2023 at 11:35:28PM +0100, Dick Franks wrote:
>>> An authoritative server includes the option if configured to do so
>>> AND if it has the a non-null domain name configured as the reporting
>>> channel. It will then reply to each query. This is IMHO better than
>>> having a resolver include the option each and every time. Note that
>>> resolvers will ignore options that are unknown to them.
>> 
>> 6.2.  Authoritative server specification
>> Contains not a shred of normative language saying any of that.
>> 
>> The preliminary waffle in the overview could apply to either the
>> solicited or unsolicited regime.
>> 
>>>> I withdraw my earlier statement that the document is almost ready.
>>>> Now, clearly it is not.
>>> 
>>> I hear you. I do not agree though, and I hope you reconsider
>> Not without further work
> 
> I agree this needs to be made more explicit than just deleting the
> conflicting text.
> 
> On Thu, Jun 22, 2023 at 04:10:46PM -0700, Wes Hardaker wrote:
>> Roy Arends <roy@dnss.ec> writes:
>> 
>>> That, IMHO is already captured by the last paragraph. I did not
>>> explicitly write a recipe of how to do that, and which servers could
>>> be used for that :-). Could you suggest text to improve the last
>>> paragraph without naming services?
>> 
>> Erg.  I hate it when I have to come up with text :-P
>> 
>> How about replacing the last sentence of security considerations with:
>> 
>> This method can be abused by intentionally deploying broken zones with
>> agent domains that are delegated to victims.  This is particularly
>> effective when DNS requests that trigger error messages are sent through
>> open resolvers [RFC8499] or widely distributed network monitoring
>> systems that perform distributed queries from around the globe.
>> Implementations SHOULD rate-limit outgoing error messages to a
>> recipient to no more than 1 a minute.
> 
> What is a "recipient"?  Is it a monitoring agent "zone", or a monitoring
> agent transport endpoint?  If we're concerned about DoS, perhaps it
> should be the latter, since many zones can resolve to the same set of
> underlying nameservers...

I will deal with this text in the update.

> 
> On Fri, Jun 23, 2023 at 01:27:21AM +0000, Ben Schwartz wrote:
> 
>> I want this draft to move forward, but upon review I noted with
>> concern the security section text:
>> 
>>   DNS error reporting is done without any authentication between the
>>   reporting resolver and the authoritative server of the agent domain.
>>   Authentication significantly increases the burden on the reporting
>>   resolver without any benefit to the monitoring agent, authoritative
>>   server or reporting resolver.
>> 
>> Strong authentication (e.g. to a zone identity with DNSSEC) is
>> probably excessive, but the current draft appears to have no defense
>> against even trivial IP spoofing.  Anyone in the world who can spoof
>> IP addresses can impersonate a reputable resolver and pollute the
>> error reports sent to authoritative servers.  As an authoritative
>> server operator, I would place a lot more trust in reports from
>> reputable resolvers than from unrecognized sources.
>> 
>> I think the draft should probably say something like: "To defend
>> against spoofing of source IP addresses used for error reports,
>> reporting resolvers MUST use DNS over TCP [RFC 7766], DNS COOKIE [RFC
>> 7873], or another procedure that defeats IP address spoofing."
> 
> Requiring cookies would be great, but they have not yet seem broad
> adoption.  Can we reasonably expect the monitoring agent zones to
> support them (and ensure consistent cookie keys across the server
> pool behind each server IP)?
> 
> Requiring TCP, combing with per-IP rate limits is probably simpler.

I will include a note to implementors that reports received over TCP will be more reliable. The rate limiting you mentioned can be managed by resolver caching, right?

> ====== New feedback:
> 
> And last, but not least, as promised, some important suggestions to
> simplify the protocool and improve scalability:
> 
> 
> --- Section 4.  Overview
> 
>> If the authoritative server has indicated support for DNS error
>> reporting and there is an issue that can be reported via extended DNS
>> errors, the reporting resolver encodes the error report in the QNAME
>> of the report query.  The reporting resolver builds this QNAME by
>> concatenating the _er label, the QTYPE, the QNAME that resulted in
>> failure, the extended error code (as described in [RFC8914]), the
>> label "_er" again, and the agent domain.  See the example in
>> Section 4.2.  Note that a regular RCODE is not included because the
>> RCODE is not relevant to the extended error code.
> 
> The proposed qname structure is suboptimal:
> 
>    - There is insufficient justification for the "_er" labels
>      at either end of the error report qname.
> 
>        o  If the monitoring agent wants to see some particular prefix,
>           (perhaps even periodically rotated to quickly drop stale
>           junk) the authoritative server can vend the prefix with the
>           agent domain.  So the "most-significant" parent "_er" is
>           IMNHSO redundant.

The monitoring agent has to determine where the QNAME ends, and the agent domain starts. If you assume that a monitoring agent only uses a single agent domain for all its reports, then sure, the _er_ label between the strings is redundant.

If however, the monitoring agent has domains in use, where the least significant labels collide with existing top level domains, it needs to determine heuristically where the agent domain starts. This is IMHO suboptimal.

>        o The leading "least-significant" "_er" is likewise (see below)
>          not adequately justified.
> 
>        o Making the EDE "info code" more significant than the problem
>          domain makes it harder to disclaim responsibility for an
>          entire DNS subtree (say, all of "xn--p1ai.monitoring.example").
> 
>          Surely the reported domain is *more* significant than the EDE
>          info code.
> 
> Therefore, a much better qname would be:
> 
>        <EDE-info-code>.<qtype>.<qname>.<agent-zone>.

The sole purpose of the leading “least-significant” “_er” is to distinguish between qname-minimized queries (for lack of a better term) and “full” queries. I understand that you argue that a monitoring agent can determine this without the _er labels (as described below), but that seem suboptimal to me.

>> The resulting report query is sent as a standard DNS query for a TXT
>> DNS resource record type by the reporting resolver.
> 
> Also, qtypes are cheap, and I rather think that a dedicated qtype (one
> that a supporting resolver might refuse to accept in queries from
> clients for example) makes sense here.  There's no need to overload
> TXT here.

This seems counter intuitive to me. A qtype that a supporting resolver might refuse to accept in queries from clients is either a temporary state (it may be accepted in the near future when this qtype will be implemented), or it needs to be specified that this qtype should not be accepted in queries from clients, which makes this qtype not cheap (that is, we won’t be able to simply use the template to request one, as it requires additional work). 

Allocating a new QTYPE for this purpose just seems redundant. 

>> This document gives no guidance on the content of the TXT resource
>> record RDATA for this record.
> 
> The dedicated qtype should have an empty payload.

We can require that the returned TXT record should record have an empty payload. 

> 
>> If the monitoring agent were to respond with NXDOMAIN (name error),
>> [RFC8020] says that any name at or below that domain should be
>> considered unreachable, and negative caching would prohibit
>> subsequent queries for anything at or below that domain for a period
>> of time, depending on the negative TTL [RFC2308].
> 
> As mentioned above, making the "info-code" more significant than the
> domain gets in the way here.
> 
>> The reporting resolver constructs the QNAME
>> "_er.1.broken.test.7._er.a01.agent-domain.example." and resolves it.
>> This QNAME indicates extended DNS error 7 occurred while trying to
>> validate "broken.test." type 1 record.
> 
> Therefore, make that:
> 
> 
>> The QNAME for the report query is constructed by concatenating the
>> following elements, appending each successive element in the list to
>> the right-hand side of the QNAME:
>> 
>> *  A label containing the string "_er".
>> 
>> *  The QTYPE that was used in the query that resulted in the extended
>>    DNS error, presented as a decimal value, in a single DNS label.
>> 
>> *  The QNAME that was used in the query that resulted in the extended
>>    DNS error.  The QNAME may consist of multiple labels and is
>>    concatenated as is, i.e. in DNS wire format.
>> 
>> *  The extended DNS error, presented as a decimal value, in a single
>>    DNS label.
>> 
>> *  A label containing the string "_er".
>> 
>> *  The agent domain.  The agent domain as received in the EDNS0
>>    report channel option set by the authoritative server.
> 
> See above, drop the pointless "_er" labels, and move the info code to
> the leaf label.
> 
>> The "_er" labels allow the monitoring agent to differentiate between
>> the agent domain and the faulty query name.  When the specified agent
>> domain is empty, or a null label (despite being not allowed in this
>> specification), the report query will have "_er" as a top-level
>> domain as a result and not the original query.  The purpose of the
>> first "_er" label is to indicate that a complete report query has
>> been received, instead of a shorter report query due to query
>> minimization.
> 
> Instead, note that qname minimised queries will not have the same qtype
> (be it TXT or dedicated).  Instead they'll typically be "A" or "NS",
> and also the reporting resolve should avoid all qname minimisation
> below the agent domain, unasking the question.

Viktor, your optimisations (removing the _er labels) are premature as it turns a deterministic process at the monitoring agent into a heuristic process. 

Many, many thanks for your thorough review and for helping us to improve the document!

Warmly,

Roy