Re: [DNSOP] DNS Error Reporting

Brian Dickson <brian.peter.dickson@gmail.com> Wed, 17 March 2021 23:49 UTC

MIME-Version: 1.0
References: <130FD763-B510-4034-9057-5BEC4C5B2E83@dnss.ec> <CAH1iCiqv6J6868ecPHQDCjm9yXehmaQjcJ30CdvhNWsjp4mxvw@mail.gmail.com> <c044d8a8-c621-39c8-c908-873dff3740c9@isc.org>
In-Reply-To: <c044d8a8-c621-39c8-c908-873dff3740c9@isc.org>
From: Brian Dickson <brian.peter.dickson@gmail.com>
Date: Wed, 17 Mar 2021 16:49:08 -0700
Message-ID: <CAH1iCirbmkAR+0rw_VK7dYmJspWGGZZ-+Cp0TXC8bQhcv-AtxA@mail.gmail.com>
To: Petr Špaček <pspacek@isc.org>
Cc: Roy Arends <roy@dnss.ec>, dnsop <dnsop@ietf.org>, Matt Larson <matt.larson@icann.org>
Content-Type: multipart/alternative; boundary="000000000000a84fa705bdc42129"
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/xKdjw45skstmppKzD5LilWKFhHQ>
Subject: Re: [DNSOP] DNS Error Reporting
Precedence: list

On Wed, Mar 17, 2021 at 2:05 PM Petr Špaček <pspacek@isc.org> wrote:

> On 12. 03. 21 4:47, Brian Dickson wrote:
> > On Fri, Oct 30, 2020 at 10:03 AM Roy Arends <roy@dnss.ec
> > <mailto:roy@dnss.ec>> wrote:
> >
> >     Dear DNS Operations folk,
> >
> >     Matt Larson and I wrote up a method that warns a domain owner of an
> >     issue with their configuration. The idea is loosely based on DMARC
> >     (RFC7489), and on Trust Anchor signalling (RFC8145).
> >
> >     The method involves an EDNS0 exchange, containing an “agent” domain,
> >     send by the authoritative server  that the resolver can send reports
> >     to in case of a failure.
> >
> >     Please see
> >     https://tools.ietf.org/html/draft-arends-dns-error-reporting-00
> >     <https://tools.ietf.org/html/draft-arends-dns-error-reporting-00>
> >
> >
> > I have a few comments (some were made in the jabber room at IETF-110),
> > mostly suggestions.
> >
> > First, what about a "sampling" field, treated as 1/N for a field value
> > of N. Do rand(N), and report only if you get a value of 0.
> > A value of N=1 would always succeed (not sampled), and a value of N=0 is
> > "don't send a report".
>
> No such field is needed, this can be done on auth side already. Auth
> side can already decide with single-answer granularity when to send the
> option, i.e. the sampling can be simply done by not/sending the option.
> (Keep in mind the option is not cached.)
>

Yeah, I was thinking of this only because I missed the bits about TTL and
cached answers preventing re-reporting.
I think the TTL lets the auth manage the report load adequately, so
sampling isn't really needed.

>
> For example auth can send the option only on 1/100 of DNSKEY queries and
> nothing else (if desired). Or for 1/100 000 of all queries.
>
> Doing so on auth side is much more flexible and does not complicate the
> prototol so I do not think complexity is justified.
>
>
> > Second, what about a "fire drill" field, which is applied with a similar
> > 1/N logic, but triggers a report even if no error occurs.
> > This would be useful for testing the reporting functionality without
> > requiring deliberate errors be introduced.
>
> I feel the complexity (also in security analysis ...) is unwarranted.
> What use-case do you envision?
>

Suppose someone sets up reporting, and wants to confirm that (a) reports
get received, and (b) perform a longitudinal check on which clients send
reports.

The use case for (a) is, the auth server operator that wants reports,
should know the full infrastructure for receiving reports works before
depending on it, and similarly whenever making changes to the reporting
system, that the change does not break reporting.
The use case for (b) is, the auth server operator is interested in what is
common between clients who do or do not send reports. For example, if no
clients behind a particular ASN are sending reports, there could be a
network issue (e.g. ACL, firewall, load balancer, NAT, whatever) that
affects all clients behind that ASN.

Both of these use cases require the client (reporter) send a report
regardless of error conditions (but modulo sampling rates).

The two options for this are, (i) break something being requested, for all
clients, to generate reports (not a good option), or (ii) use the "fire
drill" to trigger reports even if there are no errors.

I'm not sure I have done a good job of explaining this. Basically, a fire
drill is performed to simulate a failure, so the alerting/reporting can be
tested. It's similar to the "lamp test" button (which allows you to detect
burned-out indicators on analog display systems like in airplanes or
nuclear reactors.)

I.e. if you are setting up this reporting, it only makes sense that you
know it works, and to validate that you need to know it works from
everywhere, not just a few randomly chosen places.

The 1/N thing is moot with the TTL cache too, so just having a flag is
sufficient.

>
>
> > Third, what about actually sending a response to the "report" query?
> > If noerror, nodata, the reporting agent does nothing.
> > However, if some particular response is received, use that in processing
> > future errors.
> > The idea is to have some value returned, with a TTL used for how long to
> > ignore errors, or that specific error. (Maybe use logic similar to the
> > class and type fields from UPDATE?)
> > That way, if the authoritative server has a problem that can't or won't
> > be fixed for some duration, it can suppress reports until that error is
> > fixed.
>
> That's already part of the draft. The signaling query is normal (albeit
> asynchronous to the triggering query) so normal TTL rules for answers
> apply. The receiving side is expected to send back normal DNS answer and
> can freely use whatever SOA/TTL/whatever it desires.
>

Ah, thanks, I missed that, so, never mind. :-)

>
> > Finally, what about an optional field for resolver operator contact info
> > (e.g. vCard or similar), so the authority operator can follow up with a
> > human if appropriate?
>
> Interesting idea, but it leads to packet bloat caused by data which are
> unnecesary vast majority of the time.
>
> Are we (as dnsop WG) not concerned with packet bloat anymore?
>

This would add data on the DNS query used for sending the report. DNS
queries are generally very limited in size, typically less than 100 octets
long.
Adding something like "TYPE|LENGTH|mailto:dns-admin@example.com" on small
query packets for reports is not likely to cause problems for anyone,
anywhere.

So, maybe no real concern if the length is limited to some sensible value?

Brian

[DNSOP] DNS Error Reporting Roy Arends
Re: [DNSOP] DNS Error Reporting Dick Franks
Re: [DNSOP] DNS Error Reporting Roy Arends
Re: [DNSOP] DNS Error Reporting Brian Dickson
Re: [DNSOP] DNS Error Reporting Petr Špaček
Re: [DNSOP] DNS Error Reporting Brian Dickson
Re: [DNSOP] DNS Error Reporting Peter van Dijk