Re: [DNSOP] Working Group Last call for draft-ietf-dnsop-dns-error-reporting

Viktor Dukhovni <ietf-dane@dukhovni.org> Mon, 10 July 2023 22:35 UTC

Return-Path: <ietf-dane@dukhovni.org>
X-Original-To: dnsop@ietfa.amsl.com
Delivered-To: dnsop@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9A9BCC17EB6A for <dnsop@ietfa.amsl.com>; Mon, 10 Jul 2023 15:35:57 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.899
X-Spam-Level:
X-Spam-Status: No, score=-6.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id haGw0yUHZAh4 for <dnsop@ietfa.amsl.com>; Mon, 10 Jul 2023 15:35:56 -0700 (PDT)
Received: from straasha.imrryr.org (straasha.imrryr.org [100.2.39.101]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8FF42C17EB61 for <dnsop@ietf.org>; Mon, 10 Jul 2023 15:35:55 -0700 (PDT)
Received: by straasha.imrryr.org (Postfix, from userid 1001) id C7FDC12F2D4; Mon, 10 Jul 2023 18:35:54 -0400 (EDT)
Date: Mon, 10 Jul 2023 18:35:54 -0400
From: Viktor Dukhovni <ietf-dane@dukhovni.org>
To: dnsop@ietf.org
Cc: puneets@google.com
Message-ID: <ZKyHyo4Mb8I34rZI@straasha.imrryr.org>
Reply-To: dnsop@ietf.org
References: <ZJn_cwWWOKIn1wbq@straasha.imrryr.org> <76E9FBC8-9F6D-4050-9C6F-E92A2CBEB326@dnss.ec> <ZKw40DEHBUfBEoUI@straasha.imrryr.org> <1583409F-8F04-4172-B9A1-94D9900402AB@dnss.ec>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <1583409F-8F04-4172-B9A1-94D9900402AB@dnss.ec>
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/ZsdpJdBqSuNNFYwu3R4mivuW5WQ>
Subject: Re: [DNSOP] Working Group Last call for draft-ietf-dnsop-dns-error-reporting
X-BeenThere: dnsop@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: IETF DNSOP WG mailing list <dnsop.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dnsop>, <mailto:dnsop-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dnsop/>
List-Post: <mailto:dnsop@ietf.org>
List-Help: <mailto:dnsop-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dnsop>, <mailto:dnsop-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Jul 2023 22:35:57 -0000

On Mon, Jul 10, 2023 at 10:27:45PM +0100, Roy Arends wrote:

> > Right, but surely the monitoring agent can decide whether to solicit
> > such a prefix label or not.  That is whether an "_er" prefix label is
> > signalled is a *local matter* betweent the authoritative server
> > signalling the option and the monitoring agent.
> 
> I agree that a monitoring agent can specify a domain that may include
> a separator as the least significant label. However, it also requires
> the monitoring agent to understand that it should sometimes include
> this separator, and that it may be redundant at other times.

If all the monitoring agent's "customers" (authoritative servers that
return its "suffix" in the new option) are informed to signal an
"_er.agent.example" name, there's no "sometimes".  The agent, by mutual
agreement with the nameservers it supports can choose whatever suffix
format meets its needs, fixed across all customers, or customer-specific.

I haven't yet seen a reason to insist on a fixed suffix pattern.  The
resolver just stutters back the suffix it was handed by the
authoritative server's extension payload.  What problem does mandating
the least significant label of the suffix solve, that can't be solved by
just signalling the desired suffix, special label and all?

> It assumes that those running the authoritative server that returns
> the agent domain and those that run the reporting agent are in sync.
> Those are a lot of assumptions.

If they're not in sync, surely reporting will be broken, whether or not
an "_er" suffix label is used.

> >  Why should resolvers have to be responsible for this?
> 
> Because this separating label is trivial to include and avoids a lot of hassle.

The hassle in question remains unclear.  I see two relevant/likely
deployment models:

    * Self-hosted reporting, directly by the authoritative server:

        - Error reports are special by virtue of a dedicated qname
          suffix and perhaps qtype.

        - No special coördination required, the server both publishes
          and consumes the error reporting suffix.

     * Outsourced/centralised reporting, via server IPs dedicated to
       error report processing.

         - Here again no need for "_er", because all queries are
           presumptively error reports, and if the signal from the
           "customer" auth server was wrong (whether or not an "_er"
           label is included) the error report will not be handled
           correctly.

          - If the signal has the correct (mutually agreed) suffix,
            again no problem.

          - And of course the monitoring agent can specify the use
            of "_er" (or whatever) if that's convenient.

What use-case actually benefits from the "_er" LSL (least-significant
label) in the signal?  How is this benefit not obtained by mutual
agreement between the monitoring agent and its customers?

> >> The sole purpose of the leading “least-significant” “_er” is to
> >> distinguish between qname-minimized queries (for lack of a better
> >> term) and “full” queries. I understand that you argue that a
> >> monitoring agent can determine this without the _er labels (as
> >> described below), but that seem suboptimal to me.
> > 
> > The qname minimised query (whether or not a dedicated qtype is used)
> > will be for "A" or "NS" records, not TXT or the dedicated qtype, so
> > there's no need for "_er" in the first label, the qtype is sufficient.
> 
> RFC9156 contains no hard requirement to use A/NS. So I’m not confident
> that all current and future qname-minimisation implementations use
> A/NS. 

This is where this document can specify that qname minimised error
reports MUST use a qtype other than the qtype for the final error
report.

> > However, to avoid forwarding junk reports to the monitoring agent, a
> > resolver may well sensibly choose to not forward such queries, and
> > only source them internally.
> 
> I’m not following.

If the qtype is "TXT", then an open resolver is easily subject to
proxying forged error reports purporting errors that the resolver did
not observe.  Some client of the open resolver sends an explicit query
for:

    <error-reporting-qname>. IN TXT ?

which then looks like an error report from *that* resolver to the
monitoring agent.  If instead we have a dedicated qtype for error
reports, it becomes a simple matter of refusing to iterate queries for 

    <whatever>. IN <ERTYPE> ?

Any resolver wanting to report an error must do so directly, not via a
forwarder.  Especially because the forwarders won't be passing the
agent extension through to their clients!

> > The specification might also recommend that "stub" resolvers that
> > forward most queries to a "full service" resolver, should send error
> > reports *directly* to the monitoring agent.  And, of course, "full
> > service" resolvers MUST NOT *forward* the monitoring agent OPTION to
> > clients, if they send such an option, it should be locally generated
> > to signal the monitoring agent for the resolver itself.
> 
> I’m not following. 

In a forwarder chain:

    stub resolver  <->  full-service resolver  <->  auth server

When the stub resolver wants to report an error, it must contact the
monitoring agent directly, rather than pass it to the full-service
resolver.  Any agent suffix it receives from the full-service resolver
will the monitoring agent for **that** resolver, not the auth server,
and the reports need to go to the authoritative server for specified
endpoint directly!

[ Admittedly, in practice stub resolvers are not likely to make
  error reports, and forwarders are unlikely to solicit them. ]


> >> Allocating a new QTYPE for this purpose just seems redundant. 
> > 
> > It is not.  This is not a normal query, it is an error report.
> 
> However, it is a normal query though. All the intermediates
> (forwarders, caches, authoritivate servers) have no idea that this
> query is any different than others. There is nothing special in this
> query. I really want to avoid OPCODE subtyping by qtype.

But that's a problem, because forwarding of error reports masks the
origin IP, with problem reports then misattributed to the edge resolver,
that may have had no problems resolving the reported name, and may be
misused by its clients to forge such reports.

> > I would strongly prefer a dedicated qtype (with support from Puneet
> > Sood).  However, if the WG consensus is TXT, we'll grudginly cope.
> > Would it make sense to raise this narrow question by the chairs as a
> > consensus call?
> 
> To me, a dedicated qtype vs TXT seems like bike-shedding. 

I disagree.  We're not disagreeing on cosmetic details of the name of a
new qtype, rather we're disagreeing on whether to overload TXT, which
a substantive difference.

> > I did not see a response to the point about moving the info code to the
> > least-significant label in the query (first or right after the leading
> > "_er" if despite my exhortations that's retained).
> 
> The purpose of keeping the info code right before the separating _er
> label is that it helps to separate incoming reports by “severeness”,
> as in “lame delegation” reports go here,  “expired RRSIG” reports go
> there. This can all be delegated nicely by the monitoring agent.

Though lexically last, THIS is the point I want to most strongly
emphasise.  Putting the info code in the MSL (most signficant label) of
the error qname prefixed to the agent suffix breaks NXDOMAIN caching,
because we now have 65536 parent info codes for each domain that the
agent does not serve:

    *.ru.0._er.agent.example. ; signal == _er.agent.example.
    *.ru.1._er.agent.example. ; signal == _er.agent.example.
    *.ru.2._er.agent.example. ; signal == _er.agent.example.
    ....
    *.ru.65535._er.agent.example. ; signal == _er.agent.example.

Whereas, instead and with no loss of ability to group errors by severity
(indeed the LSL is parsed first!) the agent could return NXDOMAIN for:

    *.ru._er.agent.example. ; signal == _er.agent.example.

and be rid of all "*.ru" reports.

> >> Viktor, your optimisations (removing the _er labels) are premature as
> >> it turns a deterministic process at the monitoring agent into a
> >> heuristic process. 
> > 
> > I don't see how it becomes heuristic.  The dedicated qtype signals an
> > complete error reporting query, other qtypes are minimised variants.

There's no heuristic.  The agent knows what suffix(es) it serves, and
strips that suffix to recover the error report.

> Again, there is no guarantee that a minimised variant does not use the
> dedicated qtype. It is simply easier to recognise a minimised variant
> by checking if the QNAME starts with _er. This is far more reliable
> than assuming a dedicated QTYPE is not minimised.

Though I think the leading "_er" is redundant, it is mostly harmless,
I'd prefer to see it go, but will grudgingly accept it staying.

The main thing is to move the info code to the LSL (least signicant
label), modulo any final (redundant) "_er" prefix (the complete query
should be distinguished by its qtype).

Also, resolvers SHOULD NOT do query minimisation below the signalled
error reporting suffix in the first place.  Save everyone needless
latency and potential ENT issues.  Let's specify that too.

-- 
    Viktor.