[DNSOP] Quick review of draft-dwmtwc-dnsop-caching-resolution-failures-00

Mukund Sivaraman <muks@mukund.org> Tue, 12 July 2022 15:24 UTC

Return-Path: <muks@mukund.org>
X-Original-To: dnsop@ietfa.amsl.com
Delivered-To: dnsop@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 434A9C14CEFC for <dnsop@ietfa.amsl.com>; Tue, 12 Jul 2022 08:24:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.107
X-Spam-Level:
X-Spam-Status: No, score=-2.107 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=mukund.org
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WOaQGxJ3DeDn for <dnsop@ietfa.amsl.com>; Tue, 12 Jul 2022 08:24:16 -0700 (PDT)
Received: from mx.mukund.org (mx.mukund.org [188.40.188.216]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2AF55C14CF10 for <dnsop@ietf.org>; Tue, 12 Jul 2022 08:24:09 -0700 (PDT)
Date: Tue, 12 Jul 2022 20:54:04 +0530
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=mukund.org; s=mail; t=1657639447; bh=+e6BQb12gwWlNy4jQ/5FvjLE1WmqcJzInC1kUxGVQ5w=; h=Date:From:To:Subject:From; b=BL/XXGnsjkrcihIZ+eRiDBEZKpoOZBi4fAoN/FNLDwxx1+bszdUHmVjGCPBCuw8XT f37qVskLyjmFT0n8g+Hm0HPtH6U66L6Gp6kk2dOIxXccjjAhQJHilSUjLSRjSWP5cX tfHCGzE8ocHND/T2dh2L/m/Vbu0VBHvNtDJRW50o9wFOJM8mY+6wAZ2vtYzGCrZhxK DEkOkbsl5+Xgx9cSZk9acMCQ/RCtkSBpW5IzGFbts5FdYqZNkVgv650oVHQoAnHtW8 rFMh2oVsP8nz5ybuqOaElOfbypZ1JQdIYqLxuxP3PQ0XeR50OJHXDEgUykhgsUo9zI ESPPRLjdPGWFQ==
From: Mukund Sivaraman <muks@mukund.org>
To: dnsop@ietf.org
Message-ID: <Ys2SFN8QJkrRAyAz@d1>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg="pgp-sha512"; protocol="application/pgp-signature"; boundary="OUlSJRr6fTR1D74p"
Content-Disposition: inline
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/rzxIZw7th1zRZ01FLsHGZa_GIVw>
Subject: [DNSOP] Quick review of draft-dwmtwc-dnsop-caching-resolution-failures-00
X-BeenThere: dnsop@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: IETF DNSOP WG mailing list <dnsop.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dnsop>, <mailto:dnsop-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dnsop/>
List-Post: <mailto:dnsop@ietf.org>
List-Help: <mailto:dnsop-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dnsop>, <mailto:dnsop-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Jul 2022 15:24:20 -0000

Some comments quickly browsing this draft, as we're handling a quirky
issue around NS timeouts and it looked relevant.

Firstly, some resolver implementations do cache upstream NS timeouts in
various non-standard ways. The resolver I work on has at least 3-4
different mechanisms within the same codebase. Documentation on how
timeouts should be handled seems good, so I support this draft.

> Internet Engineering Task Force                               D. Wessels
> Internet-Draft                                                W. Carroll
> Intended status: Standards Track                               M. Thomas
> Expires: 17 July 2022                                           Verisign
>                                                          13 January 2022


>               Negative Caching of DNS Resolution Failures
>            draft-dwmtwc-dnsop-caching-resolution-failures-00

[snip]

>    [RFC4697] is a Best Current Practice that documents observed
>    resolution misbehaviors.  It describes a number of situations that
>    can lead to excessive queries from recusrive resolvers. including:

There's a spelling mistake in "recusrive", and the period after
"resolvers." should be removed.

[snip]

> 3.2.  TTLs

>    Resolvers MUST cache resolution failures for at least 5 seconds.
>    Resolvers SHOULD employ an exponential backoff algorithm to increase
>    the amount of time for subsequent resolution failures.  For example,
>    the initial negative cache TTL is set to 5 seconds.  The TTL is

I am guessing the authors meant to write "timeout cache TTL" here
instead of negative cache TTL. In any case, the phrase "negative cache
TTL" has a well-understood meaning per RFC 2308, and should not be
overloaded/reused to indicate timeout cache TTL.

[snip]

> 3.3.  Scope

>    Resolution failures MUST be cached against the specific query tuple
>    <query name, type, class, server IP address>.

Have you considered the effect of caching the timeout against just an
upstream server's IP address? I'm not saying you should, but wondering
if any of the other tuple fields are relevant to have separate
more-specific timeout cache entries.

In other words, is it necessary for there to be a distinction among
timeouts for:

(1) example.org., A, IN, 10.0.0.1

(2) example.org., TYPE65, IN, 10.0.0.1

(3) example.com., A, IN, 10.0.0.1

Traditionally, a resolver's upstream RTTs and timeouts are tracked
against the nameserver IP address. A failure to respond has been
considered as a property of the NS (implementation) or path to that NS.

My colleagues are handling an issue where an authoritative nameserver
does not respond to TYPE65 queries, but responds to queries for common
query types such as address records. In this case, without mitigating
with controls, the resolver is a little stumped and keeps attempting to
contact the upstream NS because it receives some responses from it. The
queries for which there are no responses eventually end up waiting for
the maximum timeout limit because the resolver keeps trying to talk to
it. On a busy resolver, these queries consume resources.

We could consider the upstream NS as "bad" if it appears to respond to
some queries but doesn't respond to others with some response. But
one-off or transient timeouts can occur sometimes due to network packet
loss.

In our case, if the resolver were to block this zone's upstream NSs as
bad, it wouldn't be able to respond to any queries within that zone
(even address records). It appears to be a popular country-level zone,
and it's unlikely the upstream operators will fix it to respond to
TYPE65 queries in the short-term. In such cases, a heavy-handed approach
may not be practical.

		Mukund