Re: [DNSOP] Quick review of draft-dwmtwc-dnsop-caching-resolution-failures-00

"Wessels, Duane" <dwessels@verisign.com> Tue, 12 July 2022 21:41 UTC

Return-Path: <dwessels@verisign.com>
X-Original-To: dnsop@ietfa.amsl.com
Delivered-To: dnsop@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3EAABC157B5F for <dnsop@ietfa.amsl.com>; Tue, 12 Jul 2022 14:41:53 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.106
X-Spam-Level:
X-Spam-Status: No, score=-7.106 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=verisign.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id omXPG7NpBbD8 for <dnsop@ietfa.amsl.com>; Tue, 12 Jul 2022 14:41:49 -0700 (PDT)
Received: from mail5.verisign.com (mail5.verisign.com [69.58.187.31]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2344BC157B56 for <dnsop@ietf.org>; Tue, 12 Jul 2022 14:41:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=verisign.com; l=7176; q=dns/txt; s=VRSN; t=1657662109; h=from:to:cc:date:message-id:references:in-reply-to: content-id:content-transfer-encoding:mime-version:subject; bh=vMPYtXwpJy7l/dW0p+obFwh5ZLjiXzxxZ5dgv5t+7iQ=; b=O03lmQ0fz7EqIQiFhl4e2L1Te8aHWkGQgG7NTDKD30DHzBmg5BgX0ncM LEDshKKXItbs6l4E1SZOUwU8a+h6aDqzupUWu0Ve0L4KidH2t52xMtmql qjdkqA2mTVwYqSIFF0Xw5WJVCSZYiQYZh/DX6jCqbaa9bjKkK5kHLDZWT /Hj4twDWM97tNllnz/4SRbeBD6xX3Zl2i13/pDCQsdPLpEH7txkRXfTvU RHNZAFMaNt0Ff47BSe2nKof3Gk0Bgj6RDcKe604PT4HqKONM7AwY2I0lk D/H25U45HYpdovtcrthtDF4++O0o4E4kzZw5D7RXCPyrxMPmAsU5eohaU A==;
IronPort-Data: A9a23:VmVmgq+fgmvvdou9Lzx2DrUDOHyTJUtcMsCJ2f8bNWPcYEJGY0x3n 2oXDD3TaKmLMGT2L4p3boy08U8H7MDTztJjSlc9pXwxFiIbosf7XtnIdU2Y0wF+jCHgZBk+s 5hBMImowOQcFCK0SsKFa+C5xZVEOCXhqoPUUIYoAAgoLeNfYHpn2EgLd9IR2NYy24DmWlnV4 7senuWEULOb828sWo4rw//bwP9flKyaVOQw5wFWiVhj5TcyplFNZH4tDfjZw0jQG+G4KtWHq 9Prl9lVyEuCpktwVYn1+lrMWhZirrb6ZWBig1IIA/Ty2kAqSiYais7XP9JEAatbZqngc3mcB 7yhuLTpITrFMJEgl8xGfRZXPjxmPpF39bbFB3S2n9yCxXf/Ji6EL/VGVCnaPKUywMAuPkdjx aRCbi4GaQqbweu6hqyhUe8qjcMmRCXpFNpH/Cg/lneAUK1gHcGrr6bivLe02B88mc1VBvvaf OIHZCBudxXPZVtEPVJ/5JcWxbr03CmiKGEwRFS9vYoc/3GMwShNweK0b8SJUdyxX5h6gRPNz o7B1yGjav0AD/Sa1Dme2nexhfLJkWX8Qo16PLG+7flyqFye2mJVDwcZPWZXutGzkEjnRNRSO xROvzEwt+439VfuRN67VQe++TiapAUaHdFXFoXW9T2w90Yd2C7BbkBsc9KLQIVOWBMeLdDy6 mK0og==
IronPort-HdrOrdr: A9a23:Ihcon6o4Bl0Wek02nA2EGRUaV5r7eYIsimQD101hICG9Kvbo8/ xHnJwguSMdEF4qKQwdcKO7Sc69qBTnhOJICOgqTM2ftWbd2FdAQLsJ0WKm+UyEJ8SczJ8j6U 4DSdkcNDSYNzET5voSojPIcerIq+PpzEncv4bjJgBWIz2CBZsM0+4zMHf8LqQ/fng+OXKofK DsnvaviQDQAkgqUg==
X-IronPort-AV: E=Sophos;i="5.92,266,1650945600"; d="scan'208";a="15428534"
Received: from BRN1WNEX01.vcorp.ad.vrsn.com (10.173.153.48) by BRN1WNEX02.vcorp.ad.vrsn.com (10.173.153.49) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2375.24; Tue, 12 Jul 2022 17:41:46 -0400
Received: from BRN1WNEX01.vcorp.ad.vrsn.com ([10.173.153.48]) by BRN1WNEX01.vcorp.ad.vrsn.com ([10.173.153.48]) with mapi id 15.01.2375.024; Tue, 12 Jul 2022 17:41:46 -0400
From: "Wessels, Duane" <dwessels@verisign.com>
To: Mukund Sivaraman <muks@mukund.org>
CC: "dnsop@ietf.org" <dnsop@ietf.org>
Thread-Topic: [EXTERNAL] [DNSOP] Quick review of draft-dwmtwc-dnsop-caching-resolution-failures-00
Thread-Index: AQHYlgOFzydbqrjzlk+lnSAtaTsEFa17h2IA
Date: Tue, 12 Jul 2022 21:41:45 +0000
Message-ID: <1A9BE4F1-448C-4302-8E68-C32D1C0DA1AF@verisign.com>
References: <Ys2SFN8QJkrRAyAz@d1>
In-Reply-To: <Ys2SFN8QJkrRAyAz@d1>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-mailer: Apple Mail (2.3654.120.0.1.13)
x-originating-ip: [10.170.148.18]
Content-Type: text/plain; charset="utf-8"
Content-ID: <C70118E9C8C8144E83BD979640A58217@verisign.com>
Content-Transfer-Encoding: base64
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/9fGqJDPXV39Sz9aW1ZWzdHlrvZ4>
Subject: Re: [DNSOP] Quick review of draft-dwmtwc-dnsop-caching-resolution-failures-00
X-BeenThere: dnsop@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: IETF DNSOP WG mailing list <dnsop.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dnsop>, <mailto:dnsop-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dnsop/>
List-Post: <mailto:dnsop@ietf.org>
List-Help: <mailto:dnsop-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dnsop>, <mailto:dnsop-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Jul 2022 21:41:53 -0000

Hi Mukund,

> On Jul 12, 2022, at 8:24 AM, Mukund Sivaraman <muks@mukund.org> wrote:
> 
> Some comments quickly browsing this draft, as we're handling a quirky
> issue around NS timeouts and it looked relevant.
> 
> Firstly, some resolver implementations do cache upstream NS timeouts in
> various non-standard ways. The resolver I work on has at least 3-4
> different mechanisms within the same codebase. Documentation on how
> timeouts should be handled seems good, so I support this draft.
> 
>> Internet Engineering Task Force                               D. Wessels
>> Internet-Draft                                                W. Carroll
>> Intended status: Standards Track                               M. Thomas
>> Expires: 17 July 2022                                           Verisign
>>                                                         13 January 2022
> 
> 
>>              Negative Caching of DNS Resolution Failures
>>           draft-dwmtwc-dnsop-caching-resolution-failures-00
> 
> [snip]
> 
>>   [RFC4697] is a Best Current Practice that documents observed
>>   resolution misbehaviors.  It describes a number of situations that
>>   can lead to excessive queries from recusrive resolvers. including:
> 
> There's a spelling mistake in "recusrive", and the period after
> "resolvers." should be removed.

Thanks, these will be fixed.


> 
> [snip]
> 
>> 3.2.  TTLs
> 
>>   Resolvers MUST cache resolution failures for at least 5 seconds.
>>   Resolvers SHOULD employ an exponential backoff algorithm to increase
>>   the amount of time for subsequent resolution failures.  For example,
>>   the initial negative cache TTL is set to 5 seconds.  The TTL is
> 
> I am guessing the authors meant to write "timeout cache TTL" here
> instead of negative cache TTL. In any case, the phrase "negative cache
> TTL" has a well-understood meaning per RFC 2308, and should not be
> overloaded/reused to indicate timeout cache TTL.

We didn’t really mean “timeout cache TTL” here.  Rather, the intention 
is specify requirements for caching all types of resolution failures.
That means SERVFAIL, REFUSED, timeouts, and the others listed in Section 2.

I think perhaps your point is that RFC 2308 talks about TTLs for negative
*answers* (NXDOMAIN, NODATA) and what we are proposing is different, often
in the absence of an answer.

Would this rewrite alleviate your concern?

   For example,
   the initial TTL for negatively caching a resolution failure is set to
   5 seconds.  The TTL is doubled after each retry that results in
   another resolution failure.  Consistent with [RFC2308], resolution
   failures MUST NOT be cached for longer than 5 minutes.


> 
> [snip]
> 
>> 3.3.  Scope
> 
>>   Resolution failures MUST be cached against the specific query tuple
>>   <query name, type, class, server IP address>.
> 
> Have you considered the effect of caching the timeout against just an
> upstream server's IP address? I'm not saying you should, but wondering
> if any of the other tuple fields are relevant to have separate
> more-specific timeout cache entries.
> 
> In other words, is it necessary for there to be a distinction among
> timeouts for:
> 
> (1) example.org., A, IN, 10.0.0.1
> 
> (2) example.org., TYPE65, IN, 10.0.0.1
> 
> (3) example.com., A, IN, 10.0.0.1
> 
> Traditionally, a resolver's upstream RTTs and timeouts are tracked
> against the nameserver IP address. A failure to respond has been
> considered as a property of the NS (implementation) or path to that NS.
> 
> My colleagues are handling an issue where an authoritative nameserver
> does not respond to TYPE65 queries, but responds to queries for common
> query types such as address records. In this case, without mitigating
> with controls, the resolver is a little stumped and keeps attempting to
> contact the upstream NS because it receives some responses from it. The
> queries for which there are no responses eventually end up waiting for
> the maximum timeout limit because the resolver keeps trying to talk to
> it. On a busy resolver, these queries consume resources.
> 
> We could consider the upstream NS as "bad" if it appears to respond to
> some queries but doesn't respond to others with some response. But
> one-off or transient timeouts can occur sometimes due to network packet
> loss.
> 
> In our case, if the resolver were to block this zone's upstream NSs as
> bad, it wouldn't be able to respond to any queries within that zone
> (even address records). It appears to be a popular country-level zone,
> and it's unlikely the upstream operators will fix it to respond to
> TYPE65 queries in the short-term. In such cases, a heavy-handed approach
> may not be practical.

We have not really considered a recommendation to cache against a name
server’s IP address only.  The idea to cache against the 4-tuple comes
from 2308 (sections 7.1 and 7.2).

We feel that improved caching based on the 4-tuple would be a big win.
It sounds like perhaps you are suggesting a more aggressive approach might
encourage authoritative operators to fix their systems?  


> 
> 		Mukund
> 
> 

DW