[DNSOP] How Slack didn't turn on DNSSEC

John Levine <johnl@taugh.com> Tue, 30 November 2021 18:38 UTC

Return-Path: <johnl@iecc.com>
X-Original-To: dnsop@ietfa.amsl.com
Delivered-To: dnsop@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B22583A149C for <dnsop@ietfa.amsl.com>; Tue, 30 Nov 2021 10:38:18 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.85
X-Spam-Level:
X-Spam-Status: No, score=-1.85 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.25, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=iecc.com header.b=jo8LRe2U; dkim=pass (2048-bit key) header.d=taugh.com header.b=LL1exqpD
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id p5AwyEk-cKBv for <dnsop@ietfa.amsl.com>; Tue, 30 Nov 2021 10:38:13 -0800 (PST)
Received: from gal.iecc.com (gal.iecc.com [IPv6:2001:470:1f07:1126:0:43:6f73:7461]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 296363A149B for <dnsop@ietf.org>; Tue, 30 Nov 2021 10:38:12 -0800 (PST)
Received: (qmail 65337 invoked from network); 30 Nov 2021 18:38:09 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=iecc.com; h=date:message-id:from:to:subject:mime-version:content-type:content-transfer-encoding:cleverness; s=ff34.61a66f91.k2111; bh=6oq8XXx9A/eGxk8V+FWfObaazxwNkunyrsKMjp+qk38=; b=jo8LRe2UzN8ascDFsLt5BsaIka0e0+N7+xss8xVzGZZ6SNKvhnW+yj2daFZU8VP2gSDvExN2kh6HdWVw1DXO4U+P688l1b/VKfbhsguaTGDC2DNETrk9HzPmnSRhm7PYmH19Rk/F6vVb4d0xbLLGeTH6No7//wsPgxNfybvkY4nTJVitTeBwJgZjKSJyBSfNjXenIX/oaZq2x96CBcMITAtGgiq/dMCx6jbz5DqZb6BKVd9arvCd2lasI2irjlorO4J/tQPDC/0bHe/10Wqm5OJIU8jRMt5jgaE/uX2j9HbJZ2SDr8f7Wi+uyF76QoLvV4n/156VEThdxH7jLdnpJw==
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=taugh.com; h=date:message-id:from:to:subject:mime-version:content-type:content-transfer-encoding:cleverness; s=ff34.61a66f91.k2111; bh=6oq8XXx9A/eGxk8V+FWfObaazxwNkunyrsKMjp+qk38=; b=LL1exqpDIbUKBiRs2e6+gD6BsSMfBtZTUfMxjYods+aMF35tbzBZ38je9ro3KXXRjMMLqsVhS6TEwFkQuHHFLdnDudQPnRkjK89Ag86eaAiT3UOHptwa2dYIZDo/ly2heV59yD6gf+yVzZJ3g0SW3SoS6bla1nVlkKozgbhZwC5nIlYvBI2Pl2EkYhQ1je4FbWPgxfBjw+0aFRnezngvzI0MEj8RQ5EV7XKar75VdvXmphqYiRWDZvhSLK+T46XWMGYA6mrWsb8mj47iIxGpNhLExmAfyZizn1WYxZM9jVSfgNvv11Au96oL76q9Xg6p5om93My9cuvbRZY04Ha0yA==
Received: from ary.qy ([IPv6:2001:470:1f07:1126::78:696d:6170]) by imap.iecc.com ([IPv6:2001:470:1f07:1126::78:696d:6170]) with ESMTPS (TLS1.2 ECDHE-RSA AES-256-GCM AEAD) via TCP6; 30 Nov 2021 18:38:09 -0000
Received: by ary.qy (Postfix, from userid 501) id 04E8230CA390; Tue, 30 Nov 2021 13:38:07 -0500 (EST)
Date: Tue, 30 Nov 2021 13:38:07 -0500
Message-Id: <20211130183809.04E8230CA390@ary.qy>
From: John Levine <johnl@taugh.com>
To: dnsop@ietf.org
Organization: Taughannock Networks
X-Headerized: yes
Cleverness: minimal
Mime-Version: 1.0
Content-type: text/plain; charset="utf-8"
Content-transfer-encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/RB4heseYoNGVI5UoX0rz5OIK07s>
Subject: [DNSOP] How Slack didn't turn on DNSSEC
X-BeenThere: dnsop@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF DNSOP WG mailing list <dnsop.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dnsop>, <mailto:dnsop-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dnsop/>
List-Post: <mailto:dnsop@ietf.org>
List-Help: <mailto:dnsop-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dnsop>, <mailto:dnsop-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 30 Nov 2021 18:38:19 -0000

This blog post has been making the rounds. Since it is about a
sequence of DNS operational failures, it seems somewhat relevant here.

https://slack.engineering/what-happened-during-slacks-dnssec-rollout/

tl;dr first try was rolled back due to what turned out to be an unrelated failure at some ISP

second try was rolled back when they found they had a CNAME at a zone
apex, which they had never noticed until it caused DNSSEC validation
errors.

third try was rolled back when they found random-looking failures that
they eventually tracked down to bugs in Amazon's Route 53 DNS server.
They had a wildcard with A but not AAAA records. When someone did an
AAAA query, the response was wrong and said there were no records at
all, not just no AAAA records. This caused failures at 8.8.8.8 clients
since Google does aggressive NSEC, not at 1.1.1.1 because Cloudflare
doesn't.

They also got some bad advice, e.g., yes the .COM zone adds and
deletes records very quickly, but that doesn't mean you can unpublish
a DS and just turn off DNSSEC because its TTL is a day. Their tooling
somehow didn't let them republish the DNSKEY at the zone apex that
matched the DS, only a new one that didn't.

It is clear from the blog post that this is a fairly sophisticated
group of ops people, who had a reasonable test plan, a bunch of test
points set up in dnsviz and so forth.  Neither of these bugs seem
very exotic, and could have been caught by routine tests.

Can or should we offer advice on how to do this better, sort of like
RFC 8901 but one level of DNS expertise down?

R's,
John