[Sidrops] RPKI Outage Post-Mortem

Nathalie Trenaman <nathalie@ripe.net> Fri, 08 January 2021 13:56 UTC

Return-Path: <nathalie@ripe.net>
X-Original-To: sidrops@ietfa.amsl.com
Delivered-To: sidrops@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0D61F3A0EF1 for <sidrops@ietfa.amsl.com>; Fri, 8 Jan 2021 05:56:55 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.098
X-Spam-Level:
X-Spam-Status: No, score=-2.098 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ripe.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fHJFtAbEwCW4 for <sidrops@ietfa.amsl.com>; Fri, 8 Jan 2021 05:56:53 -0800 (PST)
Received: from mahimahi.ripe.net (mahimahi.ripe.net [IPv6:2001:67c:2e8:11::c100:1372]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 3F4253A0EEF for <sidrops@ietf.org>; Fri, 8 Jan 2021 05:56:53 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ripe.net; s=s1-ripe-net; h=To:Date:Message-Id:Subject:Mime-Version:Content-Type:From: CC; bh=/wguQE0b4GJuRgIIy0P3CtCWHoDqZHlhu+5iA+I6B44=; b=IUKP3M5uImdVePQtlfEYQT PYU7eltwyqTUEPuHMipvAjjTQv27iizPqAgJfg7glKoDKnLFB+vIfiTA4b33BTa/euwE6lXbBUVkf GDMgTkLsBdqW7S3X0+D7qBSZ34Sugfiyp/xdinilpvbwqGoscNDj8Rk1ZaGUwFJ5gBO+2RvgR9J6O cc4MYv/p6IXMZxA7QjSPZkFB7ZJXWGKNoZlyt4ddR+YxREAlrc5pvH+59pfxTvhNS8U/3MwaXEMKC mLhZCLXtgE8NnnIsQUXqiDBkbE8mYUMFSkqlT8O6TIDuKIJyHpJ6P0doPQ/77l4tLQiRmHuJuwpRd TiVjiGLvCPrg==;
Received: from allealle.ripe.net ([193.0.23.12]:60030) by mahimahi.ripe.net with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94) (envelope-from <nathalie@ripe.net>) id 1kxsG3-0009ci-0x for sidrops@ietf.org; Fri, 08 Jan 2021 14:56:51 +0100
Received: from sslvpn.ipv6.ripe.net ([2001:67c:2e8:9::c100:14e6] helo=[IPv6:2001:67c:2e8:1200::4e8]) by allealle.ripe.net with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94) (envelope-from <nathalie@ripe.net>) id 1kxsG2-0007yh-TE for sidrops@ietf.org; Fri, 08 Jan 2021 14:56:50 +0100
From: Nathalie Trenaman <nathalie@ripe.net>
Content-Type: multipart/alternative; boundary="Apple-Mail=_B0150580-E0EF-4BB4-9FEE-760210CCC2CE"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.4\))
Message-Id: <11932542-611A-4DDC-AD2D-3356E0CB44ED@ripe.net>
Date: Fri, 08 Jan 2021 14:56:50 +0100
To: SIDR Operations WG <sidrops@ietf.org>
X-Mailer: Apple Mail (2.3608.120.23.2.4)
X-ACL-Warn: Delaying message
X-RIPE-Signature: b23882c8c47abee4cf35af21618ca92ad877974b5264846697abd34648897c52
Archived-At: <https://mailarchive.ietf.org/arch/msg/sidrops/mlFkEcI0DCLv0ZXLY3uZmM1x2do>
Subject: [Sidrops] RPKI Outage Post-Mortem
X-BeenThere: sidrops@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: A list for the SIDR Operations WG <sidrops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/sidrops>, <mailto:sidrops-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/sidrops/>
List-Post: <mailto:sidrops@ietf.org>
List-Help: <mailto:sidrops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/sidrops>, <mailto:sidrops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Jan 2021 13:56:55 -0000

Summary: 
Yesterday, on 7 January 2021, an issue with our RPKI software caused an inconsistent certificate to be published from 15:29-16:20 (UTC+1). This may have resulted in outages. We strongly recommend network operators update their Relying Party software to the latest version.

Details: 
At 15:06 (UTC+1) yesterday, we processed an outgoing transfer of IP resources to another RIR service region. This caused our system to update the corresponding RPKI certificates in our Certificate Authority (CA). 

Unfortunately, our RPKI software published the updated parent certificate (production CA) ahead of its child certificate (member CA). As a result, in the period immediately after the updated parent was published, the child certificate (updated later) contained resources that were no longer on the updated parent, and the child certificate over-claimed. This was resolved once the child certificate was updated.

Currently we have three separate processes:
* One that updates the resources in the registry in RPKI (every 15min) 
* One that updates the resources of the RIPE production CA (parent of all member CA) from the registry (1h, takes ~5 min) 
* One that updates the resources for member CAs from the registry (1h, takes ~40 min)

If there is an outgoing transfer and the member CA update runs before the production CA update, the situation with over-claiming occurs. The update of the member CA needs to happen at the same time (i.e. same RRDP delta), or before the production CA resources are reduced. This does not happen the other way around (and so is not an issue with incoming resources). 

Some older Relying Parties had applied a strict manifest handling interpretation in their validator software. This meant that they were configured to reject all certificates in the manifest if a single entry was invalid. As a consequence, all RPKI certificates covering RIPE resources were rejected by these validators during this period.

Based on our access logs, we estimate that 327 instances of Relying Party software were impacted.

On Monday 11 January, we will implement a fix so that every time a RIPE NCC certificate changes, we will look at all members to see if their certificates are over-claiming and force an immediate re-issue if so. This approach does not give us a 100% bullet-proof fix to the problem, but it reduces the period of over-claiming from an hour to a couple of minutes. 
We will work on reducing this time to less than a minute, to further reduce the potential for inconsistency. In the longer term, we will work on implementing atomic publishing of data for this type of situation. 

In the meantime, we strongly recommend that network operators update their RPKI Relying Party software to the latest version: 
* Routinator 0.8.2 
* rpki-client 6.8p1 
* FORT 1.4.2 
* octorpki 1.2.2 
* RIPE NCC 3.2-2020.12.10.13.57

Best regards,

Nathalie Trenaman
Routing Security Programme Manager
RIPE NCC