Re: [Sidrops] RPKI Outage Post-Mortem

Russ Housley <> Fri, 08 January 2021 15:33 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id ACCAD3A105F for <>; Fri, 8 Jan 2021 07:33:15 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.896
X-Spam-Status: No, score=-1.896 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id ZLW18SdjwM4H for <>; Fri, 8 Jan 2021 07:33:13 -0800 (PST)
Received: from ( []) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 00DF93A105E for <>; Fri, 8 Jan 2021 07:33:13 -0800 (PST)
Received: from localhost (localhost []) by (Postfix) with ESMTP id 5D0BA300B45 for <>; Fri, 8 Jan 2021 10:33:10 -0500 (EST)
X-Virus-Scanned: amavisd-new at
Received: from ([]) by localhost ( []) (amavisd-new, port 10026) with ESMTP id ebYXuUZ4Ff3A for <>; Fri, 8 Jan 2021 10:33:07 -0500 (EST)
Received: from a860b60074bd.fios-router.home ( []) by (Postfix) with ESMTPSA id 5B77A300BC5; Fri, 8 Jan 2021 10:33:07 -0500 (EST)
From: Russ Housley <>
Message-Id: <>
Content-Type: multipart/alternative; boundary="Apple-Mail=_C58BA7EA-3206-42EF-99E1-B64C0E3A8387"
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.17\))
Date: Fri, 8 Jan 2021 10:33:08 -0500
In-Reply-To: <>
Cc: SIDR Operations WG <>
To: Nathalie Trenaman <>
References: <>
X-Mailer: Apple Mail (2.3445.104.17)
Archived-At: <>
Subject: Re: [Sidrops] RPKI Outage Post-Mortem
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: A list for the SIDR Operations WG <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 08 Jan 2021 15:33:16 -0000


Thanks for the transparency and the clear description of the events.


> On Jan 8, 2021, at 8:56 AM, Nathalie Trenaman <> wrote:
> Summary: 
> Yesterday, on 7 January 2021, an issue with our RPKI software caused an inconsistent certificate to be published from 15:29-16:20 (UTC+1). This may have resulted in outages. We strongly recommend network operators update their Relying Party software to the latest version.
> Details: 
> At 15:06 (UTC+1) yesterday, we processed an outgoing transfer of IP resources to another RIR service region. This caused our system to update the corresponding RPKI certificates in our Certificate Authority (CA). 
> Unfortunately, our RPKI software published the updated parent certificate (production CA) ahead of its child certificate (member CA). As a result, in the period immediately after the updated parent was published, the child certificate (updated later) contained resources that were no longer on the updated parent, and the child certificate over-claimed. This was resolved once the child certificate was updated.
> Currently we have three separate processes:
> * One that updates the resources in the registry in RPKI (every 15min) 
> * One that updates the resources of the RIPE production CA (parent of all member CA) from the registry (1h, takes ~5 min) 
> * One that updates the resources for member CAs from the registry (1h, takes ~40 min)
> If there is an outgoing transfer and the member CA update runs before the production CA update, the situation with over-claiming occurs. The update of the member CA needs to happen at the same time (i.e. same RRDP delta), or before the production CA resources are reduced. This does not happen the other way around (and so is not an issue with incoming resources). 
> Some older Relying Parties had applied a strict manifest handling interpretation in their validator software. This meant that they were configured to reject all certificates in the manifest if a single entry was invalid. As a consequence, all RPKI certificates covering RIPE resources were rejected by these validators during this period.
> Based on our access logs, we estimate that 327 instances of Relying Party software were impacted.
> On Monday 11 January, we will implement a fix so that every time a RIPE NCC certificate changes, we will look at all members to see if their certificates are over-claiming and force an immediate re-issue if so. This approach does not give us a 100% bullet-proof fix to the problem, but it reduces the period of over-claiming from an hour to a couple of minutes. 
> We will work on reducing this time to less than a minute, to further reduce the potential for inconsistency. In the longer term, we will work on implementing atomic publishing of data for this type of situation. 
> In the meantime, we strongly recommend that network operators update their RPKI Relying Party software to the latest version: 
> * Routinator 0.8.2 
> * rpki-client 6.8p1 
> * FORT 1.4.2 
> * octorpki 1.2.2 
> * RIPE NCC 3.2-2020.
> Best regards,
> Nathalie Trenaman
> Routing Security Programme Manager
> _______________________________________________
> Sidrops mailing list