Re: [Sidrops] RPKI Outage Post-Mortem

Nathalie:

Thanks for the transparency and the clear description of the events.

Russ

> On Jan 8, 2021, at 8:56 AM, Nathalie Trenaman <nathalie@ripe.net> wrote:
> 
> Summary: 
> Yesterday, on 7 January 2021, an issue with our RPKI software caused an inconsistent certificate to be published from 15:29-16:20 (UTC+1). This may have resulted in outages. We strongly recommend network operators update their Relying Party software to the latest version.
> 
> Details: 
> At 15:06 (UTC+1) yesterday, we processed an outgoing transfer of IP resources to another RIR service region. This caused our system to update the corresponding RPKI certificates in our Certificate Authority (CA). 
> 
> Unfortunately, our RPKI software published the updated parent certificate (production CA) ahead of its child certificate (member CA). As a result, in the period immediately after the updated parent was published, the child certificate (updated later) contained resources that were no longer on the updated parent, and the child certificate over-claimed. This was resolved once the child certificate was updated.
> 
> Currently we have three separate processes:
> * One that updates the resources in the registry in RPKI (every 15min) 
> * One that updates the resources of the RIPE production CA (parent of all member CA) from the registry (1h, takes ~5 min) 
> * One that updates the resources for member CAs from the registry (1h, takes ~40 min)
> 
> If there is an outgoing transfer and the member CA update runs before the production CA update, the situation with over-claiming occurs. The update of the member CA needs to happen at the same time (i.e. same RRDP delta), or before the production CA resources are reduced. This does not happen the other way around (and so is not an issue with incoming resources). 
> 
> Some older Relying Parties had applied a strict manifest handling interpretation in their validator software. This meant that they were configured to reject all certificates in the manifest if a single entry was invalid. As a consequence, all RPKI certificates covering RIPE resources were rejected by these validators during this period.
> 
> Based on our access logs, we estimate that 327 instances of Relying Party software were impacted.
> 
> On Monday 11 January, we will implement a fix so that every time a RIPE NCC certificate changes, we will look at all members to see if their certificates are over-claiming and force an immediate re-issue if so. This approach does not give us a 100% bullet-proof fix to the problem, but it reduces the period of over-claiming from an hour to a couple of minutes. 
> We will work on reducing this time to less than a minute, to further reduce the potential for inconsistency. In the longer term, we will work on implementing atomic publishing of data for this type of situation. 
> 
> In the meantime, we strongly recommend that network operators update their RPKI Relying Party software to the latest version: 
> * Routinator 0.8.2 
> * rpki-client 6.8p1 
> * FORT 1.4.2 
> * octorpki 1.2.2 
> * RIPE NCC 3.2-2020.12.10.13.57
> 
> Best regards,
> 
> Nathalie Trenaman
> Routing Security Programme Manager
> RIPE NCC
> _______________________________________________
> Sidrops mailing list
> Sidrops@ietf.org
> https://www.ietf.org/mailman/listinfo/sidrops