[Sidrops] weak validation is unfit for production (Was: Reason for Outage report)

Job Snijders <job@ntt.net> Thu, 27 August 2020 14:28 UTC

Return-Path: <job@ntt.net>
X-Original-To: sidrops@ietfa.amsl.com
Delivered-To: sidrops@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E96D23A0C17 for <sidrops@ietfa.amsl.com>; Thu, 27 Aug 2020 07:28:32 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.898
X-Spam-Level:
X-Spam-Status: No, score=-1.898 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, UNPARSEABLE_RELAY=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id n9kxsZufpJ5S for <sidrops@ietfa.amsl.com>; Thu, 27 Aug 2020 07:28:31 -0700 (PDT)
Received: from mail4.sttlwa01.us.to.gin.ntt.net (mail4.sttlwa01.us.to.gin.ntt.net [204.2.238.64]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 330DF3A0C16 for <sidrops@ietf.org>; Thu, 27 Aug 2020 07:28:31 -0700 (PDT)
Received: from bench.sobornost.net (233-vpn.londen03.uk.bb.gin.ntt.net [165.254.197.233]) by mail4.sttlwa01.us.to.gin.ntt.net (Postfix) with ESMTPSA id CDED2220136; Thu, 27 Aug 2020 14:28:29 +0000 (UTC)
Received: from localhost (bench.sobornost.net [local]) by bench.sobornost.net (OpenSMTPD) with ESMTPA id ff14e15e; Thu, 27 Aug 2020 14:28:28 +0000 (UTC)
Date: Thu, 27 Aug 2020 14:28:27 +0000
From: Job Snijders <job@ntt.net>
To: sidrops@ietf.org
Message-ID: <20200827142827.GC88356@bench.sobornost.net>
References: <DE33EFAE-FBD2-478F-92A9-1FBD81CCC43F@arin.net> <727F6FBD-F73C-4F58-AE2D-0276B2A183A3@arin.net> <20200826160001.GF95612@bench.sobornost.net> <20200826202442.232829fc@grisu.home.partim.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20200826202442.232829fc@grisu.home.partim.org>
X-Clacks-Overhead: GNU Terry Pratchett
Archived-At: <https://mailarchive.ietf.org/arch/msg/sidrops/cpIS9s33ZDVQahxP2nXaifzOPoo>
Subject: [Sidrops] weak validation is unfit for production (Was: Reason for Outage report)
X-BeenThere: sidrops@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: A list for the SIDR Operations WG <sidrops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/sidrops>, <mailto:sidrops-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/sidrops/>
List-Post: <mailto:sidrops@ietf.org>
List-Help: <mailto:sidrops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/sidrops>, <mailto:sidrops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Aug 2020 14:28:33 -0000

Dear all,

It pains me to write this email. It appears there is an increasingly
acrimonious situation in which RIPE NCC, Cloudflare, and NLNetLabs
representatives not only produce and publish insecure software, but also
argue towards erosion of the robustness of the object security RPKI
depends on.

I'm drawing harsh conclusions: the reality is that we are now 6 months
into /what should've been/ a simple bug report, but turned into a trench
war. Folks are digging in their heels deeper.  Attempts in this group to
bridge the knowledge gap have failed so far.

On Wed, Aug 26, 2020 at 08:24:42PM +0200, Martin Hoffmann wrote:
> Job Snijders wrote:
> > The current versions of routinator and ripe ncc's validator have weak
> > (lacking) support for manifest handling, there are other issues in
> > both softwares that don't yield errors where they should yield errors
> > related to manifest handling. Neither implementation handles
> > manifests correctly at the moment, so neither software currently can
> > be used to confirm the correct publication of manifest related
> > data. :-(
> 
> To the best of my knowledge, Routinator and the RIPE NCC RPKI
> Validator handle manifests according to the specifications laid out in
> the relevant standards track IETF documents. 

The implementers of RIPE NCC's validator, Routinator, and OctoRPKI
entirely missed the point of WHY RPKI Manifests exist at all. The bigger
picture is ignored, one can't look at normative terms in a vacuum.

I quote from the INTRODUCTION of RFC6486:

    "A manifest is intended to allow an RP to detect unauthorized object
    removal or the substitution of stale versions of objects at a
    publication point."

A Manifest makes it possible for a validator software to react sanely
when data tampering is detected. Manifests exist to *protect* both the
issuing CA and the RP, failure to acknowledge the purpose of manifests
is akin to the famous quote "the operation was successful - but the
patient died". Did any CA ever wish for an incomplete view of their
routing intentions to be transformed into routing decisions? Zero CAs
want this.

One has to look further than the normative terms, one has to realize
what the implications are to routing in the global system and inevitably
the conclusion is to err on the side of caution. To be cynical about
what data is provided via an untrusted network input channel. Why
implement a virus scanner, which can detect virus files, but
subsequently doesn't do anything about it?

Manifests are the *only* mechanism to verify a publication point's
completeness and integrity. Neither Routinator nor RIPE NCC's software
attach any consequence to integrity issues at a publication point. Both
continue to emit as many VRPs as possible, regardless of whether the
publication point is complete to begin with! 

The datastructure of Route Origin Authorizations (ROAs) allows only a
single origin ASN per .roa file, this means network operators who wish
to grant permission to multiple ASNs (a common example: their own and
their customers' ASNs) to originate parts of their IP space, they *have*
the create multiple .roa files. The IP Block owner's routing intentions
can only be considered when the full bundle of .roa files is available.

Logically, when some .roa files are missing (which according to a valid
current manifest must be present), the remaining .roa files at the
publication point become useless as they represent an *incomplete*
overview of routing intentions; even worse those files flip from
'useless' to 'dangerous' when they are injected as VRPs into the
operator's routing system.

Manifests are analogous to to Debian's "Release + Release.gpg" APT
archive concepts. APT (or yum/dnf) do *not* proceed to install packages
when critical dependencies are missing, or when the SIGNED checksums do
not match the checksum of the downloaded .deb file.  An administrator
has to *explicitly* override (-y --force) to install such packages when
dependencies or checksums don't match.

Let me demonstrate what happens when I cherry-pick just a few words you
wrote, and withhold some of your other words. You wrote this email:
https://mailarchive.ietf.org/arch/msg/sidrops/7JxOCNBvYbwDHL7hcPHsfvxto0Q/

*** start of modified email ***
    On Wed, Aug 26, 2020 at 08:24:42PM +0200, Martin Hoffmann wrote:
    > Routinator and the RIPE NCC RPKI Validator have issues.
*** end of modified email ***

Do you see the issue now? I didn't even change the order of your words,
I merely withheld some of the text you wrote, and the resulting text is
entirely contradictory to what you intended to write!

Let's be honest, neither RIPE NCC nor NLNetLabs have real experience
using RPKI ROV 'invalid == reject' in their own networks. RIPE NCC so
far has refused to implement ROV in AS 3333 out of fear, and NLNetLab's
own ASN is a simple single-homed stub network. Why are both
organisations ignoring the community's pleas to fix a security issue?
Why the hubris? Do you really think you know better? Why does Alexander
Band say that fixing this is "not a priority", why is RIPE NCC refusing
to commit a one-line patch to fix their validator?

Is loss of face the issue? The longer the delay to provide a fix, the
longer NLNetLabs and RIPE NCC keep hurting their users (and dependents).
Is this what one calls 'good for the Internet'? The issue was brought to
attention MONTHS [1] ago, it should've been a few days to get it patched.

> Given that this topic is currently discussed in this very working
> group and there wasn’t outright consensus on how software should behave
> in these cases, it seems only prudent to delay modifications until
> after such consensus has been achieved.

The only ones arguing against the consensus are RIPE NCC and NLNetLabs
employees. Go figure. Staff and knowledge were exchanged between the two
software houses, a path is visible how the misconceptions continued to
proliferate. It is not too late to change course, but catch-up is
needed.

Believe it not, RIPE NCC, Cloudflare, and NLNetLabs are now at an
existential crisis: your credibility is on the line. Are you going to
produce routing security software which actually improves security, or
not? Will you attempt to absorb decades of PKI and X.509 experience, or
throw it all in the wind? 

Currently routinator + ripe ncc's validator + octorpki set their users
up for failure. Operators using these softwares ARE AT NEEDLESS RISK. 

Regards,

Job

[1]: https://github.com/NLnetLabs/routinator/issues/319
https://github.com/RIPE-NCC/rpki-validator-3/issues/232
https://github.com/RIPE-NCC/rpki-validator-3/issues/158
https://github.com/cloudflare/cfrpki/issues/38