Re: [Sidrops] weak validation is unfit for production (Was: Reason for Outage report)

Tim Bruijnzeels <tim@nlnetlabs.nl> Fri, 28 August 2020 06:33 UTC

Return-Path: <tim@nlnetlabs.nl>
X-Original-To: sidrops@ietfa.amsl.com
Delivered-To: sidrops@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 05A213A15A4 for <sidrops@ietfa.amsl.com>; Thu, 27 Aug 2020 23:33:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.099
X-Spam-Level:
X-Spam-Status: No, score=-2.099 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=nlnetlabs.nl
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id OCFJkOGurVmW for <sidrops@ietfa.amsl.com>; Thu, 27 Aug 2020 23:33:09 -0700 (PDT)
Received: from dicht.nlnetlabs.nl (dicht.nlnetlabs.nl [IPv6:2a04:b900::1:0:0:10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 141563A159B for <sidrops@ietf.org>; Thu, 27 Aug 2020 23:33:08 -0700 (PDT)
Received: from yoda.fritz.box (62-251-31-8.ip.xs4all.nl [62.251.31.8]) by dicht.nlnetlabs.nl (Postfix) with ESMTPSA id 22D331B270; Fri, 28 Aug 2020 08:33:06 +0200 (CEST)
Authentication-Results: dicht.nlnetlabs.nl; dmarc=fail (p=none dis=none) header.from=nlnetlabs.nl
Authentication-Results: dicht.nlnetlabs.nl; spf=fail smtp.mailfrom=tim@nlnetlabs.nl
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=nlnetlabs.nl; s=default; t=1598596386; bh=8fVX3kf50AQ5rJgD4UZcMFfdfFLmtBTV1NMZpJt4Aq0=; h=Subject:From:In-Reply-To:Date:Cc:References:To; b=n356CPhhi2ZCdImJ1c47bJztsuBQU9gngzmtWaShWKv4heUlZiTuTnkQsJwrBuHsC o381yIFUj0i7DEVBrZe7IUt+Z2in0nUl5gST2fCXMba5vy2PNSectAGkTaXBbNS00D 6x1d74FqyDKp55TOPm/MwoA+euxS2WqtpbKopIok=
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.80.23.2.2\))
From: Tim Bruijnzeels <tim@nlnetlabs.nl>
In-Reply-To: <20200827142827.GC88356@bench.sobornost.net>
Date: Fri, 28 Aug 2020 08:33:04 +0200
Cc: sidrops@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <DEBF83EC-B5B7-490B-9F30-19571991E273@nlnetlabs.nl>
References: <DE33EFAE-FBD2-478F-92A9-1FBD81CCC43F@arin.net> <727F6FBD-F73C-4F58-AE2D-0276B2A183A3@arin.net> <20200826160001.GF95612@bench.sobornost.net> <20200826202442.232829fc@grisu.home.partim.org> <20200827142827.GC88356@bench.sobornost.net>
To: Job Snijders <job@ntt.net>
X-Mailer: Apple Mail (2.3608.80.23.2.2)
Archived-At: <https://mailarchive.ietf.org/arch/msg/sidrops/Z3ziGgKNWgdpoYa8yoN939L9eT0>
Subject: Re: [Sidrops] weak validation is unfit for production (Was: Reason for Outage report)
X-BeenThere: sidrops@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: A list for the SIDR Operations WG <sidrops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/sidrops>, <mailto:sidrops-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/sidrops/>
List-Post: <mailto:sidrops@ietf.org>
List-Help: <mailto:sidrops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/sidrops>, <mailto:sidrops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Aug 2020 06:33:11 -0000

Dear WG,

Please allow me to zoom out a bit..

Much of this discussion is caused by:
a) different interpretations of a spec
b) differing views on Postel's Law and how it should (not) be applied in a security context

@a: the spec

While I write RPKI CA code, I was not intimately involved in writing validation code related to the particular issue in ARIN's manifest. The discussion however was painful. Bug reports were created. They were not blindly followed. People double checked. People argued that it should be discussed in the IETF, and not on GitHub. People had differing interpretations on what encodings are okay. Painful as it was I believe that in the end a discussion based on the content of this particular issue did take place in this group. ARIN fixed their manifest and the RP implementers planned fixes for inclusion in upcoming releases.

@b: the robustness principle and security

Should any and all violations of spec lead to the rejection of objects?

Arguments can be made, are being made, that in the context of security the answer this question is 'yes'. It makes for a simpler world, and simple is good when it comes to security. There is no slippery slope anymore. Taking this even further one can argue (in the context of manifests) that if an RP knows that it has *all current products* of a CA, and the CA clearly messed up even one object - they should no longer be trusted to know what they are doing.

The counterarguments are that we have existing deployments, and that there are shades of gray in this world.

The bug in ARIN's manifest seemed to me to be a violation of spec, but not necessarily a critical security issue. The algorithms are all specified in the relevant RPKI RFCs - hashes and signatures providing proof of possession worked out. The dialect may have been funny, but the intent was clear.

This is not the only existing encoding issue. The bouncy castle library used by both the RIPE NCC and Lacnic code bases produces CMS that is BER, not DER, encoded. The specs say that DER MUST be used. While this is is the case OpenSSL and other implementations will happily decode BER encoded CMS objects - most other CMS implementations use BER after all. This is a violation of spec, but is this a critical security issue?

Other examples were found over the years. And I have no doubt that other deviations from the specs exist today.

Some went unnoticed for years (9 years of deployment) - and are only being found now that more independent RP tools are being built.

This is a good thing. But does this mean that we should reject all objects that do not conform to spec today? That would leave alarmingly few VRPs. Should we apply 'shades of gray' - and allow certain deviations from spec if we feel they are not harmful, indefinitely? Or should we aim for full compliance in years to come, but give CAs time to fix existing issues?

I would like to believe that it's okay to have different viewpoints on this during the discussion phase. And while I may have my own opinions of what I feel is right, I am quite willing to change my mind if we can discuss these matters openly and constructively.

Kind regards,

Tim



> On 27 Aug 2020, at 16:28, Job Snijders <job@ntt.net> wrote:
> 
> Dear all,
> 
> It pains me to write this email. It appears there is an increasingly
> acrimonious situation in which RIPE NCC, Cloudflare, and NLNetLabs
> representatives not only produce and publish insecure software, but also
> argue towards erosion of the robustness of the object security RPKI
> depends on.
> 
> I'm drawing harsh conclusions: the reality is that we are now 6 months
> into /what should've been/ a simple bug report, but turned into a trench
> war. Folks are digging in their heels deeper.  Attempts in this group to
> bridge the knowledge gap have failed so far.
> 
> On Wed, Aug 26, 2020 at 08:24:42PM +0200, Martin Hoffmann wrote:
>> Job Snijders wrote:
>>> The current versions of routinator and ripe ncc's validator have weak
>>> (lacking) support for manifest handling, there are other issues in
>>> both softwares that don't yield errors where they should yield errors
>>> related to manifest handling. Neither implementation handles
>>> manifests correctly at the moment, so neither software currently can
>>> be used to confirm the correct publication of manifest related
>>> data. :-(
>> 
>> To the best of my knowledge, Routinator and the RIPE NCC RPKI
>> Validator handle manifests according to the specifications laid out in
>> the relevant standards track IETF documents. 
> 
> The implementers of RIPE NCC's validator, Routinator, and OctoRPKI
> entirely missed the point of WHY RPKI Manifests exist at all. The bigger
> picture is ignored, one can't look at normative terms in a vacuum.
> 
> I quote from the INTRODUCTION of RFC6486:
> 
>    "A manifest is intended to allow an RP to detect unauthorized object
>    removal or the substitution of stale versions of objects at a
>    publication point."
> 
> A Manifest makes it possible for a validator software to react sanely
> when data tampering is detected. Manifests exist to *protect* both the
> issuing CA and the RP, failure to acknowledge the purpose of manifests
> is akin to the famous quote "the operation was successful - but the
> patient died". Did any CA ever wish for an incomplete view of their
> routing intentions to be transformed into routing decisions? Zero CAs
> want this.
> 
> One has to look further than the normative terms, one has to realize
> what the implications are to routing in the global system and inevitably
> the conclusion is to err on the side of caution. To be cynical about
> what data is provided via an untrusted network input channel. Why
> implement a virus scanner, which can detect virus files, but
> subsequently doesn't do anything about it?
> 
> Manifests are the *only* mechanism to verify a publication point's
> completeness and integrity. Neither Routinator nor RIPE NCC's software
> attach any consequence to integrity issues at a publication point. Both
> continue to emit as many VRPs as possible, regardless of whether the
> publication point is complete to begin with! 
> 
> The datastructure of Route Origin Authorizations (ROAs) allows only a
> single origin ASN per .roa file, this means network operators who wish
> to grant permission to multiple ASNs (a common example: their own and
> their customers' ASNs) to originate parts of their IP space, they *have*
> the create multiple .roa files. The IP Block owner's routing intentions
> can only be considered when the full bundle of .roa files is available.
> 
> Logically, when some .roa files are missing (which according to a valid
> current manifest must be present), the remaining .roa files at the
> publication point become useless as they represent an *incomplete*
> overview of routing intentions; even worse those files flip from
> 'useless' to 'dangerous' when they are injected as VRPs into the
> operator's routing system.
> 
> Manifests are analogous to to Debian's "Release + Release.gpg" APT
> archive concepts. APT (or yum/dnf) do *not* proceed to install packages
> when critical dependencies are missing, or when the SIGNED checksums do
> not match the checksum of the downloaded .deb file.  An administrator
> has to *explicitly* override (-y --force) to install such packages when
> dependencies or checksums don't match.
> 
> Let me demonstrate what happens when I cherry-pick just a few words you
> wrote, and withhold some of your other words. You wrote this email:
> https://mailarchive.ietf.org/arch/msg/sidrops/7JxOCNBvYbwDHL7hcPHsfvxto0Q/
> 
> *** start of modified email ***
>    On Wed, Aug 26, 2020 at 08:24:42PM +0200, Martin Hoffmann wrote:
>> Routinator and the RIPE NCC RPKI Validator have issues.
> *** end of modified email ***
> 
> Do you see the issue now? I didn't even change the order of your words,
> I merely withheld some of the text you wrote, and the resulting text is
> entirely contradictory to what you intended to write!
> 
> Let's be honest, neither RIPE NCC nor NLNetLabs have real experience
> using RPKI ROV 'invalid == reject' in their own networks. RIPE NCC so
> far has refused to implement ROV in AS 3333 out of fear, and NLNetLab's
> own ASN is a simple single-homed stub network. Why are both
> organisations ignoring the community's pleas to fix a security issue?
> Why the hubris? Do you really think you know better? Why does Alexander
> Band say that fixing this is "not a priority", why is RIPE NCC refusing
> to commit a one-line patch to fix their validator?
> 
> Is loss of face the issue? The longer the delay to provide a fix, the
> longer NLNetLabs and RIPE NCC keep hurting their users (and dependents).
> Is this what one calls 'good for the Internet'? The issue was brought to
> attention MONTHS [1] ago, it should've been a few days to get it patched.
> 
>> Given that this topic is currently discussed in this very working
>> group and there wasn’t outright consensus on how software should behave
>> in these cases, it seems only prudent to delay modifications until
>> after such consensus has been achieved.
> 
> The only ones arguing against the consensus are RIPE NCC and NLNetLabs
> employees. Go figure. Staff and knowledge were exchanged between the two
> software houses, a path is visible how the misconceptions continued to
> proliferate. It is not too late to change course, but catch-up is
> needed.
> 
> Believe it not, RIPE NCC, Cloudflare, and NLNetLabs are now at an
> existential crisis: your credibility is on the line. Are you going to
> produce routing security software which actually improves security, or
> not? Will you attempt to absorb decades of PKI and X.509 experience, or
> throw it all in the wind? 
> 
> Currently routinator + ripe ncc's validator + octorpki set their users
> up for failure. Operators using these softwares ARE AT NEEDLESS RISK. 
> 
> Regards,
> 
> Job
> 
> [1]: https://github.com/NLnetLabs/routinator/issues/319
> https://github.com/RIPE-NCC/rpki-validator-3/issues/232
> https://github.com/RIPE-NCC/rpki-validator-3/issues/158
> https://github.com/cloudflare/cfrpki/issues/38
> 
> _______________________________________________
> Sidrops mailing list
> Sidrops@ietf.org
> https://www.ietf.org/mailman/listinfo/sidrops