[sidr] rpki repository and validation issues

Tim Bruijnzeels <tim@ripe.net> Fri, 30 March 2012 15:08 UTC

Return-Path: <tim@ripe.net>
X-Original-To: sidr@ietfa.amsl.com
Delivered-To: sidr@ietfa.amsl.com
Received: from localhost (localhost []) by ietfa.amsl.com (Postfix) with ESMTP id A37A721F865D for <sidr@ietfa.amsl.com>; Fri, 30 Mar 2012 08:08:46 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.299
X-Spam-Status: No, score=-2.299 tagged_above=-999 required=5 tests=[AWL=-0.300, BAYES_00=-2.599, J_CHICKENPOX_53=0.6]
Received: from mail.ietf.org ([]) by localhost (ietfa.amsl.com []) (amavisd-new, port 10024) with ESMTP id GdSeEp-5bbhS for <sidr@ietfa.amsl.com>; Fri, 30 Mar 2012 08:08:46 -0700 (PDT)
Received: from postgirl.ripe.net (postgirl.ipv6.ripe.net [IPv6:2001:67c:2e8:11::c100:1342]) by ietfa.amsl.com (Postfix) with ESMTP id AC69A21F8652 for <sidr@ietf.org>; Fri, 30 Mar 2012 08:08:45 -0700 (PDT)
Received: from dodo.ripe.net ([]) by postgirl.ripe.net with esmtps (TLSv1:AES256-SHA:256) (Exim 4.72) (envelope-from <tim@ripe.net>) id 1SDdRR-0002yX-Lz for sidr@ietf.org; Fri, 30 Mar 2012 17:08:44 +0200
Received: from timbru.vpn.ripe.net ([]) by dodo.ripe.net with esmtps (TLSv1:AES128-SHA:128) (Exim 4.72) (envelope-from <tim@ripe.net>) id 1SDdRP-0008Kv-Ko; Fri, 30 Mar 2012 17:08:41 +0200
From: Tim Bruijnzeels <tim@ripe.net>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Date: Fri, 30 Mar 2012 17:08:27 +0200
To: "sidr@ietf.org list" <sidr@ietf.org>
Message-Id: <D8C3F092-F913-4E60-B6C7-44AB62BDF844@ripe.net>
Mime-Version: 1.0 (Apple Message framework v1084)
X-Mailer: Apple Mail (2.1084)
X-RIPE-Spam-Level: --
X-RIPE-Spam-Report: Spam Total Points: -2.9 points pts rule name description ---- ---------------------- ------------------------------------ -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP -0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay domain -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000]
X-RIPE-Signature: 784d7acfe6559f2a0b602ec6519a0719d1749e154c17d3b6e56860fb50b2a7e3
Subject: [sidr] rpki repository and validation issues
X-BeenThere: sidr@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Secure Interdomain Routing <sidr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/sidr>, <mailto:sidr-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/sidr>
List-Post: <mailto:sidr@ietf.org>
List-Help: <mailto:sidr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/sidr>, <mailto:sidr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 30 Mar 2012 15:08:46 -0000


There are a number of separate discussions about problems with the rpki repository and ways to mitigate those problems going on on the list at the moment.

First of all let me say: as a starting point the current system works most of the time, but we are finding issues that I think should be fixed, that are not trivial.

So my take on this is that we've learned from operational experience and it would now be good to (1) enumerate the problems that we see, and (2) refine a list of requirements for improvement, and then (3) find ways forward to address these requirements, without breaking the existing infrastructure. As some of you know from discussions we had I do actually have some ideas in the solution space, but.. before going there in detail (beyond proof of concept) I think the WG should address 1 & 2..

It may be best if we could discuss this face to face, eg at one of the upcoming interim meetings. I am not a huge fan of interim meetings, but I am afraid this is a difficult subject to find consensus on on the list, and too big to discuss in a 2 hour IETF slot. My preference would be to discuss this in the planned meeting just before the Vancouver IETF: I am already planning to travel to that one, and I expect it will be the easiest to attend for most people.

Since I don't know if and when this will happen though, let me write down my ideas, without going to solution space where I think face to face or at least interactive presentations are most needed to make progress.

1 = Current problems we encounter (implementing a validator and running a publication point):

= Updates happen while we are rsync'ing from the validator:
  = We may miss objects that are on the manifest
  = We may find objects that are not on the manifest (we actually ignore these)
  = The CRL may be newer and revoke this MFT
  = The MFT may be newer and the CRL hash value is wrong

  All this makes it very difficult for the RP to make clear automated validation decisions.
  And we all know that *no one* reads the logs... or even if they do, most people won't know how to decide..

= PFX-validate assumes knowledge of *all* applicable ROAs, but the RP may not get all data
  = The only way an RP can detect this is by looking at the manifest.
  = But the manifest is considered, at least by people I talk to, to not be authoritative on this.
  = The reason why a ROA might be missing is not clear to the RP
      = A man in the middle may filter bits of information (hold back ROA, MFT, CRL)
      = There may be a bug in the CA
      = There may be a bug or problem at the publisher (eg someone *deleted* the ROA on disk)
      = There may be a race condition - stuff is changing while we look, as described above

= We need to call 'rsync' on the system (there is only one implementation, libraries unavailable for most coding languages)
  = Forks cause cpu overhead and may (temporarily) result in duplicating the memory usage (at least for jvm)
  = Parallel processing requiring system calls does not scale
  = We need rsync installed and on the path
  = We need to like that version
  = We need to make sense of exit codes to have useful error messages and take action (or inform user)

= Publication point availability
  = We don't know of any commercial CDNs that do rsync
  = Doing this ourselves by having multiple rsync servers (and eg anycast) is not trivial
  = An RP possibly ending up on different (out of sync) mirror servers, and getting inconsistent responses, does not help
  = To avoid load on back-ends the general advice to RPs has been to not be too aggressive, even though they 'want' fresh data

= Local policy knobs in validation
  = The absence of sensible defaults make it difficult to automate validation
  = Uncertainty here, for implementers, is even worse for end user: they really just want to know:
       "so, is this thing *valid*, or not?"
  = Giving them a knob confuses them and lowers the trust in this system

These are my major findings at least. There may be more. With regards to the hierarchical repository lay-out, or absence thereof causing issues. Our validator gets around that by having additional configuration for the RIR TAs to do additional pre-fetching of the repositories. This helps, so yes, either this work-around or a change in those repository lay-outs would help RPs. Having said that, the problems that I enumerated above remain in my opinion.

Chris suggested that we do some more coordinated measurements. I think this is an excellent idea and I would like to help with this effort. If possible it would be very worthwhile to get some quantifiable sense of the issues. Apart from monitoring over time, it would also be interesting to do some aggressive testing, like load stressing a publication point, set up a large, high churn, test repository, or set up different validators to update far more often (like every few minutes) and see what breaks.

2 = Possible requirements for moving forward:

This is not a complete or final list of course. I am very interested in your additions. Even thought there may be requirements that we're not able to meet, there is still value in listing them and deciding..

= New schemes should iterate on existing infrastructure without breaking it
= If more than one retrieval mechanism is allowed, then *objects* should be uniquely identifiable
= Inconsistent data to relying parties should be prevented
= It should be detectable for an RP if it does not get all the data a CA intended to publish
= Protocols should be ubiquitous with regards to support in coding languages
= Update semantics and protocol should allow for distributed caching
= Local policy knobs because of validation uncertainties should be avoided as much as possible