Re: [sidr] rpki repository and validation issues

Tim,

thanks for summing things up so clearly. I basically agree with you on
all points, however I have one comment. When you say:

> = It should be detectable for an RP if it does not get all the data a CA intended to publish
I've always believed that this was the whole point of the manifest. The
repository should contain exactly the same objects that are listed in
the mft, otherwise the whole retrieved repo copy cannot be trusted. I
understand that just discarding the whole repository can be too harsh
and lead to unstable situations, but as I understand it, the mechanism
for an RP to detect whether the repository was fully retrieved is
already in place.

As for doing shared / cooperative measurements, I'm all for it.

regards

Carlos

On 3/30/12 12:08 PM, Tim Bruijnzeels wrote:
> Hi,
>
> There are a number of separate discussions about problems with the rpki repository and ways to mitigate those problems going on on the list at the moment.
>
> First of all let me say: as a starting point the current system works most of the time, but we are finding issues that I think should be fixed, that are not trivial.
>
> So my take on this is that we've learned from operational experience and it would now be good to (1) enumerate the problems that we see, and (2) refine a list of requirements for improvement, and then (3) find ways forward to address these requirements, without breaking the existing infrastructure. As some of you know from discussions we had I do actually have some ideas in the solution space, but.. before going there in detail (beyond proof of concept) I think the WG should address 1 & 2..
>
> It may be best if we could discuss this face to face, eg at one of the upcoming interim meetings. I am not a huge fan of interim meetings, but I am afraid this is a difficult subject to find consensus on on the list, and too big to discuss in a 2 hour IETF slot. My preference would be to discuss this in the planned meeting just before the Vancouver IETF: I am already planning to travel to that one, and I expect it will be the easiest to attend for most people.
>
>
> Since I don't know if and when this will happen though, let me write down my ideas, without going to solution space where I think face to face or at least interactive presentations are most needed to make progress.
>
>
> 1 = Current problems we encounter (implementing a validator and running a publication point):
>
> = Updates happen while we are rsync'ing from the validator:
>   = We may miss objects that are on the manifest
>   = We may find objects that are not on the manifest (we actually ignore these)
>   = The CRL may be newer and revoke this MFT
>   = The MFT may be newer and the CRL hash value is wrong
>
>   All this makes it very difficult for the RP to make clear automated validation decisions.
>   And we all know that *no one* reads the logs... or even if they do, most people won't know how to decide..
>
> = PFX-validate assumes knowledge of *all* applicable ROAs, but the RP may not get all data
>   = The only way an RP can detect this is by looking at the manifest.
>   = But the manifest is considered, at least by people I talk to, to not be authoritative on this.
>   = The reason why a ROA might be missing is not clear to the RP
>       = A man in the middle may filter bits of information (hold back ROA, MFT, CRL)
>       = There may be a bug in the CA
>       = There may be a bug or problem at the publisher (eg someone *deleted* the ROA on disk)
>       = There may be a race condition - stuff is changing while we look, as described above
>
> = We need to call 'rsync' on the system (there is only one implementation, libraries unavailable for most coding languages)
>   = Forks cause cpu overhead and may (temporarily) result in duplicating the memory usage (at least for jvm)
>   = Parallel processing requiring system calls does not scale
>   = We need rsync installed and on the path
>   = We need to like that version
>   = We need to make sense of exit codes to have useful error messages and take action (or inform user)
>
> = Publication point availability
>   = We don't know of any commercial CDNs that do rsync
>   = Doing this ourselves by having multiple rsync servers (and eg anycast) is not trivial
>   = An RP possibly ending up on different (out of sync) mirror servers, and getting inconsistent responses, does not help
>   = To avoid load on back-ends the general advice to RPs has been to not be too aggressive, even though they 'want' fresh data
>
> = Local policy knobs in validation
>   = The absence of sensible defaults make it difficult to automate validation
>   = Uncertainty here, for implementers, is even worse for end user: they really just want to know:
>        "so, is this thing *valid*, or not?"
>   = Giving them a knob confuses them and lowers the trust in this system
>
>
> These are my major findings at least. There may be more. With regards to the hierarchical repository lay-out, or absence thereof causing issues. Our validator gets around that by having additional configuration for the RIR TAs to do additional pre-fetching of the repositories. This helps, so yes, either this work-around or a change in those repository lay-outs would help RPs. Having said that, the problems that I enumerated above remain in my opinion.
>
>
> Chris suggested that we do some more coordinated measurements. I think this is an excellent idea and I would like to help with this effort. If possible it would be very worthwhile to get some quantifiable sense of the issues. Apart from monitoring over time, it would also be interesting to do some aggressive testing, like load stressing a publication point, set up a large, high churn, test repository, or set up different validators to update far more often (like every few minutes) and see what breaks.
>
>
>
> 2 = Possible requirements for moving forward:
>
> This is not a complete or final list of course. I am very interested in your additions. Even thought there may be requirements that we're not able to meet, there is still value in listing them and deciding..
>
> = New schemes should iterate on existing infrastructure without breaking it
> = If more than one retrieval mechanism is allowed, then *objects* should be uniquely identifiable
> = Inconsistent data to relying parties should be prevented
> = It should be detectable for an RP if it does not get all the data a CA intended to publish
> = Protocols should be ubiquitous with regards to support in coding languages
> = Update semantics and protocol should allow for distributed caching
> = Local policy knobs because of validation uncertainties should be avoided as much as possible
>
>
>
>
>
>
> Regards,
>
> Tim
>
> _______________________________________________
> sidr mailing list
> sidr@ietf.org
> https://www.ietf.org/mailman/listinfo/sidr