Re: [Sidrops] 6486bis: Failed Fetches

Hi Chris, all,

I am happy to discuss all this later today, but I will respond here as well. If you don't have time to read before the meeting, then you will probably hear me say these things again there.

> On 30 Sep 2020, at 08:55, Christopher Morrow <christopher.morrow@gmail.com> wrote:
> 
> First, thanks! having some tested results on an optional path (for
> newly added types) seems cool.
> it's at least informed some of the conversation, which seemed to go a
> bit outside the original point of order (in my re-re-reading anyway)
> 
> I think the salient points in the discussion are really (happy to be
> corrected, of course):
>  1) the -bis is proposing a simpler handling of publication point collection.
>      this simplicity is a bit at odds with some Internet principals
> (robustness principal), but
>      trades this for very clear behaviors in the RP software.
> operations folk (myself, job, mikael though I missed his mail...
> others)
>      were a bit concerned with the different flavors of 'correct'
> that can pop out the RP :( and see that being simpler about
>      the flavors/options may make all/MOST RP get the same set of
> answers... predictability / consistency are nice.

And I completely appreciate that.

>  2) the CA software and operations have to be very clear now about
> what is/not published,
>      keeping in mind that mistakes in the CA Publication Point will
> have repercussions.
>      Perhaps these need to be clearly articulated in the -bis draft?
> (more clearly?)
>      This will, I think, clearly have impact on both CA / PP software
> and on operations of the CA / PP.
>      I think this is good, actually as we're (sidrops folks i mean)
> attempting to push for:
>          o  closer compliance to consistency
>          o  more reliance upon the whole system
>          o better overall security for the routing system

The points I have been raising should not be read as opposition to change. But, I point out probably unintended repercussions of the choices made. There may be tweaks that can be done to mitigate these, and if not then at least they should be well understood.

> 
>  3) With the proposed simplicity/strictness comes some potential
> problems, at least:
>      o problems in a CA / PP mean that CA / PP may not have their
> most updated information (routing intent) available
>      o continual problems will lead to fallback to 'unknown' in the
> routing system

This is the case that worries me most. I completely agree with the intention to have a soft landing to 'not found', but as written the bis can actually *cause* invalid routes as well.

Consider the scenario that a CA "alice" has resources and does an announcement for a large prefix - covered by a ROA. She delegates some more specific prefixes to "bob", and he makes more specific ROAs. Now "alice" removes 1 prefix from "bob", but he still has the ROA published (RFC6492 does not allow advance warning of withdrawal). Now "bob" has an invalid ROA and his PP is rejected. Bob's remaining more specific routes are now invalid. The same happens in a case where "alice" delegates a new resource to "bob" and he uses it immediately after requesting the certificate, but before "alice" had a chance to publish it.

Of course there are also other failure scenarios for Bob's PP, but the resource changes worry me the most because they are relatively frequent.

One way to mitigate this is by adding all of Bob's resources - on the CA certificate he received from Alice for this PP - to an ignore filter. Then Alice's covering ROA would be filtered an all routes for affected resources would go to unknown. This was discussed on a side thread, see suggested text here:

https://mailarchive.ietf.org/arch/msg/sidrops/pi9v6RNA2kMvEOY9BfOD9VHGJtc/

The other case here is overclaims in resources held by NIRs under RIRs. Typically these will not lead to invalid routes because the RIRs do not (in practice) issue ROAs for those resources - nor delegate them to other parties. But, in these cases the entire NIR resource set would end up as "not found".

This is why I suggested another mitigation strategy to the one discussed above. If objects are found to be invalid due to overclaims, but they are otherwise perfectly valid, then this is a recognisable state. And RPs could then choose to ignore the intersection of resources on overclaiming objects and the resource set issued to them instead. This would leave things valid for resources which were not changed.

The downside of mitigation is of course complexity, but without it we should accept that invalidating children will happen. I am not convinced that that that is a good idea, but I can accept it if operators understand and accept the consequences.

As an aside to this: I want to propose an update to RFC 6492 so that child CAs can be informed when resources will be removed (in as much as possible) and when new resource certificates can be expected to be published. But this will take time and that is assuming that the working group is open to this work.

>      o adding new object types seems difficult (just my reading so far)
>         particularly there's a note about 'people won't upgrade their
> RP software'
> 
> I think that the simplicity is a goal I would like to see achieved,
> specifically because inconsistent results in the RP set is problematic
> longer term.
> If we can get to more consistency before we roll a bunch more out
> we'll be able to course correct better/faster (I think).
> 
> I think the idea that people won't upgrade... sure? maybe? Also: "Do
> you  like killing kittens? because not managing your business critical
> systems a great way to kill kittens." We've been able to push more
> people into RPKI and better routing-intent hygiene, I don't see why we
> can't keep doing that and get folk to treat part of the routing system
> as critical to their business.
> 
> We should not shy away from a solution just because we may have to
> convince people to upgrade.

The following scenarios were discussed:

1. Accept as-is - don't specify types in the -bis

This means that RPs should be updated to support new types. And that once there is deployment, or operators have been given 'enough' time CAs can publish the new types. Operators will then be forced to upgrade. This implies that RP implementers should be committed to supporting new types. If the RP tool of choice does not support new type, then operators will need to switch to another.

It's hard to quantify the delay for the deployability of new types here.. 6-18 months?

2. Modify the -bis: let RPs check the hash and presence of types they don't understand, but don't validate

This assumes that new types are orthogonal to existing types. I.e. if you don't understand ASPA this is no reason to reject ROAs. The RP would verify the presence of all files and that they match the hashes. So they know there was no error in transit between them and the CA. The presence of an invalid object of a known type would still lead to invalidating the PP. This would allow a more gradual roll-out of new types.

If/when new object types are not orthogonal to existing types, then those types could receive a version change to ensure that they would be accepted only by RPs who understand the set of interconnected object types.

Under this strategy new types can be deployed on publication of RFCs, without impacting the existing RPKI.

3. Introduce a new explicit SIA and new PPs when new object types are introduced.

As discussed, this is possible. But I believe that this is the most difficult option by far. It requires a lot more thought, and then work for CAs and RPs alike. This would essentially mean postponing ASPA for years. If the WG and ASPA authors are okay with that, well then so be it.. just be aware of the repercussion.

4? Your suggestion here...

Tim

> 
> I look forward to the discussion tomorrow/thurs.
> 
> -chris
> 
> 
> 
> On Tue, Sep 29, 2020 at 10:55 AM Tim Bruijnzeels <tim@nlnetlabs.nl> wrote:
>> 
>> Hi,
>> 
>> On 7 Sep 2020, at 13:17, Tim Bruijnzeels <tim@nlnetlabs.nl> wrote:
>>> 
>>>>> 
>>>>> If we do go down this road then I think that we should also look at the manifest object itself, and let it convey which object (types) are critical (and while we are at it, we can specify types instead of using filename extensions). That way future object types could introduced more easily perhaps - this obviously needs more discussion but it could even allow for semantics like: 1) new object please test, don't use, 2) new objects, use if you can, 3) new objects, critical - fail if you don't understand.
>>>> 
>>>> One could combine the new SIA URI and a revised manifest, in which the manifest contains the per-object flag, rather than redefining the basic object format to accommodate the flag. That would reduce the number of RFCs that need to change. Good idea.
>>> 
>>> Upon reflection I realised that even the introduction of an SIA in the issuing CA certificate will lead to issues. RPs would reject the CA certificate, and as a result the whole PP of the parent CA. This means that the SIA cannot be deployed without leaving a significant number of RP installations behind. E.g. if I run a delegated CA under an RIR which wants to adopt ASPA, and I get a new CA certificate with the additional SIA from my parent, then 1000s (RIPE NCC >12k) other CAs will also be rejected.
>> 
>> 
>> I have done some testing on this.
>> 
>> Section 4.8.8.1 of RFC 6487 seems to say that additional SIA OIDs for CA certificates can be expected and can be ignored. And indeed it seems that all current RP software will accept (and ignore) additional SIAs.
>> 
>> At least that's the result from adding an extra SIA in a Krill branch and running the end to end tests using the following validator versions:
>> 
>> fortvalidator 1.4.0
>> OctoRPKI v1.1.4 (2019-08-06T16:51:07-0700)
>> rcynic version believed to be buildbot-1.0.1544679302
>> routinator 0.7.1
>> rpkiclient 6.7p1
>> rpkivalidator3 3.1-2020.08.06.14.39
>> 
>> I am still not very happy about the overhead that this approach implies:
>> 
>> Additional publication points which need to be maintained, which may be out of sync. What if MFT A has a certain ROA and MFT B doesn't? This can lead to serious operational impact if an announcement would be valid under one, but not the other.
>> 
>> It would also require much more fundamental code changes for producing CAs - one cannot just support a new object type. One has to support an additional publication model. Parent CAs have to be willing to include the new SIAs as well affecting the ability to deploy for delegated CAs.
>> 
>> But, at least it seems that it can be introduced without breaking things immediately for existing deployed validators.
>> 
>> Tim
>> _______________________________________________
>> Sidrops mailing list
>> Sidrops@ietf.org
>> https://www.ietf.org/mailman/listinfo/sidrops
> 
> _______________________________________________
> Sidrops mailing list
> Sidrops@ietf.org
> https://www.ietf.org/mailman/listinfo/sidrops