Re: [Sidrops] 6486bis: Failed Fetches

Tim Bruijnzeels <tim@nlnetlabs.nl> Fri, 28 August 2020 07:45 UTC

Return-Path: <tim@nlnetlabs.nl>
X-Original-To: sidrops@ietfa.amsl.com
Delivered-To: sidrops@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D1A1A3A14A4 for <sidrops@ietfa.amsl.com>; Fri, 28 Aug 2020 00:45:43 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.099
X-Spam-Level:
X-Spam-Status: No, score=-2.099 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=nlnetlabs.nl
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id YAHVY1rWBVxK for <sidrops@ietfa.amsl.com>; Fri, 28 Aug 2020 00:45:41 -0700 (PDT)
Received: from dicht.nlnetlabs.nl (dicht.nlnetlabs.nl [IPv6:2a04:b900::1:0:0:10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 23D8A3A15AC for <sidrops@ietf.org>; Fri, 28 Aug 2020 00:45:41 -0700 (PDT)
Received: from yoda.fritz.box (unknown [IPv6:2001:981:4b52:1:756d:96d7:8934:5a50]) by dicht.nlnetlabs.nl (Postfix) with ESMTPSA id 4AED71B382; Fri, 28 Aug 2020 09:45:39 +0200 (CEST)
Authentication-Results: dicht.nlnetlabs.nl; dmarc=fail (p=none dis=none) header.from=nlnetlabs.nl
Authentication-Results: dicht.nlnetlabs.nl; spf=fail smtp.mailfrom=tim@nlnetlabs.nl
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=nlnetlabs.nl; s=default; t=1598600739; bh=QKM2Ke8ZGMfnEktitYqvXcvUOxfUDUoxX6zq7IcFy1w=; h=Subject:From:In-Reply-To:Date:Cc:References:To; b=SMiGAp+xw965SVOhIDVFuwZ+tTdEOLVztGyrlMfx6AXfQXZF99CFCCljY/9cpR86V etN7wtGJSkvBOdaSdE03j+tPIjnUQx2DnKKLuz0hnfvv3CARBMjvfUOpfLU2UGCgCC o7szTpSrSWzhomKsoQhOwXDv1St7gbys3mqbbTL4=
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.80.23.2.2\))
From: Tim Bruijnzeels <tim@nlnetlabs.nl>
In-Reply-To: <cf7030be-adc4-dde4-7eda-516339fd6c91@verizon.net>
Date: Fri, 28 Aug 2020 09:45:39 +0200
Cc: sidrops@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <E8A259E6-869D-4D2C-87F0-87462B9731E2@nlnetlabs.nl>
References: <20200817163134.29aa1a6b@glaurung.nlnetlabs.nl> <c1a8fffb-9106-d08e-4254-44ddf1a0115a@verizon.net> <20200818083659.1922a98c@grisu.home.partim.org> <6cebcc89-3e07-8a85-3813-b4ae9887d119@verizon.net> <20200826122539.52493813@glaurung.nlnetlabs.nl> <b71c9c88-fb10-b037-d06a-910711e51e04@verizon.net> <cf7030be-adc4-dde4-7eda-516339fd6c91@verizon.net>
To: Stephen Kent <stkent=40verizon.net@dmarc.ietf.org>
X-Mailer: Apple Mail (2.3608.80.23.2.2)
Archived-At: <https://mailarchive.ietf.org/arch/msg/sidrops/3cUsef5b70RJQvYJyglGQXX6zHc>
Subject: Re: [Sidrops] 6486bis: Failed Fetches
X-BeenThere: sidrops@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: A list for the SIDR Operations WG <sidrops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/sidrops>, <mailto:sidrops-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/sidrops/>
List-Post: <mailto:sidrops@ietf.org>
List-Help: <mailto:sidrops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/sidrops>, <mailto:sidrops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Aug 2020 07:45:44 -0000

Hi,

> On 27 Aug 2020, at 17:07, Stephen Kent <stkent=40verizon.net@dmarc.ietf.org> wrote:
> 
> Martin,
>> ...
>>>> "not update anymore" is not how I would state the result. This fetch
>>>> will fail. Because a failed fetch will be reported to the RP
>>>> operations staff, hopefully they will contact the cognizant entity
>>>> for the pub point in question, causing the error to be fixed. Them a
>>>> subsequent fetch can succeed.
>>> That seems like an overly optimistic approach to the issue. Assume the
>>> problem is created by a bug or, worse, design oversight in the CA
>>> software. The turnaround from discovering the issue to deploying a fix
>>> can easily be weeks with some vendors. During all that time, not only
>>> can no ROAs be updated and may child CA certificates slowly expire, the
>>> entire CA’s data will not be available at all for any newly deployed
>>> relying parties. With containerised deployment, this is quite a serious
>>> issue.
>>> 
>>> As a consequence, this approach will make the routing system less
>>> secure for, I’d like to argue, no actual gain.
> 
> When I the WG chairs tell me that I misunderstood the WG consensus re strict correctness then I will revisit this topic. However, the recent messages from Mikael and Job suggest that your perception of WG consensus is not accurate. This issue is one that needs to be decided by network operators, not by developers of RP software. I agree with the observations made by Mikael and Job, i.e., that requiring strict conformance with manifest rules is the preferred approach.
> 
> 
>>>>> You could argue “Don’t do that, then” but this approach doesn’t make
>>>>> the RPKI more robust but rather makes it break more easily on simple
>>>>> oversights.
>>>> My sense of the WG discussion was that the majority of  folks chose
>>>> to prioritize correctness over robustness, and I made numerous
>>>> changes to the text to reflect that.
>>> I disagree with the blanket assessment that this approach makes RPKI more
>>> correct. To switch to the example I should have used in the first place:
>>> Ignoring a broken GBR object when producing a list of VRPs does not
>>> make the list less correct. In fact, the opposite is true: Ignoring the
>>> CA or updates to the CA because of a broken GBR makes this list less
>>> correct.
> 
> I suspect we disagree on what constitutes "correct." A correctly functioning CA does not publish objects with bad signatures or format errors, use certs that have expired, does not fail to replace expired or stale objects, does not include objects at pub points that are not on the manifest, etc.

I agree that a correctly functioning CA will produce a sound set of objects. Also it is highly unlikely that a CA will just publish an expired ROA. The much more likely scenario is that a CA is unable to make or publish updates and the MFT and CRL go stale, and the MFT EE expires, way before any ROA or issued cert would.

That being said there are scenarios beyond the control of a CA that can lead to invalid objects:

1) rsync

If the RP does rsync only, they may get an inconsistent data set. Transactions are not supported here. They just get an old MFT and new CRL or vice versa, or they may get a new MFT but not the new object, etc. This is a race condition that happens infrequently when RPs fetch during publication. It is rare, but it does happen.

I remember that many years ago I wrote code in the, then, RIPE NCC validator 2 to catch this issue. And only process publication points if the set was consistent. This was an allowed, but not mandated, interpretation of 6486.

Note that with RRDP we have deltas. Solving this issues was one of its design goals. So, phasing out rsync could mitigate this.
(yes, I realise now that the co-chairs asked me to re-submit the I-D for this, and I still need to do it)

2) resource changes

If the parent removes any resources from the child before the child has had a chance to clean up objects then this will lead to invalid objects. Being strict that *all* objects must be valid will then lead to rejection of the CA's products. Even if it is only for a few hours.

I believe that this issue could be greatly mitigated if RFC6492 would be extended with a notion to give children a warning about planned resource removals. The length of this warning time being a function of the contact frequency of child CAs and the depth of the CA hierarchy. But, if this frequency could also be stipulated - say SHOULD every 10 minutes? Then a warning time of just a few hours would already be plenty.

However, this is a separate discussion. I do not suggest that we block progress on 6486bis for this. I am just saying that if strict checking for *all* objects is where the consensus goes then the WG must be aware of these consequences and accept them.


> 
>> 
>>>> ...
>>> You absolutely have to deal with this issue in 6486bis in its current
>>> strict form. Any introduction of a new object type will permanently
>>> break CAs that use these objects when validated with a relying party
>>> software that is not aware of this type. I don’t think this is
>>> acceptable, as it effectively blocks the introduction of new types
>>> pretty much forever.
> 
> No, it does not. What I suggested is that when a new object is proposed, it is the responsibility of the those proposing the new object to explain how it will be incrementally deployed. That explanation belongs in the RFC defining the new object, and in an updated version of 6486 will need to be generated. We have no good examples of new objects that provide a basis for describing how to accommodate incremental deployment, and thus no basis for defining such mechanisms at this time. It might be the case that a new object will be defined that requires the CA to maintain a separate pub point using some newly-defined SIA pointer, indicting that the new pub point contains the new object and thus RPs that don't know how to process the object MUST NOT follow that pointer. There will need to be agreement on how long a CA MUST maintain the old pub point, etc. But, absent a concrete example of a new object type that warrants this sort of effort it is premature to write a spec.
> 
> 
>>>> Instead I believe it
>>>> makes sense for any new object proposed for inclusion in the RPKI
>>>> repository system to address this question as part of its
>>>> documentation; it's not clear that a uniform approach is appropriate,
>>>> i.e., one size may not fit all. 6486 can be updated to reflect the
>>>> processing approach proposed for any new objects.
>>> It seems to me that the best approach is to simply ignore unknown
>>> objects. We could argue whether they can be ignored completely or
>>> whether one should at least check their manifest hash. Personally, I
>>> think completely ignoring is the better approach as I don’t see any
>>> benefit in rejecting a CA because someone swapped out an object I don’t
>>> care about.
>> 
>> 
> In X.509 certs we mark extensions as critical, or not. An extension marked as critical will cause a cert to be rejected by an RP that does not know how to process that extension. One might revise the generic signed object definition (RFC 6488)  introduce a similar flag. But, first, we would have to figure out how to incrementally deploy the new signed object format, with a new version number, etc. I hope you see why this approach to incremental deployment of new object types probably would entail more that a revision of 6486.

The current version of 6486 allows RPs to treat individual objects as invalid. -bis is trying to change that to say that the whole set of objects must be valid. This changes the consequences for the deployment of new object types, so it is right to consider this now.

The most likely next object type would be the ASPA objects. I don't think that RPs should reject all ROAs because they do not yet understand ASPA. Given that object types in the RPKI get distinct file names, and RPKI signed objects get distinct OIDs I think it would be reasonable to say that RP software can ignore object *types* that it does not understand. I would still advocate then that the RPs do check for the presence and hashes of these objects as stipulated by the signed MFT, but not consider their content.

If we don't then the consequence will be that new object types only become deployable after a significant percentage of RP has been upgraded to version that understand the new type.

Now, if all operators say that they are fine with this: let things go unknown for everyone who did not upgrade, and force them to do so.. sure. But it is right to consider this now and make a conscious choice. Oh and BTW.. for RPKI use cases where we only have VALID/INVALID but no NOT FOUND, such as BGPSec, this would be problematic.

>>> Ultimately, I feel we’ve swung the pendulum way too far to the other
>>> side. The RPKI isn’t a single data set that needs to synchronized in
>>> full but it consists of multiple data sets that can be treated as
>>> independent: currently these are VRPs, router keys, and GBRs. If I use
>>> the RPKI for route origin validation, I don’t need to synchronize the
>>> router keys or GBRs. Why does it improve route origin validation if
>>> available and correctly signed data is skipped because of issues with
>>> irrelevant data?
> 
> The RPKI was designed to support origin validation first, and BGPsec second. The set of objects that were defined are intended to support these two functions. If the WG decides to extend the set of supported functions it needs to take a hard look at a wide range of RFCs that will be affected, not just 6486.
> 
> Steve
> 
> 
> _______________________________________________
> Sidrops mailing list
> Sidrops@ietf.org
> https://www.ietf.org/mailman/listinfo/sidrops