[Sidrops] stale VRP's in production networks

Lukas Tribus <lukas@ltri.eu> Mon, 31 August 2020 15:06 UTC

MIME-Version: 1.0
From: Lukas Tribus <lukas@ltri.eu>
Date: Mon, 31 Aug 2020 17:05:46 +0200
Message-ID: <CACC_My8pt9EbcgLxrHzE10sa+7h5N=hm-qDo-YK_6sd5Eg9qyA@mail.gmail.com>
To: sidrops@ietf.org
Content-Type: text/plain; charset="UTF-8"
Archived-At: <https://mailarchive.ietf.org/arch/msg/sidrops/oqksuxAOIfF2JU5Jb8PydQJbd_Q>
Subject: [Sidrops] stale VRP's in production networks
Precedence: list

Dear all,


I'm concerned about stale VRP's in production networks with ROV enabled.


It is my understanding that RPKI/RTR was developed with the intention
to fail-open, so when the data is non-existent, obsolete or the
validator is broken, we would not trust that data and either failover
to a different RTR server or fail-open completely if there are none.

Stale VRPs on production routers are a bad thing, especially because
it could remain undetected for a long time. It is unrealistic to
expect perfect monitoring and immediate manual intervention on subtle
errors from all those operators out there. We need to fail hard and
early in my opinion.


To my surprise, multiple RTR implementations (see below) actually
prefer serving stale data *by design* as opposed to closing RTR
connections or withdrawing VRP's, for the sake of availability.


I think a software stack where a single bug or honest error (human or
otherwise, like a memory allocation failure, a validator crash due to
a bug, or admin disabling the validator erroneously) will easily cause
VRP's to become stale on production networks *for a long time* is very
dangerous. Failing and not serving VRP's while broken would be
expected behavior in my opinion.

Regarding the availability of RPKI validator/RTR server setup I think
this is something that the operator needs to address based on the
individual situation/network, not something that RTR software should
care about.

I believe RTR software needs to be able to fail early, without hiding
issues and subsequently serving stale VRP's.


RPKI-validator-3 [1]:

> The RPKI-RTR server is a separate daemon, that allows routers to connect
> using the RPKI-RTR protocol. It's set up as a separate instance because
> not everyone needs to run this, but more importantly, if you do need to
> run this then a separate daemon allows one to run more than one instance
> for redundancy
> *(it keeps state even when the validator is down)*.



gortr [2]:

> Yes this behavior was implemented on purpose to provide resilience in
> case of a temporary validator issue.
> [...]
> The choice was made to ensure continuity in Origin Validation rather
> than bouncing routes/VRPs (+ added complexity on RTR serial management
> in some cases).



Any inputs and opinions on this would be greatly appreciated,

-- lukas

[1] https://github.com/RIPE-NCC/rpki-validator-3/wiki
[2] https://github.com/cloudflare/gortr/issues/81#issuecomment-676247412

[Sidrops] stale VRP's in production networks Lukas Tribus