[Sidrops] stale VRP's in production networks

Lukas Tribus <lukas@ltri.eu> Mon, 31 August 2020 15:06 UTC

Return-Path: <lukas@ltri.eu>
X-Original-To: sidrops@ietfa.amsl.com
Delivered-To: sidrops@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 84C583A14B5 for <sidrops@ietfa.amsl.com>; Mon, 31 Aug 2020 08:06:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TaYdkqm73ms6 for <sidrops@ietfa.amsl.com>; Mon, 31 Aug 2020 08:06:07 -0700 (PDT)
Received: from mail5.web-server.biz (www5.web-server.biz [185.181.105.105]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 28AC73A14B2 for <sidrops@ietf.org>; Mon, 31 Aug 2020 08:06:06 -0700 (PDT)
Received: from mail-io1-f41.google.com (mail-io1-f41.google.com [209.85.166.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail5.web-server.biz (Postfix) with ESMTPSA id BD1D247C7E for <sidrops@ietf.org>; Mon, 31 Aug 2020 15:05:59 +0000 (UTC)
Received: by mail-io1-f41.google.com with SMTP id m23so6241072iol.8 for <sidrops@ietf.org>; Mon, 31 Aug 2020 08:05:59 -0700 (PDT)
X-Gm-Message-State: AOAM531jQsq6cjbxKRUN4a92XIrVc2DoMaroDVVcCF/6hWyexsr79OOf SxHvvU3zSyV/yT8vaC6FEg6oqDkv3U2KMJ4fd9U=
X-Google-Smtp-Source: ABdhPJz4br/+Bss/IGW4FTiGmCr+jK9J096AbI3IXMTZ2BSYlvWYvj9a9yvQZwcuMIL0+BUwMiTHg+EOfhHeeHXcI4Y=
X-Received: by 2002:a6b:fb0c:: with SMTP id h12mr1498958iog.98.1598886358135; Mon, 31 Aug 2020 08:05:58 -0700 (PDT)
MIME-Version: 1.0
From: Lukas Tribus <lukas@ltri.eu>
Date: Mon, 31 Aug 2020 17:05:46 +0200
X-Gmail-Original-Message-ID: <CACC_My8pt9EbcgLxrHzE10sa+7h5N=hm-qDo-YK_6sd5Eg9qyA@mail.gmail.com>
Message-ID: <CACC_My8pt9EbcgLxrHzE10sa+7h5N=hm-qDo-YK_6sd5Eg9qyA@mail.gmail.com>
To: sidrops@ietf.org
Content-Type: text/plain; charset="UTF-8"
Archived-At: <https://mailarchive.ietf.org/arch/msg/sidrops/oqksuxAOIfF2JU5Jb8PydQJbd_Q>
Subject: [Sidrops] stale VRP's in production networks
X-BeenThere: sidrops@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: A list for the SIDR Operations WG <sidrops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/sidrops>, <mailto:sidrops-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/sidrops/>
List-Post: <mailto:sidrops@ietf.org>
List-Help: <mailto:sidrops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/sidrops>, <mailto:sidrops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Aug 2020 15:06:09 -0000

Dear all,


I'm concerned about stale VRP's in production networks with ROV enabled.


It is my understanding that RPKI/RTR was developed with the intention
to fail-open, so when the data is non-existent, obsolete or the
validator is broken, we would not trust that data and either failover
to a different RTR server or fail-open completely if there are none.

Stale VRPs on production routers are a bad thing, especially because
it could remain undetected for a long time. It is unrealistic to
expect perfect monitoring and immediate manual intervention on subtle
errors from all those operators out there. We need to fail hard and
early in my opinion.


To my surprise, multiple RTR implementations (see below) actually
prefer serving stale data *by design* as opposed to closing RTR
connections or withdrawing VRP's, for the sake of availability.


I think a software stack where a single bug or honest error (human or
otherwise, like a memory allocation failure, a validator crash due to
a bug, or admin disabling the validator erroneously) will easily cause
VRP's to become stale on production networks *for a long time* is very
dangerous. Failing and not serving VRP's while broken would be
expected behavior in my opinion.

Regarding the availability of RPKI validator/RTR server setup I think
this is something that the operator needs to address based on the
individual situation/network, not something that RTR software should
care about.

I believe RTR software needs to be able to fail early, without hiding
issues and subsequently serving stale VRP's.


RPKI-validator-3 [1]:

> The RPKI-RTR server is a separate daemon, that allows routers to connect
> using the RPKI-RTR protocol. It's set up as a separate instance because
> not everyone needs to run this, but more importantly, if you do need to
> run this then a separate daemon allows one to run more than one instance
> for redundancy
> *(it keeps state even when the validator is down)*.



gortr [2]:

> Yes this behavior was implemented on purpose to provide resilience in
> case of a temporary validator issue.
> [...]
> The choice was made to ensure continuity in Origin Validation rather
> than bouncing routes/VRPs (+ added complexity on RTR serial management
> in some cases).



Any inputs and opinions on this would be greatly appreciated,

-- lukas

[1] https://github.com/RIPE-NCC/rpki-validator-3/wiki
[2] https://github.com/cloudflare/gortr/issues/81#issuecomment-676247412