Re: [Sidrops] [WG ADOPTION] Adoption call: draft-timbru-sidrops-publication-server-bcp - ENDS 02/08/2024

Tim Bruijnzeels <tbruijnzeels@ripe.net> Wed, 07 February 2024 06:52 UTC

Return-Path: <tbruijnzeels@ripe.net>
X-Original-To: sidrops@ietfa.amsl.com
Delivered-To: sidrops@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A06C9C14F710; Tue, 6 Feb 2024 22:52:06 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.106
X-Spam-Level:
X-Spam-Status: No, score=-2.106 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ripe.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WpVHXVujKVR6; Tue, 6 Feb 2024 22:52:02 -0800 (PST)
Received: from mail-mx-2.ripe.net (mail-mx-2.ripe.net [IPv6:2001:67c:2e8:11::c100:1312]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 88505C14F6E1; Tue, 6 Feb 2024 22:51:56 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ripe.net; s=s1-ripe-net; h=To:Message-Id:Cc:Date:From:Subject:Mime-Version:Content-Type ; bh=t+X/6SqD4SCqpFWAFPrIio8ABASgS0FkaT0CwElq7a0=; b=iCCeD1nP8rDXqnF6YDjyJv+X sHGN2xvCSNbWCUq+qt9QuaxXMUia0uuy/lvV2A6bcA9pIQ/PnidzQrg0FmoAVDijSbn5JbSkzERoI tWS+mPI7AZwmZbxN4H5A9hpYz4WYi8g5olusHQIfIa+UwmLS9B7+A5+7ZZqjCIR8QjuHfWkng8L4R sIHN4oi2gY8+C3vxlo+BXNq1LLExWttGT728B9B3SBMfUkaxsrobxiZ5A072odrF63+teAVccEG5E NhLPYFLU4yYSVNT1IdVMzDlefTDh96VJgaAcaXd5zmJtdQtfPdT5ywso1glys43c14JGG/aAlURPB H4VUZbfBgA==;
Received: from imap-01.ripe.net ([2001:67c:2e8:23::c100:170e]:48670) by mail-mx-2.ripe.net with esmtps (TLS1.3) tls TLS_AES_256_GCM_SHA384 (Exim 4.96.2) (envelope-from <tbruijnzeels@ripe.net>) id 1rXbmj-00B0bq-31; Wed, 07 Feb 2024 06:51:53 +0000
Received: from sslvpn.ipv6.ripe.net ([2001:67c:2e8:9::c100:14e6] helo=smtpclient.apple) by imap-01.ripe.net with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96.2) (envelope-from <tbruijnzeels@ripe.net>) id 1rXbmj-00By64-2J; Wed, 07 Feb 2024 06:51:53 +0000
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.300.61.1.2\))
From: Tim Bruijnzeels <tbruijnzeels@ripe.net>
In-Reply-To: <ZcJmeFCmU9Txsk7M@snel>
Date: Wed, 07 Feb 2024 07:51:42 +0100
Cc: Ties de Kock <tdekock@ripe.net>, Russ Housley <housley@vigilsec.com>, IETF SIDRops <sidrops@ietf.org>, IETF SIDRops Chairs <sidrops-chairs@ietf.org>, sidrops-ads@ietf.org, Keyur Patel <keyur@arrcus.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <52ACEFB1-A3AD-4A15-ADC7-534E85184645@ripe.net>
References: <87h6j1kug1.wl-morrowc@ops-netman.net> <B60D7B39-FA81-45AF-BCBD-2784F91B43C3@vigilsec.com> <ZcFNNfrkMFxKf5hN@snel> <BBE2320C-4525-4713-B4AF-3F00ECD4228A@ripe.net> <ZcIuI7lS1OtOW_xT@snel> <EFFA95AA-F07D-490B-BEC3-0446ED2D3AA2@ripe.net> <ZcJmeFCmU9Txsk7M@snel>
To: Job Snijders <job@fastly.com>
X-Mailer: Apple Mail (2.3774.300.61.1.2)
X-RIPE-Signature: 7105a661eeefe5d5c9c241e9d0d5d89037ad456e00407e6df5347a05a1b0534e
Archived-At: <https://mailarchive.ietf.org/arch/msg/sidrops/PXtdhRKEaO7FB7_pteZA-Lvt7t0>
Subject: Re: [Sidrops] [WG ADOPTION] Adoption call: draft-timbru-sidrops-publication-server-bcp - ENDS 02/08/2024
X-BeenThere: sidrops@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: A list for the SIDR Operations WG <sidrops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/sidrops>, <mailto:sidrops-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/sidrops/>
List-Post: <mailto:sidrops@ietf.org>
List-Help: <mailto:sidrops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/sidrops>, <mailto:sidrops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 07 Feb 2024 06:52:06 -0000

Hi Job, all,

> On Feb 6, 2024, at 18:03, Job Snijders <job@fastly.com> wrote:
> 
> On Tue, Feb 06, 2024 at 05:19:11PM +0100, Tim Bruijnzeels wrote:
>>> The notification file (by specification) indeed is mutable.
>>> 
>>> I think it is helpful to point out that the RRDP deltas really only
>>> need to be generated once, as some implementers seem to have gotten
>>> this wrong in the past.
>> 
>> To the best of my knowledge no RRDP implementation has intentionally
>> regenerated deltas.
> 
>> There have been issues where a server was restored from backup to an
>> earlier state and it was unaware of the state changes since the
>> backup.
>> 
>> So, I would like to include text in this BCP that instructs the
>> Publication Server to perform a full RRDP session reset in case they
>> restored from backup.
> 
> Yes, the 'reset session' after backup restore is a good point.
> 
>>> The situation I had in mind was an example from last year, when for
>>> one of the regional internet registries all my alerts cleared within
>>> 2 hours after RRDP was disabled.
>>> 
>>> You are right to point out that the RRDP notification file usually
>>> is smaller than the rsync filename transfer list, but in turn, the
>>> rsync transfer list is way smaller than the RRDP snapshot.
>> 
>> I am in two minds about this.
>> 
>> I think the BCP should not recommend disabling RRDP, it should
>> recommend that enough bandwidth is available and/or a CDN is used.
>> 
>> The problem happens when this cannot be done. In that case RRDP
>> degrades badly, as described in the document, because all RPs fall
>> back to full snapshots. This makes the bandwidth load worse.
>> 
>> In the case of your example there were higher layer (non-technical)
>> reasons why the bandwidth capacity could not be increased and a CDN
>> could not be used at the time.
>> 
>> In this specific case disabling RRDP helped because even though there
>> could still be bandwidth issues affecting RPs this allowed enough of
>> them to perform a sync so that subsequent data usage was reduced.
>> 
>> So, yes disabling RRDP helped here, but no, I don’t think it should be
>> recommended best practice. I want to think about wording that captures
>> this...
> 
> Or phrased differently: the best practise is to ensure there is
> sufficient bandwidth available, while disabling RRDP is a dirty hack :-)
> 
> Perhaps a some kind of guidance can be included as to what 'sufficient
> bandwidth' is? Here is a starter:
> 
>  "Size of snapshot" times "Number of deployed RP instances" should
>  comfortably fit in 15 minute delivery window.
> 
> (I picked 15 minutes, because rpki-client by default timeboxes
> synchronizing individual repositories into 15 minutes.)

I think Routinator syncs every 10 mins *after* the last run completed, so
it should be close to this as well.

> So, for an RRDP server serving a 207 megabyte snapshot to 3000 RPs, the
> operator would need *at least* (207*8*3000)/900 = 5520 megabit/sec after
> a session reset. To make it fit comfortably, double or triple this number.

That approach sounds about right. And indeed, session resets are where
the full re-syncs happen in the absence of issues. So, servers should plan
their capacity to be able to handle this.

Provided that the server is confident about this, they may wish to initiate
a session reset from time to time to ensure that they have operational
experience.

If the resulting number is sufficiently high, using a CDN is highly
recommended.

But note that this is a worse case “thundering herd” issue, and
steady-state data usage will be lower.

> Given the affiliations of the authors, I'm sure that group can do a
> better job and speak from experience how capacity planning is done. :-)

We can dig up some numbers for our own session resets, but it’s a point
in space and time data point only. The numbers will change over time as
the snapshot grows, more RPs are deployed and/or their update frequency
increases. So, this is something that servers should monitor for themselves.

For small (e.g. single CA) repositories this may not be an issue at all as their
snapshot is tiny, and they are able to serve RPs on a “normal” link. We talk
about this a bit in the document, but we "RECOMMEND that CAs use a
publication service provided by their RIR, NIR or other parent as much as
possible, because availability issues with such [single CA] repositories are
frequent, and can negatively impact Relying Party software."

The way this degrades in the absence of a CDN can vary. If the capacity is
way too low, then RPs simply won’t be able to get the snapshots. Or they
get them once, and then fallback to rsync on a later sync, and then try a
full RRDP snapshot again, etc. ad nauseam. But this depends on how quickly
RPs  decide to fall back to rsync and retry (full) RRDP. If the fallback and
retries are spread out over time, then it may still recover.

Of course, servers should plan capacity such that this does not happen.
But, less aggressive fall-back and retries by RPs can help. Rate limiting
HTTP headers were not considered when RRDP was discussed, but
potentially could perhaps help a bit. The thinking at the time was that data
would not be an issue as long as CDNs were used.

Tim



> 
> Kind regards,
> 
> Job
>