Re: [sidr] some comments and questions regarding rpki-rtr

Randy Bush <> Sat, 01 October 2011 16:59 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 71CE321F913D for <>; Sat, 1 Oct 2011 09:59:12 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.537
X-Spam-Status: No, score=-2.537 tagged_above=-999 required=5 tests=[AWL=0.062, BAYES_00=-2.599]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id t8F460bsYkhp for <>; Sat, 1 Oct 2011 09:59:11 -0700 (PDT)
Received: from ( [IPv6:2001:418:1::36]) by (Postfix) with ESMTP id BB98121F9134 for <>; Sat, 1 Oct 2011 09:59:08 -0700 (PDT)
Received: from localhost ([] by with esmtp (Exim 4.76 (FreeBSD)) (envelope-from <>) id 1RA2wt-0007RL-Ry; Sat, 01 Oct 2011 17:02:04 +0000
Date: Sun, 02 Oct 2011 02:02:02 +0900
Message-ID: <>
From: Randy Bush <>
To: Tim Bruijnzeels <>
In-Reply-To: <>
References: <>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) Emacs/22.3 Mule/5.0 (SAKAKI)
MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka")
Content-Type: text/plain; charset="US-ASCII"
Cc: sidr wg list <>
Subject: Re: [sidr] some comments and questions regarding rpki-rtr
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Secure Interdomain Routing <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 01 Oct 2011 16:59:12 -0000

> So if I read 5.2 correctly the cache should respond with just 1
> end-of-data pdu if there are no updates since the serial included in a
> serial request.
> But if read 5.4 I can also interpret this that we should respond with 2
> pdus: 1 cache response, 0 data records, 1 end-of-data

i added a hammer to 5.2.  it now says

   The cache replies to this query with a Cache Response PDU
   (Section 5.4) if the cache has a, possibly null, record of the
   changes since the serial number specified by the router.  If there
   have been no changes since the router last queried, the cache then
   sends an End Of Data PDU.

> When I read 5.10 the nonce is generated when the cache starts. And reading
> 6.3 the cache may send a cache reset reply to the client when there are no
> incremental updates available.

while it certainly may send a cache reset, i recommend it not.  the
reset is not a normal condition.  read the text

   The cache may respond to a Serial Query with a Cache Reset, informing
   the router that the cache cannot supply an incremental update from
   the serial number specified by the router.  This might be because the
   cache has lost state, or because the router has waited too long
   between polls and the cache has cleaned up old data that it no longer
   believes it needs, or because the cache has run out of storage space
   and had to expire some old data early.

> Does this imply that a new cache nonce should be generated?

no it does not

> The nonce is made when the process starts. So when a client sends a
> reset query the same nonce may be kept. The client just gets a new,
> full, data set, up to the current serial for that same nonce.


> If not, then we would have to keep track of nonce-s and serials for each
> connected child

if you could keep track of serial and nonces, then we would not need

> As described in 5.5:
>    The cache server MUST ensure that it has told the router client to
>    have one and only one IPvX PDU for a unique {prefix, len, max-len,
>    asn} at any one point in time.
> So this means that cache should exclude duplicates in a full update
> even if the same unique {prefix, len, max-len, asn} exists more than
> once (same ROA, multiple prefixes, or different ROAs).

i was fine until that last parenthetical.  i simply do not understand

> I probably missed the discussion on this, but can you explain why this
> is?

relieve the router of baroque checks involving counting of announces and

> I don't see a conflict. If I get the same announce twice, it's still
> just announce?

is it?  or is it two roas?  how many withdraws to remove it?

> I am also wondering what this means wrt serial updates. Let me clarify
> by example: 10/16 is announced in serial 2, withdrawn in 3, announced
> again in 4. The router has serial 1. Should the cache then work out
> the exact delta between 1 and 4, or can it send 1-2, followed by 2-3,
> followed by 3-4.

it can do either.  i recommend the former.

> I can imagine that from the routers perspective it's very useful if
> the cache takes care of duplicates and sends just one big delta, and
> not the full history since the router last asked.


> I am afraid though, that this may cause scaling issues when a
> potentially large number of routers use the same cache (cpu), or a
> large number of pre-computed deltas need to be kept (memory). I think
> that if this responsibility were just handled by the routers we would
> have much better scaling on the cache side, and it would be much
> easier for caches to keep incremental updates without having to resort
> to no-incremental-updates like 6.3 describes.

routers cost many hundreds of thousands of euros, have five year old
cpus in the control plane, and are supposed to be moving packets.  when
it comes to a question of who does the extra work, guess who wins.

> As described in 6.1:
>    To limit the length of time a cache must keep the data necessary to
>    generate incremental updates, a router MUST send either a Serial
>    Query or a Reset Query no less frequently than once an hour.  This
>    also acts as a keep alive at the application layer.
> So, we have interpreted this to say that it's probably good on our
> side to drop the connection after 1 hour. It's must likely dead and we
> want our resource back...

otoh you might think that the client might have been otherwise occupied
and be patient for  some amount of time.  your choice.

> When we do this, do you think it would be good if we tried to send an
> error pdu just before closure? With a new error code indicating
> session timeout?

we have avoided asynchronous pdus from the cache.

> When the cache is stopped for whatever reason. Server restart, cache had
> irrecoverable internal error, anything else...
> Should we send a new type of notify / error (with  new specific code) to
> all children so that they can gracefully switch over to another cache --
> or wait until we are back?
> Or should we just close the connections?

we have avoided asynchronous pdus from the cache.

a client will see the close or an error when they next send a query.