[sidr] some comments and questions regarding rpki-rtr

timbru@ripe.net Sat, 01 October 2011 09:39 UTC

Message-ID: <49638.80.57.195.122.1317462144.squirrel@webmail.ripe.net>
Date: Sat, 01 Oct 2011 11:42:24 +0200
From: timbru@ripe.net
To: randy@psg.com, sra@hactrn.net
User-Agent: SquirrelMail/1.4.8-5.el5.centos.10
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
Importance: Normal
Cc: sidr@ietf.org
Subject: [sidr] some comments and questions regarding rpki-rtr
Precedence: list

Hi Randy, Rob, wg,


We have been working on rpki-rtr support in our validator at RIPE NCC over
the past weeks. I found the 6.x sections describing typical scenario
exchanges particularly useful.

I have read the document in the past as well, but as we all know: with
actual implementation come actual questions... so I have a couple:

A = No changes
B = Nonce and cache reset
C = Duplicate announcements / withdrawals
D = Keep alive timeout
E = Cache shutdown


A = No changes
=============
So if I read 5.2 correctly the cache should respond with just 1
end-of-data pdu if there are no updates since the serial included in a
serial request.
But if read 5.4 I can also interpret this that we should respond with 2
pdus: 1 cache response, 0 data records, 1 end-of-data

Can you please tell me which is correct?

Like this?

   Cache                         Router
     ~                             ~
     | <----- Serial Query ------- | R requests data
     |                             |
     | ----- Cache Response -----> | C confirms request
     | ------  End of Data ------> | C sends End of Data
     |                             |   and sends *same* serial
     ~                             ~

This is what we are doing now..

In any case I think it would be useful to have this somewhere in 6.2, or a
separate 6.x section.


B = Nonce and cache reset
=====================

When I read 5.10 the nonce is generated when the cache starts. And reading
6.3 the cache may send a cache reset reply to the client when there are no
incremental updates available.

Does this imply that a new cache nonce should be generated?

I assumed that it did not. The nonce is made when the process starts. So
when a client sends a reset query the same nonce may be kept. The client
just gets a new, full, data set, up to the current serial for that same
nonce.

If not, then we would have to keep track of nonce-s and serials for each
connected child, or reset them all.. I am afraid that would not scale very
well.

Part of the reason I am asking is that we are currently not yet able to
send incremental updates. So our cache always replies as described in 6.3.
We are not resetting the nonce though, and we are seeing duplicate
announcement errors from the routers.

So: is our cache wrong not to reset the nonce?

Can section 6.3 be amended to be explicit about this?


C = Duplicate announcements / withdrawals
===================================

As described in 5.5:
  The cache server MUST ensure that it has told the router client to
   have one and only one IPvX PDU for a unique {prefix, len, max-len,
   asn} at any one point in time.

So this means that cache should exclude duplicates in a full update even
if the same unique {prefix, len, max-len, asn} exists more than once (same
ROA, multiple prefixes, or different ROAs).

I probably missed the discussion on this, but can you explain why this is?
I don't see a conflict. If I get the same announce twice, it's still just
announce?

I am also wondering what this means wrt serial updates. Let me clarify by
example: 10/16 is announced in serial 2, withdrawn in 3, announced again
in 4. The router has serial 1. Should the cache then work out the exact
delta between 1 and 4, or can it send 1-2, followed by 2-3, followed by
3-4.

I can imagine that from the routers perspective it's very useful if the
cache takes care of duplicates and sends just one big delta, and not the
full history since the router last asked.

I am afraid though, that this may cause scaling issues when a potentially
large number of routers use the same cache (cpu), or a large number of
pre-computed deltas need to be kept (memory). I think that if this
responsibility were just handled by the routers we would have much better
scaling on the cache side, and it would be much easier for caches to keep
incremental updates without having to resort to no-incremental-updates
like 6.3 describes.



D = Keep alive timeout
===============

As described in 6.1:
   To limit the length of time a cache must keep the data necessary to
   generate incremental updates, a router MUST send either a Serial
   Query or a Reset Query no less frequently than once an hour.  This
   also acts as a keep alive at the application layer.

So, we have interpreted this to say that it's probably good on our side to
drop the connection after 1 hour. It's must likely dead and we want our
resource back...

When we do this, do you think it would be good if we tried to send an
error pdu just before closure? With a new error code indicating session
timeout?


E = Cache shutdown
==============

When the cache is stopped for whatever reason. Server restart, cache had
irrecoverable internal error, anything else...

Should we send a new type of notify / error (with  new specific code) to
all children so that they can gracefully switch over to another cache --
or wait until we are back?

Or should we just close the connections?



Thanks,
Tim


PS: If you are a rpki-rtr router implementer and you want to do interop
testing with us: please contact me.

[sidr] some comments and questions regarding rpki… timbru
Re: [sidr] some comments and questions regarding … Rob Austein
Re: [sidr] some comments and questions regarding … Randy Bush
Re: [sidr] some comments and questions regarding … Tim Bruijnzeels