Re: [rrg] IRON-RANGER scalability and support for packets from non-upgraded networks

Robin Whittle <rw@firstpr.com.au> Sat, 13 March 2010 03:11 UTC

Message-ID: <4B9B0244.7010304@firstpr.com.au>
Date: Sat, 13 Mar 2010 14:11:00 +1100
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: RRG <rrg@irtf.org>
References: <C7B93DF3.4F45%tony.li@tony.li> <4B94617E.1010104@firstpr.com.au > <E1829B60731D1740BB7A0626B4FAF0A649511933 94@XCH-NW-01V.nw.nos.boeing.com > <4B953EA5.4090707@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A6495119 34CF@XCH-NW-01V.nw.nos.boeing.com> <4B97016B.5050506@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A6495119413D@XCH-NW-01V.nw.nos.boeing.com> <4B998826.9070104@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A649511DCEA0@XCH-NW-01V.nw.nos.boeing.com>
In-Reply-To: <E1829B60731D1740BB7A0626B4FAF0A649511DCEA0@XCH-NW-01V.nw.nos.boeing.com>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Subject: Re: [rrg] IRON-RANGER scalability and support for packets from non-upgraded networks
Precedence: list

Short version:    Exploring the scalability of IRON-RANGER's
                  "bubble"-based registration system - every
                  10 minutes the two IRON routers of the two
                  ISPs send a registration packet to however
                  many VP (Virtual Prefix) Iron routers there
                  are for the VP which covers the I-R PI
                  prefix in question.

                  I think the scaling properties of this
                  system look bad - and I can't yet see how
                  the IRON routers can discover the IP addresses
                  of all the VP routers.

Hi Fred,

You wrote:

>>> IRON-RANGER used to speak of using IPv6 neighbour discovery
>>> as the means for locator liveness testing, dissemination
>>> of routing information, secure redirection, etc. However,
>>> the VET and SEAL mechanisms are being revised to instead
>>> use a different mechanism called the SEAL Control Message
>>> Protocol (SCMP) for tunnel endpoint negotiations that occur
>>> *within* the tunnel sublayer and are therefore not visible
>>> to either the outer IP protocol nor the inner network layer
>>> protocol. Hence, the inner network layer protocol could be
>>> anything, including IPv4, IPv6, OSI CLNP, or any other network
>>> layer protocol that is eligible for encapsulation in IP.
>>
>> OK.  I hope you will be able to explain these things not just in
>> terms of high-level concepts, but to give examples of how the whole
>> thing would actually work on a large scale.
> 
> OK if you are talking about an architectural description,
> but please note that both VET and SEAL are already full
> functional specifications that can be used by software
> developers to produce real code. 

I think I-R needs to be described in a way that someone who is up to
speed on scalable routing in general can read one or perhaps two I-R
documents and have a good idea of how the whole thing is going to
work - including with respect to scaling and security.  This doesn't
require exact bits in headers, but that could be part of it.  I think
 it needs to be pretty-much self-contained rather than requiring
people to read other documents which are not part of I-R.

>> For instance, how many IRON routers are there in an IPv4 I-R system,
>> and how many individual EID prefixes?
> 
> Let's suppose that each VP is an IPv6 ::/32, and that
> the smallest unit of PI prefix delegation from a VP is
> an IPv6 ::/56. In that case, there can theoretically be
> up to 4B VPs in the IRON RIB and 16M PI prefixes per VP.
> In practice, however, we can expect to see far fewer than
> that until the IPv6 address space reaches exhaustion
> which many believe will be well beyond our lifetimes.

OK.  Still, depending on how the address space was allocated - or at
least that subset of the address space covered by I-R's VPs - there
could be high numbers, approaching 16M perhaps, of I-R PI prefixes
per VP.

> Still thinking (very) big, let's try sizing the system
> for 100K VPs; each with 100K ::/56 delegated PI prefixes.
> That would give 10B ::/56 PI prefixes, or 1 PI prefix
> for every person on earth (depending on when you sample
> the earth's population). Let's look at the scaling
> considerations under these parameters:

OK, I think this is a good scenario to discuss.  I assume that the
VPs can be of various sizes, so some VPs could be a longer prefix,
covering less space, if there are a larger number of I-R PI prefixes
within that part of the address space.

As far as I know, you don't need VPs covering the entire advertised
subset of global unicast address space.  However, for worst-case
scaling discussions I think it is good to assume this.

>> Then, how do these IRON
>> routers, for each of these EID prefixes continually and repeatedly (I
>> guess every 10 minutes or less) securely inform a given number of VP
>> routers they are the router, or one of the routers, to which packets
>> matching a given EID prefix should be tunneled.  Since there could be
>> multiple VP routers for a given VP, and the IRON routers don't and (I
>> think) can't know where they are, how does this process work securely
>> and scalably?
> 
> Each IRON router R(i) discovers the full map of VPs in
> the IRON through participation in the IRON BGP. 

I recall that some IRON routers handle VPs and others don't.  As I
wrote earlier, assuming VP routers advertise the VP in the DFZ, not
just in the I-R overlay network, then they are acting like LISP PTRs
or Ivip DITRs.  In order for them to do this in a manner which
generally reduces the path length from sending host, via VP router to
the IRON router which delivers the packet to the destination, I think
that for each VP something like 20 or more IRON routers need to be
advertising the same VP.

I interpret your previous sentence to mean that all the IRON routers
are part of the IRON BGP overlay network, and that each one will
therefore get a single best path for each VP.  That will give it the
IP address of one IRON router which handles this VP.  It won't give
it any information on the full set of IRON routers which handle this VP.

> That
> means that each R(i) would need to perform full database
> synchronization for 100K stable IRON RIB entries that rarely
> if ever change. 

I am not sure what you mean by "full database synchronization".  Only
a subset of IRON routers advertise a VP, and each IRON router would
get a best-path to a single IRON router out of potentially numerous
IRON routers which were advertising a given VP.  So any one IRON
router would not be able to use the IRON BGP overlay system to either
discover the IP addresses (or best paths) to all IRON routers, or to
all the IRON routers which advertise VPs, assuming that some VPs were
advertised by more than one IRON router.

> This doesn't sound terrible even for existing
> core router equipment. As you noted, it is also possible that
> a given VP(j) would be advertised by multiple R(i)s - let's
> say each VP(j) is advertised by 2 R(i)s (call them R(x) and
> R(y)). But, since the IRON RIB is fully populated to all
> R(i)s, each R(i) would discover both R(x) and R(y) that
> advertise VP(j).

I don't see how this would occur.  A given IRON router receives best
paths for each VP, so for VP(j) it will get a best path to (and IP
address of) either R(x) or R(y).

> Now, for IRON router R(i) that is the provider for 100K PI
> prefixes delegated from VP(j), R(i) needs to send a "bubble"
> to both R(x) and R(y) for each PI prefix. 

Its no-doubt a relief to less muscle-bound scalable routing
architectures that the routers of IRON-RANGER are hurling about
merely "bubbles" rather than something with greater impact!

> That would amount to 200K bubbles every 600 sec, or 333
> bubbles/sec.  If each bubble is 100bytes, the total bandwidth
> required for updating all of the 100K PI prefixes is 260Kbps.

I am not sure each registration "bubble" would only be 100 bytes of
protocol-level data.  You need to specify, for IPv6:

  1 - The IP address of the IRON sending the registration (16 bytes).

  2 - The prefix the IRON router is registering (18 bytes).

  3 - Nonces and other stuff which invariably accompany messages
      such as this (10 to 20 bytes?).

  4 - Authentication material, such as a digital signature for the
      above, including the public key of the signer (the
      IRON router itself?) and a pointer to one or more PKI CAs or
      whatever so the VP router can ascertain that this really is
      the public key of the signer.  These will be FQDNs - lets
      say 50 bytes or so.

Maybe you could get the whole thing into 100 bytes.  Then add the
IPv6 header - 40 bytes - and a UDP header 8 bytes - and we are up to
about 150 bytes already.  Add in L2 headers - Ethernet is 46 octets -
 and we are up to 200 bytes.  Multiply by 8 and this is 1600 bits.

  1600 x 333 = 532,800 bits/sec ~=0.5Mbps

This is the bandwidth of incoming packets to R(x) and likewise for
R(y) in your description.   This is assuming a two IRON routers
("200k bubbles every 600 sec") per I-R PI prefix.

But your description varies from mine already in two other important
respects.

Firstly, if these VP-advertising routers are to operate properly like
DITRs or PTRs, they needs to be a lot more than 2 of them per VP.
Let's say 20.  Maybe 10 would be acceptable, maybe more - but 20 will
do.  Let's call them RVP(j, 0) to RVP(j, 19) where, in your example:

  R(x) == RVP(j, 0)
  R(y) == RVP(j, 1)

Secondly, I don't see how R(i) could discover the IP addresses of
more than one of this set of 20 routers.

In my model, if it could be shown how routers such as R(i) which
handle the 100k I-R PI prefixes in VP(j) could discover all the 20
routers RVP(j, 0) to RVP(j, 19), then each of these 20 routers has
this incoming bandwidth.

> Now, let's say that each PI prefix is multihomed to 2 providers,
> then we get 2x the message traffic for 520Kbps total for the
> bubbles needed to keep the 100K PI prefixes refreshed.

You already assumed two IRON routers per I-R PI prefix in your
260kbps figure above, so there's no need to double at again to 520kbps.

2 ISPs seems a reasonable figure, which was already part of my
calculations.

Each provider has an IRON router which handles a given I-R IP prefix,
and each such IRON router is sending bubbles to all the VP routers
(though I don't yet understand how these VP routers would be
discovered - and I am assuming there are 20 of them while you are
assuming there will be 2 of them).

My figure is 532kbps ~= 0.5Mbps incoming bandwidth per VP router.

>> If the VP routers act like DITRs or PTRs by advertising their VP in
>> the DFZ, then in order to make them work well in this respect - to
>> generally minimise the extra path length taken to and from them
>> compared to the path from the sending host to the proper IRON router
>> - I think you need at least a dozen of them.   This directly drives
>> the scaling problems in the process just mentioned where the IRON
>> routers continually register each of their EID prefixes with the
>> dozen or so VP routers which cover that EID prefix.
> 
> I don't understand why the dozen - I think with IRON VP
> routers, the only reason for multiples is for fault tolerance
> and not for optimal path routing, since path optimization will
> be coordinated by secure redirection. So, just a couple (or a
> few) IRON routers per VP should be enough I think?

Secure redirection works when an IRON router sends the initial packet
to a VP router, but it doesn't apply when the sending router is that
of a non-upgraded network.  To support generally low stretch paths
from those sending networks to the IRON router which is currently the
desired one for forwarding packets to the destination network, I
think you need a larger number.  20 is a rough figure, assuming a
global distribution of sending hosts and IRON routers which handle
the I-R PI prefixes - as is required for real portability.

If all the IRON routers for the I-R PI prefixes of a given VP were in
Europe, then it would suffice to have all the VP routers also in
Europe - so depending on the need for robustness and load sharing,
perhaps you wouldn't need 20 or them.  Maybe 5 would do.  But
generally, for this kind of scaling discussion, I think we need to
assume the goal of global portability of the new kind of address
space, with sending hosts likewise distributed globally.

So I think that for a VP containing 100k I-R PI prefixes, there are
going to be 20 such VP routers, and each is going to get a continual
1Mbps stream of registration packets.

This is not counting the work that VP router needs to do in order to
establish the authenticity of those registrations.  As far as I know,
it could only do this by looking up PKI CAs (Certification
Authorities) on a regular basis to ensure the signed registrations
were valid.

There are serious scaling problems per VP router in handling 333
signed registrations per second. That's a lot of crypto stuff to do
just to check the signatures - and a lot more work and packets going
back and forth for regularly checking that the public keys provided
are still valid.

There is also the scaling problem of there being 20 or so of these VP
routers, so the entire Internet needs to handle 20 x 0.5Mbps = 10Mbps
continually just to handle the registration of these 100k I-R PI
prefixes.  Each such prefix requires 100 bits per second in continual
registration activity - 5 bits per second per VP router per I-R PI
prefix.  For each VP router, 5 bits per second on average comes from
each of the typically two IRON routers which are registering a given
I-R PI prefix.

Checking this: If there was a single VP router and a single IRON
router registering an I-R PI prefix, the IRON router would send 1600
bits every 600 seconds. This is 2.66 bits a second.  Since there are
20 VP routers, the figure per IRON router per I-R PI prefix is 53bps.
 Since there are two such IRON routers per I-R PI prefix, each such
IRON router sends 106bps per I-R PI prefix.  With 100k of these I-R
PI prefixes per VP, this is about 10Mbps.  This checks out OK.

I think this is an unacceptable continual burden of registration traffic.

Also, this is just for 10 minute registrations.  I recall that the 10
minute time is directly related to the worst-case (10 minute) and
average (5 minute) multihoming service restoration time, as per our
previous discussions.  I think that these are rather long times.

>> Your IDs tend to be very high level and tend to specify external RFCs
>> for how you do important functions in I-R.
> 
> You may be speaking of IRON/RANGER, but the same is not
> true of VET/SEAL. VET and SEAL are fully functional
> specifications from which real code can be and has been
> derived.

Yes - SEAL is a self-contained protocol, but I still found it hard to
navigate my way within the one document.

>> Yet those RFCs say
>> nothing about I-R itself.  I think your I-Ds generally need more
>> material telling the reader specifically how you use these processes
>> in I-R.   Then, for each such process, have a detailed discussion
>> with real worst-case numbers to show that it is scalable at every
>> level for some worst-case numbers of EID prefixes, IRON routers etc.
>> - as well as secure against various kinds of attack.
> 
> Does the analysis I gave above help? If so, I can put
> it in the next version of IRON.

This is the sort of example I am hoping you will add.  But first I
think there are two questions I raised which would need to be
resolved before your example would be realistic according to my
understanding of I-R:

  1 - How does an IRON router discover all the IRON routers
      advertising a VP?  The I-R BGP overlay network does not
      provide this, as far as I know.

  2 - Allow for 20 or so routers each advertising the one VP,
      for the purposes of supporting packets from non-upgraded
      networks.

Assuming 2 is accepted, and 1 is somehow achieved, we now have, for
each of the 20 VP routers, 0.5Mbps of registration traffic.  That's a
lot of traffic and a lot of crypto processing to do.

It is no-doubt more efficient than the ~100k or so extremely
expensive BGP routers of today's DFZ fussing around comparing notes
about 300k prefixes.  However, I don't think it scales as well as an
alternative:

  http://tools.ietf.org/html/draft-whittle-ivip-arch
  http://tools.ietf.org/html/draft-whittle-ivip-drtm

which doesn't have such continual flows of registration, mapping etc.
data, unrelated to the traffic flowing to a given micronet, or to
changes in the ETR to which the micronet is mapped.

>>>>   8 - Apart from Ivip's Modified Header Forwarding arrangements,
>>>>       CES architectures involve encapsulation for tunneling
>>>>       packets from ITRs to ETRs (IRON-RANGER doesn't have ITRs and
>>>>       ETRs, but it still requires encapsulated tunneling).  There
>>>>       are some problems with this - but they do not appear to be
>>>>       prohibitive.
>>> IRON-RANGER calls them as ITEs/ETEs because it is possible
>>> to also configure a tunnel endpoint on a host and not just
>>> on routers. In terms of routers, the IRON-RANGER ITE/ETE
>>> are exactly equivalent to what the other proposals are
>>> calling as ITR/ETR.
>> OK.  In Ivip the sending host can have in "ITR" function - though it
>> is not a router and this "ITR" function doesn't advertise routes to
>> the MABs (Mapped Address Blocks) inside the host.  It does however
>> only handle packets sent by the host's stack which have destination
>> addresses matching any of the MABs.  I am sticking with "ITR" and
>> "ETR" in Ivip, to remain compatible with LISP - and because I think
>> they are easier to pronounce than "ITE" and "ETE".
> 
> I'm not sure about this - an {Ingress/Egress} Tunnel
> *Router* is a router that happens to terminate tunnel
> endpoints. On the other hand, an {Ingress/Egress}
> Tunnel *Endpoint* is a tautologically a tunnel
> *endpoint* - so, why not call it as such?

I am not suggesting you adopt "ITR" and "ETR" instead of "ITE" and
"ETE" - which I agree are more apt terms.  I was just explaining why,
for now, I will stick with "ITR" and "ETR" for Ivip.

  - Robin

[rrg] FW: I-D Action:draft-irtf-rrg-recommendatio… Tony Li
[rrg] Recommendation and what happens next Robin Whittle
Re: [rrg] FW: I-D Action:draft-irtf-rrg-recommend… Robin Whittle
Re: [rrg] Recommendation and what happens next Tony Li
Re: [rrg] FW: I-D Action:draft-irtf-rrg-recommend… Tony Li
Re: [rrg] FW: I-D Action:draft-irtf-rrg-recommend… Robin Whittle
Re: [rrg] Recommendation and what happens next Robin Whittle
Re: [rrg] FW: I-D Action:draft-irtf-rrg-recommend… Tony Li
Re: [rrg] Recommendation and what happens next Tony Li
Re: [rrg] Recommendation and what happens next Brian E Carpenter
Re: [rrg] Recommendation and what happens next Tony Li
[rrg] Why won't supporters of Loc/ID Separation (… Robin Whittle
Re: [rrg] Recommendation and what happens next Robin Whittle
Re: [rrg] Why won't supporters of Loc/ID Separati… Tony Li
Re: [rrg] Recommendation and what happens next Tony Li
Re: [rrg] Recommendation and what happens next Robin Whittle
Re: [rrg] Why won't supporters of Loc/ID Separati… Robin Whittle
Re: [rrg] Recommendation and what happens next Russ White
Re: [rrg] Recommendation and what happens next Templin, Fred L
Re: [rrg] Recommendation and what happens next Templin, Fred L
[rrg] IRON-RANGER scalability and support for pac… Robin Whittle
Re: [rrg] IRON-RANGER scalability and support for… Templin, Fred L
Re: [rrg] Recommendation and what happens next Tony Li
Re: [rrg] Why won't supporters of Loc/ID Separati… Tony Li
Re: [rrg] Recommendation and what happens next Brian E Carpenter
Re: [rrg] Recommendation and what happens next Scott Brim
Re: [rrg] Recommendation and what happens next Templin, Fred L
Re: [rrg] IRON-RANGER scalability and support for… Robin Whittle
Re: [rrg] Recommendation and what happens next Robin Whittle
Re: [rrg] Recommendation and what happens next Tony Li
Re: [rrg] Recommendation and what happens next Robin Whittle
Re: [rrg] Recommendation and what happens next Robin Whittle
Re: [rrg] Recommendation and what happens next Scott Brim
Re: [rrg] Why won't supporters of Loc/ID Separati… Scott Brim
Re: [rrg] Why won't supporters of Loc/ID Separati… Robin Whittle
Re: [rrg] IRON-RANGER scalability and support for… Templin, Fred L
Re: [rrg] Recommendation and what happens next Templin, Fred L
Re: [rrg] Why won't supporters of Loc/ID Separati… Templin, Fred L
Re: [rrg] Recommendation and what happens next Scott Brim
Re: [rrg] IRON-RANGER scalability and support for… Robin Whittle
Re: [rrg] IRON-RANGER scalability and support for… Templin, Fred L
Re: [rrg] IRON-RANGER scalability and support for… Robin Whittle
Re: [rrg] IRON-RANGER scalability and support for… Templin, Fred L
Re: [rrg] IRON-RANGER scalability and support for… Robin Whittle
Re: [rrg] IRON-RANGER scalability and support for… Templin, Fred L
Re: [rrg] IRON-RANGER scalability and support for… Robin Whittle
Re: [rrg] IRON-RANGER scalability and support for… Templin, Fred L
Re: [rrg] IRON-RANGER scalability and support for… Robin Whittle
Re: [rrg] IRON-RANGER scalability and support for… Templin, Fred L
Re: [rrg] IRON-RANGER scalability and support for… Robin Whittle
Re: [rrg] IRON-RANGER scalability and support for… Templin, Fred L
[rrg] Comments on 'draft-whittle-ivip-arch' Templin, Fred L
Re: [rrg] Comments on 'draft-whittle-ivip-arch' Robin Whittle
Re: [rrg] Comments on 'draft-whittle-ivip-arch' Robin Whittle