Re: [rrg] IRON-RANGER scalability and support for packets from non-upgradednetworks

"Templin, Fred L" <Fred.L.Templin@boeing.com> Tue, 16 March 2010 22:14 UTC

Return-Path: <Fred.L.Templin@boeing.com>
X-Original-To: rrg@core3.amsl.com
Delivered-To: rrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id E1FD03A6782 for <rrg@core3.amsl.com>; Tue, 16 Mar 2010 15:14:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.215
X-Spam-Level:
X-Spam-Status: No, score=-6.215 tagged_above=-999 required=5 tests=[AWL=-0.216, BAYES_00=-2.599, J_CHICKENPOX_31=0.6, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id iK-f73q8ZBVZ for <rrg@core3.amsl.com>; Tue, 16 Mar 2010 15:14:42 -0700 (PDT)
Received: from blv-smtpout-01.boeing.com (blv-smtpout-01.boeing.com [130.76.32.69]) by core3.amsl.com (Postfix) with ESMTP id A04E33A67AC for <rrg@irtf.org>; Tue, 16 Mar 2010 15:14:41 -0700 (PDT)
Received: from stl-av-01.boeing.com (stl-av-01.boeing.com [192.76.190.6]) by blv-smtpout-01.ns.cs.boeing.com (8.14.4/8.14.4/8.14.4/SMTPOUT) with ESMTP id o2GMEfDm016104 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL); Tue, 16 Mar 2010 15:14:45 -0700 (PDT)
Received: from stl-av-01.boeing.com (localhost [127.0.0.1]) by stl-av-01.boeing.com (8.14.4/8.14.4/DOWNSTREAM_RELAY) with ESMTP id o2GMEfrm028430; Tue, 16 Mar 2010 17:14:41 -0500 (CDT)
Received: from XCH-NWHT-06.nw.nos.boeing.com (xch-nwht-06.nw.nos.boeing.com [130.247.25.110]) by stl-av-01.boeing.com (8.14.4/8.14.4/UPSTREAM_RELAY) with ESMTP id o2GMDxOO027613 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=OK); Tue, 16 Mar 2010 17:14:41 -0500 (CDT)
Received: from XCH-NW-01V.nw.nos.boeing.com ([130.247.64.120]) by XCH-NWHT-06.nw.nos.boeing.com ([130.247.25.110]) with mapi; Tue, 16 Mar 2010 15:14:34 -0700
From: "Templin, Fred L" <Fred.L.Templin@boeing.com>
To: Robin Whittle <rw@firstpr.com.au>, RRG <rrg@irtf.org>
Date: Tue, 16 Mar 2010 15:14:33 -0700
Thread-Topic: [rrg] IRON-RANGER scalability and support for packets from non-upgradednetworks
Thread-Index: AcrE/Xrn/CV6IcaaSqiVXNLlFEMBDQAUxHMA
Message-ID: <E1829B60731D1740BB7A0626B4FAF0A649511DD643@XCH-NW-01V.nw.nos.boeing.com>
References: <C7B93DF3.4F45%tony.li@tony.li> <4B94617E.1010104@firstpr.com.au > <E1829B60731D1740BB7A0626B4FAF0A649511933 94@XCH-NW-01V.nw.nos.boeing.co m > <4B953EA5.4090707@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A6495 1 19 34CF@XCH-NW-01V.nw.nos.boeing.com> <4B97016B.5050506@firstpr.com.au> < E1 829B60731D1740BB7A0626B4FAF0A6495119413D@XCH-NW-01V.nw.nos.boeing.com> < 4B9 98826.9070104@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A649511DCE A0@XCH-NW-01V.nw.nos.boeing.com> <4B9B0244.7010304@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A649511DD102@XCH-NW-01V.nw.nos.boeing.com> <4B9F6E22.60509@firstpr.com.au>
In-Reply-To: <4B9F6E22.60509@firstpr.com.au>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [rrg] IRON-RANGER scalability and support for packets from non-upgradednetworks
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Mar 2010 22:14:46 -0000

Hi Robin,

Thanks for your follow up, and I will try to keep my
responses brief (below):

> -----Original Message-----
> From: Robin Whittle [mailto:rw@firstpr.com.au]
> Sent: Tuesday, March 16, 2010 4:40 AM
> To: RRG
> Cc: Templin, Fred L
> Subject: Re: [rrg] IRON-RANGER scalability and support for packets from non-upgradednetworks
>
> Short Version:    Fred describes a method for handling packets sent
>                   from by non-upgraded networks - roughly similar
>                   to Ivip's DITRs and LISP's PTRs.  However there
>                   are unresolved questions about the commercial
>                   arrangements for running these.
>
>                   With this arrangement, my prior assumption that
>                   there be 20 or so routers advertising a given VP
>                   (because I had understood these were performing the
>                   DITR/PTR functions) does not apply.
>
>                   This reduces the the scaling difficulties I had
>                   previously mentioned.
>
>                   But there is not yet a clearly defined method of
>                   registering with the 2 or 3 VP routers.  We discuss
>                   some of the scaling problems and workarounds for
>                   them - but a fuller discussion will depend on
>                   Fred's choice of registration mechanism.  He is
>                   contemplating replacing the overlay network's BGP
>                   with OSPF, but maybe there's a way of doing it
>                   while retaining BGP.
>
>
>
> Hi Fred,
>
> You wrote:
>
>
> >> I think I-R needs to be described in a way that someone who is up to
> >> speed on scalable routing in general can read one or perhaps two I-R
> >> documents and have a good idea of how the whole thing is going to
> >> work - including with respect to scaling and security.  This doesn't
> >> require exact bits in headers, but that could be part of it.  I think
> >>  it needs to be pretty-much self-contained rather than requiring
> >> people to read other documents which are not part of I-R.
> >
> > There is room in a future update to IRON to improve on this.
>
> OK.
>
>
> >>>> For instance, how many IRON routers are there in an IPv4 I-R system,
> >>>> and how many individual EID prefixes?
> >>>
> >>> Let's suppose that each VP is an IPv6 ::/32, and that
> >>> the smallest unit of PI prefix delegation from a VP is
> >>> an IPv6 ::/56. In that case, there can theoretically be
> >>> up to 4B VPs in the IRON RIB and 16M PI prefixes per VP.
> >>> In practice, however, we can expect to see far fewer than
> >>> that until the IPv6 address space reaches exhaustion
> >>> which many believe will be well beyond our lifetimes.
> >>
> >> OK.  Still, depending on how the address space was allocated - or at
> >> least that subset of the address space covered by I-R's VPs - there
> >> could be high numbers, approaching 16M perhaps, of I-R PI prefixes
> >> per VP.
> >
> > Well, this is a tunable knob of course. We could for
> > example set the length for VPs to ::/36, ::/40, etc.
> > to reduce the number of PI prefixes per VP.
> >
> > The tradeoff is in managing a RIB containing a large
> > number of VPs (which are likely to be quite stable) vs.
> > managing a large number of PI prefixes per VP (which
> > require periodic keepalives to maintain). So, given a
> > routing protocol that can maintain a large number of
> > VPs in a relatively static topology it seems like a
> > proper balance of PI prefixes per VP can be found.
>
> OK.
>
> >>> Still thinking (very) big, let's try sizing the system
> >>> for 100K VPs; each with 100K ::/56 delegated PI prefixes.
> >>> That would give 10B ::/56 PI prefixes, or 1 PI prefix
> >>> for every person on earth (depending on when you sample
> >>> the earth's population). Let's look at the scaling
> >>> considerations under these parameters:
> >>
> >> OK, I think this is a good scenario to discuss.  I assume that the
> >> VPs can be of various sizes, so some VPs could be a longer prefix,
> >> covering less space, if there are a larger number of I-R PI prefixes
> >> within that part of the address space.
> >
> > The length of the VPs is a tunable. It may be that there
> > can be VPs of varying lengths, but I chose to discuss as
> > all VPs having the same length for simplicity.
>
> OK.
>
>
> >> As far as I know, you don't need VPs covering the entire advertised
> >> subset of global unicast address space.  However, for worst-case
> >> scaling discussions I think it is good to assume this.
> >>
> >>>> Then, how do these IRON
> >>>> routers, for each of these EID prefixes continually and repeatedly (I
> >>>> guess every 10 minutes or less) securely inform a given number of VP
> >>>> routers they are the router, or one of the routers, to which packets
> >>>> matching a given EID prefix should be tunneled.  Since there could be
> >>>> multiple VP routers for a given VP, and the IRON routers don't and (I
> >>>> think) can't know where they are, how does this process work securely
> >>>> and scalably?
> >>>
> >>> Each IRON router R(i) discovers the full map of VPs in
> >>> the IRON through participation in the IRON BGP.
> >>
> >> I recall that some IRON routers handle VPs and others don't.  As I
> >
> > Not quite. All IRON routers by definition connect to the
> > IRON. So, all IRON routers discover all VPs in the IRON,
> > and *some* IRON routers also connect to the DFZ. Those
> > that connect to the DFZ advertise one or a few very short
> > prefixes (e.g., 4000::/3) that cover the set of all VPs
> > in the IRON.
>
> OK - so these are the I-R equivalents of Ivip's DITRs (Default ITRs
> in the DFZ) and LISP PTRs.  In my previous message, I assumed that VP
> routers were also advertising their VPs in the DFZ.  I recall I got
> this from something you wrote, but it doesn't matter now.
>
> But what are the scaling properties of these routers I will refer to
> as being "DITR-like"?
>
> Who runs them?  They are doing work, handling packets addressed to
> very large numbers of I-R end-user network prefixes, who are the
> parties which benefit.  So I think there needs to be an arrangement
> for money to flow from those end-user networks, in rough proportion
> to the traffic each DITR-like router handles for each end-user
> network.  This is handled in Ivip, but with DITRs which advertise
> specific subsets of the MABs (Mapped Address Blocks):
>
>   http://tools.ietf.org/html/draft-whittle-ivip-arch-04#section-8.1.2
>
> I suggest you devise a business case for these "DITR-like" routers -
> and give them a name.

The more I think about it, the more these specialized
VP routers are really just Default Mappers, i.e., the
similar to those discussed in APT. On the IRON, they
advertise "default", and on the DFZ they advertise one
or a few short prefixes (e.g., 4000::/3) that cover all
of the VPs in use on the IRON.

> They are going to be busy, depending on where they are located, the
> traffic patterns, how many of them there are etc.   So they need to
> be able to handle the cached mapping of some potentially large number
> of I-R end-user network prefixes.

In the case of IPv6, I think whether the IRON Default
Mappers (IDMs) will be very busy depends on how large
the IPv6 DFZ becomes. In my understanding, the IPv6 DFZ
is not very big yet. So, if most IPv6 growth occurs in
the IRON and not in the IPv6 DFZ the packet forwarding
load on the IDMs might not be so great.

> >> wrote earlier, assuming VP routers advertise the VP in the DFZ, not
> >> just in the I-R overlay network, then they are acting like LISP PTRs
> >> or Ivip DITRs.  In order for them to do this in a manner which
> >> generally reduces the path length from sending host, via VP router to
> >> the IRON router which delivers the packet to the destination, I think
> >> that for each VP something like 20 or more IRON routers need to be
> >> advertising the same VP.
> >
> > No; those IRON routers that also connect to the DFZ
> > advertise very short prefixes into the DFZ; they do
> > not advertise each individual VP into the DFZ else
> > there would be no routing scaling suppression gain.
>
> I think there would be, since each VP covers multiple individual
> end-user network prefixes.  If there are 10^7 of these prefixes, and
> on average each VP covers 100 of them, then there are 10^5 VPs and we
> have excellent routing scalability, saving 9.9 million prefixes from
> being advertised in the DFZ while providing 10 million prefixes for
> end-user networks who use them to achieve portability, multihoming
> and inbound TE.

That's good, but I think I'd still rather have the
IDMs only advertise the highly-aggregated short prefixes.

> >> I interpret your previous sentence to mean that all the IRON routers
> >> are part of the IRON BGP overlay network, and that each one will
> >> therefore get a single best path for each VP.  That will give it the
> >> IP address of one IRON router which handles this VP.  It won't give
> >> it any information on the full set of IRON routers which handle this VP.
> >
> > Here, it could be that my cursory understanding of BGP
> > is not matching well with reality. Let's say IRON routers
> > A and B both advertise VP1. Then, for any IRON router C,
> > C needs to learn that VP1 is reachable through both A and
> > B. I was hoping this could be done with BGP, but I believe
> > this could only happen if BGP supported an NBMA link model
> > and could push next hop information along with advertised
> > VPs. Do you know whether this arrangement could be realized
> > using standard BGP?
>
> Sorry, I can't reliably tell you what can and can't be done with BGP
> - I don't try to do anything special with it with Ivip.
>
> Still, if you assume that something could be done with BGP, consider
> the potential scaling problems.  Somehow, for every one of X VPs, and
>  for every Y IRON routers which handles a given VP, then you want
> each IRON router to learn via BGP the address of every one of these
> VP-advertising routers, and which VPs each one advertises.  This is
> (X * Y) items of information you are expecting BGP to deliver to
> every IRON router - so every BGP router needs to handle this
> information.
>
> The scaling properties of this would depend on how you get BGP to do
> it, and how many VPs there are, and how many IRON routers advertise
> the same VP.
>
>
> > If we are expecting too much with BGP, then I believe we can
> > turn to OSPF or some other dynamic routing protocol that
> > supports an NBMA link model. In discussions with colleagues,
> > we believe that the example arrangement I cited above can
> > be achieved with OSPF.
>
> OK . . . so you are considering using OSPF on the I-R overlay network
> rather than BGP.  I can't discuss that without doing a lot of reading
> - which I am not inclined to do.  But see below where I propose
> methods of doing the registration within the limits imposed by BGP.

I will think about both routing alternatives more. But, if
we use OSPF in the IRON overlay, routing would work the same
way as at any other layer of RANGER recursion. The list of
IDMs could be kept in the DNS under the special domain name
"isatapv2.net" which I have set aside for this purpose. All
other IRON routers can discover the list of IDMs by simply
resolving the name "isatapv2.net".

> >>> That
> >>> means that each R(i) would need to perform full database
> >>> synchronization for 100K stable IRON RIB entries that rarely
> >>> if ever change.
> >>
> >> I am not sure what you mean by "full database synchronization".  Only
> >> a subset of IRON routers advertise a VP, and each IRON router would
> >> get a best-path to a single IRON router out of potentially numerous
> >> IRON routers which were advertising a given VP.  So any one IRON
> >> router would not be able to use the IRON BGP overlay system to either
> >> discover the IP addresses (or best paths) to all IRON routers, or to
> >> all the IRON routers which advertise VPs, assuming that some VPs were
> >> advertised by more than one IRON router.
> >
> > What we need here is a dynamic routing protocol that
> > supports an NBMA link model, and the IRON is treated
> > as a gigantic NBMA link on which all IRON routers are
> > attached. Maybe BGP won't fill the bill for that, but
> > other dynamic routing protocols such as OSPF show some
> > promise.
>
>
> >>> This doesn't sound terrible even for existing
> >>> core router equipment. As you noted, it is also possible that
> >>> a given VP(j) would be advertised by multiple R(i)s - let's
> >>> say each VP(j) is advertised by 2 R(i)s (call them R(x) and
> >>> R(y)). But, since the IRON RIB is fully populated to all
> >>> R(i)s, each R(i) would discover both R(x) and R(y) that
> >>> advertise VP(j).
> >>
> >> I don't see how this would occur.  A given IRON router receives best
> >> paths for each VP, so for VP(j) it will get a best path to (and IP
> >> address of) either R(x) or R(y).
> >
> > As above.
> >
> >>> Now, for IRON router R(i) that is the provider for 100K PI
> >>> prefixes delegated from VP(j), R(i) needs to send a "bubble"
> >>> to both R(x) and R(y) for each PI prefix.
> >>
> >> Its no-doubt a relief to less muscle-bound scalable routing
> >> architectures that the routers of IRON-RANGER are hurling about
> >> merely "bubbles" rather than something with greater impact!
> >
> > No worries; they are harmless, and not at all weapons
> > of war.
>
> Good!

The term "bubbles" came from teredo (RFC4380). Maybe we can
think of a better term to use for IRON-RANGER?

> >>> That would amount to 200K bubbles every 600 sec, or 333
> >>> bubbles/sec.  If each bubble is 100bytes, the total bandwidth
> >>> required for updating all of the 100K PI prefixes is 260Kbps.
> >>
> >> I am not sure each registration "bubble" would only be 100 bytes of
> >> protocol-level data.  You need to specify, for IPv6:
> >>
> >>   1 - The IP address of the IRON sending the registration (16 bytes).
> >
> > You mean in the data portion of the bubble or in the header?
> > For IPv6-over-IPv4, the bubble does not need to include an IPv6
> > header; it need only include the IPv4 header, since VET stateless
> > address mapping allows the IPv6 link-local address to be discovered
> > by knowing only the IPv4 address. I can't see why an IPv6 address
> > would also be required in the data portion of the bubble if it can
> > already be inferred from the IPv4 header?
>
> I am definitely not going to try to think about mixed IPv4/v6
> implementations of I-R.  I can handle thinking about purely IPv4 and
> purely IPv6.

I choose to think of mixed IPv4/IPv6 for at least three
reasons:

1) We already have global deployment of IPv4, and that won't
   go away overnight when IPv6 begins to deploy.

2) IPv4 is fully built-out, so new growth will come via IPv6.

3) IPv6 addresses can embed IPv4 addresses such that there
   is stateless address mapping between an EID nexthop and
   an RLOC.

> >>   2 - The prefix the IRON router is registering (18 bytes).
> >
> > Not necessarily 18 bytes; prefix plus length is all that
> > is needed. For a ::/32, that would be 4 bytes of prefix
> > plus 1 length byte = 5 bytes. Since IPv6 likes to do
> > things in blocks of 8, however, let's round up to 16
> > to be safe.
>
> OK.
>
> >>   3 - Nonces and other stuff which invariably accompany messages
> >>       such as this (10 to 20 bytes?).
> >
> > The SEAL header with a sequence number that also
> > serves as a nonce is used for this - the SEAL
> > header plus sequence number length is accounted
> > for below:
>
> OK.
>
> >>   4 - Authentication material, such as a digital signature for the
> >>       above, including the public key of the signer (the
> >>       IRON router itself?) and a pointer to one or more PKI CAs or
> >>       whatever so the VP router can ascertain that this really is
> >>       the public key of the signer.  These will be FQDNs - lets
> >>       say 50 bytes or so.
> >
> > I honestly do not know how much this would be. I will
> > take your 50 byte estimation.
>
> OK.
>
>
> >> Maybe you could get the whole thing into 100 bytes.  Then add the
> >> IPv6 header - 40 bytes - and a UDP header 8 bytes - and we are up to
> >> about 150 bytes already.
> >
> > No IPv6 header; only an IPv4 header (20 bytes) plus a SEAL
> > header (8 bytes) plus possibly also a UDP header (8 bytes)
> > for a total of 36.
> >
> > Add in L2 headers - Ethernet is 46 octets -
> >
> > I guess you are counting everything from the preamble to the
> > end of the interframe gap? I come up with 42 (when 802.1Q header
> > is added), but I'll use your 46 to be conservative.
> >
> >>  and we are up to 200 bytes.  Multiply by 8 and this is 1600 bits.
> >
> > I have (36 + 16 + 50 + 46) = 148. So, call it 150 to be
> > safe, and the guesstimate is midway between your 200 and
> > the 100 I said initially.
>
> OK.
>
> >>   1600 x 333 = 532,800 bits/sec ~=0.5Mbps
> >
> > I get 1200 * 333 = 399,600 bps ~=0.4Mbps
>
> OK.
>
> >> This is the bandwidth of incoming packets to R(x) and likewise for
> >> R(y) in your description.   This is assuming a two IRON routers
> >> ("200k bubbles every 600 sec") per I-R PI prefix.
> >>
> >> But your description varies from mine already in two other important
> >> respects.
> >>
> >> Firstly, if these VP-advertising routers are to operate properly like
> >> DITRs or PTRs, they needs to be a lot more than 2 of them per VP.
> >
> > No, because all that needs to be injected into the DFZ is
> > one or a few very short prefixes (e.g., 4000::/3). It doesn't
> > matter then which IRON router is chosen as the egress to get
> > off of the DFZ, since that router will also have visibility
> > to all VPs on the IRON.
>
> OK.
>
> Since you have what to me is a new "DITR-like" router plan for
> supporting packets send from non-upgraded network, there is no need
> for the larger number of VP routers as I assumed in my previous
> message.  As long as you have two or three, that should be fine, I think.
>
> There are two reasons an IRON router M might need to know about which
> other IRON routers A, B and C advertise a given VP:
>
>  1 - When M has a traffic packet.  (M is either an ordinary IRON
>      router and advertises the I-R "edge" space in its own network
>      or it is a "DITR-like" router advertising this space in the
>      DFZ.)  M needs to tunnel the packet to one of these VP routers.
>
>      The VP router will tunnel it to the IRON router Z it chooses as
>      the best one to deliver the packet to the destination network
>      and will send a "mapping" packet to M which will cache this
>      information and from then on tunnel packets matching the
>      end-user network prefix in the "mapping" to Z (or some other
>      IRON router like Z, if there were two or more in the "mapping").
>
>      In this case, M needs only the address of one of the A, B or C
>      routers.  Ideally it would have the address of the closest one -
>      but it doesn't matter too much if it has the address of a more
>      distant one.  That would involve a somewhat longer trip to the
>      VP router, and perhaps a longer or shorter trip from there to Z.
>      (This would typically be shorter than the path taken through
>      LISP-ALT's overlay network.)
>
>      After M gets the "mapping", it tunnels traffic packets to Z - so
>      the distance to the VP router no longer affects the path of
>      traffic packets.
>
>      In this case, BGP on the overlay would be perfectly good - since
>      it provides the best path to one of A, B or C - typically that
>      of the "closest" (in BGP terms).
>
>
>  2 - When M is one of potentially multiple IRON routers which
>      delivers packets to a given end-user network - packets whose
>      destination address matches a given end-user network prefix P.
>
>      M needs to "blow bubbles" (highly technical term from this
>      R&D phase of IRON-RANGER) to A, B and C.  The most obvious
>      way to do this is for M to be able to know, via the overlay
>      network the addresses of all VP routers which advertise a given
>      VP.  There may be two or three or a few more of these.  They
>      could be anywhere in the world.
>
>      BGP does not appear to be a suitable mechanism for this, since
>      its "best path" basic functions would only provide M with
>      the IP address of one of A, B and C.
>
>      You could do it with BGP, by having A, B and C all know about
>      each other, and with all three sending everything they get to
>      the others.  This is not too bad in scaling terms for two,
>      three of four such VP routers.
>
>      Then, M sends its registration to one of them - whichever it
>      gets the address of via the BGP of the overlay network - and
>      A, B and C compare notes so they all get the registration.
>
>      I will call this the "VP router flooding system".

This is a nice idea. If I get what you are suggesting, each
IRON router that advertises the same VP (e.g., VP(x)) would
need to engage in a routing protocol instance with one
another to track all of the PI prefix registrations. The
problem I have with it is that that would make for perhaps
10^5 or more of these little routing protocol instances as
well as lots and lots of manually-configured peering
arrangements between the IRON routers that advertise VP(x).

For these reasons, I believe it is better for IRON router
M to know about all three of A, B and C and direct bubbles
to each of them. I think we can achieve this using OSPF
with the NBMA link model in the IRON overlay.

Please note: the EID-based IRON overlay is configured over
the DFZ, which is using BGP to disseminate RLOC-based
prefix information. So, it is BGP in the underlay and
OSPF in the overlay - weird, but I think it works.

>      Later I suggest another alternative which would also work
>      with BGP.
>
>
> If you adopted something like the above-mentioned "VP-router flooding
> system" I think you can retain BGP for the overlay network.  This
> will tell each IRON router a best-path to one of the potentially
> multiple routers which advertise a given VP.  If there are three such
> routers for a given VP, known as A, B and C, and if for a given IRON
> router, the BGP overlay network gives it a best path to B, then all
> is well.  B will tend to be closer than the others.  If B dies or
> becomes unreachable, this will cause the BGP overlay network to
> withdraw the best path to B and then provide a best path to A or C
> instead.
>
>
>
> >> Let's say 20.  Maybe 10 would be acceptable, maybe more - but 20 will
> >> do.  Let's call them RVP(j, 0) to RVP(j, 19) where, in your example:
> >>
> >>   R(x) == RVP(j, 0)
> >>   R(y) == RVP(j, 1)
> >>
> >> Secondly, I don't see how R(i) could discover the IP addresses of
> >> more than one of this set of 20 routers.
> >
> > As above, it is only 2-3 IRON routers per VP; not 20.
>
> OK.
>
> >> In my model, if it could be shown how routers such as R(i) which
> >> handle the 100k I-R PI prefixes in VP(j) could discover all the 20
> >> routers RVP(j, 0) to RVP(j, 19), then each of these 20 routers has
> >> this incoming bandwidth.
> >>
> >>> Now, let's say that each PI prefix is multihomed to 2 providers,
> >>> then we get 2x the message traffic for 520Kbps total for the
> >>> bubbles needed to keep the 100K PI prefixes refreshed.
> >> You already assumed two IRON routers per I-R PI prefix in your
> >> 260kbps figure above, so there's no need to double at again to 520kbps.
> >>
> >> 2 ISPs seems a reasonable figure, which was already part of my
> >> calculations.
> >>
> >> Each provider has an IRON router which handles a given I-R IP prefix,
> >> and each such IRON router is sending bubbles to all the VP routers
> >> (though I don't yet understand how these VP routers would be
> >> discovered - and I am assuming there are 20 of them while you are
> >> assuming there will be 2 of them).
> >>
> >> My figure is 532kbps ~= 0.5Mbps incoming bandwidth per VP router.
> >>
> >>
> >>>> If the VP routers act like DITRs or PTRs by advertising their VP in
> >>>> the DFZ, then in order to make them work well in this respect - to
> >>>> generally minimise the extra path length taken to and from them
> >>>> compared to the path from the sending host to the proper IRON router
> >>>> - I think you need at least a dozen of them.   This directly drives
> >>>> the scaling problems in the process just mentioned where the IRON
> >>>> routers continually register each of their EID prefixes with the
> >>>> dozen or so VP routers which cover that EID prefix.
> >>>
> >>> I don't understand why the dozen - I think with IRON VP
> >>> routers, the only reason for multiples is for fault tolerance
> >>> and not for optimal path routing, since path optimization will
> >>> be coordinated by secure redirection. So, just a couple (or a
> >>> few) IRON routers per VP should be enough I think?
> >>
> >> Secure redirection works when an IRON router sends the initial packet
> >> to a VP router, but it doesn't apply when the sending router is that
> >> of a non-upgraded network.  To support generally low stretch paths
> >> from those sending networks to the IRON router which is currently the
> >> desired one for forwarding packets to the destination network, I
> >> think you need a larger number.  20 is a rough figure, assuming a
> >> global distribution of sending hosts and IRON routers which handle
> >> the I-R PI prefixes - as is required for real portability.
> >
> > Again, DFZ routers on the non-upgraded network would select
> > the closest IRON router that advertises, e.g., 4000::/3 as
> > the router that can get off the DFZ and onto the IRON. So,
> > it would not be the case that all VPs would be injected into
> > the DFZ.
>
> OK - as per what to me is a new "DITR-like router" arrangement for
> handling packets sent from non-upgraded networks.
>
>
>
> >> If all the IRON routers for the I-R PI prefixes of a given VP were in
> >> Europe, then it would suffice to have all the VP routers also in
> >> Europe - so depending on the need for robustness and load sharing,
> >> perhaps you wouldn't need 20 or them.  Maybe 5 would do.  But
> >> generally, for this kind of scaling discussion, I think we need to
> >> assume the goal of global portability of the new kind of address
> >> space, with sending hosts likewise distributed globally.
> >>
> >> So I think that for a VP containing 100k I-R PI prefixes, there are
> >> going to be 20 such VP routers, and each is going to get a continual
> >> 1Mbps stream of registration packets.
> >
> > Not 20; only 2 or 3. And, it would be less than 1Mbps per
> > VP router.
>
> OK.
>
>
> >> This is not counting the work that VP router needs to do in order to
> >> establish the authenticity of those registrations.  As far as I know,
> >> it could only do this by looking up PKI CAs (Certification
> >> Authorities) on a regular basis to ensure the signed registrations
> >> were valid.
> >>
> >> There are serious scaling problems per VP router in handling 333
> >> signed registrations per second. That's a lot of crypto stuff to do
> >> just to check the signatures - and a lot more work and packets going
> >> back and forth for regularly checking that the public keys provided
> >> are still valid.
> >
> > Crypto overhead can be greatly relaxed if the IRON router
> > performs crypto only for the initial prefix registration
> > then accepts bubbles without performing the crypto for
> > subsequent prefix refreshments. This is because, using
> > SEAL, there are synchronized sequence numbers for blocking
> > off-path injections of bogus bubbles.
>
>    (I had to Google "bogus bubbles" - what a great little phrase!
>     There's no obvious domain name or rock-band or DJ using it.)
>
> I agree, there may be some way of reducing the crypto overhead with
> these sequence numbers.  But sooner or later, there would need to be
> a check of the PKI to ensure the signature was made with a public key
> which is still valid.
>
>
> >> There is also the scaling problem of there being 20 or so of these VP
> >> routers, so the entire Internet needs to handle 20 x 0.5Mbps = 10Mbps
> >> continually just to handle the registration of these 100k I-R PI
> >> prefixes.  Each such prefix requires 100 bits per second in continual
> >> registration activity - 5 bits per second per VP router per I-R PI
> >> prefix.  For each VP router, 5 bits per second on average comes from
> >> each of the typically two IRON routers which are registering a given
> >> I-R PI prefix.
> >>
> >> Checking this: If there was a single VP router and a single IRON
> >> router registering an I-R PI prefix, the IRON router would send 1600
> >> bits every 600 seconds. This is 2.66 bits a second.  Since there are
> >> 20 VP routers, the figure per IRON router per I-R PI prefix is 53bps.
> >>  Since there are two such IRON routers per I-R PI prefix, each such
> >> IRON router sends 106bps per I-R PI prefix.  With 100k of these I-R
> >> PI prefixes per VP, this is about 10Mbps.  This checks out OK.
> >
> > You are off by a factor of 10 here, because there only needs
> > to be 2 VP routers per VP.
>
> Yes - with my new understanding of the "DITR-like" routers.
>
>
> >> I think this is an unacceptable continual burden of registration traffic.
> >>
> >> Also, this is just for 10 minute registrations.  I recall that the 10
> >> minute time is directly related to the worst-case (10 minute) and
> >> average (5 minute) multihoming service restoration time, as per our
> >> previous discussions.  I think that these are rather long times.
> >
> > Well, let's touch on this a moment. The real mechanism
> > used for multihoming service restoration is Neighbor
> > Unreachability Detection. Neighbor Unreachability
> > Detection uses "hints of forward progress" to tell if
> > a neighbor has gone unreachable, and uses a default
> > staletime of 30sec after which a reachability probe
> > must be sent. This staletime can be cranked down even
> > further if there needs to be a more timely response to
> > path failure. This means that the PI prefix-refreshing
> > "bubbles" can be spaced out much longer - perhaps 1 every
> > 10hrs instead of 10min. (Maybe even 1 every 10 days!)
>
> OK, I am not sure if I ever knew the details of "Neighbor
> Unreachability Detection" - but shortening the time for these
> mechanisms raises its own scaling problems.
>
> Can you give some examples of how this would work?

I want to go back on this notion of extended inter-bubble
intervals, and return to something shorter like 600sec
or even 60sec. There needs to be a timely flow of bubbles
in case one or a few IRON routers goes down and needs to
have its PI prefix registrations refreshed.

> > In this way, the PI prefix registration process begins
> > to very much resemble DHCP prefix delegation.
>
> I will pass on this for the moment.
>
> At present, I can see these choices for this registration mechanism:
>
>   1 - Keep BGP as the overlay protocol and use my proposed "VP router
>       flooding system".
>
>   2 - Retain your current plan of each IRON router like M needing to
>       know the addresses of all the routers handing a given VP (A, B
>       and C) which BGP can't do.  So you could:
>
>       2a - keep BGP and add some other mechanism.  Maybe M sends a
>            message to the one of A, B or C it has a best path to,
>            requesting the full list of all routers A, B and C which
>            handle a given VP.  When M gets the list, it sends
>            registration "bubbles" to the routers on the list.  This
>            needs to be repeated from time-to-time to discover
>            new VP routers.
>
>       2b - use something different from BGP which provides all the
>            A, B and C router addresses to every IRON router, such as
>            M.  This needs to dynamically change as A, B and C die and
>            are restarted, or joined by others.

Right - I am still leaning toward OSPF with its NBMA
link model capabilities. The good news is that the
IRON topology itself should be relatively stable, so
not much churn due to dynamic updates.

> >>>> Your IDs tend to be very high level and tend to specify external RFCs
> >>>> for how you do important functions in I-R.
> >>>
> >>> You may be speaking of IRON/RANGER, but the same is not
> >>> true of VET/SEAL. VET and SEAL are fully functional
> >>> specifications from which real code can be and has been
> >>> derived.
> >>
> >> Yes - SEAL is a self-contained protocol, but I still found it hard to
> >> navigate my way within the one document.
> >
> > The IRON document has a lot of room to add more
> > descriptive text on the architecture. But, the
> > mechanisms are already specified in VET and SEAL.
>
> OK.
>
>
> >>>> Yet those RFCs say
> >>>> nothing about I-R itself.  I think your I-Ds generally need more
> >>>> material telling the reader specifically how you use these processes
> >>>> in I-R.   Then, for each such process, have a detailed discussion
> >>>> with real worst-case numbers to show that it is scalable at every
> >>>> level for some worst-case numbers of EID prefixes, IRON routers etc.
> >>>> - as well as secure against various kinds of attack.
> >>>
> >>> Does the analysis I gave above help? If so, I can put
> >>> it in the next version of IRON.
> >>
> >> This is the sort of example I am hoping you will add.  But first I
> >> think there are two questions I raised which would need to be
> >> resolved before your example would be realistic according to my
> >> understanding of I-R:
> >>
> >>   1 - How does an IRON router discover all the IRON routers
> >>       advertising a VP?  The I-R BGP overlay network does not
> >>       provide this, as far as I know.
> >
> > We believe that OSPF with NBMA link model (or equivalent)
> > could be used.
>
> OK.
>
>
> >>   2 - Allow for 20 or so routers each advertising the one VP,
> >>       for the purposes of supporting packets from non-upgraded
> >>       networks.
> >
> > We don't need 20; we only need 2-3. And, the bubble
> > interval (aka the "lease lifetime" can probably be
> > pushed out by a factor of ~100.
>
> OK.
>
>
> >> Assuming 2 is accepted, and 1 is somehow achieved, we now have, for
> >> each of the 20 VP routers, 0.5Mbps of registration traffic.  That's a
> >> lot of traffic and a lot of crypto processing to do.
> >
> > Crypto is not needed on each and every bubble;
> > only on the first bubble.
>
> OK.
>
>
> >> It is no-doubt more efficient than the ~100k or so extremely
> >> expensive BGP routers of today's DFZ fussing around comparing notes
> >> about 300k prefixes.  However, I don't think it scales as well as an
> >> alternative:
> >>
> >>   http://tools.ietf.org/html/draft-whittle-ivip-arch
> >>   http://tools.ietf.org/html/draft-whittle-ivip-drtm
> >>
> >> which doesn't have such continual flows of registration, mapping etc.
> >> data, unrelated to the traffic flowing to a given micronet, or to
> >> changes in the ETR to which the micronet is mapped.
> >
> > I think we have learned a few things about the scaling,
> > and there are solutions. Consider now the bubble interval
> > as being analogous to the DHCP lease lifetime, and scaling
> > can be greatly improved for (much) longer bubble intervals.
>
> OK - but you still need to design a registration mechanism before we
> can think in detail about scaling.

Let's forget about the DHCP lease lifetimes analogy
for a bit and get back onto the assumption that the
inter-bubble interval is the mechanism that keeps
PI registrations refreshed in a timely fashion.

Thanks - Fred
fred.l.templin@boeing.com

> >> I am not suggesting you adopt "ITR" and "ETR" instead of "ITE" and
> >> "ETE" - which I agree are more apt terms.  I was just explaining why,
> >> for now, I will stick with "ITR" and "ETR" for Ivip.
> >
> > OK - Fred
>
> OK.
>
>    - Robin