Re: [rrg] IRON-RANGER scalability and support for packets from non-upgraded networks

Robin Whittle <rw@firstpr.com.au> Wed, 17 March 2010 00:30 UTC

Return-Path: <rw@firstpr.com.au>
X-Original-To: rrg@core3.amsl.com
Delivered-To: rrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 5239C3A6AAA for <rrg@core3.amsl.com>; Tue, 16 Mar 2010 17:30:23 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.709
X-Spam-Level:
X-Spam-Status: No, score=-1.709 tagged_above=-999 required=5 tests=[AWL=0.186, BAYES_00=-2.599, HELO_EQ_AU=0.377, HOST_EQ_AU=0.327]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BCzXQuKzbHCP for <rrg@core3.amsl.com>; Tue, 16 Mar 2010 17:30:21 -0700 (PDT)
Received: from gair.firstpr.com.au (gair.firstpr.com.au [150.101.162.123]) by core3.amsl.com (Postfix) with ESMTP id 73E483A6975 for <rrg@irtf.org>; Tue, 16 Mar 2010 17:30:19 -0700 (PDT)
Received: from [10.0.0.6] (wira.firstpr.com.au [10.0.0.6]) by gair.firstpr.com.au (Postfix) with ESMTP id F2B7D175AF9; Wed, 17 Mar 2010 11:30:25 +1100 (EST)
Message-ID: <4BA022A3.6060607@firstpr.com.au>
Date: Wed, 17 Mar 2010 11:30:27 +1100
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: RRG <rrg@irtf.org>
References: <C7B93DF3.4F45%tony.li@tony.li> <4B94617E.1010104@firstpr.com.au > <E1829B60731D1740BB7A0626B4FAF0A649511933 94@XCH-NW-01V.nw.nos.boeing.co m > <4B953EA5.4090707@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A6495 1 19 34CF@XCH-NW-01V.nw.nos.boeing.com> <4B97016B.5050506@firstpr.com.au> < E1 829B60731D1740BB7A0626B4FAF0A6495119413D@XCH-NW-01V.nw.nos.boeing.com> < 4B9 98826.9070104@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A649511DCE A0@XCH-NW-01V.nw.nos.boeing.com> <4B9B0244.7010304@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A649511DD102@XCH-NW-01V.nw.nos.boeing.com> <4B9F6E22.60509@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A649511DD643@XCH-NW-01V.nw.nos.boeing.com>
In-Reply-To: <E1829B60731D1740BB7A0626B4FAF0A649511DD643@XCH-NW-01V.nw.nos.boeing.com>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Subject: Re: [rrg] IRON-RANGER scalability and support for packets from non-upgraded networks
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Mar 2010 00:30:23 -0000

Short version:   Further discussion on "DITR-like" routers, now
                 called "IRON Default Mappers" (IDMs), how an IRON
                 router registers an end-user network EID prefix
                 with multiple VP routers, and whether to use OSPF
                 rather than BGP for the IRON overlay network.


Hi Fred,

You wrote:

>> OK - so these are the I-R equivalents of Ivip's DITRs (Default ITRs
>> in the DFZ) and LISP PTRs.  In my previous message, I assumed that VP
>> routers were also advertising their VPs in the DFZ.  I recall I got
>> this from something you wrote, but it doesn't matter now.
>>
>> But what are the scaling properties of these routers I will refer to
>> as being "DITR-like"?
>>
>> Who runs them?  They are doing work, handling packets addressed to
>> very large numbers of I-R end-user network prefixes, who are the
>> parties which benefit.  So I think there needs to be an arrangement
>> for money to flow from those end-user networks, in rough proportion
>> to the traffic each DITR-like router handles for each end-user
>> network.  This is handled in Ivip, but with DITRs which advertise
>> specific subsets of the MABs (Mapped Address Blocks):
>>
>>   http://tools.ietf.org/html/draft-whittle-ivip-arch-04#section-8.1.2
>>
>> I suggest you devise a business case for these "DITR-like" routers -
>> and give them a name.
> 
> The more I think about it, the more these specialized
> VP routers are really just Default Mappers, i.e., the
> similar to those discussed in APT. On the IRON, they
> advertise "default", and on the DFZ they advertise one
> or a few short prefixes (e.g., 4000::/3) that cover all
> of the VPs in use on the IRON.

This is different from what I understood from your previous message.

I understood there was a subset of IRON routers which we call "VP
routers".  Each such "VP router" advertises in the IRON network (the
tunnel-based overlay network between all IRON routers, currently
implemented with BGP) a VP (Virtual Prefix).  There are typically two
or perhaps more such routers advertising a given VP.  Each such VP
router may also advertise other VPs, but for this discussion, let's
think of IRON routers A, B and C all advertising VP "P" on the IRON
network - and for the purposes of discussion not advertising any
other VPs.

After your msg06274, when I wrote msg06278, I understood that the VP
routers also advertised their VPs in the DFZ, and that this was the
mechanism by which I-R supported packets sent by hosts in
non-upgraded networks.  It doesn't matter now why I thought this.

>From the most recent pair of messages, your msg06305 and my msg06315,
I thought that this role, which I described as "DITR-like" was
performed by a subset of IRON routers which advertise one or a few
prefixes in the DFZ which covers the entire I-R "edge" subset of the
global unicast address space.  This was on the basis of your:

   >> Firstly, if these VP-advertising routers are to operate
   >> properly like DITRs or PTRs, there needs to be a lot more than
   >> 2 of them per VP.
   >
   > No, because all that needs to be injected into the DFZ is
   > one or a few very short prefixes (e.g., 4000::/3). It doesn't
   > matter then which IRON router is chosen as the egress to get
   > off of the DFZ, since that router will also have visibility
   > to all VPs on the IRON.

   > Again, DFZ routers on the non-upgraded network would select
   > the closest IRON router that advertises, e.g., 4000::/3 as
   > the router that can get off the DFZ and onto the IRON. So,
   > it would not be the case that all VPs would be injected into
   > the DFZ.

I assumed that these "DITR-like" routers were not necessarily VP routers.

Here is my understanding on what you just wrote:

> The more I think about it, the more these specialized
> VP routers

I think you mean the "DITR-like" routers are VP routers. Later you
refer to these as "IRON Default Mappers (IDMs)".  I had assumed they
either were not VP routers, or that they need not be VP routers.

> are really just Default Mappers, i.e., the
> similar to those discussed in APT. On the IRON, they
> advertise "default", and on the DFZ they advertise one
> or a few short prefixes (e.g., 4000::/3) that cover all
> of the VPs in use on the IRON.

APT's DMs certainly advertised into their routing systems of the
networks they were located within.  I recall they advertised a set of
prefixes covering all the "edge" end-user (EID) prefixes of any
end-user network which was using an ISP in the same APT island.
There could be multiple APT islands - sets of APT-adopting ISPs which
were linked by direct BGP links and therefore which were able to
share all their mapping information, which was carried over those
direct BGP links.   (If this was extended with tunnels, then all APT
adopting ISPs would be part of a single global APT island, and this
would enable EID space to be split more finely than the IPv4 /24
limit.  With separate islands, any EID prefixes longer than /24 would
need to use the same island.)

I recall that the DMs also advertised these covering prefixes to
neighbouring ISPs - AKA "advertising them in the DFZ".  But this
assumes the DMs were border routers, which I recall was not
necessarily the case.   So based on what I remember about APT, in an
single APT island, the subset of DMs which were BRs would act in much
the same way as LISP's PTRs or Ivip's DITRs, except that with Ivip's
DITRs each such DITR normally only advertises a subset of the total
Ivip "edge" space, while these APT DMs would advertise it all.

If there was a single global APT island, then all the DMs which were
BRs would advertise in the DFZ the complete set of APT "edge" space.
 I understand from what you just wrote that "these specialised VP
routers" (IDMs, below) in I-R are also BRs and that each one also
advertises the complete set of "edge" address space in the I-R system.

However, this part:

> On the IRON, they advertise "default"

makes no sense to me.  I don't recall any IRON router advertising
"default" on the IRON overlay network.  I understand that a VP router
advertises its one or more VPs.


>> They are going to be busy, depending on where they are located, the
>> traffic patterns, how many of them there are etc.   So they need to
>> be able to handle the cached mapping of some potentially large number
>> of I-R end-user network prefixes.
> 
> In the case of IPv6, I think whether the IRON Default
> Mappers (IDMs) will be very busy depends on how large
> the IPv6 DFZ becomes. In my understanding, the IPv6 DFZ
> is not very big yet. So, if most IPv6 growth occurs in
> the IRON and not in the IPv6 DFZ the packet forwarding
> load on the IDMs might not be so great.

This would only be true if you could convince most networks adopting
IPv6 to adopt I-R at the same time.


>>>> wrote earlier, assuming VP routers advertise the VP in the DFZ, not
>>>> just in the I-R overlay network, then they are acting like LISP PTRs
>>>> or Ivip DITRs.  In order for them to do this in a manner which
>>>> generally reduces the path length from sending host, via VP router to
>>>> the IRON router which delivers the packet to the destination, I think
>>>> that for each VP something like 20 or more IRON routers need to be
>>>> advertising the same VP.
>>>
>>> No; those IRON routers that also connect to the DFZ
>>> advertise very short prefixes into the DFZ; they do
>>> not advertise each individual VP into the DFZ else
>>> there would be no routing scaling suppression gain.
>>
>> I think there would be, since each VP covers multiple individual
>> end-user network prefixes.  If there are 10^7 of these prefixes, and
>> on average each VP covers 100 of them, then there are 10^5 VPs and we
>> have excellent routing scalability, saving 9.9 million prefixes from
>> being advertised in the DFZ while providing 10 million prefixes for
>> end-user networks who use them to achieve portability, multihoming
>> and inbound TE.
> 
> That's good, but I think I'd still rather have the
> IDMs only advertise the highly-aggregated short prefixes.

OK.


>>>> I interpret your previous sentence to mean that all the IRON routers
>>>> are part of the IRON BGP overlay network, and that each one will
>>>> therefore get a single best path for each VP.  That will give it the
>>>> IP address of one IRON router which handles this VP.  It won't give
>>>> it any information on the full set of IRON routers which handle this VP.
>>>
>>> Here, it could be that my cursory understanding of BGP
>>> is not matching well with reality. Let's say IRON routers
>>> A and B both advertise VP1. Then, for any IRON router C,
>>> C needs to learn that VP1 is reachable through both A and
>>> B. I was hoping this could be done with BGP, but I believe
>>> this could only happen if BGP supported an NBMA link model
>>> and could push next hop information along with advertised
>>> VPs. Do you know whether this arrangement could be realized
>>> using standard BGP?
>>
>> Sorry, I can't reliably tell you what can and can't be done with BGP
>> - I don't try to do anything special with it with Ivip.
>>
>> Still, if you assume that something could be done with BGP, consider
>> the potential scaling problems.  Somehow, for every one of X VPs, and
>>  for every Y IRON routers which handles a given VP, then you want
>> each IRON router to learn via BGP the address of every one of these
>> VP-advertising routers, and which VPs each one advertises.  This is
>> (X * Y) items of information you are expecting BGP to deliver to
>> every IRON router - so every BGP router needs to handle this
>> information.
>>
>> The scaling properties of this would depend on how you get BGP to do
>> it, and how many VPs there are, and how many IRON routers advertise
>> the same VP.
>>
>>
>>> If we are expecting too much with BGP, then I believe we can
>>> turn to OSPF or some other dynamic routing protocol that
>>> supports an NBMA link model. In discussions with colleagues,
>>> we believe that the example arrangement I cited above can
>>> be achieved with OSPF.
>>
>> OK . . . so you are considering using OSPF on the I-R overlay network
>> rather than BGP.  I can't discuss that without doing a lot of reading
>> - which I am not inclined to do.  But see below where I propose
>> methods of doing the registration within the limits imposed by BGP.
> 
> I will think about both routing alternatives more. But, if
> we use OSPF in the IRON overlay, routing would work the same
> way as at any other layer of RANGER recursion. The list of
> IDMs could be kept in the DNS under the special domain name
> "isatapv2.net" which I have set aside for this purpose. All
> other IRON routers can discover the list of IDMs by simply
> resolving the name "isatapv2.net".

OK.


> The term "bubbles" came from teredo (RFC4380). Maybe we can
> think of a better term to use for IRON-RANGER?

OK.  I don't think "bubbles" is appropriate for the registration
methods you have described so far, or that I have suggested.


>> I am definitely not going to try to think about mixed IPv4/v6
>> implementations of I-R.  I can handle thinking about purely IPv4 and
>> purely IPv6.
> 
> I choose to think of mixed IPv4/IPv6 for at least three
> reasons:
> 
> 1) We already have global deployment of IPv4, and that won't
>    go away overnight when IPv6 begins to deploy.

I agree.

> 2) IPv4 is fully built-out, so new growth will come via IPv6.

I don't agree with this at all.  I think there's plenty of scope for
more growth in the IPv4 Internet.  Fig. 11 at:

  http://www.potaroo.net/tools/ipv4/

shows 130 /8s worth of space is currently advertised.  Fig. 5 shows
this in more detail.  Of the /8s to to 223, a handful can't be used
(127, 0 maybe).  There are still a bunch of /8s which are
unadvertised.  As time progresses, this space will be too valuable to
use internally, probably inefficiently - so I expect quite a lot of
that will be made available and advertised too.

Then there are ways of using space more efficiently, as Ivip, LISP
and probably IRON-RANGER could do, by slicing and dicing it into much
smaller chunks than is possible with the /24 limit on prefixes in the
DFZ.

I think that most growth in Internet usage will occur in the IPv4
Internet for at least the rest of this decade.  The only time it
would make sense to use IPv6 instead of direct IPv4 or IPv4 behind
NAT would be for some service where it wasn't important to be able to
connect to IPv4.  At present, you couldn't sell any such service. I
guess that it may be possible to do this for large IP cell-phone
deployments where there are enough IPv6 services available to do a
reasonable subset of what people want in a hand-held device, and
where tunneling to a server which provides behind-NAT IPv4
connectivity would also be possible.


> 3) IPv6 addresses can embed IPv4 addresses such that there
>    is stateless address mapping between an EID nexthop and
>    an RLOC.

Can you explain this with an example?  I can't clearly envisage what
you mean.

If I am to keep up with mixed IPv4/IPv6 IRON-RANGER, you will need to
explain things with detailed examples.


>> Since you have what to me is a new "DITR-like" router plan for
>> supporting packets send from non-upgraded network, there is no need
>> for the larger number of VP routers as I assumed in my previous
>> message.  As long as you have two or three, that should be fine, I think.
>>
>> There are two reasons an IRON router M might need to know about which
>> other IRON routers A, B and C advertise a given VP:
>>
>>  1 - When M has a traffic packet.  (M is either an ordinary IRON
>>      router and advertises the I-R "edge" space in its own network
>>      or it is a "DITR-like" router advertising this space in the
>>      DFZ.)  M needs to tunnel the packet to one of these VP routers.
>>
>>      The VP router will tunnel it to the IRON router Z it chooses as
>>      the best one to deliver the packet to the destination network
>>      and will send a "mapping" packet to M which will cache this
>>      information and from then on tunnel packets matching the
>>      end-user network prefix in the "mapping" to Z (or some other
>>      IRON router like Z, if there were two or more in the "mapping").
>>
>>      In this case, M needs only the address of one of the A, B or C
>>      routers.  Ideally it would have the address of the closest one -
>>      but it doesn't matter too much if it has the address of a more
>>      distant one.  That would involve a somewhat longer trip to the
>>      VP router, and perhaps a longer or shorter trip from there to Z.
>>      (This would typically be shorter than the path taken through
>>      LISP-ALT's overlay network.)
>>
>>      After M gets the "mapping", it tunnels traffic packets to Z - so
>>      the distance to the VP router no longer affects the path of
>>      traffic packets.
>>
>>      In this case, BGP on the overlay would be perfectly good - since
>>      it provides the best path to one of A, B or C - typically that
>>      of the "closest" (in BGP terms).
>>
>>
>>  2 - When M is one of potentially multiple IRON routers which
>>      delivers packets to a given end-user network - packets whose
>>      destination address matches a given end-user network prefix P.
>>
>>      M needs to "blow bubbles" (highly technical term from this
>>      R&D phase of IRON-RANGER) to A, B and C.  The most obvious
>>      way to do this is for M to be able to know, via the overlay
>>      network the addresses of all VP routers which advertise a given
>>      VP.  There may be two or three or a few more of these.  They
>>      could be anywhere in the world.
>>
>>      BGP does not appear to be a suitable mechanism for this, since
>>      its "best path" basic functions would only provide M with
>>      the IP address of one of A, B and C.
>>
>>      You could do it with BGP, by having A, B and C all know about
>>      each other, and with all three sending everything they get to
>>      the others.  This is not too bad in scaling terms for two,
>>      three of four such VP routers.
>>
>>      Then, M sends its registration to one of them - whichever it
>>      gets the address of via the BGP of the overlay network - and
>>      A, B and C compare notes so they all get the registration.
>>
>>      I will call this the "VP router flooding system".
> 
> This is a nice idea. If I get what you are suggesting, each
> IRON router that advertises the same VP (e.g., VP(x)) would
> need to engage in a routing protocol instance with one
> another to track all of the PI prefix registrations. The
> problem I have with it is that that would make for perhaps
> 10^5 or more of these little routing protocol instances as
> well as lots and lots of manually-configured peering
> arrangements between the IRON routers that advertise VP(x).

Something like this - but I am not sure what you mean by "routing
protocol instance".  I understand that the two or three VP routers
for any one VP "P" do need to cooperate and share their various
registrations.  You could either create a fresh protocol to do this,
or push into service some existing protocol, including perhaps a
routing protocol.

You haven't specified anything other than manual configuration for
how an IRON router becomes a VP router.  VP routers have extra
workload, so whoever runs such a router must have a reason to do
this, probably involving payment of money in some way from the
end-user networks whose EID prefixes are covered by this VP.

If there are two or three IRON routers acting as VP routers for a
given VP, then some organisation is responsible for that VP, is
collecting payments as described above and is therefore the one
organisation driving the existence of these two or three VP routers.
 So manual configuration seems OK to me - I don't think there needs
to be a fancy automated system by which one VP router for a given VP
"P" would auto-discover any other VP router for "P" in the whole I-R
system.  However, these VP routers for the one VP do need to work
together to share registrations, and to quickly detect when one or
more of the set becomes unreachable.


> For these reasons, I believe it is better for IRON router
> M to know about all three of A, B and C and direct bubbles
> to each of them. I think we can achieve this using OSPF
> with the NBMA link model in the IRON overlay.

OK - but I guess that means not running BGP.  I don't know anything
about OSPF or its scaling properties.  BGP has no central
coordination - something which is understandably attractive to many
people.  Does OSPF have central coordination, single points of
failure etc.?


> Please note: the EID-based IRON overlay is configured over
> the DFZ, which is using BGP to disseminate RLOC-based
> prefix information. So, it is BGP in the underlay and
> OSPF in the overlay - weird, but I think it works.

Yes the DFZ uses BGP and the overlay uses . . . originally I-R used
BGP (a separate instance of BGP in each such router).  Also, IRON
routers don't need to be DFZ routers and in many or most cases are
not DFZ (BR) routers - but they all communicate via tunnels which are
carried between networks via the ordinary Internet (using the DFZ).

I guess these tunnels between IRON routers will need to be manually
configured, since they are typically between physically and
topologically nearby routers.

>>>> Also, this is just for 10 minute registrations.  I recall that the 10
>>>> minute time is directly related to the worst-case (10 minute) and
>>>> average (5 minute) multihoming service restoration time, as per our
>>>> previous discussions.  I think that these are rather long times.
>>>
>>> Well, let's touch on this a moment. The real mechanism
>>> used for multihoming service restoration is Neighbor
>>> Unreachability Detection. Neighbor Unreachability
>>> Detection uses "hints of forward progress" to tell if
>>> a neighbor has gone unreachable, and uses a default
>>> staletime of 30sec after which a reachability probe
>>> must be sent. This staletime can be cranked down even
>>> further if there needs to be a more timely response to
>>> path failure. This means that the PI prefix-refreshing
>>> "bubbles" can be spaced out much longer - perhaps 1 every
>>> 10hrs instead of 10min. (Maybe even 1 every 10 days!)
>>
>> OK, I am not sure if I ever knew the details of "Neighbor
>> Unreachability Detection" - but shortening the time for these
>> mechanisms raises its own scaling problems.
>>
>> Can you give some examples of how this would work?
> 
> I want to go back on this notion of extended inter-bubble
> intervals, and return to something shorter like 600sec
> or even 60sec. There needs to be a timely flow of bubbles
> in case one or a few IRON routers goes down and needs to
> have its PI prefix registrations refreshed.

OK - I will stay tuned for further details.


>> At present, I can see these choices for this registration mechanism:
>>
>>   1 - Keep BGP as the overlay protocol and use my proposed "VP router
>>       flooding system".
>>
>>   2 - Retain your current plan of each IRON router like M needing to
>>       know the addresses of all the routers handing a given VP (A, B
>>       and C) which BGP can't do.  So you could:
>>
>>       2a - keep BGP and add some other mechanism.  Maybe M sends a
>>            message to the one of A, B or C it has a best path to,
>>            requesting the full list of all routers A, B and C which
>>            handle a given VP.  When M gets the list, it sends
>>            registration "bubbles" to the routers on the list.  This
>>            needs to be repeated from time-to-time to discover
>>            new VP routers.
>>
>>       2b - use something different from BGP which provides all the
>>            A, B and C router addresses to every IRON router, such as
>>            M.  This needs to dynamically change as A, B and C die and
>>            are restarted, or joined by others.
> 
> Right - I am still leaning toward OSPF with its NBMA
> link model capabilities. The good news is that the
> IRON topology itself should be relatively stable, so
> not much churn due to dynamic updates.

OK.  Since the IRON routers have their own IP addresses and are
generally in networks multihomed by existing BGP techniques, then any
outages don't affect the IRON routers' IP addresses or their
tunneling arrangements.  There would still be transitory breaks in
connectivity, before the BGP multihoming arrangements kick in.  If
you could ignore those by some means in the overlay's routing system
(BGP or OSPF) then yes, the IRON routers should be pretty stable.



>> OK - but you still need to design a registration mechanism before we
>> can think in detail about scaling.
> 
> Let's forget about the DHCP lease lifetimes analogy
> for a bit and get back onto the assumption that the
> inter-bubble interval is the mechanism that keeps
> PI registrations refreshed in a timely fashion.

OK.

  - Robin