[rrg] LISP-ALT mobility and scaling to 10M, 100M, 1B etc. EIDs?

Robin Whittle <rw@firstpr.com.au> Tue, 10 March 2009 02:42 UTC

Return-Path: <rw@firstpr.com.au>
X-Original-To: rrg@core3.amsl.com
Delivered-To: rrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 0FF603A69CF for <rrg@core3.amsl.com>; Mon, 9 Mar 2009 19:42:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.533
X-Spam-Level:
X-Spam-Status: No, score=-1.533 tagged_above=-999 required=5 tests=[AWL=0.362, BAYES_00=-2.599, HELO_EQ_AU=0.377, HOST_EQ_AU=0.327]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jDUQhg2D80jD for <rrg@core3.amsl.com>; Mon, 9 Mar 2009 19:42:14 -0700 (PDT)
Received: from gair.firstpr.com.au (gair.firstpr.com.au [150.101.162.123]) by core3.amsl.com (Postfix) with ESMTP id DFE593A6879 for <rrg@irtf.org>; Mon, 9 Mar 2009 19:42:13 -0700 (PDT)
Received: from [10.0.0.6] (wira.firstpr.com.au [10.0.0.6]) by gair.firstpr.com.au (Postfix) with ESMTP id 3F8DE175C50; Tue, 10 Mar 2009 13:42:47 +1100 (EST)
Message-ID: <49B5D32F.2040804@firstpr.com.au>
Date: Tue, 10 Mar 2009 13:40:47 +1100
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.19 (Windows/20081209)
MIME-Version: 1.0
To: RRG <rrg@irtf.org>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Subject: [rrg] LISP-ALT mobility and scaling to 10M, 100M, 1B etc. EIDs?
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Mar 2009 02:42:17 -0000

Hi David,

I am replying to part of what you wrote in:

   LISP Map Server I-D & updated draft-farinacci-lisp- 2 stages of
   caching mapping

I have not yet succeeded in prompting you or any other LISP-ALT folks
to describe how the ALT network will scale to handle however many
EIDs, Map-Servers, Map-Resolvers etc. there would be in a
full-scale deployment.

Here is another, more explicit, attempt.  I could go into more detail
still, but the following should suffice.

By describing more fully my best imagined understanding of how ALT
could scale and then pointing out the problems I see in this, I hope
you and your colleagues will point out why my critique doesn't apply
- most likely by providing a fully detailed description of how the
whole ALT network would really work, and work well.

You wrote:

>   Robin,
>
>>     Map-Servers needing to make GRE tunnels to a large
>>     number (hundreds, thousands) of level 1 ALT
>>     aggregation routers - and likewise those routers
>>     needing to make GRE tunnels to large numbers of
>>     Map-Servers.
>
>   Actually, those numbers (100s, 1000s) are speculation.

In the absence of more detailed plans from the LISP team, all I can
do is speculate.


>   There are many ways folks might deploy this technology
>   (e.g, with hierarchy of some kind, as one example). So
>   lets see what how people actually deploy stuff before
>   speculating on what they might do (and I would strike the
>   not-so-parenthetical comment from future discussions).

The scaling problems will only occur when you get to a million EIDs,
10 million, a billion or whatever.

I understand you are planning on developing some experimental
LISP-ALT, Map-Server (and -Resolver) RFCs by mid-2010:

  http://www.ietf.org/mail-archive/web/lisp/current/msg00251.html

The only way you will get the scaling problems is with a real global
deployment, so these problems won't arise in any trial deployment
arising from experimental I-Ds and RFCs.

I don't find your:

   "So lets see what how people actually deploy stuff before
    speculating ..."

at all convincing for a proposal you presumably think has a chance of
being the best possible solution to the routing and addressing
scaling problem.

The ALT ID http://tools.ietf.org/html/draft-fuller-lisp-alt-00 of
2007-10 was 3 months after APT and Ivip's first I-Ds.

Now you are preparing to race ahead with ALT in an IETF WG.  That's
fine, since it is for experimental purposes.

But right from the start, before you published the I-D, surely you
would have been aware of the scaling problems of the ALT network,
which I attempt to describe later in this message.

The sole justification for ALT or any other global query server
approach (CONS and TRRP) over a scheme with the full mapping database
in query servers at each ISP (APT and Ivip, or LISP-NERD with the
full database in each ITR) is that ALT can scale to arbitrarily large
numbers of EIDs.

Indeed, ALT, CONS and NERD can scale arbitrarily in this regard.

But there's no problem with scaling APT's or Ivip's local query
servers to 10 million or so EIDs, end-user networks etc.

Even with the complex multi-RLOC mapping format of APT (which
resembles that of LISP) and even with long IPv6 EIDs and RLOCs, the
complete mapping data could easily fit in a few gigabytes of RAM in
today's COTS servers.

For an ISP, the rate (average and peak) of mapping updates is not a
problem either.  Nor is the cost of the update stream's bandwidth.
The entire mapping database would be no bigger than a HD movie
download.  So if there was 10% churn per day, the cost to the ISP of
getting the updates per day would be their wholesale cost of 0.5 to
1.0 Gbytes of incoming traffic, which is trivial.


If the frequent criticism of other proposals such as APT and Ivip
from the LISP-ALT camp is to have any credibility, you need to show
that LISP-ALT will be scalable to some much higher number of EIDs,
end-user networks etc.

Before you do this, however, I think you need to demonstrate a
realistic demand scenario for larger numbers, such as 100M, 1B, 5B or
10B.

I propose that we consider "~~10M" (meaning 5, 10, 20 maybe 30
million) as the absolute maximum number of fixed (non-mobile)
end-user networks which will ever conceivably want or need
multihoming or portability.

The LISP-ALT plans seem to assume this conventional kind of end-user
network - a fixed network, with two or more links to two or more
ISPs.  That's fine, but it seems you are not planning for whatever
other classes of end-user network would make up the balance of 100M,
1B, 5B etc.

I propose we agree to something like this:

   1 - If and when a core-edge separation approach to the routing
       and addressing scaling problem ever has significantly more
       EIDs than ~~10M, those in excess of this figure will be
       for mobile end-user networks, which physically consist of
       a portable device, with some mixture of wired and
       wireless Ethernet and other forms of wireless connectivity.

   2 - Such mobile end-user networks are likely to have only one
       EID prefix (micronet).

   3 - Therefore, only those core-edge separation schemes which
       provide some Mobility benefits by providing Mobile Nodes (MNs)
       with their own EID (micronet) will need to scale beyond ~10M
       EIDs etc.

Does that sound reasonable?  I will proceed on the assumption it is.


I think the LISP-ALT team needs to show how LISP-ALT could provide
mobility benefits to 1B or whatever MNs before you criticise other
schemes for being unable to scale to the high number of EIDs expected
- implicitly 100M, 1B, 5B, 10B etc.


How would LISP-ALT support mobility?

You clearly can't have the ETRs in the end-user network because its
Care of Address (CoA) is constantly changing.  This is at odds the
recently stated LISP-ALT team position that the ETRs are definitely
in the end-user network.

Nor can you support mobility with the ETRs in the "ISP" - the access
network.  This is because a MN is frequently changing access networks
and may have no lasting business relationship at all with its access
networks.  (For instance WiFi hot-spots, or plugging a laptop into
the network at a friend's home, which gives the laptop an address
behind NAT via DSL or whatever.)

I figured this out in June 2007 and proposed the TTR (Translating
Tunnel Router) architecture:

  http://www.ietf.org/mail-archive/web/ram/current/msg01518.html

Steve Russert and I wrote this up fully in August last year:

  http://www.firstpr.com.au/ip/ivip/TTR-Mobility.pdf


We put the ETR near the MN, but not in the access network.  We call
this new kind of ETR a TRR.

The MN owner pays a company which has a network of TTRs.  As far as
the core-edge separation scheme is concerned, they are identical to
any other ETR - in terms of being able to tunnel packets to the
end-user network.

The MN makes a 2 way encrypted tunnel to a nearby (eg. 1000km or
less) TTR and uses it as an ITR too.  Then the mapping of the MN's
one or more EIDs (micronets) is changed so all ITRs tunnel packets to
this TTR.

Most people seem to assume:

   (map-encap + mobility) = mapping change for every change of
                            access network

but that is on the assumption that the ETR must be in the ISP (access
network) or in the end-user network (MN).

TTR mobility does not involve rapid changes to mapping.  Most people
don't move more than 1000km (or whatever) very often.  There's no
absolute need for a mapping change when they do - but by using a TTR
which is closer to to wherever they move to, overall path lengths are
reduced.


LISP-ALT can work with the TTR Mobility architecture.  However, you
will need to make something other than the ETRs the authoritative
source of mapping.  In practice, the MN owner whose EID is being
mapped will enable the TTR company to control their mapping.  This
should be fine - the TTR company will probably run its own
Map-Servers and so be a part of the ALT network.

I think Ivip would support mobility better than APT, because Ivip's
essentially real-time control of ITR behavior enables a rapid
selection of a new TTR compared to the delays inherent in the
slow push mapping distribution system of APT.

I think LISP-ALT could work quite well with the TRR mobility
architecture.  Mapping changes are not frequent, but it would be best
if they were propagated quickly.  The current ETR (TTR) could prompt
currently active ITRs to update their mapping after the mapping has
changed to the new TTR.  So LISP-ALT could have a fast response to
mapping changes like Ivip.


I think you have not yet established a realistic demand scenario for
LISP-ALT handling more than ~~10M EIDs.  Until you do so (and I think
you can, with TTR mobility) I don't think you should criticise other
schemes for not being able to scale to 100M or more EIDs.


Before making such criticisms, I think you should also describe in
detail how the ALT network is going to scale to 100M, 1B or whatever
EIDs.  Presumably, for most of these EIDs, the end-user network will
only have one or at most a few EIDs, so you are also discussing how
the system would scale to about this number of separate end-user
networks.

As far as I can see, you need to discuss this in terms of LISP-ALT
supporting at most ~~10M fixed networks (the type you seem to assume
in ALT development at present) and the balance of the 100M, 1B etc.
being mobile end-user networks.

Ignoring for a moment where the ETRs are and what devices are the
authoritative sources of mapping, here are the scaling challenges I
think the LISP-ALT team needs to address.


 1 - The challenge of ensuring robustness in a highly aggregated
     network.

     The ALT network is defined as being highly aggregated.  This
     is the only way to ensure (on average) a minimal number of ALT
     routers in the path from one ITR (or Map-Resolver), up the
     ALT hierarchy to whatever level it is fully meshed, and then
     down the hierarchy to the ETR (perhaps via a Map-Server).

        (Not all Map-Requests will need to ascend to the fully
         meshed level, but enough of them will for this to be a
         major limiting factor in overall performance of the ALT
         network.)

     High aggregation is easy to achieve:

        0  - Make any one ALT router the sole one which handles
             some prefix.

        1 -  Make it advertise that prefix to its single upstream
             router as a single entity (so the constituent
             longer prefixes don't propagate above to the next
             level).

        2 -  Make sure all lower level routers in the hierarchy
             which handle the set of prefixes that is split into,
             are its neighbours.

             This is easy to do with GRE tunnels, so the ALT
             routers don't need to be physically close or connected.

     The hierarchy is then a simple upside-down tree - and at
     some level near the theoretical top, all the routers at that
     level are fully meshed - directly connected to each other.

     The question now is how do you achieve this maximal level of
     aggregation and make it robust against the failure of any one
     router or link?

     With this simple model:

       Level

       N + 1                  X    a.b.0.0/16
                              |
                  -----------------------------------
                  |               |                 |
                  |               |                 |
       N          A  a.b.0.0/20   B a.b.16.0/20 ...  P  a.b.240.0/20
                                  |
                  -----------------------------------
                  |               |
                  |               |
       N - 1      Q  a.b.1.0/24   R  a.b.2.0/24 ...


     the router at each level aggregates a bunch of prefixes - 16 in
     the above example and presents them as one larger block of space
     (shorter prefix) to the level above.

     This network has optimal aggregation - but how can it tolerate
     failures?

     Assuming the GRE tunnels are statically defined ...

         (If they are dynamically established, you need a really
          secure system for orchestrating this.)

     ... the system can't reconfigure itself if the B router fails,
     to somehow put in place a replacement for B, in a totally
     different location.

         (A different location is needed, since maybe the outage
          is a persistent - more than a few seconds - failure which
          affects wherever B was located.)

     Do you propose that each "N - 1" router below A to P have
     additional, already established, GRE tunnels to router X?  That
     doesn't sound scalable.

     What if some router Q below B, on level "N - 1", can no longer
     reach B?

     Do you suggest every router such as Q and R also have GRE
     tunnels to some other level N routers?  Maybe this is possible,
     with the non-B level N routers somehow not advertising the
     longer prefix from Q and R etc. unless Q and R cannot reach B?

     I am trying to make up answers to these questions on your
     behalf.  It would be better if the LISP-ALT team explained how
     this would work in a fully deployed system serving 100M, 1B etc.
     EIDs.

     For some number of EIDs, such as 100M or 1B, this would involve
     giving realistic details about:

     a - How many GRE tunnels and so BGP neighbours each ALT
         router might have.

     b - The aggregation factor per level.

     c - Whether Map-Resolvers and Map-Servers are single-homed
         to a single ALT router, or whether they are multihomed
         an so have to take part in the full BGP of the ALT network.

     d - How many levels there would be.

     e - How many ALT routers there would be in the highest
         level - the one where all at that level connect to
         each other in a fully meshed manner.

     f - How many separate ETRs and ITRs the Map-Resolvers
         and Map-Servers might handle.

     g - Estimates of peak and average query volumes in a fully
         deployed (100M, 1B, 5B etc.) network.

     h - Therefore, estimates of traffic volumes at the top
         level, where quite a lot of the packets will need
         to ascend to in order to descend towards their
         destination.

     i - Likely business scenarios for who would own and run
         all these ALT routers.  This includes how they would
         be paid for the traffic flowing on them, and where
         geographically, these routers would be located.

     j - Likely scenarios for physical locations of Map-Servers
         regarding the EIDs of the ETRs in their networks and
         the physical location of the lowest level ALT router
         for the encompassing prefix of that EID - which I guess
         the Map-Server will need to make a GRE tunnel to.

 2 - The long path problem:

     Based on the above, you should also be able to estimate
     average and worst-case travel times, according to the
     number of ALT levels and routers the request needs to
     traverse before it gets to the ETR.

     With the geographical information and some estimate of the
     underlying structure of the DFZ between the ALT routers,
     you should be able to estimate actual travel times and
     the cumulative sensitivity of packet loss to the number of
     DFZ routers in the entire path.

     More on this problem at:

     LISP-ALT's long path problem yet again   2008-12-24
     http://www.ietf.org/mail-archive/web/rrg/current/msg04097.html

     Many people are concerned about some initial packets being
     delayed until the ITR gets a mapping reply from the global
     ALT network.  I understand a typical scenario would be to
     drop the initial packet if there was no mapping, and wait for
     the sending host to send another packet, by which time
     hopefully the mapping would have arrived.  In that case, the
     delay would be longer than the mapping reply delay.  Also,
     when the sending host tries again, it may not try the same
     destination host - it might try some other one, in which
     case the whole dropped packet and delay process could
     repeat itself as often as the sending host tries to send
     packets to different destination hosts.

     Also, this behavior of the ITR dropping a first packet and
     waiting for a retry might encourage application programmers
     to fire off a series of packets, to minimise the time after
     the mapping arrives before one of them is tunneled to the
     correct ETR.

     Alternatively, you could have ITRs buffer all packets and
     send each one whenever the mapping arrives.

     (Sometimes, due to dropped packets in the ALT network,
      there a reply will never arrive to a single request.)

     But then, the ITR may be sending a packet well after the
     sending host sent it, which can lead to duplicate
     packets, extra reply traffic from the destination host,
     and other forms of waste and confusion.

     This long-path problem doesn't just affect a single packet.
     It can affect the first packets in each direction between
     two hosts on EID addresses.  Also, since end-user networks
     will often run their own DNS servers, it could delay DNS
     lookups too.  There can be a delay in getting the
     DNS request to the DNS server when the DNS server is on
     an EID the ITR has no mapping for.  If the request
     comes from a host on an EID address, the DNS server's
     ITR will probably also need to do a mapping lookup.

     So you could have the delay problem four or more times in
     a row.  This is assuming that .com.au server is not on
     an EID address, which is reasonable.

     Hosts A and B are on EID addresses.

         A  ------------------->      .com.au DNS server
         A <------------- DELAY       .com.au DNS server

         A  DELAY ------------->   xxx.com.au DNS server
         A <------------- DELAY    xxx.com.au DNS server

         A  DELAY ------------->   www.xxx.com.au
         A  <------------- DELAY   www.xxx.com.au


Sorry to break your reply up with such a long suggestion.

>   BTW, the RRG (or IETF, or ITU-T, or ATIS, ...) can
>   specify any technology they want. 

Sure, but before anyone can be confident of LISP-ALT being a viable
solution to the routing scaling problem, or even the best solution,
they would want the designers to describe at least one way these
fundamental challenges can be decisively resolved.


>   However, SPs if deploy
>   it, they will deploy it in the way that makes most
>   business sense for their particular predicament, and that
>   may or may not bear any relationship to what is said in
>   SDO deliberations.  And unfortunately, we have precious
>   little input from those folks our (SDO) fora.

Yes, but before any Standards Development Organisations, the IETF
included, will be motivated to take LISP-ALT seriously as a practical
solution to the routing scaling problem, you will need to show
convincing plans for every aspect of its operation when fully deployed.

Since you criticise other approaches for not being able to scale, you
must have some figure like 100M, 1B, 5B or whatever EIDs in mind.

So I think you need to show technically how LISP-ALT could scale like
this, and remain robust.

Then I think you need to describe at least one set of business
arrangements where everyone involved is sufficiently motivated to
play their part - in a way in which the great majority of existing
and new end-user networks will voluntarily adopt LISP-ALT-mapped
address space.


>>      Two Map-Servers for an end-user network's two ETRs
>>      will generally link directly (via a GRE tunnel) to
>>      the same level 1 ALT aggregation router.  There,
>>      the router will (I guess) send packets for this
>>      network's EID prefix only to one of these, since they
>>      both have the same AS hop count of 1.  Which will
>>      depend on the AS number of the ISP.
>
>   I don't know what level 1 ALT means. The ALT doesn't
>   carry level identifiers (we thought about this in CONS
>   but discarded it for various reasons). And again, just
>   like everything else, you build scalability with
>   hierarchy. Like every hierarchy, there will be a "top
>   level", but not everyone will or needs to connect to such
>   a top level.

I meant "level 1" as the lowest in the inverted tree structure I
depicted above.

I will respond to the rest of your reply in the original thread.

  - Robin