Re: [rrg] LMS alternative critique

Robin Whittle <rw@firstpr.com.au> Thu, 25 February 2010 02:46 UTC
Message-ID: <4B85E519.3080609@firstpr.com.au>
Date: Thu, 25 Feb 2010 13:48:57 +1100
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: RRG <rrg@irtf.org>
References: <4B7F9E39.2030800@firstpr.com.au> <4eb512451002240117y4fe3a056r6376981034c9ca5@mail.gmail.com>
In-Reply-To: <4eb512451002240117y4fe3a056r6376981034c9ca5@mail.gmail.com>
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: 8bit
Subject: Re: [rrg] LMS alternative critique
Precedence: list
Hi Letong,

Thanks for your reply, in which you wrote:

> Hi Robin:
>
>     Thank you for your critique. My response is inline.

>     Layered Mapping System
>     ----------------------
> 
>     LMS is a step towards designing a complete Core-Edge Separation routing
>     scaling solution for both IPv4 and IPv6, somewhat along the lines of
>     LISP-ALT.
> 
>     There are insufficient details in the proposal to evaluate how well the
>     basic infrastructure of ITRs and ETRs would work, considering the
>     unknown nature of mapping delays, 
> 
> We did a simulation of LMS and using real data collected from a
> campus border router to show that, when equipped with the two-stage
> cache mechanism at ITRs (as stated in our proposal), the request
> hops are considerably small (94% no hop: cache hits; 5.5% two hops; 0.5%
> four hops). These hops are logical as mapping servers
> talk using tunnels, while the delay between two random nodes may not be
> unacceptable (some estimated it 115 ms ([1]).  The redundancy for
> logical mapping servers may help to reduce the delays between two
> mapping servers, since a mapping server may choose a nearby server it
> wants to communicate with.
>  
> [1]: S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, ”A
> scalable content-addressable network,” in Proc. of ACM SIGCOMM’ 01, San
> Diego, CA, USA, Aug. 2001, pp.161-172.

I am not sure how this paper:

  http://www.cs.cornell.edu/People/francis/p13-ratnasamy.pdf

relates to the LMS proposal or to an ALT-like structure.  Nor have
you mentioned how your system would incorporate multiple physical
MSes (Map Servers) presumably reachable by your system in ways which
are topologically different both in terms of LMS routers and their
tunnels and the actual paths these tunnels take on DFZ and other
routers and their physical data links.

Because your structure is highly compressed, with few levels, I am
not surprised that you can get small numbers of hops.  ITR caching
will of course mean that for a sequence of two to many thousands of
packets sent from one host to another, there would be a single
mapping lookup or perhaps no lookup due to to the ITR already having
the mapping it needs in its cache.

When I wrote:

> There are insufficient details in the proposal to evaluate how well
> the basic infrastructure of ITRs and ETRs would work, considering
> the unknown nature of mapping delays,

I meant to infer that the "mapping delays" are unknown, in part
because I don't understand how the system would work at all - how
routers could maintain millions of BGP sessions and tunnels, how
redundant routers and MSes could be accommodated etc.


>     reliability of the global mapping system,
> 
> The reliability of the global mapping system lies in its
> redundancy and efficient communications between mapping servers. The
> test needs real environment and actual implementation, however I do not
> see there is logic fault in the layered mapping system. Perhaps we can
> draw lessons from DNS, which run effectively.

I think you are just providing vague references to DNS, when you need
to show exactly how your newly proposed structure would work, be
robust against single points of failure, be possible to implement
with realistic amounts of RAM, CPU power etc.

>     the problems of Path MTU Discovery (due to tunneling via
>     encapsulation)
>  
> MTU issue is inherent in the map-and-encap scheme. However, address
> rewriting needs intermediate routers to change the packet headers, which
> is insecure and brings many issues such as checksum
> conformity; comprising both core and edge addresses in one form only
> handle the problem by shrinking the size of namespace. You have proposed
> to estimate MTU along the tunnels, yet I do not know whether it is
> effective.

If you want to have a complete proposal for scalable routing, I think
you need to show exactly how you would solve any problems such as
PMTUD for the tunnels you are using.  I think you need to specify
exactly what tunneling protocols you would use, and how they handle
PMTUD for IPv4 and IPv6, including when these tunnels themselves are
going through other tunnels you can't control or foresee.


>     and difficulties with an ITR reliably probing
>     reachability of the destination network via multiple ETRs.
> 
> ITR receive multiple mappings in the response, each with different
> priorities. ITR can select a most preferred ETR to forward packets to,
> while using others as backups.

Yes - this is the basic LISP approach.  But how does the ITR decide
which ETRs can currently send packets to the destination network?
The LISP people have been grappling with this since LISP's inception
three or so years ago.  The ITR doesn't have a way of probing through
the ETR to a host or router in the destination network, since it
doesn't have any information on what IP address to test, or how to
probe reachability to them.

With Ivip, the ITRs make no such decisions.  The mapping is a single
ETR address.  The end-user whose network this is either controls the
mapping themselves, or appoints a multihoming monitoring company to
probe reachability however the end-user likes, and then controls the
mapping to restore service in the event of a failure of an ETR, its
ISP or the link from the ETR to the destination network.


> I do not know whether the following relates to answer your question, I
> just write down to improve the clarification: A mapping server is the
> authorative of the mapping information of its charged edge
> address. It should (and could) know the connectivity of ETRs and charged
> edge addresses in time. When an ITR caches the locator information
> of mapping servers which it thinks may be useful (as suggested in the
> two-stage cache mechanism), the ITR can periodically request the mapping
> servers about its interested mapping information to get the current
> reachability information.

You could do this.  In this scenario there are no options in the
mapping - the mapping would contain a single ETR address, and the
mapping server somehow tests reachability and changes the mapping
accordingly.  But then, in order to achieve, for instance, 30 second
response times to outages, every ITR in the world would need to
request mapping every 30 seconds (as long as it is still handling
packets addressed to the destination network).  This will place a
huge load on your system, especially since I think you are sending
traffic packets through your network of tunnels and routers, as
data-driven mapping requests, and so the packets get delivered via
the Network.  Even if you just sent short mapping requests, rather
than data packets, you would still have an enormous load on the
mapping network and the map servers.

LISP doesn't do this.  It returns mapping information with longer
caching times than the desired times for multihoming service
restoration.  The mapping consists of multiple ETR addresses with
priorities.  Then the difficulty is how the ITRs decide which of the
ETR addresses to use.  As noted above, Ivip is completely different -
and separates out the reachability testing and mapping
decision-making from the CES architecture itself, making it the
responsibility of the end-user network.

>     Most of the proposal concerns a new global, distributed, mapping query
>     system based on the ALT concepts of each node in the tree-like structure
>     being a router, using tunnels to other such nodes, and all such nodes
>     using BGP to develop best paths to neighbours.
> 
>     By a series of mathematical transformations the designers arrive at
>     certain design choices for IPv4 and IPv6.  For IPv4, if the entire
>     address space was to be mapped (and at present there is no need to do so
>     beyond 224.0.0.0) there would be 256 Layer 2 nodes.  Therefore, each
>     such node would be the authoritative query server for mapping for
>     however much "edge" space was used in an entire /8 of 16 million IPv4
>     addresses.  This is a wildly unrealistic demand of any single physical
>     server, considering that there are likely to be millions of separate
>     "edge" "EID" (LISP) prefixes in each such /8.
> 
> We make a constraint study of the process capability on mapping
> servers. Assume `Pa'  is the percentage of mappings per second that are
> requested, `N' is the total number of edge blocks, thus `Pa*N' is the
> total number of requests sent into the mapping system. A mapping
> server can forward `R' requests per second. In an L-layered system, a
> request may traverse `L+1' MNs in the worst case. Assuming requests are
> distributed evenly among `M' leaf mapping servers, we have the constraint:
>     R * M > Pa * N * (L + 1).  (1)
> `Pa' is estimated to be less than 0.001 [2], we set `N' here to be 2^32
> in the IPv4 case, `L' is 2, the right part of (1) is O(10^6). Note
> that a single router can forward 10^8 packets per second [2], 

That is a big router if it is handling potentially long traffic
packets.  Also, average traffic levels probably need to be 20% or
less of the peak rate the system can handle, so as not to drop too
many packets when the traffic level briefly fluctuates to 5 or 10
times higher than average.

> thus
> we see no problems that the leaf mapping servers handle requests from
> all ITRs. Our own simulations validate this (LMS: Section
> 3.2.3). Morever, the mechanism of redundancy for mapping servers, and
> the two stage cache mechanism can further relieve the burdern of each
> mapping server. The caching of locators of leaf mapping servers can
> especially relieve the load of root mapping servers.
>  
> [2]:H. Luo, Y. Qin, H. Zhang. A DHT-based Identifier-to-locator Mapping
> Approach for a Scalable Internet. IEEE Transaction on Parallel and
> Distribution Systems. VOL.20, NO.10, 2009.

As far as I know, you have not provided details of how you would
achieve this redundancy.  Its not good enough, I think, to point to a
paper and expect readers to understand how that paper, written for
another purpose, could be applied to your proposal.  I think you need
to explain the principles, give detailed examples, and then show how
the whole thing could scale and still be robust to some specified
level of mass adoption which reflects the greatest level of queries,
traffic or whatever you think the system would ever have to cope with.


>     A single address in a
>     single prefix of such "edge" space could be extremely intensively used,
>     with tens or hundreds of thousands of hosts connecting to it in any 1
>     hour period - and in many instances, each such connection resulting in a
>     mapping query.
> 
>     If such an arrangement were feasible, there's no obvious reason why BGP
>     and tunnels would be used at all, since each ITR could easily be
>     configured with, or automatically discover, the IP addresses of all
>     these nodes.
> 
> In the IPv4 case caching locator information of all mapping servers in
> an ITR _is_ reasonable, that's why we think FIRMS [3] could work well in
> the IPv4 case. 

OK - so why do the 224, 256 or whatever small number of mapping
servers and their routers need to be connected with BGP?

I don't know how a single mapping server could handle all the
requests for a busy /8.  If the entire /8 was "edge" space, there are
 16 million IPv4 addresses.  Many end-user networks will only need
one or a few IPv4 addresses, so there could easily be 10 million
separately mapped "EID prefixes" (to use LISP terminology).  Some of
these might have many mapping requests per second.  I don't see how
your mathematically derived decision to have a single map server
handle so many EID prefixes is scalable.


> However, if IPv6 is used for edge addresses, there are
> much more mapping servers. Storing all their locators would become
> unfeasible.
>  
> [3]: M. Menth, M. Hartmann, M. Hofling. FIRMS: a Future InteRnet Mapping
> System. EuroView2008, 2008.

http://www3.informatik.uni-wuerzburg.de/~menth/Publications/papers/Menth09-FIRMS.pdf

As I wrote, I don't think your mathematically derived decision about
how to structure the IPv6 system would scale well either.  I think
you need to explain exactly how these things will work in terms of
bits and bytes, RAM and CPU power, delay times in fibre at 200km per
millisecond, actual details of network structure, physical and
topological locations of all the elements etc.


>     A more realistic arrangement might be to have the "edge" space broken
>     into a larger number of sections, such as 2^22 (4 million) divisions of
>     2^10 IPv4 addresses each.  If (and even this is questionable, though it
>     would frequently be true) each such division could be managed by a
>     single node (a single authoritative query server), then it would be
>     perfectly feasible for each ITR to automatically discover the IP
>     addresses of all 2^22 potential query servers.  In practice, in some
>     cases, no such query server would be required, since none of those 1024
>     IPv4 addresses were "edge" space.  In other cases, due to low enough
>     actual query rates, a single server could handle queries for multiple
>     sets of 1024 IPv4 ranges of "edge" space.
> 
>     A simple ground-up engineering evaluation would thus produce a much more
>     practical solution than the highly contrived top-down mathematical model
>     - which considered only storage requirements for mapping data in each
>     query server, and not the volume of queries.
> 
> Firstly, we did not provide a wholesome analysis about the process
> capability of mapping servers; secondly, what we did in the proposal is
> just a constraint study, to show that the layer structure is scalable in
> providing efficient mapping service while remain the storage and process
> load acceptable. The arrangement is a specific example, while the
> actual layer number and prefix width may well be vary in actual, as to
> different parts of the world.

OK - I think your mathematical model didn't include enough detail to
anticipate real operational conditions, such as peak numbers of
requests or the actual density of EID prefixes, which will be far
more numerous than prefixes which are advertised in the DFZ today.


>     The IPv6 arrangement of two layers (a third layer is the single top
>     node) seems even more unrealistic.  Although the IPv6 address space is
>     likely to remain sparsely used for a long time, due to its inherent
>     vastness, the LMS plan calls for up to 2^24 Layer 2 nodes, each with up
>     to 2^24 Layer 3 nodes underneath.  This proposal seems wildly
>     unrealistic as stated - since each such node would need to accommodate
>     up to 2^24 + 1 tunnels and BGP sessions with its neighbours.
> 
> Firstly, similar calculation as previously implemented in the IPv4 case,
> process constraint can be meet in the IPv6 case; one important virtue of
> LMS is that it can be incrementally constructed. A mapping server need
> not to be constructed if none of its charged edge addresses are used.
> This especially makes sense in the IPv6 case. 

I agree - much of the IPv6 space is unused, so this reduces the total
size of the system you are proposing.


> Along with the
> popularization of the IPv6 address, we do not see it is impractical for
> a node to be able to accomadate 2^24 tunnels and sessions with neighbors.

I think that if you looked into the software, CPU processing and
state requirements of tunnels and BGP communications you would arrive
at a figure more like 2^6 or maybe 2^8.  I think your proposal is
highly theoretical and that your understanding of what is practical
is about 100,000 times greater than what is actually practical.


>     Furthermore, the top node (Layer 1) has to cope with most query packets,
>     since there is no suggestion that Layer 2 nodes would be fully or even
>     partially meshed.
> 
> Using the two stage cache mechanism, our experiment validates that the
> requests sent to the root node would be sharply reduced (nearly zero
> from our router simulated as an ITR). This is because ITRs would
> directly query the responsible bottom-layer nodes if they cache the
> locator information of them (the hit rate of this cache is high).
> Moreover, the success of DNS may also support the feasibility of the
> tree structure.

In IPv4, obviously, an ITR could cache the addresses of all 224 map
servers.  In IPv6, an ITR could cache some larger number, but you
can't rely on your broad-brush mathematical model to tell you at what
levels of the address hierarchy you need map servers, because it is
so dependent on how the "edge" space is assigned, and the nature of
the traffic to the EID prefixes in that space.

I think a "bottom-up" analysis, starting with some explicit but
realistic assumptions of address usage, traffic patterns etc. would
provide meaningful constraints on how high in the inverted tree you
can place map servers before any one map server would be overloaded.
 Then you could think about how you would implement multiple map
servers with the same responsibilities for redundancy and/or load
sharing.  Only then could you start to design the overall structure
by which queries are sent to this map servers.  This would be
realistic and meaningful, if the assumptions were realistic.

The highly theoretical, inadequately detailed, top-down approach you
have taken produces results which are at odds with the actual
capabilities of servers and routers, and not necessarily grounded in
foreseeable traffic patterns in a new system of millions of EIDs,
which is different from the current situation which produced the
traffic patterns used in your modelling.   Starting with a few
numbers and throwing them into a highly contrived formula, which is
what you have done, does not produce results which properly respect
the real engineering constraints on a practical solution.


>     Even with some unspecified caching times, the prodigious query rates
>     considered in section 3.2.2 cannot be suitably handled by either of the
>     IPv4 or IPv6 structures in the proposal.
> 
> The cache timeout is 5 minutes and the cache size limit is 30,000
> entries. I am sorry for the miss of this information. As stated, we
> extrapolate the query rate from one ITR to the whole world (according
> to the proportion of our campus address space to the whole IPv4 space),
> we see that the process capability constraint can be satisfied.

Maybe so, but what of the future when there are far more EID prefixes
than there are DFZ prefixes today?  With the plan you mentioned
above, your 5 minute caching time means mulithoming service
restoration times could be as long as 5 minutes plus whatever time it
takes the mapping servers to figure out there is a problem and change
the mapping accordingly.  I think this time is too long.  You can
shorten it to 30 seconds, and then the levels of queries to map
servers go up by a factor of 6 - which completely changes the limits
on how much address space any one map server can be responsible for.


>     While there is reference to redundant hardware for each node, there is
>     no discussion of how to implement this.  This is one of the problems
>     which so far has bedevilled LISP-ALT: - how to add nodes to this highly
>     aggregated network structure so that there is no single point of
>     failure.  For instance, how in the IPv6 system could there be two
>     physical nodes, each performing the role of a given Level 2 node, in
>     topologically diverse locations - without adding great complexity and
>     greater numbers of tunnels and BGP sessions?
> 
> I do not think adding physical nodes to provide redundancy would
> complicate the system much. 

You haven't explained how you would do it in the LMS structure.


> DNS uses mirrors and can provide redundancy
> effectively. 

DNS is not a routing system - and you are proposing a routing system
like ALT, with different decisions about levels of aggregation, but
also with ITRs caching addresses of map servers they discover
initially via traffic packets acting as map requests traversing the
LMS overlay network.

A domain is delegated, normally, to two or more topologically
separated authoritative query servers, and delegation is achieved by
passing on the IP addresses of these servers.

You can create two or more redundant mapping servers and place them
in physically and topologically different locations.  Now the task is
to show how the LMS routers handle this doubled number of map
servers, and automatically propagate their reachability up and down
the the LMS system with BGP, in a scalable fashion, without there
being any single point of failure such as reliance on any particular
LMS router.  How would you structure LMS so for every router, there
is another which will automatically take its place of it fails?  You
can't point to DNS as a solution, since it doesn't involve routers.


> However, the practical issue should be found and solved
> through actual implementation. Thus a thorough and
> large-scale experiment (as the LISP interworking) is much in need.

The LISP test network is, and always will be, too small to display
the scaling problems which are obvious for ALT or for LMS.

The whole idea of good architecture is to take everything into
consideration, especially all the physical details (which are
sometimes described disparagingly as merely "engineering details")
and devise a plan which looks like it can scale to the massive sizes
required for a fully adopted routing scaling solution.  Only when you
have such a design is it worth experimenting, I think.

To say "It looks like we don't know how to make it work on a large
enough scale to be a real solution - so lets try building it on a
small scale, and then ideally a larger scale and see if we can figure
out how to make it work." is the antithesis of good design.  I think
the LISP team did something like this.  They place great importance
on actually trying things.  I am not opposed to trying things, but
when an architecture such as ALT has very obvious failings, which
were mentioned within months of its announcement:

  ALT's strong aggregation often leads to *very* long paths
  http://www.ietf.org/mail-archive/web/rrg/current/msg01171.html

it would have been better to create a better design rather than
experimenting with one which had obvious problems with no apparent
solution.


>     The suggestion (section 5) that core (DFZ) routers not maintain a full
>     FIB, but rather hold a packet which does not match any FIB prefix, pay
>     for a mapping lookup, await the result, update the FIB (a frequently
>     expensive operation) and then forward the packet - is also wildly
>     unrealistic.
> 
> If the mapping system has been existed, or highly configured routers
> would provide mapping services, why is it unrealistic, that routers who
> cannot afford to hold the global routing table, discard of uninterested
> specific routes and query for them when needed?

Any router which dropped or delayed packets like this would not be
usable in the DFZ.  DFZ routers need to reliably and instantly
forward any packet whose destination address matches one of the
prefixes advertised in the DFZ.  If it can't do this, no-one would
want to forward packets to that router.


>     It is important to research alternative approaches when existing methods
>     are perceived as facing serious problems, as is the case with LISP-ALT.
>      In this case, the proposed solution is not likely to be any improvement
>     on any ALT arrangement which is likely to arise by a more hand-crafted
>     design methodology.
> 
>     The LMS proposal, as it currently stands, is far too incomplete to be
>     considered suitable for further development ahead of some other
>     proposals.  It represents the efforts of a creative team to improve on
>     LISP-ALT, and does not necessarily mean that all such attempts at
>     improvement would lead to such impractical choices.
>  
> Waiting for your reply. Thank you.
>  
> Best wishes,
> Letong

OK - thanks for your reply.

  - Robin
[rrg] LMS alternative critique Robin Whittle
Re: [rrg] LMS alternative critique Charrie Sun
Re: [rrg] LMS alternative critique Robin Whittle
Re: [rrg] LMS alternative critique Charrie Sun