Re: [rrg] LMS alternative critique
Robin Whittle <rw@firstpr.com.au> Thu, 25 February 2010 02:46 UTC
Return-Path: <rw@firstpr.com.au>
X-Original-To: rrg@core3.amsl.com
Delivered-To: rrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id CB83B28C2AC for <rrg@core3.amsl.com>; Wed, 24 Feb 2010 18:46:49 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.578
X-Spam-Level:
X-Spam-Status: No, score=-1.578 tagged_above=-999 required=5 tests=[AWL=0.002, BAYES_00=-2.599, HELO_EQ_AU=0.377, HOST_EQ_AU=0.327, SARE_MILLIONSOF=0.315]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id bjFdmGJ0vmTb for <rrg@core3.amsl.com>; Wed, 24 Feb 2010 18:46:47 -0800 (PST)
Received: from gair.firstpr.com.au (gair.firstpr.com.au [150.101.162.123]) by core3.amsl.com (Postfix) with ESMTP id 3715C28C253 for <rrg@irtf.org>; Wed, 24 Feb 2010 18:46:46 -0800 (PST)
Received: from [10.0.0.6] (wira.firstpr.com.au [10.0.0.6]) by gair.firstpr.com.au (Postfix) with ESMTP id 3FFC7175AA9; Thu, 25 Feb 2010 13:48:54 +1100 (EST)
Message-ID: <4B85E519.3080609@firstpr.com.au>
Date: Thu, 25 Feb 2010 13:48:57 +1100
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: RRG <rrg@irtf.org>
References: <4B7F9E39.2030800@firstpr.com.au> <4eb512451002240117y4fe3a056r6376981034c9ca5@mail.gmail.com>
In-Reply-To: <4eb512451002240117y4fe3a056r6376981034c9ca5@mail.gmail.com>
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: 8bit
Subject: Re: [rrg] LMS alternative critique
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Thu, 25 Feb 2010 02:46:49 -0000
Hi Letong, Thanks for your reply, in which you wrote: > Hi Robin: > > Thank you for your critique. My response is inline. > Layered Mapping System > ---------------------- > > LMS is a step towards designing a complete Core-Edge Separation routing > scaling solution for both IPv4 and IPv6, somewhat along the lines of > LISP-ALT. > > There are insufficient details in the proposal to evaluate how well the > basic infrastructure of ITRs and ETRs would work, considering the > unknown nature of mapping delays, > > We did a simulation of LMS and using real data collected from a > campus border router to show that, when equipped with the two-stage > cache mechanism at ITRs (as stated in our proposal), the request > hops are considerably small (94% no hop: cache hits; 5.5% two hops; 0.5% > four hops). These hops are logical as mapping servers > talk using tunnels, while the delay between two random nodes may not be > unacceptable (some estimated it 115 ms ([1]). The redundancy for > logical mapping servers may help to reduce the delays between two > mapping servers, since a mapping server may choose a nearby server it > wants to communicate with. > > [1]: S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, ”A > scalable content-addressable network,” in Proc. of ACM SIGCOMM’ 01, San > Diego, CA, USA, Aug. 2001, pp.161-172. I am not sure how this paper: http://www.cs.cornell.edu/People/francis/p13-ratnasamy.pdf relates to the LMS proposal or to an ALT-like structure. Nor have you mentioned how your system would incorporate multiple physical MSes (Map Servers) presumably reachable by your system in ways which are topologically different both in terms of LMS routers and their tunnels and the actual paths these tunnels take on DFZ and other routers and their physical data links. Because your structure is highly compressed, with few levels, I am not surprised that you can get small numbers of hops. ITR caching will of course mean that for a sequence of two to many thousands of packets sent from one host to another, there would be a single mapping lookup or perhaps no lookup due to to the ITR already having the mapping it needs in its cache. When I wrote: > There are insufficient details in the proposal to evaluate how well > the basic infrastructure of ITRs and ETRs would work, considering > the unknown nature of mapping delays, I meant to infer that the "mapping delays" are unknown, in part because I don't understand how the system would work at all - how routers could maintain millions of BGP sessions and tunnels, how redundant routers and MSes could be accommodated etc. > reliability of the global mapping system, > > The reliability of the global mapping system lies in its > redundancy and efficient communications between mapping servers. The > test needs real environment and actual implementation, however I do not > see there is logic fault in the layered mapping system. Perhaps we can > draw lessons from DNS, which run effectively. I think you are just providing vague references to DNS, when you need to show exactly how your newly proposed structure would work, be robust against single points of failure, be possible to implement with realistic amounts of RAM, CPU power etc. > the problems of Path MTU Discovery (due to tunneling via > encapsulation) > > MTU issue is inherent in the map-and-encap scheme. However, address > rewriting needs intermediate routers to change the packet headers, which > is insecure and brings many issues such as checksum > conformity; comprising both core and edge addresses in one form only > handle the problem by shrinking the size of namespace. You have proposed > to estimate MTU along the tunnels, yet I do not know whether it is > effective. If you want to have a complete proposal for scalable routing, I think you need to show exactly how you would solve any problems such as PMTUD for the tunnels you are using. I think you need to specify exactly what tunneling protocols you would use, and how they handle PMTUD for IPv4 and IPv6, including when these tunnels themselves are going through other tunnels you can't control or foresee. > and difficulties with an ITR reliably probing > reachability of the destination network via multiple ETRs. > > ITR receive multiple mappings in the response, each with different > priorities. ITR can select a most preferred ETR to forward packets to, > while using others as backups. Yes - this is the basic LISP approach. But how does the ITR decide which ETRs can currently send packets to the destination network? The LISP people have been grappling with this since LISP's inception three or so years ago. The ITR doesn't have a way of probing through the ETR to a host or router in the destination network, since it doesn't have any information on what IP address to test, or how to probe reachability to them. With Ivip, the ITRs make no such decisions. The mapping is a single ETR address. The end-user whose network this is either controls the mapping themselves, or appoints a multihoming monitoring company to probe reachability however the end-user likes, and then controls the mapping to restore service in the event of a failure of an ETR, its ISP or the link from the ETR to the destination network. > I do not know whether the following relates to answer your question, I > just write down to improve the clarification: A mapping server is the > authorative of the mapping information of its charged edge > address. It should (and could) know the connectivity of ETRs and charged > edge addresses in time. When an ITR caches the locator information > of mapping servers which it thinks may be useful (as suggested in the > two-stage cache mechanism), the ITR can periodically request the mapping > servers about its interested mapping information to get the current > reachability information. You could do this. In this scenario there are no options in the mapping - the mapping would contain a single ETR address, and the mapping server somehow tests reachability and changes the mapping accordingly. But then, in order to achieve, for instance, 30 second response times to outages, every ITR in the world would need to request mapping every 30 seconds (as long as it is still handling packets addressed to the destination network). This will place a huge load on your system, especially since I think you are sending traffic packets through your network of tunnels and routers, as data-driven mapping requests, and so the packets get delivered via the Network. Even if you just sent short mapping requests, rather than data packets, you would still have an enormous load on the mapping network and the map servers. LISP doesn't do this. It returns mapping information with longer caching times than the desired times for multihoming service restoration. The mapping consists of multiple ETR addresses with priorities. Then the difficulty is how the ITRs decide which of the ETR addresses to use. As noted above, Ivip is completely different - and separates out the reachability testing and mapping decision-making from the CES architecture itself, making it the responsibility of the end-user network. > Most of the proposal concerns a new global, distributed, mapping query > system based on the ALT concepts of each node in the tree-like structure > being a router, using tunnels to other such nodes, and all such nodes > using BGP to develop best paths to neighbours. > > By a series of mathematical transformations the designers arrive at > certain design choices for IPv4 and IPv6. For IPv4, if the entire > address space was to be mapped (and at present there is no need to do so > beyond 224.0.0.0) there would be 256 Layer 2 nodes. Therefore, each > such node would be the authoritative query server for mapping for > however much "edge" space was used in an entire /8 of 16 million IPv4 > addresses. This is a wildly unrealistic demand of any single physical > server, considering that there are likely to be millions of separate > "edge" "EID" (LISP) prefixes in each such /8. > > We make a constraint study of the process capability on mapping > servers. Assume `Pa' is the percentage of mappings per second that are > requested, `N' is the total number of edge blocks, thus `Pa*N' is the > total number of requests sent into the mapping system. A mapping > server can forward `R' requests per second. In an L-layered system, a > request may traverse `L+1' MNs in the worst case. Assuming requests are > distributed evenly among `M' leaf mapping servers, we have the constraint: > R * M > Pa * N * (L + 1). (1) > `Pa' is estimated to be less than 0.001 [2], we set `N' here to be 2^32 > in the IPv4 case, `L' is 2, the right part of (1) is O(10^6). Note > that a single router can forward 10^8 packets per second [2], That is a big router if it is handling potentially long traffic packets. Also, average traffic levels probably need to be 20% or less of the peak rate the system can handle, so as not to drop too many packets when the traffic level briefly fluctuates to 5 or 10 times higher than average. > thus > we see no problems that the leaf mapping servers handle requests from > all ITRs. Our own simulations validate this (LMS: Section > 3.2.3). Morever, the mechanism of redundancy for mapping servers, and > the two stage cache mechanism can further relieve the burdern of each > mapping server. The caching of locators of leaf mapping servers can > especially relieve the load of root mapping servers. > > [2]:H. Luo, Y. Qin, H. Zhang. A DHT-based Identifier-to-locator Mapping > Approach for a Scalable Internet. IEEE Transaction on Parallel and > Distribution Systems. VOL.20, NO.10, 2009. As far as I know, you have not provided details of how you would achieve this redundancy. Its not good enough, I think, to point to a paper and expect readers to understand how that paper, written for another purpose, could be applied to your proposal. I think you need to explain the principles, give detailed examples, and then show how the whole thing could scale and still be robust to some specified level of mass adoption which reflects the greatest level of queries, traffic or whatever you think the system would ever have to cope with. > A single address in a > single prefix of such "edge" space could be extremely intensively used, > with tens or hundreds of thousands of hosts connecting to it in any 1 > hour period - and in many instances, each such connection resulting in a > mapping query. > > If such an arrangement were feasible, there's no obvious reason why BGP > and tunnels would be used at all, since each ITR could easily be > configured with, or automatically discover, the IP addresses of all > these nodes. > > In the IPv4 case caching locator information of all mapping servers in > an ITR _is_ reasonable, that's why we think FIRMS [3] could work well in > the IPv4 case. OK - so why do the 224, 256 or whatever small number of mapping servers and their routers need to be connected with BGP? I don't know how a single mapping server could handle all the requests for a busy /8. If the entire /8 was "edge" space, there are 16 million IPv4 addresses. Many end-user networks will only need one or a few IPv4 addresses, so there could easily be 10 million separately mapped "EID prefixes" (to use LISP terminology). Some of these might have many mapping requests per second. I don't see how your mathematically derived decision to have a single map server handle so many EID prefixes is scalable. > However, if IPv6 is used for edge addresses, there are > much more mapping servers. Storing all their locators would become > unfeasible. > > [3]: M. Menth, M. Hartmann, M. Hofling. FIRMS: a Future InteRnet Mapping > System. EuroView2008, 2008. http://www3.informatik.uni-wuerzburg.de/~menth/Publications/papers/Menth09-FIRMS.pdf As I wrote, I don't think your mathematically derived decision about how to structure the IPv6 system would scale well either. I think you need to explain exactly how these things will work in terms of bits and bytes, RAM and CPU power, delay times in fibre at 200km per millisecond, actual details of network structure, physical and topological locations of all the elements etc. > A more realistic arrangement might be to have the "edge" space broken > into a larger number of sections, such as 2^22 (4 million) divisions of > 2^10 IPv4 addresses each. If (and even this is questionable, though it > would frequently be true) each such division could be managed by a > single node (a single authoritative query server), then it would be > perfectly feasible for each ITR to automatically discover the IP > addresses of all 2^22 potential query servers. In practice, in some > cases, no such query server would be required, since none of those 1024 > IPv4 addresses were "edge" space. In other cases, due to low enough > actual query rates, a single server could handle queries for multiple > sets of 1024 IPv4 ranges of "edge" space. > > A simple ground-up engineering evaluation would thus produce a much more > practical solution than the highly contrived top-down mathematical model > - which considered only storage requirements for mapping data in each > query server, and not the volume of queries. > > Firstly, we did not provide a wholesome analysis about the process > capability of mapping servers; secondly, what we did in the proposal is > just a constraint study, to show that the layer structure is scalable in > providing efficient mapping service while remain the storage and process > load acceptable. The arrangement is a specific example, while the > actual layer number and prefix width may well be vary in actual, as to > different parts of the world. OK - I think your mathematical model didn't include enough detail to anticipate real operational conditions, such as peak numbers of requests or the actual density of EID prefixes, which will be far more numerous than prefixes which are advertised in the DFZ today. > The IPv6 arrangement of two layers (a third layer is the single top > node) seems even more unrealistic. Although the IPv6 address space is > likely to remain sparsely used for a long time, due to its inherent > vastness, the LMS plan calls for up to 2^24 Layer 2 nodes, each with up > to 2^24 Layer 3 nodes underneath. This proposal seems wildly > unrealistic as stated - since each such node would need to accommodate > up to 2^24 + 1 tunnels and BGP sessions with its neighbours. > > Firstly, similar calculation as previously implemented in the IPv4 case, > process constraint can be meet in the IPv6 case; one important virtue of > LMS is that it can be incrementally constructed. A mapping server need > not to be constructed if none of its charged edge addresses are used. > This especially makes sense in the IPv6 case. I agree - much of the IPv6 space is unused, so this reduces the total size of the system you are proposing. > Along with the > popularization of the IPv6 address, we do not see it is impractical for > a node to be able to accomadate 2^24 tunnels and sessions with neighbors. I think that if you looked into the software, CPU processing and state requirements of tunnels and BGP communications you would arrive at a figure more like 2^6 or maybe 2^8. I think your proposal is highly theoretical and that your understanding of what is practical is about 100,000 times greater than what is actually practical. > Furthermore, the top node (Layer 1) has to cope with most query packets, > since there is no suggestion that Layer 2 nodes would be fully or even > partially meshed. > > Using the two stage cache mechanism, our experiment validates that the > requests sent to the root node would be sharply reduced (nearly zero > from our router simulated as an ITR). This is because ITRs would > directly query the responsible bottom-layer nodes if they cache the > locator information of them (the hit rate of this cache is high). > Moreover, the success of DNS may also support the feasibility of the > tree structure. In IPv4, obviously, an ITR could cache the addresses of all 224 map servers. In IPv6, an ITR could cache some larger number, but you can't rely on your broad-brush mathematical model to tell you at what levels of the address hierarchy you need map servers, because it is so dependent on how the "edge" space is assigned, and the nature of the traffic to the EID prefixes in that space. I think a "bottom-up" analysis, starting with some explicit but realistic assumptions of address usage, traffic patterns etc. would provide meaningful constraints on how high in the inverted tree you can place map servers before any one map server would be overloaded. Then you could think about how you would implement multiple map servers with the same responsibilities for redundancy and/or load sharing. Only then could you start to design the overall structure by which queries are sent to this map servers. This would be realistic and meaningful, if the assumptions were realistic. The highly theoretical, inadequately detailed, top-down approach you have taken produces results which are at odds with the actual capabilities of servers and routers, and not necessarily grounded in foreseeable traffic patterns in a new system of millions of EIDs, which is different from the current situation which produced the traffic patterns used in your modelling. Starting with a few numbers and throwing them into a highly contrived formula, which is what you have done, does not produce results which properly respect the real engineering constraints on a practical solution. > Even with some unspecified caching times, the prodigious query rates > considered in section 3.2.2 cannot be suitably handled by either of the > IPv4 or IPv6 structures in the proposal. > > The cache timeout is 5 minutes and the cache size limit is 30,000 > entries. I am sorry for the miss of this information. As stated, we > extrapolate the query rate from one ITR to the whole world (according > to the proportion of our campus address space to the whole IPv4 space), > we see that the process capability constraint can be satisfied. Maybe so, but what of the future when there are far more EID prefixes than there are DFZ prefixes today? With the plan you mentioned above, your 5 minute caching time means mulithoming service restoration times could be as long as 5 minutes plus whatever time it takes the mapping servers to figure out there is a problem and change the mapping accordingly. I think this time is too long. You can shorten it to 30 seconds, and then the levels of queries to map servers go up by a factor of 6 - which completely changes the limits on how much address space any one map server can be responsible for. > While there is reference to redundant hardware for each node, there is > no discussion of how to implement this. This is one of the problems > which so far has bedevilled LISP-ALT: - how to add nodes to this highly > aggregated network structure so that there is no single point of > failure. For instance, how in the IPv6 system could there be two > physical nodes, each performing the role of a given Level 2 node, in > topologically diverse locations - without adding great complexity and > greater numbers of tunnels and BGP sessions? > > I do not think adding physical nodes to provide redundancy would > complicate the system much. You haven't explained how you would do it in the LMS structure. > DNS uses mirrors and can provide redundancy > effectively. DNS is not a routing system - and you are proposing a routing system like ALT, with different decisions about levels of aggregation, but also with ITRs caching addresses of map servers they discover initially via traffic packets acting as map requests traversing the LMS overlay network. A domain is delegated, normally, to two or more topologically separated authoritative query servers, and delegation is achieved by passing on the IP addresses of these servers. You can create two or more redundant mapping servers and place them in physically and topologically different locations. Now the task is to show how the LMS routers handle this doubled number of map servers, and automatically propagate their reachability up and down the the LMS system with BGP, in a scalable fashion, without there being any single point of failure such as reliance on any particular LMS router. How would you structure LMS so for every router, there is another which will automatically take its place of it fails? You can't point to DNS as a solution, since it doesn't involve routers. > However, the practical issue should be found and solved > through actual implementation. Thus a thorough and > large-scale experiment (as the LISP interworking) is much in need. The LISP test network is, and always will be, too small to display the scaling problems which are obvious for ALT or for LMS. The whole idea of good architecture is to take everything into consideration, especially all the physical details (which are sometimes described disparagingly as merely "engineering details") and devise a plan which looks like it can scale to the massive sizes required for a fully adopted routing scaling solution. Only when you have such a design is it worth experimenting, I think. To say "It looks like we don't know how to make it work on a large enough scale to be a real solution - so lets try building it on a small scale, and then ideally a larger scale and see if we can figure out how to make it work." is the antithesis of good design. I think the LISP team did something like this. They place great importance on actually trying things. I am not opposed to trying things, but when an architecture such as ALT has very obvious failings, which were mentioned within months of its announcement: ALT's strong aggregation often leads to *very* long paths http://www.ietf.org/mail-archive/web/rrg/current/msg01171.html it would have been better to create a better design rather than experimenting with one which had obvious problems with no apparent solution. > The suggestion (section 5) that core (DFZ) routers not maintain a full > FIB, but rather hold a packet which does not match any FIB prefix, pay > for a mapping lookup, await the result, update the FIB (a frequently > expensive operation) and then forward the packet - is also wildly > unrealistic. > > If the mapping system has been existed, or highly configured routers > would provide mapping services, why is it unrealistic, that routers who > cannot afford to hold the global routing table, discard of uninterested > specific routes and query for them when needed? Any router which dropped or delayed packets like this would not be usable in the DFZ. DFZ routers need to reliably and instantly forward any packet whose destination address matches one of the prefixes advertised in the DFZ. If it can't do this, no-one would want to forward packets to that router. > It is important to research alternative approaches when existing methods > are perceived as facing serious problems, as is the case with LISP-ALT. > In this case, the proposed solution is not likely to be any improvement > on any ALT arrangement which is likely to arise by a more hand-crafted > design methodology. > > The LMS proposal, as it currently stands, is far too incomplete to be > considered suitable for further development ahead of some other > proposals. It represents the efforts of a creative team to improve on > LISP-ALT, and does not necessarily mean that all such attempts at > improvement would lead to such impractical choices. > > Waiting for your reply. Thank you. > > Best wishes, > Letong OK - thanks for your reply. - Robin
- [rrg] LMS alternative critique Robin Whittle
- Re: [rrg] LMS alternative critique Charrie Sun
- Re: [rrg] LMS alternative critique Robin Whittle
- Re: [rrg] LMS alternative critique Charrie Sun