[rrg] Evolution Towards Global Routing Scalability

Lan Wang <lanwang@memphis.edu> Wed, 28 October 2009 12:31 UTC

Return-Path: <lanwang@memphis.edu>
X-Original-To: rrg@core3.amsl.com
Delivered-To: rrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id B611128C0D7 for <rrg@core3.amsl.com>; Wed, 28 Oct 2009 05:31:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.469
X-Spam-Level:
X-Spam-Status: No, score=-6.469 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4, SARE_RMML_Stock10=0.13]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id GQ8ZUd89Ywlo for <rrg@core3.amsl.com>; Wed, 28 Oct 2009 05:31:06 -0700 (PDT)
Received: from itmta2.memphis.edu (itmx2.memphis.edu [141.225.112.71]) by core3.amsl.com (Postfix) with ESMTP id CFF4A3A68E6 for <rrg@irtf.org>; Wed, 28 Oct 2009 05:31:05 -0700 (PDT)
Received: from itexfe6.uom.memphis.edu (itexfe6.memphis.edu [10.15.1.67]) by itmta2.memphis.edu (Postfix) with ESMTP id BB7D1A0C023 for <rrg@irtf.org>; Wed, 28 Oct 2009 07:31:10 -0500 (CDT)
Received: from [209.91.4.133] (209.91.4.133) by ummail.memphis.edu (10.15.1.67) with Microsoft SMTP Server (TLS) id 8.1.393.1; Wed, 28 Oct 2009 07:31:10 -0500
MIME-Version: 1.0 (Apple Message framework v753.1)
Content-Transfer-Encoding: 7bit
Message-ID: <87EE65A9-7C9A-4789-BDAE-0AA7A3144E3A@memphis.edu>
Content-Type: text/plain; charset="US-ASCII"; delsp="yes"; format="flowed"
To: RRG Mailing List <rrg@irtf.org>
From: Lan Wang <lanwang@memphis.edu>
Date: Wed, 28 Oct 2009 07:31:07 -0500
X-Mailer: Apple Mail (2.753.1)
Subject: [rrg] Evolution Towards Global Routing Scalability
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Oct 2009 12:31:09 -0000

Hi all,

The following draft serves as our input to the RRG recommendation  
discussion.  Your feedback is very welcome.

Lan
---------
Network Working Group                                           B. Zhang
Internet-Draft                                          Univ. of Arizona
Intended status: Informational                                  L. Zhang
Expires: April 29, 2010                                             UCLA
                                                                  L.  
Wang
                                                         Univ. of  
Memphis
                                                         October 26,  
2009


               Evolution Towards Global Routing Scalability
                    draft-zhang-zhang-evolution-02.txt

Status of this Memo

    This Internet-Draft is submitted to IETF in full conformance with  
the
    provisions of BCP 78 and BCP 79.

    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups.  Note that
    other groups may also distribute working documents as Internet-
    Drafts.

    Internet-Drafts are draft documents valid for a maximum of six  
months
    and may be updated, replaced, or obsoleted by other documents at any
    time.  It is inappropriate to use Internet-Drafts as reference
    material or to cite them other than as "work in progress."

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt.

    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html.

    This Internet-Draft will expire on April 29, 2010.

Copyright Notice

    Copyright (c) 2009 IETF Trust and the persons identified as the
    document authors.  All rights reserved.

    This document is subject to BCP 78 and the IETF Trust's Legal
    Provisions Relating to IETF Documents in effect on the date of
    publication of this document (http://trustee.ietf.org/license-info).
    Please review these documents carefully, as they describe your  
rights
    and restrictions with respect to this document.

    This document may contain material from IETF Documents or IETF
    Contributions published or made publicly available before November



Zhang, et al.            Expires April 29, 2010                 [Page 1]
Internet-Draft                 Scaling BGP                  October 2009


    10, 2008.  The person(s) controlling the copyright in some of this
    material may not have granted the IETF Trust the right to allow
    modifications of such material outside the IETF Standards Process.
    Without obtaining an adequate license from the person(s) controlling
    the copyright in such materials, this document may not be modified
    outside the IETF Standards Process, and derivative works of it may
    not be created outside the IETF Standards Process, except to format
    it for publication as an RFC or to translate it into languages other
    than English.

Abstract

    Internet routing scalability has long been considered a serious
    problem.  Although many efforts have been devoted to address this
    problem over the years, the IETF community as a whole is yet to
    achieve a shared understanding on what is the best way forward.  In
    this draft, we step up a level to re-examine the problem and the
    ongoing efforts. we conclude that, to effectively solve the routing
    scalability problem, we first need a clear understanding on how to
    introduce solutions to the Internet which is a global scale deployed
    system.  In this draft we sketch out our reasoning on the need  
for an
    evolutionary path towards scaling the global routing system, instead
    of attempting to introduce a brand new design.




























Zhang, et al.            Expires April 29, 2010                 [Page 2]
Internet-Draft                 Scaling BGP                  October 2009


Table of Contents

    1.   
Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
    2.  Difficulties in Deploying New  
Solutions  . . . . . . . . . . .  5
    3.  An Evolutionary Path towards Scalable  
Routing  . . . . . . . .  7
      3.1.  Step One: Local FIB Size  
Reduction . . . . . . . . . . . .  7
      3.2.  Step Two: Network-Coordinated FIB Size  
Reduction . . . . .  8
      3.3.  Step Three: Reducing Adjacent AS Virtual Aggregation
             
Overhead . . . . . . . . . . . . . . . . . . . . . . . . . 10
      3.4.  Step Four: Reducing RIB  
Size . . . . . . . . . . . . . . . 11
      3.5.  Step Five: Insulating the Core from Edge  
Churns  . . . . . 12
      3.6.   
Summary  . . . . . . . . . . . . . . . . . . . . . . . . . 13
    4.  Evolution versus Incremental  
Deployability . . . . . . . . . . 14
    5.   
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 15
    6.  Security  
Considerations  . . . . . . . . . . . . . . . . . . . 15
    7.  Informative  
References . . . . . . . . . . . . . . . . . . . . 16
    Authors'  
Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17


































Zhang, et al.            Expires April 29, 2010                 [Page 3]
Internet-Draft                 Scaling BGP                  October 2009


1.  Introduction

    Internet routing scalability has long been an outstanding problem.
    Over the years many efforts, including our own, have been devoted to
    solve this problem.  Since the 2006 IAB Workshop on Internet Routing
    and Addressing [RFC4984], new IRTF/IETF efforts have been devoted to
    developing a scalable routing architecture, and a number of  
proposals
    have been put on the table [RRG].  We contributed a new design  
dubbed
    APT [APT]; another new design LISP already has running code [LISP].
    Yet no clear consensus has emerged in the community as to what is  
the
    best way forward.

    Assuming the routing scalability problem is real and we can find a
    new design that is technically sound, why is it so difficult to  
agree
    on deploying a new design that can solve the problem?  We put in the
    effort to understand fundamental roadblocks in rolling out APT[APT],
    and came to a new understanding of the problem at hand: when  
facing a
    problem, as engineers we naturally tend to design a new system to
    solve the problem, hoping that the new design would be rolled out to
    replace the old problematic one.  This kind of "solution by new
    design" approach can be effective in solving problems in small scale
    (e.g. one could easily replace an old computer with a new one), but
    it does not work for the deployed Internet.  Instead, the Internet-
    scale systems need to resolve problems through an evolutionary path,
    not a revolutionary new design.

    In this draft we first discuss the major difficulties in rolling out
    a new design to solve the global routing scalability problem.  These
    difficulties suggest that the Internet routing infrastructure needs
    an evolutionary path to move forward.  We then show evidences that
    such an evolutionary path indeed exists.  To address the concern
    about getting stuck at "local optimal" with incremental changes, we
    sketch out a solution scenario which demonstrates the feasibility of
    moving the routing system towards a scalable architecture through
    incremental steps (see Section 3).  This evolutionary path is driven
    by the most severe aspect of the routing scalability problem that
    each individual ISP faces at each stage, yet an interesting outcome
    is that the overall routing architecture evolves towards the
    separation between customer networks and provider networks, on which
    our APT design was based.  The major difference between the
    evolutionary design and the initial APT design is that the  
separation
    is not a starting point of the design, but rather a natural  
result of
    the evolutionary design.

    We also draw a distinction between an evolutionary process towards a
    final solution direction versus the "incremental deployability" that
    many previously proposed new designs claim to have (see Section 4).
    One must not mistake co-existence between a new design and the



Zhang, et al.            Expires April 29, 2010                 [Page 4]
Internet-Draft                 Scaling BGP                  October 2009


    existing system as incremental deployability, because the latter
    requires offering immediate gains for first movers, otherwise no one
    has incentive to be first movers.


2.  Difficulties in Deploying New Solutions

    Two of the few fundamental properties of the Internet are its
    distributed governance and its diversity along multiple dimensions.
    The Internet is an interconnect of tens of thousands independently
    administrated networks, each with its own budget, planning, business
    models and operational practices.  As a result, not everyone shares
    the same view as far as the routing scalability issue is concerned.
    For example, many customer networks and small regional providers do
    not carry the full BGP routing table internally; instead they
    propagate only internal routes inside their networks and use default
    routing to reach the rest of the Internet through one or a few exit
    points.  On the other hand, large networks in general carry the full
    routing table internally for efficient data delivery to a large
    number of destinations.  As a result, the former may not feel the
    pain of routing table growth but the latter may do.

    Even among the networks that do carry the full routing table
    internally, some (such as content providers) are able to upgrade
    their routing infrastructure every few years to keep up with the
    demand of ever growing BGP table; others may not be able to afford
    doing so.  For example, we learned from a few large ISPs that,
    although they were able to upgrade the relatively small number of
    core routers with the latest technology that can handle a million or
    more routes, they could not afford to upgrade all their edge routers
    which may count up to a thousand or more, even though some of them
    are more than 10 years old.  Consequently, some networks may
    encounter the FIB or RIB size limitations earlier than others, some
    may experience severe problems while others may not feel the problem
    at all.  Even within the same network, some routers can handle the
    increasing routing table size while others cannot.  Several  
incidents
    have occurred recently that were caused by edge router RIB or FIB
    overflow.  Although these incidents may be triggered by other
    problems (e.g., route leak-out) that led to the inflation of the RIB
    size, they did show the fact that a large RIB size can easily push
    old edge routers to fall off the cliff.

    Therefore, although finding a way to control routing table size is
    necessary in the long run, especially in lieu of increasing IPv6
    deployment, different networks can have different degrees of
    incentive to solve the problem, and some may not see a need to take
    any action towards fixing the problem for the time being.




Zhang, et al.            Expires April 29, 2010                 [Page 5]
Internet-Draft                 Scaling BGP                  October 2009


    Yet another important issue in solution evaluation is network
    economics.  A new solution design usually calls for software upgrade
    or even new hardware, both require additional investment as well as
    new expertise in managing and troubleshooting the new technology.
    The affordability associated with deploying a new design varies
    greatly among different networks.  Even if a network may suffer pain
    from the growing routing table size, it still may not be able to
    deploy a new solution if the cost is considered prohibitively high.
    Instead, people tend to look for simple twists of the existing
    systems that can provide effective relief from the RIB/FIB growth
    pressure.  One such simple patch was presented at October 2008 NANOG
    meeting [NANOG44].

    Each network makes its own business decision on whether to deploy a
    new design, based on its evaluation of the severity of the problem
    and its affordability of deploying the solution.  Given the scale  
and
    diversity of the Internet, it is certain that the buy-in of any new
    solution will not be harmonious.  Even for those networks that
    require a solution to handle routing scalability, the deployment  
will
    likely be a gradual process consisting of several stages.
    Furthermore, the day for the global Internet as a whole to deploy a
    new solution may take forever to come.

    To summarize: we see that

    o  Different parties have different perceptions regarding the  
routing
       scalability problem due to their differences in economical
       conditions and operational practices; some are yet to be  
convinced
       that the routing scalability problem is serious [BGP2008].
    o  For networks that face the routing scalability problem, there can
       still be different severity at different routers.
    o  Networks that experience routing scalability problems are also
       likely to have affordability concerns for new solutions.
    o  If any new solution gets rolled out, it is certain to start from
       one or a few parties first, and may or may not ever reach the
       entire Internet.

    The above argues that we should attack the routing scalability
    problem with an evolutionary approach.  By evolution we mean that  
(1)
    the solution should be deployable by individual AS who deems an
    action necessary, without needing coordination with neighbor ASes;
    (2) the solution can bring immediate gain to a single first-mover;
    (3) even within the AS, the solution should enable the routing table
    size reduction at only those routers whose capacity fall behind the
    the FIB or RIB growth curve; and (4) the solution should be an
    incremental step on top of the existing system to minimize the cost
    while being effective.  Building a solution on top of the existing
    system makes it much cheaper and easier to roll out, and makes it



Zhang, et al.            Expires April 29, 2010                 [Page 6]
Internet-Draft                 Scaling BGP                  October 2009


    likely to work transparently with the rest of the system that does
    not make the changes or does not make the change at the same time.


3.  An Evolutionary Path towards Scalable Routing

    Based on our current understanding of the problem and the solution
    space, in this section we sketch out an evolutionary path towards
    scaling Inter-domain routing.  As the Internet continues to evolve
    over time, it is likely that our understanding will also evolve,  
thus
    the specific path we sketched out in this draft may, or is likely  
to,
    change.  The main point we argue in this draft is not any specific
    evolutionary path that the Internet may take, but rather we aim to
    show strong evidences that such an evolutionary path both exists and
    is feasible; that we should aim for an evolutionary path to address
    routing scalability problem, rather than attempting a brand new
    design; and, most of all, that such incremental steps should not
    result in "local maximal" situation, instead they can indeed move  
the
    overall system towards the routing architecture that our new design,
    APT, aimed for.

    At this time we can see several steps in evolving today's BGP  
routing
    system towards a controllable growth of the routing table size.  We
    identify potentially most severe pain at each step that warrants a
    fix.  We then identify a fix that has a reasonable cost, can be
    carried out by individual networks, and can be built on top of the
    existing operations, so that it does not break any other parts of  
the
    global routing system.  Note that any such simple fix necessarily  
has
    its limitations.  As the fix gets widely deployed, its limitations
    are likely to become more pronounced, and can become the next  
problem
    to address.  At the same time, other aspects of the routing
    scalability problems that were not addressed by these fixes may
    become more severe.  These issues will lead to the next step of
    evolving the system forward.

3.1.  Step One: Local FIB Size Reduction

    During early 2009 we conducted a quick survey on routing scalability
    among a small group of people with operational expertise.  The
    results identified the fast growing FIB size as the highest priority
    concern in routing scalability; this is also consistent with the
    results from the IAB 2006 workshop on Routing and Addressing
    [RFC4984].  Therefore, we consider reducing FIB size the first issue
    to addres, and we believe there is no major disagreement regarding
    this problem statement.

    The proposed solutions for resolving this FIB scalability  
problem, on
    the other hand, differ significantly.  Most of the proposals



Zhang, et al.            Expires April 29, 2010                 [Page 7]
Internet-Draft                 Scaling BGP                  October 2009


    presented to the IRTF Routing Research Group (including our own
    earlier work, APT) took on the direction of a basic architectural
    change.  Not only is an architectural change likely to take long to
    go through the IETF standardization process as well as costly to  
roll
    out, but also it suffers from a more fundamental problem: the
    difficulty to bring immediate benefits to first movers and to be
    compatible with today's deployed base.  We will discuss more about
    the difficulties in deploying a new design with architectural  
changes
    in Section 4.

    A different direction to reduce FIB has also be proposed.  Proposals
    in this direction do not propose immediate architectual changes,
    instead they take pragmatic approaches to reducing FIB size.  A
    simple idea to compress FIB size has been suggested by multiple
    people independently for some time.  It works in the following way:
    if all longer prefixes, say those under 1.0.0.0/8, share the same
    next hop with their covering prefix 1.0.0.0/8, then only 1.0.0.0/8
    needs to be installed in the FIB.  A recent study [FIBAggregate]
    refined the basic idea to an effective FIB Aggregation scheme, and
    proposed an efficient FIB update mechanism when the next hop of
    either the covering or some covered prefix(es) changes and when
    prefixes need to be added or removed from the FIB.  Preliminary
    evaluation shows that different FIB Aggregation techniques can  
reduce
    the FIB size by 50% to 70% or more with no impact on the correctness
    of packet forwarding.

    FIB aggregation requires no protocol changes.  However the
    effectiveness of FIB aggregation depends on the aggregatability of
    covered and covering prefixes, hence has a lower bound on how  
much it
    can reduce FIB size (i.e. it cannot reduce FIB to an arbitrarily
    small size).  Another proposal for FIB size reduction is Virtual
    Aggregation (VA) by Francis and Xu [Virtual_Aggregation].  VA has  
the
    potential to reduce the FIB size by a factor of 10 or more.

3.2.  Step Two: Network-Coordinated FIB Size Reduction

    Briefly, Virtual Aggregation works as follows.  An ISP can reduce  
its
    routers FIB size by configuring a router, dubbed Aggregation Point
    Router (APR), to announce a short prefix, say 1.0.0.0/8, into its  
own
    network in place of multiple longer prefixes that fall within
    1.0.0.0/8.  This short prefix is called a virtual prefix.  The APR
    maintains FIB entries for all the longer prefixes (e.g., 1.1.0.0/16)
    covered by the virtual prefix it announces, while other routers in
    the network only maintain the virtual prefix 1.0.0.0/8.  When a
    router R receives a packet to be forwarded to 1.1.0.0/16, R's FIB
    will direct the packet to the APR, and the APR then encapsulates the
    packet to the egress router for the actual prefix 1.1.0.0./16.




Zhang, et al.            Expires April 29, 2010                 [Page 8]
Internet-Draft                 Scaling BGP                  October 2009


    Both FIB Aggregation and Virtual Aggregation represent evolutionary
    steps towards scaling the FIB size.  Each of them can be done by an
    individual ISP to effectively shrink the FIB size for its routers,
    and makes no impact on the routing operations of any other networks.
    Virtual Aggregation can be more effective in FIB size reduction,
    however because all packets destined to the prefixes that have been
    aggregated will go through the APR, Virtual Aggregation introduces
    additional delivery delay (i.e., path stretch), encapsulation
    overhead, as well as the potential of the APR becoming a traffic  
load
    concentration point.  Several operational steps can be applied to
    mitigate these problems.

    o  Do not aggregate prefixes that carry heavy volumes of traffic
       (popular prefixes).  Based on the assumption that most traffic
       falls into a small percentage of prefixes, avoiding  
aggregation of
       popular prefixes can prevent majority of data traffic from path
       stretch and prevent the APR from overload.
    o  One can also control APR load by using more APRs to share the
       load.
    o  Proper positioning of APRs can minimize the path stretch
       [VA_performance].
    o  Finally, if an APR receives heaving volume of traffic from  
certain
       ingress router, the APR can send to this ingress router the FIB
       entries that its traffic are destined to, so that the ingress
       router can cache the FIB entries and encapsulate the packets
       towards the egress routers directly.  This will both reduce the
       APR load and eliminate the path stretch.

    This last technique makes an APR perform more or less in the same  
way
    as a Default Mapper (DM) in our APT design [APT], however with one
    fundamental difference.  Deploying an APR does not necessarily
    require any new protocol or a new functional box (the DM node) that
    the APT deployment would require.  Instead, an operator can simply
    configure a router to be an APR.  Only when the APR rollout becomes
    successful and the APR load becomes an issue, then the operator may
    consider additional changes to make the ingress routers handle
    caching.

    We believe that FIB Aggregation (FA) and Virtual Aggregation (VA)  
can
    be deployed sequentially or in parallel to effectively reduce the  
FIB
    size at most routers.  Virtual Aggregation can be viewed as a poor
    man's map-encap within one AS.  The APR holds the mapping table from
    the virtual prefix to all the egress routers through which the
    specific prefixes can be reached.  This mapping information is
    directly derived from BGP routing updates without a new mapping
    distribution system.  The APR then encapsulates packets to those
    egress routers.




Zhang, et al.            Expires April 29, 2010                 [Page 9]
Internet-Draft                 Scaling BGP                  October 2009


    How much time can FA and VA buy us in curtailing the FIB size  
growth?
    It seems only time can tell.  But if we look ahead one step, as the
    Internet continues to grow, and as IPv6 deployment starts rolling
    out, more networks may face the FIB size problem and adopt FA and VA
    as solutions.  When two or more adjacent ASes all deploy Virtual
    Aggregation, packets that traverse these ASes will experience the
    cumulated path stretches and encapsulation/decapsulation cost of all
    the ASes along their paths.  The need to resolve this new problem  
(of
    cumulated path stretch and overhead) can naturally lead to the next
    step of evolution towards better routing scalability.

3.3.  Step Three: Reducing Adjacent AS Virtual Aggregation Overhead

    Assuming the AS path a packet takes is W-X-Y-Z, and both X and Y  
have
    deployed Virtual Aggregation.  Then instead of X's own egress  
router,
    we would like to see that X's APR encapsulates the packet  
directly to
    the egress router of Y that connects Y to Z. This will reduce the
    path stretch and the packet will only need to be encapsulated/
    decapsulated once instead of two times.

    To enable such inter-AS Virtual Aggregation, X's APR needs to know
    Y's egress router for a given destination prefix P. This mapping
    information (i.e., mapping from a destination prefix P to an egress
    router) needs to be propagated from Y to X. The least resistant
    approach is to piggyback such mapping information on the existing  
BGP
    announcement for prefix P. Francis and Xu have proposed such an
    extension to BGP, which carries the mapping information in a new BGP
    attribute [InterDomainVA]; the APT team was also looking into  
more or
    less the same design when the above mentioned draft was published.

    We show the feasibility of this second step by the following
    reasoning.  First, this second step towards better routing
    scalability will take place only after at least two adjacent  
networks
    (X and Y in our example) have deployed VA and benefited from it.
    Therefore we reason that they would not want to move away from VA  
but
    would like to minimize VA's cost in path stretch and encapsulation,
    to improve the performance for their customers.  Second, the  
required
    BGP implementation changes are backward compatible, meaning that
    networks that have deployed this solution can easily interwork with
    networks that have not deployed this solution.  Furthermore,  
adjacent
    VA-enabled ASes may not need to exchange the mapping for all of  
their
    prefixes.  Again if we take the common assumption that majority of
    traffic falls within a relatively small number of prefixes, then  
AS Y
    may only need to send AS X the mapping for a small number of  
prefixes
    to improve the performance for a bulk part of the traffic.

    As a side note we would also like to point out that this virtual
    aggregation mapping exchange *closely* resembles the early design of



Zhang, et al.            Expires April 29, 2010                [Page 10]
Internet-Draft                 Scaling BGP                  October 2009


    APT mapping information exchange between Default Mappers [APT-00].
    The content of the mapping exchanges is somewhat different, but a
    more fundamental difference between what we discussed in this  
section
    and that early APT design back in 2007 is the following: here we
    sketch out an evolutionary path forward, which does not require,  
as a
    starting point, any protocol change or information exchange across
    multiple ASes that the early APT design does.  Rather, the need for
    mapping exchange arises only after the FIB size reduction has been
    achieved, and the mapping exchange can start with two adjacent ASes
    after each of them has deployed Virtual Aggregation.

3.4.  Step Four: Reducing RIB Size

    Piggybacking the virtual aggregation mapping information on BGP can
    work well when the mapping table is small.  When more networks have
    adopted Virtual Aggregation, the mapping table is likely to grow
    large, which may make it no longer feasible to piggyback all the
    mapping information on the existing BGP sessions.  The main problem,
    as we can perceive today, would be the RIB size growth: A BGP router
    will receive the same mapping information from multiple neighboring
    BGP routers, and store all of it in its Adj-RIBs-IN.  Thus BGP
    routers may end up with storing multiple copies of the same mapping
    information.  For example assuming ISP ASes W, X, Y, and Z have a
    full-mesh connectivity among themselves, and AS-W propagates a
    mapping entry [eggress router R, customer prefix P], then X will
    receive 3 copies of this mappy entry from Y, Z, and W, respectively.

    This issue was pointed out back in 2007 when the early APT design  
was
    discussed, and one suggestion to get around the problem is to use
    separate BGP sessions for mapping information exchange.  Since the
    mapping information is global in scope (i.e. the pair of [eggress
    router R, customer prefix P] is independent from which path one uses
    to reach egress router R), this separate session can apply certain
    special rules to remove duplicate entries.

    Another factor is that, after a network X has deployed virtual
    aggregation for a while and has gained sufficient operational
    experience, it may become clear that many of its routers no longer
    need to keep the full RIB table.  If an internal router has small  
FIB
    and relies on APRs to route packets towards all other destinations,
    it does not need a full RIB to build its FIB.  Theoretically
    speaking, all border routers of X that connect to legacy networks
    (i.e., those that have not deployed VA) would still need to keep the
    full RIB in order to make BGP announcements into the legacy
    neighbors.  However in practice, only the customer-facing border
    routers need a full RIB.  The other border routers, those that face
    either peer or provider legacy neighbors, only need to announce X's
    own customer prefixes to them.  Careful engineering analysis and



Zhang, et al.            Expires April 29, 2010                [Page 11]
Internet-Draft                 Scaling BGP                  October 2009


    configuration can eliminate the need for many routers to keep full
    RIB; among those that keep the full RIB will be the ones serving as
    APRs.

    As we perceive what may happen further into the future, the picture
    becomes more blurry, hence what we try to forecast here may or may
    not bear great accuracy for what may happen in the future.  Having
    said that, we perceive that the combination of the aforementioned  
two
    factors (relieving regular routers from storing mapping table and
    full RIB table) would lead to moving the mapping dissemination from
    the regular BGP instance (which is used for inter-domain routing) to
    a separate BGP instance only between APRs via multi-hop BGP  
sessions.
    Though the protocol is still BGP for the ease of deployment, APRs
    would run a different session (e.g., on a different TCP port) for
    mapping dissemination purpose only.  Other regular routers run
    regular BGP instance for inter-domain routing purpose, but are
    relieved from bearing the overhead of storing and propagating  
mapping
    information or the full RIB table.

    When the RIB size for most routers (other than the APRs) is reduced,
    what are the prefixes that get dropped out of the RIB?  Since APRs
    (or ingress routers, if they are upgraded to handle caching) must
    encapsulate packets towards egress routers that connect to the more
    specific prefixes that have been aggregated out, the ASes must
    exchange the reachability information about their own topologies, so
    that routers in different ASes know how to reach each other.  The
    prefixes that got aggregated out of the core routing system would be
    those that belong to the edge ASes.  As such, Virtual Aggregation
    plus mapping exchange effectively drives the overall routing system
    towards the separation of edge site prefixes from the transit  
network
    routing, a scalable routing architecture that the APT design has
    depicted[eFIT_IPv6].

    Again we cannot help but to point out the close resemblance between
    the system we depicted above and the original APT design.  On the
    surface, it seems the only noticeable difference is just the names:
    here we have APRs instead of APT's Default Mappers that use BGP to
    exchange mapping information.  But once again we must not forget an
    essential difference: we reach this perceived stage towards scalable
    routing through an evolutionary path, instead of requiring
    installation of a new design from day one.

3.5.  Step Five: Insulating the Core from Edge Churns

    In the current Internet, flaps of customer prefixes are  
propagated to
    the rest of the Internet in the form of BGP updates, i.e., routing
    churns.  With virtual aggregation and mapping exchange, these churns
    would be reflected as mapping updates, which are disseminated  
through



Zhang, et al.            Expires April 29, 2010                [Page 12]
Internet-Draft                 Scaling BGP                  October 2009


    the interconnections of APRs.  We perceive this as a benefit, as
    other non-APR routers can be sheltered from updates due to edge
    instabilities.

    Our earlier measurement and analysis study [TopologyGrowth] has  
shown
    that most Internet topology growth comes from the addition of
    customer edge ASes.  It is conceivable that as the number of  
customer
    sites continues to increase, the amount of churns may become too  
much
    to handle in a cost-effective way.  A solution to this edge churn
    problem is to insulate the mapping dissemination system from the  
edge
    dynamics.  Based on the current BGP data, our estimation shows that,
    if we could remove BGP updates induced by customer prefix
    instabilities, we would have reduced the total amount of routing
    churns by an order of magnitude [eFIT_IPv6].  Ideally, when the link
    connecting a customer site to a provider fails, the mapping system
    should propagate this failure information only when the failure  
has a
    long duration, so that every network will be aware of this failure
    and choose an alternative path to reach the affected customer site.
    But long lasting failures probably do not happen frequently.  Short
    failures, which are frequent, should not be propagated through the
    mapping system.  Instead, they should be handled by other means.   
For
    example, in the APT design, the failure handling actions are data-
    driven, i.e., a link failure to an edge network is not reported
    unless and until there are data packets that are heading towards the
    failed link.  We are actively working on an evolutionary solution
    that can provide equivalent data-driven handling of edge failures as
    APT does.

3.6.  Summary

    If we can imagine a picture where all the networks in the Internet
    had deployed all the steps of routing scalability improvement we
    sketched above, then the Internet routing system would have  
converged
    to a new map-encap routing architecture that resembles APT.  Then
    what is the fundamental difference between the evolutionary path
    described in this draft and the deployment of APT?

    First, we emphasize that the fundamental goal is to reduce the
    routing system size, and that the separation of edges and core (or
    EIDs from Locators as in LISP's terminology) itself should *not*  
be a
    required starting point.  Second, we show that the evolutionary  
path,
    which goes through several steps with clearly identified benefits  
and
    minimal cost at each step, can naturally converge towards the
    separation as a result!  We make two points from this last  
statement:
    (1) This could also be used as an evidence that the evolution can
    indeed lead to architectural changes, as it moves the system towards
    the same point that a new design points to. (2) We used the phrase
    "converge towards separation", rather than "achieving separation",



Zhang, et al.            Expires April 29, 2010                [Page 13]
Internet-Draft                 Scaling BGP                  October 2009


    because we believe that, even after a long time and many networks
    have adopted the solution, it is most likely that some networks will
    remain at various early stages of the evolution, some may not have
    even made a single change.  This is the nature of the Internet, due
    to its two properties that we mentioned at the beginning: its
    distributed governance, and its diversity along economical and
    operational practice dimensions.


4.  Evolution versus Incremental Deployability

    So far our discussion has focused on a possible evolutionary path of
    the routing system towards a scalable design.  In this section we
    would like to broaden the discussion to a more general question:  
Many
    new designs make the claim that they are incrementally deployable.
    So what is the difference between an "incrementally deployable" new
    design and an evolutionary path?

    We believe that one fundamental difference is that all new designs
    have an implicit assumption that the entire system would eventually
    move to the new design.  No matter how much effort the designer puts
    into the incremental deployment step of a new design, the design
    itself does not start with the assumption that significant portions
    of the system would never adopt it.  Therefore, it is likely the  
case
    that the assumed benefit of the new design would be achieved only
    after a majority, if not the whole, of the system has deployed the
    design, and that the cost of incremental deployment would be
    minimized only then as well.  The incremental deployment  
machinery is
    simply to glue together the part that has made the change and the
    rest that has not, to make the system function together at the
    intermediate, and hopefully transient, stage.  However the system as
    whole would be in a sub-optimal state until the new design gets  
fully
    deployed.  LISP can serve as an example here.

    In contrast, gradual evolutions in a large system depict a picture
    where changes may happen here and there as needed, but there is no
    expectation that the system as a whole must make a change.  Whoever
    adopts a step forward can gain the benefit, without waiting for
    others to take actions.

    The evolutionary approach recognizes that changes to the Internet  
can
    only be a gradual process with multiple stages.  At each stage,
    networks that make the changes must have the incentive to do so.
    More specifically,

    1.  Each stage focuses on an immediate problem with enough economic
        impact that warrants a fix.




Zhang, et al.            Expires April 29, 2010                [Page 14]
Internet-Draft                 Scaling BGP                  October 2009


    2.  Each stage offers a solution that solves the problem, does not
        break other parts of the Internet, and can be deployed with a
        reasonable cost considering the specific problem.
    3.  As the solution is being deployed by more and more networks, its
        downside may become more pronounced and eventually requires a
        fix, which leads to the next stage of the evolution.

    Like many others, we too hoped that our new design, APT, could be
    eventually deployed everywhere to put the routing scalability under
    control.  We gradually realized that it is infeasible to attempt to
    roll out a new routing framework (i.e. a clean separation of edge
    prefixes from the core routing system) in a vast deployed system.
    The Internet Protocol, IP, was designed to accommodate heterogeneity
    at subnet technology level.  Today, the intrinsic heterogeneity and
    distributed governance in the Internet require the accommodation of
    heterogeneity at the network control plane.  Solutions to routing
    scalability should be control knobs on top of the deployed base to
    those parties who need them, and there should not be an expectation
    that the entire Internet would (eventually) move to a new design.

    An evolutionary process accommodates differences at different parts
    of the system, as new functions are built on top of, hence can
    peacefully co-exist with, the deployed base.  On the other hand, a
    revolutionary new design focuses on the final outcome once the
    replacement of the old by the new is done throughout the Internet.
    The latter would offer a clean picture of the overall system,
    assuming the final stage could be reached.  The former, on the other
    hand, would present a much messier or more complex picture, both
    because we twist old protocols for new functions and because
    different parts may do different things.  As pointed out by Haldane
    over 80 years ago [SizeImpact], in biological systems, "The higher
    animals are not larger than the lower because they are more
    complicated.  They are more complicated because they are  
larger."  We
    believe the same is true for man-built systems: as the system grows
    larger in size, it is necessarily becoming more complicated.


5.  Acknowledgements

    The authors are part of the APT team.  The APT project is funded by
    US National Science Foundation.  The survey mentioned in Section 3.1
    was conducted by Dan Jen.


6.  Security Considerations

    This draft is a discussion on the Internet's necessity to follow an
    evolutionary path towards the future.  There is no direct impact on



Zhang, et al.            Expires April 29, 2010                [Page 15]
Internet-Draft                 Scaling BGP                  October 2009


    the Internet security.


7.  Informative References

    [APT]      Jen, D., Meisel, M., Massey, D., Wang, L., Zhang, B., and
               L. Zhang, "APT: A Practical Tunneling Architecture for
               Routing Scalability", UCLA Computer Science Departnent
               Technical Report 080004, March 2008.

    [APT-00]   Jen, D., Meisel, M., Massey, D., Wang, L., Zhang, B., and
               L. Zhang, "APT: A Practical Transit Mapping Service",
               draft-jen-apt-00, July 2007.

    [BGP2008]  Huston, G., "BGP IN 2008 - what's changed", APRICOT
               presentation, 2009, <http://apricot2009.net/
               index.php?option=content&task=view&id=51>.

    [FIBAggregate]
               Zhang, B., Wang, L., Zhao, X., Liu, Y., and L. Zhang,  
"FIB
               Aggregation", draft-zhang-fibaggregation-01.txt, October
               2009.

    [InterDomainVA]
               Xu, X. and P. Francis, "Simple Tunnel Endpoint Signaling
               in BGP", draft-xu-tunnel-00, February 2009.

    [LISP]     Farinacci, D., Fuller, V., Meyer, D., and D. Lewis,
               "Location/ID Separation Protocol (LISP)",
               draft-farinacci-lisp-12, March 2009.

    [NANOG44]  Roisman, D., "Extending the Life of Layer 3 Switches in a
               256k+ Route World", NANOG44, October 2008, <http://
               www.nanog.org/meetings/nanog44/presentations/Monday/
               Roisman_lightning.pdf>.

    [RFC4984]  Meyer, D., Zhang, L., and K. Fall, "Report from the IAB
               Workshop on Routing and Addressing", RFC 4984,
               September 2007.

    [RRG]      RRG, "IRTF Routing Research Group Home Page", <http://
               tools.ietf.org/group/irtf/trac/wiki/ 
RoutingResearchGroup>.

    [SIRA]     Zhang, B. and et. al., "A Secure and Scalable Internet
               Routing Architecture", ACM SIGCOMM 2006 Poster Session.

    [SizeImpact]
               Haldane, "On Being the Right Size", 1928,



Zhang, et al.            Expires April 29, 2010                [Page 16]
Internet-Draft                 Scaling BGP                  October 2009


               <http://irl.cs.ucla.edu/papers/right-size.html>.

    [TopologyGrowth]
               Oliveira, R., Zhang, B., and L. Zhang, "Observing the
               Evolution of Internet AS Topology", ACM SIGCOMM 2007.

    [VA_performance]
               Ballani, B., Francis, P., Jen, D., Xu, X., and L. Zhang,
               "FIB Aggregation", draft-ietf-grow-va-perf-00.txt, July
               2009.

    [Virtual_Aggregation]
               Francis, P., Xu, X., and H. Billani, "FIB Suppression  
with
               Virtual Aggregation and Default Routes",
               draft-francis-idr-intra-va-01, September 2008.

    [eFIT_IPv6]
               Massey, D. and et. al., "A Scalable Routing System Design
               for Future Internet", ACM SIGCOMM 2007 IPv6 Workshop.


Authors' Addresses

    Beichuan Zhang
    Univ. of Arizona

    Email: bzhang@arizona.edu


    Lixia Zhang
    UCLA

    Email: lixia@cs.ucla.edu


    Lan Wang
    Univ. of Memphis

    Email: lanwang@cs.memphis.edu












Zhang, et al.            Expires April 29, 2010                [Page 17]