Re: Routing Directorate Review for "Use of BGP for routing in large-scale data centers" (adding RTG WG)
Alia Atlas <akatlas@gmail.com> Mon, 25 April 2016 19:02 UTC
Return-Path: <akatlas@gmail.com>
X-Original-To: rtgwg@ietfa.amsl.com
Delivered-To: rtgwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 58FAC12D693; Mon, 25 Apr 2016 12:02:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -102.699
X-Spam-Level:
X-Spam-Status: No, score=-102.699 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0YsJjbkJEnsA; Mon, 25 Apr 2016 12:02:03 -0700 (PDT)
Received: from mail-oi0-x22e.google.com (mail-oi0-x22e.google.com [IPv6:2607:f8b0:4003:c06::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8E28E12D679; Mon, 25 Apr 2016 12:01:59 -0700 (PDT)
Received: by mail-oi0-x22e.google.com with SMTP id r78so186888014oie.0; Mon, 25 Apr 2016 12:01:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=nnGXI76UDtWXMX9QP6pHtLyScutMauY7zUyoc3YbVOc=; b=UMhN/bgEfJIcqcthK66VnaZMqttFbz/V+wC8ogw8JITIY565m/z20sl4YAoObBAnk2 EaxxUt5Lh++YwEHCV2MqdezEeFy/+aZdcTmclw82c/NIvOn01j1ClLLOtvvZNoIpa1iA 3GBcdBI+8TBtzf3HznJqRMEjnzJ8n3RW567LXa/fh438kikmmerL5W/x3HHy9QRgtTqS PsNgfHjrArf40EOaL7ox57YO8j6rLW+qYVB1ExITK1k7HDEvD5Xt/KBkOvfP4wQXBX94 P3qT7NQMGeF5myfYEo4ACIArYvCclLK4aSO5bRXJwcE4l5H5wAWWHmMpy/7ce1GxeSnz 9ySQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=nnGXI76UDtWXMX9QP6pHtLyScutMauY7zUyoc3YbVOc=; b=gZG9AzViBwrltxBRhOuAm6RbgPflFYOE0O6RfyjT4lrrsLPATqcUh/Nq0dMIWTSbJ8 mscYLktEiJKPW7OifPCiP/2QG9y7lA/sbdHxbQTL/ny0hw0BOBzRXI5o/JLE5x/0wQ5b 48pfBWuw12Bshmd/NUTkVECIvgPC/iozHuhzaa36soanY2xQrKKhCeICoHwkeoiQ/gd3 ZZs6ImdS+tbwYMhiEDyOjwAr5wED+AUX1OZCabTZKHB7JFGt3xZK5HWOPIRQluiQWEJq PIR3seTzq79Ur6tcmcQ60YXwapstAAPpqZu6pN4ulsfN1Sp1hCDtIugh+EWl8s0opDBS tqJA==
X-Gm-Message-State: AOPr4FWQ1Jb1+F8o2yXqcJqM9mzuRpdI77tjW/qomGc9DicR4NfBjVXLJVYQMopyGYFGYmoZyqK71D/T/w9SxA==
MIME-Version: 1.0
X-Received: by 10.202.105.198 with SMTP id e189mr509359oic.195.1461610918346; Mon, 25 Apr 2016 12:01:58 -0700 (PDT)
Received: by 10.60.115.168 with HTTP; Mon, 25 Apr 2016 12:01:58 -0700 (PDT)
In-Reply-To: <D343C90F.5C211%acee@cisco.com>
References: <D343C90F.5C211%acee@cisco.com>
Date: Mon, 25 Apr 2016 15:01:58 -0400
Message-ID: <CAG4d1rdt8jTcJ59X0fDnVn2sEERAyB=2gjjVAnqQ5SDvUxfG0Q@mail.gmail.com>
Subject: Re: Routing Directorate Review for "Use of BGP for routing in large-scale data centers" (adding RTG WG)
From: Alia Atlas <akatlas@gmail.com>
To: "Acee Lindem (acee)" <acee@cisco.com>
Content-Type: multipart/alternative; boundary="001a1141b26696bc01053153cfed"
Archived-At: <http://mailarchive.ietf.org/arch/msg/rtgwg/uS3cloRquF3rAi3g17bI5GmSdBk>
Cc: "draft-ietf-rtgwg-bgp-routing-large-dc@ietf.org" <draft-ietf-rtgwg-bgp-routing-large-dc@ietf.org>, Routing WG <rtgwg@ietf.org>, Routing Directorate <rtg-dir@ietf.org>, Routing ADs <rtg-ads@tools.ietf.org>
X-BeenThere: rtgwg@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Routing Area Working Group <rtgwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rtgwg/>
List-Post: <mailto:rtgwg@ietf.org>
List-Help: <mailto:rtgwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 25 Apr 2016 19:02:17 -0000
Hi Acee, Thank you very much for your review. Authors, could you please respond soon? I am hoping to get this out to IETF Last Call by Thursday - and on the telechat for May 19. That depends on timely updates from the authors and shepherd. Thanks, Alia On Mon, Apr 25, 2016 at 1:16 PM, Acee Lindem (acee) <acee@cisco.com> wrote: > Hello, > > I have been selected as the Routing Directorate reviewer for this draft. > The Routing Directorate seeks to review all routing or routing-related > drafts as they pass through IETF last call and IESG review, and sometimes > on special request. The purpose of the review is to provide assistance to > the Routing ADs. For more information about the Routing Directorate, > please see http://trac.tools.ietf.org/area/rtg/trac/wiki/RtgDir > > Although these comments are primarily for the use of the Routing ADs, it > would be helpful if you could consider them along with any other IETF Last > Call comments that you receive, and strive to resolve them through > discussion or by updating the draft. > > Document: draft-ietf-rtgwg-bgp-routing-large-dc-09.txt > Reviewer: Acee Lindem > Review Date: 4/25/16 > IETF LC End Date: Not started > Intended Status: Informational > > Summary: > This document is basically ready for publication, but has some minor > issues and nits that should be resolved prior to publication. > > Comments: > The document starts with the requirements for an MSDC routing and then > provides an overview of Clos data topologies and data center network > design. This overview attempts to cover a lot of a material in a very > small amount of text. While not completely successful, the overview > provides a lot of good information and references. The bulk of the > document covers the usage of EBGP as the sole data center routing protocol > and other aspects of the routing design including ECMP, summarization > issues, and convergence. These sections provide a very good guide for > using EBGP in a Clos data center and an excellent discussion of the > deployment issues (based on real deployment experience). > > The technical content of the document is excellent. The readability > could be improved by breaking up some of the run-on sentences and with the > suggested editorial changes (see Nits below). > > > Major Issues: > > I have no major issues with the document. > > Minor Issues: > > Section 4.2: Can an informative reference be added for Direct Server > Return (DSR)? > Section 5.2.4 and 7.4: Define precisely what is meant by "scale-out" > topology somewhere in the document. > Section 5.2.5: Can you add a backward reference to the discussion of > "lack of peer links inside every peer”? Also, it would be good to describe > how this would allow for summarization and under what failure conditions. > Section 7.4: Should you add a reference to > https://www.ietf.org/id/draft-ietf-rtgwg-bgp-pic-00.txt to the penultimate > paragraph in this section? > > Nits: > > *************** > *** 143,149 **** > network stability so that a small group of people can effectively > support a significantly sized network. > > ! Experimentation and extensive testing has shown that External BGP > (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for > these type of data center applications. This is in contrast with > more traditional DC designs, which may se simple tree topologies and > --- 143,149 ---- > network stability so that a sall group of people can effectively > support a significantly sized network. > > ! Experimentation and extensive testing have shown that External BGP > (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for > these type of data center applications. This is in contrast with > more traditional DC designs, which may use simple tree topologies and > *************** > *** 178,191 **** > 2.1. Bandwidth and Traffic Patterns > > The primary requirement when building an interconnection network for > ! large number of servers is to accommodate application bandwidth and > latency requirements. Until recently it was quite common to see the > majority of traffic entering and leaving the data center, commonly > referred to as "north-south" traffic. Traditional "tree" topologies > were sufficient to accommodate such flows, even with high > oversubscription ratios between the layers of the network. If more > bandwidth was required, it was added by "scaling up" the network > ! elements, e.g. by upgrading the device's linecards or fabrics or > replacing the device with one with higher port density. > > Today many large-scale data centers host applications generating > --- 178,191 ---- > 2.1. Bandwidth and Traffic Patterns > > The primary requirement when building an interconnection network for > ! a large number of servers is to accommodate application bandwidth and > latency requirements. Until recently it was quite common to see the > majority of traffic entering and leaving the data center, commonly > referred to as "north-south" traffic. Traditional "tree" topologies > were sufficient to accommodate such flows, even with high > oversubscription ratios between the layers of the network. If more > bandwidth was required, it was added by "scaling up" the network > ! elements, e.g., by upgrading the device's linecards or fabrics or > replacing the device with one with higher port density. > > Today many large-scale data centers host applications generating > *************** > *** 195,201 **** > [HADOOP], massive data replication between clusters needed by certain > applications, or virtual machine migrations. Scaling traditional > tree topologies to match these bandwidth demands becomes either too > ! expensive or impossible due to physical limitations, e.g. port > density in a switch. > > 2.2. CAPEX Minimization > --- 195,201 ---- > [HADOOP], massive data replication between clusters needed by certain > applications, or virtual machine migrations. Scaling traditional > tree topologies to match these bandwidth demands becomes either too > ! expensive or impossible due to physical limitations, e.g., port > density in a switch. > > 2.2. CAPEX Minimization > *************** > *** 209,215 **** > > o Unifying all network elements, preferably using the same hardware > type or even the same device. This allows for volume pricing on > ! bulk purchases and reduced maintenance and sparing costs. > > o Driving costs down using competitive pressures, by introducing > multiple network equipment vendors. > --- 209,215 ---- > > o Unifying all network elements, preferably using the same hardware > type or even the same device. This allows for volume pricing on > ! bulk purchases and reduced maintenance and inventory costs. > > o Driving costs down using competitive pressures, by introducing > multiple network equipment vendors. > *************** > *** 234,244 **** > minimizes software issue-related failures. > > An important aspect of Operational Expenditure (OPEX) minimization is > ! reducing size of failure domains in the network. Ethernet networks > are known to be susceptible to broadcast or unicast traffic storms > that can have a dramatic impact on network performance and > availability. The use of a fully routed design significantly reduces > ! the size of the data plane failure domains - i.e. limits them to the > lowest level in the network hierarchy. However, such designs > introduce the problem of distributed control plane failures. This > observation calls for simpler and less control plane protocols to > --- 234,244 ---- > minimizes software issue-related failures. > > An important aspect of Operational Expenditure (OPEX) minimization is > ! reducing the size of failure domains in the network. Ethernet > networks > are known to be susceptible to broadcast or unicast traffic storms > that can have a dramatic impact on network performance and > availability. The use of a fully routed design significantly reduces > ! the size of the data plane failure domains, i.e., limits them to the > lowest level in the network hierarchy. However, such designs > introduce the problem of distributed control plane failures. This > observation calls for simpler and less control plane protocols to > *************** > *** 253,259 **** > performed by network devices. Traditionally, load balancers are > deployed as dedicated devices in the traffic forwarding path. The > problem arises in scaling load balancers under growing traffic > ! demand. A preferable solution would be able to scale load balancing > layer horizontally, by adding more of the uniform nodes and > distributing incoming traffic across these nodes. In situations like > this, an ideal choice would be to use network infrastructure itself > --- 253,259 ---- > performed by network devices. Traditionally, load balancers are > deployed as dedicated devices in the traffic forwarding path. The > problem arises in scaling load balancers under growing traffic > ! demand. A preferable solution would be able to scale the load > balancing > layer horizontally, by adding more of the uniform nodes and > distributing incoming traffic across these nodes. In situations like > this, an ideal choice would be to use network infrastructure itself > *************** > *** 305,311 **** > 3.1. Traditional DC Topology > > In the networking industry, a common design choice for data centers > ! typically look like a (upside down) tree with redundant uplinks and > three layers of hierarchy namely; core, aggregation/distribution and > access layers (see Figure 1). To accommodate bandwidth demands, each > higher layer, from server towards DC egress or WAN, has higher port > --- 305,311 ---- > 3.1. Traditional DC Topology > > In the networking industry, a common design choice for data centers > ! typically look like an (upside down) tree with redundant uplinks and > three layers of hierarchy namely; core, aggregation/distribution and > access layers (see Figure 1). To accommodate bandwidth demands, each > higher layer, from server towards DC egress or WAN, has higher port > *************** > *** 373,379 **** > topology, sometimes called "fat-tree" (see, for example, [INTERCON] > and [ALFARES2008]). This topology features an odd number of stages > (sometimes known as dimensions) and is commonly made of uniform > ! elements, e.g. network switches with the same port count. Therefore, > the choice of folded Clos topology satisfies REQ1 and facilitates > REQ2. See Figure 2 below for an example of a folded 3-stage Clos > topology (3 stages counting Tier-2 stage twice, when tracing a packet > --- 373,379 ---- > topology, sometimes called "fat-tree" (see, for example, [INTERCON] > and [ALFARES2008]). This topology features an odd number of stages > (sometimes known as dimensions) and is commonly made of uniform > ! elements, e.g., network switches with the same port count. Therefore, > the choice of folded Clos topology satisfies REQ1 and facilitates > REQ2. See Figure 2 below for an example of a folded 3-stage Clos > topology (3 stages counting Tier-2 stage twice, when tracing a packet > *************** > *** 460,466 **** > 3.2.3. Scaling the Clos topology > > A Clos topology can be scaled either by increasing network element > ! port density or adding more stages, e.g. moving to a 5-stage Clos, as > illustrated in Figure 3 below: > > Tier-1 > --- 460,466 ---- > 3.2.3. Scaling the Clos topology > > A Clos topology can be scaled either by increasing network element > ! port density or adding more stages, e.g., moving to a 5-stage Clos, as > illustrated in Figure 3 below: > > Tier-1 > *************** > *** 523,529 **** > 3.2.4. Managing the Size of Clos Topology Tiers > > If a data center network size is small, it is possible to reduce the > ! number of switches in Tier-1 or Tier-2 of Clos topology by a factor > of two. To understand how this could be done, take Tier-1 as an > example. Every Tier-2 device connects to a single group of Tier-1 > devices. If half of the ports on each of the Tier-1 devices are not > --- 523,529 ---- > 3.2.4. Managing the Size of Clos Topology Tiers > > If a data center network size is small, it is possible to reduce the > ! number of switches in Tier-1 or Tier-2 of a Clos topology by a factor > of two. To understand how this could be done, take Tier-1 as an > example. Every Tier-2 device connects to a single group of Tier-1 > devices. If half of the ports on each of the Tier-1 devices are not > *************** > *** 574,580 **** > originally defined in [IEEE8021D-1990] for loop free topology > creation, typically utilizing variants of the traditional DC topology > described in Section 3.1. At the time, many DC switches either did > ! not support Layer 3 routed protocols or supported it with additional > licensing fees, which played a part in the design choice. Although > many enhancements have been made through the introduction of Rapid > Spanning Tree Protocol (RSTP) in the latest revision of > --- 574,580 ---- > originally defined in [IEEE8021D-1990] for loop free topology > creation, typically utilizing variants of the traditional DC topology > described in Section 3.1. At the time, many DC switches either did > ! not support Layer 3 routing protocols or supported them with > additional > licensing fees, which played a part in the design choice. Although > many enhancements have been made through the introduction of Rapid > Spanning Tree Protocol (RSTP) in the latest revision of > *************** > *** 599,605 **** > as the backup for loop prevention. The major downsides of this > approach are the lack of ability to scale linearly past two in most > implementations, lack of standards based implementations, and added > ! failure domain risk of keeping state between the devices. > > It should be noted that building large, horizontally scalable, Layer > 2 only networks without STP is possible recently through the > --- 599,605 ---- > as the backup for loop prevention. The major downsides of this > approach are the lack of ability to scale linearly past two in most > implementations, lack of standards based implementations, and added > ! the failure domain risk of syncing state between the devices. > > It should be noted that building large, horizontally scalable, Layer > 2 only networks without STP is possible recently through the > *************** > *** 621,631 **** > Finally, neither the base TRILL specification nor the M-LAG approach > totally eliminate the problem of the shared broadcast domain, that is > so detrimental to the operations of any Layer 2, Ethernet based > ! solutions. Later TRILL extensions have been proposed to solve the > this problem statement primarily based on the approaches outlined in > [RFC7067], but this even further limits the number of available > ! interoperable implementations that can be used to build a fabric, > ! therefore TRILL based designs have issues meeting REQ2, REQ3, and > REQ4. > > 4.2. Hybrid L2/L3 Designs > --- 621,631 ---- > Finally, neither the base TRILL specification nor the M-LAG approach > totally eliminate the problem of the shared broadcast domain, that is > so detrimental to the operations of any Layer 2, Ethernet based > ! solution. Later TRILL extensions have been proposed to solve the > this problem statement primarily based on the approaches outlined in > [RFC7067], but this even further limits the number of available > ! interoperable implementations that can be used to build a fabric. > ! Therefore, TRILL based designs have issues meeting REQ2, REQ3, and > REQ4. > > 4.2. Hybrid L2/L3 Designs > *************** > *** 635,641 **** > in either the Tier-1 or Tier-2 parts of the network and dividing the > Layer 2 domain into numerous, smaller domains. This design has > allowed data centers to scale up, but at the cost of complexity in > ! the network managing multiple protocols. For the following reasons, > operators have retained Layer 2 in either the access (Tier-3) or both > access and aggregation (Tier-3 and Tier-2) parts of the network: > > --- 635,641 ---- > in either the Tier-1 or Tier-2 parts of the network and dividing the > Layer 2 domain into numerous, smaller domains. This design has > allowed data centers to scale up, but at the cost of complexity in > ! the managing multiple network protocols. For the following reasons, > operators have retained Layer 2 in either the access (Tier-3) or both > access and aggregation (Tier-3 and Tier-2) parts of the network: > > *************** > *** 644,650 **** > > o Seamless mobility for virtual machines that require the > preservation of IP addresses when a virtual machine moves to > ! different Tier-3 switch. > > o Simplified IP addressing = less IP subnets are required for the > data center. > --- 644,650 ---- > > o Seamless mobility for virtual machines that require the > preservation of IP addresses when a virtual machine moves to > ! a different Tier-3 switch. > > o Simplified IP addressing = less IP subnets are required for the > data center. > *************** > *** 679,686 **** > adoption in networks where large Layer 2 adjacency and larger size > Layer 3 subnets are not as critical compared to network scalability > and stability. Application providers and network operators continue > ! to also develop new solutions to meet some of the requirements that > ! previously have driven large Layer 2 domains by using various overlay > or tunneling techniques. > > 5. Routing Protocol Selection and Design > --- 679,686 ---- > adoption in networks where large Layer 2 adjacency and larger size > Layer 3 subnets are not as critical compared to network scalability > and stability. Application providers and network operators continue > ! to develop new solutions to meet some of the requirements that > ! previously had driven large Layer 2 domains using various overlay > or tunneling techniques. > > 5. Routing Protocol Selection and Design > *************** > *** 700,706 **** > design. > > Although EBGP is the protocol used for almost all inter-domain > ! routing on the Internet and has wide support from both vendor and > service provider communities, it is not generally deployed as the > primary routing protocol within the data center for a number of > reasons (some of which are interrelated): > --- 700,706 ---- > design. > > Although EBGP is the protocol used for almost all inter-domain > ! routing in the Internet and has wide support from both vendor and > service provider communities, it is not generally deployed as the > primary routing protocol within the data center for a number of > reasons (some of which are interrelated): > *************** > *** 741,754 **** > state IGPs. Since every BGP router calculates and propagates only > the best-path selected, a network failure is masked as soon as the > BGP speaker finds an alternate path, which exists when highly > ! symmetric topologies, such as Clos, are coupled with EBGP only > design. In contrast, the event propagation scope of a link-state > IGP is an entire area, regardless of the failure type. In this > way, BGP better meets REQ3 and REQ4. It is also worth mentioning > that all widely deployed link-state IGPs feature periodic > ! refreshes of routing information, even if this rarely causes > ! impact to modern router control planes, while BGP does not expire > ! routing state. > > o BGP supports third-party (recursively resolved) next-hops. This > allows for manipulating multipath to be non-ECMP based or > --- 741,754 ---- > state IGPs. Since every BGP router calculates and propagates only > the best-path selected, a network failure is masked as soon as the > BGP speaker finds an alternate path, which exists when highly > ! symmetric topologies, such as Clos, are coupled with an EBGP only > design. In contrast, the event propagation scope of a link-state > IGP is an entire area, regardless of the failure type. In this > way, BGP better meets REQ3 and REQ4. It is also worth mentioning > that all widely deployed link-state IGPs feature periodic > ! refreshes of routing information while BGP does not expire > ! routing state, although this rarely impacts modern router control > ! planes. > > o BGP supports third-party (recursively resolved) next-hops. This > allows for manipulating multipath to be non-ECMP based or > *************** > *** 765,775 **** > controlled and complex unwanted paths will be ignored. See > Section 5.2 for an example of a working ASN allocation scheme. In > a link-state IGP accomplishing the same goal would require multi- > ! (instance/topology/processes) support, typically not available in > all DC devices and quite complex to configure and troubleshoot. > Using a traditional single flooding domain, which most DC designs > utilize, under certain failure conditions may pick up unwanted > ! lengthy paths, e.g. traversing multiple Tier-2 devices. > > o EBGP configuration that is implemented with minimal routing policy > is easier to troubleshoot for network reachability issues. In > --- 765,775 ---- > controlled and complex unwanted paths will be ignored. See > Section 5.2 for an example of a working ASN allocation scheme. In > a link-state IGP accomplishing the same goal would require multi- > ! (instance/topology/process) support, typically not available in > all DC devices and quite complex to configure and troubleshoot. > Using a traditional single flooding domain, which most DC designs > utilize, under certain failure conditions may pick up unwanted > ! lengthy paths, e.g., traversing multiple Tier-2 devices. > > o EBGP configuration that is implemented with minimal routing policy > is easier to troubleshoot for network reachability issues. In > *************** > *** 806,812 **** > loopback sessions are used even in the case of multiple links > between the same pair of nodes. > > ! o Private Use ASNs from the range 64512-65534 are used so as to > avoid ASN conflicts. > > o A single ASN is allocated to all of the Clos topology's Tier-1 > --- 806,812 ---- > loopback sessions are used even in the case of multiple links > between the same pair of nodes. > > ! o Private Use ASNs from the range 64512-65534 are used to > avoid ASN conflicts. > > o A single ASN is allocated to all of the Clos topology's Tier-1 > *************** > *** 815,821 **** > o A unique ASN is allocated to each set of Tier-2 devices in the > same cluster. > > ! o A unique ASN is allocated to every Tier-3 device (e.g. ToR) in > this topology. > > > --- 815,821 ---- > o A unique ASN is allocated to each set of Tier-2 devices in the > same cluster. > > ! o A unique ASN is allocated to every Tier-3 device (e.g., ToR) in > this topology. > > > *************** > *** 903,922 **** > > Another solution to this problem would be using Four-Octet ASNs > ([RFC6793]), where there are additional Private Use ASNs available, > ! see [IANA.AS]. Use of Four-Octet ASNs put additional protocol > ! complexity in the BGP implementation so should be considered against > the complexity of re-use when considering REQ3 and REQ4. Perhaps > more importantly, they are not yet supported by all BGP > implementations, which may limit vendor selection of DC equipment. > ! When supported, ensure that implementations in use are able to remove > ! the Private Use ASNs if required for external connectivity > ! (Section 5.2.4). > > 5.2.3. Prefix Advertisement > > A Clos topology features a large number of point-to-point links and > associated prefixes. Advertising all of these routes into BGP may > ! create FIB overload conditions in the network devices. Advertising > these links also puts additional path computation stress on the BGP > control plane for little benefit. There are two possible solutions: > > --- 903,922 ---- > > Another solution to this problem would be using Four-Octet ASNs > ([RFC6793]), where there are additional Private Use ASNs available, > ! see [IANA.AS]. Use of Four-Octet ASNs puts additional protocol > ! complexity in the BGP implementation and should be balanced against > the complexity of re-use when considering REQ3 and REQ4. Perhaps > more importantly, they are not yet supported by all BGP > implementations, which may limit vendor selection of DC equipment. > ! When supported, ensure that deployed implementations are able to > remove > ! the Private Use ASNs when external connectivity to these ASes is > ! required (Section 5.2.4). > > 5.2.3. Prefix Advertisement > > A Clos topology features a large number of point-to-point links and > associated prefixes. Advertising all of these routes into BGP may > ! create FIB overload in the network devices. Advertising > these links also puts additional path computation stress on the BGP > control plane for little benefit. There are two possible solutions: > > *************** > *** 925,951 **** > device, distant networks will automatically be reachable via the > advertising EBGP peer and do not require reachability to these > prefixes. However, this may complicate operations or monitoring: > ! e.g. using the popular "traceroute" tool will display IP addresses > that are not reachable. > > o Advertise point-to-point links, but summarize them on every > device. This requires an address allocation scheme such as > allocating a consecutive block of IP addresses per Tier-1 and > Tier-2 device to be used for point-to-point interface addressing > ! to the lower layers (Tier-2 uplinks will be numbered out of Tier-1 > ! addressing and so forth). > > Server subnets on Tier-3 devices must be announced into BGP without > using route summarization on Tier-2 and Tier-1 devices. Summarizing > subnets in a Clos topology results in route black-holing under a > ! single link failure (e.g. between Tier-2 and Tier-3 devices) and > hence must be avoided. The use of peer links within the same tier to > resolve the black-holing problem by providing "bypass paths" is > undesirable due to O(N^2) complexity of the peering mesh and waste of > ports on the devices. An alternative to the full-mesh of peer-links > ! would be using a simpler bypass topology, e.g. a "ring" as described > in [FB4POST], but such a topology adds extra hops and has very > ! limited bisection bandwidth, in addition requiring special tweaks to > > > > --- 925,951 ---- > device, distant networks will automatically be reachable via the > advertising EBGP peer and do not require reachability to these > prefixes. However, this may complicate operations or monitoring: > ! e.g., using the popular "traceroute" tool will display IP addresses > that are not reachable. > > o Advertise point-to-point links, but summarize them on every > device. This requires an address allocation scheme such as > allocating a consecutive block of IP addresses per Tier-1 and > Tier-2 device to be used for point-to-point interface addressing > ! to the lower layers (Tier-2 uplink addresses will be allocated > ! from Tier-1 address blocks and so forth). > > Server subnets on Tier-3 devices must be announced into BGP without > using route summarization on Tier-2 and Tier-1 devices. Summarizing > subnets in a Clos topology results in route black-holing under a > ! single link failure (e.g., between Tier-2 and Tier-3 devices) and > hence must be avoided. The use of peer links within the same tier to > resolve the black-holing problem by providing "bypass paths" is > undesirable due to O(N^2) complexity of the peering mesh and waste of > ports on the devices. An alternative to the full-mesh of peer-links > ! would be using a simpler bypass topology, e.g., a "ring" as described > in [FB4POST], but such a topology adds extra hops and has very > ! limited bisectional bandwidth. Additionally requiring special tweaks > to > > > > *************** > *** 956,963 **** > > make BGP routing work - such as possibly splitting every device into > an ASN on its own. Later in this document, Section 8.2 introduces a > ! less intrusive method for performing a limited form route > ! summarization in Clos networks and discusses it's associated trade- > offs. > > 5.2.4. External Connectivity > --- 956,963 ---- > > make BGP routing work - such as possibly splitting every device into > an ASN on its own. Later in this document, Section 8.2 introduces a > ! less intrusive method for performing a limited form of route > ! summarization in Clos networks and discusses its associated trade- > offs. > > 5.2.4. External Connectivity > *************** > *** 972,985 **** > document. These devices have to perform a few special functions: > > o Hide network topology information when advertising paths to WAN > ! routers, i.e. remove Private Use ASNs [RFC6996] from the AS_PATH > attribute. This is typically done to avoid ASN number collisions > between different data centers and also to provide a uniform > AS_PATH length to the WAN for purposes of WAN ECMP to Anycast > prefixes originated in the topology. An implementation specific > BGP feature typically called "Remove Private AS" is commonly used > to accomplish this. Depending on implementation, the feature > ! should strip a contiguous sequence of Private Use ASNs found in > AS_PATH attribute prior to advertising the path to a neighbor. > This assumes that all ASNs used for intra data center numbering > are from the Private Use ranges. The process for stripping the > --- 972,985 ---- > document. These devices have to perform a few special functions: > > o Hide network topology information when advertising paths to WAN > ! routers, i.e., remove Private Use ASNs [RFC6996] from the AS_PATH > attribute. This is typically done to avoid ASN number collisions > between different data centers and also to provide a uniform > AS_PATH length to the WAN for purposes of WAN ECMP to Anycast > prefixes originated in the topology. An implementation specific > BGP feature typically called "Remove Private AS" is commonly used > to accomplish this. Depending on implementation, the feature > ! should strip a contiguous sequence of Private Use ASNs found in an > AS_PATH attribute prior to advertising the path to a neighbor. > This assumes that all ASNs used for intra data center numbering > are from the Private Use ranges. The process for stripping the > *************** > *** 998,1005 **** > to the WAN Routers upstream, to provide resistance to a single- > link failure causing the black-holing of traffic. To prevent > black-holing in the situation when all of the EBGP sessions to the > ! WAN routers fail simultaneously on a given device it is more > ! desirable to take the "relaying" approach rather than introducing > the default route via complicated conditional route origination > schemes provided by some implementations [CONDITIONALROUTE]. > > --- 998,1005 ---- > to the WAN Routers upstream, to provide resistance to a single- > link failure causing the black-holing of traffic. To prevent > black-holing in the situation when all of the EBGP sessions to the > ! WAN routers fail simultaneously on a given device, it is more > ! desirable to readvertise the default route rather than originating > the default route via complicated conditional route origination > schemes provided by some implementations [CONDITIONALROUTE]. > > *************** > *** 1017,1023 **** > prefixes originated from within the data center in a fully routed > network design. For example, a network with 2000 Tier-3 devices will > have at least 2000 servers subnets advertised into BGP, along with > ! the infrastructure or other prefixes. However, as discussed before, > the proposed network design does not allow for route summarization > due to the lack of peer links inside every tier. > > --- 1017,1023 ---- > prefixes originated from within the data center in a fully routed > network design. For example, a network with 2000 Tier-3 devices will > have at least 2000 servers subnets advertised into BGP, along with > ! the infrastructure and link prefixes. However, as discussed before, > the proposed network design does not allow for route summarization > due to the lack of peer links inside every tier. > > *************** > *** 1028,1037 **** > o Interconnect the Border Routers using a full-mesh of physical > links or using any other "peer-mesh" topology, such as ring or > hub-and-spoke. Configure BGP accordingly on all Border Leafs to > ! exchange network reachability information - e.g. by adding a mesh > of IBGP sessions. The interconnecting peer links need to be > appropriately sized for traffic that will be present in the case > ! of a device or link failure underneath the Border Routers. > > o Tier-1 devices may have additional physical links provisioned > toward the Border Routers (which are Tier-2 devices from the > --- 1028,1037 ---- > o Interconnect the Border Routers using a full-mesh of physical > links or using any other "peer-mesh" topology, such as ring or > hub-and-spoke. Configure BGP accordingly on all Border Leafs to > ! exchange network reachability information, e.g., by adding a mesh > of IBGP sessions. The interconnecting peer links need to be > appropriately sized for traffic that will be present in the case > ! of a device or link failure in the mesh connecting the Border > Routers. > > o Tier-1 devices may have additional physical links provisioned > toward the Border Routers (which are Tier-2 devices from the > *************** > *** 1043,1049 **** > device compared with the other devices in the Clos. This also > reduces the number of ports available to "regular" Tier-2 switches > and hence the number of clusters that could be interconnected via > ! Tier-1 layer. > > If any of the above options are implemented, it is possible to > perform route summarization at the Border Routers toward the WAN > --- 1043,1049 ---- > device compared with the other devices in the Clos. This also > reduces the number of ports available to "regular" Tier-2 switches > and hence the number of clusters that could be interconnected via > ! the Tier-1 layer. > > If any of the above options are implemented, it is possible to > perform route summarization at the Border Routers toward the WAN > *************** > *** 1071,1079 **** > ECMP is the fundamental load sharing mechanism used by a Clos > topology. Effectively, every lower-tier device will use all of its > directly attached upper-tier devices to load share traffic destined > ! to the same IP prefix. Number of ECMP paths between any two Tier-3 > devices in Clos topology equals to the number of the devices in the > ! middle stage (Tier-1). For example, Figure 5 illustrates the > topology where Tier-3 device A has four paths to reach servers X and > Y, via Tier-2 devices B and C and then Tier-1 devices 1, 2, 3, and 4 > respectively. > --- 1071,1079 ---- > ECMP is the fundamental load sharing mechanism used by a Clos > topology. Effectively, every lower-tier device will use all of its > directly attached upper-tier devices to load share traffic destined > ! to the same IP prefix. The number of ECMP paths between any two > Tier-3 > devices in Clos topology equals to the number of the devices in the > ! middle stage (Tier-1). For example, Figure 5 illustrates a > topology where Tier-3 device A has four paths to reach servers X and > Y, via Tier-2 devices B and C and then Tier-1 devices 1, 2, 3, and 4 > respectively. > *************** > *** 1105,1116 **** > > The ECMP requirement implies that the BGP implementation must support > multipath fan-out for up to the maximum number of devices directly > ! attached at any point in the topology in upstream or downstream > direction. Normally, this number does not exceed half of the ports > found on a device in the topology. For example, an ECMP fan-out of > 32 would be required when building a Clos network using 64-port > devices. The Border Routers may need to have wider fan-out to be > ! able to connect to multitude of Tier-1 devices if route summarization > at Border Router level is implemented as described in Section 5.2.5. > If a device's hardware does not support wider ECMP, logical link- > grouping (link-aggregation at layer 2) could be used to provide > --- 1105,1116 ---- > > The ECMP requirement implies that the BGP implementation must support > multipath fan-out for up to the maximum number of devices directly > ! attached at any point in the topology in the upstream or downstream > direction. Normally, this number does not exceed half of the ports > found on a device in the topology. For example, an ECMP fan-out of > 32 would be required when building a Clos network using 64-port > devices. The Border Routers may need to have wider fan-out to be > ! able to connect to a multitude of Tier-1 devices if route > summarization > at Border Router level is implemented as described in Section 5.2.5. > If a device's hardware does not support wider ECMP, logical link- > grouping (link-aggregation at layer 2) could be used to provide > *************** > *** 1122,1131 **** > Internet-Draft draft-ietf-rtgwg-bgp-routing-large-dc March 2016 > > > ! "hierarchical" ECMP (Layer 3 ECMP followed by Layer 2 ECMP) to > compensate for fan-out limitations. Such approach, however, > increases the risk of flow polarization, as less entropy will be > ! available to the second stage of ECMP. > > Most BGP implementations declare paths to be equal from an ECMP > perspective if they match up to and including step (e) in > --- 1122,1131 ---- > Internet-Draft draft-ietf-rtgwg-bgp-routing-large-dc March 2016 > > > ! "hierarchical" ECMP (Layer 3 ECMP coupled with Layer 2 ECMP) to > compensate for fan-out limitations. Such approach, however, > increases the risk of flow polarization, as less entropy will be > ! available at the second stage of ECMP. > > Most BGP implementations declare paths to be equal from an ECMP > perspective if they match up to and including step (e) in > *************** > *** 1148,1154 **** > perspective of other devices, such a prefix would have BGP paths with > different AS_PATH attribute values, while having the same AS_PATH > attribute lengths. Therefore, BGP implementations must support load > ! sharing over above-mentioned paths. This feature is sometimes known > as "multipath relax" or "multipath multiple-as" and effectively > allows for ECMP to be done across different neighboring ASNs if all > other attributes are equal as already described in the previous > --- 1148,1154 ---- > perspective of other devices, such a prefix would have BGP paths with > different AS_PATH attribute values, while having the same AS_PATH > attribute lengths. Therefore, BGP implementations must support load > ! sharing over the above-mentioned paths. This feature is sometimes > known > as "multipath relax" or "multipath multiple-as" and effectively > allows for ECMP to be done across different neighboring ASNs if all > other attributes are equal as already described in the previous > *************** > *** 1182,1199 **** > > It is often desirable to have the hashing function used for ECMP to > be consistent (see [CONS-HASH]), to minimize the impact on flow to > ! next-hop affinity changes when a next-hop is added or removed to ECMP > group. This could be used if the network device is used as a load > balancer, mapping flows toward multiple destinations - in this case, > ! losing or adding a destination will not have detrimental effect of > currently established flows. One particular recommendation on > implementing consistent hashing is provided in [RFC2992], though > other implementations are possible. This functionality could be > naturally combined with weighted ECMP, with the impact of the next- > hop changes being proportional to the weight of the given next-hop. > The downside of consistent hashing is increased load on hardware > ! resource utilization, as typically more space is required to > ! implement a consistent-hashing region. > > 7. Routing Convergence Properties > > --- 1182,1199 ---- > > It is often desirable to have the hashing function used for ECMP to > be consistent (see [CONS-HASH]), to minimize the impact on flow to > ! next-hop affinity changes when a next-hop is added or removed to an > ECMP > group. This could be used if the network device is used as a load > balancer, mapping flows toward multiple destinations - in this case, > ! losing or adding a destination will not have a detrimental effect on > currently established flows. One particular recommendation on > implementing consistent hashing is provided in [RFC2992], though > other implementations are possible. This functionality could be > naturally combined with weighted ECMP, with the impact of the next- > hop changes being proportional to the weight of the given next-hop. > The downside of consistent hashing is increased load on hardware > ! resource utilization, as typically more resources (e.g., TCAM space) > ! are required to implement a consistent-hashing function. > > 7. Routing Convergence Properties > > *************** > *** 1209,1224 **** > driven mechanism to obtain updates on IGP state changes. The > proposed routing design does not use an IGP, so the remaining > mechanisms that could be used for fault detection are BGP keep-alive > ! process (or any other type of keep-alive mechanism) and link-failure > triggers. > > Relying solely on BGP keep-alive packets may result in high > ! convergence delays, in the order of multiple seconds (on many BGP > implementations the minimum configurable BGP hold timer value is > three seconds). However, many BGP implementations can shut down > local EBGP peering sessions in response to the "link down" event for > the outgoing interface used for BGP peering. This feature is > ! sometimes called as "fast fallover". Since links in modern data > centers are predominantly point-to-point fiber connections, a > physical interface failure is often detected in milliseconds and > subsequently triggers a BGP re-convergence. > --- 1209,1224 ---- > driven mechanism to obtain updates on IGP state changes. The > proposed routing design does not use an IGP, so the remaining > mechanisms that could be used for fault detection are BGP keep-alive > ! time-out (or any other type of keep-alive mechanism) and link-failure > triggers. > > Relying solely on BGP keep-alive packets may result in high > ! convergence delays, on the order of multiple seconds (on many BGP > implementations the minimum configurable BGP hold timer value is > three seconds). However, many BGP implementations can shut down > local EBGP peering sessions in response to the "link down" event for > the outgoing interface used for BGP peering. This feature is > ! sometimes called "fast fallover". Since links in modern data > centers are predominantly point-to-point fiber connections, a > physical interface failure is often detected in milliseconds and > subsequently triggers a BGP re-convergence. > *************** > *** 1236,1242 **** > > Alternatively, some platforms may support Bidirectional Forwarding > Detection (BFD) [RFC5880] to allow for sub-second failure detection > ! and fault signaling to the BGP process. However, use of either of > these presents additional requirements to vendor software and > possibly hardware, and may contradict REQ1. Until recently with > [RFC7130], BFD also did not allow detection of a single member link > --- 1236,1242 ---- > > Alternatively, some platforms may support Bidirectional Forwarding > Detection (BFD) [RFC5880] to allow for sub-second failure detection > ! and fault signaling to the BGP process. However, the use of either of > these presents additional requirements to vendor software and > possibly hardware, and may contradict REQ1. Until recently with > [RFC7130], BFD also did not allow detection of a single member link > *************** > *** 1245,1251 **** > > 7.2. Event Propagation Timing > > ! In the proposed design the impact of BGP Minimum Route Advertisement > Interval (MRAI) timer (See section 9.2.1.1 of [RFC4271]) should be > considered. Per the standard it is required for BGP implementations > to space out consecutive BGP UPDATE messages by at least MRAI > --- 1245,1251 ---- > > 7.2. Event Propagation Timing > > ! In the proposed design the impact of the BGP Minimum Route > Advertisement > Interval (MRAI) timer (See section 9.2.1.1 of [RFC4271]) should be > considered. Per the standard it is required for BGP implementations > to space out consecutive BGP UPDATE messages by at least MRAI > *************** > *** 1258,1270 **** > In a Clos topology each EBGP speaker typically has either one path > (Tier-2 devices don't accept paths from other Tier-2 in the same > cluster due to same ASN) or N paths for the same prefix, where N is a > ! significantly large number, e.g. N=32 (the ECMP fan-out to the next > Tier). Therefore, if a link fails to another device from which a > ! path is received there is either no backup path at all (e.g. from > perspective of a Tier-2 switch losing link to a Tier-3 device), or > ! the backup is readily available in BGP Loc-RIB (e.g. from perspective > of a Tier-2 device losing link to a Tier-1 switch). In the former > ! case, the BGP withdrawal announcement will propagate un-delayed and > trigger re-convergence on affected devices. In the latter case, the > best-path will be re-evaluated and the local ECMP group corresponding > to the new next-hop set changed. If the BGP path was the best-path > --- 1258,1270 ---- > In a Clos topology each EBGP speaker typically has either one path > (Tier-2 devices don't accept paths from other Tier-2 in the same > cluster due to same ASN) or N paths for the same prefix, where N is a > ! significantly large number, e.g., N=32 (the ECMP fan-out to the next > Tier). Therefore, if a link fails to another device from which a > ! path is received there is either no backup path at all (e.g., from the > perspective of a Tier-2 switch losing link to a Tier-3 device), or > ! the backup is readily available in BGP Loc-RIB (e.g., from perspective > of a Tier-2 device losing link to a Tier-1 switch). In the former > ! case, the BGP withdrawal announcement will propagate without delay and > trigger re-convergence on affected devices. In the latter case, the > best-path will be re-evaluated and the local ECMP group corresponding > to the new next-hop set changed. If the BGP path was the best-path > *************** > *** 1279,1285 **** > situation when a link between Tier-3 and Tier-2 device fails, the > Tier-2 device will send BGP UPDATE messages to all upstream Tier-1 > devices, withdrawing the affected prefixes. The Tier-1 devices, in > ! turn, will relay those messages to all downstream Tier-2 devices > (except for the originator). Tier-2 devices other than the one > originating the UPDATE should then wait for ALL upstream Tier-1 > > --- 1279,1285 ---- > situation when a link between Tier-3 and Tier-2 device fails, the > Tier-2 device will send BGP UPDATE messages to all upstream Tier-1 > devices, withdrawing the affected prefixes. The Tier-1 devices, in > ! turn, will relay these messages to all downstream Tier-2 devices > (except for the originator). Tier-2 devices other than the one > originating the UPDATE should then wait for ALL upstream Tier-1 > > *************** > *** 1307,1313 **** > features that vendors include to reduce the control plane impact of > rapidly flapping prefixes. However, due to issues described with > false positives in these implementations especially under such > ! "dispersion" events, it is not recommended to turn this feature on in > this design. More background and issues with "route flap dampening" > and possible implementation changes that could affect this are well > described in [RFC7196]. > --- 1307,1313 ---- > features that vendors include to reduce the control plane impact of > rapidly flapping prefixes. However, due to issues described with > false positives in these implementations especially under such > ! "dispersion" events, it is not recommended to enable this feature in > this design. More background and issues with "route flap dampening" > and possible implementation changes that could affect this are well > described in [RFC7196]. > *************** > *** 1316,1324 **** > > A network is declared to converge in response to a failure once all > devices within the failure impact scope are notified of the event and > ! have re-calculated their RIB's and consequently updated their FIB's. > Larger failure impact scope typically means slower convergence since > ! more devices have to be notified, and additionally results in a less > stable network. In this section we describe BGP's advantages over > link-state routing protocols in reducing failure impact scope for a > Clos topology. > --- 1316,1324 ---- > > A network is declared to converge in response to a failure once all > devices within the failure impact scope are notified of the event and > ! have re-calculated their RIBs and consequently updated their FIBs. > Larger failure impact scope typically means slower convergence since > ! more devices have to be notified, and results in a less > stable network. In this section we describe BGP's advantages over > link-state routing protocols in reducing failure impact scope for a > Clos topology. > *************** > *** 1327,1335 **** > the best path from the point of view of the local router is sent to > neighbors. As such, some failures are masked if the local node can > immediately find a backup path and does not have to send any updates > ! further. Notice that in the worst case ALL devices in a data center > topology have to either withdraw a prefix completely or update the > ! ECMP groups in the FIB. However, many failures will not result in > such a wide impact. There are two main failure types where impact > scope is reduced: > > --- 1327,1335 ---- > the best path from the point of view of the local router is sent to > neighbors. As such, some failures are masked if the local node can > immediately find a backup path and does not have to send any updates > ! further. Notice that in the worst case, all devices in a data center > topology have to either withdraw a prefix completely or update the > ! ECMP groups in their FIBs. However, many failures will not result in > such a wide impact. There are two main failure types where impact > scope is reduced: > > *************** > *** 1357,1367 **** > > o Failure of a Tier-1 device: In this case, all Tier-2 devices > directly attached to the failed node will have to update their > ! ECMP groups for all IP prefixes from non-local cluster. The > Tier-3 devices are once again not involved in the re-convergence > process, but may receive "implicit withdraws" as described above. > > ! Even though in case of such failures multiple IP prefixes will have > to be reprogrammed in the FIB, it is worth noting that ALL of these > prefixes share a single ECMP group on Tier-2 device. Therefore, in > the case of implementations with a hierarchical FIB, only a single > --- 1357,1367 ---- > > o Failure of a Tier-1 device: In this case, all Tier-2 devices > directly attached to the failed node will have to update their > ! ECMP groups for all IP prefixes from a non-local cluster. The > Tier-3 devices are once again not involved in the re-convergence > process, but may receive "implicit withdraws" as described above. > > ! Even in the case of such failures, multiple IP prefixes will have > to be reprogrammed in the FIB, it is worth noting that ALL of these > prefixes share a single ECMP group on Tier-2 device. Therefore, in > the case of implementations with a hierarchical FIB, only a single > *************** > *** 1375,1381 **** > possible with the proposed design, since using this technique may > create routing black-holes as mentioned previously. Therefore, the > worst control plane failure impact scope is the network as a whole, > ! for instance in a case of a link failure between Tier-2 and Tier-3 > devices. The amount of impacted prefixes in this case would be much > less than in the case of a failure in the upper layers of a Clos > network topology. The property of having such large failure scope is > --- 1375,1381 ---- > possible with the proposed design, since using this technique may > create routing black-holes as mentioned previously. Therefore, the > worst control plane failure impact scope is the network as a whole, > ! for instance in thecase of a link failure between Tier-2 and Tier-3 > devices. The amount of impacted prefixes in this case would be much > less than in the case of a failure in the upper layers of a Clos > network topology. The property of having such large failure scope is > *************** > *** 1384,1397 **** > > 7.5. Routing Micro-Loops > > ! When a downstream device, e.g. Tier-2 device, loses all paths for a > prefix, it normally has the default route pointing toward the > upstream device, in this case the Tier-1 device. As a result, it is > ! possible to get in the situation when Tier-2 switch loses a prefix, > ! but Tier-1 switch still has the path pointing to the Tier-2 device, > ! which results in transient micro-loop, since Tier-1 switch will keep > passing packets to the affected prefix back to Tier-2 device, and > ! Tier-2 will bounce it back again using the default route. This > micro-loop will last for the duration of time it takes the upstream > device to fully update its forwarding tables. > > --- 1384,1397 ---- > > 7.5. Routing Micro-Loops > > ! When a downstream device, e.g., Tier-2 device, loses all paths for a > prefix, it normally has the default route pointing toward the > upstream device, in this case the Tier-1 device. As a result, it is > ! possible to get in the situation where a Tier-2 switch loses a prefix, > ! but a Tier-1 switch still has the path pointing to the Tier-2 device, > ! which results in transient micro-loop, since the Tier-1 switch will > keep > passing packets to the affected prefix back to Tier-2 device, and > ! the Tier-2 will bounce it back again using the default route. This > micro-loop will last for the duration of time it takes the upstream > device to fully update its forwarding tables. > > *************** > *** 1402,1408 **** > Internet-Draft draft-ietf-rtgwg-bgp-routing-large-dc March 2016 > > > ! To minimize impact of the micro-loops, Tier-2 and Tier-1 switches can > be configured with static "discard" or "null" routes that will be > more specific than the default route for prefixes missing during > network convergence. For Tier-2 switches, the discard route should > --- 1402,1408 ---- > Internet-Draft draft-ietf-rtgwg-bgp-routing-large-dc March 2016 > > > ! To minimize the impact of such micro-loops, Tier-2 and Tier-1 > switches can > be configured with static "discard" or "null" routes that will be > more specific than the default route for prefixes missing during > network convergence. For Tier-2 switches, the discard route should > *************** > *** 1417,1423 **** > > 8.1. Third-party Route Injection > > ! BGP allows for a "third-party", i.e. directly attached, BGP speaker > to inject routes anywhere in the network topology, meeting REQ5. > This can be achieved by peering via a multihop BGP session with some > or even all devices in the topology. Furthermore, BGP diverse path > --- 1417,1423 ---- > > 8.1. Third-party Route Injection > > ! BGP allows for a "third-party", i.e., directly attached, BGP speaker > to inject routes anywhere in the network topology, meeting REQ5. > This can be achieved by peering via a multihop BGP session with some > or even all devices in the topology. Furthermore, BGP diverse path > *************** > *** 1427,1433 **** > implementation. Unfortunately, in many implementations ADD-PATH has > been found to only support IBGP properly due to the use cases it was > originally optimized for, which limits the "third-party" peering to > ! IBGP only, if the feature is used. > > To implement route injection in the proposed design, a third-party > BGP speaker may peer with Tier-3 and Tier-1 switches, injecting the > --- 1427,1433 ---- > implementation. Unfortunately, in many implementations ADD-PATH has > been found to only support IBGP properly due to the use cases it was > originally optimized for, which limits the "third-party" peering to > ! IBGP only. > > To implement route injection in the proposed design, a third-party > BGP speaker may peer with Tier-3 and Tier-1 switches, injecting the > *************** > *** 1442,1453 **** > As mentioned previously, route summarization is not possible within > the proposed Clos topology since it makes the network susceptible to > route black-holing under single link failures. The main problem is > ! the limited number of redundant paths between network elements, e.g. > there is only a single path between any pair of Tier-1 and Tier-3 > devices. However, some operators may find route aggregation > desirable to improve control plane stability. > > ! If planning on using any technique to summarize within the topology > modeling of the routing behavior and potential for black-holing > should be done not only for single or multiple link failures, but > > --- 1442,1453 ---- > As mentioned previously, route summarization is not possible within > the proposed Clos topology since it makes the network susceptible to > route black-holing under single link failures. The main problem is > ! the limited number of redundant paths between network elements, e.g., > there is only a single path between any pair of Tier-1 and Tier-3 > devices. However, some operators may find route aggregation > desirable to improve control plane stability. > > ! If any technique to summarize within the topology is planned, > modeling of the routing behavior and potential for black-holing > should be done not only for single or multiple link failures, but > > *************** > *** 1458,1468 **** > Internet-Draft draft-ietf-rtgwg-bgp-routing-large-dc March 2016 > > > ! also fiber pathway failures or optical domain failures if the > topology extends beyond a physical location. Simple modeling can be > done by checking the reachability on devices doing summarization > under the condition of a link or pathway failure between a set of > ! devices in every tier as well as to the WAN routers if external > connectivity is present. > > Route summarization would be possible with a small modification to > --- 1458,1468 ---- > Internet-Draft draft-ietf-rtgwg-bgp-routing-large-dc March 2016 > > > ! also fiber pathway failures or optical domain failures when the > topology extends beyond a physical location. Simple modeling can be > done by checking the reachability on devices doing summarization > under the condition of a link or pathway failure between a set of > ! devices in every tier as well as to the WAN routers when external > connectivity is present. > > Route summarization would be possible with a small modification to > *************** > *** 1519,1544 **** > cluster from Tier-2 devices since each of them has only a single path > down to this prefix. It would require dual-homed servers to > accomplish that. Also note that this design is only resilient to > ! single link failure. It is possible for a double link failure to > isolate a Tier-2 device from all paths toward a specific Tier-3 > device, thus causing a routing black-hole. > > ! A result of the proposed topology modification would be reduction of > Tier-1 devices port capacity. This limits the maximum number of > attached Tier-2 devices and therefore will limit the maximum DC > network size. A larger network would require different Tier-1 > devices that have higher port density to implement this change. > > Another problem is traffic re-balancing under link failures. Since > ! three are two paths from Tier-1 to Tier-3, a failure of the link > between Tier-1 and Tier-2 switch would result in all traffic that was > taking the failed link to switch to the remaining path. This will > ! result in doubling of link utilization on the remaining link. > > 8.2.2. Simple Virtual Aggregation > > A completely different approach to route summarization is possible, > ! provided that the main goal is to reduce the FIB pressure, while > allowing the control plane to disseminate full routing information. > Firstly, it could be easily noted that in many cases multiple > prefixes, some of which are less specific, share the same set of the > --- 1519,1544 ---- > cluster from Tier-2 devices since each of them has only a single path > down to this prefix. It would require dual-homed servers to > accomplish that. Also note that this design is only resilient to > ! single link failures. It is possible for a double link failure to > isolate a Tier-2 device from all paths toward a specific Tier-3 > device, thus causing a routing black-hole. > > ! A result of the proposed topology modification would be a reduction of > Tier-1 devices port capacity. This limits the maximum number of > attached Tier-2 devices and therefore will limit the maximum DC > network size. A larger network would require different Tier-1 > devices that have higher port density to implement this change. > > Another problem is traffic re-balancing under link failures. Since > ! there are two paths from Tier-1 to Tier-3, a failure of the link > between Tier-1 and Tier-2 switch would result in all traffic that was > taking the failed link to switch to the remaining path. This will > ! result in doubling the link utilization of the remaining link. > > 8.2.2. Simple Virtual Aggregation > > A completely different approach to route summarization is possible, > ! provided that the main goal is to reduce the FIB size, while > allowing the control plane to disseminate full routing information. > Firstly, it could be easily noted that in many cases multiple > prefixes, some of which are less specific, share the same set of the > *************** > *** 1550,1563 **** > [RFC6769] and only install the least specific route in the FIB, > ignoring more specific routes if they share the same next-hop set. > For example, under normal network conditions, only the default route > ! need to be programmed into FIB. > > Furthermore, if the Tier-2 devices are configured with summary > ! prefixes covering all of their attached Tier-3 device's prefixes the > same logic could be applied in Tier-1 devices as well, and, by > induction to Tier-2/Tier-3 switches in different clusters. These > summary routes should still allow for more specific prefixes to leak > ! to Tier-1 devices, to enable for detection of mismatches in the next- > hop sets if a particular link fails, changing the next-hop set for a > specific prefix. > > --- 1550,1563 ---- > [RFC6769] and only install the least specific route in the FIB, > ignoring more specific routes if they share the same next-hop set. > For example, under normal network conditions, only the default route > ! needs to be programmed into FIB. > > Furthermore, if the Tier-2 devices are configured with summary > ! prefixes covering all of their attached Tier-3 device's prefixes, the > same logic could be applied in Tier-1 devices as well, and, by > induction to Tier-2/Tier-3 switches in different clusters. These > summary routes should still allow for more specific prefixes to leak > ! to Tier-1 devices, to enable detection of mismatches in the next- > hop sets if a particular link fails, changing the next-hop set for a > specific prefix. > > *************** > *** 1571,1584 **** > > > Re-stating once again, this technique does not reduce the amount of > ! control plane state (i.e. BGP UPDATEs/BGP LocRIB sizing), but only > ! allows for more efficient FIB utilization, by spotting more specific > ! prefixes that share their next-hops with less specifics. > > 8.3. ICMP Unreachable Message Masquerading > > This section discusses some operational aspects of not advertising > ! point-to-point link subnets into BGP, as previously outlined as an > option in Section 5.2.3. The operational impact of this decision > could be seen when using the well-known "traceroute" tool. > Specifically, IP addresses displayed by the tool will be the link's > --- 1571,1585 ---- > > > Re-stating once again, this technique does not reduce the amount of > ! control plane state (i.e., BGP UPDATEs/BGP Loc-RIB size), but only > ! allows for more efficient FIB utilization, by detecting more specific > ! prefixes that share their next-hop set with a subsuming less specific > ! prefix. > > 8.3. ICMP Unreachable Message Masquerading > > This section discusses some operational aspects of not advertising > ! point-to-point link subnets into BGP, as previously identified as an > option in Section 5.2.3. The operational impact of this decision > could be seen when using the well-known "traceroute" tool. > Specifically, IP addresses displayed by the tool will be the link's > *************** > *** 1587,1605 **** > complicated. > > One way to overcome this limitation is by using the DNS subsystem to > ! create the "reverse" entries for the IP addresses of the same device > ! pointing to the same name. The connectivity then can be made by > ! resolving this name to the "primary" IP address of the devices, e.g. > its Loopback interface, which is always advertised into BGP. > However, this creates a dependency on the DNS subsystem, which may be > unavailable during an outage. > > Another option is to make the network device perform IP address > masquerading, that is rewriting the source IP addresses of the > ! appropriate ICMP messages sent off of the device with the "primary" > IP address of the device. Specifically, the ICMP Destination > Unreachable Message (type 3) codes 3 (port unreachable) and ICMP Time > ! Exceeded (type 11) code 0, which are involved in proper working of > the "traceroute" tool. With this modification, the "traceroute" > probes sent to the devices will always be sent back with the > "primary" IP address as the source, allowing the operator to discover > --- 1588,1606 ---- > complicated. > > One way to overcome this limitation is by using the DNS subsystem to > ! create the "reverse" entries for these point-to-point IP addresses > pointing > ! to a the same name as the loopback address. The connectivity then > can be made by > ! resolving this name to the "primary" IP address of the devices, e.g., > its Loopback interface, which is always advertised into BGP. > However, this creates a dependency on the DNS subsystem, which may be > unavailable during an outage. > > Another option is to make the network device perform IP address > masquerading, that is rewriting the source IP addresses of the > ! appropriate ICMP messages sent by the device with the "primary" > IP address of the device. Specifically, the ICMP Destination > Unreachable Message (type 3) codes 3 (port unreachable) and ICMP Time > ! Exceeded (type 11) code 0, which are required for correct operation of > the "traceroute" tool. With this modification, the "traceroute" > probes sent to the devices will always be sent back with the > "primary" IP address as the source, allowing the operator to discover > > Thanks, > Acee > > _______________________________________________ > rtgwg mailing list > rtgwg@ietf.org > https://www.ietf.org/mailman/listinfo/rtgwg >
- Routing Directorate Review for "Use of BGP for ro… Acee Lindem (acee)
- Re: Routing Directorate Review for "Use of BGP fo… Alia Atlas
- RE: Routing Directorate Review for "Use of BGP fo… Petr Lapukhov
- Re: Routing Directorate Review for "Use of BGP fo… Jon Mitchell
- Re: Routing Directorate Review for "Use of BGP fo… Alia Atlas
- Re: Routing Directorate Review for "Use of BGP fo… Acee Lindem (acee)