Re: [RTG-DIR] Routing Directorate Review for "Use of BGP for routing in large-scale data centers"

Alia Atlas <akatlas@gmail.com> Wed, 27 April 2016 20:30 UTC
MIME-Version: 1.0
In-Reply-To: <D343C870.5C20A%acee@cisco.com>
References: <D343C870.5C20A%acee@cisco.com>
Date: Wed, 27 Apr 2016 16:30:35 -0400
Message-ID: <CAG4d1rch2iCuRiWKYRZtg2zWOHS2c-1uoHHX=6gKz97W+sQpzA@mail.gmail.com>
From: Alia Atlas <akatlas@gmail.com>
To: "Acee Lindem (acee)" <acee@cisco.com>
Content-Type: multipart/alternative; boundary="001a1141b2662f2f3005317d48b0"
Archived-At: <http://mailarchive.ietf.org/arch/msg/rtg-dir/NgeY1yqemc40psopxwcPScz87cM>
Cc: "draft-ietf-rtgwg-bgp-routing-large-dc@ietf.org" <draft-ietf-rtgwg-bgp-routing-large-dc@ietf.org>, Routing Directorate <rtg-dir@ietf.org>, Routing ADs <rtg-ads@tools.ietf.org>
Subject: Re: [RTG-DIR] Routing Directorate Review for "Use of BGP for routing in large-scale data centers"
Precedence: list
Hi Acee,

Thanks very much for your thorough review!

Authors and Shepherd,
Could you please work to address these comments and submit an updated draft?
I still need to do my AD review and hope to have this to the IESG telechat
on May 19,
which means getting it into IETF Last Call this week.

Thanks,
Alia

On Mon, Apr 25, 2016 at 1:13 PM, Acee Lindem (acee) <acee@cisco.com> wrote:

> Hello,
>
> I have been selected as the Routing Directorate reviewer for this draft.
> The Routing Directorate seeks to review all routing or routing-related
> drafts as they pass through IETF last call and IESG review, and sometimes
> on special request. The purpose of the review is to provide assistance to
> the Routing ADs. For more information about the Routing Directorate,
> please see http://trac.tools.ietf.org/area/rtg/trac/wiki/RtgDir
>
> Although these comments are primarily for the use of the Routing ADs, it
> would be helpful if you could consider them along with any other IETF Last
> Call comments that you receive, and strive to resolve them through
> discussion or by updating the draft.
>
> Document: draft-ietf-rtgwg-bgp-routing-large-dc-09.txt
> Reviewer: Acee Lindem
> Review Date: 4/25/16
> IETF LC End Date: Not started
> Intended Status: Informational
>
> Summary:
>     This document is basically ready for publication, but has some minor
> issues and nits that should be resolved prior to publication.
>
> Comments:
>     The document starts with the requirements for an MSDC routing and then
> provides an overview of Clos data topologies and data center network
> design. This overview attempts to cover a lot of a material in a very
> small amount of text. While not completely successful, the overview
> provides a lot of good information and references. The bulk of the
> document covers the usage of EBGP as the sole data center routing protocol
> and other aspects of the routing design including ECMP, summarization
> issues, and convergence. These sections provide a very good guide for
> using EBGP in a Clos data center and an excellent discussion of the
> deployment issues (based on real deployment experience).
>
>     The technical content of the document is excellent. The readability
> could be improved by breaking up some of the run-on sentences and with the
> suggested editorial changes (see Nits below).
>
>
> Major Issues:
>
>     I have no major issues with the document.
>
> Minor Issues:
>
>     Section 4.2: Can an informative reference be added for Direct Server
> Return (DSR)?
>     Section 5.2.4 and 7.4: Define precisely what is meant by "scale-out"
> topology somewhere in the document.
>     Section 5.2.5: Can you add a backward reference to the discussion of
> "lack of peer links inside every peer”? Also, it would be good to describe
> how this would allow for summarization and under what failure conditions.
>     Section 7.4: Should you add a reference to
> https://www.ietf.org/id/draft-ietf-rtgwg-bgp-pic-00.txt to the penultimate
> paragraph in this section?
>
> Nits:
>
> ***************
> *** 143,149 ****
>      network stability so that a small group of people can effectively
>      support a significantly sized network.
>
> !    Experimentation and extensive testing has shown that External BGP
>      (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for
>      these type of data center applications.  This is in contrast with
>      more traditional DC designs, which may se simple tree topologies and
> --- 143,149 ----
>      network stability so that a sall group of people can effectively
>      support a significantly sized network.
>
> !    Experimentation and extensive testing have shown that External BGP
>      (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for
>      these type of data center applications.  This is in contrast with
>      more traditional DC designs, which may use simple tree topologies and
> ***************
> *** 178,191 ****
>   2.1.  Bandwidth and Traffic Patterns
>
>      The primary requirement when building an interconnection network for
> !    large number of servers is to accommodate application bandwidth and
>      latency requirements.  Until recently it was quite common to see the
>      majority of traffic entering and leaving the data center, commonly
>      referred to as "north-south" traffic.  Traditional "tree" topologies
>      were sufficient to accommodate such flows, even with high
>      oversubscription ratios between the layers of the network.  If more
>      bandwidth was required, it was added by "scaling up" the network
> !    elements, e.g. by upgrading the device's linecards or fabrics or
>      replacing the device with one with higher port density.
>
>      Today many large-scale data centers host applications generating
> --- 178,191 ----
>   2.1.  Bandwidth and Traffic Patterns
>
>      The primary requirement when building an interconnection network for
> !    a large number of servers is to accommodate application bandwidth and
>      latency requirements.  Until recently it was quite common to see the
>      majority of traffic entering and leaving the data center, commonly
>      referred to as "north-south" traffic.  Traditional "tree" topologies
>      were sufficient to accommodate such flows, even with high
>      oversubscription ratios between the layers of the network.  If more
>      bandwidth was required, it was added by "scaling up" the network
> !    elements, e.g., by upgrading the device's linecards or fabrics or
>      replacing the device with one with higher port density.
>
>      Today many large-scale data centers host applications generating
> ***************
> *** 195,201 ****
>      [HADOOP], massive data replication between clusters needed by certain
>      applications, or virtual machine migrations.  Scaling traditional
>      tree topologies to match these bandwidth demands becomes either too
> !    expensive or impossible due to physical limitations, e.g. port
>      density in a switch.
>
>   2.2.  CAPEX Minimization
> --- 195,201 ----
>      [HADOOP], massive data replication between clusters needed by certain
>      applications, or virtual machine migrations.  Scaling traditional
>      tree topologies to match these bandwidth demands becomes either too
> !    expensive or impossible due to physical limitations, e.g., port
>      density in a switch.
>
>   2.2.  CAPEX Minimization
> ***************
> *** 209,215 ****
>
>      o  Unifying all network elements, preferably using the same hardware
>         type or even the same device.  This allows for volume pricing on
> !       bulk purchases and reduced maintenance and sparing costs.
>
>      o  Driving costs down using competitive pressures, by introducing
>         multiple network equipment vendors.
> --- 209,215 ----
>
>      o  Unifying all network elements, preferably using the same hardware
>         type or even the same device.  This allows for volume pricing on
> !       bulk purchases and reduced maintenance and inventory costs.
>
>      o  Driving costs down using competitive pressures, by introducing
>         multiple network equipment vendors.
> ***************
> *** 234,244 ****
>      minimizes software issue-related failures.
>
>      An important aspect of Operational Expenditure (OPEX) minimization is
> !    reducing size of failure domains in the network.  Ethernet networks
>      are known to be susceptible to broadcast or unicast traffic storms
>      that can have a dramatic impact on network performance and
>      availability.  The use of a fully routed design significantly reduces
> !    the size of the data plane failure domains - i.e. limits them to the
>      lowest level in the network hierarchy.  However, such designs
>      introduce the problem of distributed control plane failures.  This
>      observation calls for simpler and less control plane protocols to
> --- 234,244 ----
>      minimizes software issue-related failures.
>
>      An important aspect of Operational Expenditure (OPEX) minimization is
> !    reducing the size of failure domains in the network.  Ethernet
> networks
>      are known to be susceptible to broadcast or unicast traffic storms
>      that can have a dramatic impact on network performance and
>      availability.  The use of a fully routed design significantly reduces
> !    the size of the data plane failure domains, i.e., limits them to the
>      lowest level in the network hierarchy.  However, such designs
>      introduce the problem of distributed control plane failures.  This
>      observation calls for simpler and less control plane protocols to
> ***************
> *** 253,259 ****
>      performed by network devices.  Traditionally, load balancers are
>      deployed as dedicated devices in the traffic forwarding path.  The
>      problem arises in scaling load balancers under growing traffic
> !    demand.  A preferable solution would be able to scale load balancing
>      layer horizontally, by adding more of the uniform nodes and
>      distributing incoming traffic across these nodes.  In situations like
>      this, an ideal choice would be to use network infrastructure itself
> --- 253,259 ----
>      performed by network devices.  Traditionally, load balancers are
>      deployed as dedicated devices in the traffic forwarding path.  The
>      problem arises in scaling load balancers under growing traffic
> !    demand.  A preferable solution would be able to scale the load
> balancing
>      layer horizontally, by adding more of the uniform nodes and
>      distributing incoming traffic across these nodes.  In situations like
>      this, an ideal choice would be to use network infrastructure itself
> ***************
> *** 305,311 ****
>   3.1.  Traditional DC Topology
>
>      In the networking industry, a common design choice for data centers
> !    typically look like a (upside down) tree with redundant uplinks and
>      three layers of hierarchy namely; core, aggregation/distribution and
>      access layers (see Figure 1).  To accommodate bandwidth demands, each
>      higher layer, from server towards DC egress or WAN, has higher port
> --- 305,311 ----
>   3.1.  Traditional DC Topology
>
>      In the networking industry, a common design choice for data centers
> !    typically look like an (upside down) tree with redundant uplinks and
>      three layers of hierarchy namely; core, aggregation/distribution and
>      access layers (see Figure 1).  To accommodate bandwidth demands, each
>      higher layer, from server towards DC egress or WAN, has higher port
> ***************
> *** 373,379 ****
>      topology, sometimes called "fat-tree" (see, for example, [INTERCON]
>      and [ALFARES2008]).  This topology features an odd number of stages
>      (sometimes known as dimensions) and is commonly made of uniform
> !    elements, e.g. network switches with the same port count.  Therefore,
>      the choice of folded Clos topology satisfies REQ1 and facilitates
>      REQ2.  See Figure 2 below for an example of a folded 3-stage Clos
>      topology (3 stages counting Tier-2 stage twice, when tracing a packet
> --- 373,379 ----
>      topology, sometimes called "fat-tree" (see, for example, [INTERCON]
>      and [ALFARES2008]).  This topology features an odd number of stages
>      (sometimes known as dimensions) and is commonly made of uniform
> !    elements, e.g., network switches with the same port count.  Therefore,
>      the choice of folded Clos topology satisfies REQ1 and facilitates
>      REQ2.  See Figure 2 below for an example of a folded 3-stage Clos
>      topology (3 stages counting Tier-2 stage twice, when tracing a packet
> ***************
> *** 460,466 ****
>   3.2.3.  Scaling the Clos topology
>
>      A Clos topology can be scaled either by increasing network element
> !    port density or adding more stages, e.g. moving to a 5-stage Clos, as
>      illustrated in Figure 3 below:
>
>                                         Tier-1
> --- 460,466 ----
>   3.2.3.  Scaling the Clos topology
>
>      A Clos topology can be scaled either by increasing network element
> !    port density or adding more stages, e.g., moving to a 5-stage Clos, as
>      illustrated in Figure 3 below:
>
>                                         Tier-1
> ***************
> *** 523,529 ****
>   3.2.4.  Managing the Size of Clos Topology Tiers
>
>      If a data center network size is small, it is possible to reduce the
> !    number of switches in Tier-1 or Tier-2 of Clos topology by a factor
>      of two.  To understand how this could be done, take Tier-1 as an
>      example.  Every Tier-2 device connects to a single group of Tier-1
>      devices.  If half of the ports on each of the Tier-1 devices are not
> --- 523,529 ----
>   3.2.4.  Managing the Size of Clos Topology Tiers
>
>      If a data center network size is small, it is possible to reduce the
> !    number of switches in Tier-1 or Tier-2 of a Clos topology by a factor
>      of two.  To understand how this could be done, take Tier-1 as an
>      example.  Every Tier-2 device connects to a single group of Tier-1
>      devices.  If half of the ports on each of the Tier-1 devices are not
> ***************
> *** 574,580 ****
>      originally defined in [IEEE8021D-1990] for loop free topology
>      creation, typically utilizing variants of the traditional DC topology
>      described in Section 3.1.  At the time, many DC switches either did
> !    not support Layer 3 routed protocols or supported it with additional
>      licensing fees, which played a part in the design choice.  Although
>      many enhancements have been made through the introduction of Rapid
>      Spanning Tree Protocol (RSTP) in the latest revision of
> --- 574,580 ----
>      originally defined in [IEEE8021D-1990] for loop free topology
>      creation, typically utilizing variants of the traditional DC topology
>      described in Section 3.1.  At the time, many DC switches either did
> !    not support Layer 3 routing protocols or supported them with
> additional
>      licensing fees, which played a part in the design choice.  Although
>      many enhancements have been made through the introduction of Rapid
>      Spanning Tree Protocol (RSTP) in the latest revision of
> ***************
> *** 599,605 ****
>      as the backup for loop prevention.  The major downsides of this
>      approach are the lack of ability to scale linearly past two in most
>      implementations, lack of standards based implementations, and added
> !    failure domain risk of keeping state between the devices.
>
>      It should be noted that building large, horizontally scalable, Layer
>      2 only networks without STP is possible recently through the
> --- 599,605 ----
>      as the backup for loop prevention.  The major downsides of this
>      approach are the lack of ability to scale linearly past two in most
>      implementations, lack of standards based implementations, and added
> !    the failure domain risk of syncing state between the devices.
>
>      It should be noted that building large, horizontally scalable, Layer
>      2 only networks without STP is possible recently through the
> ***************
> *** 621,631 ****
>      Finally, neither the base TRILL specification nor the M-LAG approach
>      totally eliminate the problem of the shared broadcast domain, that is
>      so detrimental to the operations of any Layer 2, Ethernet based
> !    solutions.  Later TRILL extensions have been proposed to solve the
>      this problem statement primarily based on the approaches outlined in
>      [RFC7067], but this even further limits the number of available
> !    interoperable implementations that can be used to build a fabric,
> !    therefore TRILL based designs have issues meeting REQ2, REQ3, and
>      REQ4.
>
>   4.2.  Hybrid L2/L3 Designs
> --- 621,631 ----
>      Finally, neither the base TRILL specification nor the M-LAG approach
>      totally eliminate the problem of the shared broadcast domain, that is
>      so detrimental to the operations of any Layer 2, Ethernet based
> !    solution.  Later TRILL extensions have been proposed to solve the
>      this problem statement primarily based on the approaches outlined in
>      [RFC7067], but this even further limits the number of available
> !    interoperable implementations that can be used to build a fabric.
> !    Therefore, TRILL based designs have issues meeting REQ2, REQ3, and
>      REQ4.
>
>   4.2.  Hybrid L2/L3 Designs
> ***************
> *** 635,641 ****
>      in either the Tier-1 or Tier-2 parts of the network and dividing the
>      Layer 2 domain into numerous, smaller domains.  This design has
>      allowed data centers to scale up, but at the cost of complexity in
> !    the network managing multiple protocols.  For the following reasons,
>      operators have retained Layer 2 in either the access (Tier-3) or both
>      access and aggregation (Tier-3 and Tier-2) parts of the network:
>
> --- 635,641 ----
>      in either the Tier-1 or Tier-2 parts of the network and dividing the
>      Layer 2 domain into numerous, smaller domains.  This design has
>      allowed data centers to scale up, but at the cost of complexity in
> !    the managing multiple network protocols.  For the following reasons,
>      operators have retained Layer 2 in either the access (Tier-3) or both
>      access and aggregation (Tier-3 and Tier-2) parts of the network:
>
> ***************
> *** 644,650 ****
>
>      o  Seamless mobility for virtual machines that require the
>         preservation of IP addresses when a virtual machine moves to
> !       different Tier-3 switch.
>
>      o  Simplified IP addressing = less IP subnets are required for the
>         data center.
> --- 644,650 ----
>
>      o  Seamless mobility for virtual machines that require the
>         preservation of IP addresses when a virtual machine moves to
> !       a different Tier-3 switch.
>
>      o  Simplified IP addressing = less IP subnets are required for the
>         data center.
> ***************
> *** 679,686 ****
>      adoption in networks where large Layer 2 adjacency and larger size
>      Layer 3 subnets are not as critical compared to network scalability
>      and stability.  Application providers and network operators continue
> !    to also develop new solutions to meet some of the requirements that
> !    previously have driven large Layer 2 domains by using various overlay
>      or tunneling techniques.
>
>   5.  Routing Protocol Selection and Design
> --- 679,686 ----
>      adoption in networks where large Layer 2 adjacency and larger size
>      Layer 3 subnets are not as critical compared to network scalability
>      and stability.  Application providers and network operators continue
> !    to develop new solutions to meet some of the requirements that
> !    previously had driven large Layer 2 domains using various overlay
>      or tunneling techniques.
>
>   5.  Routing Protocol Selection and Design
> ***************
> *** 700,706 ****
>      design.
>
>      Although EBGP is the protocol used for almost all inter-domain
> !    routing on the Internet and has wide support from both vendor and
>      service provider communities, it is not generally deployed as the
>      primary routing protocol within the data center for a number of
>      reasons (some of which are interrelated):
> --- 700,706 ----
>      design.
>
>      Although EBGP is the protocol used for almost all inter-domain
> !    routing in the Internet and has wide support from both vendor and
>      service provider communities, it is not generally deployed as the
>      primary routing protocol within the data center for a number of
>      reasons (some of which are interrelated):
> ***************
> *** 741,754 ****
>         state IGPs.  Since every BGP router calculates and propagates only
>         the best-path selected, a network failure is masked as soon as the
>         BGP speaker finds an alternate path, which exists when highly
> !       symmetric topologies, such as Clos, are coupled with EBGP only
>         design.  In contrast, the event propagation scope of a link-state
>         IGP is an entire area, regardless of the failure type.  In this
>         way, BGP better meets REQ3 and REQ4.  It is also worth mentioning
>         that all widely deployed link-state IGPs feature periodic
> !       refreshes of routing information, even if this rarely causes
> !       impact to modern router control planes, while BGP does not expire
> !       routing state.
>
>      o  BGP supports third-party (recursively resolved) next-hops.  This
>         allows for manipulating multipath to be non-ECMP based or
> --- 741,754 ----
>         state IGPs.  Since every BGP router calculates and propagates only
>         the best-path selected, a network failure is masked as soon as the
>         BGP speaker finds an alternate path, which exists when highly
> !       symmetric topologies, such as Clos, are coupled with an EBGP only
>         design.  In contrast, the event propagation scope of a link-state
>         IGP is an entire area, regardless of the failure type.  In this
>         way, BGP better meets REQ3 and REQ4.  It is also worth mentioning
>         that all widely deployed link-state IGPs feature periodic
> !       refreshes of routing information while BGP does not expire
> !       routing state, although this rarely impacts modern router control
> !       planes.
>
>      o  BGP supports third-party (recursively resolved) next-hops.  This
>         allows for manipulating multipath to be non-ECMP based or
> ***************
> *** 765,775 ****
>         controlled and complex unwanted paths will be ignored.  See
>         Section 5.2 for an example of a working ASN allocation scheme.  In
>         a link-state IGP accomplishing the same goal would require multi-
> !       (instance/topology/processes) support, typically not available in
>         all DC devices and quite complex to configure and troubleshoot.
>         Using a traditional single flooding domain, which most DC designs
>         utilize, under certain failure conditions may pick up unwanted
> !       lengthy paths, e.g. traversing multiple Tier-2 devices.
>
>      o  EBGP configuration that is implemented with minimal routing policy
>         is easier to troubleshoot for network reachability issues.  In
> --- 765,775 ----
>         controlled and complex unwanted paths will be ignored.  See
>         Section 5.2 for an example of a working ASN allocation scheme.  In
>         a link-state IGP accomplishing the same goal would require multi-
> !       (instance/topology/process) support, typically not available in
>         all DC devices and quite complex to configure and troubleshoot.
>         Using a traditional single flooding domain, which most DC designs
>         utilize, under certain failure conditions may pick up unwanted
> !       lengthy paths, e.g., traversing multiple Tier-2 devices.
>
>      o  EBGP configuration that is implemented with minimal routing policy
>         is easier to troubleshoot for network reachability issues.  In
> ***************
> *** 806,812 ****
>         loopback sessions are used even in the case of multiple links
>         between the same pair of nodes.
>
> !    o  Private Use ASNs from the range 64512-65534 are used so as to
>         avoid ASN conflicts.
>
>      o  A single ASN is allocated to all of the Clos topology's Tier-1
> --- 806,812 ----
>         loopback sessions are used even in the case of multiple links
>         between the same pair of nodes.
>
> !    o  Private Use ASNs from the range 64512-65534 are used to
>         avoid ASN conflicts.
>
>      o  A single ASN is allocated to all of the Clos topology's Tier-1
> ***************
> *** 815,821 ****
>      o  A unique ASN is allocated to each set of Tier-2 devices in the
>         same cluster.
>
> !    o  A unique ASN is allocated to every Tier-3 device (e.g.  ToR) in
>         this topology.
>
>
> --- 815,821 ----
>      o  A unique ASN is allocated to each set of Tier-2 devices in the
>         same cluster.
>
> !    o  A unique ASN is allocated to every Tier-3 device (e.g.,  ToR) in
>         this topology.
>
>
> ***************
> *** 903,922 ****
>
>      Another solution to this problem would be using Four-Octet ASNs
>      ([RFC6793]), where there are additional Private Use ASNs available,
> !    see [IANA.AS].  Use of Four-Octet ASNs put additional protocol
> !    complexity in the BGP implementation so should be considered against
>      the complexity of re-use when considering REQ3 and REQ4.  Perhaps
>      more importantly, they are not yet supported by all BGP
>      implementations, which may limit vendor selection of DC equipment.
> !    When supported, ensure that implementations in use are able to remove
> !    the Private Use ASNs if required for external connectivity
> !    (Section 5.2.4).
>
>   5.2.3.  Prefix Advertisement
>
>      A Clos topology features a large number of point-to-point links and
>      associated prefixes.  Advertising all of these routes into BGP may
> !    create FIB overload conditions in the network devices.  Advertising
>      these links also puts additional path computation stress on the BGP
>      control plane for little benefit.  There are two possible solutions:
>
> --- 903,922 ----
>
>      Another solution to this problem would be using Four-Octet ASNs
>      ([RFC6793]), where there are additional Private Use ASNs available,
> !    see [IANA.AS].  Use of Four-Octet ASNs puts additional protocol
> !    complexity in the BGP implementation and should be balanced against
>      the complexity of re-use when considering REQ3 and REQ4.  Perhaps
>      more importantly, they are not yet supported by all BGP
>      implementations, which may limit vendor selection of DC equipment.
> !    When supported, ensure that deployed implementations are able to
> remove
> !    the Private Use ASNs when external connectivity to these ASes is
> !    required (Section 5.2.4).
>
>   5.2.3.  Prefix Advertisement
>
>      A Clos topology features a large number of point-to-point links and
>      associated prefixes.  Advertising all of these routes into BGP may
> !    create FIB overload in the network devices.  Advertising
>      these links also puts additional path computation stress on the BGP
>      control plane for little benefit.  There are two possible solutions:
>
> ***************
> *** 925,951 ****
>         device, distant networks will automatically be reachable via the
>         advertising EBGP peer and do not require reachability to these
>         prefixes.  However, this may complicate operations or monitoring:
> !       e.g. using the popular "traceroute" tool will display IP addresses
>         that are not reachable.
>
>      o  Advertise point-to-point links, but summarize them on every
>         device.  This requires an address allocation scheme such as
>         allocating a consecutive block of IP addresses per Tier-1 and
>         Tier-2 device to be used for point-to-point interface addressing
> !       to the lower layers (Tier-2 uplinks will be numbered out of Tier-1
> !       addressing and so forth).
>
>      Server subnets on Tier-3 devices must be announced into BGP without
>      using route summarization on Tier-2 and Tier-1 devices.  Summarizing
>      subnets in a Clos topology results in route black-holing under a
> !    single link failure (e.g. between Tier-2 and Tier-3 devices) and
>      hence must be avoided.  The use of peer links within the same tier to
>      resolve the black-holing problem by providing "bypass paths" is
>      undesirable due to O(N^2) complexity of the peering mesh and waste of
>      ports on the devices.  An alternative to the full-mesh of peer-links
> !    would be using a simpler bypass topology, e.g. a "ring" as described
>      in [FB4POST], but such a topology adds extra hops and has very
> !    limited bisection bandwidth, in addition requiring special tweaks to
>
>
>
> --- 925,951 ----
>         device, distant networks will automatically be reachable via the
>         advertising EBGP peer and do not require reachability to these
>         prefixes.  However, this may complicate operations or monitoring:
> !       e.g., using the popular "traceroute" tool will display IP addresses
>         that are not reachable.
>
>      o  Advertise point-to-point links, but summarize them on every
>         device.  This requires an address allocation scheme such as
>         allocating a consecutive block of IP addresses per Tier-1 and
>         Tier-2 device to be used for point-to-point interface addressing
> !       to the lower layers (Tier-2 uplink addresses will be allocated
> !       from Tier-1 address blocks and so forth).
>
>      Server subnets on Tier-3 devices must be announced into BGP without
>      using route summarization on Tier-2 and Tier-1 devices.  Summarizing
>      subnets in a Clos topology results in route black-holing under a
> !    single link failure (e.g., between Tier-2 and Tier-3 devices) and
>      hence must be avoided.  The use of peer links within the same tier to
>      resolve the black-holing problem by providing "bypass paths" is
>      undesirable due to O(N^2) complexity of the peering mesh and waste of
>      ports on the devices.  An alternative to the full-mesh of peer-links
> !    would be using a simpler bypass topology, e.g., a "ring" as described
>      in [FB4POST], but such a topology adds extra hops and has very
> !    limited bisectional bandwidth. Additionally requiring special tweaks
> to
>
>
>
> ***************
> *** 956,963 ****
>
>      make BGP routing work - such as possibly splitting every device into
>      an ASN on its own.  Later in this document, Section 8.2 introduces a
> !    less intrusive method for performing a limited form route
> !    summarization in Clos networks and discusses it's associated trade-
>      offs.
>
>   5.2.4.  External Connectivity
> --- 956,963 ----
>
>      make BGP routing work - such as possibly splitting every device into
>      an ASN on its own.  Later in this document, Section 8.2 introduces a
> !    less intrusive method for performing a limited form of route
> !    summarization in Clos networks and discusses its associated trade-
>      offs.
>
>   5.2.4.  External Connectivity
> ***************
> *** 972,985 ****
>      document.  These devices have to perform a few special functions:
>
>      o  Hide network topology information when advertising paths to WAN
> !       routers, i.e. remove Private Use ASNs [RFC6996] from the AS_PATH
>         attribute.  This is typically done to avoid ASN number collisions
>         between different data centers and also to provide a uniform
>         AS_PATH length to the WAN for purposes of WAN ECMP to Anycast
>         prefixes originated in the topology.  An implementation specific
>         BGP feature typically called "Remove Private AS" is commonly used
>         to accomplish this.  Depending on implementation, the feature
> !       should strip a contiguous sequence of Private Use ASNs found in
>         AS_PATH attribute prior to advertising the path to a neighbor.
>         This assumes that all ASNs used for intra data center numbering
>         are from the Private Use ranges.  The process for stripping the
> --- 972,985 ----
>      document.  These devices have to perform a few special functions:
>
>      o  Hide network topology information when advertising paths to WAN
> !       routers, i.e., remove Private Use ASNs [RFC6996] from the AS_PATH
>         attribute.  This is typically done to avoid ASN number collisions
>         between different data centers and also to provide a uniform
>         AS_PATH length to the WAN for purposes of WAN ECMP to Anycast
>         prefixes originated in the topology.  An implementation specific
>         BGP feature typically called "Remove Private AS" is commonly used
>         to accomplish this.  Depending on implementation, the feature
> !       should strip a contiguous sequence of Private Use ASNs found in an
>         AS_PATH attribute prior to advertising the path to a neighbor.
>         This assumes that all ASNs used for intra data center numbering
>         are from the Private Use ranges.  The process for stripping the
> ***************
> *** 998,1005 ****
>         to the WAN Routers upstream, to provide resistance to a single-
>         link failure causing the black-holing of traffic.  To prevent
>         black-holing in the situation when all of the EBGP sessions to the
> !       WAN routers fail simultaneously on a given device it is more
> !       desirable to take the "relaying" approach rather than introducing
>         the default route via complicated conditional route origination
>         schemes provided by some implementations [CONDITIONALROUTE].
>
> --- 998,1005 ----
>         to the WAN Routers upstream, to provide resistance to a single-
>         link failure causing the black-holing of traffic.  To prevent
>         black-holing in the situation when all of the EBGP sessions to the
> !       WAN routers fail simultaneously on a given device, it is more
> !       desirable to readvertise the default route rather than originating
>         the default route via complicated conditional route origination
>         schemes provided by some implementations [CONDITIONALROUTE].
>
> ***************
> *** 1017,1023 ****
>      prefixes originated from within the data center in a fully routed
>      network design.  For example, a network with 2000 Tier-3 devices will
>      have at least 2000 servers subnets advertised into BGP, along with
> !    the infrastructure or other prefixes.  However, as discussed before,
>      the proposed network design does not allow for route summarization
>      due to the lack of peer links inside every tier.
>
> --- 1017,1023 ----
>      prefixes originated from within the data center in a fully routed
>      network design.  For example, a network with 2000 Tier-3 devices will
>      have at least 2000 servers subnets advertised into BGP, along with
> !    the infrastructure and link prefixes.  However, as discussed before,
>      the proposed network design does not allow for route summarization
>      due to the lack of peer links inside every tier.
>
> ***************
> *** 1028,1037 ****
>      o  Interconnect the Border Routers using a full-mesh of physical
>         links or using any other "peer-mesh" topology, such as ring or
>         hub-and-spoke.  Configure BGP accordingly on all Border Leafs to
> !       exchange network reachability information - e.g. by adding a mesh
>         of IBGP sessions.  The interconnecting peer links need to be
>         appropriately sized for traffic that will be present in the case
> !       of a device or link failure underneath the Border Routers.
>
>      o  Tier-1 devices may have additional physical links provisioned
>         toward the Border Routers (which are Tier-2 devices from the
> --- 1028,1037 ----
>      o  Interconnect the Border Routers using a full-mesh of physical
>         links or using any other "peer-mesh" topology, such as ring or
>         hub-and-spoke.  Configure BGP accordingly on all Border Leafs to
> !       exchange network reachability information, e.g., by adding a mesh
>         of IBGP sessions.  The interconnecting peer links need to be
>         appropriately sized for traffic that will be present in the case
> !       of a device or link failure in the mesh connecting the Border
> Routers.
>
>      o  Tier-1 devices may have additional physical links provisioned
>         toward the Border Routers (which are Tier-2 devices from the
> ***************
> *** 1043,1049 ****
>         device compared with the other devices in the Clos.  This also
>         reduces the number of ports available to "regular" Tier-2 switches
>         and hence the number of clusters that could be interconnected via
> !       Tier-1 layer.
>
>      If any of the above options are implemented, it is possible to
>      perform route summarization at the Border Routers toward the WAN
> --- 1043,1049 ----
>         device compared with the other devices in the Clos.  This also
>         reduces the number of ports available to "regular" Tier-2 switches
>         and hence the number of clusters that could be interconnected via
> !       the Tier-1 layer.
>
>      If any of the above options are implemented, it is possible to
>      perform route summarization at the Border Routers toward the WAN
> ***************
> *** 1071,1079 ****
>      ECMP is the fundamental load sharing mechanism used by a Clos
>      topology.  Effectively, every lower-tier device will use all of its
>      directly attached upper-tier devices to load share traffic destined
> !    to the same IP prefix.  Number of ECMP paths between any two Tier-3
>      devices in Clos topology equals to the number of the devices in the
> !    middle stage (Tier-1).  For example, Figure 5 illustrates the
>      topology where Tier-3 device A has four paths to reach servers X and
>      Y, via Tier-2 devices B and C and then Tier-1 devices 1, 2, 3, and 4
>      respectively.
> --- 1071,1079 ----
>      ECMP is the fundamental load sharing mechanism used by a Clos
>      topology.  Effectively, every lower-tier device will use all of its
>      directly attached upper-tier devices to load share traffic destined
> !    to the same IP prefix.  The number of ECMP paths between any two
> Tier-3
>      devices in Clos topology equals to the number of the devices in the
> !    middle stage (Tier-1).  For example, Figure 5 illustrates a
>      topology where Tier-3 device A has four paths to reach servers X and
>      Y, via Tier-2 devices B and C and then Tier-1 devices 1, 2, 3, and 4
>      respectively.
> ***************
> *** 1105,1116 ****
>
>      The ECMP requirement implies that the BGP implementation must support
>      multipath fan-out for up to the maximum number of devices directly
> !    attached at any point in the topology in upstream or downstream
>      direction.  Normally, this number does not exceed half of the ports
>      found on a device in the topology.  For example, an ECMP fan-out of
>      32 would be required when building a Clos network using 64-port
>      devices.  The Border Routers may need to have wider fan-out to be
> !    able to connect to multitude of Tier-1 devices if route summarization
>      at Border Router level is implemented as described in Section 5.2.5.
>      If a device's hardware does not support wider ECMP, logical link-
>      grouping (link-aggregation at layer 2) could be used to provide
> --- 1105,1116 ----
>
>      The ECMP requirement implies that the BGP implementation must support
>      multipath fan-out for up to the maximum number of devices directly
> !    attached at any point in the topology in the upstream or downstream
>      direction.  Normally, this number does not exceed half of the ports
>      found on a device in the topology.  For example, an ECMP fan-out of
>      32 would be required when building a Clos network using 64-port
>      devices.  The Border Routers may need to have wider fan-out to be
> !    able to connect to a multitude of Tier-1 devices if route
> summarization
>      at Border Router level is implemented as described in Section 5.2.5.
>      If a device's hardware does not support wider ECMP, logical link-
>      grouping (link-aggregation at layer 2) could be used to provide
> ***************
> *** 1122,1131 ****
>   Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
>
>
> !    "hierarchical" ECMP (Layer 3 ECMP followed by Layer 2 ECMP) to
>      compensate for fan-out limitations.  Such approach, however,
>      increases the risk of flow polarization, as less entropy will be
> !    available to the second stage of ECMP.
>
>      Most BGP implementations declare paths to be equal from an ECMP
>      perspective if they match up to and including step (e) in
> --- 1122,1131 ----
>   Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
>
>
> !    "hierarchical" ECMP (Layer 3 ECMP coupled with Layer 2 ECMP) to
>      compensate for fan-out limitations.  Such approach, however,
>      increases the risk of flow polarization, as less entropy will be
> !    available at the second stage of ECMP.
>
>      Most BGP implementations declare paths to be equal from an ECMP
>      perspective if they match up to and including step (e) in
> ***************
> *** 1148,1154 ****
>      perspective of other devices, such a prefix would have BGP paths with
>      different AS_PATH attribute values, while having the same AS_PATH
>      attribute lengths.  Therefore, BGP implementations must support load
> !    sharing over above-mentioned paths.  This feature is sometimes known
>      as "multipath relax" or "multipath multiple-as" and effectively
>      allows for ECMP to be done across different neighboring ASNs if all
>      other attributes are equal as already described in the previous
> --- 1148,1154 ----
>      perspective of other devices, such a prefix would have BGP paths with
>      different AS_PATH attribute values, while having the same AS_PATH
>      attribute lengths.  Therefore, BGP implementations must support load
> !    sharing over the above-mentioned paths.  This feature is sometimes
> known
>      as "multipath relax" or "multipath multiple-as" and effectively
>      allows for ECMP to be done across different neighboring ASNs if all
>      other attributes are equal as already described in the previous
> ***************
> *** 1182,1199 ****
>
>      It is often desirable to have the hashing function used for ECMP to
>      be consistent (see [CONS-HASH]), to minimize the impact on flow to
> !    next-hop affinity changes when a next-hop is added or removed to ECMP
>      group.  This could be used if the network device is used as a load
>      balancer, mapping flows toward multiple destinations - in this case,
> !    losing or adding a destination will not have detrimental effect of
>      currently established flows.  One particular recommendation on
>      implementing consistent hashing is provided in [RFC2992], though
>      other implementations are possible.  This functionality could be
>      naturally combined with weighted ECMP, with the impact of the next-
>      hop changes being proportional to the weight of the given next-hop.
>      The downside of consistent hashing is increased load on hardware
> !    resource utilization, as typically more space is required to
> !    implement a consistent-hashing region.
>
>   7.  Routing Convergence Properties
>
> --- 1182,1199 ----
>
>      It is often desirable to have the hashing function used for ECMP to
>      be consistent (see [CONS-HASH]), to minimize the impact on flow to
> !    next-hop affinity changes when a next-hop is added or removed to an
> ECMP
>      group.  This could be used if the network device is used as a load
>      balancer, mapping flows toward multiple destinations - in this case,
> !    losing or adding a destination will not have a detrimental effect on
>      currently established flows.  One particular recommendation on
>      implementing consistent hashing is provided in [RFC2992], though
>      other implementations are possible.  This functionality could be
>      naturally combined with weighted ECMP, with the impact of the next-
>      hop changes being proportional to the weight of the given next-hop.
>      The downside of consistent hashing is increased load on hardware
> !    resource utilization, as typically more resources (e.g., TCAM space)
> !    are required to implement a consistent-hashing function.
>
>   7.  Routing Convergence Properties
>
> ***************
> *** 1209,1224 ****
>      driven mechanism to obtain updates on IGP state changes.  The
>      proposed routing design does not use an IGP, so the remaining
>      mechanisms that could be used for fault detection are BGP keep-alive
> !    process (or any other type of keep-alive mechanism) and link-failure
>      triggers.
>
>      Relying solely on BGP keep-alive packets may result in high
> !    convergence delays, in the order of multiple seconds (on many BGP
>      implementations the minimum configurable BGP hold timer value is
>      three seconds).  However, many BGP implementations can shut down
>      local EBGP peering sessions in response to the "link down" event for
>      the outgoing interface used for BGP peering.  This feature is
> !    sometimes called as "fast fallover".  Since links in modern data
>      centers are predominantly point-to-point fiber connections, a
>      physical interface failure is often detected in milliseconds and
>      subsequently triggers a BGP re-convergence.
> --- 1209,1224 ----
>      driven mechanism to obtain updates on IGP state changes.  The
>      proposed routing design does not use an IGP, so the remaining
>      mechanisms that could be used for fault detection are BGP keep-alive
> !    time-out (or any other type of keep-alive mechanism) and link-failure
>      triggers.
>
>      Relying solely on BGP keep-alive packets may result in high
> !    convergence delays, on the order of multiple seconds (on many BGP
>      implementations the minimum configurable BGP hold timer value is
>      three seconds).  However, many BGP implementations can shut down
>      local EBGP peering sessions in response to the "link down" event for
>      the outgoing interface used for BGP peering.  This feature is
> !    sometimes called "fast fallover".  Since links in modern data
>      centers are predominantly point-to-point fiber connections, a
>      physical interface failure is often detected in milliseconds and
>      subsequently triggers a BGP re-convergence.
> ***************
> *** 1236,1242 ****
>
>      Alternatively, some platforms may support Bidirectional Forwarding
>      Detection (BFD) [RFC5880] to allow for sub-second failure detection
> !    and fault signaling to the BGP process.  However, use of either of
>      these presents additional requirements to vendor software and
>      possibly hardware, and may contradict REQ1.  Until recently with
>      [RFC7130], BFD also did not allow detection of a single member link
> --- 1236,1242 ----
>
>      Alternatively, some platforms may support Bidirectional Forwarding
>      Detection (BFD) [RFC5880] to allow for sub-second failure detection
> !    and fault signaling to the BGP process.  However, the use of either of
>      these presents additional requirements to vendor software and
>      possibly hardware, and may contradict REQ1.  Until recently with
>      [RFC7130], BFD also did not allow detection of a single member link
> ***************
> *** 1245,1251 ****
>
>   7.2.  Event Propagation Timing
>
> !    In the proposed design the impact of BGP Minimum Route Advertisement
>      Interval (MRAI) timer (See section 9.2.1.1 of [RFC4271]) should be
>      considered.  Per the standard it is required for BGP implementations
>      to space out consecutive BGP UPDATE messages by at least MRAI
> --- 1245,1251 ----
>
>   7.2.  Event Propagation Timing
>
> !    In the proposed design the impact of the BGP Minimum Route
> Advertisement
>      Interval (MRAI) timer (See section 9.2.1.1 of [RFC4271]) should be
>      considered.  Per the standard it is required for BGP implementations
>      to space out consecutive BGP UPDATE messages by at least MRAI
> ***************
> *** 1258,1270 ****
>      In a Clos topology each EBGP speaker typically has either one path
>      (Tier-2 devices don't accept paths from other Tier-2 in the same
>      cluster due to same ASN) or N paths for the same prefix, where N is a
> !    significantly large number, e.g.  N=32 (the ECMP fan-out to the next
>      Tier).  Therefore, if a link fails to another device from which a
> !    path is received there is either no backup path at all (e.g. from
>      perspective of a Tier-2 switch losing link to a Tier-3 device), or
> !    the backup is readily available in BGP Loc-RIB (e.g. from perspective
>      of a Tier-2 device losing link to a Tier-1 switch).  In the former
> !    case, the BGP withdrawal announcement will propagate un-delayed and
>      trigger re-convergence on affected devices.  In the latter case, the
>      best-path will be re-evaluated and the local ECMP group corresponding
>      to the new next-hop set changed.  If the BGP path was the best-path
> --- 1258,1270 ----
>      In a Clos topology each EBGP speaker typically has either one path
>      (Tier-2 devices don't accept paths from other Tier-2 in the same
>      cluster due to same ASN) or N paths for the same prefix, where N is a
> !    significantly large number, e.g.,  N=32 (the ECMP fan-out to the next
>      Tier).  Therefore, if a link fails to another device from which a
> !    path is received there is either no backup path at all (e.g., from the
>      perspective of a Tier-2 switch losing link to a Tier-3 device), or
> !    the backup is readily available in BGP Loc-RIB (e.g., from perspective
>      of a Tier-2 device losing link to a Tier-1 switch).  In the former
> !    case, the BGP withdrawal announcement will propagate without delay and
>      trigger re-convergence on affected devices.  In the latter case, the
>      best-path will be re-evaluated and the local ECMP group corresponding
>      to the new next-hop set changed.  If the BGP path was the best-path
> ***************
> *** 1279,1285 ****
>      situation when a link between Tier-3 and Tier-2 device fails, the
>      Tier-2 device will send BGP UPDATE messages to all upstream Tier-1
>      devices, withdrawing the affected prefixes.  The Tier-1 devices, in
> !    turn, will relay those messages to all downstream Tier-2 devices
>      (except for the originator).  Tier-2 devices other than the one
>      originating the UPDATE should then wait for ALL upstream Tier-1
>
> --- 1279,1285 ----
>      situation when a link between Tier-3 and Tier-2 device fails, the
>      Tier-2 device will send BGP UPDATE messages to all upstream Tier-1
>      devices, withdrawing the affected prefixes.  The Tier-1 devices, in
> !    turn, will relay these messages to all downstream Tier-2 devices
>      (except for the originator).  Tier-2 devices other than the one
>      originating the UPDATE should then wait for ALL upstream Tier-1
>
> ***************
> *** 1307,1313 ****
>      features that vendors include to reduce the control plane impact of
>      rapidly flapping prefixes.  However, due to issues described with
>      false positives in these implementations especially under such
> !    "dispersion" events, it is not recommended to turn this feature on in
>      this design.  More background and issues with "route flap dampening"
>      and possible implementation changes that could affect this are well
>      described in [RFC7196].
> --- 1307,1313 ----
>      features that vendors include to reduce the control plane impact of
>      rapidly flapping prefixes.  However, due to issues described with
>      false positives in these implementations especially under such
> !    "dispersion" events, it is not recommended to enable this feature in
>      this design.  More background and issues with "route flap dampening"
>      and possible implementation changes that could affect this are well
>      described in [RFC7196].
> ***************
> *** 1316,1324 ****
>
>      A network is declared to converge in response to a failure once all
>      devices within the failure impact scope are notified of the event and
> !    have re-calculated their RIB's and consequently updated their FIB's.
>      Larger failure impact scope typically means slower convergence since
> !    more devices have to be notified, and additionally results in a less
>      stable network.  In this section we describe BGP's advantages over
>      link-state routing protocols in reducing failure impact scope for a
>      Clos topology.
> --- 1316,1324 ----
>
>      A network is declared to converge in response to a failure once all
>      devices within the failure impact scope are notified of the event and
> !    have re-calculated their RIBs and consequently updated their FIBs.
>      Larger failure impact scope typically means slower convergence since
> !    more devices have to be notified, and results in a less
>      stable network.  In this section we describe BGP's advantages over
>      link-state routing protocols in reducing failure impact scope for a
>      Clos topology.
> ***************
> *** 1327,1335 ****
>      the best path from the point of view of the local router is sent to
>      neighbors.  As such, some failures are masked if the local node can
>      immediately find a backup path and does not have to send any updates
> !    further.  Notice that in the worst case ALL devices in a data center
>      topology have to either withdraw a prefix completely or update the
> !    ECMP groups in the FIB.  However, many failures will not result in
>      such a wide impact.  There are two main failure types where impact
>      scope is reduced:
>
> --- 1327,1335 ----
>      the best path from the point of view of the local router is sent to
>      neighbors.  As such, some failures are masked if the local node can
>      immediately find a backup path and does not have to send any updates
> !    further.  Notice that in the worst case, all devices in a data center
>      topology have to either withdraw a prefix completely or update the
> !    ECMP groups in their FIBs.  However, many failures will not result in
>      such a wide impact.  There are two main failure types where impact
>      scope is reduced:
>
> ***************
> *** 1357,1367 ****
>
>      o  Failure of a Tier-1 device: In this case, all Tier-2 devices
>         directly attached to the failed node will have to update their
> !       ECMP groups for all IP prefixes from non-local cluster.  The
>         Tier-3 devices are once again not involved in the re-convergence
>         process, but may receive "implicit withdraws" as described above.
>
> !    Even though in case of such failures multiple IP prefixes will have
>      to be reprogrammed in the FIB, it is worth noting that ALL of these
>      prefixes share a single ECMP group on Tier-2 device.  Therefore, in
>      the case of implementations with a hierarchical FIB, only a single
> --- 1357,1367 ----
>
>      o  Failure of a Tier-1 device: In this case, all Tier-2 devices
>         directly attached to the failed node will have to update their
> !       ECMP groups for all IP prefixes from a non-local cluster.  The
>         Tier-3 devices are once again not involved in the re-convergence
>         process, but may receive "implicit withdraws" as described above.
>
> !    Even in the case of such failures, multiple IP prefixes will have
>      to be reprogrammed in the FIB, it is worth noting that ALL of these
>      prefixes share a single ECMP group on Tier-2 device.  Therefore, in
>      the case of implementations with a hierarchical FIB, only a single
> ***************
> *** 1375,1381 ****
>      possible with the proposed design, since using this technique may
>      create routing black-holes as mentioned previously.  Therefore, the
>      worst control plane failure impact scope is the network as a whole,
> !    for instance in a case of a link failure between Tier-2 and Tier-3
>      devices.  The amount of impacted prefixes in this case would be much
>      less than in the case of a failure in the upper layers of a Clos
>      network topology.  The property of having such large failure scope is
> --- 1375,1381 ----
>      possible with the proposed design, since using this technique may
>      create routing black-holes as mentioned previously.  Therefore, the
>      worst control plane failure impact scope is the network as a whole,
> !    for instance in thecase of a link failure between Tier-2 and Tier-3
>      devices.  The amount of impacted prefixes in this case would be much
>      less than in the case of a failure in the upper layers of a Clos
>      network topology.  The property of having such large failure scope is
> ***************
> *** 1384,1397 ****
>
>   7.5.  Routing Micro-Loops
>
> !    When a downstream device, e.g.  Tier-2 device, loses all paths for a
>      prefix, it normally has the default route pointing toward the
>      upstream device, in this case the Tier-1 device.  As a result, it is
> !    possible to get in the situation when Tier-2 switch loses a prefix,
> !    but Tier-1 switch still has the path pointing to the Tier-2 device,
> !    which results in transient micro-loop, since Tier-1 switch will keep
>      passing packets to the affected prefix back to Tier-2 device, and
> !    Tier-2 will bounce it back again using the default route.  This
>      micro-loop will last for the duration of time it takes the upstream
>      device to fully update its forwarding tables.
>
> --- 1384,1397 ----
>
>   7.5.  Routing Micro-Loops
>
> !    When a downstream device, e.g.,  Tier-2 device, loses all paths for a
>      prefix, it normally has the default route pointing toward the
>      upstream device, in this case the Tier-1 device.  As a result, it is
> !    possible to get in the situation where a Tier-2 switch loses a prefix,
> !    but a Tier-1 switch still has the path pointing to the Tier-2 device,
> !    which results in transient micro-loop, since the Tier-1 switch will
> keep
>      passing packets to the affected prefix back to Tier-2 device, and
> !    the Tier-2 will bounce it back again using the default route.  This
>      micro-loop will last for the duration of time it takes the upstream
>      device to fully update its forwarding tables.
>
> ***************
> *** 1402,1408 ****
>   Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
>
>
> !    To minimize impact of the micro-loops, Tier-2 and Tier-1 switches can
>      be configured with static "discard" or "null" routes that will be
>      more specific than the default route for prefixes missing during
>      network convergence.  For Tier-2 switches, the discard route should
> --- 1402,1408 ----
>   Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
>
>
> !    To minimize the impact of such micro-loops, Tier-2 and Tier-1
> switches can
>      be configured with static "discard" or "null" routes that will be
>      more specific than the default route for prefixes missing during
>      network convergence.  For Tier-2 switches, the discard route should
> ***************
> *** 1417,1423 ****
>
>   8.1.  Third-party Route Injection
>
> !    BGP allows for a "third-party", i.e. directly attached, BGP speaker
>      to inject routes anywhere in the network topology, meeting REQ5.
>      This can be achieved by peering via a multihop BGP session with some
>      or even all devices in the topology.  Furthermore, BGP diverse path
> --- 1417,1423 ----
>
>   8.1.  Third-party Route Injection
>
> !    BGP allows for a "third-party", i.e., directly attached, BGP speaker
>      to inject routes anywhere in the network topology, meeting REQ5.
>      This can be achieved by peering via a multihop BGP session with some
>      or even all devices in the topology.  Furthermore, BGP diverse path
> ***************
> *** 1427,1433 ****
>      implementation.  Unfortunately, in many implementations ADD-PATH has
>      been found to only support IBGP properly due to the use cases it was
>      originally optimized for, which limits the "third-party" peering to
> !    IBGP only, if the feature is used.
>
>      To implement route injection in the proposed design, a third-party
>      BGP speaker may peer with Tier-3 and Tier-1 switches, injecting the
> --- 1427,1433 ----
>      implementation.  Unfortunately, in many implementations ADD-PATH has
>      been found to only support IBGP properly due to the use cases it was
>      originally optimized for, which limits the "third-party" peering to
> !    IBGP only.
>
>      To implement route injection in the proposed design, a third-party
>      BGP speaker may peer with Tier-3 and Tier-1 switches, injecting the
> ***************
> *** 1442,1453 ****
>      As mentioned previously, route summarization is not possible within
>      the proposed Clos topology since it makes the network susceptible to
>      route black-holing under single link failures.  The main problem is
> !    the limited number of redundant paths between network elements, e.g.
>      there is only a single path between any pair of Tier-1 and Tier-3
>      devices.  However, some operators may find route aggregation
>      desirable to improve control plane stability.
>
> !    If planning on using any technique to summarize within the topology
>      modeling of the routing behavior and potential for black-holing
>      should be done not only for single or multiple link failures, but
>
> --- 1442,1453 ----
>      As mentioned previously, route summarization is not possible within
>      the proposed Clos topology since it makes the network susceptible to
>      route black-holing under single link failures.  The main problem is
> !    the limited number of redundant paths between network elements, e.g.,
>      there is only a single path between any pair of Tier-1 and Tier-3
>      devices.  However, some operators may find route aggregation
>      desirable to improve control plane stability.
>
> !    If any technique to summarize within the topology is planned,
>      modeling of the routing behavior and potential for black-holing
>      should be done not only for single or multiple link failures, but
>
> ***************
> *** 1458,1468 ****
>   Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
>
>
> !    also fiber pathway failures or optical domain failures if the
>      topology extends beyond a physical location.  Simple modeling can be
>      done by checking the reachability on devices doing summarization
>      under the condition of a link or pathway failure between a set of
> !    devices in every tier as well as to the WAN routers if external
>      connectivity is present.
>
>      Route summarization would be possible with a small modification to
> --- 1458,1468 ----
>   Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
>
>
> !    also fiber pathway failures or optical domain failures when the
>      topology extends beyond a physical location.  Simple modeling can be
>      done by checking the reachability on devices doing summarization
>      under the condition of a link or pathway failure between a set of
> !    devices in every tier as well as to the WAN routers when external
>      connectivity is present.
>
>      Route summarization would be possible with a small modification to
> ***************
> *** 1519,1544 ****
>      cluster from Tier-2 devices since each of them has only a single path
>      down to this prefix.  It would require dual-homed servers to
>      accomplish that.  Also note that this design is only resilient to
> !    single link failure.  It is possible for a double link failure to
>      isolate a Tier-2 device from all paths toward a specific Tier-3
>      device, thus causing a routing black-hole.
>
> !    A result of the proposed topology modification would be reduction of
>      Tier-1 devices port capacity.  This limits the maximum number of
>      attached Tier-2 devices and therefore will limit the maximum DC
>      network size.  A larger network would require different Tier-1
>      devices that have higher port density to implement this change.
>
>      Another problem is traffic re-balancing under link failures.  Since
> !    three are two paths from Tier-1 to Tier-3, a failure of the link
>      between Tier-1 and Tier-2 switch would result in all traffic that was
>      taking the failed link to switch to the remaining path.  This will
> !    result in doubling of link utilization on the remaining link.
>
>   8.2.2.  Simple Virtual Aggregation
>
>      A completely different approach to route summarization is possible,
> !    provided that the main goal is to reduce the FIB pressure, while
>      allowing the control plane to disseminate full routing information.
>      Firstly, it could be easily noted that in many cases multiple
>      prefixes, some of which are less specific, share the same set of the
> --- 1519,1544 ----
>      cluster from Tier-2 devices since each of them has only a single path
>      down to this prefix.  It would require dual-homed servers to
>      accomplish that.  Also note that this design is only resilient to
> !    single link failures.  It is possible for a double link failure to
>      isolate a Tier-2 device from all paths toward a specific Tier-3
>      device, thus causing a routing black-hole.
>
> !    A result of the proposed topology modification would be a reduction of
>      Tier-1 devices port capacity.  This limits the maximum number of
>      attached Tier-2 devices and therefore will limit the maximum DC
>      network size.  A larger network would require different Tier-1
>      devices that have higher port density to implement this change.
>
>      Another problem is traffic re-balancing under link failures.  Since
> !    there are two paths from Tier-1 to Tier-3, a failure of the link
>      between Tier-1 and Tier-2 switch would result in all traffic that was
>      taking the failed link to switch to the remaining path.  This will
> !    result in doubling the link utilization of the remaining link.
>
>   8.2.2.  Simple Virtual Aggregation
>
>      A completely different approach to route summarization is possible,
> !    provided that the main goal is to reduce the FIB size, while
>      allowing the control plane to disseminate full routing information.
>      Firstly, it could be easily noted that in many cases multiple
>      prefixes, some of which are less specific, share the same set of the
> ***************
> *** 1550,1563 ****
>      [RFC6769] and only install the least specific route in the FIB,
>      ignoring more specific routes if they share the same next-hop set.
>      For example, under normal network conditions, only the default route
> !    need to be programmed into FIB.
>
>      Furthermore, if the Tier-2 devices are configured with summary
> !    prefixes covering all of their attached Tier-3 device's prefixes the
>      same logic could be applied in Tier-1 devices as well, and, by
>      induction to Tier-2/Tier-3 switches in different clusters.  These
>      summary routes should still allow for more specific prefixes to leak
> !    to Tier-1 devices, to enable for detection of mismatches in the next-
>      hop sets if a particular link fails, changing the next-hop set for a
>      specific prefix.
>
> --- 1550,1563 ----
>      [RFC6769] and only install the least specific route in the FIB,
>      ignoring more specific routes if they share the same next-hop set.
>      For example, under normal network conditions, only the default route
> !    needs to be programmed into FIB.
>
>      Furthermore, if the Tier-2 devices are configured with summary
> !    prefixes covering all of their attached Tier-3 device's prefixes, the
>      same logic could be applied in Tier-1 devices as well, and, by
>      induction to Tier-2/Tier-3 switches in different clusters.  These
>      summary routes should still allow for more specific prefixes to leak
> !    to Tier-1 devices, to enable detection of mismatches in the next-
>      hop sets if a particular link fails, changing the next-hop set for a
>      specific prefix.
>
> ***************
> *** 1571,1584 ****
>
>
>      Re-stating once again, this technique does not reduce the amount of
> !    control plane state (i.e.  BGP UPDATEs/BGP LocRIB sizing), but only
> !    allows for more efficient FIB utilization, by spotting more specific
> !    prefixes that share their next-hops with less specifics.
>
>   8.3.  ICMP Unreachable Message Masquerading
>
>      This section discusses some operational aspects of not advertising
> !    point-to-point link subnets into BGP, as previously outlined as an
>      option in Section 5.2.3.  The operational impact of this decision
>      could be seen when using the well-known "traceroute" tool.
>      Specifically, IP addresses displayed by the tool will be the link's
> --- 1571,1585 ----
>
>
>      Re-stating once again, this technique does not reduce the amount of
> !    control plane state (i.e.,  BGP UPDATEs/BGP Loc-RIB size), but only
> !    allows for more efficient FIB utilization, by detecting more specific
> !    prefixes that share their next-hop set with a subsuming less specific
> !    prefix.
>
>   8.3.  ICMP Unreachable Message Masquerading
>
>      This section discusses some operational aspects of not advertising
> !    point-to-point link subnets into BGP, as previously identified as an
>      option in Section 5.2.3.  The operational impact of this decision
>      could be seen when using the well-known "traceroute" tool.
>      Specifically, IP addresses displayed by the tool will be the link's
> ***************
> *** 1587,1605 ****
>      complicated.
>
>      One way to overcome this limitation is by using the DNS subsystem to
> !    create the "reverse" entries for the IP addresses of the same device
> !    pointing to the same name.  The connectivity then can be made by
> !    resolving this name to the "primary" IP address of the devices, e.g.
>      its Loopback interface, which is always advertised into BGP.
>      However, this creates a dependency on the DNS subsystem, which may be
>      unavailable during an outage.
>
>      Another option is to make the network device perform IP address
>      masquerading, that is rewriting the source IP addresses of the
> !    appropriate ICMP messages sent off of the device with the "primary"
>      IP address of the device.  Specifically, the ICMP Destination
>      Unreachable Message (type 3) codes 3 (port unreachable) and ICMP Time
> !    Exceeded (type 11) code 0, which are involved in proper working of
>      the "traceroute" tool.  With this modification, the "traceroute"
>      probes sent to the devices will always be sent back with the
>      "primary" IP address as the source, allowing the operator to discover
> --- 1588,1606 ----
>      complicated.
>
>      One way to overcome this limitation is by using the DNS subsystem to
> !    create the "reverse" entries for these point-to-point IP addresses
> pointing
> !    to a the same name as the loopback address.  The connectivity then
> can be made by
> !    resolving this name to the "primary" IP address of the devices, e.g.,
>      its Loopback interface, which is always advertised into BGP.
>      However, this creates a dependency on the DNS subsystem, which may be
>      unavailable during an outage.
>
>      Another option is to make the network device perform IP address
>      masquerading, that is rewriting the source IP addresses of the
> !    appropriate ICMP messages sent by the device with the "primary"
>      IP address of the device.  Specifically, the ICMP Destination
>      Unreachable Message (type 3) codes 3 (port unreachable) and ICMP Time
> !    Exceeded (type 11) code 0, which are required for correct operation of
>      the "traceroute" tool.  With this modification, the "traceroute"
>      probes sent to the devices will always be sent back with the
>      "primary" IP address as the source, allowing the operator to discover
>
> Thanks,
> Acee
>
>
[RTG-DIR] Routing Directorate Review for "Use of … Acee Lindem (acee)
Re: [RTG-DIR] Routing Directorate Review for "Use… Alia Atlas