Re: Routing Directorate Review for "Use of BGP for routing in large-scale data centers" (adding RTG WG)

Hi Acee,

Thank you very much for your review.

Authors, could you please respond soon?  I am hoping to get this out to
IETF Last Call
by Thursday - and on the telechat for May 19.    That depends on timely
updates from
the authors and shepherd.

Thanks,
Alia

On Mon, Apr 25, 2016 at 1:16 PM, Acee Lindem (acee) <acee@cisco.com> wrote:

> Hello,
>
> I have been selected as the Routing Directorate reviewer for this draft.
> The Routing Directorate seeks to review all routing or routing-related
> drafts as they pass through IETF last call and IESG review, and sometimes
> on special request. The purpose of the review is to provide assistance to
> the Routing ADs. For more information about the Routing Directorate,
> please see http://trac.tools.ietf.org/area/rtg/trac/wiki/RtgDir
>
> Although these comments are primarily for the use of the Routing ADs, it
> would be helpful if you could consider them along with any other IETF Last
> Call comments that you receive, and strive to resolve them through
> discussion or by updating the draft.
>
> Document: draft-ietf-rtgwg-bgp-routing-large-dc-09.txt
> Reviewer: Acee Lindem
> Review Date: 4/25/16
> IETF LC End Date: Not started
> Intended Status: Informational
>
> Summary:
>     This document is basically ready for publication, but has some minor
> issues and nits that should be resolved prior to publication.
>
> Comments:
>     The document starts with the requirements for an MSDC routing and then
> provides an overview of Clos data topologies and data center network
> design. This overview attempts to cover a lot of a material in a very
> small amount of text. While not completely successful, the overview
> provides a lot of good information and references. The bulk of the
> document covers the usage of EBGP as the sole data center routing protocol
> and other aspects of the routing design including ECMP, summarization
> issues, and convergence. These sections provide a very good guide for
> using EBGP in a Clos data center and an excellent discussion of the
> deployment issues (based on real deployment experience).
>
>     The technical content of the document is excellent. The readability
> could be improved by breaking up some of the run-on sentences and with the
> suggested editorial changes (see Nits below).
>
>
> Major Issues:
>
>     I have no major issues with the document.
>
> Minor Issues:
>
>     Section 4.2: Can an informative reference be added for Direct Server
> Return (DSR)?
>     Section 5.2.4 and 7.4: Define precisely what is meant by "scale-out"
> topology somewhere in the document.
>     Section 5.2.5: Can you add a backward reference to the discussion of
> "lack of peer links inside every peer”? Also, it would be good to describe
> how this would allow for summarization and under what failure conditions.
>     Section 7.4: Should you add a reference to
> https://www.ietf.org/id/draft-ietf-rtgwg-bgp-pic-00.txt to the penultimate
> paragraph in this section?
>
> Nits:
>
> ***************
> *** 143,149 ****
>      network stability so that a small group of people can effectively
>      support a significantly sized network.
>
> !    Experimentation and extensive testing has shown that External BGP
>      (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for
>      these type of data center applications.  This is in contrast with
>      more traditional DC designs, which may se simple tree topologies and
> --- 143,149 ----
>      network stability so that a sall group of people can effectively
>      support a significantly sized network.
>
> !    Experimentation and extensive testing have shown that External BGP
>      (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for
>      these type of data center applications.  This is in contrast with
>      more traditional DC designs, which may use simple tree topologies and
> ***************
> *** 178,191 ****
>   2.1.  Bandwidth and Traffic Patterns
>
>      The primary requirement when building an interconnection network for
> !    large number of servers is to accommodate application bandwidth and
>      latency requirements.  Until recently it was quite common to see the
>      majority of traffic entering and leaving the data center, commonly
>      referred to as "north-south" traffic.  Traditional "tree" topologies
>      were sufficient to accommodate such flows, even with high
>      oversubscription ratios between the layers of the network.  If more
>      bandwidth was required, it was added by "scaling up" the network
> !    elements, e.g. by upgrading the device's linecards or fabrics or
>      replacing the device with one with higher port density.
>
>      Today many large-scale data centers host applications generating
> --- 178,191 ----
>   2.1.  Bandwidth and Traffic Patterns
>
>      The primary requirement when building an interconnection network for
> !    a large number of servers is to accommodate application bandwidth and
>      latency requirements.  Until recently it was quite common to see the
>      majority of traffic entering and leaving the data center, commonly
>      referred to as "north-south" traffic.  Traditional "tree" topologies
>      were sufficient to accommodate such flows, even with high
>      oversubscription ratios between the layers of the network.  If more
>      bandwidth was required, it was added by "scaling up" the network
> !    elements, e.g., by upgrading the device's linecards or fabrics or
>      replacing the device with one with higher port density.
>
>      Today many large-scale data centers host applications generating
> ***************
> *** 195,201 ****
>      [HADOOP], massive data replication between clusters needed by certain
>      applications, or virtual machine migrations.  Scaling traditional
>      tree topologies to match these bandwidth demands becomes either too
> !    expensive or impossible due to physical limitations, e.g. port
>      density in a switch.
>
>   2.2.  CAPEX Minimization
> --- 195,201 ----
>      [HADOOP], massive data replication between clusters needed by certain
>      applications, or virtual machine migrations.  Scaling traditional
>      tree topologies to match these bandwidth demands becomes either too
> !    expensive or impossible due to physical limitations, e.g., port
>      density in a switch.
>
>   2.2.  CAPEX Minimization
> ***************
> *** 209,215 ****
>
>      o  Unifying all network elements, preferably using the same hardware
>         type or even the same device.  This allows for volume pricing on
> !       bulk purchases and reduced maintenance and sparing costs.
>
>      o  Driving costs down using competitive pressures, by introducing
>         multiple network equipment vendors.
> --- 209,215 ----
>
>      o  Unifying all network elements, preferably using the same hardware
>         type or even the same device.  This allows for volume pricing on
> !       bulk purchases and reduced maintenance and inventory costs.
>
>      o  Driving costs down using competitive pressures, by introducing
>         multiple network equipment vendors.
> ***************
> *** 234,244 ****
>      minimizes software issue-related failures.
>
>      An important aspect of Operational Expenditure (OPEX) minimization is
> !    reducing size of failure domains in the network.  Ethernet networks
>      are known to be susceptible to broadcast or unicast traffic storms
>      that can have a dramatic impact on network performance and
>      availability.  The use of a fully routed design significantly reduces
> !    the size of the data plane failure domains - i.e. limits them to the
>      lowest level in the network hierarchy.  However, such designs
>      introduce the problem of distributed control plane failures.  This
>      observation calls for simpler and less control plane protocols to
> --- 234,244 ----
>      minimizes software issue-related failures.
>
>      An important aspect of Operational Expenditure (OPEX) minimization is
> !    reducing the size of failure domains in the network.  Ethernet
> networks
>      are known to be susceptible to broadcast or unicast traffic storms
>      that can have a dramatic impact on network performance and
>      availability.  The use of a fully routed design significantly reduces
> !    the size of the data plane failure domains, i.e., limits them to the
>      lowest level in the network hierarchy.  However, such designs
>      introduce the problem of distributed control plane failures.  This
>      observation calls for simpler and less control plane protocols to
> ***************
> *** 253,259 ****
>      performed by network devices.  Traditionally, load balancers are
>      deployed as dedicated devices in the traffic forwarding path.  The
>      problem arises in scaling load balancers under growing traffic
> !    demand.  A preferable solution would be able to scale load balancing
>      layer horizontally, by adding more of the uniform nodes and
>      distributing incoming traffic across these nodes.  In situations like
>      this, an ideal choice would be to use network infrastructure itself
> --- 253,259 ----
>      performed by network devices.  Traditionally, load balancers are
>      deployed as dedicated devices in the traffic forwarding path.  The
>      problem arises in scaling load balancers under growing traffic
> !    demand.  A preferable solution would be able to scale the load
> balancing
>      layer horizontally, by adding more of the uniform nodes and
>      distributing incoming traffic across these nodes.  In situations like
>      this, an ideal choice would be to use network infrastructure itself
> ***************
> *** 305,311 ****
>   3.1.  Traditional DC Topology
>
>      In the networking industry, a common design choice for data centers
> !    typically look like a (upside down) tree with redundant uplinks and
>      three layers of hierarchy namely; core, aggregation/distribution and
>      access layers (see Figure 1).  To accommodate bandwidth demands, each
>      higher layer, from server towards DC egress or WAN, has higher port
> --- 305,311 ----
>   3.1.  Traditional DC Topology
>
>      In the networking industry, a common design choice for data centers
> !    typically look like an (upside down) tree with redundant uplinks and
>      three layers of hierarchy namely; core, aggregation/distribution and
>      access layers (see Figure 1).  To accommodate bandwidth demands, each
>      higher layer, from server towards DC egress or WAN, has higher port
> ***************
> *** 373,379 ****
>      topology, sometimes called "fat-tree" (see, for example, [INTERCON]
>      and [ALFARES2008]).  This topology features an odd number of stages
>      (sometimes known as dimensions) and is commonly made of uniform
> !    elements, e.g. network switches with the same port count.  Therefore,
>      the choice of folded Clos topology satisfies REQ1 and facilitates
>      REQ2.  See Figure 2 below for an example of a folded 3-stage Clos
>      topology (3 stages counting Tier-2 stage twice, when tracing a packet
> --- 373,379 ----
>      topology, sometimes called "fat-tree" (see, for example, [INTERCON]
>      and [ALFARES2008]).  This topology features an odd number of stages
>      (sometimes known as dimensions) and is commonly made of uniform
> !    elements, e.g., network switches with the same port count.  Therefore,
>      the choice of folded Clos topology satisfies REQ1 and facilitates
>      REQ2.  See Figure 2 below for an example of a folded 3-stage Clos
>      topology (3 stages counting Tier-2 stage twice, when tracing a packet
> ***************
> *** 460,466 ****
>   3.2.3.  Scaling the Clos topology
>
>      A Clos topology can be scaled either by increasing network element
> !    port density or adding more stages, e.g. moving to a 5-stage Clos, as
>      illustrated in Figure 3 below:
>
>                                         Tier-1
> --- 460,466 ----
>   3.2.3.  Scaling the Clos topology
>
>      A Clos topology can be scaled either by increasing network element
> !    port density or adding more stages, e.g., moving to a 5-stage Clos, as
>      illustrated in Figure 3 below:
>
>                                         Tier-1
> ***************
> *** 523,529 ****
>   3.2.4.  Managing the Size of Clos Topology Tiers
>
>      If a data center network size is small, it is possible to reduce the
> !    number of switches in Tier-1 or Tier-2 of Clos topology by a factor
>      of two.  To understand how this could be done, take Tier-1 as an
>      example.  Every Tier-2 device connects to a single group of Tier-1
>      devices.  If half of the ports on each of the Tier-1 devices are not
> --- 523,529 ----
>   3.2.4.  Managing the Size of Clos Topology Tiers
>
>      If a data center network size is small, it is possible to reduce the
> !    number of switches in Tier-1 or Tier-2 of a Clos topology by a factor
>      of two.  To understand how this could be done, take Tier-1 as an
>      example.  Every Tier-2 device connects to a single group of Tier-1
>      devices.  If half of the ports on each of the Tier-1 devices are not
> ***************
> *** 574,580 ****
>      originally defined in [IEEE8021D-1990] for loop free topology
>      creation, typically utilizing variants of the traditional DC topology
>      described in Section 3.1.  At the time, many DC switches either did
> !    not support Layer 3 routed protocols or supported it with additional
>      licensing fees, which played a part in the design choice.  Although
>      many enhancements have been made through the introduction of Rapid
>      Spanning Tree Protocol (RSTP) in the latest revision of
> --- 574,580 ----
>      originally defined in [IEEE8021D-1990] for loop free topology
>      creation, typically utilizing variants of the traditional DC topology
>      described in Section 3.1.  At the time, many DC switches either did
> !    not support Layer 3 routing protocols or supported them with
> additional
>      licensing fees, which played a part in the design choice.  Although
>      many enhancements have been made through the introduction of Rapid
>      Spanning Tree Protocol (RSTP) in the latest revision of
> ***************
> *** 599,605 ****
>      as the backup for loop prevention.  The major downsides of this
>      approach are the lack of ability to scale linearly past two in most
>      implementations, lack of standards based implementations, and added
> !    failure domain risk of keeping state between the devices.
>
>      It should be noted that building large, horizontally scalable, Layer
>      2 only networks without STP is possible recently through the
> --- 599,605 ----
>      as the backup for loop prevention.  The major downsides of this
>      approach are the lack of ability to scale linearly past two in most
>      implementations, lack of standards based implementations, and added
> !    the failure domain risk of syncing state between the devices.
>
>      It should be noted that building large, horizontally scalable, Layer
>      2 only networks without STP is possible recently through the
> ***************
> *** 621,631 ****
>      Finally, neither the base TRILL specification nor the M-LAG approach
>      totally eliminate the problem of the shared broadcast domain, that is
>      so detrimental to the operations of any Layer 2, Ethernet based
> !    solutions.  Later TRILL extensions have been proposed to solve the
>      this problem statement primarily based on the approaches outlined in
>      [RFC7067], but this even further limits the number of available
> !    interoperable implementations that can be used to build a fabric,
> !    therefore TRILL based designs have issues meeting REQ2, REQ3, and
>      REQ4.
>
>   4.2.  Hybrid L2/L3 Designs
> --- 621,631 ----
>      Finally, neither the base TRILL specification nor the M-LAG approach
>      totally eliminate the problem of the shared broadcast domain, that is
>      so detrimental to the operations of any Layer 2, Ethernet based
> !    solution.  Later TRILL extensions have been proposed to solve the
>      this problem statement primarily based on the approaches outlined in
>      [RFC7067], but this even further limits the number of available
> !    interoperable implementations that can be used to build a fabric.
> !    Therefore, TRILL based designs have issues meeting REQ2, REQ3, and
>      REQ4.
>
>   4.2.  Hybrid L2/L3 Designs
> ***************
> *** 635,641 ****
>      in either the Tier-1 or Tier-2 parts of the network and dividing the
>      Layer 2 domain into numerous, smaller domains.  This design has
>      allowed data centers to scale up, but at the cost of complexity in
> !    the network managing multiple protocols.  For the following reasons,
>      operators have retained Layer 2 in either the access (Tier-3) or both
>      access and aggregation (Tier-3 and Tier-2) parts of the network:
>
> --- 635,641 ----
>      in either the Tier-1 or Tier-2 parts of the network and dividing the
>      Layer 2 domain into numerous, smaller domains.  This design has
>      allowed data centers to scale up, but at the cost of complexity in
> !    the managing multiple network protocols.  For the following reasons,
>      operators have retained Layer 2 in either the access (Tier-3) or both
>      access and aggregation (Tier-3 and Tier-2) parts of the network:
>
> ***************
> *** 644,650 ****
>
>      o  Seamless mobility for virtual machines that require the
>         preservation of IP addresses when a virtual machine moves to
> !       different Tier-3 switch.
>
>      o  Simplified IP addressing = less IP subnets are required for the
>         data center.
> --- 644,650 ----
>
>      o  Seamless mobility for virtual machines that require the
>         preservation of IP addresses when a virtual machine moves to
> !       a different Tier-3 switch.
>
>      o  Simplified IP addressing = less IP subnets are required for the
>         data center.
> ***************
> *** 679,686 ****
>      adoption in networks where large Layer 2 adjacency and larger size
>      Layer 3 subnets are not as critical compared to network scalability
>      and stability.  Application providers and network operators continue
> !    to also develop new solutions to meet some of the requirements that
> !    previously have driven large Layer 2 domains by using various overlay
>      or tunneling techniques.
>
>   5.  Routing Protocol Selection and Design
> --- 679,686 ----
>      adoption in networks where large Layer 2 adjacency and larger size
>      Layer 3 subnets are not as critical compared to network scalability
>      and stability.  Application providers and network operators continue
> !    to develop new solutions to meet some of the requirements that
> !    previously had driven large Layer 2 domains using various overlay
>      or tunneling techniques.
>
>   5.  Routing Protocol Selection and Design
> ***************
> *** 700,706 ****
>      design.
>
>      Although EBGP is the protocol used for almost all inter-domain
> !    routing on the Internet and has wide support from both vendor and
>      service provider communities, it is not generally deployed as the
>      primary routing protocol within the data center for a number of
>      reasons (some of which are interrelated):
> --- 700,706 ----
>      design.
>
>      Although EBGP is the protocol used for almost all inter-domain
> !    routing in the Internet and has wide support from both vendor and
>      service provider communities, it is not generally deployed as the
>      primary routing protocol within the data center for a number of
>      reasons (some of which are interrelated):
> ***************
> *** 741,754 ****
>         state IGPs.  Since every BGP router calculates and propagates only
>         the best-path selected, a network failure is masked as soon as the
>         BGP speaker finds an alternate path, which exists when highly
> !       symmetric topologies, such as Clos, are coupled with EBGP only
>         design.  In contrast, the event propagation scope of a link-state
>         IGP is an entire area, regardless of the failure type.  In this
>         way, BGP better meets REQ3 and REQ4.  It is also worth mentioning
>         that all widely deployed link-state IGPs feature periodic
> !       refreshes of routing information, even if this rarely causes
> !       impact to modern router control planes, while BGP does not expire
> !       routing state.
>
>      o  BGP supports third-party (recursively resolved) next-hops.  This
>         allows for manipulating multipath to be non-ECMP based or
> --- 741,754 ----
>         state IGPs.  Since every BGP router calculates and propagates only
>         the best-path selected, a network failure is masked as soon as the
>         BGP speaker finds an alternate path, which exists when highly
> !       symmetric topologies, such as Clos, are coupled with an EBGP only
>         design.  In contrast, the event propagation scope of a link-state
>         IGP is an entire area, regardless of the failure type.  In this
>         way, BGP better meets REQ3 and REQ4.  It is also worth mentioning
>         that all widely deployed link-state IGPs feature periodic
> !       refreshes of routing information while BGP does not expire
> !       routing state, although this rarely impacts modern router control
> !       planes.
>
>      o  BGP supports third-party (recursively resolved) next-hops.  This
>         allows for manipulating multipath to be non-ECMP based or
> ***************
> *** 765,775 ****
>         controlled and complex unwanted paths will be ignored.  See
>         Section 5.2 for an example of a working ASN allocation scheme.  In
>         a link-state IGP accomplishing the same goal would require multi-
> !       (instance/topology/processes) support, typically not available in
>         all DC devices and quite complex to configure and troubleshoot.
>         Using a traditional single flooding domain, which most DC designs
>         utilize, under certain failure conditions may pick up unwanted
> !       lengthy paths, e.g. traversing multiple Tier-2 devices.
>
>      o  EBGP configuration that is implemented with minimal routing policy
>         is easier to troubleshoot for network reachability issues.  In
> --- 765,775 ----
>         controlled and complex unwanted paths will be ignored.  See
>         Section 5.2 for an example of a working ASN allocation scheme.  In
>         a link-state IGP accomplishing the same goal would require multi-
> !       (instance/topology/process) support, typically not available in
>         all DC devices and quite complex to configure and troubleshoot.
>         Using a traditional single flooding domain, which most DC designs
>         utilize, under certain failure conditions may pick up unwanted
> !       lengthy paths, e.g., traversing multiple Tier-2 devices.
>
>      o  EBGP configuration that is implemented with minimal routing policy
>         is easier to troubleshoot for network reachability issues.  In
> ***************
> *** 806,812 ****
>         loopback sessions are used even in the case of multiple links
>         between the same pair of nodes.
>
> !    o  Private Use ASNs from the range 64512-65534 are used so as to
>         avoid ASN conflicts.
>
>      o  A single ASN is allocated to all of the Clos topology's Tier-1
> --- 806,812 ----
>         loopback sessions are used even in the case of multiple links
>         between the same pair of nodes.
>
> !    o  Private Use ASNs from the range 64512-65534 are used to
>         avoid ASN conflicts.
>
>      o  A single ASN is allocated to all of the Clos topology's Tier-1
> ***************
> *** 815,821 ****
>      o  A unique ASN is allocated to each set of Tier-2 devices in the
>         same cluster.
>
> !    o  A unique ASN is allocated to every Tier-3 device (e.g.  ToR) in
>         this topology.
>
>
> --- 815,821 ----
>      o  A unique ASN is allocated to each set of Tier-2 devices in the
>         same cluster.
>
> !    o  A unique ASN is allocated to every Tier-3 device (e.g.,  ToR) in
>         this topology.
>
>
> ***************
> *** 903,922 ****
>
>      Another solution to this problem would be using Four-Octet ASNs
>      ([RFC6793]), where there are additional Private Use ASNs available,
> !    see [IANA.AS].  Use of Four-Octet ASNs put additional protocol
> !    complexity in the BGP implementation so should be considered against
>      the complexity of re-use when considering REQ3 and REQ4.  Perhaps
>      more importantly, they are not yet supported by all BGP
>      implementations, which may limit vendor selection of DC equipment.
> !    When supported, ensure that implementations in use are able to remove
> !    the Private Use ASNs if required for external connectivity
> !    (Section 5.2.4).
>
>   5.2.3.  Prefix Advertisement
>
>      A Clos topology features a large number of point-to-point links and
>      associated prefixes.  Advertising all of these routes into BGP may
> !    create FIB overload conditions in the network devices.  Advertising
>      these links also puts additional path computation stress on the BGP
>      control plane for little benefit.  There are two possible solutions:
>
> --- 903,922 ----
>
>      Another solution to this problem would be using Four-Octet ASNs
>      ([RFC6793]), where there are additional Private Use ASNs available,
> !    see [IANA.AS].  Use of Four-Octet ASNs puts additional protocol
> !    complexity in the BGP implementation and should be balanced against
>      the complexity of re-use when considering REQ3 and REQ4.  Perhaps
>      more importantly, they are not yet supported by all BGP
>      implementations, which may limit vendor selection of DC equipment.
> !    When supported, ensure that deployed implementations are able to
> remove
> !    the Private Use ASNs when external connectivity to these ASes is
> !    required (Section 5.2.4).
>
>   5.2.3.  Prefix Advertisement
>
>      A Clos topology features a large number of point-to-point links and
>      associated prefixes.  Advertising all of these routes into BGP may
> !    create FIB overload in the network devices.  Advertising
>      these links also puts additional path computation stress on the BGP
>      control plane for little benefit.  There are two possible solutions:
>
> ***************
> *** 925,951 ****
>         device, distant networks will automatically be reachable via the
>         advertising EBGP peer and do not require reachability to these
>         prefixes.  However, this may complicate operations or monitoring:
> !       e.g. using the popular "traceroute" tool will display IP addresses
>         that are not reachable.
>
>      o  Advertise point-to-point links, but summarize them on every
>         device.  This requires an address allocation scheme such as
>         allocating a consecutive block of IP addresses per Tier-1 and
>         Tier-2 device to be used for point-to-point interface addressing
> !       to the lower layers (Tier-2 uplinks will be numbered out of Tier-1
> !       addressing and so forth).
>
>      Server subnets on Tier-3 devices must be announced into BGP without
>      using route summarization on Tier-2 and Tier-1 devices.  Summarizing
>      subnets in a Clos topology results in route black-holing under a
> !    single link failure (e.g. between Tier-2 and Tier-3 devices) and
>      hence must be avoided.  The use of peer links within the same tier to
>      resolve the black-holing problem by providing "bypass paths" is
>      undesirable due to O(N^2) complexity of the peering mesh and waste of
>      ports on the devices.  An alternative to the full-mesh of peer-links
> !    would be using a simpler bypass topology, e.g. a "ring" as described
>      in [FB4POST], but such a topology adds extra hops and has very
> !    limited bisection bandwidth, in addition requiring special tweaks to
>
>
>
> --- 925,951 ----
>         device, distant networks will automatically be reachable via the
>         advertising EBGP peer and do not require reachability to these
>         prefixes.  However, this may complicate operations or monitoring:
> !       e.g., using the popular "traceroute" tool will display IP addresses
>         that are not reachable.
>
>      o  Advertise point-to-point links, but summarize them on every
>         device.  This requires an address allocation scheme such as
>         allocating a consecutive block of IP addresses per Tier-1 and
>         Tier-2 device to be used for point-to-point interface addressing
> !       to the lower layers (Tier-2 uplink addresses will be allocated
> !       from Tier-1 address blocks and so forth).
>
>      Server subnets on Tier-3 devices must be announced into BGP without
>      using route summarization on Tier-2 and Tier-1 devices.  Summarizing
>      subnets in a Clos topology results in route black-holing under a
> !    single link failure (e.g., between Tier-2 and Tier-3 devices) and
>      hence must be avoided.  The use of peer links within the same tier to
>      resolve the black-holing problem by providing "bypass paths" is
>      undesirable due to O(N^2) complexity of the peering mesh and waste of
>      ports on the devices.  An alternative to the full-mesh of peer-links
> !    would be using a simpler bypass topology, e.g., a "ring" as described
>      in [FB4POST], but such a topology adds extra hops and has very
> !    limited bisectional bandwidth. Additionally requiring special tweaks
> to
>
>
>
> ***************
> *** 956,963 ****
>
>      make BGP routing work - such as possibly splitting every device into
>      an ASN on its own.  Later in this document, Section 8.2 introduces a
> !    less intrusive method for performing a limited form route
> !    summarization in Clos networks and discusses it's associated trade-
>      offs.
>
>   5.2.4.  External Connectivity
> --- 956,963 ----
>
>      make BGP routing work - such as possibly splitting every device into
>      an ASN on its own.  Later in this document, Section 8.2 introduces a
> !    less intrusive method for performing a limited form of route
> !    summarization in Clos networks and discusses its associated trade-
>      offs.
>
>   5.2.4.  External Connectivity
> ***************
> *** 972,985 ****
>      document.  These devices have to perform a few special functions:
>
>      o  Hide network topology information when advertising paths to WAN
> !       routers, i.e. remove Private Use ASNs [RFC6996] from the AS_PATH
>         attribute.  This is typically done to avoid ASN number collisions
>         between different data centers and also to provide a uniform
>         AS_PATH length to the WAN for purposes of WAN ECMP to Anycast
>         prefixes originated in the topology.  An implementation specific
>         BGP feature typically called "Remove Private AS" is commonly used
>         to accomplish this.  Depending on implementation, the feature
> !       should strip a contiguous sequence of Private Use ASNs found in
>         AS_PATH attribute prior to advertising the path to a neighbor.
>         This assumes that all ASNs used for intra data center numbering
>         are from the Private Use ranges.  The process for stripping the
> --- 972,985 ----
>      document.  These devices have to perform a few special functions:
>
>      o  Hide network topology information when advertising paths to WAN
> !       routers, i.e., remove Private Use ASNs [RFC6996] from the AS_PATH
>         attribute.  This is typically done to avoid ASN number collisions
>         between different data centers and also to provide a uniform
>         AS_PATH length to the WAN for purposes of WAN ECMP to Anycast
>         prefixes originated in the topology.  An implementation specific
>         BGP feature typically called "Remove Private AS" is commonly used
>         to accomplish this.  Depending on implementation, the feature
> !       should strip a contiguous sequence of Private Use ASNs found in an
>         AS_PATH attribute prior to advertising the path to a neighbor.
>         This assumes that all ASNs used for intra data center numbering
>         are from the Private Use ranges.  The process for stripping the
> ***************
> *** 998,1005 ****
>         to the WAN Routers upstream, to provide resistance to a single-
>         link failure causing the black-holing of traffic.  To prevent
>         black-holing in the situation when all of the EBGP sessions to the
> !       WAN routers fail simultaneously on a given device it is more
> !       desirable to take the "relaying" approach rather than introducing
>         the default route via complicated conditional route origination
>         schemes provided by some implementations [CONDITIONALROUTE].
>
> --- 998,1005 ----
>         to the WAN Routers upstream, to provide resistance to a single-
>         link failure causing the black-holing of traffic.  To prevent
>         black-holing in the situation when all of the EBGP sessions to the
> !       WAN routers fail simultaneously on a given device, it is more
> !       desirable to readvertise the default route rather than originating
>         the default route via complicated conditional route origination
>         schemes provided by some implementations [CONDITIONALROUTE].
>
> ***************
> *** 1017,1023 ****
>      prefixes originated from within the data center in a fully routed
>      network design.  For example, a network with 2000 Tier-3 devices will
>      have at least 2000 servers subnets advertised into BGP, along with
> !    the infrastructure or other prefixes.  However, as discussed before,
>      the proposed network design does not allow for route summarization
>      due to the lack of peer links inside every tier.
>
> --- 1017,1023 ----
>      prefixes originated from within the data center in a fully routed
>      network design.  For example, a network with 2000 Tier-3 devices will
>      have at least 2000 servers subnets advertised into BGP, along with
> !    the infrastructure and link prefixes.  However, as discussed before,
>      the proposed network design does not allow for route summarization
>      due to the lack of peer links inside every tier.
>
> ***************
> *** 1028,1037 ****
>      o  Interconnect the Border Routers using a full-mesh of physical
>         links or using any other "peer-mesh" topology, such as ring or
>         hub-and-spoke.  Configure BGP accordingly on all Border Leafs to
> !       exchange network reachability information - e.g. by adding a mesh
>         of IBGP sessions.  The interconnecting peer links need to be
>         appropriately sized for traffic that will be present in the case
> !       of a device or link failure underneath the Border Routers.
>
>      o  Tier-1 devices may have additional physical links provisioned
>         toward the Border Routers (which are Tier-2 devices from the
> --- 1028,1037 ----
>      o  Interconnect the Border Routers using a full-mesh of physical
>         links or using any other "peer-mesh" topology, such as ring or
>         hub-and-spoke.  Configure BGP accordingly on all Border Leafs to
> !       exchange network reachability information, e.g., by adding a mesh
>         of IBGP sessions.  The interconnecting peer links need to be
>         appropriately sized for traffic that will be present in the case
> !       of a device or link failure in the mesh connecting the Border
> Routers.
>
>      o  Tier-1 devices may have additional physical links provisioned
>         toward the Border Routers (which are Tier-2 devices from the
> ***************
> *** 1043,1049 ****
>         device compared with the other devices in the Clos.  This also
>         reduces the number of ports available to "regular" Tier-2 switches
>         and hence the number of clusters that could be interconnected via
> !       Tier-1 layer.
>
>      If any of the above options are implemented, it is possible to
>      perform route summarization at the Border Routers toward the WAN
> --- 1043,1049 ----
>         device compared with the other devices in the Clos.  This also
>         reduces the number of ports available to "regular" Tier-2 switches
>         and hence the number of clusters that could be interconnected via
> !       the Tier-1 layer.
>
>      If any of the above options are implemented, it is possible to
>      perform route summarization at the Border Routers toward the WAN
> ***************
> *** 1071,1079 ****
>      ECMP is the fundamental load sharing mechanism used by a Clos
>      topology.  Effectively, every lower-tier device will use all of its
>      directly attached upper-tier devices to load share traffic destined
> !    to the same IP prefix.  Number of ECMP paths between any two Tier-3
>      devices in Clos topology equals to the number of the devices in the
> !    middle stage (Tier-1).  For example, Figure 5 illustrates the
>      topology where Tier-3 device A has four paths to reach servers X and
>      Y, via Tier-2 devices B and C and then Tier-1 devices 1, 2, 3, and 4
>      respectively.
> --- 1071,1079 ----
>      ECMP is the fundamental load sharing mechanism used by a Clos
>      topology.  Effectively, every lower-tier device will use all of its
>      directly attached upper-tier devices to load share traffic destined
> !    to the same IP prefix.  The number of ECMP paths between any two
> Tier-3
>      devices in Clos topology equals to the number of the devices in the
> !    middle stage (Tier-1).  For example, Figure 5 illustrates a
>      topology where Tier-3 device A has four paths to reach servers X and
>      Y, via Tier-2 devices B and C and then Tier-1 devices 1, 2, 3, and 4
>      respectively.
> ***************
> *** 1105,1116 ****
>
>      The ECMP requirement implies that the BGP implementation must support
>      multipath fan-out for up to the maximum number of devices directly
> !    attached at any point in the topology in upstream or downstream
>      direction.  Normally, this number does not exceed half of the ports
>      found on a device in the topology.  For example, an ECMP fan-out of
>      32 would be required when building a Clos network using 64-port
>      devices.  The Border Routers may need to have wider fan-out to be
> !    able to connect to multitude of Tier-1 devices if route summarization
>      at Border Router level is implemented as described in Section 5.2.5.
>      If a device's hardware does not support wider ECMP, logical link-
>      grouping (link-aggregation at layer 2) could be used to provide
> --- 1105,1116 ----
>
>      The ECMP requirement implies that the BGP implementation must support
>      multipath fan-out for up to the maximum number of devices directly
> !    attached at any point in the topology in the upstream or downstream
>      direction.  Normally, this number does not exceed half of the ports
>      found on a device in the topology.  For example, an ECMP fan-out of
>      32 would be required when building a Clos network using 64-port
>      devices.  The Border Routers may need to have wider fan-out to be
> !    able to connect to a multitude of Tier-1 devices if route
> summarization
>      at Border Router level is implemented as described in Section 5.2.5.
>      If a device's hardware does not support wider ECMP, logical link-
>      grouping (link-aggregation at layer 2) could be used to provide
> ***************
> *** 1122,1131 ****
>   Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
>
>
> !    "hierarchical" ECMP (Layer 3 ECMP followed by Layer 2 ECMP) to
>      compensate for fan-out limitations.  Such approach, however,
>      increases the risk of flow polarization, as less entropy will be
> !    available to the second stage of ECMP.
>
>      Most BGP implementations declare paths to be equal from an ECMP
>      perspective if they match up to and including step (e) in
> --- 1122,1131 ----
>   Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
>
>
> !    "hierarchical" ECMP (Layer 3 ECMP coupled with Layer 2 ECMP) to
>      compensate for fan-out limitations.  Such approach, however,
>      increases the risk of flow polarization, as less entropy will be
> !    available at the second stage of ECMP.
>
>      Most BGP implementations declare paths to be equal from an ECMP
>      perspective if they match up to and including step (e) in
> ***************
> *** 1148,1154 ****
>      perspective of other devices, such a prefix would have BGP paths with
>      different AS_PATH attribute values, while having the same AS_PATH
>      attribute lengths.  Therefore, BGP implementations must support load
> !    sharing over above-mentioned paths.  This feature is sometimes known
>      as "multipath relax" or "multipath multiple-as" and effectively
>      allows for ECMP to be done across different neighboring ASNs if all
>      other attributes are equal as already described in the previous
> --- 1148,1154 ----
>      perspective of other devices, such a prefix would have BGP paths with
>      different AS_PATH attribute values, while having the same AS_PATH
>      attribute lengths.  Therefore, BGP implementations must support load
> !    sharing over the above-mentioned paths.  This feature is sometimes
> known
>      as "multipath relax" or "multipath multiple-as" and effectively
>      allows for ECMP to be done across different neighboring ASNs if all
>      other attributes are equal as already described in the previous
> ***************
> *** 1182,1199 ****
>
>      It is often desirable to have the hashing function used for ECMP to
>      be consistent (see [CONS-HASH]), to minimize the impact on flow to
> !    next-hop affinity changes when a next-hop is added or removed to ECMP
>      group.  This could be used if the network device is used as a load
>      balancer, mapping flows toward multiple destinations - in this case,
> !    losing or adding a destination will not have detrimental effect of
>      currently established flows.  One particular recommendation on
>      implementing consistent hashing is provided in [RFC2992], though
>      other implementations are possible.  This functionality could be
>      naturally combined with weighted ECMP, with the impact of the next-
>      hop changes being proportional to the weight of the given next-hop.
>      The downside of consistent hashing is increased load on hardware
> !    resource utilization, as typically more space is required to
> !    implement a consistent-hashing region.
>
>   7.  Routing Convergence Properties
>
> --- 1182,1199 ----
>
>      It is often desirable to have the hashing function used for ECMP to
>      be consistent (see [CONS-HASH]), to minimize the impact on flow to
> !    next-hop affinity changes when a next-hop is added or removed to an
> ECMP
>      group.  This could be used if the network device is used as a load
>      balancer, mapping flows toward multiple destinations - in this case,
> !    losing or adding a destination will not have a detrimental effect on
>      currently established flows.  One particular recommendation on
>      implementing consistent hashing is provided in [RFC2992], though
>      other implementations are possible.  This functionality could be
>      naturally combined with weighted ECMP, with the impact of the next-
>      hop changes being proportional to the weight of the given next-hop.
>      The downside of consistent hashing is increased load on hardware
> !    resource utilization, as typically more resources (e.g., TCAM space)
> !    are required to implement a consistent-hashing function.
>
>   7.  Routing Convergence Properties
>
> ***************
> *** 1209,1224 ****
>      driven mechanism to obtain updates on IGP state changes.  The
>      proposed routing design does not use an IGP, so the remaining
>      mechanisms that could be used for fault detection are BGP keep-alive
> !    process (or any other type of keep-alive mechanism) and link-failure
>      triggers.
>
>      Relying solely on BGP keep-alive packets may result in high
> !    convergence delays, in the order of multiple seconds (on many BGP
>      implementations the minimum configurable BGP hold timer value is
>      three seconds).  However, many BGP implementations can shut down
>      local EBGP peering sessions in response to the "link down" event for
>      the outgoing interface used for BGP peering.  This feature is
> !    sometimes called as "fast fallover".  Since links in modern data
>      centers are predominantly point-to-point fiber connections, a
>      physical interface failure is often detected in milliseconds and
>      subsequently triggers a BGP re-convergence.
> --- 1209,1224 ----
>      driven mechanism to obtain updates on IGP state changes.  The
>      proposed routing design does not use an IGP, so the remaining
>      mechanisms that could be used for fault detection are BGP keep-alive
> !    time-out (or any other type of keep-alive mechanism) and link-failure
>      triggers.
>
>      Relying solely on BGP keep-alive packets may result in high
> !    convergence delays, on the order of multiple seconds (on many BGP
>      implementations the minimum configurable BGP hold timer value is
>      three seconds).  However, many BGP implementations can shut down
>      local EBGP peering sessions in response to the "link down" event for
>      the outgoing interface used for BGP peering.  This feature is
> !    sometimes called "fast fallover".  Since links in modern data
>      centers are predominantly point-to-point fiber connections, a
>      physical interface failure is often detected in milliseconds and
>      subsequently triggers a BGP re-convergence.
> ***************
> *** 1236,1242 ****
>
>      Alternatively, some platforms may support Bidirectional Forwarding
>      Detection (BFD) [RFC5880] to allow for sub-second failure detection
> !    and fault signaling to the BGP process.  However, use of either of
>      these presents additional requirements to vendor software and
>      possibly hardware, and may contradict REQ1.  Until recently with
>      [RFC7130], BFD also did not allow detection of a single member link
> --- 1236,1242 ----
>
>      Alternatively, some platforms may support Bidirectional Forwarding
>      Detection (BFD) [RFC5880] to allow for sub-second failure detection
> !    and fault signaling to the BGP process.  However, the use of either of
>      these presents additional requirements to vendor software and
>      possibly hardware, and may contradict REQ1.  Until recently with
>      [RFC7130], BFD also did not allow detection of a single member link
> ***************
> *** 1245,1251 ****
>
>   7.2.  Event Propagation Timing
>
> !    In the proposed design the impact of BGP Minimum Route Advertisement
>      Interval (MRAI) timer (See section 9.2.1.1 of [RFC4271]) should be
>      considered.  Per the standard it is required for BGP implementations
>      to space out consecutive BGP UPDATE messages by at least MRAI
> --- 1245,1251 ----
>
>   7.2.  Event Propagation Timing
>
> !    In the proposed design the impact of the BGP Minimum Route
> Advertisement
>      Interval (MRAI) timer (See section 9.2.1.1 of [RFC4271]) should be
>      considered.  Per the standard it is required for BGP implementations
>      to space out consecutive BGP UPDATE messages by at least MRAI
> ***************
> *** 1258,1270 ****
>      In a Clos topology each EBGP speaker typically has either one path
>      (Tier-2 devices don't accept paths from other Tier-2 in the same
>      cluster due to same ASN) or N paths for the same prefix, where N is a
> !    significantly large number, e.g.  N=32 (the ECMP fan-out to the next
>      Tier).  Therefore, if a link fails to another device from which a
> !    path is received there is either no backup path at all (e.g. from
>      perspective of a Tier-2 switch losing link to a Tier-3 device), or
> !    the backup is readily available in BGP Loc-RIB (e.g. from perspective
>      of a Tier-2 device losing link to a Tier-1 switch).  In the former
> !    case, the BGP withdrawal announcement will propagate un-delayed and
>      trigger re-convergence on affected devices.  In the latter case, the
>      best-path will be re-evaluated and the local ECMP group corresponding
>      to the new next-hop set changed.  If the BGP path was the best-path
> --- 1258,1270 ----
>      In a Clos topology each EBGP speaker typically has either one path
>      (Tier-2 devices don't accept paths from other Tier-2 in the same
>      cluster due to same ASN) or N paths for the same prefix, where N is a
> !    significantly large number, e.g.,  N=32 (the ECMP fan-out to the next
>      Tier).  Therefore, if a link fails to another device from which a
> !    path is received there is either no backup path at all (e.g., from the
>      perspective of a Tier-2 switch losing link to a Tier-3 device), or
> !    the backup is readily available in BGP Loc-RIB (e.g., from perspective
>      of a Tier-2 device losing link to a Tier-1 switch).  In the former
> !    case, the BGP withdrawal announcement will propagate without delay and
>      trigger re-convergence on affected devices.  In the latter case, the
>      best-path will be re-evaluated and the local ECMP group corresponding
>      to the new next-hop set changed.  If the BGP path was the best-path
> ***************
> *** 1279,1285 ****
>      situation when a link between Tier-3 and Tier-2 device fails, the
>      Tier-2 device will send BGP UPDATE messages to all upstream Tier-1
>      devices, withdrawing the affected prefixes.  The Tier-1 devices, in
> !    turn, will relay those messages to all downstream Tier-2 devices
>      (except for the originator).  Tier-2 devices other than the one
>      originating the UPDATE should then wait for ALL upstream Tier-1
>
> --- 1279,1285 ----
>      situation when a link between Tier-3 and Tier-2 device fails, the
>      Tier-2 device will send BGP UPDATE messages to all upstream Tier-1
>      devices, withdrawing the affected prefixes.  The Tier-1 devices, in
> !    turn, will relay these messages to all downstream Tier-2 devices
>      (except for the originator).  Tier-2 devices other than the one
>      originating the UPDATE should then wait for ALL upstream Tier-1
>
> ***************
> *** 1307,1313 ****
>      features that vendors include to reduce the control plane impact of
>      rapidly flapping prefixes.  However, due to issues described with
>      false positives in these implementations especially under such
> !    "dispersion" events, it is not recommended to turn this feature on in
>      this design.  More background and issues with "route flap dampening"
>      and possible implementation changes that could affect this are well
>      described in [RFC7196].
> --- 1307,1313 ----
>      features that vendors include to reduce the control plane impact of
>      rapidly flapping prefixes.  However, due to issues described with
>      false positives in these implementations especially under such
> !    "dispersion" events, it is not recommended to enable this feature in
>      this design.  More background and issues with "route flap dampening"
>      and possible implementation changes that could affect this are well
>      described in [RFC7196].
> ***************
> *** 1316,1324 ****
>
>      A network is declared to converge in response to a failure once all
>      devices within the failure impact scope are notified of the event and
> !    have re-calculated their RIB's and consequently updated their FIB's.
>      Larger failure impact scope typically means slower convergence since
> !    more devices have to be notified, and additionally results in a less
>      stable network.  In this section we describe BGP's advantages over
>      link-state routing protocols in reducing failure impact scope for a
>      Clos topology.
> --- 1316,1324 ----
>
>      A network is declared to converge in response to a failure once all
>      devices within the failure impact scope are notified of the event and
> !    have re-calculated their RIBs and consequently updated their FIBs.
>      Larger failure impact scope typically means slower convergence since
> !    more devices have to be notified, and results in a less
>      stable network.  In this section we describe BGP's advantages over
>      link-state routing protocols in reducing failure impact scope for a
>      Clos topology.
> ***************
> *** 1327,1335 ****
>      the best path from the point of view of the local router is sent to
>      neighbors.  As such, some failures are masked if the local node can
>      immediately find a backup path and does not have to send any updates
> !    further.  Notice that in the worst case ALL devices in a data center
>      topology have to either withdraw a prefix completely or update the
> !    ECMP groups in the FIB.  However, many failures will not result in
>      such a wide impact.  There are two main failure types where impact
>      scope is reduced:
>
> --- 1327,1335 ----
>      the best path from the point of view of the local router is sent to
>      neighbors.  As such, some failures are masked if the local node can
>      immediately find a backup path and does not have to send any updates
> !    further.  Notice that in the worst case, all devices in a data center
>      topology have to either withdraw a prefix completely or update the
> !    ECMP groups in their FIBs.  However, many failures will not result in
>      such a wide impact.  There are two main failure types where impact
>      scope is reduced:
>
> ***************
> *** 1357,1367 ****
>
>      o  Failure of a Tier-1 device: In this case, all Tier-2 devices
>         directly attached to the failed node will have to update their
> !       ECMP groups for all IP prefixes from non-local cluster.  The
>         Tier-3 devices are once again not involved in the re-convergence
>         process, but may receive "implicit withdraws" as described above.
>
> !    Even though in case of such failures multiple IP prefixes will have
>      to be reprogrammed in the FIB, it is worth noting that ALL of these
>      prefixes share a single ECMP group on Tier-2 device.  Therefore, in
>      the case of implementations with a hierarchical FIB, only a single
> --- 1357,1367 ----
>
>      o  Failure of a Tier-1 device: In this case, all Tier-2 devices
>         directly attached to the failed node will have to update their
> !       ECMP groups for all IP prefixes from a non-local cluster.  The
>         Tier-3 devices are once again not involved in the re-convergence
>         process, but may receive "implicit withdraws" as described above.
>
> !    Even in the case of such failures, multiple IP prefixes will have
>      to be reprogrammed in the FIB, it is worth noting that ALL of these
>      prefixes share a single ECMP group on Tier-2 device.  Therefore, in
>      the case of implementations with a hierarchical FIB, only a single
> ***************
> *** 1375,1381 ****
>      possible with the proposed design, since using this technique may
>      create routing black-holes as mentioned previously.  Therefore, the
>      worst control plane failure impact scope is the network as a whole,
> !    for instance in a case of a link failure between Tier-2 and Tier-3
>      devices.  The amount of impacted prefixes in this case would be much
>      less than in the case of a failure in the upper layers of a Clos
>      network topology.  The property of having such large failure scope is
> --- 1375,1381 ----
>      possible with the proposed design, since using this technique may
>      create routing black-holes as mentioned previously.  Therefore, the
>      worst control plane failure impact scope is the network as a whole,
> !    for instance in thecase of a link failure between Tier-2 and Tier-3
>      devices.  The amount of impacted prefixes in this case would be much
>      less than in the case of a failure in the upper layers of a Clos
>      network topology.  The property of having such large failure scope is
> ***************
> *** 1384,1397 ****
>
>   7.5.  Routing Micro-Loops
>
> !    When a downstream device, e.g.  Tier-2 device, loses all paths for a
>      prefix, it normally has the default route pointing toward the
>      upstream device, in this case the Tier-1 device.  As a result, it is
> !    possible to get in the situation when Tier-2 switch loses a prefix,
> !    but Tier-1 switch still has the path pointing to the Tier-2 device,
> !    which results in transient micro-loop, since Tier-1 switch will keep
>      passing packets to the affected prefix back to Tier-2 device, and
> !    Tier-2 will bounce it back again using the default route.  This
>      micro-loop will last for the duration of time it takes the upstream
>      device to fully update its forwarding tables.
>
> --- 1384,1397 ----
>
>   7.5.  Routing Micro-Loops
>
> !    When a downstream device, e.g.,  Tier-2 device, loses all paths for a
>      prefix, it normally has the default route pointing toward the
>      upstream device, in this case the Tier-1 device.  As a result, it is
> !    possible to get in the situation where a Tier-2 switch loses a prefix,
> !    but a Tier-1 switch still has the path pointing to the Tier-2 device,
> !    which results in transient micro-loop, since the Tier-1 switch will
> keep
>      passing packets to the affected prefix back to Tier-2 device, and
> !    the Tier-2 will bounce it back again using the default route.  This
>      micro-loop will last for the duration of time it takes the upstream
>      device to fully update its forwarding tables.
>
> ***************
> *** 1402,1408 ****
>   Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
>
>
> !    To minimize impact of the micro-loops, Tier-2 and Tier-1 switches can
>      be configured with static "discard" or "null" routes that will be
>      more specific than the default route for prefixes missing during
>      network convergence.  For Tier-2 switches, the discard route should
> --- 1402,1408 ----
>   Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
>
>
> !    To minimize the impact of such micro-loops, Tier-2 and Tier-1
> switches can
>      be configured with static "discard" or "null" routes that will be
>      more specific than the default route for prefixes missing during
>      network convergence.  For Tier-2 switches, the discard route should
> ***************
> *** 1417,1423 ****
>
>   8.1.  Third-party Route Injection
>
> !    BGP allows for a "third-party", i.e. directly attached, BGP speaker
>      to inject routes anywhere in the network topology, meeting REQ5.
>      This can be achieved by peering via a multihop BGP session with some
>      or even all devices in the topology.  Furthermore, BGP diverse path
> --- 1417,1423 ----
>
>   8.1.  Third-party Route Injection
>
> !    BGP allows for a "third-party", i.e., directly attached, BGP speaker
>      to inject routes anywhere in the network topology, meeting REQ5.
>      This can be achieved by peering via a multihop BGP session with some
>      or even all devices in the topology.  Furthermore, BGP diverse path
> ***************
> *** 1427,1433 ****
>      implementation.  Unfortunately, in many implementations ADD-PATH has
>      been found to only support IBGP properly due to the use cases it was
>      originally optimized for, which limits the "third-party" peering to
> !    IBGP only, if the feature is used.
>
>      To implement route injection in the proposed design, a third-party
>      BGP speaker may peer with Tier-3 and Tier-1 switches, injecting the
> --- 1427,1433 ----
>      implementation.  Unfortunately, in many implementations ADD-PATH has
>      been found to only support IBGP properly due to the use cases it was
>      originally optimized for, which limits the "third-party" peering to
> !    IBGP only.
>
>      To implement route injection in the proposed design, a third-party
>      BGP speaker may peer with Tier-3 and Tier-1 switches, injecting the
> ***************
> *** 1442,1453 ****
>      As mentioned previously, route summarization is not possible within
>      the proposed Clos topology since it makes the network susceptible to
>      route black-holing under single link failures.  The main problem is
> !    the limited number of redundant paths between network elements, e.g.
>      there is only a single path between any pair of Tier-1 and Tier-3
>      devices.  However, some operators may find route aggregation
>      desirable to improve control plane stability.
>
> !    If planning on using any technique to summarize within the topology
>      modeling of the routing behavior and potential for black-holing
>      should be done not only for single or multiple link failures, but
>
> --- 1442,1453 ----
>      As mentioned previously, route summarization is not possible within
>      the proposed Clos topology since it makes the network susceptible to
>      route black-holing under single link failures.  The main problem is
> !    the limited number of redundant paths between network elements, e.g.,
>      there is only a single path between any pair of Tier-1 and Tier-3
>      devices.  However, some operators may find route aggregation
>      desirable to improve control plane stability.
>
> !    If any technique to summarize within the topology is planned,
>      modeling of the routing behavior and potential for black-holing
>      should be done not only for single or multiple link failures, but
>
> ***************
> *** 1458,1468 ****
>   Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
>
>
> !    also fiber pathway failures or optical domain failures if the
>      topology extends beyond a physical location.  Simple modeling can be
>      done by checking the reachability on devices doing summarization
>      under the condition of a link or pathway failure between a set of
> !    devices in every tier as well as to the WAN routers if external
>      connectivity is present.
>
>      Route summarization would be possible with a small modification to
> --- 1458,1468 ----
>   Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016
>
>
> !    also fiber pathway failures or optical domain failures when the
>      topology extends beyond a physical location.  Simple modeling can be
>      done by checking the reachability on devices doing summarization
>      under the condition of a link or pathway failure between a set of
> !    devices in every tier as well as to the WAN routers when external
>      connectivity is present.
>
>      Route summarization would be possible with a small modification to
> ***************
> *** 1519,1544 ****
>      cluster from Tier-2 devices since each of them has only a single path
>      down to this prefix.  It would require dual-homed servers to
>      accomplish that.  Also note that this design is only resilient to
> !    single link failure.  It is possible for a double link failure to
>      isolate a Tier-2 device from all paths toward a specific Tier-3
>      device, thus causing a routing black-hole.
>
> !    A result of the proposed topology modification would be reduction of
>      Tier-1 devices port capacity.  This limits the maximum number of
>      attached Tier-2 devices and therefore will limit the maximum DC
>      network size.  A larger network would require different Tier-1
>      devices that have higher port density to implement this change.
>
>      Another problem is traffic re-balancing under link failures.  Since
> !    three are two paths from Tier-1 to Tier-3, a failure of the link
>      between Tier-1 and Tier-2 switch would result in all traffic that was
>      taking the failed link to switch to the remaining path.  This will
> !    result in doubling of link utilization on the remaining link.
>
>   8.2.2.  Simple Virtual Aggregation
>
>      A completely different approach to route summarization is possible,
> !    provided that the main goal is to reduce the FIB pressure, while
>      allowing the control plane to disseminate full routing information.
>      Firstly, it could be easily noted that in many cases multiple
>      prefixes, some of which are less specific, share the same set of the
> --- 1519,1544 ----
>      cluster from Tier-2 devices since each of them has only a single path
>      down to this prefix.  It would require dual-homed servers to
>      accomplish that.  Also note that this design is only resilient to
> !    single link failures.  It is possible for a double link failure to
>      isolate a Tier-2 device from all paths toward a specific Tier-3
>      device, thus causing a routing black-hole.
>
> !    A result of the proposed topology modification would be a reduction of
>      Tier-1 devices port capacity.  This limits the maximum number of
>      attached Tier-2 devices and therefore will limit the maximum DC
>      network size.  A larger network would require different Tier-1
>      devices that have higher port density to implement this change.
>
>      Another problem is traffic re-balancing under link failures.  Since
> !    there are two paths from Tier-1 to Tier-3, a failure of the link
>      between Tier-1 and Tier-2 switch would result in all traffic that was
>      taking the failed link to switch to the remaining path.  This will
> !    result in doubling the link utilization of the remaining link.
>
>   8.2.2.  Simple Virtual Aggregation
>
>      A completely different approach to route summarization is possible,
> !    provided that the main goal is to reduce the FIB size, while
>      allowing the control plane to disseminate full routing information.
>      Firstly, it could be easily noted that in many cases multiple
>      prefixes, some of which are less specific, share the same set of the
> ***************
> *** 1550,1563 ****
>      [RFC6769] and only install the least specific route in the FIB,
>      ignoring more specific routes if they share the same next-hop set.
>      For example, under normal network conditions, only the default route
> !    need to be programmed into FIB.
>
>      Furthermore, if the Tier-2 devices are configured with summary
> !    prefixes covering all of their attached Tier-3 device's prefixes the
>      same logic could be applied in Tier-1 devices as well, and, by
>      induction to Tier-2/Tier-3 switches in different clusters.  These
>      summary routes should still allow for more specific prefixes to leak
> !    to Tier-1 devices, to enable for detection of mismatches in the next-
>      hop sets if a particular link fails, changing the next-hop set for a
>      specific prefix.
>
> --- 1550,1563 ----
>      [RFC6769] and only install the least specific route in the FIB,
>      ignoring more specific routes if they share the same next-hop set.
>      For example, under normal network conditions, only the default route
> !    needs to be programmed into FIB.
>
>      Furthermore, if the Tier-2 devices are configured with summary
> !    prefixes covering all of their attached Tier-3 device's prefixes, the
>      same logic could be applied in Tier-1 devices as well, and, by
>      induction to Tier-2/Tier-3 switches in different clusters.  These
>      summary routes should still allow for more specific prefixes to leak
> !    to Tier-1 devices, to enable detection of mismatches in the next-
>      hop sets if a particular link fails, changing the next-hop set for a
>      specific prefix.
>
> ***************
> *** 1571,1584 ****
>
>
>      Re-stating once again, this technique does not reduce the amount of
> !    control plane state (i.e.  BGP UPDATEs/BGP LocRIB sizing), but only
> !    allows for more efficient FIB utilization, by spotting more specific
> !    prefixes that share their next-hops with less specifics.
>
>   8.3.  ICMP Unreachable Message Masquerading
>
>      This section discusses some operational aspects of not advertising
> !    point-to-point link subnets into BGP, as previously outlined as an
>      option in Section 5.2.3.  The operational impact of this decision
>      could be seen when using the well-known "traceroute" tool.
>      Specifically, IP addresses displayed by the tool will be the link's
> --- 1571,1585 ----
>
>
>      Re-stating once again, this technique does not reduce the amount of
> !    control plane state (i.e.,  BGP UPDATEs/BGP Loc-RIB size), but only
> !    allows for more efficient FIB utilization, by detecting more specific
> !    prefixes that share their next-hop set with a subsuming less specific
> !    prefix.
>
>   8.3.  ICMP Unreachable Message Masquerading
>
>      This section discusses some operational aspects of not advertising
> !    point-to-point link subnets into BGP, as previously identified as an
>      option in Section 5.2.3.  The operational impact of this decision
>      could be seen when using the well-known "traceroute" tool.
>      Specifically, IP addresses displayed by the tool will be the link's
> ***************
> *** 1587,1605 ****
>      complicated.
>
>      One way to overcome this limitation is by using the DNS subsystem to
> !    create the "reverse" entries for the IP addresses of the same device
> !    pointing to the same name.  The connectivity then can be made by
> !    resolving this name to the "primary" IP address of the devices, e.g.
>      its Loopback interface, which is always advertised into BGP.
>      However, this creates a dependency on the DNS subsystem, which may be
>      unavailable during an outage.
>
>      Another option is to make the network device perform IP address
>      masquerading, that is rewriting the source IP addresses of the
> !    appropriate ICMP messages sent off of the device with the "primary"
>      IP address of the device.  Specifically, the ICMP Destination
>      Unreachable Message (type 3) codes 3 (port unreachable) and ICMP Time
> !    Exceeded (type 11) code 0, which are involved in proper working of
>      the "traceroute" tool.  With this modification, the "traceroute"
>      probes sent to the devices will always be sent back with the
>      "primary" IP address as the source, allowing the operator to discover
> --- 1588,1606 ----
>      complicated.
>
>      One way to overcome this limitation is by using the DNS subsystem to
> !    create the "reverse" entries for these point-to-point IP addresses
> pointing
> !    to a the same name as the loopback address.  The connectivity then
> can be made by
> !    resolving this name to the "primary" IP address of the devices, e.g.,
>      its Loopback interface, which is always advertised into BGP.
>      However, this creates a dependency on the DNS subsystem, which may be
>      unavailable during an outage.
>
>      Another option is to make the network device perform IP address
>      masquerading, that is rewriting the source IP addresses of the
> !    appropriate ICMP messages sent by the device with the "primary"
>      IP address of the device.  Specifically, the ICMP Destination
>      Unreachable Message (type 3) codes 3 (port unreachable) and ICMP Time
> !    Exceeded (type 11) code 0, which are required for correct operation of
>      the "traceroute" tool.  With this modification, the "traceroute"
>      probes sent to the devices will always be sent back with the
>      "primary" IP address as the source, allowing the operator to discover
>
> Thanks,
> Acee
>
> _______________________________________________
> rtgwg mailing list
> rtgwg@ietf.org
> https://www.ietf.org/mailman/listinfo/rtgwg
>