RE: Routing Directorate Review for "Use of BGP for routing in large-scale data centers" (adding RTG WG)

Petr Lapukhov <petr@fb.com> Mon, 25 April 2016 19:14 UTC

Return-Path: <prvs=39232dc803=petr@fb.com>
X-Original-To: rtgwg@ietfa.amsl.com
Delivered-To: rtgwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7F8F912D670; Mon, 25 Apr 2016 12:14:30 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.731
X-Spam-Level:
X-Spam-Status: No, score=-0.731 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, HTTPS_HTTP_MISMATCH=1.989, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=fb.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ys3XSymorVre; Mon, 25 Apr 2016 12:14:23 -0700 (PDT)
Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 342AF12D6AD; Mon, 25 Apr 2016 12:14:23 -0700 (PDT)
Received: from pps.filterd (m0044008.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.11/8.16.0.11) with SMTP id u3PJAUqc029913; Mon, 25 Apr 2016 12:14:21 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=fb.com; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : mime-version; s=facebook; bh=/SsuOCag4xOtH2fdJLRjS/Jl8O9eeTBuopl/LPrnoHs=; b=iG0ab6VzhlSjcbE4SLQWqDDLoAwdG/YARXxvA8hSFNFXg2lijy+xkVr7c2T/RBS3DWnp z5DCJXvdMKuiJjA91TGd4G+aTN8gYtovHDmLMuqrtF/4Om0aHvG8VXZYp3hoxDrwJAfv w0m4AHINwsQQMepyNEjI65ucWTluKNjADik=
Received: from mail.thefacebook.com ([199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 22hqj88ut0-1 (version=TLSv1 cipher=AES128-SHA bits=128 verify=NOT); Mon, 25 Apr 2016 12:14:20 -0700
Received: from PRN-CHUB17.TheFacebook.com (192.168.16.71) by PRN-CHUB15.TheFacebook.com (192.168.16.65) with Microsoft SMTP Server (TLS) id 14.3.248.2; Mon, 25 Apr 2016 12:14:20 -0700
Received: from PRN-MBX01-2.TheFacebook.com ([169.254.3.168]) by PRN-CHUB17.TheFacebook.com ([fe80::5055:beda:4388:976f%13]) with mapi id 14.03.0248.002; Mon, 25 Apr 2016 12:14:19 -0700
From: Petr Lapukhov <petr@fb.com>
To: Alia Atlas <akatlas@gmail.com>, "Acee Lindem (acee)" <acee@cisco.com>
Subject: RE: Routing Directorate Review for "Use of BGP for routing in large-scale data centers" (adding RTG WG)
Thread-Topic: Routing Directorate Review for "Use of BGP for routing in large-scale data centers" (adding RTG WG)
Thread-Index: AQHRnxYlQYfeCPyXgkGAG4vr+m6GrZ+bgQ4A//+Nx4c=
Date: Mon, 25 Apr 2016 19:14:18 +0000
Message-ID: <3F437107848A5140A6A19222EFFB34812015BB92@PRN-MBX01-2.TheFacebook.com>
References: <D343C90F.5C211%acee@cisco.com>, <CAG4d1rdt8jTcJ59X0fDnVn2sEERAyB=2gjjVAnqQ5SDvUxfG0Q@mail.gmail.com>
In-Reply-To: <CAG4d1rdt8jTcJ59X0fDnVn2sEERAyB=2gjjVAnqQ5SDvUxfG0Q@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [192.168.52.123]
Content-Type: multipart/alternative; boundary="_000_3F437107848A5140A6A19222EFFB34812015BB92PRNMBX012TheFac_"
MIME-Version: 1.0
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2016-04-25_09:, , signatures=0
Archived-At: <http://mailarchive.ietf.org/arch/msg/rtgwg/1Vs3RcZ8WeSqsGIGavLco67YKzY>
Cc: "draft-ietf-rtgwg-bgp-routing-large-dc@ietf.org" <draft-ietf-rtgwg-bgp-routing-large-dc@ietf.org>, Routing ADs <rtg-ads@tools.ietf.org>, Routing Directorate <rtg-dir@ietf.org>, Routing WG <rtgwg@ietf.org>
X-BeenThere: rtgwg@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Routing Area Working Group <rtgwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rtgwg/>
List-Post: <mailto:rtgwg@ietf.org>
List-Help: <mailto:rtgwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 25 Apr 2016 19:14:30 -0000

Acess, thank you so much for finishing the review!

Alia, we'll work on addressing the feedback ASAP.

Regards,

Petr

________________________________
From: rtgwg [rtgwg-bounces@ietf.org] on behalf of Alia Atlas [akatlas@gmail.com]
Sent: Monday, April 25, 2016 12:01 PM
To: Acee Lindem (acee)
Cc: draft-ietf-rtgwg-bgp-routing-large-dc@ietf.org; Routing WG; Routing Directorate; Routing ADs
Subject: Re: Routing Directorate Review for "Use of BGP for routing in large-scale data centers" (adding RTG WG)

Hi Acee,

Thank you very much for your review.

Authors, could you please respond soon?  I am hoping to get this out to IETF Last Call
by Thursday - and on the telechat for May 19.    That depends on timely updates from
the authors and shepherd.

Thanks,
Alia



On Mon, Apr 25, 2016 at 1:16 PM, Acee Lindem (acee) <acee@cisco.com<mailto:acee@cisco.com>> wrote:
Hello,

I have been selected as the Routing Directorate reviewer for this draft.
The Routing Directorate seeks to review all routing or routing-related
drafts as they pass through IETF last call and IESG review, and sometimes
on special request. The purpose of the review is to provide assistance to
the Routing ADs. For more information about the Routing Directorate,
please see ​http://trac.tools.ietf.org/area/rtg/trac/wiki/RtgDir<https://urldefense.proofpoint.com/v2/url?u=http-3A__trac.tools.ietf.org_area_rtg_trac_wiki_RtgDir&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=LU_vJaM_EQu1Ssm35j2xlA&m=jlwmqRMSBbzWIRorI_sDAM1i0LuuOdLFmJDVNLptIYc&s=bGMR_tEpNxCUnEeN6BBPRvfMoD0TeSvdo3zJAERgt_s&e=>

Although these comments are primarily for the use of the Routing ADs, it
would be helpful if you could consider them along with any other IETF Last
Call comments that you receive, and strive to resolve them through
discussion or by updating the draft.

Document: draft-ietf-rtgwg-bgp-routing-large-dc-09.txt
Reviewer: Acee Lindem
Review Date: 4/25/16
IETF LC End Date: Not started
Intended Status: Informational

Summary:
    This document is basically ready for publication, but has some minor
issues and nits that should be resolved prior to publication.

Comments:
    The document starts with the requirements for an MSDC routing and then
provides an overview of Clos data topologies and data center network
design. This overview attempts to cover a lot of a material in a very
small amount of text. While not completely successful, the overview
provides a lot of good information and references. The bulk of the
document covers the usage of EBGP as the sole data center routing protocol
and other aspects of the routing design including ECMP, summarization
issues, and convergence. These sections provide a very good guide for
using EBGP in a Clos data center and an excellent discussion of the
deployment issues (based on real deployment experience).

    The technical content of the document is excellent. The readability
could be improved by breaking up some of the run-on sentences and with the
suggested editorial changes (see Nits below).


Major Issues:

    I have no major issues with the document.

Minor Issues:

    Section 4.2: Can an informative reference be added for Direct Server
Return (DSR)?
    Section 5.2.4 and 7.4: Define precisely what is meant by "scale-out"
topology somewhere in the document.
    Section 5.2.5: Can you add a backward reference to the discussion of
"lack of peer links inside every peer”? Also, it would be good to describe
how this would allow for summarization and under what failure conditions.
    Section 7.4: Should you add a reference to
https://www.ietf.org/id/draft-ietf-rtgwg-bgp-pic-00.txt<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ietf.org_id_draft-2Dietf-2Drtgwg-2Dbgp-2Dpic-2D00.txt&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=LU_vJaM_EQu1Ssm35j2xlA&m=jlwmqRMSBbzWIRorI_sDAM1i0LuuOdLFmJDVNLptIYc&s=SD_APm6_2nl0SN90seTbZcUWvdfpA2z2JLLrDUwVDsA&e=> to the penultimate
paragraph in this section?

Nits:

***************
*** 143,149 ****
     network stability so that a small group of people can effectively
     support a significantly sized network.

!    Experimentation and extensive testing has shown that External BGP
     (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for
     these type of data center applications.  This is in contrast with
     more traditional DC designs, which may se simple tree topologies and
--- 143,149 ----
     network stability so that a sall group of people can effectively
     support a significantly sized network.

!    Experimentation and extensive testing have shown that External BGP
     (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for
     these type of data center applications.  This is in contrast with
     more traditional DC designs, which may use simple tree topologies and
***************
*** 178,191 ****
  2.1.  Bandwidth and Traffic Patterns

     The primary requirement when building an interconnection network for
!    large number of servers is to accommodate application bandwidth and
     latency requirements.  Until recently it was quite common to see the
     majority of traffic entering and leaving the data center, commonly
     referred to as "north-south" traffic.  Traditional "tree" topologies
     were sufficient to accommodate such flows, even with high
     oversubscription ratios between the layers of the network.  If more
     bandwidth was required, it was added by "scaling up" the network
!    elements, e.g. by upgrading the device's linecards or fabrics or
     replacing the device with one with higher port density.

     Today many large-scale data centers host applications generating
--- 178,191 ----
  2.1.  Bandwidth and Traffic Patterns

     The primary requirement when building an interconnection network for
!    a large number of servers is to accommodate application bandwidth and
     latency requirements.  Until recently it was quite common to see the
     majority of traffic entering and leaving the data center, commonly
     referred to as "north-south" traffic.  Traditional "tree" topologies
     were sufficient to accommodate such flows, even with high
     oversubscription ratios between the layers of the network.  If more
     bandwidth was required, it was added by "scaling up" the network
!    elements, e.g., by upgrading the device's linecards or fabrics or
     replacing the device with one with higher port density.

     Today many large-scale data centers host applications generating
***************
*** 195,201 ****
     [HADOOP], massive data replication between clusters needed by certain
     applications, or virtual machine migrations.  Scaling traditional
     tree topologies to match these bandwidth demands becomes either too
!    expensive or impossible due to physical limitations, e.g. port
     density in a switch.

  2.2.  CAPEX Minimization
--- 195,201 ----
     [HADOOP], massive data replication between clusters needed by certain
     applications, or virtual machine migrations.  Scaling traditional
     tree topologies to match these bandwidth demands becomes either too
!    expensive or impossible due to physical limitations, e.g., port
     density in a switch.

  2.2.  CAPEX Minimization
***************
*** 209,215 ****

     o  Unifying all network elements, preferably using the same hardware
        type or even the same device.  This allows for volume pricing on
!       bulk purchases and reduced maintenance and sparing costs.

     o  Driving costs down using competitive pressures, by introducing
        multiple network equipment vendors.
--- 209,215 ----

     o  Unifying all network elements, preferably using the same hardware
        type or even the same device.  This allows for volume pricing on
!       bulk purchases and reduced maintenance and inventory costs.

     o  Driving costs down using competitive pressures, by introducing
        multiple network equipment vendors.
***************
*** 234,244 ****
     minimizes software issue-related failures.

     An important aspect of Operational Expenditure (OPEX) minimization is
!    reducing size of failure domains in the network.  Ethernet networks
     are known to be susceptible to broadcast or unicast traffic storms
     that can have a dramatic impact on network performance and
     availability.  The use of a fully routed design significantly reduces
!    the size of the data plane failure domains - i.e. limits them to the
     lowest level in the network hierarchy.  However, such designs
     introduce the problem of distributed control plane failures.  This
     observation calls for simpler and less control plane protocols to
--- 234,244 ----
     minimizes software issue-related failures.

     An important aspect of Operational Expenditure (OPEX) minimization is
!    reducing the size of failure domains in the network.  Ethernet
networks
     are known to be susceptible to broadcast or unicast traffic storms
     that can have a dramatic impact on network performance and
     availability.  The use of a fully routed design significantly reduces
!    the size of the data plane failure domains, i.e., limits them to the
     lowest level in the network hierarchy.  However, such designs
     introduce the problem of distributed control plane failures.  This
     observation calls for simpler and less control plane protocols to
***************
*** 253,259 ****
     performed by network devices.  Traditionally, load balancers are
     deployed as dedicated devices in the traffic forwarding path.  The
     problem arises in scaling load balancers under growing traffic
!    demand.  A preferable solution would be able to scale load balancing
     layer horizontally, by adding more of the uniform nodes and
     distributing incoming traffic across these nodes.  In situations like
     this, an ideal choice would be to use network infrastructure itself
--- 253,259 ----
     performed by network devices.  Traditionally, load balancers are
     deployed as dedicated devices in the traffic forwarding path.  The
     problem arises in scaling load balancers under growing traffic
!    demand.  A preferable solution would be able to scale the load
balancing
     layer horizontally, by adding more of the uniform nodes and
     distributing incoming traffic across these nodes.  In situations like
     this, an ideal choice would be to use network infrastructure itself
***************
*** 305,311 ****
  3.1.  Traditional DC Topology

     In the networking industry, a common design choice for data centers
!    typically look like a (upside down) tree with redundant uplinks and
     three layers of hierarchy namely; core, aggregation/distribution and
     access layers (see Figure 1).  To accommodate bandwidth demands, each
     higher layer, from server towards DC egress or WAN, has higher port
--- 305,311 ----
  3.1.  Traditional DC Topology

     In the networking industry, a common design choice for data centers
!    typically look like an (upside down) tree with redundant uplinks and
     three layers of hierarchy namely; core, aggregation/distribution and
     access layers (see Figure 1).  To accommodate bandwidth demands, each
     higher layer, from server towards DC egress or WAN, has higher port
***************
*** 373,379 ****
     topology, sometimes called "fat-tree" (see, for example, [INTERCON]
     and [ALFARES2008]).  This topology features an odd number of stages
     (sometimes known as dimensions) and is commonly made of uniform
!    elements, e.g. network switches with the same port count.  Therefore,
     the choice of folded Clos topology satisfies REQ1 and facilitates
     REQ2.  See Figure 2 below for an example of a folded 3-stage Clos
     topology (3 stages counting Tier-2 stage twice, when tracing a packet
--- 373,379 ----
     topology, sometimes called "fat-tree" (see, for example, [INTERCON]
     and [ALFARES2008]).  This topology features an odd number of stages
     (sometimes known as dimensions) and is commonly made of uniform
!    elements, e.g., network switches with the same port count.  Therefore,
     the choice of folded Clos topology satisfies REQ1 and facilitates
     REQ2.  See Figure 2 below for an example of a folded 3-stage Clos
     topology (3 stages counting Tier-2 stage twice, when tracing a packet
***************
*** 460,466 ****
  3.2.3.  Scaling the Clos topology

     A Clos topology can be scaled either by increasing network element
!    port density or adding more stages, e.g. moving to a 5-stage Clos, as
     illustrated in Figure 3 below:

                                        Tier-1
--- 460,466 ----
  3.2.3.  Scaling the Clos topology

     A Clos topology can be scaled either by increasing network element
!    port density or adding more stages, e.g., moving to a 5-stage Clos, as
     illustrated in Figure 3 below:

                                        Tier-1
***************
*** 523,529 ****
  3.2.4.  Managing the Size of Clos Topology Tiers

     If a data center network size is small, it is possible to reduce the
!    number of switches in Tier-1 or Tier-2 of Clos topology by a factor
     of two.  To understand how this could be done, take Tier-1 as an
     example.  Every Tier-2 device connects to a single group of Tier-1
     devices.  If half of the ports on each of the Tier-1 devices are not
--- 523,529 ----
  3.2.4.  Managing the Size of Clos Topology Tiers

     If a data center network size is small, it is possible to reduce the
!    number of switches in Tier-1 or Tier-2 of a Clos topology by a factor
     of two.  To understand how this could be done, take Tier-1 as an
     example.  Every Tier-2 device connects to a single group of Tier-1
     devices.  If half of the ports on each of the Tier-1 devices are not
***************
*** 574,580 ****
     originally defined in [IEEE8021D-1990] for loop free topology
     creation, typically utilizing variants of the traditional DC topology
     described in Section 3.1.  At the time, many DC switches either did
!    not support Layer 3 routed protocols or supported it with additional
     licensing fees, which played a part in the design choice.  Although
     many enhancements have been made through the introduction of Rapid
     Spanning Tree Protocol (RSTP) in the latest revision of
--- 574,580 ----
     originally defined in [IEEE8021D-1990] for loop free topology
     creation, typically utilizing variants of the traditional DC topology
     described in Section 3.1.  At the time, many DC switches either did
!    not support Layer 3 routing protocols or supported them with
additional
     licensing fees, which played a part in the design choice.  Although
     many enhancements have been made through the introduction of Rapid
     Spanning Tree Protocol (RSTP) in the latest revision of
***************
*** 599,605 ****
     as the backup for loop prevention.  The major downsides of this
     approach are the lack of ability to scale linearly past two in most
     implementations, lack of standards based implementations, and added
!    failure domain risk of keeping state between the devices.

     It should be noted that building large, horizontally scalable, Layer
     2 only networks without STP is possible recently through the
--- 599,605 ----
     as the backup for loop prevention.  The major downsides of this
     approach are the lack of ability to scale linearly past two in most
     implementations, lack of standards based implementations, and added
!    the failure domain risk of syncing state between the devices.

     It should be noted that building large, horizontally scalable, Layer
     2 only networks without STP is possible recently through the
***************
*** 621,631 ****
     Finally, neither the base TRILL specification nor the M-LAG approach
     totally eliminate the problem of the shared broadcast domain, that is
     so detrimental to the operations of any Layer 2, Ethernet based
!    solutions.  Later TRILL extensions have been proposed to solve the
     this problem statement primarily based on the approaches outlined in
     [RFC7067], but this even further limits the number of available
!    interoperable implementations that can be used to build a fabric,
!    therefore TRILL based designs have issues meeting REQ2, REQ3, and
     REQ4.

  4.2.  Hybrid L2/L3 Designs
--- 621,631 ----
     Finally, neither the base TRILL specification nor the M-LAG approach
     totally eliminate the problem of the shared broadcast domain, that is
     so detrimental to the operations of any Layer 2, Ethernet based
!    solution.  Later TRILL extensions have been proposed to solve the
     this problem statement primarily based on the approaches outlined in
     [RFC7067], but this even further limits the number of available
!    interoperable implementations that can be used to build a fabric.
!    Therefore, TRILL based designs have issues meeting REQ2, REQ3, and
     REQ4.

  4.2.  Hybrid L2/L3 Designs
***************
*** 635,641 ****
     in either the Tier-1 or Tier-2 parts of the network and dividing the
     Layer 2 domain into numerous, smaller domains.  This design has
     allowed data centers to scale up, but at the cost of complexity in
!    the network managing multiple protocols.  For the following reasons,
     operators have retained Layer 2 in either the access (Tier-3) or both
     access and aggregation (Tier-3 and Tier-2) parts of the network:

--- 635,641 ----
     in either the Tier-1 or Tier-2 parts of the network and dividing the
     Layer 2 domain into numerous, smaller domains.  This design has
     allowed data centers to scale up, but at the cost of complexity in
!    the managing multiple network protocols.  For the following reasons,
     operators have retained Layer 2 in either the access (Tier-3) or both
     access and aggregation (Tier-3 and Tier-2) parts of the network:

***************
*** 644,650 ****

     o  Seamless mobility for virtual machines that require the
        preservation of IP addresses when a virtual machine moves to
!       different Tier-3 switch.

     o  Simplified IP addressing = less IP subnets are required for the
        data center.
--- 644,650 ----

     o  Seamless mobility for virtual machines that require the
        preservation of IP addresses when a virtual machine moves to
!       a different Tier-3 switch.

     o  Simplified IP addressing = less IP subnets are required for the
        data center.
***************
*** 679,686 ****
     adoption in networks where large Layer 2 adjacency and larger size
     Layer 3 subnets are not as critical compared to network scalability
     and stability.  Application providers and network operators continue
!    to also develop new solutions to meet some of the requirements that
!    previously have driven large Layer 2 domains by using various overlay
     or tunneling techniques.

  5.  Routing Protocol Selection and Design
--- 679,686 ----
     adoption in networks where large Layer 2 adjacency and larger size
     Layer 3 subnets are not as critical compared to network scalability
     and stability.  Application providers and network operators continue
!    to develop new solutions to meet some of the requirements that
!    previously had driven large Layer 2 domains using various overlay
     or tunneling techniques.

  5.  Routing Protocol Selection and Design
***************
*** 700,706 ****
     design.

     Although EBGP is the protocol used for almost all inter-domain
!    routing on the Internet and has wide support from both vendor and
     service provider communities, it is not generally deployed as the
     primary routing protocol within the data center for a number of
     reasons (some of which are interrelated):
--- 700,706 ----
     design.

     Although EBGP is the protocol used for almost all inter-domain
!    routing in the Internet and has wide support from both vendor and
     service provider communities, it is not generally deployed as the
     primary routing protocol within the data center for a number of
     reasons (some of which are interrelated):
***************
*** 741,754 ****
        state IGPs.  Since every BGP router calculates and propagates only
        the best-path selected, a network failure is masked as soon as the
        BGP speaker finds an alternate path, which exists when highly
!       symmetric topologies, such as Clos, are coupled with EBGP only
        design.  In contrast, the event propagation scope of a link-state
        IGP is an entire area, regardless of the failure type.  In this
        way, BGP better meets REQ3 and REQ4.  It is also worth mentioning
        that all widely deployed link-state IGPs feature periodic
!       refreshes of routing information, even if this rarely causes
!       impact to modern router control planes, while BGP does not expire
!       routing state.

     o  BGP supports third-party (recursively resolved) next-hops.  This
        allows for manipulating multipath to be non-ECMP based or
--- 741,754 ----
        state IGPs.  Since every BGP router calculates and propagates only
        the best-path selected, a network failure is masked as soon as the
        BGP speaker finds an alternate path, which exists when highly
!       symmetric topologies, such as Clos, are coupled with an EBGP only
        design.  In contrast, the event propagation scope of a link-state
        IGP is an entire area, regardless of the failure type.  In this
        way, BGP better meets REQ3 and REQ4.  It is also worth mentioning
        that all widely deployed link-state IGPs feature periodic
!       refreshes of routing information while BGP does not expire
!       routing state, although this rarely impacts modern router control
!       planes.

     o  BGP supports third-party (recursively resolved) next-hops.  This
        allows for manipulating multipath to be non-ECMP based or
***************
*** 765,775 ****
        controlled and complex unwanted paths will be ignored.  See
        Section 5.2 for an example of a working ASN allocation scheme.  In
        a link-state IGP accomplishing the same goal would require multi-
!       (instance/topology/processes) support, typically not available in
        all DC devices and quite complex to configure and troubleshoot.
        Using a traditional single flooding domain, which most DC designs
        utilize, under certain failure conditions may pick up unwanted
!       lengthy paths, e.g. traversing multiple Tier-2 devices.

     o  EBGP configuration that is implemented with minimal routing policy
        is easier to troubleshoot for network reachability issues.  In
--- 765,775 ----
        controlled and complex unwanted paths will be ignored.  See
        Section 5.2 for an example of a working ASN allocation scheme.  In
        a link-state IGP accomplishing the same goal would require multi-
!       (instance/topology/process) support, typically not available in
        all DC devices and quite complex to configure and troubleshoot.
        Using a traditional single flooding domain, which most DC designs
        utilize, under certain failure conditions may pick up unwanted
!       lengthy paths, e.g., traversing multiple Tier-2 devices.

     o  EBGP configuration that is implemented with minimal routing policy
        is easier to troubleshoot for network reachability issues.  In
***************
*** 806,812 ****
        loopback sessions are used even in the case of multiple links
        between the same pair of nodes.

!    o  Private Use ASNs from the range 64512-65534 are used so as to
        avoid ASN conflicts.

     o  A single ASN is allocated to all of the Clos topology's Tier-1
--- 806,812 ----
        loopback sessions are used even in the case of multiple links
        between the same pair of nodes.

!    o  Private Use ASNs from the range 64512-65534 are used to
        avoid ASN conflicts.

     o  A single ASN is allocated to all of the Clos topology's Tier-1
***************
*** 815,821 ****
     o  A unique ASN is allocated to each set of Tier-2 devices in the
        same cluster.

!    o  A unique ASN is allocated to every Tier-3 device (e.g.  ToR) in
        this topology.


--- 815,821 ----
     o  A unique ASN is allocated to each set of Tier-2 devices in the
        same cluster.

!    o  A unique ASN is allocated to every Tier-3 device (e.g.,  ToR) in
        this topology.


***************
*** 903,922 ****

     Another solution to this problem would be using Four-Octet ASNs
     ([RFC6793]), where there are additional Private Use ASNs available,
!    see [IANA.AS<https://urldefense.proofpoint.com/v2/url?u=http-3A__IANA.AS&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=LU_vJaM_EQu1Ssm35j2xlA&m=jlwmqRMSBbzWIRorI_sDAM1i0LuuOdLFmJDVNLptIYc&s=yWa6kSCc2OwwVaNfy2nBgZ_NLUL4kH-QLP1BNb1ZaLU&e=>].  Use of Four-Octet ASNs put additional protocol
!    complexity in the BGP implementation so should be considered against
     the complexity of re-use when considering REQ3 and REQ4.  Perhaps
     more importantly, they are not yet supported by all BGP
     implementations, which may limit vendor selection of DC equipment.
!    When supported, ensure that implementations in use are able to remove
!    the Private Use ASNs if required for external connectivity
!    (Section 5.2.4).

  5.2.3.  Prefix Advertisement

     A Clos topology features a large number of point-to-point links and
     associated prefixes.  Advertising all of these routes into BGP may
!    create FIB overload conditions in the network devices.  Advertising
     these links also puts additional path computation stress on the BGP
     control plane for little benefit.  There are two possible solutions:

--- 903,922 ----

     Another solution to this problem would be using Four-Octet ASNs
     ([RFC6793]), where there are additional Private Use ASNs available,
!    see [IANA.AS<https://urldefense.proofpoint.com/v2/url?u=http-3A__IANA.AS&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=LU_vJaM_EQu1Ssm35j2xlA&m=jlwmqRMSBbzWIRorI_sDAM1i0LuuOdLFmJDVNLptIYc&s=yWa6kSCc2OwwVaNfy2nBgZ_NLUL4kH-QLP1BNb1ZaLU&e=>].  Use of Four-Octet ASNs puts additional protocol
!    complexity in the BGP implementation and should be balanced against
     the complexity of re-use when considering REQ3 and REQ4.  Perhaps
     more importantly, they are not yet supported by all BGP
     implementations, which may limit vendor selection of DC equipment.
!    When supported, ensure that deployed implementations are able to
remove
!    the Private Use ASNs when external connectivity to these ASes is
!    required (Section 5.2.4).

  5.2.3.  Prefix Advertisement

     A Clos topology features a large number of point-to-point links and
     associated prefixes.  Advertising all of these routes into BGP may
!    create FIB overload in the network devices.  Advertising
     these links also puts additional path computation stress on the BGP
     control plane for little benefit.  There are two possible solutions:

***************
*** 925,951 ****
        device, distant networks will automatically be reachable via the
        advertising EBGP peer and do not require reachability to these
        prefixes.  However, this may complicate operations or monitoring:
!       e.g. using the popular "traceroute" tool will display IP addresses
        that are not reachable.

     o  Advertise point-to-point links, but summarize them on every
        device.  This requires an address allocation scheme such as
        allocating a consecutive block of IP addresses per Tier-1 and
        Tier-2 device to be used for point-to-point interface addressing
!       to the lower layers (Tier-2 uplinks will be numbered out of Tier-1
!       addressing and so forth).

     Server subnets on Tier-3 devices must be announced into BGP without
     using route summarization on Tier-2 and Tier-1 devices.  Summarizing
     subnets in a Clos topology results in route black-holing under a
!    single link failure (e.g. between Tier-2 and Tier-3 devices) and
     hence must be avoided.  The use of peer links within the same tier to
     resolve the black-holing problem by providing "bypass paths" is
     undesirable due to O(N^2) complexity of the peering mesh and waste of
     ports on the devices.  An alternative to the full-mesh of peer-links
!    would be using a simpler bypass topology, e.g. a "ring" as described
     in [FB4POST], but such a topology adds extra hops and has very
!    limited bisection bandwidth, in addition requiring special tweaks to



--- 925,951 ----
        device, distant networks will automatically be reachable via the
        advertising EBGP peer and do not require reachability to these
        prefixes.  However, this may complicate operations or monitoring:
!       e.g., using the popular "traceroute" tool will display IP addresses
        that are not reachable.

     o  Advertise point-to-point links, but summarize them on every
        device.  This requires an address allocation scheme such as
        allocating a consecutive block of IP addresses per Tier-1 and
        Tier-2 device to be used for point-to-point interface addressing
!       to the lower layers (Tier-2 uplink addresses will be allocated
!       from Tier-1 address blocks and so forth).

     Server subnets on Tier-3 devices must be announced into BGP without
     using route summarization on Tier-2 and Tier-1 devices.  Summarizing
     subnets in a Clos topology results in route black-holing under a
!    single link failure (e.g., between Tier-2 and Tier-3 devices) and
     hence must be avoided.  The use of peer links within the same tier to
     resolve the black-holing problem by providing "bypass paths" is
     undesirable due to O(N^2) complexity of the peering mesh and waste of
     ports on the devices.  An alternative to the full-mesh of peer-links
!    would be using a simpler bypass topology, e.g., a "ring" as described
     in [FB4POST], but such a topology adds extra hops and has very
!    limited bisectional bandwidth. Additionally requiring special tweaks
to



***************
*** 956,963 ****

     make BGP routing work - such as possibly splitting every device into
     an ASN on its own.  Later in this document, Section 8.2 introduces a
!    less intrusive method for performing a limited form route
!    summarization in Clos networks and discusses it's associated trade-
     offs.

  5.2.4.  External Connectivity
--- 956,963 ----

     make BGP routing work - such as possibly splitting every device into
     an ASN on its own.  Later in this document, Section 8.2 introduces a
!    less intrusive method for performing a limited form of route
!    summarization in Clos networks and discusses its associated trade-
     offs.

  5.2.4.  External Connectivity
***************
*** 972,985 ****
     document.  These devices have to perform a few special functions:

     o  Hide network topology information when advertising paths to WAN
!       routers, i.e. remove Private Use ASNs [RFC6996] from the AS_PATH
        attribute.  This is typically done to avoid ASN number collisions
        between different data centers and also to provide a uniform
        AS_PATH length to the WAN for purposes of WAN ECMP to Anycast
        prefixes originated in the topology.  An implementation specific
        BGP feature typically called "Remove Private AS" is commonly used
        to accomplish this.  Depending on implementation, the feature
!       should strip a contiguous sequence of Private Use ASNs found in
        AS_PATH attribute prior to advertising the path to a neighbor.
        This assumes that all ASNs used for intra data center numbering
        are from the Private Use ranges.  The process for stripping the
--- 972,985 ----
     document.  These devices have to perform a few special functions:

     o  Hide network topology information when advertising paths to WAN
!       routers, i.e., remove Private Use ASNs [RFC6996] from the AS_PATH
        attribute.  This is typically done to avoid ASN number collisions
        between different data centers and also to provide a uniform
        AS_PATH length to the WAN for purposes of WAN ECMP to Anycast
        prefixes originated in the topology.  An implementation specific
        BGP feature typically called "Remove Private AS" is commonly used
        to accomplish this.  Depending on implementation, the feature
!       should strip a contiguous sequence of Private Use ASNs found in an
        AS_PATH attribute prior to advertising the path to a neighbor.
        This assumes that all ASNs used for intra data center numbering
        are from the Private Use ranges.  The process for stripping the
***************
*** 998,1005 ****
        to the WAN Routers upstream, to provide resistance to a single-
        link failure causing the black-holing of traffic.  To prevent
        black-holing in the situation when all of the EBGP sessions to the
!       WAN routers fail simultaneously on a given device it is more
!       desirable to take the "relaying" approach rather than introducing
        the default route via complicated conditional route origination
        schemes provided by some implementations [CONDITIONALROUTE].

--- 998,1005 ----
        to the WAN Routers upstream, to provide resistance to a single-
        link failure causing the black-holing of traffic.  To prevent
        black-holing in the situation when all of the EBGP sessions to the
!       WAN routers fail simultaneously on a given device, it is more
!       desirable to readvertise the default route rather than originating
        the default route via complicated conditional route origination
        schemes provided by some implementations [CONDITIONALROUTE].

***************
*** 1017,1023 ****
     prefixes originated from within the data center in a fully routed
     network design.  For example, a network with 2000 Tier-3 devices will
     have at least 2000 servers subnets advertised into BGP, along with
!    the infrastructure or other prefixes.  However, as discussed before,
     the proposed network design does not allow for route summarization
     due to the lack of peer links inside every tier.

--- 1017,1023 ----
     prefixes originated from within the data center in a fully routed
     network design.  For example, a network with 2000 Tier-3 devices will
     have at least 2000 servers subnets advertised into BGP, along with
!    the infrastructure and link prefixes.  However, as discussed before,
     the proposed network design does not allow for route summarization
     due to the lack of peer links inside every tier.

***************
*** 1028,1037 ****
     o  Interconnect the Border Routers using a full-mesh of physical
        links or using any other "peer-mesh" topology, such as ring or
        hub-and-spoke.  Configure BGP accordingly on all Border Leafs to
!       exchange network reachability information - e.g. by adding a mesh
        of IBGP sessions.  The interconnecting peer links need to be
        appropriately sized for traffic that will be present in the case
!       of a device or link failure underneath the Border Routers.

     o  Tier-1 devices may have additional physical links provisioned
        toward the Border Routers (which are Tier-2 devices from the
--- 1028,1037 ----
     o  Interconnect the Border Routers using a full-mesh of physical
        links or using any other "peer-mesh" topology, such as ring or
        hub-and-spoke.  Configure BGP accordingly on all Border Leafs to
!       exchange network reachability information, e.g., by adding a mesh
        of IBGP sessions.  The interconnecting peer links need to be
        appropriately sized for traffic that will be present in the case
!       of a device or link failure in the mesh connecting the Border
Routers.

     o  Tier-1 devices may have additional physical links provisioned
        toward the Border Routers (which are Tier-2 devices from the
***************
*** 1043,1049 ****
        device compared with the other devices in the Clos.  This also
        reduces the number of ports available to "regular" Tier-2 switches
        and hence the number of clusters that could be interconnected via
!       Tier-1 layer.

     If any of the above options are implemented, it is possible to
     perform route summarization at the Border Routers toward the WAN
--- 1043,1049 ----
        device compared with the other devices in the Clos.  This also
        reduces the number of ports available to "regular" Tier-2 switches
        and hence the number of clusters that could be interconnected via
!       the Tier-1 layer.

     If any of the above options are implemented, it is possible to
     perform route summarization at the Border Routers toward the WAN
***************
*** 1071,1079 ****
     ECMP is the fundamental load sharing mechanism used by a Clos
     topology.  Effectively, every lower-tier device will use all of its
     directly attached upper-tier devices to load share traffic destined
!    to the same IP prefix.  Number of ECMP paths between any two Tier-3
     devices in Clos topology equals to the number of the devices in the
!    middle stage (Tier-1).  For example, Figure 5 illustrates the
     topology where Tier-3 device A has four paths to reach servers X and
     Y, via Tier-2 devices B and C and then Tier-1 devices 1, 2, 3, and 4
     respectively.
--- 1071,1079 ----
     ECMP is the fundamental load sharing mechanism used by a Clos
     topology.  Effectively, every lower-tier device will use all of its
     directly attached upper-tier devices to load share traffic destined
!    to the same IP prefix.  The number of ECMP paths between any two
Tier-3
     devices in Clos topology equals to the number of the devices in the
!    middle stage (Tier-1).  For example, Figure 5 illustrates a
     topology where Tier-3 device A has four paths to reach servers X and
     Y, via Tier-2 devices B and C and then Tier-1 devices 1, 2, 3, and 4
     respectively.
***************
*** 1105,1116 ****

     The ECMP requirement implies that the BGP implementation must support
     multipath fan-out for up to the maximum number of devices directly
!    attached at any point in the topology in upstream or downstream
     direction.  Normally, this number does not exceed half of the ports
     found on a device in the topology.  For example, an ECMP fan-out of
     32 would be required when building a Clos network using 64-port
     devices.  The Border Routers may need to have wider fan-out to be
!    able to connect to multitude of Tier-1 devices if route summarization
     at Border Router level is implemented as described in Section 5.2.5.
     If a device's hardware does not support wider ECMP, logical link-
     grouping (link-aggregation at layer 2) could be used to provide
--- 1105,1116 ----

     The ECMP requirement implies that the BGP implementation must support
     multipath fan-out for up to the maximum number of devices directly
!    attached at any point in the topology in the upstream or downstream
     direction.  Normally, this number does not exceed half of the ports
     found on a device in the topology.  For example, an ECMP fan-out of
     32 would be required when building a Clos network using 64-port
     devices.  The Border Routers may need to have wider fan-out to be
!    able to connect to a multitude of Tier-1 devices if route
summarization
     at Border Router level is implemented as described in Section 5.2.5.
     If a device's hardware does not support wider ECMP, logical link-
     grouping (link-aggregation at layer 2) could be used to provide
***************
*** 1122,1131 ****
  Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016


!    "hierarchical" ECMP (Layer 3 ECMP followed by Layer 2 ECMP) to
     compensate for fan-out limitations.  Such approach, however,
     increases the risk of flow polarization, as less entropy will be
!    available to the second stage of ECMP.

     Most BGP implementations declare paths to be equal from an ECMP
     perspective if they match up to and including step (e) in
--- 1122,1131 ----
  Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016


!    "hierarchical" ECMP (Layer 3 ECMP coupled with Layer 2 ECMP) to
     compensate for fan-out limitations.  Such approach, however,
     increases the risk of flow polarization, as less entropy will be
!    available at the second stage of ECMP.

     Most BGP implementations declare paths to be equal from an ECMP
     perspective if they match up to and including step (e) in
***************
*** 1148,1154 ****
     perspective of other devices, such a prefix would have BGP paths with
     different AS_PATH attribute values, while having the same AS_PATH
     attribute lengths.  Therefore, BGP implementations must support load
!    sharing over above-mentioned paths.  This feature is sometimes known
     as "multipath relax" or "multipath multiple-as" and effectively
     allows for ECMP to be done across different neighboring ASNs if all
     other attributes are equal as already described in the previous
--- 1148,1154 ----
     perspective of other devices, such a prefix would have BGP paths with
     different AS_PATH attribute values, while having the same AS_PATH
     attribute lengths.  Therefore, BGP implementations must support load
!    sharing over the above-mentioned paths.  This feature is sometimes
known
     as "multipath relax" or "multipath multiple-as" and effectively
     allows for ECMP to be done across different neighboring ASNs if all
     other attributes are equal as already described in the previous
***************
*** 1182,1199 ****

     It is often desirable to have the hashing function used for ECMP to
     be consistent (see [CONS-HASH]), to minimize the impact on flow to
!    next-hop affinity changes when a next-hop is added or removed to ECMP
     group.  This could be used if the network device is used as a load
     balancer, mapping flows toward multiple destinations - in this case,
!    losing or adding a destination will not have detrimental effect of
     currently established flows.  One particular recommendation on
     implementing consistent hashing is provided in [RFC2992], though
     other implementations are possible.  This functionality could be
     naturally combined with weighted ECMP, with the impact of the next-
     hop changes being proportional to the weight of the given next-hop.
     The downside of consistent hashing is increased load on hardware
!    resource utilization, as typically more space is required to
!    implement a consistent-hashing region.

  7.  Routing Convergence Properties

--- 1182,1199 ----

     It is often desirable to have the hashing function used for ECMP to
     be consistent (see [CONS-HASH]), to minimize the impact on flow to
!    next-hop affinity changes when a next-hop is added or removed to an
ECMP
     group.  This could be used if the network device is used as a load
     balancer, mapping flows toward multiple destinations - in this case,
!    losing or adding a destination will not have a detrimental effect on
     currently established flows.  One particular recommendation on
     implementing consistent hashing is provided in [RFC2992], though
     other implementations are possible.  This functionality could be
     naturally combined with weighted ECMP, with the impact of the next-
     hop changes being proportional to the weight of the given next-hop.
     The downside of consistent hashing is increased load on hardware
!    resource utilization, as typically more resources (e.g., TCAM space)
!    are required to implement a consistent-hashing function.

  7.  Routing Convergence Properties

***************
*** 1209,1224 ****
     driven mechanism to obtain updates on IGP state changes.  The
     proposed routing design does not use an IGP, so the remaining
     mechanisms that could be used for fault detection are BGP keep-alive
!    process (or any other type of keep-alive mechanism) and link-failure
     triggers.

     Relying solely on BGP keep-alive packets may result in high
!    convergence delays, in the order of multiple seconds (on many BGP
     implementations the minimum configurable BGP hold timer value is
     three seconds).  However, many BGP implementations can shut down
     local EBGP peering sessions in response to the "link down" event for
     the outgoing interface used for BGP peering.  This feature is
!    sometimes called as "fast fallover".  Since links in modern data
     centers are predominantly point-to-point fiber connections, a
     physical interface failure is often detected in milliseconds and
     subsequently triggers a BGP re-convergence.
--- 1209,1224 ----
     driven mechanism to obtain updates on IGP state changes.  The
     proposed routing design does not use an IGP, so the remaining
     mechanisms that could be used for fault detection are BGP keep-alive
!    time-out (or any other type of keep-alive mechanism) and link-failure
     triggers.

     Relying solely on BGP keep-alive packets may result in high
!    convergence delays, on the order of multiple seconds (on many BGP
     implementations the minimum configurable BGP hold timer value is
     three seconds).  However, many BGP implementations can shut down
     local EBGP peering sessions in response to the "link down" event for
     the outgoing interface used for BGP peering.  This feature is
!    sometimes called "fast fallover".  Since links in modern data
     centers are predominantly point-to-point fiber connections, a
     physical interface failure is often detected in milliseconds and
     subsequently triggers a BGP re-convergence.
***************
*** 1236,1242 ****

     Alternatively, some platforms may support Bidirectional Forwarding
     Detection (BFD) [RFC5880] to allow for sub-second failure detection
!    and fault signaling to the BGP process.  However, use of either of
     these presents additional requirements to vendor software and
     possibly hardware, and may contradict REQ1.  Until recently with
     [RFC7130], BFD also did not allow detection of a single member link
--- 1236,1242 ----

     Alternatively, some platforms may support Bidirectional Forwarding
     Detection (BFD) [RFC5880] to allow for sub-second failure detection
!    and fault signaling to the BGP process.  However, the use of either of
     these presents additional requirements to vendor software and
     possibly hardware, and may contradict REQ1.  Until recently with
     [RFC7130], BFD also did not allow detection of a single member link
***************
*** 1245,1251 ****

  7.2.  Event Propagation Timing

!    In the proposed design the impact of BGP Minimum Route Advertisement
     Interval (MRAI) timer (See section 9.2.1.1 of [RFC4271]) should be
     considered.  Per the standard it is required for BGP implementations
     to space out consecutive BGP UPDATE messages by at least MRAI
--- 1245,1251 ----

  7.2.  Event Propagation Timing

!    In the proposed design the impact of the BGP Minimum Route
Advertisement
     Interval (MRAI) timer (See section 9.2.1.1 of [RFC4271]) should be
     considered.  Per the standard it is required for BGP implementations
     to space out consecutive BGP UPDATE messages by at least MRAI
***************
*** 1258,1270 ****
     In a Clos topology each EBGP speaker typically has either one path
     (Tier-2 devices don't accept paths from other Tier-2 in the same
     cluster due to same ASN) or N paths for the same prefix, where N is a
!    significantly large number, e.g.  N=32 (the ECMP fan-out to the next
     Tier).  Therefore, if a link fails to another device from which a
!    path is received there is either no backup path at all (e.g. from
     perspective of a Tier-2 switch losing link to a Tier-3 device), or
!    the backup is readily available in BGP Loc-RIB (e.g. from perspective
     of a Tier-2 device losing link to a Tier-1 switch).  In the former
!    case, the BGP withdrawal announcement will propagate un-delayed and
     trigger re-convergence on affected devices.  In the latter case, the
     best-path will be re-evaluated and the local ECMP group corresponding
     to the new next-hop set changed.  If the BGP path was the best-path
--- 1258,1270 ----
     In a Clos topology each EBGP speaker typically has either one path
     (Tier-2 devices don't accept paths from other Tier-2 in the same
     cluster due to same ASN) or N paths for the same prefix, where N is a
!    significantly large number, e.g.,  N=32 (the ECMP fan-out to the next
     Tier).  Therefore, if a link fails to another device from which a
!    path is received there is either no backup path at all (e.g., from the
     perspective of a Tier-2 switch losing link to a Tier-3 device), or
!    the backup is readily available in BGP Loc-RIB (e.g., from perspective
     of a Tier-2 device losing link to a Tier-1 switch).  In the former
!    case, the BGP withdrawal announcement will propagate without delay and
     trigger re-convergence on affected devices.  In the latter case, the
     best-path will be re-evaluated and the local ECMP group corresponding
     to the new next-hop set changed.  If the BGP path was the best-path
***************
*** 1279,1285 ****
     situation when a link between Tier-3 and Tier-2 device fails, the
     Tier-2 device will send BGP UPDATE messages to all upstream Tier-1
     devices, withdrawing the affected prefixes.  The Tier-1 devices, in
!    turn, will relay those messages to all downstream Tier-2 devices
     (except for the originator).  Tier-2 devices other than the one
     originating the UPDATE should then wait for ALL upstream Tier-1

--- 1279,1285 ----
     situation when a link between Tier-3 and Tier-2 device fails, the
     Tier-2 device will send BGP UPDATE messages to all upstream Tier-1
     devices, withdrawing the affected prefixes.  The Tier-1 devices, in
!    turn, will relay these messages to all downstream Tier-2 devices
     (except for the originator).  Tier-2 devices other than the one
     originating the UPDATE should then wait for ALL upstream Tier-1

***************
*** 1307,1313 ****
     features that vendors include to reduce the control plane impact of
     rapidly flapping prefixes.  However, due to issues described with
     false positives in these implementations especially under such
!    "dispersion" events, it is not recommended to turn this feature on in
     this design.  More background and issues with "route flap dampening"
     and possible implementation changes that could affect this are well
     described in [RFC7196].
--- 1307,1313 ----
     features that vendors include to reduce the control plane impact of
     rapidly flapping prefixes.  However, due to issues described with
     false positives in these implementations especially under such
!    "dispersion" events, it is not recommended to enable this feature in
     this design.  More background and issues with "route flap dampening"
     and possible implementation changes that could affect this are well
     described in [RFC7196].
***************
*** 1316,1324 ****

     A network is declared to converge in response to a failure once all
     devices within the failure impact scope are notified of the event and
!    have re-calculated their RIB's and consequently updated their FIB's.
     Larger failure impact scope typically means slower convergence since
!    more devices have to be notified, and additionally results in a less
     stable network.  In this section we describe BGP's advantages over
     link-state routing protocols in reducing failure impact scope for a
     Clos topology.
--- 1316,1324 ----

     A network is declared to converge in response to a failure once all
     devices within the failure impact scope are notified of the event and
!    have re-calculated their RIBs and consequently updated their FIBs.
     Larger failure impact scope typically means slower convergence since
!    more devices have to be notified, and results in a less
     stable network.  In this section we describe BGP's advantages over
     link-state routing protocols in reducing failure impact scope for a
     Clos topology.
***************
*** 1327,1335 ****
     the best path from the point of view of the local router is sent to
     neighbors.  As such, some failures are masked if the local node can
     immediately find a backup path and does not have to send any updates
!    further.  Notice that in the worst case ALL devices in a data center
     topology have to either withdraw a prefix completely or update the
!    ECMP groups in the FIB.  However, many failures will not result in
     such a wide impact.  There are two main failure types where impact
     scope is reduced:

--- 1327,1335 ----
     the best path from the point of view of the local router is sent to
     neighbors.  As such, some failures are masked if the local node can
     immediately find a backup path and does not have to send any updates
!    further.  Notice that in the worst case, all devices in a data center
     topology have to either withdraw a prefix completely or update the
!    ECMP groups in their FIBs.  However, many failures will not result in
     such a wide impact.  There are two main failure types where impact
     scope is reduced:

***************
*** 1357,1367 ****

     o  Failure of a Tier-1 device: In this case, all Tier-2 devices
        directly attached to the failed node will have to update their
!       ECMP groups for all IP prefixes from non-local cluster.  The
        Tier-3 devices are once again not involved in the re-convergence
        process, but may receive "implicit withdraws" as described above.

!    Even though in case of such failures multiple IP prefixes will have
     to be reprogrammed in the FIB, it is worth noting that ALL of these
     prefixes share a single ECMP group on Tier-2 device.  Therefore, in
     the case of implementations with a hierarchical FIB, only a single
--- 1357,1367 ----

     o  Failure of a Tier-1 device: In this case, all Tier-2 devices
        directly attached to the failed node will have to update their
!       ECMP groups for all IP prefixes from a non-local cluster.  The
        Tier-3 devices are once again not involved in the re-convergence
        process, but may receive "implicit withdraws" as described above.

!    Even in the case of such failures, multiple IP prefixes will have
     to be reprogrammed in the FIB, it is worth noting that ALL of these
     prefixes share a single ECMP group on Tier-2 device.  Therefore, in
     the case of implementations with a hierarchical FIB, only a single
***************
*** 1375,1381 ****
     possible with the proposed design, since using this technique may
     create routing black-holes as mentioned previously.  Therefore, the
     worst control plane failure impact scope is the network as a whole,
!    for instance in a case of a link failure between Tier-2 and Tier-3
     devices.  The amount of impacted prefixes in this case would be much
     less than in the case of a failure in the upper layers of a Clos
     network topology.  The property of having such large failure scope is
--- 1375,1381 ----
     possible with the proposed design, since using this technique may
     create routing black-holes as mentioned previously.  Therefore, the
     worst control plane failure impact scope is the network as a whole,
!    for instance in thecase of a link failure between Tier-2 and Tier-3
     devices.  The amount of impacted prefixes in this case would be much
     less than in the case of a failure in the upper layers of a Clos
     network topology.  The property of having such large failure scope is
***************
*** 1384,1397 ****

  7.5.  Routing Micro-Loops

!    When a downstream device, e.g.  Tier-2 device, loses all paths for a
     prefix, it normally has the default route pointing toward the
     upstream device, in this case the Tier-1 device.  As a result, it is
!    possible to get in the situation when Tier-2 switch loses a prefix,
!    but Tier-1 switch still has the path pointing to the Tier-2 device,
!    which results in transient micro-loop, since Tier-1 switch will keep
     passing packets to the affected prefix back to Tier-2 device, and
!    Tier-2 will bounce it back again using the default route.  This
     micro-loop will last for the duration of time it takes the upstream
     device to fully update its forwarding tables.

--- 1384,1397 ----

  7.5.  Routing Micro-Loops

!    When a downstream device, e.g.,  Tier-2 device, loses all paths for a
     prefix, it normally has the default route pointing toward the
     upstream device, in this case the Tier-1 device.  As a result, it is
!    possible to get in the situation where a Tier-2 switch loses a prefix,
!    but a Tier-1 switch still has the path pointing to the Tier-2 device,
!    which results in transient micro-loop, since the Tier-1 switch will
keep
     passing packets to the affected prefix back to Tier-2 device, and
!    the Tier-2 will bounce it back again using the default route.  This
     micro-loop will last for the duration of time it takes the upstream
     device to fully update its forwarding tables.

***************
*** 1402,1408 ****
  Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016


!    To minimize impact of the micro-loops, Tier-2 and Tier-1 switches can
     be configured with static "discard" or "null" routes that will be
     more specific than the default route for prefixes missing during
     network convergence.  For Tier-2 switches, the discard route should
--- 1402,1408 ----
  Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016


!    To minimize the impact of such micro-loops, Tier-2 and Tier-1
switches can
     be configured with static "discard" or "null" routes that will be
     more specific than the default route for prefixes missing during
     network convergence.  For Tier-2 switches, the discard route should
***************
*** 1417,1423 ****

  8.1.  Third-party Route Injection

!    BGP allows for a "third-party", i.e. directly attached, BGP speaker
     to inject routes anywhere in the network topology, meeting REQ5.
     This can be achieved by peering via a multihop BGP session with some
     or even all devices in the topology.  Furthermore, BGP diverse path
--- 1417,1423 ----

  8.1.  Third-party Route Injection

!    BGP allows for a "third-party", i.e., directly attached, BGP speaker
     to inject routes anywhere in the network topology, meeting REQ5.
     This can be achieved by peering via a multihop BGP session with some
     or even all devices in the topology.  Furthermore, BGP diverse path
***************
*** 1427,1433 ****
     implementation.  Unfortunately, in many implementations ADD-PATH has
     been found to only support IBGP properly due to the use cases it was
     originally optimized for, which limits the "third-party" peering to
!    IBGP only, if the feature is used.

     To implement route injection in the proposed design, a third-party
     BGP speaker may peer with Tier-3 and Tier-1 switches, injecting the
--- 1427,1433 ----
     implementation.  Unfortunately, in many implementations ADD-PATH has
     been found to only support IBGP properly due to the use cases it was
     originally optimized for, which limits the "third-party" peering to
!    IBGP only.

     To implement route injection in the proposed design, a third-party
     BGP speaker may peer with Tier-3 and Tier-1 switches, injecting the
***************
*** 1442,1453 ****
     As mentioned previously, route summarization is not possible within
     the proposed Clos topology since it makes the network susceptible to
     route black-holing under single link failures.  The main problem is
!    the limited number of redundant paths between network elements, e.g.
     there is only a single path between any pair of Tier-1 and Tier-3
     devices.  However, some operators may find route aggregation
     desirable to improve control plane stability.

!    If planning on using any technique to summarize within the topology
     modeling of the routing behavior and potential for black-holing
     should be done not only for single or multiple link failures, but

--- 1442,1453 ----
     As mentioned previously, route summarization is not possible within
     the proposed Clos topology since it makes the network susceptible to
     route black-holing under single link failures.  The main problem is
!    the limited number of redundant paths between network elements, e.g.,
     there is only a single path between any pair of Tier-1 and Tier-3
     devices.  However, some operators may find route aggregation
     desirable to improve control plane stability.

!    If any technique to summarize within the topology is planned,
     modeling of the routing behavior and potential for black-holing
     should be done not only for single or multiple link failures, but

***************
*** 1458,1468 ****
  Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016


!    also fiber pathway failures or optical domain failures if the
     topology extends beyond a physical location.  Simple modeling can be
     done by checking the reachability on devices doing summarization
     under the condition of a link or pathway failure between a set of
!    devices in every tier as well as to the WAN routers if external
     connectivity is present.

     Route summarization would be possible with a small modification to
--- 1458,1468 ----
  Internet-Draft    draft-ietf-rtgwg-bgp-routing-large-dc       March 2016


!    also fiber pathway failures or optical domain failures when the
     topology extends beyond a physical location.  Simple modeling can be
     done by checking the reachability on devices doing summarization
     under the condition of a link or pathway failure between a set of
!    devices in every tier as well as to the WAN routers when external
     connectivity is present.

     Route summarization would be possible with a small modification to
***************
*** 1519,1544 ****
     cluster from Tier-2 devices since each of them has only a single path
     down to this prefix.  It would require dual-homed servers to
     accomplish that.  Also note that this design is only resilient to
!    single link failure.  It is possible for a double link failure to
     isolate a Tier-2 device from all paths toward a specific Tier-3
     device, thus causing a routing black-hole.

!    A result of the proposed topology modification would be reduction of
     Tier-1 devices port capacity.  This limits the maximum number of
     attached Tier-2 devices and therefore will limit the maximum DC
     network size.  A larger network would require different Tier-1
     devices that have higher port density to implement this change.

     Another problem is traffic re-balancing under link failures.  Since
!    three are two paths from Tier-1 to Tier-3, a failure of the link
     between Tier-1 and Tier-2 switch would result in all traffic that was
     taking the failed link to switch to the remaining path.  This will
!    result in doubling of link utilization on the remaining link.

  8.2.2.  Simple Virtual Aggregation

     A completely different approach to route summarization is possible,
!    provided that the main goal is to reduce the FIB pressure, while
     allowing the control plane to disseminate full routing information.
     Firstly, it could be easily noted that in many cases multiple
     prefixes, some of which are less specific, share the same set of the
--- 1519,1544 ----
     cluster from Tier-2 devices since each of them has only a single path
     down to this prefix.  It would require dual-homed servers to
     accomplish that.  Also note that this design is only resilient to
!    single link failures.  It is possible for a double link failure to
     isolate a Tier-2 device from all paths toward a specific Tier-3
     device, thus causing a routing black-hole.

!    A result of the proposed topology modification would be a reduction of
     Tier-1 devices port capacity.  This limits the maximum number of
     attached Tier-2 devices and therefore will limit the maximum DC
     network size.  A larger network would require different Tier-1
     devices that have higher port density to implement this change.

     Another problem is traffic re-balancing under link failures.  Since
!    there are two paths from Tier-1 to Tier-3, a failure of the link
     between Tier-1 and Tier-2 switch would result in all traffic that was
     taking the failed link to switch to the remaining path.  This will
!    result in doubling the link utilization of the remaining link.

  8.2.2.  Simple Virtual Aggregation

     A completely different approach to route summarization is possible,
!    provided that the main goal is to reduce the FIB size, while
     allowing the control plane to disseminate full routing information.
     Firstly, it could be easily noted that in many cases multiple
     prefixes, some of which are less specific, share the same set of the
***************
*** 1550,1563 ****
     [RFC6769] and only install the least specific route in the FIB,
     ignoring more specific routes if they share the same next-hop set.
     For example, under normal network conditions, only the default route
!    need to be programmed into FIB.

     Furthermore, if the Tier-2 devices are configured with summary
!    prefixes covering all of their attached Tier-3 device's prefixes the
     same logic could be applied in Tier-1 devices as well, and, by
     induction to Tier-2/Tier-3 switches in different clusters.  These
     summary routes should still allow for more specific prefixes to leak
!    to Tier-1 devices, to enable for detection of mismatches in the next-
     hop sets if a particular link fails, changing the next-hop set for a
     specific prefix.

--- 1550,1563 ----
     [RFC6769] and only install the least specific route in the FIB,
     ignoring more specific routes if they share the same next-hop set.
     For example, under normal network conditions, only the default route
!    needs to be programmed into FIB.

     Furthermore, if the Tier-2 devices are configured with summary
!    prefixes covering all of their attached Tier-3 device's prefixes, the
     same logic could be applied in Tier-1 devices as well, and, by
     induction to Tier-2/Tier-3 switches in different clusters.  These
     summary routes should still allow for more specific prefixes to leak
!    to Tier-1 devices, to enable detection of mismatches in the next-
     hop sets if a particular link fails, changing the next-hop set for a
     specific prefix.

***************
*** 1571,1584 ****


     Re-stating once again, this technique does not reduce the amount of
!    control plane state (i.e.  BGP UPDATEs/BGP LocRIB sizing), but only
!    allows for more efficient FIB utilization, by spotting more specific
!    prefixes that share their next-hops with less specifics.

  8.3.  ICMP Unreachable Message Masquerading

     This section discusses some operational aspects of not advertising
!    point-to-point link subnets into BGP, as previously outlined as an
     option in Section 5.2.3.  The operational impact of this decision
     could be seen when using the well-known "traceroute" tool.
     Specifically, IP addresses displayed by the tool will be the link's
--- 1571,1585 ----


     Re-stating once again, this technique does not reduce the amount of
!    control plane state (i.e.,  BGP UPDATEs/BGP Loc-RIB size), but only
!    allows for more efficient FIB utilization, by detecting more specific
!    prefixes that share their next-hop set with a subsuming less specific
!    prefix.

  8.3.  ICMP Unreachable Message Masquerading

     This section discusses some operational aspects of not advertising
!    point-to-point link subnets into BGP, as previously identified as an
     option in Section 5.2.3.  The operational impact of this decision
     could be seen when using the well-known "traceroute" tool.
     Specifically, IP addresses displayed by the tool will be the link's
***************
*** 1587,1605 ****
     complicated.

     One way to overcome this limitation is by using the DNS subsystem to
!    create the "reverse" entries for the IP addresses of the same device
!    pointing to the same name.  The connectivity then can be made by
!    resolving this name to the "primary" IP address of the devices, e.g.
     its Loopback interface, which is always advertised into BGP.
     However, this creates a dependency on the DNS subsystem, which may be
     unavailable during an outage.

     Another option is to make the network device perform IP address
     masquerading, that is rewriting the source IP addresses of the
!    appropriate ICMP messages sent off of the device with the "primary"
     IP address of the device.  Specifically, the ICMP Destination
     Unreachable Message (type 3) codes 3 (port unreachable) and ICMP Time
!    Exceeded (type 11) code 0, which are involved in proper working of
     the "traceroute" tool.  With this modification, the "traceroute"
     probes sent to the devices will always be sent back with the
     "primary" IP address as the source, allowing the operator to discover
--- 1588,1606 ----
     complicated.

     One way to overcome this limitation is by using the DNS subsystem to
!    create the "reverse" entries for these point-to-point IP addresses
pointing
!    to a the same name as the loopback address.  The connectivity then
can be made by
!    resolving this name to the "primary" IP address of the devices, e.g.,
     its Loopback interface, which is always advertised into BGP.
     However, this creates a dependency on the DNS subsystem, which may be
     unavailable during an outage.

     Another option is to make the network device perform IP address
     masquerading, that is rewriting the source IP addresses of the
!    appropriate ICMP messages sent by the device with the "primary"
     IP address of the device.  Specifically, the ICMP Destination
     Unreachable Message (type 3) codes 3 (port unreachable) and ICMP Time
!    Exceeded (type 11) code 0, which are required for correct operation of
     the "traceroute" tool.  With this modification, the "traceroute"
     probes sent to the devices will always be sent back with the
     "primary" IP address as the source, allowing the operator to discover

Thanks,
Acee

_______________________________________________
rtgwg mailing list
rtgwg@ietf.org<mailto:rtgwg@ietf.org>
https://www.ietf.org/mailman/listinfo/rtgwg<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ietf.org_mailman_listinfo_rtgwg&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=LU_vJaM_EQu1Ssm35j2xlA&m=jlwmqRMSBbzWIRorI_sDAM1i0LuuOdLFmJDVNLptIYc&s=A2noaUtSFIOr0rW_aQEWW0-nFKvsJ4BOGvpsaSj74ps&e=>