Re: AD review of draft-ietf-rtgwg-bgp-pic-18
Alvaro Retana <aretana.ietf@gmail.com> Thu, 13 October 2022 18:15 UTC
Return-Path: <aretana.ietf@gmail.com>
X-Original-To: rtgwg@ietfa.amsl.com
Delivered-To: rtgwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6B9E1C15271F; Thu, 13 Oct 2022 11:15:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.862
X-Spam-Level:
X-Spam-Status: No, score=-0.862 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, NORMAL_HTTP_TO_IP=0.001, NUMERIC_HTTP_ADDR=1.242, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Kisrzm3Ar-JH; Thu, 13 Oct 2022 11:15:27 -0700 (PDT)
Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com [IPv6:2607:f8b0:4864:20::530]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 586E0C15271E; Thu, 13 Oct 2022 11:15:24 -0700 (PDT)
Received: by mail-pg1-x530.google.com with SMTP id b5so2235698pgb.6; Thu, 13 Oct 2022 11:15:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:mime-version:references:in-reply-to :from:from:to:cc:subject:date:message-id:reply-to; bh=cpRKKokZgogo3DlFClh+DB7Jpddo0+NnSew0uEVAGaY=; b=a2UYcUVztUbVpe1EYHpzU3nhZJNtPdOZ0YKnMDOulX2sPVX/Nh2nOYCOfyr2CYI2RQ +JpFwkTiNBYbKOlUsL9KWoIxmIUqJVYTvZZ6W3zSArfEysm48KUmQNlsgHPzxNr5tw/u iQA3nanE2GwG3NR5LHRwMjPgbWZFLyTJB5wk7g7yL/ndIFBnfZk5nzYiDTFdFbHGyTB3 +rLreCb34FWkoHKMQ5qAYtwzq/uMErCZvPm/+xZTScrtnIxIsmay+vfbjnv7vVBDUTcL I5d/WgzAWMpCyZDcbcdW8mzQnEt0MpOHO+D7HlINMuiG6WN2juWF6pvyxU/74dOO6Tze N/bw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:mime-version:references:in-reply-to :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=cpRKKokZgogo3DlFClh+DB7Jpddo0+NnSew0uEVAGaY=; b=MMR4QvoBTkTibRUgt5d65x02q3SDAbw2iAN6Yr8x5yP6SmIR9jpgB2gem/AoYUE+WB Dh/PXOMxoJHMPSkx8r5K49C6CCfJY5+MfkGsidItruy6Jrb+7v7ePUta393eMi1acwut sQPZ8KkzZJMonvRBoWJkhTLWV6op7JCQGcPSq+c0/CkFiQzSnjFO5qetSC87BLaLHKa5 xi/oZ8soIDst/NZCuoN2o7hivE+NALYvaSL1Lu1nBH/S8X/FNtcJBrCVuQyR1Ieeyp52 7DDiKM39u3F6eUFy7zIaxG/asusX2FlNxjOLG86HAnWTa7FYyssDBOD1F1glhJXTZLQj bVxg==
X-Gm-Message-State: ACrzQf1bpPMI5jXZd92PdKHoQpowy/exuedWM00opU+zp6zuW5Gk9DgH qqIIcCn818tx6XzLnlpmAY97+MTnGC2ODIP8jU+QdBT9
X-Google-Smtp-Source: AMsMyM6iQ+qHu1G7C309VvMLXvzQgM00R0anOJ2L/HwwPxqerKWhfVV3tVxzlNYFD163wtCgAZ5L/JDGSePdZ2K4XEk=
X-Received: by 2002:a05:6a00:a04:b0:534:d8a6:40ce with SMTP id p4-20020a056a000a0400b00534d8a640cemr799428pfh.15.1665684923201; Thu, 13 Oct 2022 11:15:23 -0700 (PDT)
Received: from 1058052472880 named unknown by gmailapi.google.com with HTTPREST; Thu, 13 Oct 2022 11:15:21 -0700
From: Alvaro Retana <aretana.ietf@gmail.com>
In-Reply-To: <CAMMESszTT8bRMt1UMS+OvsqePhZDJMHtOJ9sfcUs8zu19rtjsQ@mail.gmail.com>
References: <CAMMESsw1bdYc-p7TD7GtHNHa6zrO3Ms20VZX89Hza_JtWpK=Bw@mail.gmail.com> <CAMMESszTT8bRMt1UMS+OvsqePhZDJMHtOJ9sfcUs8zu19rtjsQ@mail.gmail.com>
MIME-Version: 1.0
Date: Thu, 13 Oct 2022 11:15:21 -0700
Message-ID: <CAMMESsxHLphxBGD749HcHF8OntE-oKBPdRiK8pu=fMxt6eUdsQ@mail.gmail.com>
Subject: Re: AD review of draft-ietf-rtgwg-bgp-pic-18
To: abashandy.ietf@gmail.com, draft-ietf-rtgwg-bgp-pic@ietf.org
Cc: Yingzhen Qu <yingzhen.ietf@gmail.com>, RTGWG <rtgwg@ietf.org>, rtgwg-chairs@ietf.org
Content-Type: multipart/alternative; boundary="00000000000027be8a05eaee7e94"
Archived-At: <https://mailarchive.ietf.org/arch/msg/rtgwg/4TI7y_Oq028YnE_NDoFPnHLLxdw>
X-BeenThere: rtgwg@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Routing Area Working Group <rtgwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rtgwg/>
List-Post: <mailto:rtgwg@ietf.org>
List-Help: <mailto:rtgwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Oct 2022 18:15:34 -0000
Ping! On September 8, 2022 at 11:46:05 AM, Alvaro Retana (aretana.ietf@gmail.com) wrote: Ping! On August 2, 2022 at 4:23:53 PM, Alvaro Retana (aretana.ietf@gmail.com) wrote: Dear authors: Thank you for documenting this important feature. After reading the document, I think that it still needs work: (1) The terminology used is not aligned with rfc4271. Of major importance is the description of the Routing Table, the Decision Process, the use of best routes (not paths!), etc. I pointed out multiple occurrences below, buy I need you to check the whole docuent for consistency. (2) The choice to describe the mechanism using a complex scenario (VPNs, etc.) has resulted in the text being more complex than it needs to be. It is probably too late for this, but a better approach (especially because this is an Informational document) would have been to describe the functionality using a simple IP-only network. A more elaborate example could have then been included in an Appendix. (3) The text (mainly in §5) related to hardware limitations is interesting but also highly speculative. Given that it is not part of the main mechanism, I would suggest putting this information in an appendix. I put detailed comments below. To the Chairs: I couldn't find a request to the idr WG for review. Did I miss it? Given the subject, we need to give idr the opportunity to comment. Thanks! Alvaro. [Line numbers from idnits.] ... 12 Abstract 14 In a network comprising thousands of BGP peers exchanging millions of 15 routes, many routes are reachable via more than one next-hop. Given 16 the large scaling targets, it is desirable to restore traffic after 17 failure in a time period that does not depend on the number of BGP 18 prefixes. In this document we proposed an architecture by which 19 traffic can be re-routed to ECMP or pre-calculated backup paths in a 20 timeframe that does not depend on the number of BGP prefixes. The 21 objective is achieved through organizing the forwarding data 22 structures in a hierarchical manner and sharing forwarding elements 23 among the maximum possible number of routes. The proposed technique 24 achieves prefix independent convergence while ensuring incremental 25 deployment, complete automation, and zero management and provisioning 26 effort. It is noteworthy to mention that the benefits of BGP Prefix 27 Independent Convergence (BGP-PIC) are hinged on the existence of more 28 than one path whether as ECMP or primary-backup. [style nit] "we proposed" Please don't write using the first person ("we propose"), use the third person instead ("this document proposes"). [minor] s/we proposed an architecture/describes an architecture Other places also say that the document "proposes" -- at this point in the process we're past proposals. :-) [minor] Please expand ECMP in the Abstract, and later on first use. [] "The proposed technique achieves prefix independent convergence while ensuring incremental deployment, complete automation, and zero management and provisioning effort." Great marketing! This is a technical document, please avoid selling the solution. [minor] Consider breaking the Abstract into two paragraphs. ... 109 1. Introduction 111 BGP speakers exchange reachability information about 112 prefixes[1][2] and, for labeled address families, namely AFI/SAFI 113 1/4, 2/4, 1/128, and 2/128, an edge router assigns local labels to 114 prefixes and associates the local label with each advertised 115 prefix using technologies such as L3VPN [9], 6PE [10], and 116 Softwire [8] using BGP label unicast (BGP-LU) technique[3]. A BGP 117 speaker then applies the path selection steps to choose the best 118 path. In modern networks, it is not uncommon to have a prefix 119 reachable via multiple edge routers. In addition to proprietary 120 techniques, multiple techniques have been proposed to allow for 121 BGP to advertise more than one path for a given prefix 122 [7][12][13], whether in the form of equal cost multipath or 123 primary-backup. Another common and widely deployed scenario is 124 L3VPN with multi-homed VPN sites with unique Route Distinguisher. 125 It is advantageous to utilize the commonality among paths used by 126 NLRIs[1] to significantly improve convergence in case of topology 127 modifications. [] I still haven't read the rest of the document... PIC is a valuable and simple technique. Just from the first sentence (and peeking ahead at some of the figures and examples), it seems to me that the description could have been significantly simpler. :-( [major] rfc7322 requires that RFCs be referenced as [RFCXXXX], please update the citations. [minor nit] Multiple citations don't have spaces around the brackets ("[]"). [minor] The reference to rfc4271 is the general reference for BGP, no need to include a reference to rfc4760 with it. [] "for labeled address families...L3VPN with multi-homed VPN sites with unique Route Distinguisher" The detail about labeled AFs/L3VPN seems unnecessary given that PIC doesn't just apply to them. [major] "BGP label unicast (BGP-LU) technique[3]" Note that rfc8277 doesn't use the terms BGP-LU, "label unicast", or "labeled unicast". I know that BGP-LU is commonly used, but it is not defined in any IETF document (that I could find). Please define it in the terminology section. Also, don't refer to rfc8277 every time you use the term. [major] "A BGP speaker then applies the path selection steps to choose the best path." rfc4271 doesn't talk about "path selection" or finding a "best path". It specifies a Decision Process that results in best routes. Please use consistent terminology. [minor] "In addition to proprietary techniques..." These proprietary techniques are not described anywhere. Please don't mention them -- it will only raise unnecessary questions. [nit] "techniques have been proposed...[7][12][13]" Add-path and Diverse paths are not just "proposed". [minor] "It is advantageous to utilize the commonality among paths used by NLRIs[1] to significantly improve convergence in case of topology modifications." You may have to be more explicit about which commonality -- mention it explicitly. 129 This document proposes a hierarchical and shared forwarding chain 130 organization that allows traffic to be restored to pre-calculated 131 alternative equal cost primary path or backup path in a time 132 period that does not depend on the number of BGP prefixes. The 133 technique relies on internal router behavior that is completely 134 transparent to the operator and can be incrementally deployed and 135 enabled with zero operator intervention. In other words, once it 136 is implemented and deployed on a router, nothing is required from 137 the operator to make it work. It is noteworthy to mention that 138 this document describes FIB architecture that can be implemented 139 in both hardware and/or software. [nit] s/to pre-calculated/to a pre-calculated [nit] s/describes FIB/describes a FIB [minor] Expand FIB on first mention. 141 1.1. Terminology [major] Please don't deviate (at least) from terminology used in standards track specifications. Please don't redefine terms to mean something different in this document. If using terminology defined elsewhere, please just reference those documents, and don't redefine the terms here. 143 This section defines the terms used in this document. For ease of 144 use, we will use terms similar to those used by L3VPN [9]. [major] Similar? If you're using the terminology from rfc4364, then please just use it. Redefining terms doesn't make anything easier. To be clear: remove any terms that are already defined in rfc4364 and make the reference Normative. 146 o BGP prefix: A prefix P/m (of any AFI/SAFI) that a BGP speaker 147 has a path for. [major] rfc4271 uses the concept of an "IP prefix" carried in a "route" and described by a "path" (from §1.1): Route A unit of information that pairs a set of destinations with the attributes of a path to those destinations. The set of destinations are systems whose IP addresses are contained in one IP address prefix carried in the Network Layer Reachability Information (NLRI) field of an UPDATE message. The path is the information reported in the path attributes field of the same UPDATE message. 149 o IGP prefix: A prefix P/m (of any AFI/SAFI) that is learnt via 150 an Interior Gateway Protocol (IGP), such as OSPF and ISIS. The 151 prefix may be learnt directly through the IGP or redistributed 152 from other protocol(s). [major] rfc4271 uses the term "IGP route". [minor] The "P/m" notation is not explained -- and is not used outside of this section. [major] "is learnt via an...IGP...may be learnt directly through the IGP or redistributed from other protocol(s)" There's a conflict in the definition: does it come directly from the IGP or from something else? Might might those "other protocol(s)" be? 154 o CE[7]: An external router through which an egress PE can reach 155 a prefix P/m. [minor] Please expand CE. [major] The reference to draft-ietf-idr-best-external seems incorrect; that draft doesn't use CE at all. This seems to be a term that comes from RFC4364. [minor] Please expand PE in first use. 157 o Egress PE[7], "ePE": A BGP speaker that learns about a prefix 158 through an eBGP peer and chooses that eBGP peer as the next-hop 159 for that prefix. [major] The reference to draft-ietf-idr-best-external seems incorrect; that draft doesn't use PE at all. This seems to be a term that comes from RFC4364. [minor] s/eBGP/EBGP/g That is how rfc4271 uses it. [minor] Expand EBGP on first use. 161 o Ingress PE, "iPE": A BGP speaker that learns about a prefix 162 through a iBGP peer and chooses an egress PE as the next-hop for 163 the prefix. [minor] s/iBGP/IBGP/g That is how rfc4271 uses it. [minor] Expand IBGP on first use. 165 o Path: The next-hop in a sequence of nodes starting from the 166 current node and ending with the destination node or network 167 identified by the prefix. The nodes may not be directly 168 connected. [major] See the definition above from rfc4271. [major] Some of the definitions refer to BGP, but others to the FIB characteristics....differentiate. "FIB Path" vs "BGP Path", for example. Note that rfc4271 uses the term "Routing Table" as follows (from §3.2: Routing Information Base): Routing information that the BGP speaker uses to forward packets (or to construct the forwarding table used for packet forwarding) is maintained in the Routing Table. The Routing Table accumulates routes to directly connected networks, static routes, routes learned from the IGP protocols, and routes learned from BGP. Whether a specific BGP route should be installed in the Routing Table, and whether a BGP route should override a route to the same destination installed by another source, is a local policy decision, and is not specified in this document. In addition to actual packet forwarding, the Routing Table is used for resolution of the next-hop addresses specified in BGP updates (see Section 5.1.3). 170 o Recursive path: A path consisting only of the IP address of the 171 next-hop without the outgoing interface. Subsequent lookups are 172 necessary to determine the outgoing interface and a directly 173 connected next-hop. [major] rfc4271 already used "recursive route". 175 o Non-recursive path: A path consisting of the IP address of a 176 directly connected next-hop and outgoing interface. [] See above. ... 181 o Primary path: A recursive or non-recursive path that can be 182 used all the time as long as a walk starting from this path can 183 end to an adjacency. A prefix can have more than one primary 184 path. [minor] "can be used all the time" I'm not sure what you mean here. [minor] The concept of a "walk" hasn't been introduced -- please put a pointer to where that is discussed. ... 189 o Leaf: A container data structure for a prefix or local label. 190 Alternatively, it is the data structure that contains prefix 191 specific information. [] So a leaf can be many things: "prefix or local label...[or] prefix specific information". Hmmm.... ... 198 o Pathlist: An array of paths used by one or more prefixes to 199 forward traffic to destination(s) covered by an IP prefix. Each 200 path in the pathlist carries its "path-index" that identifies its 201 position in the array of paths. In general, the value of the 202 "path-index" stored in path may not necessarily have the same 203 value of the location of the path in the pathlist. For example 204 the 3rd path may carry path-index value of 1. A pathlist may 205 contain a mix of primary and backup paths. [nit] s/its "path-index"/a "path-index" 207 o OutLabel-List: Each labeled prefix is associated with an 208 OutLabel-List. The OutLabel-List is an array of one or more 209 outgoing labels and/or label actions where each label or label 210 action has 1-to-1 correspondence to a path in the pathlist. 211 Label actions[6] are: push the label, pop the label, swap the 212 incoming label with the label in the OutLabel-List entry, or 213 don't push anything at all in case of "unlabeled". The prefix 214 may be an IGP or BGP prefix. [major] If the label label actions are defined in rfc3031, please just point to it -- is the "swap the incoming label with the label in the OutLabel-List entry" defined there too? It seems to me that the action defined here is part of a general action taken by the FIB, and not specific only to labels (as in a different encapsulation, for example). 216 o Forwarding chain: It is a compound data structure consisting of 217 multiple connected block that a forwarding engine walks one 218 block at a time to forward the packet out of an interface. 219 Section 2.2 explains an example of a forwarding chain. 220 Subsequent sections provide additional examples [nit] s/multiple connected block/multiple connected blocks [nit] s/additional examples/additional examples. [minor] Note that both "forwarding chain" and "forwarding Chain" are used throughout the text. Please be consistent. ... 229 o Route: A prefix with one or more paths associated with it. The 230 minimum set of objects needed to construct a route is a leaf 231 and a pathlist. [major] See the comments above about rfc4271 and the FIB in general. 233 2. Overview 235 The idea of BGP-PIC is based on two pillars [minor] The descriptions below are very thick (hard to read through) without more context. Suggestion: keep the description simple in this section and point to where there is more detail. 236 o A shared hierarchical forwarding chain: It is not uncommon to see 237 multiple destinations reachable via the same list of next-hops. 238 Instead of having a separate list of next-hops for each 239 destination, all destinations sharing the same list of next-hops 240 can point to a single copy of this list thereby allowing fast 241 convergence by making changes to a single shared list of next- 242 hops rather than possibly a large number of destinations. Because 243 paths in a pathlist may be recursive, a hierarchy is formed 244 between pathlist and the resolving prefix whereby the pathlist 245 depends on the resolving prefix. 247 o A forwarding plane that supports multiple levels of indirection: 248 A forwarding chain that starts with a destination and ends with 249 an outgoing interface is not a simple flat structure. Instead a 250 forwarding entry is constructed via multiple levels of 251 dependency. A BGP NLRI uses a recursive next-hop, which in turn 252 resolves via an IGP next-hop, which in turn resolves via an 253 adjacency consisting of one or more outgoing interface(s) and 254 next-hop(s). [nit] s/Instead a/Instead, a [minor] Are "multiple levels of indirection" and "multiple levels of dependency" the same thing? The latter is used to describe the former. [] Next-hop is also a term that needs to be differentiated. In this paragraph is it used in the context of a BGP route, an IGP route, and a local address on the FIB. See rfc4271. 256 Designing a forwarding plane that constructs multi-level forwarding 257 chains with maximal sharing of forwarding objects allows rerouting a 258 large number of destinations by modifying a small number of objects 259 thereby achieving convergence in a time frame that does not depend 260 on the number of destinations. For example, if the IGP prefix that 261 resolves a recursive next-hop is updated there is no need to update 262 the possibly large number of BGP NLRIs that use this recursive next- 263 hop. [] The example is orders of magnitue simpler that the actual explanation. BTW, none of the words used ("multi-level forwarding chains with maximal sharing of forwarding objects") are mentioned in the description of the pillars. 265 2.1. Dependency 267 This section describes the required functionalities in the 268 forwarding and control planes to support BGP-PIC described in this 269 document. [nit] s/BGP-PIC described/BGP-PIC as described 271 2.1.1. Hierarchical Hardware FIB (Forwarding Information Base) [major] The Introduction says that "this document describes FIB architecture that can be implemented in both hardware and/or software", but this section requires a hardware implementation. ?? 273 BGP-PIC requires a hierarchical hardware FIB support: for each BGP 274 forwarded packet, a BGP leaf is looked up, then a BGP Pathlist is 275 consulted, then an IGP Pathlist, then an Adjacency. [major] "BGP forwarded packet" I think you mean something along the lines of a packet destined to an external destination, or to a destination learned through BGP, or ... ?? [major] The definitions of both a leaf and a pathlist don't assume protocol origin. IOW, the reader has to make significant assumptions to understand that you mean "a leaf/pathlist that resulted from BGP/IGP-learned information" (or something like that). If the identification of the origin protocol is required, please also say so. [major] "Pathlist" and "Adjancency" are capitalized, but "leaf" isn't. Personally, I like capitalizing terms with specific meaning. Not all instances (throughout the document) are treated the same. Please be consistent throughout. [major] The general process above: lookup the destination (BGP leaf), which points to possible alternatives (BGP pathlist), which recursively resolve... assumes knowledge of how a FIB is generally built and the operations that forwarding a packet requires. This may not be common knowledge to every reader. Please either explain the assumptions or reference a place where the explanation exists already. 277 An alternative method consists in "flattening" the dependencies when 278 programming the BGP destinations into HW FIB resulting in 279 potentially eliminating both the BGP Path-List and IGP Path-List 280 consultation. Such an approach decreases the number of memory 281 lookups per forwarding operation at the expense of HW FIB memory 282 increase (flattening means less sharing thereby less duplication), 283 loss of ECMP properties (flattening means less pathlist entropy) and 284 loss of BGP-PIC properties. [minor] HW is mentioned here again...and the concept of programming, which goes back to the general knowledge assumptions that you're making. [minor] s/Path-List/Pathlist/g [major] Am I to assume that this "alternative method" shouldn't be considered/used, especially if it results in loss of the "BGP-PIC properties"?? This section started by mentioning a requirement -- if this other approach is not what is required the why even mention it? 286 2.1.2. Availability of more than one BGP next-hops 288 When the primary BGP next-hop fails, BGP-PIC depends on the 289 availability of one or more pre-computed and pre-installed secondary 290 BGP next-hop(s) in the BGP Pathlist. [major] "primary BGP next-hop" At this point in the document you've hinted at the possibility of having multiple BGP routes (with potentially different NEXT_HOPs), but the concept of "primary BGP next-hop" hasn't been explained. I'm assuming that by this you mean the BGP best route [rfc4271]. (Again, you're making assumptions of the knowledge of the reader.) I can see that this section talks about "secondary next-hops" later. Perhaps change the order of the text so that the conditions precede the explanation of the actions. [major] "BGP next-hop fails" rfc4271 talks about the the address in the NEXT_HOP attribute becoming not resolvable. Please use established terminology! [major] "pre-computed and pre-installed...in the BGP Pathlist" There's an example of "pre-computed" later on, but I couldn't find an explanation of "pre-installed...in the BGP Pathlist". 292 The existence of a secondary next-hop is clearly required for the 293 following reason: a service caring for network availability will 294 require two disjoint network connections resulting in two BGP next- 295 hops. [minor] You mean "clearly required" for BGP-PIC, right? Not for the service. 297 The BGP distribution of secondary next-hops is available thanks to 298 the following BGP mechanisms: Add-Path [12], BGP Best-External [7], 299 diverse path [13], and the frequent use in VPN deployments of 300 different VPN RD's per PE. Another option to learn multiple BGP 301 NH/path is simply to receive IBGP paths from multiple BGP RR 302 selection a different path as best. It is noteworthy to mention that 303 the availability of another BGP path does not mean that all failure 304 scenarios can be covered by simply forwarding traffic to the 305 available secondary path. The discussion of how to cover various 306 failure scenarios is beyond the scope of this document. [minor] "BGP NH/path" is a new term. Even if using "NH" seems obvious, please mention it at least once. Also, you've been talking (incorrectly) about BGP paths (instead of routes) or BGP next-hops and now you're combining the two. This is the only instance. [nit] s/is simply to receive/is to receive [nit] s/multiple BGP RR selection/multiple BGP RRs selecting [minor] You might want to reference rfc9107 as an example of "multiple BGP RRs selecting a different path as best". 308 2.2. BGP-PIC Illustration 310 To illustrate the two pillars above as well as the platform 311 dependency, we will use an example of a simple multihomed L3VPN [9] 312 prefix in a BGP-free core running LDP [4] or segment routing over 313 MPLS forwarding plane [5]. [nit] s/a simple multihomed/a multihomed What is "simple" to you may not be simple to others. Just state the facts. [minor] s/L3VPN [9]/L3VPN The citation is not needed all the time. [minor] "core running LDP [4] or segment routing over MPLS forwarding plane [5]" This is an example. To simplify, pick one. 315 +--------------------------------+ 316 | | 317 | ePE2 (IGP-IP1 192.0.2.1, Loopback) 318 | | \ 319 | | \ 320 | | \ 321 iPE | CE....VRF "Blue", ASnum 65000 322 | | / (VPN-IP1 198.51.100.0/24) 323 | | / (VPN-IP2 203.0.113.0/24) 324 | LDP/Segment-Routing Core | / 325 | ePE1 (IGP-IP2 192.0.2.2, Loopback) 326 | | 327 +--------------------------------+ 328 Figure 1 VPN prefix reachable via multiple PEs [nit] s/ASnum/ASN Expand on first use. 330 Referring to Figure 1, suppose the iPE (the ingress PE) receives 331 NLRIs for the VPN prefixes VPN-IP1 and VPN-IP2 from two egress PEs, 332 ePE1 and ePE2 with next-hop BGP-NH1 and BGP-NH2, respectively. 333 Assume that ePE1 advertise the VPN labels VPN-L11 and VPN-L12 while 334 ePE2 advertise the VPN labels VPN-L21 and VPN-L22 for VPN-IP1 and 335 VPN-IP2, respectively. Suppose that BGP-NH1 and BGP-NH2 are resolved 336 via the IGP prefixes IGP-IP1 and IGP-IP2, where each happen to have 337 2 equal cost paths with IGP-NH1 and IGP-NH2 reachable via the 338 interfaces I1 and I2, respectively. Suppose that local labels 339 (whether LDP [4] or segment routing [5]) on the downstream LSRs[4] 340 for IGP-IP1 are IGP-L11 and IGP-L12 while for IGP-IP2 are IGP-L21 341 and IGP-L22. As such, the routing table at iPE is as follows: [minor] "receives NLRIs" I other places you're talked about paths or routes...please be consistent! [minor] "IGP-NH1 and IGP-NH2 reachable via the interfaces I1 and I2" I'm sure you're talking about interfaces on the iPE, but others may find that the picture is lacking detail when compared to the description: the core is just a box.. 343 65000: 198.51.100.0/24 344 via ePE1 (192.0.2.1), VPN Label: VPN-L11 345 via ePE2 (192.0.2.2), VPN Label: VPN-L21 [minor] You need to explain the notation and, for example, why some entries have an ASN and others don't. [minor] It might help if the description included the IP addresses. For example, when mentioning BGP-NH1, indicate that it refers to 192.0.2.1. ... 362 IP Leaf: Pathlist: IP Leaf: Pathlist: 363 -------- +-------+ -------- +----------+ 364 VPN-IP1-->|BGP-NH1|-->IGP-IP1(BGP NH1)--->|IGP-NH1,I1|--->Adjacency1 365 | |BGP-NH2|-->.... | |IGP-NH2,I2|--->Adjacency2 366 | +-------+ | +----------+ 367 | | 368 | | 369 v v 370 OutLabel-List: OutLabel-List: 371 +----------------------+ +----------------------+ 372 |VPN-L11 (VPN-IP1, NH1)| |IGP-L11 (IGP-IP1, NH1)| 373 |VPN-L21 (VPN-IP1, NH2)| |IGP-L12 (IGP-IP1, NH2)| 374 +----------------------+ +----------------------+ 376 Figure 2 Shared Hierarchical Forwarding Chain at iPE 378 The forwarding chain depicted in Figure 2 illustrates the first 379 pillar, which is sharing and hierarchy. We can see that the BGP 380 pathlist consisting of BGP-NH1 and BGP-NH2 is shared by all NLRIs 381 reachable via ePE1 and ePE2. As such, it is possible to make changes 382 to the pathlist without having to make changes to the NLRIs. For 383 example, if BGP-NH2 becomes unreachable, there is no need to modify 384 any of the possibly large number of NLRIs. Instead only the shared 385 pathlist needs to be modified. Likewise, due to the hierarchical 386 structure of the forwarding chain, it is possible to make 387 modifications to the IGP routes without having to make any changes 388 to the BGP NLRIs. For example, if the interface "I2" goes down, only 389 the shared IGP pathlist needs to be updated, but none of the IGP 390 prefixes sharing the IGP pathlist nor the BGP NLRIs using the IGP 391 prefixes for resolution need to be modified. [minor] "We can see that the BGP pathlist consisting of BGP-NH1 and BGP-NH2 is shared by all NLRIs reachable via ePE1 and ePE2." I only see VPN-IP1 using that pathlist. I know it is not easy to draw using ASCII, but you may want to provide a conceptual figure showing how the leafs and Pathlists relate to each other and then describe the contents. OR You can also integrate an SVI drawing in the XML. [minor] The representation of the Pathlists is not consistent: the first one shows just BGP-NHs, while the second one has an extra element. It may be obvious to you that the interface element points at the corresponding Adjacencies, but please explain that too. [major] The description is not complete -- it jumps from using one Pathlist to the other, but it doesn't cover the recursive nature of the resolution: resolving the BGP-NHs in the Pathlist through another leaf. [major] The outlabel-list is not explained anywhere. 393 Figure 2 can also be used to illustrate the second BGP-PIC pillar. 394 Having a deep forwarding chain such as the one illustrated in Figure 395 2 requires a forwarding plane that is capable of accessing multiple 396 levels of indirection in order to calculate the outgoing 397 interface(s) and next-hops(s). While a deeper forwarding chain 398 minimizes the re-convergence time on topology change, there will 399 always exist platforms with limited capabilities and hence imposing 400 a limit on the depth of the forwarding chain. Section 5 describes 401 how to gracefully trade off convergence speed with the number of 402 hierarchical levels to support platforms with different 403 capabilities. [minor] "Figure 2 can also be used to illustrate the second BGP-PIC pillar." Ok. But the rest of the paragraph doesn't explain how that is needed in the example. 405 Another example using IPv6 addresses can be something like the 406 following 408 65000: 2001:DB8:1::/48 409 via ePE1 (192::1), VPN Label: VPN6-L11 410 via ePE2 (192::2), VPN Label: VPN6-L21 412 65000: 2001:DB8:2:/48 413 via ePE1 (192::1), VPN Label: VPN6-L12 414 via ePE2 (192::2), VPN Label: VPN6-L22 416 192::1/128 417 via Core, Label: IGP6-L11 418 via Core, Label: IGP6-L12 420 192::2/128 421 via Core, Label: IGP6-L21 422 via Core, Label: IGP6-L22 424 The same hierarchical forwarding chain described can be constructed 425 for IPv6 addresses/prefixes. [major] Please use the addresses from rfc3849. [] BTW, sorry to be blunt, but this is a lazy attempt at using IPv6 in an example. You should at least show a figure and maybe a different scenario...not just similar numbers. 427 3. Constructing the Shared Hierarchical Forwarding Chain 429 Constructing the forwarding chain is an application of the two 430 pillars described in Section 2. This section describes how to 431 construct the forwarding chain in hierarchical shared manner. [nit] s/in hierarchical/in a hierarchical 433 3.1. Constructing the BGP-PIC forwarding Chain 435 The whole process starts when BGP downloads a prefix to FIB. The 436 prefix contains one or more outgoing paths. For certain labeled 437 prefixes, such as VPN [9] prefixes, each path may be associated with 438 an outgoing label and the prefix itself may be assigned a local 439 label. The list of outgoing paths defines a pathlist. If such 440 pathlist does not already exist, then FIB manager (software or 441 hardware entity responsible for managing the FIB) creates a new 442 pathlist, otherwise the existing pathlist is used. The BGP prefix is 443 added as a dependent of the pathlist. [major] "BGP downloads a prefix to FIB" rfc4271 talks about "routes...installed in the Routing Table". Please, as much as possible, use the terminology from established RFCS. Note that this document doesn't define what a FIB is. [major] "The prefix contains one or more outgoing paths." This may be another terminology issue: a BGP route doesn't contain multiple next-hops. If the routes came from ADD_PATH or ORR, etc., there will be multiple BGP routes with a common NLRI and different next-hops. IOW, it looks like the Pathlist would be built incrementally as routes for the same NLRI are installed. The end result should be the same. Terminology matters! [minor] "VPN [9]" The other references to rfc4364 use "L3VPN", should this one use it too? [major] "If such pathlist does not already exist..." It should be mentioned somewhere the reason for this Pathlist to already exist: there's probably another BGP route using it. This is where the connection to the shared nature of the chain needs to be made. Otherwise, the description sounds as if there might simply be a Pathlist structure available... [nit] s/FIB manager/the FIB manager/g [] It might be easier/better/clearer to explain the process as a series of steps. Note that the description above is really a series of steps: BGP has a route to install, the route is accepted in the routing table (local policy), a leaf for this NLRI exists (or not), the Pathlist exists (or not), etc... 445 The previous step constructs the upper part of the hierarchical 446 forwarding chain. The forwarding chain is completed by resolving the 447 paths of the pathlist. A BGP path usually consists of a next-hop. 448 The next-hop is resolved by finding a matching IGP prefix. [major] "The forwarding chain is completed by resolving the paths of the pathlist." The example in the last section shows that this "lower part" (is that the right term?) could also include a Pathlist. The explanation of how the "upper part" is constructed didn't mention that the process was to "resolve" the BGP NLRI. As a result, there's a disconnect between what was done before and what is assumed here. Again, a series of steps would serve the purpose better. Also, be mindful of the use of "resolve" in rfc4271. [major] "A BGP path usually consists of a next-hop." As opposed to what? I mean the NEXT_HOP attribute is mandatory, a route includes multiple other Path Attributes... Not sure what the point it. [minor] "The next-hop is resolved by finding a matching IGP prefix." Yes, as an extreme simplification. §9.1.2.1/rfc4271 offers a better description. 450 The end result is a hierarchical shared forwarding chain where the 451 BGP pathlist is shared by all BGP prefixes that use the same list of 452 paths and the IGP prefix is shared by all pathlists that have a path 453 resolving via that IGP prefix. It is noteworthy to mention that the 454 forwarding chain is constructed without any operator intervention at 455 all. [minor] "The end result is a hierarchical shared forwarding chain where the BGP pathlist is shared..." As I mentioned above, more emphasis should be places on (when explaining the steps) the shared nature of the Pathlists and how its dependents are independent of each other too. [minor] "without any operator intervention" Yes, this was mentioned as an advantage somewhere. It would be better if the document was explicit from the start on the point that it is describing a set of abstract structures and procedures to be internally instantiated. Just like protocol structures (neighbors, routes, etc.) that are transparent to the user. ... 460 3.2. Example: Primary-Backup Path Scenario 462 Consider the egress PE ePE1 in the case of the multi-homed VPN 463 prefixes in the BGP-free core depicted in Figure 1. Suppose ePE1 464 determines that the primary path is the external path but the backup 465 path is the iBGP path to the other PE ePE2 with next-hop BGP-NH2. 466 ePE2 constructs the forwarding chain depicted in Figure 3. We are 467 only showing a single VPN prefix for simplicity. But all prefixes 468 that are multihomed to ePE1 and ePE2 share the BGP pathlist. [minor] "BGP-free core depicted in Figure 1" It doesn't really make a difference, but it wasn't mentioned before that Figure 1 has a BGP-free core. I'm mentioning this because there are inconsistencies in the description. Also, this is a detail that is not important to the description, but that can create confusion for other readers and generate unnecessary questions and comments. [minor] "Suppose ePE1 determines..." It would be great if there was either a pointer to how ePE1 does all this (where are backups specified?), or a little bit more explanation of the assumptions. [nit] s/external path but the backup/external path, while the backup 470 BGP OutLabel-List 471 VPN-L11 +---------+ 472 (Label-leaf)---+---->|Unlabeled| 473 | +---------+ 474 v | VPN-L21 | 475 | | (swap) | 476 | +---------+ 477 | 478 | BGP Pathlist 479 | +------------+ Connected route 480 | | CE-NH |------>(to the CE) 481 | |path-index=0| 482 | +------------+ 483 | | VPN-NH2 | 484 VPN-IP1 -----+------------------>| (backup) |------>IGP Leaf 485 (IP leaf) |path-index=1| (Towards ePE2) 486 | +------------+ 487 | 488 | BGP OutLabel-List 489 | +---------+ 490 +------------->|Unlabeled| 491 +---------+ 492 | VPN-L21 | 493 | (push) | 494 +---------+ 496 Figure 3 : VPN Prefix Forwarding Chain with eiBGP paths on egress PE [minor] Figure 2 provided an easy-to-follow dependency order: Leaf to Pathlist to Leaf to Pathlist. The dependencies in Figure 3 are not straightforward, please provide a description of the figure before explaining the differences with respect to the other example. [major] The definition of Pathlist (§1.1) implies that all its entries can be used for forwarding, but the Pathlist above (and this use case) clearly indicate that not all the entries in a Pathlist are used in the same way, or at the same time. 498 The example depicted in Figure 3 differs from the example in Figure 499 2 in two main aspects. First, as long as the primary path towards 500 the CE (external path) is useable, it will be the only path used for 501 forwarding while the OutLabel-List contains both the unlabeled label 502 (primary path) and the VPN label (backup path) advertised by the 503 backup path ePE2. The second aspect is presence of the label leaf 504 corresponding to the VPN prefix. This label leaf is used to match 505 VPN traffic arriving from the core. Note that the label leaf shares 506 the pathlist with the IP prefix. [major] "the...path...is useable" What does "useable" mean? Please describe this term in the terminology section. [major] "primary path towards the CE (external path) is useable, it will be the only path used" The arrow from VPN-IP1 points to the VPN-NH2 entry in the Pathlist, so the description doesn't seem to match. [major] "the OutLabel-List" There are two boxes named "BGP OutLabel-List", and they contain different backup actions (swap vs push). [minor] "unlabeled label" rfc3031 talks about unlabeled packets, not labels. [minor] A third difference is that this figure includes a "path-index" -- but that is not explained. 508 4. Forwarding Behavior ... 513 When a packet arrives at a router, it matches a leaf. A labeled 514 packet matches a label leaf while an IP packet matches an IP leaf. 515 The forwarding engines walks the forwarding chain starting from the 516 leaf until the walk terminates on an adjacency. Thus when a packet 517 arrives, the chain is walked as follows: [minor] Instead of summarizing, simply start with the steps. 519 1. Lookup the leaf based on the destination address or the label at 520 the top of the packet. [major] Lookup how? How do you know you found the right leaf? I think rfc3031 has some text related to the forwarding of labeled packets. But I don't know of a reference for IP forwarding, so you should at least mention something around a longest-prefix-match. 522 2. Retrieve the parent pathlist of the leaf. [minor] While it is related, the terminology section defines the dependency relationship as the leaf being the child of the Pathlist -- keeping consistency is important. 524 3. Pick the outgoing path "Pi" from the list of resolved paths in 525 the pathlist. The method by which the outgoing path is picked is 526 beyond the scope of this document (e.g. flow-preserving hash 527 exploiting entropy within the MPLS stack and IP header). Let the 528 "path-index" of the outgoing path "Pi" be "j". [minor] s/Pick the outgoing path "Pi"/Pick an outgoing path "Pi" [] "Let the "path-index" of the outgoing path "Pi" be "j"." I'm not sure if you mean that "j is the path-index of Pi" or "assign a path-index of j". If the former, how is the path-index assigned? 530 4. If the prefix is labeled, use the "path-index" "j" to retrieve 531 the jth label "Lj" stored the jth entry in the OutLabel-List and 532 apply the label action of the label on the packet (e.g. for VPN 533 label on the ingress PE, the label action is "push"). As 534 mentioned in Section 1.1 the value of the "path-index" stored 535 in path may not necessarily be the same value of the location of 536 the path in the pathlist. [minor] "retrieve the jth label "Lj" stored the jth entry in the OutLabel-List" This sounds as if there were (at least) j labels in each of (at least) j entries in the OutLabel-List. But I assume that there is really only one label (or label stack) in that entry. ?? 538 5. Move to the parent of the chosen path "Pi". 540 6. If the chosen path "Pi" is recursive, move to its parent prefix 541 and go to step 2. 543 7. If the chosen path is non-recursive move to its parent adjacency. 544 Otherwise go to the next step. [minor] These 2 steps say the same thing: "move to the parent of..."Pi"", "path "Pi"...move to its parent", and "move to its parent". I think you need to better differentiate (?) the options. ... 549 Let's apply the above forwarding steps to the forwarding chain 550 depicted in Figure 2 in Section 2. Suppose a packet arrives at 551 ingress PE iPE from an external neighbor. Assume the packet matches 552 the VPN prefix VPN-IP1. While walking the forwarding chain, the 553 forwarding engine applies a hashing algorithm to choose the path and 554 the hashing at the BGP level yields path 0 while the hashing at the 555 IGP level yields path 1. In that case, the packet will be sent out 556 of interface I2 with the label stack "IGP-L12,VPN-L11". [minor] s/path 0...path 1/path-index 0...path-index 1 [minor] The example would be better understood if it was presented as steps matching the process described above (vs a narrative explanation). Also, be specific in which path is the one with each path-index... [major] Figure 2 doesn't contain any indication of a path-index. 558 5. Handling Platforms with Limited Levels of Hierarchy [] This whole section, which is longer than the text describing the "normal" mechanism (!), seems to be more appropriate for an Appendix than the main body of the document. Please move it. ... 564 o Being able to reduce the number of hierarchical levels from any 565 arbitrary value to a smaller arbitrary value that can be 566 supported by the forwarding engine. [?] What is the worst case in terms of number of levels? 3? Just curious. I'm asking simply because an "arbitrary value" sounds like it could be 10s o even 1000s, when in reality it is not more than a handful. Are there many platforms out there that wouldn't support 2-3 levels (for the number of routes/paths they normally would carry)? [minor] "smaller arbitrary value" This smaller value is not really arbitrary (as in random), it is determined by the platform. Maybe something along the lines of a "value supported by the platform" would be better. ... 571 5.1. Flattening the Forwarding Chain ... 584 1. The forwarding engine is now at leaf "R1". ... 605 9. Now the forwarding engine continues the walk to the parent of 606 "Qj". [minor] This list of steps is just a restatement of the process in the last section. IMO there's no need to repeat the steps here. Point at the 2-level example, and jump directly to the next paragraph. To reference from the explanation below, "a picture is worth a thousand words". ;-) ... 614 1. FIB manager wants to reduce the number of levels used by "Pi" by 615 1. [nit] This is not really a step... ... 622 4. FIB manager also extracts the OutLabel-list(R2) associated with 623 the leaf "R2". Remember that OutLabel-list(R2) = <L1, L2,..., 624 Lm>. [] New syntax: "OutLabel-list(R2)". ... 643 It is noteworthy to mention that the labels in the OutLabel-list 644 associated with the "flattened" pathlist may be stored in the same 645 memory location as the path itself to avoid additional memory 646 access. But that is an implementation detail that is beyond the 647 scope of this document. [] Why even mention it then? 649 The same steps can be applied to all paths in the pathlist <P1, 650 P2,..., Pn> so that all paths are "flattened" thereby reducing the 651 number of hierarchical levels by one. Note that that "flattening" a 652 pathlist pulls in all paths of the parent paths, a desired feature 653 to utilize all ECMP/UCMP paths at all levels. A platform that has a 654 limit on the number of paths in a pathlist for any given leaf may 655 choose to reduce the number paths using methods that are beyond the 656 scope of this document. [minor] Expand UCMP. ... 663 Because a flattened pathlist may have an associated OutLabel-list 664 the forwarding behavior has to be slightly modified. The 665 modification is done by adding the following step right after step 4 666 in Section 4. 668 5. If there is an OutLabel-list associated with the pathlist, then 669 if the path "Pi" is chosen by the hashing algorithm, retrieve the 670 label at location "i" in that OutLabel-list and apply the label 671 action of that label on the packet. [major] Step 4 says this: 4. If the prefix is labeled, use the "path-index" "j" to retrieve the jth label "Lj" stored the jth entry in the OutLabel-List and apply the label action of the label on the packet... Besides using "j" instead of "i" as the path-index, both seem to say the same thing. Am I missing something? ... 676 5.2. Example: Flattening a forwarding chain. 678 This example uses a case of inter-AS option C [9] where there are 3 679 levels of hierarchy. Figure 4 illustrates the sample topology. To 680 force 3 levels of hierarchy, the ASBRs[9] on the ingress domain 681 (domain 1) advertise the core routers of the egress domain (domain 682 2) to the ingress PE (iPE) via BGP-LU [3] instead of redistributing 683 them into the IGP of domain 1. The end result is that the ingress PE 684 (iPE) has 2 levels of recursion for the VPN prefixes VPN-IP1 and 685 VPN-IP2. [] Suggestion: OLD> To force 3 levels of hierarchy, the ASBRs[9] on the ingress domain (domain 1) advertise the core routers of the egress domain (domain 2) to the ingress PE (iPE) via BGP-LU [3] instead of redistributing them into the IGP of domain 1. NEW> The ASBRs on the ingress domain (Domain 1) use BGP to advertise the core routers (ASBRs and ePEs) of the egress domain (Domain 2) to the iPE. [minor] Expand ASBR on first use. No need to add a reference. 687 Domain 1 Domain 2 688 +-------------+ +-------------+ 689 | | | | 690 | LDP/SR Core | | LDP/SR core | 691 | | | | 692 | (192.0.2.4) | | 693 | ASBR11-------ASBR21........ePE1(192.0.2.1) 694 | | \ / | . . |\ 695 | | \ / | . . | \ 696 | | \ / | . . | \ 697 | | \/ | .. | \VPN-IP1(198.51.100.0/24) 698 | | /\ | . . | /VRF "Blue" ASn: 65000 699 | | / \ | . . | / 700 | | / \ | . . | / 701 | | / \ | . . |/ 702 iPE ASBR12-------ASBR22........ePE2 (192.0.2.2) 703 | (192.0.2.5) | |\ 704 | | | | \ 705 | | | | \ 706 | | | | \VRF "Blue" ASn: 65000 707 | | | | /VPN-IP2(203.0.113.0/24) 708 | | | | / 709 | | | | / 710 | | | |/ 711 | ASBR13-------ASBR23........ePE3(192.0.2.3) 712 | (192.0.2.6) | | 713 | | | | 714 | | | | 715 +-------------+ +-------------+ 716 <=========== <========= <============ 717 Advertise ePEx Advertise Redistribute 718 Using iBGP-LU ePEx Using IGP into 719 eBGP-LU BGP 721 Figure 4 : Sample 3-level hierarchy topology [minor] s/ASn/ASN/g [major] "Redistribute IGP into BGP" ?? I'm sure you don't mean that, but probably more something like "redistribute the ePEx routes only"... 723 We will make the following assumptions about connectivity [nit] s/connectivity/connectivity: 725 o In "domain 2", both ASBR21 and ASBR22 can reach both ePE1 and 726 ePE2 using the same distance. [nit] The figure capitalizes "Domain", but the text doesn't. Please be consistent! [minor] By "distance" I assume you mean "IGP metric". Please use that term instead. Apply to other occurences. ... 733 We will make the following assumptions about the labels [nit] s/labels/labels: ... 740 o The labels advertised by ASBR11 to iPE using BGP-LU [3] for the 741 egress PEs ePE1 and ePE2 are LASBR111(ePE1) and LASBR112(ePE2), 742 respectively. [] Why change the notation of the labels? This new notation ("LASBR111(ePE1)") just makes the explanation more complex. :-( ... 764 65000: 198.51.100.0/24 765 via ePE1 (192.0.2.1), VPN Label: VPN-L11 766 via ePE2 (192.0.2.2), VPN Label: VPN-L21 767 65000: 203.0.113.0/24 768 via ePE2 (192.0.2.2), VPN Label: VPN-L22 769 via ePE3 (192.0.2.3), VPN Label: VPN-L32 771 192.0.2.1/32 (ePE1) 772 Via ASBR11, BGP-LU Label: LASBR111(ePE1) 773 Via ASBR12, BGP-LU Label: LASBR121(ePE1) 774 192.0.2.2/32 (ePE2) 775 Via ASBR11, BGP-LU Label: LASBR112(ePE2) 776 Via ASBR12, BGP-LU Label: LASBR122(ePE2) 777 192.0.2.3/32 (ePE3) 778 Via ASBR13, BGP-LU Label: LASBR13(ePE3) 780 192.0.2.4/32 (ASBR11) 781 via Core, Label: IGP-L11 782 192.0.2.5/32 (ASBR12) 783 via Core, Label: IGP-L12 784 192.0.2.6/32 (ASBR13) 785 via Core, Label: IGP-L13 [nit] s/Via/via/g [minor] Only some of the addresses are annotated ("192.0.2.1/32 (ePE1)"). This is helpful, but inconsistent -- both within this example and with previous ones. ... 797 o The index inside the pathlist entry indicates the label that will 798 be picked from the OutLabel-List associated with the child leaf 799 if that path is chosen by the forwarding engine hashing function. [] s/index/path-index 801 OutLabel-List OutLabel-List 802 For VPN-IP1 For VPN-IP2 803 +------------+ +--------+ +-------+ +------------+ 804 | VPN-L11 |<---| VPN-IP1| |VPN-IP2|-->| VPN-L22 | 805 +------------+ +---+----+ +---+---+ +------------+ 806 | VPN-L21 | | | | VPN-L32 | 807 +------------+ | | +------------+ 808 | | 809 V V 810 +---+---+ +---+---+ 811 | 0 | 1 | | 0 | 1 | 812 +-|-+-\-+ +-/-+-\-+ 813 | \ / \ 814 | \ / \ 815 | \ / \ 816 | \ / \ 817 v \ / \ 818 +-----+ +-----+ +-----+ 819 +----+ ePE1| |ePE2 +-----+ | ePE3+-----+ 820 | +--+--+ +-----+ | +--+--+ | 821 v | / v | v 822 +--------------+ | / +--------------+ | +-------------+ 823 |LASBR111(ePE1)| | / |LASBR112(ePE2)| | |LASBR13(ePE3)| 824 +--------------+ | / +--------------+ | +-------------+ 825 |LASBR121(ePE1)| | / |LASBR122(ePE2)| | OutLabel-List 826 +--------------+ | / +--------------+ | For ePE3 827 OutLabel-List | / OutLabel-List | 828 For ePE1 | / For ePE2 | 829 | / | 830 | / | 831 | / | 832 v / v 833 +---+---+ Shared Pathlist +---+ Pathlist 834 | 0 | 1 | For ePE1 and ePE2 | 0 | For ePE3 835 +-|-+-\-+ +-|-+ 836 | \ | 837 | \ | 838 | \ | 839 | \ | 840 v \ v 841 +------+ +------+ +------+ 842 +---+ASBR11| |ASBR12+--+ |ASBR13+---+ 843 | +------+ +------+ | +------+ | 844 v v v 845 +-------+ +-------+ +-------+ 846 |IGP-L11| |IGP-L12| |IGP-L13| 847 +-------+ +-------+ +-------+ 849 Figure 5 : Forwarding Chain for hardware supporting 3 Levels [] The OutLabel-Lists at the bottom are not tagged as such. [] Some down arrows ("v") are missing -- on the non-vertical lines. 851 Now suppose the hardware on iPE (the ingress PE) supports 2 levels 852 of hierarchy only. In that case, the 3-levels forwarding chain in 853 Figure 5 needs to be "flattened" into 2 levels only. 855 OutLabel-List OutLabel-List 856 For VPN-IP1 For VPN-IP2 857 +------------+ +-------+ +-------+ +------------+ 858 | VPN-L11 |<---|VPN-IP1| | VPN-IP2|--->| VPN-L22 | 859 +------------+ +---+---+ +---+---+ +------------+ 860 | VPN-L21 | | | | VPN-L32 | 861 +------------+ | | +------------+ 862 | | 863 | | 864 | | 865 Flattened | | Flattened 866 pathlist V V pathlist 867 +===+===+ +===+===+===+ +==============+ 868 +--------+ 0 | 1 | | 0 | 0 | 1 +---->|LASBR112(ePE2)| 869 | +=|=+=\=+ +=/=+=/=+=\=+ +==============+ 870 v | \ / / \ |LASBR122(ePE2)| 871 +==============+ | \ +-----+ / \ +==============+ 872 |LASBR111(ePE1)| | \/ / \ |LASBR13(ePE3) | 873 +==============+ | /\ / \ +==============+ 874 |LASBR121(ePE1)| | / \ / \ 875 +==============+ | / \ / \ 876 | / \ / \ 877 | / + + \ 878 | + | | \ 879 | | | | \ 880 v v v v \ 881 +------+ +------+ +------+ 882 +----|ASBR11| |ASBR12+---+ |ASBR13+---+ 883 | +------+ +------+ | +------+ | 884 v v v 885 +-------+ +-------+ +-------+ 886 |IGP-L11| |IGP-L12| |IGP-L13| 887 +-------+ +-------+ +-------+ [] Same comments as above. 889 Figure 6 : Flattening 3 levels to 2 levels of Hierarchy on iPE 891 Figure 6 represents one way to "flatten" a 3 levels hierarchy into 892 two levels. There are few important points: [nit] s/are few/are a few ... 908 o Let's take a look at the flattened pathlist used by the prefix 909 "VPN-IP2", The pathlist associated with the prefix "VPN-IP2" has 910 three entries. [nit] s/"VPN-IP2", The/"VPN-IP2". The 912 o The first and second entry have index "0". This is because 913 both entries correspond to ePE2. Thus when hashing performed 914 by the forwarding engine results in using first or the second 915 entry in the pathlist, the forwarding engine will pick the 916 correct VPN label "VPN-L22", which is the label advertised by 917 ePE2 for the prefix "VPN-IP2". [nit] s/using first/using the first ... 952 o So the packet is forwarded towards the ASBR "ASBR12" and the IGP 953 label at the top will be "L12". [minor] The figure calls the label "IGP-L12", not just "L12". There are other instances of this below. ... 982 6. Forwarding Chain Adjustment at a Failure 984 The hierarchical and shared structure of the forwarding chain 985 explained in the previous section allows modifying a small number of 986 forwarding chain objects to re-route traffic to a pre-calculated 987 equal-cost or backup path without the need to modify the possibly 988 very large number of BGP prefixes. In this section, we go over 989 various core and edge failure scenarios to illustrate how FIB 990 manager can utilize the forwarding chain structure to achieve BGP 991 prefix independent convergence. [nit] s/how FIB manager/how FIB the manager 993 6.1. BGP-PIC core [minor] The meaning of "core" and "edge" should be explained somwehre. ... 1001 When a remote link or node fails, IGP on the ingress PE receives 1002 advertisement indicating a topology change so IGP re-converges to 1003 either find a new next-hop and/or outgoing interface or remove the 1004 path completely from the IGP prefix used to resolve BGP next-hops. 1005 IGP and/or LDP download the modified IGP leaves with modified 1006 outgoing labels for labeled core. [nit] s/IGP/the IGP/g [nit] s/receives advertisement/receives an advertisement [nit] s/for labeled core/for the labeled core [] "download" This is the second time that the "download" concept is used without explanation. See my comment in §3.1. 1008 When a local link fails, FIB manager detects the failure almost 1009 immediately. The FIB manager marks the impacted path(s) as unusable 1010 so that only useable paths are used to forward packets. Hence only 1011 IGP pathlists with paths using the failed local link need to be 1012 modified. All other pathlists are not impacted. Note that in this 1013 particular case there is actually no need even to backwalk to IGP 1014 leaves to adjust the OutLabel-Lists because FIB can rely on the 1015 path-index stored in the useable paths in the pathlist to pick the 1016 right label. [nit] s/is actually no need even to/is no need to [nit] s/to IGP leaves/to the IGP leaves [major] The term "backwalk" is introduced here, but not explained. 1018 It is noteworthy to mention that because FIB manager modifies the 1019 forwarding chain starting from the IGP leaves only. BGP pathlists 1020 and leaves are not modified. Hence traffic restoration occurs within 1021 the time frame of IGP convergence, and, for local link failure, 1022 assuming a backup path has been precomputed, within the timeframe of 1023 local detection (e.g. 50ms). Examples of solutions that pre- 1024 computing backup paths are IP FRR [16] remote LFA [17], Ti-LFA [15] 1025 and MRT [18] or eBGP path having a backup path [11]. [major] The case on the previous paragraph deals with a local link failure that results in the routes being "unusable", but this paragraph mentions restoration "assuming a backup path has been precomputed" -- the restoration timeframe also applies to the case where there is no backup path (as explained above). Note that the next paragraph also talks about a backup route and seems to ignore the "unusable" case above. [nit] s/solutions that pre- computing/solutions that can pre-compute [nit] s/Ti-LFA/TI-LFA [minor] Expand FRR, LFA, TI-LFA, and MRT on first use. [nit] s/IP FRR [16] remote LFA...[18] or eBGP/IP FRR [16], remote LFA...[18], or eBGP [minor] "eBGP path having a backup path [11]" I don't think that's the right reference. 1027 Let's apply the procedure mentioned in this subsection to the 1028 forwarding chain depicted in Figure 2. Suppose a remote link failure 1029 occurs and impacts the first ECMP IGP path to the remote BGP next- 1030 hop. Upon IGP convergence, the IGP pathlist used by the BGP next-hop 1031 is updated to reflect the new topology (one path instead of two). As 1032 soon as the IGP convergence is effective for the BGP next-hop entry, 1033 the new forwarding state is immediately available to all dependent 1034 BGP prefixes. The same behavior would occur if the failure was local 1035 such as an interface going down. As soon as the IGP convergence is 1036 complete for the BGP next-hop IGP route, all its BGP depending 1037 routes benefit from the new path. In fact, upon local failure, if 1038 LFA protection is enabled for the IGP route to the BGP next-hop and 1039 a backup path was pre-computed and installed in the pathlist, upon 1040 the local interface failure, the LFA backup path is immediately 1041 activated (e.g. sub-50msec) and thus protection benefits all the 1042 depending BGP traffic through the hierarchical forwarding dependency 1043 between the routes. [minor] "Upon IGP convergence... As soon as the IGP convergence is effective for the BGP next-hop entry," There's some redundancy in the text. Also, by using different text, you're implying that there is a difference between "IGP convergence" and "IGP convergence...effective for the BGP next-hop entry". [major] "The same behavior would occur if the failure was local such as an interface going down. As soon as the IGP convergence..." According to the text above (2 or 3 paragraphs before), IGP convergence is not necessary in the local failure case. ... 1050 6.2.1. Adjusting forwarding Chain in egress node failure [minor] Most of the text talks about an "egress node". I assume that is the same as an "edge node" -- is it? If so, please be consistent and/or explain somewhere that you're referring to the same thing. 1052 When an edge node fails, IGP on neighboring core nodes send route 1053 updates indicating that the edge node is no longer reachable. IGP 1054 running on the iPE instructs FIB to remove the IP and label leaves 1055 corresponding to the failed edge node from FIB. So FIB manager 1056 performs the following steps: [major] "IGP on neighboring core nodes send route updates indicating that the edge node is no longer reachable" §6.1 says that "Node failures are treated as link failures." Is that the same in this section? The neighbor nodes can't detect that the node is down -- they can only detect the link being down. The rest of the explanation assumes that there was a single link connecting the egress node to the core, or that all the links were reported as down. Neither case is explained. If there is more than one link there shouldn't be a failure in reaching the next-hop -- except for the change in the local router(s), similar to a core failure. ... 1073 6.2.2. Adjusting Forwarding Chain on PE-CE link Failure ... 1082 In the first case, the rest of iBGP peers will remain unaware of the 1083 link failure and will continue to forward traffic to the edge node 1084 until the edge node attached to the failed link withdraws the BGP 1085 prefixes. If the destination prefixes are multi-homed to another 1086 iBGP peer, say ePE2, then FIB manager on the edge router detecting 1087 the link failure applies the following steps (see Figure 3): [minor] Figure 3 is an example -- it does show a sample chain. Figure 4 may be used as a reference for the type of backup described here. ... 1100 o The label entry in OutLabel-List corresponding to the 1101 internal path to backup egress PE has swap action to the 1102 label advertised by backup egress PE. [nit] s/has swap action/has a swap action [nit] s/by backup/by the backup 1104 o For an arriving label packet (e.g. VPN), the top label is 1105 swapped with the label advertised by backup egress PE and the 1106 packet is sent towards that backup egress PE. [nit] s/label packet/labeled packet [major] Even though you're expanding on an example, it is important for the text to be precise. - The label mentioned is "e.g. VPN", but you're been using more specific labels -- from figure 3, "VPN-L21"... - The OutLabel-Lists in Figure 3 show both "VPN-L21 (swap)" and "VPN-L21 (push)". The same label is used, only the swap action is mentioned here. What am I missing? - As I mentioned before, the diagram in Figure 3 is not clear. 1108 o For unlabeled traffic, packets are simply redirected towards 1109 backup egress PE. [nit] s/backup/the backups [major] "ackets are simply redirected towards backup egress PE" How? The backup ePE is presumably not connected (which is why a label was needed)...?? 1111 In the second case where the edge router uses the IP address of the 1112 failed link as the BGP next-hop, the edge router will still perform 1113 the previous steps. But, unlike the case of next-hop self, IGP on 1114 failed edge node informs the rest of the iBGP peers that IP address 1115 of the failed link is no longer reachable. Hence the FIB manager on 1116 iBGP peers will delete the IGP leaf corresponding to the IP prefix 1117 of the failed link. The behavior of the iBGP peers will be identical 1118 to the case of edge node failure outlined in Section 6.2.1. [nit] s/IGP on failed edge node/the IGP on the failed edge node [nit] s/that IP address/that the IP address 1120 It is noteworthy to mention that because the edge link failure is 1121 local to the edge router, sub-50 msec convergence can be achieved as 1122 described in [11]. [] But only at the edge router -- not in other places of the network where the IGP has to converge. Please keep the numbers in the Appendix. IOW, there's no need to claim this performance here. 1124 Let's try to apply the case of next-hop self to the forwarding chain 1125 depicted in Figure 3. After failure of the link between ePE1 and CE, 1126 the forwarding engine will route traffic arriving from the core 1127 towards VPN-NH2 with path-index=1. A packet arriving from the core 1128 will contain the label VPN-L11 at top. The label VPN-L11 is swapped 1129 with the label VPN-L21 and the packet is forwarded towards ePE2. [minor] This paragraph is out of place: you already tried to explain this case a few paragraphs above. Please move the text and deduplicate. 1131 6.3. Handling Failures for Flattened Forwarding Chains 1133 As explained in the in Section 5 if the number of hierarchy levels 1134 of a platform cannot support the native number of hierarchy levels 1135 of a recursive forwarding chain, the instantiated forwarding chain 1136 is constructed by flattening two or more levels. Hence a 3 levels 1137 chain in Figure 5 is flattened into the 2 levels chain in Figure 6. [nit] s/a 3 levels/the 3-level [nit] s/2 levels/2-level 1139 While reducing the benefits of BGP-PIC, flattening one hierarchy 1140 into a shallower hierarchy does not always result in a complete loss 1141 of the benefits of the BGP-PIC. To illustrate this fact suppose 1142 ASBR12 is no longer reachable in domain 1. If the platform supports 1143 the full hierarchy depth, the forwarding chain is the one depicted 1144 in Figure 5 and hence the FIB manager needs to backwalk one level to 1145 the pathlist shared by "ePE1" and "ePE2" and adjust it. If the 1146 platform supports 2 levels of hierarchy, then a useable forwarding 1147 chain is the one depicted in Figure 6. In that case, if ASBR12 is no 1148 longer reachable, the FIB manager has to backwalk to the two 1149 flattened pathlists and updates both of them. 1151 The main observation is that the loss of convergence speed due to 1152 the loss of hierarchy depth depends on the structure of the 1153 forwarding chain itself. To illustrate this fact, let's take two 1154 extremes. Suppose the forwarding objects in level i+1 depend on the 1155 forwarding objects in level i. If every object on level i+1 depends 1156 on a separate object in level i, then flattening level i into level 1157 i+1 will not result in loss of convergence speed. Now let's take the 1158 other extreme. Suppose "n" objects in level i+1 depend on 1 object 1159 in level i. Now suppose FIB flattens level i into level i+1. If a 1160 topology change results in modifying the single object in level i, 1161 then FIB has to backwalk and modify "n" objects in the flattened 1162 level, thereby losing all the benefit of BGP-PIC. Experience shows 1163 that flattening forwarding chains usually results in moderate loss 1164 of BGP-PIC benefits. Further analysis is needed to corroborate and 1165 quantify this statement. [major] "The main observation is that the loss of convergence speed due to the loss of hierarchy depth... Further analysis is needed to corroborate and quantify this statement." The explanation above sounds plausible. However, if you really don't know, and there is a dependency on the structure, and it depends on the failure, and there's no alternative (if the platform supports less levels)... Why even include this information? It feels like the equivalent of "hand waving". :-( 1167 7. Properties [minor] This section is really a summary. I wonder if it would serve a better purpose at the start of the document as a "road map". 1169 7.1. Coverage 1171 All the possible failures, except CE node failure, are covered, 1172 whether they impact a local or remote IGP path or a local or remote 1173 BGP next-hop as described in Section 6. This section provides 1174 details for each failure and how the hierarchical and shared FIB 1175 structure proposed in this document allows recovery that does not 1176 depend on number of BGP prefixes. [major] "except CE node failure" Isn't that covered in §6.2.2? 1178 7.1.1. A remote failure on the path to a BGP next-hop 1180 Upon IGP convergence, the IGP leaf for the BGP next-hop is updated 1181 upon IGP convergence and all the BGP depending routes leverage the 1182 new IGP forwarding state immediately. Details of this behavior can 1183 be found in Section 6.1. [minor] Redundant text: "Upon IGP convergence, the IGP leaf for the BGP next-hop is updated upon IGP convergence..." 1185 This BGP resiliency property only depends on IGP convergence and is 1186 independent of the number of BGP prefixes impacted. [] "BGP resiliency property" The rest of the text (and the name of the document!) refers to convergence. It isn't until this section that you change the terminology. Please be consistent. 1188 7.1.2. A local failure on the path to a BGP next-hop 1190 Upon LFA protection, the IGP leaf for the BGP next-hop is updated to 1191 use the precomputed LFA backup path and all the BGP depending routes 1192 leverage this LFA protection. Details of this behavior can be found 1193 in Section 6.1. [minor] Redundant text: LFA is mentioned 3 times in the first sentence. [major] What about the case where an LFA doesn't exist? ... 1238 7.3. Automated 1240 The BGP-PIC solution does not require any operator involvement. The 1241 process is entirely automated as part of the FIB implementation. [] Ok -- it is not as much an automation as it is an implementation choice. 1243 The salient points enabling this automation are: [minor] I"m not sure if you're saying that the list below is what enables the "automation", or if the "automation" results in that. 1245 o Extension of the BGP Best Path to compute more than one primary 1246 ([12]and [13]) or backup BGP next-hop ([7] and [14]). [major] This functionality is available regardless of the structure of the FIB. 1248 o Sharing of BGP Path-list across BGP destinations with same 1249 primary and backup BGP next-hop. [nit] s/same/the same 1251 o Hierarchical indirection and dependency between BGP pathlist and 1252 IGP pathlist. [] The last two bullets are a characteristic of the implementation. 1254 7.4. Incremental Deployment 1256 As soon as one router supports BGP-PIC solution, it benefits from 1257 all its benefits (most notably convergence that does not depend in 1258 the number of prefixes) without any requirement for other routers to 1259 support BGP-PIC. [nit] Reduncant text: "it benefits from all its benefits" [major] The assertion above is true. However, (in the extreme) a single router supporting BGP-PIC doesn't necessarily translate in to better overall convergence in the network (it depends on where the router is, the failure, etc.). Also, having a mix of routers could result in transient micro loops as the speed of convergence is not the same. 1261 8. Security Considerations 1263 The behavior described in this document is internal functionality 1264 to a router that result in significant improvement to convergence 1265 time as well as reduction in CPU and memory used by FIB while not 1266 showing change in basic routing and forwarding functionality. As 1267 such no additional security risk is introduced by using the 1268 mechanisms proposed in this document. [] I can't think of anything else to say here, but you did not show (nor do I need you to) "significant...reduction in CPU and memory used". It sounds more like a marketing statement. :-( 1270 9. IANA Considerations 1272 No requirements for IANA [major] s/.../This document has no IANA actions. 1274 10. Conclusions [] This section is not needed, it contains redundant information. Please remove it. ... 1283 11. References [major] The RFC Editor requires that the OID be listed for all references [1]. Please update the references -- here's an example of what they should look like [2]: [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, DOI 10.17487/RFC4271, January 2006, <https://www.rfc-editor.org/info/rfc4271>. [1] https://www.rfc-editor.org/styleguide/part2/ [2] https://www.rfc-editor.org/refs/ref4271.txt 1285 11.1. Normative References ... 1293 [3] E. Rosen, " Carrying Label Information in BGP-4", RFC 8277, 1294 October 2017 [minor] This reference can be Informative. 1296 [4] Andersson, L., Minei, I., and B. Thomas, "LDP Specification", 1297 RFC 5036, October 2007 [minor] This reference can be Informative. 1299 [5] A. Bashandy, C. Filsfils, S. Previdi, B. Decraene, S. 1300 Litkowski, M. Horneffer, R. Shakir, "Segment Routing with MPLS 1301 data plane", RFC 8660, December 2019 [minor] This reference can be Informative. ... 1377 Appendix A. Perspective ... 1392 o BGP Convergence per BGP destination ~ 200usec conservative, 1394 ~ 100usec optimistic [minor] "BGP Convergence" The explanation below talks about recalculating the best route -- which is different than convergence (which includes sending BGP Updates, etc.). [EoR -18]
- AD review of draft-ietf-rtgwg-bgp-pic-18 Alvaro Retana
- Re: AD review of draft-ietf-rtgwg-bgp-pic-18 Alvaro Retana
- Re: AD review of draft-ietf-rtgwg-bgp-pic-18 Alvaro Retana
- RE: AD review of draft-ietf-rtgwg-bgp-pic-18 Susan Hares
- RE: AD review of draft-ietf-rtgwg-bgp-pic-18 Alvaro Retana
- Re: AD review of draft-ietf-rtgwg-bgp-pic-18 Yingzhen Qu
- Re: AD review of draft-ietf-rtgwg-bgp-pic-18 Alvaro Retana
- Re: AD review of draft-ietf-rtgwg-bgp-pic-18 Ahmed Bashandy
- Re: AD review of draft-ietf-rtgwg-bgp-pic-18 Alvaro Retana
- Re: AD review of draft-ietf-rtgwg-bgp-pic-18 Ahmed Bashandy
- Re: AD review of draft-ietf-rtgwg-bgp-pic-18 Alvaro Retana
- Re: AD review of draft-ietf-rtgwg-bgp-pic-18 Yingzhen Qu
- Re: AD review of draft-ietf-rtgwg-bgp-pic-18 Ahmed Bashandy