Re: [Rift] AD Review of draft-ietf-rift-rift-12 (Part 1)
Tony Przygienda <tonysietf@gmail.com> Sun, 07 March 2021 17:31 UTC
Return-Path: <tonysietf@gmail.com>
X-Original-To: rift@ietfa.amsl.com
Delivered-To: rift@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 087B53A19C8; Sun, 7 Mar 2021 09:31:53 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.096
X-Spam-Level:
X-Spam-Status: No, score=-2.096 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BwuByTtZ4zFS; Sun, 7 Mar 2021 09:31:44 -0800 (PST)
Received: from mail-il1-x12a.google.com (mail-il1-x12a.google.com [IPv6:2607:f8b0:4864:20::12a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BA1333A19C6; Sun, 7 Mar 2021 09:31:43 -0800 (PST)
Received: by mail-il1-x12a.google.com with SMTP id s1so6754675ilh.12; Sun, 07 Mar 2021 09:31:43 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ZRVCcH4a8JbphC3xiJ1T5l1KBj9fxrcX+Im2nZAIeSg=; b=rSOm4yA8mrvkDxgjfe2P8qA6Q3Gm1P/7Oz9jc3MOVYBtm+Gs+qvUU+j4m5dAcHYUlp ylC/hEOkl+GC7P5raaSmPQfkKduo5HyXdszRlx8J9aRT7jfuD1mWvYqhs5vwGnehEW36 MJHcOGiQi9nQOB4+tatGvaAwk6bZDEtuyu1C+0IW9nBNhBM7BPWnY76veENkczHJplIm 2VatlDjfA7PfmjVakoseAWMlj4A2pbGB7UuwJhygtABQ9ywhPMXrAp/wra7QI6f5jom1 cDGn6qCD0WArfr3iM3PpqJ40tklRnunDaqP0DUoOGdYyVxQlTirvbTkc7rPWshFVxuVR 0cgQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ZRVCcH4a8JbphC3xiJ1T5l1KBj9fxrcX+Im2nZAIeSg=; b=FEWwHXrSMjOQ3hh1EylbDtLGZwhkeOgk9XEL/99wblhkNs+kFRHw6cMgAW7uBstbz2 IJ10MZEPk15rlQD4TeeXPRgFRcNm2VUPjII6vn7NTmPvZGj74P+2Amsg5AcA25BW5BSY hoDUlHpDLuJwu3BAUaV4lzcitO7cgFqGarHi3d67Cg89qvY5qb4puAEekfQvkioCEgEd YcZe0WBtxmRtosJ6AOMq+QU/5OvCLsbqcvAlNvVg/8eobyvwm4/Fz5oZVvNRL+gegAY8 /m0NgkHqi2uEAW0Fs5zo/FpK1XmX3NIartUqgHJh0ocFhDlAwi8IygmI8i4GBGDGvyoj rzDg==
X-Gm-Message-State: AOAM531QtmfO81s8SzabldIZe5Uj8wvashjo3/UJhOv4LsojCY0SQNx7 bVGKuIRkd2y2ONCdYhfHF78cC/rfN2mWfUWAYf0=
X-Google-Smtp-Source: ABdhPJyXz50I95qcO91qsauORlp+hpIvNQnowCkHi9o3rfa14AShSl0s/x/YPvtR9UlHeX3HGaJPsNknk4JXFkHaJuI=
X-Received: by 2002:a05:6e02:180d:: with SMTP id a13mr16941699ilv.156.1615138301692; Sun, 07 Mar 2021 09:31:41 -0800 (PST)
MIME-Version: 1.0
References: <CAMMESsxTQnUDMGRiLPhB+Ci090xkE7Ea9HLC8E4SLQ7rv+qFnQ@mail.gmail.com>
In-Reply-To: <CAMMESsxTQnUDMGRiLPhB+Ci090xkE7Ea9HLC8E4SLQ7rv+qFnQ@mail.gmail.com>
From: Tony Przygienda <tonysietf@gmail.com>
Date: Sun, 07 Mar 2021 18:31:05 +0100
Message-ID: <CA+wi2hO6FBHutqdWy5d-vpwdHjwUofHdRV3ts+oyUzW3k7wVfQ@mail.gmail.com>
To: Alvaro Retana <aretana.ietf@gmail.com>
Cc: "draft-ietf-rift-rift@ietf.org" <draft-ietf-rift-rift@ietf.org>, "zhang.zheng" <zhang.zheng@zte.com.cn>, "rift-chairs@ietf.org" <rift-chairs@ietf.org>, "rift@ietf.org" <rift@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000bc4d8105bcf5b0b6"
Archived-At: <https://mailarchive.ietf.org/arch/msg/rift/Ui4mHqc_5qZws1pcJfqlgMd14QA>
Subject: Re: [Rift] AD Review of draft-ietf-rift-rift-12 (Part 1)
X-BeenThere: rift@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion of Routing in Fat Trees <rift.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rift>, <mailto:rift-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rift/>
List-Post: <mailto:rift@ietf.org>
List-Help: <mailto:rift-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rift>, <mailto:rift-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Mar 2021 17:31:53 -0000
Alvaro, after meetings here answer inline On Fri, Jan 15, 2021 at 8:56 PM Alvaro Retana <aretana.ietf@gmail.com> wrote: > > > > Part 1 includes the introductory text through the end of §4.1 (Overview). > > While I appreciate the overview of the topologies and the protocol, I > think that this extended introductory material is at times too > long/wordy and complex (for the average reader, sometimes being forced > to make assumptions or jump ahead), but also incomplete. To be more > specific: > > - The description of the general topology (§4.1.2) gets in to significant > detail and complexity, including the way in which the concepts are > depicted > in the figures. (ASCII art is not the best, you might want to take > advantage of using SVI.) I noticed that none of the figures/sections in > §4.1.2 are referenced elsewhere (but the simple Figures 2 and 3 are), > which > makes me think about the value of the in-depth treatment if it will not > be > explicitly considered later on. > > In several places I was under the impression that a DC design guide was > being presented. Which brings up the question: does RIFT require the > topologies to be exactly as the ones described to operate correctly? > > - My main expectation of the overview was to get a high-level idea of the > operation of RIFT, but that is not done there. Besides a quick mention > in > §4.1.1 and some text in the introduction, the focus of the overview is on > fallen leaves/dissaggregation. I understand that may be a significant > issue/feature, but it shouldn't be the dominating topic in the overview. > Maybe some of the other pieces are more "well-known" (neighbor discovery, > flooding, etc.), but even ZTP (even if optional) is not mentioned. > Alvaro, after meeting of authors here multi-dimensional answer here how we will address that a) Jordan will extend the document with SVG figures (it seems SVG finally works). FSMs will add SVG figures which we already have but couldn't make to work before. b) We will flatten the ToC in front and add a reader's guide to allow people which parts they have to read. We have the "what part you need to implement" in section but will probably move that up higher up in the doc. c) The abstract will be shortened as you suggest and we will put that into the intro (what does the protocol do and provide) d) I'll rewrite into third voice d) all your comments/nits are fine & will be taken care of except the forward references to sections in the glossary. IMO this will simply muddle the draft and @ certain point we will get calls for a "index" with terms and sections which RFC does not support and IME never served any purpose in a book. Glossary is simply glossary, go over terms, read the defintion, some of it may not be clear, to be held close when reading the following section then. -- tony > > > In general, I don't think that the deep-/complex treatment is > necessary. You may still decide to keep it (see specific comments > inline below), but I think it will represent a significant distraction > for other reviewers. > > > Thanks! > > Alvaro. > > [1] https://datatracker.ietf.org/doc/ad/alvaro.retana > > > [Line numbers from idnits.] > > > ... > 17 Abstract > > 19 This document defines a specialized, dynamic routing protocol > for > 20 Clos and fat-tree network topologies optimized towards > minimization > 21 of configuration and operational complexity. The protocol > > [opinion] While this is a nice Abstract, I think that it is too long > and is not completely reflected in the Introduction. Personally, I > consider the first paragraph enough for an Abstract. It would be very > nice if the list below was instead moved to the Introduction with > pointers to where these protocol characteristics are > specified/explained. > > > 23 o deals with no configuration, fully automated construction of > fat- > 24 tree topologies based on detection of links, > > 26 o minimizes the amount of routing state held at each level, > > 28 o automatically prunes and load balances topology flooding > exchanges > 29 over a sufficient subset of links, > > 31 o supports automatic disaggregation of prefixes on link and > node > 32 failures to prevent black-holing and suboptimal routing, > > 34 o allows traffic steering and re-routing policies, > > 36 o allows loop-free non-ECMP forwarding, > > 38 o automatically re-balances traffic towards the spines based on > 39 bandwidth available and finally > > 41 o provides mechanisms to synchronize a limited key-value > data-store > 42 that can be used after protocol convergence to e.g. > bootstrap > 43 higher levels of functionality on nodes. > > > ... > 77 Table of Contents > > 79 1. Authors . . . . . . . . . . . . . . . . . . . . . . . . . . > . 6 > 80 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . > . 6 > 81 2.1. Requirements Language . . . . . . . . . . . . . . . . . > . 8 > 82 3. Reference Frame . . . . . . . . . . . . . . . . . . . . . . > . 8 > 83 3.1. Terminology . . . . . . . . . . . . . . . . . . . . . . > . 8 > 84 3.2. Topology . . . . . . . . . . . . . . . . . . . . . . . > . 13 > 85 4. RIFT: Routing in Fat Trees . . . . . . . . . . . . . . . . > . 15 > 86 4.1. Overview . . . . . . . . . . . . . . . . . . . . . . . > . 16 > 87 4.1.1. Properties . . . . . . . . . . . . . . . . . . . . > . 16 > 88 4.1.2. Generalized Topology View . . . . . . . . . . . . . > . 17 > 89 4.1.2.1. Terminology . . . . . . . . . . . . . . . . . . > . 17 > 90 4.1.2.2. Clos as Crossed Crossbars . . . . . . . . . . . > . 18 > 91 4.1.3. Fallen Leaf Problem . . . . . . . . . . . . . . . . > . 28 > 92 4.1.4. Discovering Fallen Leaves . . . . . . . . . . . . . > . 30 > 93 4.1.5. Addressing the Fallen Leaves Problem . . . . . . . > . 31 > 94 4.2. Specification . . . . . . . . . . . . . . . . . . . . . > . 32 > 95 4.2.1. Transport . . . . . . . . . . . . . . . . . . . . . > . 33 > 96 4.2.2. Link (Neighbor) Discovery (LIE Exchange) . . . . . > . 33 > 97 4.2.2.1. LIE FSM . . . . . . . . . . . . . . . . . . . . > . 36 > > [nit] I don't think we need all this detail in the TOC. Maybe > limiting the entries to 2 or 3 levels is enough (e.g. 4.2 or 4.2.2). > > > ... > 256 1. Authors > > 258 This work is a product of a list of individuals which are all > to be > 259 considered major contributors independent of the fact whether > their > 260 name made it to the limited boilerplate author's list or not. > > [minor] Please move this section to one called "Contributors" and > place it after the Acknowledgments. Only the people not on the front > page should be listed there. > > https://tools.ietf.org/html/rfc7322#section-4.11 > > > ... > 273 2. Introduction > > 275 Clos [CLOS] and Fat-Tree [FATTREE] topologies have gained > prominence > 276 in today's networking, primarily as result of the paradigm shift > 277 towards a centralized data-center based architecture that is > poised > 278 to deliver a majority of computation and storage services in the > 279 future. Today's current routing protocols were geared towards a > 280 network with an irregular topology and low degree of > connectivity > 281 originally but given they were the only available options, > 282 consequently several attempts to apply those protocols to Clos > have > 283 been made. Most successfully BGP [RFC4271] [RFC7938] has been > 284 extended to this purpose, not as much due to its inherent > suitability > 285 but rather because the perceived capability to easily modify > BGP and > 286 the immanent difficulties with link-state [DIJKSTRA] based > protocols > 287 to optimize topology exchange and converge quickly in large > scale > 288 densely meshed topologies. The incumbent protocols precondition > 289 normally extensive configuration or provisioning during bring > up and > 290 re-dimensioning. This tends to be viable only for a set of > 291 organizations with according networking operation skills and > budgets. > 292 For many IP fabric builders a desirable protocol would be one > that > 293 auto-configures itself and deals with failures and > mis-configurations > 294 with a minimum of human intervention only. Such a solution > would > 295 allow local IP fabric bandwidth to be consumed in a 'standard > 296 component' fashion, i.e. provision it much faster and operate > it at > 297 much lower costs than today, much like compute or storage is > consumed > 298 already. > > [nit] s/Fat-Tree/Fat Tree/g To be consistent with the terminology > section. > > > ... > 318 For the visually oriented reader, Figure 1 presents a first > level > 319 simplified view of the resulting information and routes on a > RIFT > 320 fabric. The top of the fabric is holding in its link-state > database > 321 the nodes below it and the routes to them. In the second row > of the > 322 database table we indicate that partial information of other > nodes in > 323 the same level is available as well. The details of how this is > 324 achieved will be postponed for the moment. When we look at the > 325 "bottom" of the fabric, the leaves, we see that the topology is > 326 basically empty and they only hold a load balanced default > route to > 327 the next level under normal conditions. > > [nit] s/holding...the nodes below/holding...information about the nodes > below > > > [style nit] Some portions of the text are written in first person ("we > indicate"). Personally, in this type of documents I prefer to not see > that treatment ("the table indicates"). This is just a personal > preference, a nit. No need to take any action -- unless you really > want to. ;-) > > > [minor] "details of how this is achieved will be postponed for the > moment." Sure, this is just the Introduction. A pointer to where the > details are would be very nice. > > > [nit] s/and they only hold a load balanced default route to the next > level under normal conditions./and, under normal conditions, they only > hold a load balanced default route to the next level. > > > 329 The balance of this document details a dedicated IP fabric > routing > 330 protocol, fills in the specification details and ultimately > includes > 331 resulting security considerations. > > [] As I mentioned above, moving the list from the Abstract to the > Introduction would be beneficial. Given that this is a long document, > providing some type of roadmap/reader's guide (based on that list, for > example) would be great! > > > ... > 357 2.1. Requirements Language > > 359 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL > NOT", > 360 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in > this > 361 document are to be interpreted as described in RFC 8174 > [RFC8174]. > > [major] Use the template exactly as written in rfc8174. > > > 363 3. Reference Frame > > 365 3.1. Terminology > > [minor] Where possible/appropriate, please add forward references to > the sections where the terms are further specified. > > > 367 This section presents the terminology used in this document. > It is > 368 assumed that the reader is thoroughly familiar with the terms > and > 369 concepts used in OSPF [RFC2328] and IS-IS > [ISO10589-Second-Edition], > 370 [ISO10589] as well as the according graph theoretical concepts > of > 371 shortest path first (SPF) [DIJKSTRA] computation and DAGs. > > [minor] The two references to ISO10589 look the same to me. Do we need > both? > > > ... > 383 Directed Acyclic Graph (DAG): A finite directed graph with no > 384 directed cycles (loops). If links in Clos are considered as > 385 either being all directed towards the top or vice versa, > each of > 386 such two graphs is a DAG. > > [nit] s/in Clos/in a Clos > > > 388 Folded Spine-and-Leaf: In case Clos fabric input and output > stages > 389 are analogous, the fabric can be "folded" to build a > "superspine" > 390 or top which we will call Top of Fabric (ToF) in this > document. > > [nit] s/In case Clos/In case the Clos > > > [minor] This term is only used in this section. Does it really need > to be defined? I'm wondering if the terminology can be simplified. > > > ... > 401 Superspine vs. Aggregation and Spine vs. Edge/Leaf: > 402 Traditional level names in 5-stages folded Clos for Level 2, > 1 and > 403 0 respectively. We normalize this language to talk about > top-of- > 404 fabric (ToF), top-of-pod (ToP) and leaves. > > [minor] Instead of adding an entry (even if the name uses "vs.", it is > not a comparison) for these names just to reference the normalized > names, which are also defined later, just mention these traditional > names in those entries. > > > 406 Zero Touch Provisioning (ZTP): Optional RIFT mechanism which > allows > 407 to derive node levels automatically based on minimum > configuration > 408 (only ToF property has to be provisioned on according nodes). > > [?] I can't parse the text in parenthesis. > > > 410 Point of Delivery (PoD): A self-contained vertical slice or > subset > 411 of a Clos or Fat Tree network containing normally only level > 0 and > 412 level 1 nodes. A node in a PoD communicates with nodes in > other > 413 PoDs via the Top-of-Fabric. We number PoDs to distinguish > them > 414 and use PoD #0 to denote "undefined" PoD. > > [minor] "level 0 and level 1" The definition of level doesn't > mention numbers -- maybe add something there to point at the fact that > level 0 and a leaf are equivalent (position-wise). > > > ... > 429 Leaf: A node without southbound adjacencies. Its level is 0 > (except > 430 cases where it is deriving its level via ZTP and is running > 431 without LEAF_ONLY which will be explained in Section 4.2.7). > > [minor] s/(except...Section 4.2.7)./(see Section 4.2.7). > > > 433 Top-of-fabric Plane or Partition: In large fabrics > top-of-fabric > 434 switches may not have enough ports to aggregate all switches > south > 435 of them and with that, the ToF is 'split' into multiple > 436 independent planes. Introduction and Section 4.1.2 explains > the > 437 concept in more detail. A plane is subset of ToF nodes that > see > 438 each other through south reflection or E-W links. > > [minor] "Introduction..." I didn't see related text there. > > > [nit] s/is subset/is a subset > > > 440 Radix: A radix of a switch is basically number of switching > ports it > 441 provides. It's sometimes called fanout as well. > > [nit] s/A radix of a switch is basically number of switching ports it > provides./The number of switching ports it provides. > > > ... > 460 East-West Link: A link between two nodes at the same level. > East- > 461 West links are normally not part of Clos or "fat-tree" > topologies. > > [minor] s/East-West Link/East-West (E-W) Link > > > [minor] "...normally not part of Clos or "fat-tree" topologies." But > they are used by RIFT in several places. To avoid confusion maybe the > last sentence is not needed. > > > ... > 476 South Reflection: Often abbreviated just as "reflection" it > defines > 477 a mechanism where South Node TIEs are "reflected" from the > level > 478 south back up north to allow nodes in the same level without > E-W > 479 links to "see" each other's node TIEs. > > [nit] s/"reflection" it/"reflection", it > > > [minor] Please expand TIE. I know the definition is in the very next > paragraph, but it is good practice to expand on first use. > > > ... > 490 Node TIE: This stands as acronym for a "Node Topology > Information > 491 Element" that contains all adjacencies the node discovered > and > 492 information about node itself. Node TIE should NOT be > confused > 493 with a North TIE since "node" defines the type of TIE rather > than > 494 its direction. > > [nit] s/This stands as acronym for a "Node Topology Information > Element" that contains/An acronym for a "Node Topology Information > Element", which contains > > > [nit] s/about node/about the node > > > [minor] s/NOT/not/g This is not one of the rfc2119 keywords, so it > should not be capitalized. I know you're doing it for emphasis, but > it will be eventually changed -- so let's take care of it now. > > > ... > 501 Key Value TIE: A South TIE that is carrying a set of key value > pairs > 502 [DYNAMO]. It can be used to distribute information in the > 503 southbound direction within the protocol. > > [minor] "Key Value TIE" is not used anywhere else in the document. > Also, the definition talks about a South TIE carrying (only? -- that's > what the definition sounds like) key value pairs...but §4.2.3.2 > mentions other information and even North TIEs carrying key value > pairs. > > > 505 TIDE: Topology Information Description Element, equivalent to > CSNP > 506 in ISIS. > > [minor] Please expand CSNP. > > > 508 TIRE: Topology Information Request Element, equivalent to PSNP > in > 509 ISIS. It can both confirm received and request missing TIEs. > > [minor] Please expand PSNP. > > > 511 De-aggregation/Disaggregation: Process in which a node decides > to > 512 advertise more specific prefixes Southwards, either > positively to > 513 attract the corresponding traffic, or negatively to repel it. > 514 Disaggregation is performed to prevent black-holing and > suboptimal > 515 routing to the more specific prefixes. > > [nit] "De-aggregation/Disaggregation" It would be very nice if you > settled on one word. Disaggregation seems to be used the most, but > dis-aggregation also shows up a couple of times. > > > ... > 521 Flood Repeater (FR): A node can designate one or more > northbound > 522 neighbor nodes to be flood repeaters. The flood repeaters > are > 523 responsible for flooding northbound TIEs further north. > They are > 524 similar to MPR in OSLR. The document sometimes calls them > flood > 525 leaders as well. > > [minor] Please expand both MPR and OLSR. > > > [minor] Also, please add a reference. I see that MPR/OLSR are only > mentioned once more (§4.2.3.9), and wonder if we even need to make > reference to them. The first paragraph in §4.2.3.9 already pretty > much covers the intent of an MPR. > > > 527 Bandwidth Adjusted Distance (BAD): Each RIFT node can > calculate the > 528 amount of northbound bandwidth available towards a node > compared > 529 to other nodes at the same level and can modify the route > distance > 530 accordingly to allow for the lower level to adjust their load > 531 balancing towards spines. > > [minor] A reference to §4.3.6.1 would be very nice. > > > 533 Overloaded: Applies to a node advertising `overload` attribute > as > 534 set. The semantics closely follow the meaning of the same > 535 attribute in [ISO10589-Second-Edition]. > > [nit] s/advertising `overload` attribute/advertising the `overload` > attribute > > > [minor] There is no overload attribute in ISO10589, just an Overload > Bit. Also, §4.3.1 (please add a reference) calls it the overload bit. > > > ... > 540 Three-Way Adjacency: RIFT tries to form a unique adjacency > over an > 541 interface and exchange local configuration and necessary ZTP > 542 information. An adjacency is only advertised in node TIEs > and > 543 used for computations after it achieved three-way state, > i.e. both > 544 routers reflected each other in LIEs including relevant > security > 545 information. LIEs before three-way state is reached may > carry ZTP > 546 related information already. > > [minor] s/tries to form a unique adjacency/forms a unique adjacency > > > [nit] s/and exchange local/and exchanges local > > > [nit] s/after it achieved three-way state/after the three-way state is > achieved > > > [minor] Note that three-way, threeway and three way are all used in > different places. Please be consistent. > > > ... > 554 Neighbor: Once a three-way adjacency has been formed a > neighborship > 555 relationship contains the neighbor's properties. Multiple > 556 adjacencies can be formed to a remote node via parallel > interfaces > 557 but such adjacencies are NOT sharing a neighbor structure. > Saying > 558 "neighbor" is thus equivalent to saying "a three-way > adjacency". > > [] How is load balancing achieved through parallel links between the > same pair of routers? Just putting this comment here so I don't > forget it later. > > > ... > 566 Shortest-Path First (SPF): A well-known graph algorithm > attributed > 567 to Dijkstra that establishes a tree of shortest paths from a > 568 source to destinations on the graph. We use SPF acronym due > to > 569 its familiarity as general term for the node reachability > 570 calculations RIFT can employ to ultimately calculate routes > of > 571 which Dijkstra algorithm is one. > > 573 North SPF (N-SPF): A reachability calculation that is > progressing > 574 northbound, as example SPF that is using South Node TIEs > only. > 575 Normally it progresses a single hop only and installs default > 576 routes. > > 578 South SPF (S-SPF): A reachability calculation that is > progressing > 579 southbound, as example SPF that is using North Node TIEs > only. > > [minor] Please add a reference to where the specific algorithm used by > RIFT is specified. > > > ... > 585 3.2. Topology > 586 ^ N +--------+ +--------+ > 587 Level 2 | |ToF 21| |ToF 22| > 588 E <-*-> W ++-+--+-++ ++-+--+-++ > 589 | | | | | | | | | > 590 S v P111/2 P121/2 | | | | > 591 ^ ^ ^ ^ | | | | > 592 | | | | | | | | > 593 +--------------+ | +-----------+ | | | > +---------------+ > 594 | | | | | | | > | > 595 South +-----------------------------+ | | > ^ > 596 | | | | | | | > All TIEs > 597 0/0 0/0 0/0 +-----------------------------+ > | > 598 v v v | | | | > | > 599 | | +-+ +<-0/0----------+ | > | > 600 | | | | | | | > | > 601 +-+----++ optional +-+----++ ++----+-+ > ++-----++ > 602 Level 1 | | E/W link | | | | | > | > 603 |Spin111+----------+Spin112| |Spin121| > |Spin122| > 604 +-+---+-+ ++----+-+ +-+---+-+ > ++---+--+ > 605 | | | South | | | | > 606 | +---0/0--->-----+ 0/0 | +----------------+ | > 607 0/0 | | | | | | | > 608 | +---<-0/0-----+ | v | +--------------+ | | > 609 v | | | | | | | > 610 +-+---+-+ +--+--+-+ +-+---+-+ > +---+-+-+ > 611 Level 0 | | (L2L) | | | | | > | > 612 |Leaf111+~~~~~~~~~~+Leaf112| |Leaf121| > |Leaf122| > 613 +-+-----+ +-+---+-+ +--+--+-+ > +-+-----+ > 614 + + \ / + + > 615 Prefix111 Prefix112 \ / Prefix121 > Prefix122 > 616 multi-homed > 617 Prefix > 618 +---------- PoD 1 ---------+ +---------- PoD 2 > ---------+ > > 620 Figure 2: A Three Level Spine-and-Leaf Topology > 621 .+--------+ +--------+ +--------+ +--------+ > 622 .|ToF A1| |ToF B1| |ToF B2| |ToF A2| > 623 .++-+-----+ ++-+-----+ ++-+-----+ ++-+-----+ > 624 . | | | | | | | | > 625 . | | | | | +---------------+ > 626 . | | | | | | | | > 627 . | | | +-------------------------+ | > 628 . | | | | | | | | > 629 . | +-----------------------+ | | | | > 630 . | | | | | | | | > 631 . | | +---------+ | +---------+ | | > 632 . | | | | | | | | > 633 . | +---------------------------------+ | | > 634 . | | | | | | | | > 635 .++-+-----+ ++-+-----+ +--+-+---+ +----+-+-+ > 636 .|Spine111| |Spine112| |Spine121| |Spine122| > 637 .+-+---+--+ ++----+--+ +-+---+--+ ++---+---+ > 638 . | | | | | | | | > 639 . | +--------+ | | +--------+ | > 640 . | | | | | | | | > 641 . | -------+ | | | +------+ | | > 642 . | | | | | | | | > 643 .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ > 644 .|Leaf111| |Leaf112| |Leaf121| |Leaf122| > 645 .+-------+ +-------+ +-------+ +-------+ > > 647 Figure 3: Topology with Multiple Planes > > 649 We will use topology in Figure 2 (called commonly a fat > tree/network > 650 in modern IP fabric considerations [VAHDAT08] as homonym to the > 651 original definition of the term [FATTREE]) in all further > 652 considerations. This figure depicts a generic "single plane > fat- > 653 tree" and the concepts explained using three levels apply by > 654 induction to further levels and higher degrees of connectivity. > 655 Further, this document will deal also with designs that provide > only > 656 sparser connectivity and "partitioned spines" as shown in > Figure 3 > 657 and explained further in Section 4.1.2. > > [minor] The first sentence introduces another source to define fat > tree, which is not mentioned in the Introduction nor in the > Terminology. This is not a huge deal, but it would be nice to keep > consistency throughout. IOW, include the new reference somewhere in > the first couple of sections, settle on one, or simply just don't add > a new reference. > > > [minor] For completeness, it would be nice to explain that the figures > are incomplete: for example, Figure 2 shows only some of the TIEs, > Figure 3 shows none of them, etc... > > > [] BTW, SVI graphics are now supported in xmltorfcv3. Some of the > figures might be easier to visualize that way than using ASCII art. > > > 659 4. RIFT: Routing in Fat Trees > > 661 We present here a detailed outline of a protocol optimized for > 662 Routing in Fat Trees (RIFT) that in most abstract terms has many > 663 properties of a modified link-state protocol > 664 [RFC2328][ISO10589-Second-Edition] when distributing information > 665 northbound and distance vector [RFC4271] protocol when > distributing > 666 information southbound. While this is an unusual combination, > it > 667 does quite naturally exhibit the desirable properties we seek. > > [minor] s/detailed outline/detailed specification > > > [nit] s/and distance vector/and a distance vector > > > [] The references to OSPF/ISIS/BGP seem superfluous because those > documents don't define generic link-state or distance vector protocols > -- in fact, many would argue that BGP is a path vector protocol. Just > my opinion. I would be very happy if the references are not included. > I wonder if there are generic references that can be used instead of > specific ones (for information purposes). > > > 669 4.1. Overview > > 671 4.1.1. Properties > > 673 The most singular property of RIFT is that it floods flat > link-state > 674 information northbound only so that each level obtains the full > 675 topology of levels south of it. Link-State information is, > with some > 676 exceptions, never flooded East-West or back South again. > Exceptions > 677 like south reflection is explained in detail in Section 4.2.5.1 > and > 678 east-west flooding at ToF level in multi-plane fabrics is > outlined in > 679 Section 4.1.2. In southbound direction, the protocol operates > like a > 680 "fully summarizing, unidirectional" path vector protocol or > rather a > 681 distance vector with implicit split horizon. Routing > information, > 682 normally just the default route, propagates one hop south and > is 're- > 683 advertised' by nodes at next lower level. However, RIFT uses > 684 flooding in the southern direction as well to avoid the > overhead of > 685 building an update per adjacency. We omit describing the > East-West > 686 direction for the moment. > > [minor] What is "flat link-state information"? It looks like this is > the only place where "flat" is used. Maybe s/flat/ > > > [nit] s/In southbound direction/In the southbound direction > > > [] "...the protocol operates like a "fully summarizing, > unidirectional" path vector protocol or rather a distance vector with > implicit split horizon." I hope that the operation is specified > elsewhere, and that the document doesn't depend on these descriptions. > Personal opinion: simple and direct language may serve you better. > > > 688 Those information flow constraints create not only an > anisotropic > 689 protocol (i.e. the information is not distributed "evenly" or > 690 "clumped" but summarized along the N-S gradient) but also a > "smooth" > 691 information propagation where nodes do not receive the same > 692 information from multiple directions at the same time. > Normally, > 693 accepting the same reachability on any link, without > understanding > 694 its topological significance, forces tie-breaking on some kind > of > 695 distance metric. And such tie-breaking leads ultimately in > hop-by- > 696 hop forwarding to shortest paths only. In contrast to that, > RIFT, > 697 under normal conditions, does not need to tie-break same > reachability > 698 information from multiple directions. Its computation > principles > 699 (south forwarding direction is always preferred) leads to > valley-free > 700 forwarding behavior. And since valley free routing is > loop-free, it > 701 can use all feasible paths which is another highly desirable > property > 702 if available bandwidth should be utilized to the maximum extent > 703 possible. > > [] "anisotropic" This is my word of the day. I learned a new one! :-) > > > [nit] s/tie-break same/tie-break the same > > > [minor] "valley-free" Reference? > > > 705 To account for the "northern" and the "southern" information > split > 706 the link state database is partitioned accordingly into "north > 707 representation" and "south representation" TIEs. In simplest > terms > 708 the North TIEs contain a link state topology description of > lower > 709 levels and and South TIEs carry simply default routes towards > the > 710 level above. This oversimplified view will be refined > gradually in > 711 following sections while introducing protocol procedures and > state > 712 machines at the same time. > > [nit] s/in following/in the following > > > 714 4.1.2. Generalized Topology View > > 716 This section will shed some light on the topologies RIFT > addresses, > 717 including multi plane fabrics and their implications. Readers > that > 718 are only interested in single plane designs, i.e. all > top-of-fabric > 719 nodes being topologically equal and initially connected to all > the > 720 switches at the level below them, can skip the rest of Section > 4.1.2 > 721 and resulting Section 4.2.5.2 as well. > > [minor] "Readers...can skip the rest of Section 4.1.2 and resulting > Section 4.2.5.2 as well." I can see how a reader can skip a part of > the overview, but §4.2* is where the specification is. Are you saying > that §4.2.5.2 doesn't have to be implemented/supported in some cases? > Are there other sections that are also not needed in some cases? Does > this result in the ability to implement subsets of RIFT to support > specific topologies? Where is that discussed? > > > ... > 737 4.1.2.1. Terminology > ... > 746 K: Denotes the number of ports in radix of a switch pointing > north or > 747 south. Further, K_LEAF denotes number of ports pointing > south, > 748 i.e. towards leaves, and K_TOP for number of ports pointing > north > 749 towards a higher spine level. To simplify the visual aids, > 750 notations and further considerations, K will be mostly set to > 751 Radix/2. > > [minor] Radix is defined in §3.1 as the number of ports. s/Denotes > the number of ports in radix of a switch/Denotes the radix of a switch > > > ... > 757 N: Denote the number of independent ToF planes in a topology. > > [nit] s/Denote/Denotes > > > ... > 766 4.1.2.2. Clos as Crossed Crossbars > > 768 The typical topology for which RIFT is defined is built of P > number > 769 of PoDs and connected together by S number of ToF nodes. A PoD > node > 770 has K number of ports (also called Radix). We consider half of > them > 771 (K=Radix/2) as connecting host devices from the south, and the > other > 772 half connecting to interleaved PoD Top-Level switches to the > north. > 773 Ratio K can be chosen differently without loss of generality > when > 774 port speeds differ or the fabric is oversubscribed but K=R/2 > allows > 775 for more readable representation whereby there are as many ports > 776 facing north as south on any intermediate node. We represent a > node > 777 hence in a schematic fashion with ports "sticking out" to its > north > 778 and south rather than by the usual real-world front faceplate > designs > 779 of the day. > > [nit] s/Ratio K can be chosen differently/The K ratio can be chosen > differently > > > [minor] "K=R/2" R is defined in §4.1.2.1 as the redundancy, not the radix. > > > 781 Figure 4 provides a view of a leaf node as seen from the north, > i.e. > 782 showing ports that connect northbound. For lack of a better > symbol, > 783 we have chosen to use the "o" as ASCII visualisation of a single > 784 port. In this example, K_LEAF has 6 ports. Observe that the > number > 785 of PoDs is not related to Radix unless the ToF Nodes are > constrained > 786 to be the same as the PoD nodes in a particular deployment. > > [minor] "showing ports that connect northbound...K_LEAF has 6 ports" > The ports that connect north are K_TOP. > > > 788 Top view > 789 +---+ > 790 | | > 791 | o | e.g., Radix = 12, K_LEAF = 6 > 792 | | > 793 | o | > 794 | | ------------------------- > 795 | o ------- Physical Port (Ethernet) ----+ > 796 | | ------------------------- | > 797 | o | | > 798 | | | > 799 | o | | > 800 | | | > 801 | o | | > 802 | | | > 803 +---+ | > > 805 || || || || || || || > 806 +----+ > +------------------------------------------------+ > 807 | | | > | > 808 +----+ > +------------------------------------------------+ > 809 || || || || || || || > 810 Side views > > 812 Figure 4: A Leaf Node, K_LEAF=6 > > 814 The Radix of a PoD's top node may be different than that of the > leaf > 815 node. Though, more often than not, a same type of node is used > for > 816 both, effectively forming a square (K*K). In general case, we > could > 817 have switches with K_TOP southern ports on nodes at the top of > the > 818 PoD which are not necessarily the same as K_LEAF. For > instance, in > 819 the representations below, we pick a 6 port K_LEAF and a 8 port > 820 K_TOP. In order to form a crossbar, we need K_TOP Leaf Nodes as > 821 illustrated in Figure 5. > > [nit] s/In general case/In the general case > > > [minor] "K_TOP southern ports" Aren't K_TOP the ports pointing north? > The description is confusing because the terminology from the last > section is not used in the same way -- the description mixes the > terminology with the number represented. For example, "K_TOP Leaf > Nodes" doesn't make sense if the terminology is strictly applied, > where K_TOP is the "number of ports pointing north". Also (if I > understood Figure 4 correctly), each node below has 6 K_TOP ports -- > presumably the node at the top has 8 K_LEAF ports. > > > 823 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ > 824 | | | | | | | | | | | | | | | | > 825 | o | | o | | o | | o | | o | | o | | o | | o | > 826 | | | | | | | | | | | | | | | | > 827 | o | | o | | o | | o | | o | | o | | o | | o | > 828 | | | | | | | | | | | | | | | | > 829 | o | | o | | o | | o | | o | | o | | o | | o | > 830 | | | | | | | | | | | | | | | | > 831 | o | | o | | o | | o | | o | | o | | o | | o | > 832 | | | | | | | | | | | | | | | | > 833 | o | | o | | o | | o | | o | | o | | o | | o | > 834 | | | | | | | | | | | | | | | | > 835 | o | | o | | o | | o | | o | | o | | o | | o | > 836 | | | | | | | | | | | | | | | | > 837 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ > > 839 Figure 5: Southern View of a PoD, K_TOP=8 > > 841 As further visualized in Figure 6 the K_TOP Leaf Nodes are fully > 842 interconnected with the K_LEAF PoD-top nodes, providing > connectivity > 843 that can be represented as a crossbar when "looked at" from the > 844 north. The result is that, in the absence of a failure, a > packet > 845 entering the PoD from the north on any port can be routed to > any port > 846 in the south of the PoD and vice versa. And that is precisely > why it > 847 makes sense to talk about a "switching matrix". > > [minor] "K_TOP Leaf Nodes are fully interconnected with the K_LEAF > PoD-top nodes" Same comment about the terminology... I only see one > "PoD top Node" with one connection to a switch, not a full > interconnect. > > > [minor] The figure also doesn't show the connection between the > switches (if any)...and I'm not sure what the "connectors" (?) on the > switch at the top/bottom are (there seem to be more of them than > ports). > > > 849 E<-*->W > > 851 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ > 852 | | | | | | | | | | | | | | | | > 853 +--------------------------------------------------------+ > 854 | o o o o o o o o | > 855 +--------------------------------------------------------+ > 856 +--------------------------------------------------------+ > 857 | o o o o o o o o | > 858 +--------------------------------------------------------+ > 859 +--------------------------------------------------------+ > 860 | o o o o o o o o | > 861 +--------------------------------------------------------+ > 862 +--------------------------------------------------------+ > 863 | o o o o o o o o | > 864 +--------------------------------------------------------+ > 865 +--------------------------------------------------------+ > 866 | o o o o o o o o > |<-+ > 867 +--------------------------------------------------------+ > | > 868 +--------------------------------------------------------+ > | > 869 | o o o o o o o o | > | > 870 +--------------------------------------------------------+ > | > 871 | | | | | | | | | | | | | | | | > | > 872 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ > | > 873 ^ > | > 874 | > | > 875 | ---------- --------------------- > | > 876 +----- Leaf Node PoD top Node (Spine) > --+ > 877 ---------- --------------------- > > 879 Figure 6: Northern View of a PoD's Spines, K_TOP=8 > > 881 Side views of this PoD is illustrated in Figure 7 and Figure 8. > > 883 Connecting to Spine > > 885 || || || || || || || || > 886 > +----------------------------------------------------------------+ N > 887 | PoD top Node seen sideways > | ^ > 888 > +----------------------------------------------------------------+ | > 889 || || || || || || || || > * > 890 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ > | > 891 | | | | | | | | | | | | | | | | > v > 892 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ > S > 893 || || || || || || || || > > 895 Connecting to Client nodes > > 897 Figure 7: Side View of a PoD, K_TOP=8, K_LEAF=6 > > [minor] I count 8 connections to the south in the top node...and just > one on the switches below it. > > > 899 Connecting to Spine > > 901 || || || || || || > 902 +----+ +----+ +----+ +----+ +----+ +----+ > N > 903 | | | | | | | | | | | PoD top Nodes > ^ > 904 +----+ +----+ +----+ +----+ +----+ +----+ > | > 905 || || || || || || > * > 906 +------------------------------------------------+ > | > 907 | Leaf seen sideways | > v > 908 +------------------------------------------------+ > S > 909 || || || || || || > > 911 Connecting to Client nodes > > [minor] A leaf doesn't have southbound ports/adjacencies. What is > this leaf connected to? > > > 913 Figure 8: Other Side View of a PoD, K_TOP=8, K_LEAF=6, 90o > turn in > 914 E-W Plane > > [minor] In this case I count a leaf with 6 northbound interfaces. > > > [minor] "90o turn in E-W Plane" I don't know what that is. > > > 916 As next step, let us observe that a resulting PoD can be > abstracted > 917 as a bigger node with a number K of K_POD= K_TOP * K_LEAF, and > the > 918 design can recurse. > > [minor] K is already defined as the number of ports (§4.1.2.1). > > > [minor] Lost again. If the PoD is abstracted as a single node, then > it would have K_TOP + K_LEAF nodes, not sure where the "*" comes from > or what is trying to denote. > > > 920 It will be critical at this point that, before progressing > further, > 921 the concept and the picture of "crossed crossbars" is clear. > Else, > 922 the following considerations might be difficult to comprehend. > > [] The concept is clear to me -- I don't find the explanation and the > corresponding pictures specially helpful. > > > ... > 929 This topology is also referred to as a single plane > configuration and > 930 is quite popular due to its simplicity. In order to reach a 1:1 > 931 connectivity ratio between the ToF and the leaves, it results > that > 932 there are K_TOP ToF nodes, because each port of a ToP node > connects > 933 to a different ToF node, and K_LEAF ToP nodes for the same > reason. > 934 Consequently, it will take (P * K_LEAF) ports on a ToF node to > 935 connect to each of the K_LEAF ToP nodes of the P PoDs, as shown > in > 936 Figure 9. > > [minor] "there are K_TOP ToF nodes...and K_LEAF ToP nodes" As with > other places, the terminology is not used as defined earlier. K_* > refer to the number of ports in a specific switch, so their use is > relative to that switch. In this case each ToP has K_TOP links, and > each ToF has K_LEAF links. The use without the reference point is > confusing. > > > [minor] "(P * K_LEAF)" This calculation is clear once one realizes > that the previous discussion was for the number of ports per PoD, not > total (as the definition of K_* suggests). > > > > 938 [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] <-----+ > 939 | | | | | | | | | > 940 [=================================] | ----------- > 941 | | | | | | | | +----- Top-of-Fabric > 942 [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] +----- Node > -------+ > 943 | ----------- > | > 944 | > v > 945 +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ <-----+ > +-+ > 946 | | | | | | | | | | | | | | | | > | | > 947 [ |o| |o| |o| |o| |o| |o| |o| |o| ] > | | > 948 [ |o| |o| |o| |o| |o| |o| |o| |o| ] > ------------------------- | | > 949 [ |o| |o| |o| |o| |o| |o| |o| |o<--- Physical Port > (Ethernet) | | > 950 [ |o| |o| |o| |o| |o| |o| |o| |o| ] > ------------------------- | | > 951 [ |o| |o| |o| |o| |o| |o| |o| |o| ] > | | > 952 [ |o| |o| |o| |o| |o| |o| |o| |o| ] > | | > 953 | | | | | | | | | | | | | | | | > | | > 954 [ |o| |o| |o| |o| |o| |o| |o| |o| ] > | | > 955 [ |o| |o| |o| |o| |o| |o| |o| |o| ] -------------- > | | > 956 [ |o| |o| |o| |o| |o| |o| |o| |o| ] <--- PoD top level > | | > 957 [ |o| |o| |o| |o| |o| |o| |o| |o| ] node (Spine) ---+ > | | > 958 [ |o| |o| |o| |o| |o| |o| |o| |o| ] -------------- | > | | > 959 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | > | | > 960 | | | | | | | | | | | | | | | | -+ +- +-+ v > | | > 961 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | --| |--[ > ]--| | > 962 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | ----- | --| |--[ > ]--| | > 963 [ |o| |o| |o| |o| |o| |o| |o| |o| ] +--- PoD ---+ --| |--[ > ]--| | > 964 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | ----- | --| |--[ > ]--| | > 965 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | --| |--[ > ]--| | > 966 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | --| |--[ > ]--| | > 967 | | | | | | | | | | | | | | | | -+ +- +-+ > | | > 968 +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ > +-+ > > 970 Figure 9: Fabric Spines and TOFs in Single Plane Design, 3 > PoDs > > [minor] I believe you when you say that this figure shows how "it will > take (P * K_LEAF) ports on a ToF node to connect to each of the K_LEAF > ToP nodes of the P PoDs", but the drawing is not straight forward to > interpret. Among other reasons because there seem to be 3 different > connection types/interpretations/? to the ToF -- a different one for > each PoD. > > > 972 The top view can be collapsed into a third dimension where the > hidden > 973 depth index is representing the PoD number. We can then show > one PoD > 974 as a class of PoDs and hence save one dimension in our > 975 representation. The Spine Node expands in the depth and the > vertical > 976 dimensions, whereas the PoD top level Nodes are constrained, in > 977 horizontal dimension. A port in the 2-D representation > represents > 978 effectively the class of all the ports at the same position in > all > 979 the PoDs that are projected in its position along the depth > axis. > 980 This is shown in Figure 10. > > [] Do we really need this extra representation? > > > ... > 1003 As simple as single plane deployment is it introduces a limit > due to > 1004 the bound on the available radix of the ToF nodes that has to > be at > 1005 least P * K_LEAF. Nevertheless, we will see that a distinct > 1006 advantage of a connected or non-partitioned Top-of-Fabric is > that all > 1007 failures can be resolved by simple, non-transitive, positive > 1008 disaggregation (i.e. nodes advertising more specific prefixes > with > 1009 the default to the level below them that is however not > propagated > 1010 further down the fabric) as described in Section 4.2.5.1 . In > other > 1011 words; non-partitioned ToF nodes can always reach nodes below or > 1012 withdraw the routes from PoDs they cannot reach unambiguously. > And > 1013 with this, positive disaggregation can heal all failures and > still > 1014 allow all the ToF nodes to see each other via south reflection. > 1015 Disaggregation will be explained in further detail in Section > 4.2.5. > > [nit] s/deployment is it introduces/deployment is, it introduces > > > 1017 In order to scale beyond the "single plane limit", the > Top-of-Fabric > 1018 can be partitioned by a N number of identically wired planes > where N > 1019 is an integer divider of K_LEAF. The 1:1 ratio and the desired > 1020 symmetry are still served, this time with (K_TOP * N) ToF > nodes, each > 1021 of (P * K_LEAF / N) ports. N=1 represents a non-partitioned > Spine > 1022 and N=K_LEAF is a maximally partitioned Spine. Further, if R > is any > 1023 integer divisor of K_LEAF, then N=K_LEAF/R is a feasible number > of > 1024 planes and R a redundancy factor. If proves convenient for > 1025 deployments to use a radix for the leaf nodes that is a power > of 2 so > 1026 they can pick a number of planes that is a lower power of 2. > The > 1027 example in Figure 11 splits the Spine in 2 planes with a > redundancy > 1028 factor R=3, meaning that there are 3 non-intersecting paths > between > 1029 any leaf node and any ToF node. A ToF node must have, in this > case, > 1030 at least 3*P ports, and be directly connected to 3 of the 6 > PoD-ToP > 1031 nodes (spines) in each PoD. > > [nit] s/by a N number/by an N number > > > [minor] "(K_TOP * N) ToF nodes, each of (P * K_LEAF / N) ports" > Again, the use of the terminology without a reference assumes a > specific interpretation by the reader. > > > [minor] "if R is any integer divisor of K_LEAF, then N=K_LEAF/R is a > feasible number of planes and R a redundancy factor." Please expand > on the meaning of the redundancy factor. > > > [minor] "6 PoD-ToP nodes" I count 8. > > > 1033 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ > 1034 +-| |--| |--| |--| |--| |--| |--| |--| |-+ > 1035 | | o | | o | | o | | o | | o | | o | | o | | o | | > 1036 +-| |--| |--| |--| |--| |--| |--| |--| |-+ > 1037 +-| |--| |--| |--| |--| |--| |--| |--| |-+ > 1038 | | o | | o | | o | | o | | o | | o | | o | | o | | > 1039 +-| |--| |--| |--| |--| |--| |--| |--| |-+ > 1040 +-| |--| |--| |--| |--| |--| |--| |--| |-+ > 1041 | | o | | o | | o | | o | | o | | o | | o | | o | | > 1042 +-| |--| |--| |--| |--| |--| |--| |--| |-+ > 1043 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ > > 1045 Plane 1 > 1046 ----------- . ------------ . ------------ . ------------ . > -------- > 1047 Plane 2 > > 1049 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ > 1050 +-| |--| |--| |--| |--| |--| |--| |--| |-+ > 1051 | | o | | o | | o | | o | | o | | o | | o | | o | | > 1052 +-| |--| |--| |--| |--| |--| |--| |--| |-+ > 1053 +-| |--| |--| |--| |--| |--| |--| |--| |-+ > 1054 | | o | | o | | o | | o | | o | | o | | o | | o | | > 1055 +-| |--| |--| |--| |--| |--| |--| |--| |-+ > 1056 +-| |--| |--| |--| |--| |--| |--| |--| |-+ > 1057 | | o | | o | | o | | o | | o | | o | | o | | o | | > 1058 +-| |--| |--| |--| |--| |--| |--| |--| |-+ > 1059 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ > 1060 ^ > 1061 | > 1062 | ---------------- > 1063 +----- Top-of-Fabric node > 1064 "across" depth > 1065 ---------------- > > 1067 Figure 11: Northern View of a Multi-Plane ToF Level, K_LEAF=6, > N=2 > > 1069 At the extreme end of the spectrum it is even possible to fully > 1070 partition the spine with N = K_LEAF and R=1, while maintaining > 1071 connectivity between each leaf node and each Top-of-Fabric > node. In > 1072 that case the ToF node connects to a single Port per PoD, so it > 1073 appears as a single port in the projected view represented in > 1074 Figure 12. The number of ports required on the Spine Node is > more or > 1075 equal to P, the number of PoDs. > > [minor] "more or equal to P" ?? > > > ... > 1121 4.1.3. Fallen Leaf Problem > ... > 1140 In a maximally partitioned fabric, the redundancy factor is R= > 1, so > 1141 any breakage in the fabric may cause one or more fallen leaves. > 1142 However, not all cases require disaggregation. The following > cases > 1143 do not require particular action in such scenario: > > [major] A quick look at §4.2.5.1 doesn't explicitly mention how a node > considers the redundancy factor...but that may be included in the "DAG > computation" mentioned in the first step. I'm putting this comment > here so I don't forget later... > > > 1145 If a southern link on a node goes down, then connectivity > through > 1146 that node is lost for all nodes south of it. There is no > need to > 1147 disaggregate since the connectivity to this node is lost for > all > 1148 spine nodes in a same fashion. > > 1150 If a ToF Node goes down, then northern traffic towards it is > 1151 routed via alternate ToF nodes in the same plane and there > is no > 1152 need to disaggregate routes. > ... > 1159 If the breakage is the last northern link from a ToP node to > a ToF > 1160 node going down, then the fallen leaf problem affects only > The ToF > 1161 node, and the connectivity to all the nodes in the PoD is > lost > 1162 from that ToF node. This can be observed by other ToF nodes > 1163 within the plane where the ToP node is located and positively > 1164 disaggregated within that plane. > > [nit] s/only The ToF/only the ToF > > > 1166 On the other hand, there is a need to disaggregate the routes to > 1167 Fallen Leaves in a transitive fashion, all the way to the other > 1168 leaves in the following cases: > > [] Without having seen the specific mechanism, this overview is hard to > digest. > > > 1170 o If the breakage is the last northern link from a leaf node > within > 1171 a plane (there is only one such link in a maximally > partitioned > 1172 fabric) that goes down, then connectivity to all unicast > prefixes > 1173 attached to the leaf node is lost within the plane where the > link > 1174 is located. Southern Reflection by a leaf node, e.g., > between ToP > 1175 nodes, if the PoD has only 2 levels, happens in between > planes, > 1176 allowing the ToP nodes to detect the problem within the PoD > where > 1177 it occurs and positively disaggregate. The breakage can be > 1178 observed by the ToF nodes in the same plane through the North > 1179 flooding of TIEs from the ToP nodes. The ToF nodes however > need > 1180 to be aware of all the affected prefixes for the negative, > 1181 possibly transitive disaggregation to be fully effective > (i.e. a > 1182 node advertising in control plane that it cannot reach a > certain > 1183 more specific prefix than default whereas such > disaggregation must > 1184 in extreme condition propagate further down southbound). The > 1185 problem can also be observed by the ToF nodes in the other > planes > 1186 through the flooding of North TIEs from the affected leaf > nodes, > 1187 together with non-node North TIEs which indicate the affected > 1188 prefixes. To be effective in that case, the positive > 1189 disaggregation must reach down to the nodes that make the > plane > 1190 selection, which are typically the ingress leaf nodes. The > 1191 information is not useful for routing in the intermediate > levels. > > [nit] s/in control plane/in the control plane > > > [nit] s/in extreme condition/in the extreme condition > > > 1193 o If the breakage is a ToP node in a maximally partitioned > fabric - > 1194 in which case it is the only ToP node serving the plane in > that > 1195 PoD - goes down, then the connectivity to all the nodes in > the PoD > 1196 is lost within the plane where the ToP node is located. > 1197 Consequently, all leaves of the PoD fall in this plane. > Since the > 1198 Southern Reflection between the ToF nodes happens only > within a > 1199 plane, ToF nodes in other planes cannot discover fallen > leaves in > 1200 a different plane. They also cannot determine beyond their > local > 1201 plane whether a leaf node that was initially reachable has > become > 1202 unreachable. As the breakage can be observed by the ToF > nodes in > 1203 the plane where the breakage happened, the ToF nodes in the > plane > 1204 need to be aware of all the affected prefixes for the > negative > 1205 disaggregation to be fully effective. The problem can also > be > 1206 observed by the ToF nodes in the other planes through the > flooding > 1207 of North TIEs from the affected leaf nodes, if there are > only 3 > 1208 levels and the ToP nodes are directly connected to the leaf > nodes, > 1209 and then again it can only be effective it is propagated > 1210 transitively to the leaf, and useless above that level. > > [nit] s/fabric -...- goes down,/fabric -...-, > > > 1212 For the sake of easy comprehension let us roll the abstractions > back > 1213 into a simple example and observe that in Figure 3 the loss of > link > 1214 Spine 122 to Leaf 122 will make Leaf 122 a fallen leaf for > Top-of- > 1215 Fabric plane B. Worse, if the cabling was never present in > first > 1216 place, plane B will not even be able to know that such a fallen > leaf > 1217 exists. Hence partitioning without further treatment results > in two > 1218 grave problems: > > [] "For the sake of easy comprehension...Figure 3..." Finally! > Hmmm...sorry...I mean, it is a little ironic that after all the new > terminology, detailed descriptions and figures, the clearer > explanation uses the simplest drawing. > > > [nit] s/in first place/in the first place > > > 1220 o Leaf 111 trying to route to Leaf 122 MUST choose Spine 111 in > 1221 plane A as its next hop since plane B will inevitably > blackhole > 1222 the packet when forwarding using default routes or do > excessive > 1223 bow tying. This information must be in its routing table. > > [major] s/MUST/must This is not a Normative statement, just a > statement of fact (inside an example). > > > 1225 o Any kind of "flooding" or distance vector trying to deal > with the > 1226 problem by distributing host routes will be able to converge > only > 1227 using paths through leaves. The flooding of information on > Leaf > 1228 122 would have to go up to Top-of-Fabric A and then > "loopback" > 1229 over other leaves to ToF B leading in extreme cases to > traffic for > 1230 Leaf 122 when presented to plane B taking an "inverted > fabric" > 1231 path where leaves start to serve as TOFs, at least for the > 1232 duration of a protocol's convergence. > > [] "Any kind of "flooding" or distance vector..." I can guess the > meaning, but it would be better that I don't have to. Maybe > something like: "Any advertisement..." > > > [minor] "information on Leaf 122" s/on/ about (?), or maybe from. ?? > > > 1234 4.1.4. Discovering Fallen Leaves > > 1236 As illustrated later, and without further proof, the way to > deal with > 1237 fallen leaves in multi-plane designs, when aggregation is used, > is > 1238 that RIFT requires all the ToF nodes to share the same north > topology > 1239 database. This happens naturally in single plane design by the > means > 1240 of northbound flooding and south reflection but needs additional > 1241 considerations in multi-plane fabrics. To satisfy this RIFT, in > 1242 multi-plane designs, relies at the ToF level on ring > interconnection > 1243 of switches in multiple planes. Other solutions are possible > but > 1244 they either need more cabling or end up having much longer > flooding > 1245 paths and/or single points of failure. > > [minor] "As illustrated later..." Where? > > > [] "and without further proof" I hope this is at least specified at > that later point. > > > [nit] s/To satisfy this RIFT, in multi-plane designs, relies/To > satisfy this need in multi-plane designs, RIFT relies > > > 1247 In detail, by reserving two ports on each Top-of-Fabric node it > is > 1248 possible to connect them together by interplane bi-directional > rings > 1249 as illustrated in Figure 13. The rings will be used to > exchange full > 1250 north topology information between planes. All ToFs having same > 1251 north topology allows by the means of transitive, negative > 1252 disaggregation described in Section 4.2.5.2 to efficiently fix > any > 1253 possible fallen leaf scenario. Somewhat as a side-effect, the > 1254 exchange of information fulfills the ask to present full view > of the > 1255 fabric topology at the Top-of-Fabric level, without the need to > 1256 collate it from multiple points by additional complexity of > 1257 technologies like [RFC7752]. > > [nit] s/fulfills the ask to present full view/fulfills the requirement > to have a full view > > > [] "..., without the need to collate it from multiple points by > additional complexity of technologies like [RFC7752]." This last > phrase is unnecessary: because carrying RIFT information in BGP-LS is > not defined, and more importantly, there's no need to criticize other > technology to make RIFT look better. > > > 1259 +---+ +---+ +---+ +---+ +---+ +---+ +--------+ > 1260 | | | | | | | | | | | | | | > 1261 | | | | | | | | > 1262 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | > 1263 +-| |--| |--| |--| |--| |--| |--| |-+ | > 1264 | | o | | o | | o | | o | | o | | o | | o | | | > Plane A > 1265 +-| |--| |--| |--| |--| |--| |--| |-+ | > 1266 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | > 1267 | | | | | | | | > 1268 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | > 1269 +-| |--| |--| |--| |--| |--| |--| |-+ | > 1270 | | o | | o | | o | | o | | o | | o | | o | | | > Plane B > 1271 +-| |--| |--| |--| |--| |--| |--| |-+ | > 1272 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | > 1273 | | | | | | | | > 1274 ... | > 1275 | | | | | | | | > 1276 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | > 1277 +-| |--| |--| |--| |--| |--| |--| |-+ | > 1278 | | o | | o | | o | | o | | o | | o | | o | | | > Plane X > 1279 +-| |--| |--| |--| |--| |--| |--| |-+ | > 1280 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | > 1281 | | | | | | | | > 1282 | | | | | | | | | | | | | | > 1283 +---+ +---+ +---+ +---+ +---+ +---+ +--------+ > 1284 Rings 1 2 3 4 5 6 7 > > 1286 Figure 13: Connecting Top-of-Fabric Nodes Across Planes by > Rings > > [minor] Is that one ring per plane, multiple rings per plane or a big > ring for all the planes? The drawing is not clear to me. :-( > > > 1288 4.1.5. Addressing the Fallen Leaves Problem > > 1290 One consequence of the "Fallen Leaf" problem is that some > prefixes > 1291 attached to the fallen leaf become unreachable from some of the > ToF > 1292 nodes. RIFT proposes two methods to address this issue, the > positive > 1293 and the negative disaggregation. Both methods flood South TIEs > to > 1294 advertise the impacted prefix(es). > > [nit] s/RIFT proposes two methods/RIFT defines two methods > > > [End of Review - Part 1] > > _______________________________________________ > RIFT mailing list > RIFT@ietf.org > https://www.ietf.org/mailman/listinfo/rift >
- [Rift] AD Review of draft-ietf-rift-rift-12 (Part… Alvaro Retana
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Antoni Przygienda
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Pascal Thubert (pthubert)
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Tony Przygienda
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Alvaro Retana
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Tony Przygienda
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Tony Przygienda
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Pascal Thubert (pthubert)
- [Rift] Fwd: AD Review of draft-ietf-rift-rift-12 … Tony Przygienda
- Re: [Rift] Fwd: AD Review of draft-ietf-rift-rift… Alvaro Retana
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Alvaro Retana
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Jordan Head
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Alvaro Retana
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Jordan Head
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Alvaro Retana
- Re: [Rift] AD Review of draft-ietf-rift-rift-12 (… Jordan Head