[Rift] RtgDir Early review: draft-ietf-rift-rift-06

<russ@riw.us> Tue, 23 July 2019 19:33 UTC

Return-Path: <russ@riw.us>
X-Original-To: rift@ietfa.amsl.com
Delivered-To: rift@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8FDEE12090C; Tue, 23 Jul 2019 12:33:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.598
X-Spam-Level:
X-Spam-Status: No, score=-2.598 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=messagingengine.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id nGq2N2e8ol10; Tue, 23 Jul 2019 12:33:37 -0700 (PDT)
Received: from wout5-smtp.messagingengine.com (wout5-smtp.messagingengine.com [64.147.123.21]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 38E7C12084D; Tue, 23 Jul 2019 12:33:36 -0700 (PDT)
Received: from compute6.internal (compute6.nyi.internal [10.202.2.46]) by mailout.west.internal (Postfix) with ESMTP id 85A1261F; Tue, 23 Jul 2019 15:33:35 -0400 (EDT)
Received: from mailfrontend1 ([10.202.2.162]) by compute6.internal (MEProxy); Tue, 23 Jul 2019 15:33:35 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:message-id:mime-version:subject:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; bh=76vP4D 3jRRMRNNRnjPEbmSY8KOURTZJRL1GYuJd8UKM=; b=cPsihbD6t17CZPV9lpO1Pe /12BkC+bQRUfzlUwUMcpUEdWkzn7Rj+UmzMbvPi4Rq793AiffeFLN54A4vSXjaI0 8yNZ7ZBQWT/IvC45kUrp9SxzOdddzFcz1RNkbFvlMIL9aEQlJNp0d7/3ICPoJfJo zBpIgvGdMoyn0zhnl7sIIA6psK8iqpgYiNxkAbcjOOPLJkNKJFKZMzeq2AGlcE0G 7TRjp0YC+E0CGBziUZgycLyTF3XSuPFR9fUFQNlN2NQPfQmH4xqz8RVtt1ble+eW P+d4dZ/SZZdGjta7b1W6ClTi5upBFa/8GkRlZBDVXlpQNDrPOxi0Oq5f8ER6SDnw ==
X-ME-Sender: <xms:DmE3XW_622iSGVHnpy9BlygR87T17PP7OHorlEaYPckkps4II9r3IQ>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduvddrjeekgddufeekucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkgggtgffothesthhqgh dtvddtjeenucfhrhhomhepoehruhhsshesrhhifidruhhsqeenucffohhmrghinhepihgv thhfrdhorhhgnecukfhppeefuddrudeffedrudefkedrudeijeenucfrrghrrghmpehmrg hilhhfrhhomheprhhushhssehrihifrdhushenucevlhhushhtvghrufhiiigvpedt
X-ME-Proxy: <xmx:DmE3XTHb3aktbtKeAce9i-w9NiUvUzfrAGL0GTYdEsfGYGBtTbIVVQ> <xmx:DmE3XQfrS3gobAgTSQP--91yCjp9oUX3sTfwKlSrnCYSk3KFX9IXkg> <xmx:DmE3XclwJLG-JzdHP1mEHaHgx5iv7LxaEKC4hAAkokEyBhmJ7PzYvg> <xmx:D2E3XZZddOXV_Wl4F5v6mR4vMeBl93FllOYsyGcuxUdPtTsMSyCqxQ>
Received: from RussPC (dhcp-8aa7.meeting.ietf.org [31.133.138.167]) by mail.messagingengine.com (Postfix) with ESMTPA id 7A67080060; Tue, 23 Jul 2019 15:33:34 -0400 (EDT)
From: russ@riw.us
To: rtg-wg-chairs@ietf.org, draft-ietf-rift-rift@ietf.org
Cc: rtg-dir@ietf.org, rift@ietf.org
Date: Tue, 23 Jul 2019 15:33:34 -0400
Message-ID: <097e01d5418d$847ee300$8d7ca900$@riw.us>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AdVBjP5d4GzHvFj0ReaDFe2tM6C8FA==
Content-Language: en-us
Archived-At: <https://mailarchive.ietf.org/arch/msg/rift/ng76ClJP4-wkKWCKzEeuhX4gSR4>
Subject: [Rift] RtgDir Early review: draft-ietf-rift-rift-06
X-BeenThere: rift@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion of Routing in Fat Trees <rift.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rift>, <mailto:rift-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rift/>
List-Post: <mailto:rift@ietf.org>
List-Help: <mailto:rift-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rift>, <mailto:rift-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Jul 2019 19:33:48 -0000

Hello

I have been selected to do a routing directorate “early” review of this draft. 
​https://datatracker.ietf.org/doc/draft-foo-name/

The routing directorate will, on request from the working group chair, perform an “early” review of a draft before it is submitted for publication to the IESG. The early review can be performed at any time during the draft’s lifetime as a working group document. The purpose of the early review depends on the stage that the document has reached.

As this document has recently been adopted by the working group, my focus for the review is on providing a new perspective on the work, with the intention of catching any issues early on in the document's life cycle.

For more information about the Routing Directorate, please see ​http://trac.tools.ietf.org/area/rtg/trac/wiki/RtgDir

Document: draft-ietf-rift-rift-06
Reviewer: Russ White
Review Date: 23 July 2019
Intended Status: Standards Track

Summary: 

I have significant concerns about this document. It needs more work before being submitted to the IESG.

Comments:

Overall this draft is very readable, although I have suggested wording changes below. The protocol described is useful and interesting. I largely have questions about the structure and wording of the draft, although there are a couple of technical questions in the comments below (probably just things I missed the explanation for, however).

Structurally, this document almost feels like two different documents. The first specifies a set of use cases and requirements, while the second specifies a protocol. I wonder if it wouldn't be better to split this document into two pieces, making each one more specific, and possibly a bit easier to read? We might too late in the process to make such a change -- if so, that's okay, just thought it might be worth considering.

Throughout -- "fat tree," "Clos," and other terms are used in various places. As a reader, given the different meanings for these terms, I found this a bit confusing. It might be better to use "spine and leaf" throughout.

Throughout -- the term "folded" has two very different meanings... The first is a multistage fabric on which traffic flows bi-directionally, the second is a way of drawing spine and leaf fabrics in blocks. It might be useful to clarify this from the beginning somehow, so readers who are expecting one meaning don't misread things because a different meaning is intended (?). I think it's always used here in the "way of drawing a spine and leaf fabric" here.

==
Page 9

The definitions of the various forms of TIE might be better if they clearly differentiated between topology and reachability information (?). The OSPF router LSA, for instance, contains both kinds of information. In IS-IS, adjacencies, which are called Node TIES (I think) are tlv22, while the Prefix TIE would be a tlv128 and 236. I don't think there is an equivalent separation between topology (reachable neighbors) and reachability (reachable prefixes) in OSPF. 

==
REQ13: "Taking a path through the spine in cases where a shorter path is available is highly undesirable."

This doesn't seem to be related to the previous statements in the requirement... Not certain this needs to be included? Or maybe it wants to be someplace else in the list?

==
REQ10 and REQ13 both seem to say the same thing -- although REQ13 is better worded. Maybe both of these are not needed?

==
REQ15: "...links of a single node having same addresses." -- might be better as "...links of a single node sharing the same address."

==
REQ18: "...security mechanisms that allow to restrict nodes, especially leafs without proper credentials from forming three-way adjacencies."

Might be better as:

"...security mechanisms that allow the operator to restrict nodes, especially leaf nodes without proper credentials, from forming a three-way adjacency."

==
PEND2: 500,000 seems way too small for the scale numbers I've heard in the past -- especially if we are including the underlay protocol running on the hosts in the mix. If there are 300k ports is at least 1 prefix per edge port, and potentially 10-15 prefixes per edge port. If you guess around 20% of the workloads attached to the fabric might be mobile, then the number here is more likely 1 million at the base, with a top end of around 2-3 million prefixes. It might be worth adjusting this number a bit larger (?).

==
Section 5, paragraph after the header

"...when "pointing north" and path vector [RFC4271] protocol when "pointing south". While this is an unusual combination,..."

Elsewhere in the document "distance-vector" is used. It might be better to be consistent?

==
Section 5.1.1

In the first paragraph ("The most singular property..."), omitting east-west control plane information flow is mentioned twice; could probably just be once.

The first sentence in the second paragraph is a bit of a run-on; might be best to omit some of this, or break it into two sentences.

Second paragraph: "The application of those principle lead to RIFT..." doesn't seem right. Might be better as: "The application of these principles lead to RIFT..."

==
Section 5.1.2, paragraph beginning: "Given the difficulty of visualizing multi plane design which are..." First, this is one long sentence. Second, what's illustrated looks like a crossbar, but it's not a crossbar. I find this a little confusing (?). The problem is -- I don't know of another term to use here.

==
Section 5.1.2, paragraph beginning: "The typical topology for which RIFT is defined is built of a number P..." This seems to describe a five stage butterfly or pod-and-spine fabric, but does not seem to describe a seven stage (?).

Overall -- I'm not certain trying to describe these topologies at this level in a protocol specification draft is all that useful. The terminology is fluid (used by different people in different ways), and this depth of explanation does not seem to be useful to describe the protocol operation. This is one place where splitting the document into two pieces might be helpful.

==
Paragraph beginning: "It is critical at this junction that the concept and the picture of..." What is shown is not really a crossbar fabric... I'm not certain calling this a crossbar is helpful here (?).

==
Section 5.1.4: "This happens naturally in single plane design but needs additional considerations in multiplane fabrics." When aggregation is used, as well -- but not when no aggregation of reachability information is deployed in the fabric. It might be worth adding that qualification here.

==
Section 5.1.4

Sentence beginning: "In more detail, by reserving two ports on each Top-of-Fabric node it is possible to connect them together in an interplane bi-directional ring as illustrated in Figure 13..."

I might have missed it, but this document does not seem to address how traffic will be prevented from flowing through these links. It's probably important to note either that traffic will flow along these links, hence they need to be sized appropriately, or how traffic is prevented from flowing along these links. 

==
Section 5.1.5

Sentence beginning: "The effect of a positive disaggregation..." If positive deaggregation takes place, it could draw all the traffic to the deaggregated destinations along a small set of links, causing a hot spot in the fabric. This might be taken care of "naturally" in the way RIFT does positive deaggregation, but it might be good to explain this in the document.

A more general observation: positive deaggregation, as described in the document, can potentially cause cascading failures where a group of links or nodes fail, causing a lot of new information to be pushed into lower levels rather quickly, which could potentially cause those nodes to overrun their table or processing space and fail in turn, causing more information to be dumped into the control plane, etc. It may be worth connsidering having some form of "hysteresis," timer or other mechanism to prevent cascading failures of this kind.

==
Section 5.2.2

"All RIFT routers MUST support IPv4 forwarding and MAY support IPv6..."

So an IPv6 only fabric is not possible? 

==
Section 5.2.6

"After the SPF is run, it is necessary to attach according prefixes."

I think this might want to mean --

"After SPF is run, it is necessary to attach the resulting reachability information in the form of prefixes." 

(?)

==
Section 5.2.6

It seems like the first and third paragraphs describe the same procedure? Are both descriptions necessary?

==
Section 5.2.6

"An exemplary implementation..." should probably be "an example implementation..." ??

==
Section 5.2.6

"After the positive prefixes are attached and tie-broken, negative prefixes are attached and used in case of northbound computation, ideally from the shortest length to the longest."

It seems like the prefixes should be "tie-broken" before being attached. This sentence is generally confusing, perhaps:

"After the positive prefixes have been attached, the negative prefixes will be attached in their prefix-length order (from the shortest to the longest). These negative prefixes will only be used when computing northbound reachability."

==
Section 5.2.6

"Observe that despite seeming quite computationally expensive the operations are only necessary if the only available advertisements for a prefix are negative since tie-breaking always prefers positive information for the prefix which stops any kind of recursion since positive information never inherits next hops."

Just because something is not done all that often does not mean it's not computationally expensive. :-) Maybe something like this would be better:

"Although these operations can be computationally expensive, the overall load on devices in the network is low because these computations are not run very often, as positive route advertisements are always preferred over negative ones. This prevents recursion in most cases because positive reachability information never inherits next hops."

Or something like this?

==
Section 5.2.6

"Negative 2001:db8::/32 entry inherits from"

Probably wants to be "The negative 2001:db8::/32 entry..."

==
Section 5.2.6

"Finally, let us look at the case where S3 becomes available again as default gateway, ..."

"...the default gateway..." or "...a default gateway..."

==
Section 5.2.7

"Each RIFT node can optionally operate in zero touch provisioning..."

"optionally" is not needed here

==
Section 5.2.7.2

"RIFT identifies each node via a SystemID which is a 64 bits wide integer. It is relatively simple to derive a, for all practical purposes collision free, value for each node on startup. For that purpose, a node MUST use as system ID EUI-64 MA-L format [EUI64] where the organizationally governed 24 bits can be used to generate system IDs for multiple RIFT instances running on the system."

This seems too wordy -- maybe:

"RIFT nodes require a 64 bit SystemID in the EUI-64 MA-L format formed as described in [EUI64]. The organizationally goverened portion of this ID  (24 bits) can be used to generate multiple IDs if required to indicate more than one RIFT instance."

==
Section 5.3.1

"Overload Bit MUST be respected..."

should be 

"The overload bit MUST be respected..."

==
Section 5.3.2

"Since the leafs do see only "one hop away" they do not need to run a full SPF but can simply gather prefix candidates from their neighbors and build the according routing table."

Seems like this should be 

"Since leaf nodes only have one hop of topology information, they do not need to run SPF. Instead, they can gather the available prefix candidates and build the routing table accordingly."

==
Section 5.3.3

"time stamp: With this method, the infrastructure memorizes the..."

I think the correct word here is "records," rather than "memorizes" (?)

==
Section 5.3.3

"sequence counter: With this method, a mobile node notifies its point..."

Another problem here is the node might not "know" it has been moved from one location to another in the fabric, particularly if a layer 2 only attached process is moved along with it's ARP cache, and something like the EVPN default MAC is used to enable the move. I don't know if this is worth mentioning, but it does seem like a case that needs to be thought about to make certain it works with the process described.

==
Section 5.3.3

"The leaf node MUST advertise a time stamp of the latest sighting..."

Is this for all prefixes, or just for prefixes that are somehow marked or configured as potentially mobile on the fabric?

==
Section 5.3.3

"RIFT also defines an abstract negative clock (ANSC) that compares as less than any other clock."

Does this mean lower priority, or always older than any other clock? It seems like this means "older" based on 5.3.3.1, but the language could probably be clearer.

==
Section 5.3.3

One thing not covered here is what happens if some workload moves between two RIFT fabrics interconnected via some other protocol, such as BGP. It seems the most obvious solution is to have RIFT treat redistributed destinations the same as directly attached destinations in terms of clocks, etc., but this is not called out in the document. It might be worth mentioning.

==
Section 5.3.3.3

This section seems a little ... muddled? You want to both have the ability to use only the most recently available versions of any anycast destination, while also not requiring all the anycast advertisements to have the same time stamp (?). I'm not certain this section entirely solves that problem. This sentence might be creating the confusion in my reading of the section, as I don't really understand what it is saying: 

"An anycast prefix does not carry a clock or all clock attributes MUST be the same under the rules of Section 5.3.3.1."

MUST be treated the same as prefixes advertised under the rules of 5.3.3.1? I'm not certain.

==
Section 5.3.3.4

"RIFT is agnostic whether any overlay technology like [MIP, LISP, VxLAN, NVO3] and the associated signaling is deployed over it. But it is expected that leaf nodes, and possibly Top-of-Fabric nodes can perform the according encapsulation."

"according" should be "correct," I think. 

==
Section 5.3.6.1

"All the multiplications and additions are saturating, i.e. when exceeding range of the bandwidth type are set to highest possible value of the type."

Maybe better:

"If a calculation produces a result exceeding the range of the type, the result is set to the highest possible value for that type."