Re: [Rift] RtgDir Early review: draft-ietf-rift-rift-06

Tony Przygienda <tonysietf@gmail.com> Tue, 23 July 2019 20:27 UTC

Return-Path: <tonysietf@gmail.com>
X-Original-To: rift@ietfa.amsl.com
Delivered-To: rift@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BB10E120950; Tue, 23 Jul 2019 13:27:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.997
X-Spam-Level:
X-Spam-Status: No, score=-1.997 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id kN5_U66O6vty; Tue, 23 Jul 2019 13:27:17 -0700 (PDT)
Received: from mail-ed1-x52b.google.com (mail-ed1-x52b.google.com [IPv6:2a00:1450:4864:20::52b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C82331203C2; Tue, 23 Jul 2019 13:27:13 -0700 (PDT)
Received: by mail-ed1-x52b.google.com with SMTP id s49so10309358edb.1; Tue, 23 Jul 2019 13:27:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=GwyPMv1j9avH8Vnfm3Rmt3e1Fpn0qM+URyG5VmS8ZRw=; b=MXAh9Uk/uDFl0zGFzjX9HuNF3HgsVD1uauZHLlCQ5VVtaNVNz/+xIomYbbi+e/hJ27 DXoxPPw+aZpNqb06iApLWOAq8aajEz+McqaLa0tbwHc3px4NCeLXRgk7SlrcAodJa5BF nx06GIcHlycLIfYj1rKPsLQfC174r6MKZX3vCEnzrJEc8ld6MDUABnxhd7XYeWCLMqbW cDrW++MondNQ0UvpII/psj6NLHFBXwOjDE40RCE1uTiCFelHYDc601LeOwckXjejMSrI tRlcL7Dt3eujYTeLeWG9TdcjUdH32Q/LEEwqulTR4yxFbqb8yYM489FeMcY3wyomd59N u9Eg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=GwyPMv1j9avH8Vnfm3Rmt3e1Fpn0qM+URyG5VmS8ZRw=; b=V+MNFQvbjMQgpB/oXNJq6M7VxQasVUCj7gcLxBQy2pb/4qbdsOz8rE4U1IdjYgaXiV bdc2/lJ+csUQ8a03JWI2SEeaUjoUl26ojtWP8HUbvL7mjRo2/PHT+4yDWxxK8hd5WZ7i Tzb/BqVatw34Swrk6HJ1lbgfRZ0hb/1x4dzASHcZPIxReLbi+nmsMcKheW4C6ju8eS7o c71VNuF6/J3cxs5DSTWYwsum0RLT7pt7pMrHgZ5kseksXl//RxIASLfRHJa3pBR1mzSF 3OF+76PxF5sMkEXyJDXUa5wDyYSTZeLUeZIkSNMczqhEU8VRplYL+EcYe65xEO5h+ov1 cNNw==
X-Gm-Message-State: APjAAAWXBmXHc99TVe2x6x6iKiBXbwoJVvwyNxF+YGXaiu/4/KS0ppEk jerpNPwj9ogy5WCcO/zhySKHOuICbnBCgfGYFDqGZQEECwE=
X-Google-Smtp-Source: APXvYqxZ46g02b/rhNnbkW561ppp1yj7jIQ1i0b82DaEKS7fLfQyLIHEGF9gfKQBIEzF1n/kU7H2muf19UY8HUV/U3U=
X-Received: by 2002:a50:b87c:: with SMTP id k57mr66624955ede.226.1563913632155; Tue, 23 Jul 2019 13:27:12 -0700 (PDT)
MIME-Version: 1.0
References: <097e01d5418d$847ee300$8d7ca900$@riw.us>
In-Reply-To: <097e01d5418d$847ee300$8d7ca900$@riw.us>
From: Tony Przygienda <tonysietf@gmail.com>
Date: Tue, 23 Jul 2019 16:26:35 -0400
Message-ID: <CA+wi2hO0SMadg1LHNvORCxD37kkriA9LUJEbevSURbcNTvk3hQ@mail.gmail.com>
To: Russ White <russ@riw.us>
Cc: rtg-wg-chairs@ietf.org, draft-ietf-rift-rift@ietf.org, rtg-dir@ietf.org, rift@ietf.org
Content-Type: multipart/alternative; boundary="000000000000810aab058e5f044a"
Archived-At: <https://mailarchive.ietf.org/arch/msg/rift/WfceSWboY4G1kyJRkp6u1AyZkC8>
Subject: Re: [Rift] RtgDir Early review: draft-ietf-rift-rift-06
X-BeenThere: rift@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion of Routing in Fat Trees <rift.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rift>, <mailto:rift-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rift/>
List-Post: <mailto:rift@ietf.org>
List-Help: <mailto:rift-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rift>, <mailto:rift-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Jul 2019 20:27:22 -0000

thanks for quick review, Russ. Flew over it. most stuff seems clarification
and editorials to me. Let's sit with authors in the bar and go over it, I
have -07 pending anyway right after IETF due to the v4/v6 router
requirement comments and bruno's interop so this is a great time in fact to
file off the last rough edges ...

--- tony


On Tue, Jul 23, 2019 at 3:33 PM <russ@riw.us> wrote:

> Hello
>
> I have been selected to do a routing directorate “early” review of this
> draft.
> ​https://datatracker.ietf.org/doc/draft-foo-name/
>
> The routing directorate will, on request from the working group chair,
> perform an “early” review of a draft before it is submitted for publication
> to the IESG. The early review can be performed at any time during the
> draft’s lifetime as a working group document. The purpose of the early
> review depends on the stage that the document has reached.
>
> As this document has recently been adopted by the working group, my focus
> for the review is on providing a new perspective on the work, with the
> intention of catching any issues early on in the document's life cycle.
>
> For more information about the Routing Directorate, please see ​
> http://trac.tools.ietf.org/area/rtg/trac/wiki/RtgDir
>
> Document: draft-ietf-rift-rift-06
> Reviewer: Russ White
> Review Date: 23 July 2019
> Intended Status: Standards Track
>
> Summary:
>
> I have significant concerns about this document. It needs more work before
> being submitted to the IESG.
>
> Comments:
>
> Overall this draft is very readable, although I have suggested wording
> changes below. The protocol described is useful and interesting. I largely
> have questions about the structure and wording of the draft, although there
> are a couple of technical questions in the comments below (probably just
> things I missed the explanation for, however).
>
> Structurally, this document almost feels like two different documents. The
> first specifies a set of use cases and requirements, while the second
> specifies a protocol. I wonder if it wouldn't be better to split this
> document into two pieces, making each one more specific, and possibly a bit
> easier to read? We might too late in the process to make such a change --
> if so, that's okay, just thought it might be worth considering.
>
> Throughout -- "fat tree," "Clos," and other terms are used in various
> places. As a reader, given the different meanings for these terms, I found
> this a bit confusing. It might be better to use "spine and leaf" throughout.
>
> Throughout -- the term "folded" has two very different meanings... The
> first is a multistage fabric on which traffic flows bi-directionally, the
> second is a way of drawing spine and leaf fabrics in blocks. It might be
> useful to clarify this from the beginning somehow, so readers who are
> expecting one meaning don't misread things because a different meaning is
> intended (?). I think it's always used here in the "way of drawing a spine
> and leaf fabric" here.
>
> ==
> Page 9
>
> The definitions of the various forms of TIE might be better if they
> clearly differentiated between topology and reachability information (?).
> The OSPF router LSA, for instance, contains both kinds of information. In
> IS-IS, adjacencies, which are called Node TIES (I think) are tlv22, while
> the Prefix TIE would be a tlv128 and 236. I don't think there is an
> equivalent separation between topology (reachable neighbors) and
> reachability (reachable prefixes) in OSPF.
>
> ==
> REQ13: "Taking a path through the spine in cases where a shorter path is
> available is highly undesirable."
>
> This doesn't seem to be related to the previous statements in the
> requirement... Not certain this needs to be included? Or maybe it wants to
> be someplace else in the list?
>
> ==
> REQ10 and REQ13 both seem to say the same thing -- although REQ13 is
> better worded. Maybe both of these are not needed?
>
> ==
> REQ15: "...links of a single node having same addresses." -- might be
> better as "...links of a single node sharing the same address."
>
> ==
> REQ18: "...security mechanisms that allow to restrict nodes, especially
> leafs without proper credentials from forming three-way adjacencies."
>
> Might be better as:
>
> "...security mechanisms that allow the operator to restrict nodes,
> especially leaf nodes without proper credentials, from forming a three-way
> adjacency."
>
> ==
> PEND2: 500,000 seems way too small for the scale numbers I've heard in the
> past -- especially if we are including the underlay protocol running on the
> hosts in the mix. If there are 300k ports is at least 1 prefix per edge
> port, and potentially 10-15 prefixes per edge port. If you guess around 20%
> of the workloads attached to the fabric might be mobile, then the number
> here is more likely 1 million at the base, with a top end of around 2-3
> million prefixes. It might be worth adjusting this number a bit larger (?).
>
> ==
> Section 5, paragraph after the header
>
> "...when "pointing north" and path vector [RFC4271] protocol when
> "pointing south". While this is an unusual combination,..."
>
> Elsewhere in the document "distance-vector" is used. It might be better to
> be consistent?
>
> ==
> Section 5.1.1
>
> In the first paragraph ("The most singular property..."), omitting
> east-west control plane information flow is mentioned twice; could probably
> just be once.
>
> The first sentence in the second paragraph is a bit of a run-on; might be
> best to omit some of this, or break it into two sentences.
>
> Second paragraph: "The application of those principle lead to RIFT..."
> doesn't seem right. Might be better as: "The application of these
> principles lead to RIFT..."
>
> ==
> Section 5.1.2, paragraph beginning: "Given the difficulty of visualizing
> multi plane design which are..." First, this is one long sentence. Second,
> what's illustrated looks like a crossbar, but it's not a crossbar. I find
> this a little confusing (?). The problem is -- I don't know of another term
> to use here.
>
> ==
> Section 5.1.2, paragraph beginning: "The typical topology for which RIFT
> is defined is built of a number P..." This seems to describe a five stage
> butterfly or pod-and-spine fabric, but does not seem to describe a seven
> stage (?).
>
> Overall -- I'm not certain trying to describe these topologies at this
> level in a protocol specification draft is all that useful. The terminology
> is fluid (used by different people in different ways), and this depth of
> explanation does not seem to be useful to describe the protocol operation.
> This is one place where splitting the document into two pieces might be
> helpful.
>
> ==
> Paragraph beginning: "It is critical at this junction that the concept and
> the picture of..." What is shown is not really a crossbar fabric... I'm not
> certain calling this a crossbar is helpful here (?).
>
> ==
> Section 5.1.4: "This happens naturally in single plane design but needs
> additional considerations in multiplane fabrics." When aggregation is used,
> as well -- but not when no aggregation of reachability information is
> deployed in the fabric. It might be worth adding that qualification here.
>
> ==
> Section 5.1.4
>
> Sentence beginning: "In more detail, by reserving two ports on each
> Top-of-Fabric node it is possible to connect them together in an interplane
> bi-directional ring as illustrated in Figure 13..."
>
> I might have missed it, but this document does not seem to address how
> traffic will be prevented from flowing through these links. It's probably
> important to note either that traffic will flow along these links, hence
> they need to be sized appropriately, or how traffic is prevented from
> flowing along these links.
>
> ==
> Section 5.1.5
>
> Sentence beginning: "The effect of a positive disaggregation..." If
> positive deaggregation takes place, it could draw all the traffic to the
> deaggregated destinations along a small set of links, causing a hot spot in
> the fabric. This might be taken care of "naturally" in the way RIFT does
> positive deaggregation, but it might be good to explain this in the
> document.
>
> A more general observation: positive deaggregation, as described in the
> document, can potentially cause cascading failures where a group of links
> or nodes fail, causing a lot of new information to be pushed into lower
> levels rather quickly, which could potentially cause those nodes to overrun
> their table or processing space and fail in turn, causing more information
> to be dumped into the control plane, etc. It may be worth connsidering
> having some form of "hysteresis," timer or other mechanism to prevent
> cascading failures of this kind.
>
> ==
> Section 5.2.2
>
> "All RIFT routers MUST support IPv4 forwarding and MAY support IPv6..."
>
> So an IPv6 only fabric is not possible?
>
> ==
> Section 5.2.6
>
> "After the SPF is run, it is necessary to attach according prefixes."
>
> I think this might want to mean --
>
> "After SPF is run, it is necessary to attach the resulting reachability
> information in the form of prefixes."
>
> (?)
>
> ==
> Section 5.2.6
>
> It seems like the first and third paragraphs describe the same procedure?
> Are both descriptions necessary?
>
> ==
> Section 5.2.6
>
> "An exemplary implementation..." should probably be "an example
> implementation..." ??
>
> ==
> Section 5.2.6
>
> "After the positive prefixes are attached and tie-broken, negative
> prefixes are attached and used in case of northbound computation, ideally
> from the shortest length to the longest."
>
> It seems like the prefixes should be "tie-broken" before being attached.
> This sentence is generally confusing, perhaps:
>
> "After the positive prefixes have been attached, the negative prefixes
> will be attached in their prefix-length order (from the shortest to the
> longest). These negative prefixes will only be used when computing
> northbound reachability."
>
> ==
> Section 5.2.6
>
> "Observe that despite seeming quite computationally expensive the
> operations are only necessary if the only available advertisements for a
> prefix are negative since tie-breaking always prefers positive information
> for the prefix which stops any kind of recursion since positive information
> never inherits next hops."
>
> Just because something is not done all that often does not mean it's not
> computationally expensive. :-) Maybe something like this would be better:
>
> "Although these operations can be computationally expensive, the overall
> load on devices in the network is low because these computations are not
> run very often, as positive route advertisements are always preferred over
> negative ones. This prevents recursion in most cases because positive
> reachability information never inherits next hops."
>
> Or something like this?
>
> ==
> Section 5.2.6
>
> "Negative 2001:db8::/32 entry inherits from"
>
> Probably wants to be "The negative 2001:db8::/32 entry..."
>
> ==
> Section 5.2.6
>
> "Finally, let us look at the case where S3 becomes available again as
> default gateway, ..."
>
> "...the default gateway..." or "...a default gateway..."
>
> ==
> Section 5.2.7
>
> "Each RIFT node can optionally operate in zero touch provisioning..."
>
> "optionally" is not needed here
>
> ==
> Section 5.2.7.2
>
> "RIFT identifies each node via a SystemID which is a 64 bits wide integer.
> It is relatively simple to derive a, for all practical purposes collision
> free, value for each node on startup. For that purpose, a node MUST use as
> system ID EUI-64 MA-L format [EUI64] where the organizationally governed 24
> bits can be used to generate system IDs for multiple RIFT instances running
> on the system."
>
> This seems too wordy -- maybe:
>
> "RIFT nodes require a 64 bit SystemID in the EUI-64 MA-L format formed as
> described in [EUI64]. The organizationally goverened portion of this ID
> (24 bits) can be used to generate multiple IDs if required to indicate more
> than one RIFT instance."
>
> ==
> Section 5.3.1
>
> "Overload Bit MUST be respected..."
>
> should be
>
> "The overload bit MUST be respected..."
>
> ==
> Section 5.3.2
>
> "Since the leafs do see only "one hop away" they do not need to run a full
> SPF but can simply gather prefix candidates from their neighbors and build
> the according routing table."
>
> Seems like this should be
>
> "Since leaf nodes only have one hop of topology information, they do not
> need to run SPF. Instead, they can gather the available prefix candidates
> and build the routing table accordingly."
>
> ==
> Section 5.3.3
>
> "time stamp: With this method, the infrastructure memorizes the..."
>
> I think the correct word here is "records," rather than "memorizes" (?)
>
> ==
> Section 5.3.3
>
> "sequence counter: With this method, a mobile node notifies its point..."
>
> Another problem here is the node might not "know" it has been moved from
> one location to another in the fabric, particularly if a layer 2 only
> attached process is moved along with it's ARP cache, and something like the
> EVPN default MAC is used to enable the move. I don't know if this is worth
> mentioning, but it does seem like a case that needs to be thought about to
> make certain it works with the process described.
>
> ==
> Section 5.3.3
>
> "The leaf node MUST advertise a time stamp of the latest sighting..."
>
> Is this for all prefixes, or just for prefixes that are somehow marked or
> configured as potentially mobile on the fabric?
>
> ==
> Section 5.3.3
>
> "RIFT also defines an abstract negative clock (ANSC) that compares as less
> than any other clock."
>
> Does this mean lower priority, or always older than any other clock? It
> seems like this means "older" based on 5.3.3.1, but the language could
> probably be clearer.
>
> ==
> Section 5.3.3
>
> One thing not covered here is what happens if some workload moves between
> two RIFT fabrics interconnected via some other protocol, such as BGP. It
> seems the most obvious solution is to have RIFT treat redistributed
> destinations the same as directly attached destinations in terms of clocks,
> etc., but this is not called out in the document. It might be worth
> mentioning.
>
> ==
> Section 5.3.3.3
>
> This section seems a little ... muddled? You want to both have the ability
> to use only the most recently available versions of any anycast
> destination, while also not requiring all the anycast advertisements to
> have the same time stamp (?). I'm not certain this section entirely solves
> that problem. This sentence might be creating the confusion in my reading
> of the section, as I don't really understand what it is saying:
>
> "An anycast prefix does not carry a clock or all clock attributes MUST be
> the same under the rules of Section 5.3.3.1."
>
> MUST be treated the same as prefixes advertised under the rules of
> 5.3.3.1? I'm not certain.
>
> ==
> Section 5.3.3.4
>
> "RIFT is agnostic whether any overlay technology like [MIP, LISP, VxLAN,
> NVO3] and the associated signaling is deployed over it. But it is expected
> that leaf nodes, and possibly Top-of-Fabric nodes can perform the according
> encapsulation."
>
> "according" should be "correct," I think.
>
> ==
> Section 5.3.6.1
>
> "All the multiplications and additions are saturating, i.e. when exceeding
> range of the bandwidth type are set to highest possible value of the type."
>
> Maybe better:
>
> "If a calculation produces a result exceeding the range of the type, the
> result is set to the highest possible value for that type."
>
>
>
> _______________________________________________
> RIFT mailing list
> RIFT@ietf.org
> https://www.ietf.org/mailman/listinfo/rift
>