Re: [Rift] RIFT

Kris Price <kris@krisprice.nz> Thu, 18 April 2019 17:28 UTC

Return-Path: <kris@krisprice.nz>
X-Original-To: rift@ietfa.amsl.com
Delivered-To: rift@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DAE36120454 for <rift@ietfa.amsl.com>; Thu, 18 Apr 2019 10:28:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=krisprice-nz.20150623.gappssmtp.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Jnr4Z1qHCfGN for <rift@ietfa.amsl.com>; Thu, 18 Apr 2019 10:28:17 -0700 (PDT)
Received: from mail-lf1-x12c.google.com (mail-lf1-x12c.google.com [IPv6:2a00:1450:4864:20::12c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5A407120444 for <rift@ietf.org>; Thu, 18 Apr 2019 10:28:17 -0700 (PDT)
Received: by mail-lf1-x12c.google.com with SMTP id v1so2227688lfg.5 for <rift@ietf.org>; Thu, 18 Apr 2019 10:28:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=krisprice-nz.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=H99fQu9xocMWFlDmy75KXeyDa0S7/RPughN0pVWQPtI=; b=NRWu9XJ98moVWZRbS0Hf7jBsdMCKYJIV0YZ3zJ7EHjldBjZKQmfmXAxZOwhgOMbssS pSWUNK9lMohcAs2avE2EaSB2NB1/wTOggae7aN8GLAM8hWS9TvqT21zH7LJ9vv/rRVSC /Pct2yrioVshtBUoaYG6HvXsa34HFr7aiMamVbPveYTzEZpNFV+T4pbIwXubOo3iI4Ue 0E9IiszCIhBYqKKOkG8O0h75nHAlmrTX/KLYK7jBsbfwdwCGL4VXnde2NZLp0c6bmMd5 IzItIFPxB3lpmM0lcTPtbnXJ3ajM6hOiX5JOBuJ26UrdwVk+wCEfnj8XgmNjXlSpHAVD S9Mg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=H99fQu9xocMWFlDmy75KXeyDa0S7/RPughN0pVWQPtI=; b=DGJ6sO6RHLNJJ1/pMCrdNJgNHmrxwDj1reiLj7AmHjQpHjgwrfDF9uoLfxI01xTS/j 3dppGZcXsRYv63+iNfK1mXfMkgco/en5WWSa/1Q71L/r4eTvQmnd4NDajsTmx+nvhIkX J3AYkxtorzH7L0ltmG+Qhg3UPhWTy8EAa9Ew7CfeGiRBVSNNn+cOkDSooEHfFVJFW0uI D97ImtXiICYmgdwb9Cdvxel6z5GzK0d/K27vGtq6PIO3mQmxD5AVBM/eVJu+t7H7sFw2 hQUKCyj0Bc4gGtGfDfy8jhQJs84Ip1ZYiMjpXAFCDU+yeiMDf4kV6fuaCkfd3QCJc5sG KwYg==
X-Gm-Message-State: APjAAAX9ECbnji/9T2q1ovQTQ7zbxU3RBl1B6ejC1SejB8aLLRBf6LaG jBNhO0DGMR2xnO4OiKBUh1iLCzqEpC6Dnq1bONz/yg==
X-Google-Smtp-Source: APXvYqxBV87Ht/2Xh8tl7IqkRMpnbVoO1toOrx3iIg5McMbqVwHPRm/KBchbjpaUvPjHOrcUtCaeJFGnaRBPERr2kfs=
X-Received: by 2002:ac2:482e:: with SMTP id 14mr32600324lft.1.1555608495357; Thu, 18 Apr 2019 10:28:15 -0700 (PDT)
MIME-Version: 1.0
References: <CACqcHa05D9mCNWPtkHMw4t-0opbz33PsnB9Ts=wadfM1UD4cNA@mail.gmail.com> <MWHPR05MB32798B45DD99D8ABF75B875AAC280@MWHPR05MB3279.namprd05.prod.outlook.com>
In-Reply-To: <MWHPR05MB32798B45DD99D8ABF75B875AAC280@MWHPR05MB3279.namprd05.prod.outlook.com>
From: Kris Price <kris@krisprice.nz>
Date: Thu, 18 Apr 2019 13:28:14 -0400
Message-ID: <CACqcHa3TnRS76Rnr5Wkq4_L47i5ZQLQiZFy5aNt3zmTr487LrA@mail.gmail.com>
To: Antoni Przygienda <prz@juniper.net>
Cc: "brunorijsman@gmail.com" <brunorijsman@gmail.com>, "rift@ietf.org" <rift@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/rift/uhYZKZOSRXytqIZs-uc9jcEHCIM>
Subject: Re: [Rift] RIFT
X-BeenThere: rift@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion of Routing in Fat Trees <rift.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rift>, <mailto:rift-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rift/>
List-Post: <mailto:rift@ietf.org>
List-Help: <mailto:rift-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rift>, <mailto:rift-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 18 Apr 2019 17:28:27 -0000

Hey Tony,

On the rings: Ahh! I get it, okay that makes it better. I was also
wondering if some kind of designated 'S-TIE' reflector / virtual links
/ or explicitly configured multi-hop adjacencies solution could be
used (the issue being one of how do you route these packets between
the peers without needing to do something like source route multiple
hops southbound before being default routed northbound).

Back on the subject of disaggregation:

The other reason for asking for the always disaggregate option is to
prevent the transient congestion that can occur on link failures. But
I do see now on rereading the draft you've called this out in the
second to last paragraph of 5.2.5.1., but it's left it as an
implementation specific problem to solve.

It seems this would arise frequently at the bottom two tiers of the
network. Any loss of any single link to any rack (tier 0) would result
in all other nodes at tier 1 disaggregating the prefix(es) for that
rack and causing the potential transient incast-like congestion. I'm a
bit concerned that this may be a noticeable event in some cases (e.g.
a storage row/cluster or maybe where RoCE is in use), and one that
would be fairly annoying to debug and remedy post transition to RIFT
if you didn't foresee it and have the tools (knobs) in place to
prevent it from happening without a PR and s/w upgrade. Should
implementations have a conscious solution in advance for this, and
what's the best way to ensure that? The 'always-disaggregate' knob is
one. Another might be something like a 'min-next-hops' option where
the local RIFT instance on tier 0 won't install a prefix unless it has
received it from a minimum number of up streams.

Both of these do run counter to the low-configuration nature of RIFT.
Another might be a protocol change, something like nodes
disaggregating prefixes by default until they know they are more than
1 hop from the bottom of fabric? (This may run into other convergence
issues during fabric bring up and cold start and maybe there are other
issues with it that need doodling out.)

/2c

Cheers :-)
Kris


On Fri, Apr 12, 2019 at 1:44 PM Antoni Przygienda <prz@juniper.net> wrote:
>
> Hey Kris, great to see you engaging back ;-) I cc: rift mailing list for posterity
>
>
> 1) yepp, very nice python implementation, especially if one wants to understand RIFT as running protocol rather than paper spec ;-)
> 2) Yepp, multi-planes did lead to lots of discussions in core-team meetings around acceptable solutions and how we'd explain it properly. Explanation largely due to Pascal's and Ilya's work, it takes a bit to soak the ASCII pictures but once you start to grok the concept of "crossbars crossbaring crossbars" being Clos ;-) it's very easy to think through the stuff.
> 3) A knob to basically "always disaggregate southbound" is  as simple thing to do really. Just like Bruno has it in his computation first phase mostly decides _which_ nodes need disaggregation. The result can be simply replaced by all southbound nodes & then disaggregation happens naturally. Observe that you still want the default origination since a PoD doesn't see other PoDs except via spine and disaggregation is _not_ transitive (I'm talking positive now). There are other cases you want to advertise southobuond some prefixes beside default and it's a normal thing to do, nothing says that you only advertise default southbound anywhere.
> 4) Observe that we do NOT have a single ring! A ring is only as long as the #planes you have. No'one will have 1000 planes ;-) So let's say you have 64 switches in each plane and 4 planes. You will have 64 rings of lengh 4. Obviously you can double-ring or ring within the plane as well to improve reliability but basically the topology is coherent until 2! links in the same ring break.
> 5) Always enabling disaggregation: That's point 3. Observe that does NOT solve your multi-plane problem on breakages since positive disaggregation is NOT transitive. Yes, you can turn southbound PGP on and blast whole fabric with all prefixes which basically makes your blast radius uncontained on any change (kind of flat IGP or rather IGP with DV prefixes ;-) and a single link coming/going may lead to massive amount of convergence traffic due to prefix reachability changes. Moreover, all your leafs (which is servers in extreme case) need FIB size of the size of fabric host routes ...
>
> Spec authorship is still open ;-) so if you feel like improving/adding to draft, just let me know your moniker on bitbucket since thath's where the newest spec versions live ...
>
> https://bitbucket.org/riftrfc/rift_draft/src/master/
>
> --- tony
>
>
> ________________________________
> From: Kris Price <kris@krisprice.nz>
> Sent: Friday, April 12, 2019 10:22 AM
> To: Antoni Przygienda; brunorijsman@gmail.com
> Subject: RIFT
>
> Hey guys,
>
> ... couple of thoughts for you:
>
> I see someone much better at explaining things than me talked to you since we last spoke :-) so you've caveated the case with how middle tiers in a Clos don't get full visibility of all nodes at that tier via the reflection because by very nature of the Clos they're not fully striped to that lower tier (that is the essence of how Clos topologies scale so is present in any large fabric). You've covered that under the description as "hyper planes" and work around is to use a ring at those tiers to pass control plane traffic.
>
> Would it be possible to somehow /safely/ have a 'knob' to turn off the aggregation at a tier so it'll always advertise all prefixes southbound? Particularly this is useful in a network that might migrate to this protocol that does not want to go back and cable a ring. And in some cases cabling a ring will be undesirable to some potential users anyway. (In some real world networks that could mean cabling a ring connecting >1000 switches that form the "spine columns". That’s a long brittle ring.)
>
> Larger question, would it be possible to disable aggregation network wide? (For someone that might be interested in using RIFT, but for reasons other than the aggregation capability and where that capability may be seen as undesirable.)
>
> There's a semi-related issue at the very bottom of the fabric but it's a bit difficult to explain, and I'm not sure it's really RIFTs problem to solve, (in part really its due to a bad network design choice that exists) so I might draw that up later.
>
> Cheers
> Kris
>
>
>
>