Re: [Rift] RIFT

Antoni Przygienda <prz@juniper.net> Thu, 18 April 2019 23:10 UTC

From: Antoni Przygienda <prz@juniper.net>
To: Kris Price <kris@krisprice.nz>
CC: "brunorijsman@gmail.com" <brunorijsman@gmail.com>, "rift@ietf.org" <rift@ietf.org>
Thread-Topic: RIFT
Thread-Index: AQHU8VRUPBOno7miBkKQqRko+b0GOKY4xxUkgAludgCAAAZsrYAAJfoAgAAv1Mg=
Date: Thu, 18 Apr 2019 23:10:40 +0000
Message-ID: <MWHPR05MB3279E66C8723A77D342EC95AAC260@MWHPR05MB3279.namprd05.prod.outlook.com>
References: <CACqcHa05D9mCNWPtkHMw4t-0opbz33PsnB9Ts=wadfM1UD4cNA@mail.gmail.com> <MWHPR05MB32798B45DD99D8ABF75B875AAC280@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa3TnRS76Rnr5Wkq4_L47i5ZQLQiZFy5aNt3zmTr487LrA@mail.gmail.com> <MWHPR05MB32798005D0A97DCC996CCB11AC260@MWHPR05MB3279.namprd05.prod.outlook.com>, <CACqcHa1=iLcakOE-O1cWWHH+7qu0hvoaT5hq_5FfjcqNsi3MUw@mail.gmail.com>
In-Reply-To: <CACqcHa1=iLcakOE-O1cWWHH+7qu0hvoaT5hq_5FfjcqNsi3MUw@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts)
Content-Type: multipart/alternative; boundary="_000_MWHPR05MB3279E66C8723A77D342EC95AAC260MWHPR05MB3279namp_"
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-Network-Message-Id: b1b6a603-d701-4217-aa09-08d6c453139b
X-MS-Exchange-CrossTenant-originalarrivaltime: 18 Apr 2019 23:10:40.8491 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR05MB2941
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-18_11:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904180134
Archived-At: <https://mailarchive.ietf.org/arch/msg/rift/RHmCWAKdKmN-pD-akQ6g6GW068A>
Subject: Re: [Rift] RIFT
X-BeenThere: rift@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion of Routing in Fat Trees <rift.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rift>, <mailto:rift-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rift/>
List-Post: <mailto:rift@ietf.org>
List-Help: <mailto:rift-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rift>, <mailto:rift-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 18 Apr 2019 23:10:51 -0000

________________________________
From: Kris Price <kris@krisprice.nz>
Sent: Thursday, April 18, 2019 1:07 PM
To: Antoni Przygienda
Cc: brunorijsman@gmail.com; rift@ietf.org
Subject: Re: RIFT

Hey Tony, inline:

[snip]
> On the rings: Ahh! I get it, okay that makes it better. I was also
> wondering if some kind of designated 'S-TIE' reflector / virtual links
> / or explicitly configured multi-hop adjacencies solution could be
> used (the issue being one of how do you route these packets between
> the peers without needing to do something like source route multiple
> hops southbound before being default routed northbound).
>
> good, I know it takes bit to grok the stuff. We did the best we could with ASCII and language but the concepts need some chewing for sure, even if you have been around big fabrics for a bit ;-) So, nothing like route reflectors and so on, within a plane normal south reflection takes care of sync'ing up all you need, outside the plane the ring takes care of sync'ing up planes (for flooding horizontal links below ToF are south and @ ToF level north basically and with that you have all the topology to figure out negative disaggregation.  I explicitly killed any "virtual link" suggestions, I went through this particular hell in my life more than once and don't want to visit it anymore ;-) ...

[KP]: I'm a bit skeptical of buy in to rings as a solution, but if you
have customer's buying into that then that's cool. (I omitted
describing the physical *shudder* when I wrote "virtual links" ;-))

We spent a lot of time chewing different design points. Your choices are either "flat host routes everywhere" or "in case of failures your servers may become top of your fabric" in case of multi-plane ... Or you ring on top in multi-plane (which BTW some top 5 already do and generally, 90% of people are happy with single-plane where you don't need any rings @ ToF ;-) ...  You missed the core team discussions that have been had for weeks and months ;-) Look @ recordings pls ...

[snip]
> It seems this would arise frequently at the bottom two tiers of the
> network. Any loss of any single link to any rack (tier 0) would result
> in all other nodes at tier 1 disaggregating the prefix(es) for that
> rack and causing the potential transient incast-like congestion. I'm a
> bit concerned that this may be a noticeable event in some cases (e.g.
> a storage row/cluster or maybe where RoCE is in use), and one that
> would be fairly annoying to debug and remedy post transition to RIFT
> if you didn't foresee it and have the tools (knobs) in place to
> prevent it from happening without a PR and s/w upgrade.
>
> yepp, you call the spade but you're a bit too pesimistic me thinks. Let's assume 2 ToRs dual-homing a rack or couple racks of servers. if you loose a link in a multi-homed server you basically end up having the other ToR de-aggregating just this server prefix to other servers (even if you run some kubernetes @ scale you may have 100 prefixes or so I'd say, I can't imagine a server hosting thousands really) ... Then, if you think about the ToRs on top of PoD then it's not as bad as you think. If you loose a single ToR in a PoD towards a spine (I'm loose with terminology here) then you will NOT see disaggregation as long the other ToRs in the PoD are still connected to the PoD. Draw pictures & run the public consumption package ;-)  More interesting discussions are bandwdith balancing on link losses (which I think we solved well northbound) and whether it even should be done southbound since notion of "available bandwidth southbound" is confounding ... Spec doesn't forbid it (the beauty of loop-free valley-free routing that gives you insane amount of lee-way how you choose to forward) BTW if somone is smart enough to figure that out ;-) ...

[KP]: You're not the first to describe me as a pessimist. :-) I don't
follow the 2x ToRs and multi-homed servers part, I haven't seen that
used in a very long time, and granted I've been out of the game for a
bit but is anyone still multihoming servers at scale? Maybe certain
enterprise use cases, but they're not pushing the boundaries of scale
so don't need the aggregation anyway.

Modern architectures I see will be moving to good extent to ROTH IMO due to micro-segmentation and tunnel origination on servers.

[KP]: A top of rack switch ("tier-1" lets say) may be connected to 8
or 16 switches (or more) northbound (naturally let's call that next
tier "tier-2"). If any single link between a tier-1 and tier-2 switch
goes down (let's say between tier-1-1 and tier-2-1), all other nodes
in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine
that tier-2-1 no longer has southbound reachability for tier-1-1's
prefixes and that they each need to disagregate these to prevent
tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which
would then need to forward up to tier-3 and back down).

I think we have a disconnect here. ToF level will only disaggregate if a ToF looses _all_ ToP connections to a PoD in a single plane design so I don't follow your argument. If you run multi-plane design you should multi-home each Pod multiple times into your plane as well. If you don't, dugh, you must disaggregate since the plane will blackhole.

[KP]: With positive disagregation we can introduce transient
congestion if there's a lot of traffic from tier-1-2..n to tier-1-1
because a switch may get the prefix from one upstream node first and
install that before getting it from the remaining upstream nodes. (So
we could for a brief instant go from 8 paths ECMP to 1 path then back
up to 7 paths.) On the other hand if all prefixes are disaggregated,
and when the link between tier-1-1 and tier-2-1 fails, tier-2-1 is now
only announcing a withdraw for the affected prefixes to tier-1-2..n,
we can avoid generating this temporary incast-like scenario by design.

[KP]: It's a preference for more deterministic behavior of the fabric
over less deterministic behavior.

Well, having blast radius of whole fabric is in a sense deterministic with every server changing/rebooting shaking whole fabric. I wouldn't call it optimal though.

Far more helpful than "deterministic" is in control system theory (https://en.wikipedia.org/wiki/Stability_theory) to think about "stability" where desireable positive stability is correlated with minimal blast radius. The more inputs shake more of your system the less "stability" you have.
[https://upload.wikimedia.org/wikipedia/commons/3/3b/Stability_Diagram.png]<https://en.wikipedia.org/wiki/Stability_theory>

Stability theory - Wikipedia<https://en.wikipedia.org/wiki/Stability_theory>
In mathematics, stability theory addresses the stability of solutions of differential equations and of trajectories of dynamical systems under small perturbations of initial conditions. The heat equation, for example, is a stable partial differential equation because small perturbations of initial data lead to small variations in temperature at a later time as a result of the maximum principle.
en.wikipedia.org

But again, if you want to disaggreagate e'thing all the time, RIFT won't stop you and you still will be benefiting from flood reduction and N-flooding-only in RIFT which makes for about 25% of normal flooding volume based on empirical data here ...

> Should
> implementations have a conscious solution in advance for this, and
> what's the best way to ensure that? The 'always-disaggregate' knob is
> one. Another might be something like a 'min-next-hops' option where
> the local RIFT instance on tier 0 won't install a prefix unless it has
> received it from a minimum number of up streams
>
> The always disaggregate knob is something you can do per level if you desire but it's basically a big hammer buying you much bigger blast radius in normal operation. And if you pull RIFT onto servers in multi-plane fabrics your FIB may blow up if you do that (unless we think server adapters with 2M FIB size, probably ain't gonna happen ;-).

[KP]: Blast radius doesn't seem bigger to me. FIB explosion is a
design consideration for anyone before thinking about routes from
servers. At a small number selectively it's fine and practiced, e.g.
advertising prefixes from servers that are doing software load
balancing.

Then we disconnect. Think about flat host routing & what rebooting one server does to you in terms of flooding & resulting computaton and so on and what RIFT blast radius is. There is a world of difference.

[KP]: But advertising them say for every VM so you can move VMs
anywhere... that's still going to have impacts on your network design.
With FIB sizes as they are these days most people below the top 5 (or
so) are going to be fine. And anyone in the top 5 (or so) are still
going to be running into trouble. And if you do something like use the
same switch for top of rack layer as at further layers in the Clos,
then this RIFT scaling feature doesn't apply to you anyway as you have
the same FIB size at all tiers.

[KP]: In any case disaggregation at the bottom two tiers where this is
much more likely to be a problem, still permits aggregation higher.

Obviously, your choice. If all servers in a PoD want to have all server prefixes disaggregated, RIFT won't prevent you if you get the implementaion knob.

> The other idea I don't grok, you have to explain in more detail.

[KP]: As an alternative to disaggregation and announcing a withdraw
when the link between tier-1-1 and tier-2-1 goes down. It could be
that we have all the RIFT instances on tier-1 configured to know that
they should not install a prefix *unless* they have seen it advertised
from some minimum number of tier-2 nodes. E.g. if there are 8 tier-2
nodes, we might set that to say 4. Now we somewhat avoid the incast
scenario where the switch installs the disaggregated prefix with one
next hop into it's FIB. Instead it'll wait until it has a minimum of
four next-hops. (This is spit balling, it may open other problems.)

yes, it's always-negative-disaggregation which is possible, however much harder to implement and you would somehow need to ring the ToP to have all the necesasry topolgoy information to achieve that (that's why we ring ToF in multi-plane design). Argument has been made before, we spent tons time with Pascal going pro and cons until the current design was found the best choice.

Re: [Rift] RIFT Antoni Przygienda
Re: [Rift] RIFT Bruno Rijsman
Re: [Rift] RIFT Kris Price
Re: [Rift] RIFT Antoni Przygienda
Re: [Rift] RIFT Kris Price
Re: [Rift] RIFT Antoni Przygienda
Re: [Rift] RIFT Bruno Rijsman
Re: [Rift] RIFT Bruno Rijsman
Re: [Rift] RIFT Kris Price
Re: [Rift] RIFT Antoni Przygienda
Re: [Rift] RIFT Kris Price
Re: [Rift] RIFT Tony Przygienda