Re: [Rift] RIFT

Kris Price <kris@krisprice.nz> Sat, 20 April 2019 18:58 UTC

Return-Path: <kris@krisprice.nz>
X-Original-To: rift@ietfa.amsl.com
Delivered-To: rift@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CACA9120164 for <rift@ietfa.amsl.com>; Sat, 20 Apr 2019 11:58:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=krisprice-nz.20150623.gappssmtp.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uIlQxW3P1KnE for <rift@ietfa.amsl.com>; Sat, 20 Apr 2019 11:58:42 -0700 (PDT)
Received: from mail-lf1-x131.google.com (mail-lf1-x131.google.com [IPv6:2a00:1450:4864:20::131]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EB9AA120162 for <rift@ietf.org>; Sat, 20 Apr 2019 11:58:41 -0700 (PDT)
Received: by mail-lf1-x131.google.com with SMTP id i68so6214431lfi.10 for <rift@ietf.org>; Sat, 20 Apr 2019 11:58:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=krisprice-nz.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=0JUdCnqtE5ocWuTCr6MQrvJ2LCxLPENekz/JFLRf+2M=; b=ylYK3Jo/i99QVEhLuE6P6OBmTRKu+JfnvmzGOws647p37Zkq5mC9+TiPRTjIXPDXWi JcKkRYDJN/MNJrXfAt3fxZRwHER/213203g5b6mY58tdtAyk/jE87wqJP+qkA3olAzAG a0xeox3XX0rvVU5sQkJqRcmC/yBDJRUfoP9uf+2Wysy6Q/hJeN6G/wlQEdESorgu08lU srz9JwfHEHUxXPXA9Tn4VhXA5t7SpFLf4h/Lb4N8UN0w0Z1mc3MMoabdOZHdPAZTfKs8 bnL09FrNXF5gQ70N7SLn3KrNYJ38iuhOCbU37i4TIRk0RZGHBvQ+/6vAig8ssgulI8yY 9j2Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=0JUdCnqtE5ocWuTCr6MQrvJ2LCxLPENekz/JFLRf+2M=; b=filEn5hNJ7wdiUVsO6eC3U3M9qvqBUVE/OS0y5h4aEUSJGXx46OFtVPMaQuOEV4XB0 niFuyrXffAndzXTPE5OxQ+jp1EiKQj6YL+eG/RqiZEsi33m92g4rzv/DpQpWol8BmEvi lAPkdUZRx9gnTp0uMgfZ7NfsZJpXcCeGVr4eXmOBTJIoyAO4yRW2BamJW+ETGBXOhdwh x5rvy3yOtwi83sEbTeEkjsnDEfwaCd9yxCyUt9puGXvG+UzMncA5W1mCjs47HwqWPmhY nrCdxosefLGOzOSWmG9xL2usQa7mOuu1/xHMFxm/tDc2HhI/W9LdDF2o+Z3YQZuLwa4H nSJQ==
X-Gm-Message-State: APjAAAUNZjwkzLxuJlVG4pDspp6iO3JL5vtO3TpU4fOKeDzrgSaae61q BaRopgD7xtKYyx4LQUVMFvpTP0ji/uXirIfiNUmDhg==
X-Google-Smtp-Source: APXvYqyxIzIbeQkx6vU0yt9DDozXCBDViWNHOrLQeFy8o4LtWmykhTMl7GyzqVy6ae7IIUax9qhWbbH1KKXJYXORVq4=
X-Received: by 2002:ac2:5a47:: with SMTP id r7mr6150730lfn.116.1555786720021; Sat, 20 Apr 2019 11:58:40 -0700 (PDT)
MIME-Version: 1.0
References: <CACqcHa05D9mCNWPtkHMw4t-0opbz33PsnB9Ts=wadfM1UD4cNA@mail.gmail.com> <MWHPR05MB32798B45DD99D8ABF75B875AAC280@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa3TnRS76Rnr5Wkq4_L47i5ZQLQiZFy5aNt3zmTr487LrA@mail.gmail.com> <MWHPR05MB32798005D0A97DCC996CCB11AC260@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa1=iLcakOE-O1cWWHH+7qu0hvoaT5hq_5FfjcqNsi3MUw@mail.gmail.com> <MWHPR05MB3279E66C8723A77D342EC95AAC260@MWHPR05MB3279.namprd05.prod.outlook.com>
In-Reply-To: <MWHPR05MB3279E66C8723A77D342EC95AAC260@MWHPR05MB3279.namprd05.prod.outlook.com>
From: Kris Price <kris@krisprice.nz>
Date: Sat, 20 Apr 2019 14:58:41 -0400
Message-ID: <CACqcHa2UumgmnhdY-W6s-nzZhz8Ct+WEV=z+opvyeZ7aX=E9CA@mail.gmail.com>
To: Antoni Przygienda <prz@juniper.net>
Cc: "brunorijsman@gmail.com" <brunorijsman@gmail.com>, "rift@ietf.org" <rift@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/rift/UeyVgYniYmOaXPlIITYwSjhQfHs>
Subject: Re: [Rift] RIFT
X-BeenThere: rift@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion of Routing in Fat Trees <rift.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rift>, <mailto:rift-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rift/>
List-Post: <mailto:rift@ietf.org>
List-Help: <mailto:rift-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rift>, <mailto:rift-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 20 Apr 2019 18:58:45 -0000

Inline:

[snip]

> [Prz] Modern architectures I see will be moving to good extent to ROTH IMO due to micro-segmentation and tunnel origination on servers.

[KP]: Will respond to this further down.

> [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8
> or 16 switches (or more) northbound (naturally let's call that next
> tier "tier-2"). If any single link between a tier-1 and tier-2 switch
> goes down (let's say between tier-1-1 and tier-2-1), all other nodes
> in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine
> that tier-2-1 no longer has southbound reachability for tier-1-1's
> prefixes and that they each need to disagregate these to prevent
> tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which
> would then need to forward up to tier-3 and back down).
>
> [Prz] I think we have a disconnect here. ToF level will only disaggregate if a ToF looses _all_ ToP connections to a PoD in a single plane design so I don't follow your argument. If you run multi-plane design you should multi-home each Pod multiple times into your plane as well. If you don't, dugh, you must disaggregate since the plane will blackhole.

[KP]: I think the disconnect is due my not using RIFT labels for
devices. I'm not talking about top of fabric. In RIFT labels I'm
describing a PoD, where the Leafs are top of rack switches. Then when
a Top of PoD<->Leaf link fails, the other Top of PoD switches will
disaggregate the prefixes on and below that Leaf leading to the incast
problem described. (This is described in the draft.)

[snip]

> [KP]: It's a preference for more deterministic behavior of the fabric
> over less deterministic behavior.
>
> [Prz] Well, having blast radius of whole fabric is in a sense deterministic with every server changing/rebooting shaking whole fabric. I wouldn't call it optimal though.

[KP]: Will respond to this further down.

> Far more helpful than "deterministic" is in control system theory (https://en.wikipedia.org/wiki/Stability_theory) to think about "stability" where desireable positive stability is correlated with minimal blast radius. The more inputs shake more of your system the less "stability" you have.

[KP]: Absolutely, I'm not a mathematician, but reducing the amount of
change under small perturbations is a concern at the back of my head
when I described a preference for more deterministic behavior. Aside
from addressing the incast concern, it was an intuition that adding
routes, and subtracting routes as they come and go would be less
change than mass adds/removals when disagregating. So e.g. sticking
with the example of one PoD, where the link between a top of rack and
top of PoD switch goes down. That means one top of PoD device
withdraws one route (or maybe 1*many routes if routing on host is
happening), and that was less of a churn than say 7 other top of PoD
switches advertising 7*many new routes.

> [Prz] But again, if you want to disaggreagate e'thing all the time, RIFT won't stop you and you still will be benefiting from flood reduction and N-flooding-only in RIFT which makes for about 25% of normal flooding volume based on empirical data here ...

[KP]: That's good, that was my hope. There are benefits to RIFT beyond
it's positive disaggregation feature.

[snip]

> [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a
> design consideration for anyone before thinking about routes from
> servers. At a small number selectively it's fine and practiced, e.g.
> advertising prefixes from servers that are doing software load
> balancing.
>
> [Prz] Then we disconnect. Think about flat host routing & what rebooting one server does to you in terms of flooding & resulting computaton and so on and what RIFT blast radius is. There is a world of difference.

[KP]: I don't fully follow the statements about routing on the host
and shaking the fabric. Sure it would be a bad idea[tm] to do this in
a flat OSPF domain. As I understood it in RIFT with it's link state
up, distance vector down design, if we have a route come and go at the
edge then it will be propagated all the way northbound, but will not
be propagated southbound. If we were disaggregated between the Top of
Rack switches and the Top of PoD switches, then a route appearing or
dissapearing does propagate back down from the Top of PoD to the Top
of Rack, but that's it. It doesn't shake the wider fabric any more
than without disaggreation. If we were disagregated between the Top of
Fabric and Top of PoD routes would be propagated back down to the
other Top of PoDs, but not below.

[KP]: WRT to the routing on host it also seems in contradiction with
concerns about fabric stability. If fabric stability is a concern I
would think you still want addressing hierarchy and to use another
layer of indirection to achieve service mobility, to keep the fabric
unaware of the services constantly popping in and out of existence.
Yes, selectively this works, e.g. in the software load balancer
scenario, tunnel ingest, etc. That is fine and widely practiced within
limits. But the way you're describing it sounds like it's expected to
be used generously, with every host announcing prefixes, and there's
an expectation to move those prefixes such that you end up with a
random distribution. (Which is fine, I am probably out of touch with
fashion.) So with dissagregation in the PoD servers being single homed
would still see just the default. But all Top of Rack switches will
see the prefixes from other servers in the PoD, vs. when aggregation
is in effect and they'll only see the default (plus any disaggregated
due to a failure). The Top of PoD will have all prefixes in the PoD
and below, and further up in higher layers they'll all need to scale
up their FIB requirements to see all fabric routes. That's the same in
all cases with RIFT due to link state up.

[KP]: From purely the scaling perspective the aggregation feature is
useful where the number of routes produced by the servers in a PoD can
overwhelm the top of rack switches in that PoD but not the Top of PoD
switches, and so on up higher layers in the fabric. The FIB available
in devices these days would seem to preclude that.

[snip]

> yes, it's always-negative-disaggregation which is possible, however much harder to implement and you would somehow need to ring the ToP to have all the necesasry topolgoy information to achieve that (that's why we ring ToF in multi-plane design). Argument has been made before, we spent tons time with Pascal going pro and cons until the current design was found the best choice.[snip]

[KP]: It looks like negative disaggregation could be an elegant
protocol level solution if feasible and reliable.

[KP]: I think the answer is that RIFT has looked into the problem and
prospective operators are happy with the risks and trade off so I
accept that. :-)

Thanks!
Kris