Re: [Rift] RIFT

Tony Przygienda <tonysietf@gmail.com> Sun, 21 April 2019 01:36 UTC

MIME-Version: 1.0
References: <CACqcHa05D9mCNWPtkHMw4t-0opbz33PsnB9Ts=wadfM1UD4cNA@mail.gmail.com> <MWHPR05MB32798B45DD99D8ABF75B875AAC280@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa3TnRS76Rnr5Wkq4_L47i5ZQLQiZFy5aNt3zmTr487LrA@mail.gmail.com> <MWHPR05MB32798005D0A97DCC996CCB11AC260@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa1=iLcakOE-O1cWWHH+7qu0hvoaT5hq_5FfjcqNsi3MUw@mail.gmail.com> <MWHPR05MB3279E66C8723A77D342EC95AAC260@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa2UumgmnhdY-W6s-nzZhz8Ct+WEV=z+opvyeZ7aX=E9CA@mail.gmail.com>
In-Reply-To: <CACqcHa2UumgmnhdY-W6s-nzZhz8Ct+WEV=z+opvyeZ7aX=E9CA@mail.gmail.com>
From: Tony Przygienda <tonysietf@gmail.com>
Date: Sat, 20 Apr 2019 18:35:52 -0700
Message-ID: <CA+wi2hNK=mEd9Y96kyJexYJZC4Q8V4e66FRBbhGnu91pmYJJxg@mail.gmail.com>
To: Kris Price <kris@krisprice.nz>
Cc: Antoni Przygienda <prz@juniper.net>, "rift@ietf.org" <rift@ietf.org>, "brunorijsman@gmail.com" <brunorijsman@gmail.com>
Content-Type: multipart/alternative; boundary="00000000000087174e05870061e3"
Archived-At: <https://mailarchive.ietf.org/arch/msg/rift/xbN6Ah0ZOGqw4C37ipNGPtZ1pKc>
Subject: Re: [Rift] RIFT
Precedence: list

On Sat, Apr 20, 2019 at 11:58 AM Kris Price <kris@krisprice.nz> wrote:

> ...
>
> > [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8
> > or 16 switches (or more) northbound (naturally let's call that next
> > tier "tier-2"). If any single link between a tier-1 and tier-2 switch
> > goes down (let's say between tier-1-1 and tier-2-1), all other nodes
> > in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine
> > that tier-2-1 no longer has southbound reachability for tier-1-1's
> > prefixes and that they each need to disagregate these to prevent
> > tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which
> > would then need to forward up to tier-3 and back down).
>

Let's look @ at a figure

              .                                  [A,B,C,D]
              .                                  [E]
              .             +-----+      +-----+
 Level 2      .             |  E  |      |  F  | A/32 @ [C,D]
              .             +-+-+-+      +-+-+-+ B/32 @ [C,D]
              .               | |          | |   C/32 @ C
              .               | |    +-----+ |   D/32 @ D
              .               | |    |       |
              .               | +------+     |
              .               |      | |     |
 Level 1      .       [A,B] +-+---+  | | +---+-+ [A,B]
              .       [D]   |  C  +--+ +-+  D  | [C]
              .             +-+-+-+      +-+-+-+
              .  0/0  @ [E,F] | |          | |   0/0  @ [E,F]
              .  A/32 @ A     | |    +-----+ |   A/32 @ A
              .  B/32 @ B     | |    |       |   B/32 @ B
              .               | +------+     |
              .               |      | |     |
              .             +-+---+  | | +---+-+
              .             |  A  +--+ +-+  B  |
 Level 0      . 0/0 @ [C,D] +-----+      +-----+ 0/0 @ [C,D]

Let's call A ToR and it's holding 8 server addresses. If you loos D-A the
only disaggregation you will see is C disaggregating to B the 8 addresses.
This is unavoidable. I assume we agree.

>
> > [Prz] I think we have a disconnect here. ToF level will only
> disaggregate if a ToF looses _all_ ToP connections to a PoD in a single
> plane design so I don't follow your argument. If you run multi-plane design
> you should multi-home each Pod multiple times into your plane as well. If
> you don't, dugh, you must disaggregate since the plane will blackhole.
>
> [KP]: I think the disconnect is due my not using RIFT labels for
> devices. I'm not talking about top of fabric. In RIFT labels I'm
> describing a PoD, where the Leafs are top of rack switches. Then when
> a Top of PoD<->Leaf link fails, the other Top of PoD switches will
> disaggregate the prefixes on and below that Leaf leading to the incast
> problem described. (This is described in the draft.)
>

right. It's really the only possible choice between having aggreation and
having to react to link failure by disaggregating.

>
> > Far more helpful than "deterministic" is in control system theory (
> https://en.wikipedia.org/wiki/Stability_theory) to think about
> "stability" where desireable positive stability is correlated with minimal
> blast radius. The more inputs shake more of your system the less
> "stability" you have.
>
> [KP]: Absolutely, I'm not a mathematician, but reducing the amount of
> change under small perturbations is a concern at the back of my head
> when I described a preference for more deterministic behavior. Aside
> from addressing the incast concern, it was an intuition that adding
> routes, and subtracting routes as they come and go would be less
> change than mass adds/removals when disagregating. So e.g. sticking
> with the example of one PoD, where the link between a top of rack and
> top of PoD switch goes down. That means one top of PoD device
> withdraws one route (or maybe 1*many routes if routing on host is
> happening), and that was less of a churn than say 7 other top of PoD
> switches advertising 7*many new routes.
>

if that's your preference you can simply configure all your Level 1
switches to always disaggregate @ the cost of extra flooding on every
address change and FIB size in Level 0.

> > [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a
> > design consideration for anyone before thinking about routes from
> > servers. At a small number selectively it's fine and practiced, e.g.
> > advertising prefixes from servers that are doing software load
> > balancing.
>

Your blast radius is somewhat bigger. Every server coming will affect all
Level 0s in the PoD.

>
> [KP]: I don't fully follow the statements about routing on the host
> and shaking the fabric. Sure it would be a bad idea[tm] to do this in
> a flat OSPF domain. As I understood it in RIFT with it's link state
> up, distance vector down design, if we have a route come and go at the
> edge then it will be propagated all the way northbound, but will not
> be propagated southbound. If we were disaggregated between the Top of
> Rack switches and the Top of PoD switches, then a route appearing or
> dissapearing does propagate back down from the Top of PoD to the Top
> of Rack, but that's it. It doesn't shake the wider fabric any more
> than without disaggreation. If we were disagregated between the Top of
> Fabric and Top of PoD routes would be propagated back down to the
> other Top of PoDs, but not below.
>

Yes, I meant if you run a flat OSPF domain with host routes (as people
actually do if the scale holds up).

Otherwise, yes, we agree. We just needed to talk in same words about same
things ;-)

>
> [KP]: WRT to the routing on host it also seems in contradiction with
> concerns about fabric stability. If fabric stability is a concern I
> would think you still want addressing hierarchy and to use another
> layer of indirection to achieve service mobility, to keep the fabric
> unaware of the services constantly popping in and out of existence.
>

yes and no. Depends what you need. if you want to multi-home servers (since
impact on your services is non-negligible if you loose e.g. a ToR) and need
automatic bandwidth balancing north (nice thing), no need for MC-LAG in L2,
tunnel origination on server without stitching, automatic disaggregation on
failures, view of full topology on top of fabric and so on this has lots of
appeal. And then, if you start to do things like running BIRD on host &
then redistribute a default route and so on you don't have lots of these
capabilites and on top another layer/protocol isntance to manage.

> Yes, selectively this works, e.g. in the software load balancer
> scenario, tunnel ingest, etc. That is fine and widely practiced within
> limits. But the way you're describing it sounds like it's expected to
> be used generously, with every host announcing prefixes, and there's
> an expectation to move those prefixes such that you end up with a
> random distribution. (Which is fine, I am probably out of touch with
> fashion.)

yupp, that is the expectation (i.e. RIFT is designed to be able to support
that if needed). Look @ mobility section ;-)  Then you really have a
"fabric" vs. a "network", i.e. something that gives you bandwidth the same
way chips give you RAM. You don't think on which RAM bank your allocation
has to reside to work, why should you be all concerned where and how you
hook stuff up and whether your services move addresses if all you need is
just "more bandwidth".

> So with dissagregation in the PoD servers being single homed
> would still see just the default.

if your server is single homed running any kind of routing protocol seems a
waste really (unless you statically provision addreses and want them
carried through rather than using DHCP and so on). You can as well point a
static out,  it's not like you can load balance, react to failures or
anything much.

> But all Top of Rack switches will
> see the prefixes from other servers in the PoD, vs. when aggregation
> is in effect and they'll only see the default (plus any disaggregated
> due to a failure). The Top of PoD will have all prefixes in the PoD
> and below, and further up in higher layers they'll all need to scale
> up their FIB requirements to see all fabric routes. That's the same in
> all cases with RIFT due to link state up.
>

Only ToF needs all routes (which is level 2 in 5-stage folded) in case of
single plane fabric. In multi-plane fabric things are more complex. Any
reasonable failure should be healed by negative disaggregation in levels
higher up but one could construct completely pathological scenarios where
you have to propagate all the way down since if a server can reach another
server through certain planes only, it must know which planes to avoid to
prevent a up-fabric/down-fabric/up-fabric again effectively turning other
servers into ToF (which we call "fabric inversion" and seems extremely
undesirable, BTW, in such scenarios your flooding on normal protocols also
has to go up/down/up so once that happens you really don't have any kind of
"hierarchical fabric" but bunch of nodes & links where traffic tries to get
places somehow). I think the draft explains that decently well.

>
> [KP]: From purely the scaling perspective the aggregation feature is
> useful where the number of routes produced by the servers in a PoD can
> overwhelm the top of rack switches in that PoD but not the Top of PoD
> switches, and so on up higher layers in the fabric. The FIB available
> in devices these days would seem to preclude that.
>
>
Right, so it's an interesting discussion and you're very focused on the way
you prefer to deploy it and then that all makes sense. But if you have for
reasons above pull RIFT all way down into multi-home server you realize
that your FIB is small and that storing your underlay routes competes
directly with your overlay routes which are the ones paying the bills so
the dynamic changes. If your ToRs are originating overlays (as in EVPN
e.g.) you'll face the same calculus. RIFT is agnostic, run it to the ToR,
disaggregate servers if you want, will woirk fine but it allows you to pull
all the way to ROTH and fast mobility of addresses, or run EVPN origination
on the ToRs and use all-active or MC-LAG or whatever from servers, it will
all work.

So, we have an applicability document pending and this kind of stuff should
all go into it IMO. Feel free to drum up the crowd and start/massage it.

> [snip]
>
> > yes, it's always-negative-disaggregation which is possible, however much
> harder to implement and you would somehow need to ring the ToP to have all
> the necesasry topolgoy information to achieve that (that's why we ring ToF
> in multi-plane design). Argument has been made before, we spent tons time
> with Pascal going pro and cons until the current design was found the best
> choice.[snip]
>
> [KP]: It looks like negative disaggregation could be an elegant
> protocol level solution if feasible and reliable.
>

we spec'ed it our solid me thinks & you find examples and so on in the
spec. Implementation doesn't look very challenging, most intersting is the
recursive FIB hole punching in case of negative disaggregates but in
fabrics it seems very unlikely people will carry lots aggregates together
with more specific so then the problem doesn't even exist. Silicon is
oblivious to it BTW, it all happens in control plane. If you read that and
have further input, all interested in that ...

nice you're drilling, I think lots people looked @ the stuff over last
year+ and we closed all the holes and discussed out all the design choices
but one more pair of experienced eyes never hurts

--- tony

Re: [Rift] RIFT Antoni Przygienda
Re: [Rift] RIFT Bruno Rijsman
Re: [Rift] RIFT Kris Price
Re: [Rift] RIFT Antoni Przygienda
Re: [Rift] RIFT Kris Price
Re: [Rift] RIFT Antoni Przygienda
Re: [Rift] RIFT Bruno Rijsman
Re: [Rift] RIFT Bruno Rijsman
Re: [Rift] RIFT Kris Price
Re: [Rift] RIFT Antoni Przygienda
Re: [Rift] RIFT Kris Price
Re: [Rift] RIFT Tony Przygienda