Re: [Rift] RIFT
Tony Przygienda <tonysietf@gmail.com> Sun, 21 April 2019 01:36 UTC
Return-Path: <tonysietf@gmail.com>
X-Original-To: rift@ietfa.amsl.com
Delivered-To: rift@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E378B1201C5 for <rift@ietfa.amsl.com>; Sat, 20 Apr 2019 18:36:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.998
X-Spam-Level:
X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id H_AnzfZBVQZl for <rift@ietfa.amsl.com>; Sat, 20 Apr 2019 18:36:31 -0700 (PDT)
Received: from mail-ed1-x52a.google.com (mail-ed1-x52a.google.com [IPv6:2a00:1450:4864:20::52a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4844E1201A9 for <rift@ietf.org>; Sat, 20 Apr 2019 18:36:31 -0700 (PDT)
Received: by mail-ed1-x52a.google.com with SMTP id k45so7095144edb.6 for <rift@ietf.org>; Sat, 20 Apr 2019 18:36:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=sjslSBdqMNreLQFeCi+MbI/kxd2MpV8PsJmEQq2MgE8=; b=YCoxE1CcKZX34CO4CUnCzllatVxqPWC1+1xaaKCTFvmO94fv9N6YYSPFGDuUdnmvak gfnM+j1BLCLfCpoBFUUH3BY+Uo1x50o8Gw6ybjP/2G0J1z5j5p6nYU7roZQY+oWtdd8d xIFtvgy99JbwvIxw1SGdkvaaxCMRv2/Ib2QodHKSPGPWQpKvpF6GCdEOy8AQj1Mi4QNV VBcQ4rvynya5jtTwCMBSnvTvBYbTApU6TT5iwDgYjJmsypxBdTBaMOo+guCJOLeZkzOy Qce2oXtk0CpRTE/Zy29olitM5PIRD24UrGT3j8Od+jG7PI1ol3zIYhIOWN7bk3uBlPga lGrA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=sjslSBdqMNreLQFeCi+MbI/kxd2MpV8PsJmEQq2MgE8=; b=eiOaDq18MlWUUEYnG5puxSLvpTraRLvpoeYQEypwUQTmlAefB31e7wyeqMvKvTJnvB VP8D/F3vvX6gsOReBwm++GehUjOxCcLrxKObe3+/bMtvBLUP23GkmqvrS61GY7p5f9Wa NjaQokHAK5J6rUaGUcY0uT3ETKwTm+XzjZSM+Kgjt3Kgq0deqhoL6iY8SaFscz7eh5/7 n+7bdc/bHfQe/S3GKCrpa0uuw7apRTmis5ghgOmreCgwkVYFgb75/XmZldn4P1H868To 8IpZnWIY8Sn5spM7W9+YE+xZ+0/rmD7gH2OmtSq7rZh68PaQGnVXPqjaoQQpJ733faWX e5Jw==
X-Gm-Message-State: APjAAAUSa742+OenIFBQvuKsaUcYFKPc1Mo2Vfvb1FkdoUDTCVpX9nbs iZ98Q8DqgOSr3XUfl5K3sjimicfYNGTBZbSk3j0=
X-Google-Smtp-Source: APXvYqz5FB4JKbpOVjtQRRYZ6xwMA6FnXUvZ0lM7hVgKqvt81gwGolAo2OGp7UXAEjur6BAmnWvAG+de4HqUESDRSL0=
X-Received: by 2002:a50:9052:: with SMTP id z18mr7289770edz.256.1555810589544; Sat, 20 Apr 2019 18:36:29 -0700 (PDT)
MIME-Version: 1.0
References: <CACqcHa05D9mCNWPtkHMw4t-0opbz33PsnB9Ts=wadfM1UD4cNA@mail.gmail.com> <MWHPR05MB32798B45DD99D8ABF75B875AAC280@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa3TnRS76Rnr5Wkq4_L47i5ZQLQiZFy5aNt3zmTr487LrA@mail.gmail.com> <MWHPR05MB32798005D0A97DCC996CCB11AC260@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa1=iLcakOE-O1cWWHH+7qu0hvoaT5hq_5FfjcqNsi3MUw@mail.gmail.com> <MWHPR05MB3279E66C8723A77D342EC95AAC260@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa2UumgmnhdY-W6s-nzZhz8Ct+WEV=z+opvyeZ7aX=E9CA@mail.gmail.com>
In-Reply-To: <CACqcHa2UumgmnhdY-W6s-nzZhz8Ct+WEV=z+opvyeZ7aX=E9CA@mail.gmail.com>
From: Tony Przygienda <tonysietf@gmail.com>
Date: Sat, 20 Apr 2019 18:35:52 -0700
Message-ID: <CA+wi2hNK=mEd9Y96kyJexYJZC4Q8V4e66FRBbhGnu91pmYJJxg@mail.gmail.com>
To: Kris Price <kris@krisprice.nz>
Cc: Antoni Przygienda <prz@juniper.net>, "rift@ietf.org" <rift@ietf.org>, "brunorijsman@gmail.com" <brunorijsman@gmail.com>
Content-Type: multipart/alternative; boundary="00000000000087174e05870061e3"
Archived-At: <https://mailarchive.ietf.org/arch/msg/rift/xbN6Ah0ZOGqw4C37ipNGPtZ1pKc>
Subject: Re: [Rift] RIFT
X-BeenThere: rift@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion of Routing in Fat Trees <rift.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rift>, <mailto:rift-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rift/>
List-Post: <mailto:rift@ietf.org>
List-Help: <mailto:rift-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rift>, <mailto:rift-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 21 Apr 2019 01:36:35 -0000
On Sat, Apr 20, 2019 at 11:58 AM Kris Price <kris@krisprice.nz> wrote: > ... > > > [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8 > > or 16 switches (or more) northbound (naturally let's call that next > > tier "tier-2"). If any single link between a tier-1 and tier-2 switch > > goes down (let's say between tier-1-1 and tier-2-1), all other nodes > > in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine > > that tier-2-1 no longer has southbound reachability for tier-1-1's > > prefixes and that they each need to disagregate these to prevent > > tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which > > would then need to forward up to tier-3 and back down). > Let's look @ at a figure . [A,B,C,D] . [E] . +-----+ +-----+ Level 2 . | E | | F | A/32 @ [C,D] . +-+-+-+ +-+-+-+ B/32 @ [C,D] . | | | | C/32 @ C . | | +-----+ | D/32 @ D . | | | | . | +------+ | . | | | | Level 1 . [A,B] +-+---+ | | +---+-+ [A,B] . [D] | C +--+ +-+ D | [C] . +-+-+-+ +-+-+-+ . 0/0 @ [E,F] | | | | 0/0 @ [E,F] . A/32 @ A | | +-----+ | A/32 @ A . B/32 @ B | | | | B/32 @ B . | +------+ | . | | | | . +-+---+ | | +---+-+ . | A +--+ +-+ B | Level 0 . 0/0 @ [C,D] +-----+ +-----+ 0/0 @ [C,D] Let's call A ToR and it's holding 8 server addresses. If you loos D-A the only disaggregation you will see is C disaggregating to B the 8 addresses. This is unavoidable. I assume we agree. > > > [Prz] I think we have a disconnect here. ToF level will only > disaggregate if a ToF looses _all_ ToP connections to a PoD in a single > plane design so I don't follow your argument. If you run multi-plane design > you should multi-home each Pod multiple times into your plane as well. If > you don't, dugh, you must disaggregate since the plane will blackhole. > > [KP]: I think the disconnect is due my not using RIFT labels for > devices. I'm not talking about top of fabric. In RIFT labels I'm > describing a PoD, where the Leafs are top of rack switches. Then when > a Top of PoD<->Leaf link fails, the other Top of PoD switches will > disaggregate the prefixes on and below that Leaf leading to the incast > problem described. (This is described in the draft.) > right. It's really the only possible choice between having aggreation and having to react to link failure by disaggregating. > > > Far more helpful than "deterministic" is in control system theory ( > https://en.wikipedia.org/wiki/Stability_theory) to think about > "stability" where desireable positive stability is correlated with minimal > blast radius. The more inputs shake more of your system the less > "stability" you have. > > [KP]: Absolutely, I'm not a mathematician, but reducing the amount of > change under small perturbations is a concern at the back of my head > when I described a preference for more deterministic behavior. Aside > from addressing the incast concern, it was an intuition that adding > routes, and subtracting routes as they come and go would be less > change than mass adds/removals when disagregating. So e.g. sticking > with the example of one PoD, where the link between a top of rack and > top of PoD switch goes down. That means one top of PoD device > withdraws one route (or maybe 1*many routes if routing on host is > happening), and that was less of a churn than say 7 other top of PoD > switches advertising 7*many new routes. > if that's your preference you can simply configure all your Level 1 switches to always disaggregate @ the cost of extra flooding on every address change and FIB size in Level 0. > > [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a > > design consideration for anyone before thinking about routes from > > servers. At a small number selectively it's fine and practiced, e.g. > > advertising prefixes from servers that are doing software load > > balancing. > Your blast radius is somewhat bigger. Every server coming will affect all Level 0s in the PoD. > > [KP]: I don't fully follow the statements about routing on the host > and shaking the fabric. Sure it would be a bad idea[tm] to do this in > a flat OSPF domain. As I understood it in RIFT with it's link state > up, distance vector down design, if we have a route come and go at the > edge then it will be propagated all the way northbound, but will not > be propagated southbound. If we were disaggregated between the Top of > Rack switches and the Top of PoD switches, then a route appearing or > dissapearing does propagate back down from the Top of PoD to the Top > of Rack, but that's it. It doesn't shake the wider fabric any more > than without disaggreation. If we were disagregated between the Top of > Fabric and Top of PoD routes would be propagated back down to the > other Top of PoDs, but not below. > Yes, I meant if you run a flat OSPF domain with host routes (as people actually do if the scale holds up). Otherwise, yes, we agree. We just needed to talk in same words about same things ;-) > > [KP]: WRT to the routing on host it also seems in contradiction with > concerns about fabric stability. If fabric stability is a concern I > would think you still want addressing hierarchy and to use another > layer of indirection to achieve service mobility, to keep the fabric > unaware of the services constantly popping in and out of existence. > yes and no. Depends what you need. if you want to multi-home servers (since impact on your services is non-negligible if you loose e.g. a ToR) and need automatic bandwidth balancing north (nice thing), no need for MC-LAG in L2, tunnel origination on server without stitching, automatic disaggregation on failures, view of full topology on top of fabric and so on this has lots of appeal. And then, if you start to do things like running BIRD on host & then redistribute a default route and so on you don't have lots of these capabilites and on top another layer/protocol isntance to manage. > Yes, selectively this works, e.g. in the software load balancer > scenario, tunnel ingest, etc. That is fine and widely practiced within > limits. But the way you're describing it sounds like it's expected to > be used generously, with every host announcing prefixes, and there's > an expectation to move those prefixes such that you end up with a > random distribution. (Which is fine, I am probably out of touch with > fashion.) yupp, that is the expectation (i.e. RIFT is designed to be able to support that if needed). Look @ mobility section ;-) Then you really have a "fabric" vs. a "network", i.e. something that gives you bandwidth the same way chips give you RAM. You don't think on which RAM bank your allocation has to reside to work, why should you be all concerned where and how you hook stuff up and whether your services move addresses if all you need is just "more bandwidth". > So with dissagregation in the PoD servers being single homed > would still see just the default. if your server is single homed running any kind of routing protocol seems a waste really (unless you statically provision addreses and want them carried through rather than using DHCP and so on). You can as well point a static out, it's not like you can load balance, react to failures or anything much. > But all Top of Rack switches will > see the prefixes from other servers in the PoD, vs. when aggregation > is in effect and they'll only see the default (plus any disaggregated > due to a failure). The Top of PoD will have all prefixes in the PoD > and below, and further up in higher layers they'll all need to scale > up their FIB requirements to see all fabric routes. That's the same in > all cases with RIFT due to link state up. > Only ToF needs all routes (which is level 2 in 5-stage folded) in case of single plane fabric. In multi-plane fabric things are more complex. Any reasonable failure should be healed by negative disaggregation in levels higher up but one could construct completely pathological scenarios where you have to propagate all the way down since if a server can reach another server through certain planes only, it must know which planes to avoid to prevent a up-fabric/down-fabric/up-fabric again effectively turning other servers into ToF (which we call "fabric inversion" and seems extremely undesirable, BTW, in such scenarios your flooding on normal protocols also has to go up/down/up so once that happens you really don't have any kind of "hierarchical fabric" but bunch of nodes & links where traffic tries to get places somehow). I think the draft explains that decently well. > > [KP]: From purely the scaling perspective the aggregation feature is > useful where the number of routes produced by the servers in a PoD can > overwhelm the top of rack switches in that PoD but not the Top of PoD > switches, and so on up higher layers in the fabric. The FIB available > in devices these days would seem to preclude that. > > Right, so it's an interesting discussion and you're very focused on the way you prefer to deploy it and then that all makes sense. But if you have for reasons above pull RIFT all way down into multi-home server you realize that your FIB is small and that storing your underlay routes competes directly with your overlay routes which are the ones paying the bills so the dynamic changes. If your ToRs are originating overlays (as in EVPN e.g.) you'll face the same calculus. RIFT is agnostic, run it to the ToR, disaggregate servers if you want, will woirk fine but it allows you to pull all the way to ROTH and fast mobility of addresses, or run EVPN origination on the ToRs and use all-active or MC-LAG or whatever from servers, it will all work. So, we have an applicability document pending and this kind of stuff should all go into it IMO. Feel free to drum up the crowd and start/massage it. > [snip] > > > yes, it's always-negative-disaggregation which is possible, however much > harder to implement and you would somehow need to ring the ToP to have all > the necesasry topolgoy information to achieve that (that's why we ring ToF > in multi-plane design). Argument has been made before, we spent tons time > with Pascal going pro and cons until the current design was found the best > choice.[snip] > > [KP]: It looks like negative disaggregation could be an elegant > protocol level solution if feasible and reliable. > we spec'ed it our solid me thinks & you find examples and so on in the spec. Implementation doesn't look very challenging, most intersting is the recursive FIB hole punching in case of negative disaggregates but in fabrics it seems very unlikely people will carry lots aggregates together with more specific so then the problem doesn't even exist. Silicon is oblivious to it BTW, it all happens in control plane. If you read that and have further input, all interested in that ... nice you're drilling, I think lots people looked @ the stuff over last year+ and we closed all the holes and discussed out all the design choices but one more pair of experienced eyes never hurts --- tony
- Re: [Rift] RIFT Antoni Przygienda
- Re: [Rift] RIFT Bruno Rijsman
- Re: [Rift] RIFT Kris Price
- Re: [Rift] RIFT Antoni Przygienda
- Re: [Rift] RIFT Kris Price
- Re: [Rift] RIFT Antoni Przygienda
- Re: [Rift] RIFT Bruno Rijsman
- Re: [Rift] RIFT Bruno Rijsman
- Re: [Rift] RIFT Kris Price
- Re: [Rift] RIFT Antoni Przygienda
- Re: [Rift] RIFT Kris Price
- Re: [Rift] RIFT Tony Przygienda