Re: [Rift] RIFT

Kris Price <kris@krisprice.nz> Sat, 20 April 2019 16:16 UTC

Return-Path: <kris@krisprice.nz>
X-Original-To: rift@ietfa.amsl.com
Delivered-To: rift@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7E1EB120044 for <rift@ietfa.amsl.com>; Sat, 20 Apr 2019 09:16:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=krisprice-nz.20150623.gappssmtp.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PUPTyPR9QV-e for <rift@ietfa.amsl.com>; Sat, 20 Apr 2019 09:16:28 -0700 (PDT)
Received: from mail-lj1-x235.google.com (mail-lj1-x235.google.com [IPv6:2a00:1450:4864:20::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EEA3112013D for <rift@ietf.org>; Sat, 20 Apr 2019 09:16:27 -0700 (PDT)
Received: by mail-lj1-x235.google.com with SMTP id t4so6955330ljc.2 for <rift@ietf.org>; Sat, 20 Apr 2019 09:16:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=krisprice-nz.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=0jW2lbDDotzl5eu/vnHkDwMMa2ugKyqif9rtM4vgUVY=; b=vK4nI2FVENHQLFB5p2MEcaZQWdBXMbpnwoPYZNKJLzfCaa52wobUuE83cKhADPx5yi M6y+zCG7mwr7Ul1AV8dy7E2KlupzScmFY4hskyC0HCQ1snpGyfIsLiu0VGWDEadlYQT0 o4L5DxlxaLcxi3jMK9V+50mlHuqFUNRNKjfab7Ay1AynBqw9FT0pNklzyoNV28DNuOTi ybI3tnPCZJGliqhdprB/6ceZ4tI+xl/chyxwfTFzSABc/iKZ4aVAKHXTd1UBul/VCLxa E8CQ0xf5W2TtC4e68AvXUmMbh/feOqFBmnSXsmAT5KKxW6hx6Vqw7HjV07EQxFbVO8FV 3IHA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=0jW2lbDDotzl5eu/vnHkDwMMa2ugKyqif9rtM4vgUVY=; b=Iqbab2J7TyLvS5lRxLFEXU4db+iIsb+J+UBUoX9j1EGwi1GHVFX6b6P05VKz4V04Tb Luko4nr5n5XoDwppZ4LdJmximlyTt5MiJ0YNemZ7Zh2yPNTiTaTPUHp4jNTT6o1nHoz7 MfRMJfYG4qlgGhMZOSInz0geEor898GX+4o1+buJl5JQWMvSpvjIkT2PSWg0zt/ZHppp TaXoQ99oTRX/FL0/qUiu12D62KhqezTl/IcTlAcu7jvk1yFCAdyVTpTApFmJi3S0sKqd QQLyIOiI8hK4Y+PJ5EKeWjAZd3glfggp5batHWVr1QVV7q1ziYfANtVufS5dPPIOmN6v iq/Q==
X-Gm-Message-State: APjAAAWUgMaeHiy7MM7sqlpDDoMT8nvHQ8kjW4k1lbVMhZ2oS7wIxx7W fkv+g4sGjQt0tBVfKbTkgTjkYTC04N+EepuCinyOBA==
X-Google-Smtp-Source: APXvYqz3QNMrxcNr5AnbQYcmsGrLa15LqxQyL1oNwltGkdKiMiK+Zn77O2kFeH/+8CidNKXcMSqgd0f6Oh4h4Vuc/zA=
X-Received: by 2002:a2e:731a:: with SMTP id o26mr4730043ljc.69.1555776985744; Sat, 20 Apr 2019 09:16:25 -0700 (PDT)
MIME-Version: 1.0
References: <CACqcHa05D9mCNWPtkHMw4t-0opbz33PsnB9Ts=wadfM1UD4cNA@mail.gmail.com> <MWHPR05MB32798B45DD99D8ABF75B875AAC280@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa3TnRS76Rnr5Wkq4_L47i5ZQLQiZFy5aNt3zmTr487LrA@mail.gmail.com> <MWHPR05MB32798005D0A97DCC996CCB11AC260@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa1=iLcakOE-O1cWWHH+7qu0hvoaT5hq_5FfjcqNsi3MUw@mail.gmail.com> <8B6D2FC6-15D3-4133-AAB4-160E1D82A827@gmail.com>
In-Reply-To: <8B6D2FC6-15D3-4133-AAB4-160E1D82A827@gmail.com>
From: Kris Price <kris@krisprice.nz>
Date: Sat, 20 Apr 2019 12:16:27 -0400
Message-ID: <CACqcHa0-s6sP6YG7MVASByna_aqQ1SrsCnf00eJH-Q_z7U_LUw@mail.gmail.com>
To: Bruno Rijsman <brunorijsman@gmail.com>
Cc: Tony Przygienda <prz@juniper.net>, "rift@ietf.org" <rift@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/rift/M9Nm6o8ILZUJV0r-ybo2-cui3Rg>
Subject: Re: [Rift] RIFT
X-BeenThere: rift@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion of Routing in Fat Trees <rift.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rift>, <mailto:rift-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rift/>
List-Post: <mailto:rift@ietf.org>
List-Help: <mailto:rift-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rift>, <mailto:rift-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 20 Apr 2019 16:16:31 -0000

Hi Bruno,

Not sure if I followed negative disaggregation correctly. Is this used
at levels below the top of fabric or was it something that was
discussed as a possibility?

Cheers
Kris

On Thu, Apr 18, 2019 at 6:10 PM Bruno Rijsman <brunorijsman@gmail.com> wrote:
>
> Kris,
>
> What is your opinion is on negative aggregation as a solution for the transient incast-like congestion after a failure with positive disaggregation?
>
> — Bruno
>
> > On Apr 18, 2019, at 5:07 PM, Kris Price <kris@krisprice.nz> wrote:
> >
> > Hey Tony, inline:
> >
> > [snip]
> >> On the rings: Ahh! I get it, okay that makes it better. I was also
> >> wondering if some kind of designated 'S-TIE' reflector / virtual links
> >> / or explicitly configured multi-hop adjacencies solution could be
> >> used (the issue being one of how do you route these packets between
> >> the peers without needing to do something like source route multiple
> >> hops southbound before being default routed northbound).
> >>
> >> good, I know it takes bit to grok the stuff. We did the best we could with ASCII and language but the concepts need some chewing for sure, even if you have been around big fabrics for a bit ;-) So, nothing like route reflectors and so on, within a plane normal south reflection takes care of sync'ing up all you need, outside the plane the ring takes care of sync'ing up planes (for flooding horizontal links below ToF are south and @ ToF level north basically and with that you have all the topology to figure out negative disaggregation.  I explicitly killed any "virtual link" suggestions, I went through this particular hell in my life more than once and don't want to visit it anymore ;-) ...
> >
> > [KP]: I'm a bit skeptical of buy in to rings as a solution, but if you
> > have customer's buying into that then that's cool. (I omitted
> > describing the physical *shudder* when I wrote "virtual links" ;-))
> >
> > [snip]
> >> It seems this would arise frequently at the bottom two tiers of the
> >> network. Any loss of any single link to any rack (tier 0) would result
> >> in all other nodes at tier 1 disaggregating the prefix(es) for that
> >> rack and causing the potential transient incast-like congestion. I'm a
> >> bit concerned that this may be a noticeable event in some cases (e.g.
> >> a storage row/cluster or maybe where RoCE is in use), and one that
> >> would be fairly annoying to debug and remedy post transition to RIFT
> >> if you didn't foresee it and have the tools (knobs) in place to
> >> prevent it from happening without a PR and s/w upgrade.
> >>
> >> yepp, you call the spade but you're a bit too pesimistic me thinks. Let's assume 2 ToRs dual-homing a rack or couple racks of servers. if you loose a link in a multi-homed server you basically end up having the other ToR de-aggregating just this server prefix to other servers (even if you run some kubernetes @ scale you may have 100 prefixes or so I'd say, I can't imagine a server hosting thousands really) ... Then, if you think about the ToRs on top of PoD then it's not as bad as you think. If you loose a single ToR in a PoD towards a spine (I'm loose with terminology here) then you will NOT see disaggregation as long the other ToRs in the PoD are still connected to the PoD. Draw pictures & run the public consumption package ;-)  More interesting discussions are bandwdith balancing on link losses (which I think we solved well northbound) and whether it even should be done southbound since notion of "available bandwidth southbound" is confounding ... Spec doesn't forbid it (the beauty of loop-free valley-free routing that gives you insane amount of lee-way how you choose to forward) BTW if somone is smart enough to figure that out ;-) ...
> >
> > [KP]: You're not the first to describe me as a pessimist. :-) I don't
> > follow the 2x ToRs and multi-homed servers part, I haven't seen that
> > used in a very long time, and granted I've been out of the game for a
> > bit but is anyone still multihoming servers at scale? Maybe certain
> > enterprise use cases, but they're not pushing the boundaries of scale
> > so don't need the aggregation anyway.
> >
> > [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8
> > or 16 switches (or more) northbound (naturally let's call that next
> > tier "tier-2"). If any single link between a tier-1 and tier-2 switch
> > goes down (let's say between tier-1-1 and tier-2-1), all other nodes
> > in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine
> > that tier-2-1 no longer has southbound reachability for tier-1-1's
> > prefixes and that they each need to disagregate these to prevent
> > tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which
> > would then need to forward up to tier-3 and back down).
> >
> > [KP]: With positive disagregation we can introduce transient
> > congestion if there's a lot of traffic from tier-1-2..n to tier-1-1
> > because a switch may get the prefix from one upstream node first and
> > install that before getting it from the remaining upstream nodes. (So
> > we could for a brief instant go from 8 paths ECMP to 1 path then back
> > up to 7 paths.) On the other hand if all prefixes are disaggregated,
> > and when the link between tier-1-1 and tier-2-1 fails, tier-2-1 is now
> > only announcing a withdraw for the affected prefixes to tier-1-2..n,
> > we can avoid generating this temporary incast-like scenario by design.
> >
> > [KP]: It's a preference for more deterministic behavior of the fabric
> > over less deterministic behavior.
> >
> >> Should
> >> implementations have a conscious solution in advance for this, and
> >> what's the best way to ensure that? The 'always-disaggregate' knob is
> >> one. Another might be something like a 'min-next-hops' option where
> >> the local RIFT instance on tier 0 won't install a prefix unless it has
> >> received it from a minimum number of up streams
> >>
> >> The always disaggregate knob is something you can do per level if you desire but it's basically a big hammer buying you much bigger blast radius in normal operation. And if you pull RIFT onto servers in multi-plane fabrics your FIB may blow up if you do that (unless we think server adapters with 2M FIB size, probably ain't gonna happen ;-).
> >
> > [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a
> > design consideration for anyone before thinking about routes from
> > servers. At a small number selectively it's fine and practiced, e.g.
> > advertising prefixes from servers that are doing software load
> > balancing.
> >
> > [KP]: But advertising them say for every VM so you can move VMs
> > anywhere... that's still going to have impacts on your network design.
> > With FIB sizes as they are these days most people below the top 5 (or
> > so) are going to be fine. And anyone in the top 5 (or so) are still
> > going to be running into trouble. And if you do something like use the
> > same switch for top of rack layer as at further layers in the Clos,
> > then this RIFT scaling feature doesn't apply to you anyway as you have
> > the same FIB size at all tiers.
> >
> > [KP]: In any case disaggregation at the bottom two tiers where this is
> > much more likely to be a problem, still permits aggregation higher.
> >
> >> The other idea I don't grok, you have to explain in more detail.
> >
> > [KP]: As an alternative to disaggregation and announcing a withdraw
> > when the link between tier-1-1 and tier-2-1 goes down. It could be
> > that we have all the RIFT instances on tier-1 configured to know that
> > they should not install a prefix *unless* they have seen it advertised
> > from some minimum number of tier-2 nodes. E.g. if there are 8 tier-2
> > nodes, we might set that to say 4. Now we somewhat avoid the incast
> > scenario where the switch installs the disaggregated prefix with one
> > next hop into it's FIB. Instead it'll wait until it has a minimum of
> > four next-hops. (This is spit balling, it may open other problems.)
> >
> > Cheers :-)
> > Kris
>