Re: [Rift] RIFT

Antoni Przygienda <prz@juniper.net> Sat, 20 April 2019 16:58 UTC

From: Antoni Przygienda <prz@juniper.net>
To: Kris Price <kris@krisprice.nz>, Bruno Rijsman <brunorijsman@gmail.com>
CC: "rift@ietf.org" <rift@ietf.org>
Thread-Topic: RIFT
Thread-Index: AQHU8VRUPBOno7miBkKQqRko+b0GOKY4xxUkgAludgCAAAZsrYAAJfoAgAAiaQCAAsHNgIAACazQ
Date: Sat, 20 Apr 2019 16:57:56 +0000
Message-ID: <MWHPR05MB3279BEF36FCF955D93B90100AC200@MWHPR05MB3279.namprd05.prod.outlook.com>
References: <CACqcHa05D9mCNWPtkHMw4t-0opbz33PsnB9Ts=wadfM1UD4cNA@mail.gmail.com> <MWHPR05MB32798B45DD99D8ABF75B875AAC280@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa3TnRS76Rnr5Wkq4_L47i5ZQLQiZFy5aNt3zmTr487LrA@mail.gmail.com> <MWHPR05MB32798005D0A97DCC996CCB11AC260@MWHPR05MB3279.namprd05.prod.outlook.com> <CACqcHa1=iLcakOE-O1cWWHH+7qu0hvoaT5hq_5FfjcqNsi3MUw@mail.gmail.com> <8B6D2FC6-15D3-4133-AAB4-160E1D82A827@gmail.com>, <CACqcHa0-s6sP6YG7MVASByna_aqQ1SrsCnf00eJH-Q_z7U_LUw@mail.gmail.com>
In-Reply-To: <CACqcHa0-s6sP6YG7MVASByna_aqQ1SrsCnf00eJH-Q_z7U_LUw@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts)
Content-Type: multipart/alternative; boundary="_000_MWHPR05MB3279BEF36FCF955D93B90100AC200MWHPR05MB3279namp_"
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-Network-Message-Id: d25b83f5-2ca5-4321-b681-08d6c5b1565d
X-MS-Exchange-CrossTenant-originalarrivaltime: 20 Apr 2019 16:57:56.6117 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR05MB3263
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-20_06:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904200128
Archived-At: <https://mailarchive.ietf.org/arch/msg/rift/GLmy84rFtgSdw3DkBlLbwjEnFqk>
Subject: Re: [Rift] RIFT
X-BeenThere: rift@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion of Routing in Fat Trees <rift.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rift>, <mailto:rift-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rift/>
List-Post: <mailto:rift@ietf.org>
List-Help: <mailto:rift-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rift>, <mailto:rift-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 20 Apr 2019 16:58:06 -0000

Kris, negative disaggreagtion is used if and only if

a) fabric has mutliple planes
b) a node gets completely separated in terms of cross-sectional bandwidth from a plane, call that "fallen leafs"

Negative disaggregation is transitive only to the level where the breakage is healed

Sections 5.2.5.2, 6.5 explains that all in nice details and specs out the mechanisms necessary. All in protocol since a bit now ...

thanks

--- tony
________________________________
From: Kris Price <kris@krisprice.nz>
Sent: Saturday, April 20, 2019 9:16 AM
To: Bruno Rijsman
Cc: Antoni Przygienda; rift@ietf.org
Subject: Re: RIFT

Hi Bruno,

Not sure if I followed negative disaggregation correctly. Is this used
at levels below the top of fabric or was it something that was
discussed as a possibility?

Cheers
Kris

On Thu, Apr 18, 2019 at 6:10 PM Bruno Rijsman <brunorijsman@gmail.com> wrote:
>
> Kris,
>
> What is your opinion is on negative aggregation as a solution for the transient incast-like congestion after a failure with positive disaggregation?
>
> — Bruno
>
> > On Apr 18, 2019, at 5:07 PM, Kris Price <kris@krisprice.nz> wrote:
> >
> > Hey Tony, inline:
> >
> > [snip]
> >> On the rings: Ahh! I get it, okay that makes it better. I was also
> >> wondering if some kind of designated 'S-TIE' reflector / virtual links
> >> / or explicitly configured multi-hop adjacencies solution could be
> >> used (the issue being one of how do you route these packets between
> >> the peers without needing to do something like source route multiple
> >> hops southbound before being default routed northbound).
> >>
> >> good, I know it takes bit to grok the stuff. We did the best we could with ASCII and language but the concepts need some chewing for sure, even if you have been around big fabrics for a bit ;-) So, nothing like route reflectors and so on, within a plane normal south reflection takes care of sync'ing up all you need, outside the plane the ring takes care of sync'ing up planes (for flooding horizontal links below ToF are south and @ ToF level north basically and with that you have all the topology to figure out negative disaggregation.  I explicitly killed any "virtual link" suggestions, I went through this particular hell in my life more than once and don't want to visit it anymore ;-) ...
> >
> > [KP]: I'm a bit skeptical of buy in to rings as a solution, but if you
> > have customer's buying into that then that's cool. (I omitted
> > describing the physical *shudder* when I wrote "virtual links" ;-))
> >
> > [snip]
> >> It seems this would arise frequently at the bottom two tiers of the
> >> network. Any loss of any single link to any rack (tier 0) would result
> >> in all other nodes at tier 1 disaggregating the prefix(es) for that
> >> rack and causing the potential transient incast-like congestion. I'm a
> >> bit concerned that this may be a noticeable event in some cases (e.g.
> >> a storage row/cluster or maybe where RoCE is in use), and one that
> >> would be fairly annoying to debug and remedy post transition to RIFT
> >> if you didn't foresee it and have the tools (knobs) in place to
> >> prevent it from happening without a PR and s/w upgrade.
> >>
> >> yepp, you call the spade but you're a bit too pesimistic me thinks. Let's assume 2 ToRs dual-homing a rack or couple racks of servers. if you loose a link in a multi-homed server you basically end up having the other ToR de-aggregating just this server prefix to other servers (even if you run some kubernetes @ scale you may have 100 prefixes or so I'd say, I can't imagine a server hosting thousands really) ... Then, if you think about the ToRs on top of PoD then it's not as bad as you think. If you loose a single ToR in a PoD towards a spine (I'm loose with terminology here) then you will NOT see disaggregation as long the other ToRs in the PoD are still connected to the PoD. Draw pictures & run the public consumption package ;-)  More interesting discussions are bandwdith balancing on link losses (which I think we solved well northbound) and whether it even should be done southbound since notion of "available bandwidth southbound" is confounding ... Spec doesn't forbid it (the beauty of loop-free valley-free routing that gives you insane amount of lee-way how you choose to forward) BTW if somone is smart enough to figure that out ;-) ...
> >
> > [KP]: You're not the first to describe me as a pessimist. :-) I don't
> > follow the 2x ToRs and multi-homed servers part, I haven't seen that
> > used in a very long time, and granted I've been out of the game for a
> > bit but is anyone still multihoming servers at scale? Maybe certain
> > enterprise use cases, but they're not pushing the boundaries of scale
> > so don't need the aggregation anyway.
> >
> > [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8
> > or 16 switches (or more) northbound (naturally let's call that next
> > tier "tier-2"). If any single link between a tier-1 and tier-2 switch
> > goes down (let's say between tier-1-1 and tier-2-1), all other nodes
> > in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine
> > that tier-2-1 no longer has southbound reachability for tier-1-1's
> > prefixes and that they each need to disagregate these to prevent
> > tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which
> > would then need to forward up to tier-3 and back down).
> >
> > [KP]: With positive disagregation we can introduce transient
> > congestion if there's a lot of traffic from tier-1-2..n to tier-1-1
> > because a switch may get the prefix from one upstream node first and
> > install that before getting it from the remaining upstream nodes. (So
> > we could for a brief instant go from 8 paths ECMP to 1 path then back
> > up to 7 paths.) On the other hand if all prefixes are disaggregated,
> > and when the link between tier-1-1 and tier-2-1 fails, tier-2-1 is now
> > only announcing a withdraw for the affected prefixes to tier-1-2..n,
> > we can avoid generating this temporary incast-like scenario by design.
> >
> > [KP]: It's a preference for more deterministic behavior of the fabric
> > over less deterministic behavior.
> >
> >> Should
> >> implementations have a conscious solution in advance for this, and
> >> what's the best way to ensure that? The 'always-disaggregate' knob is
> >> one. Another might be something like a 'min-next-hops' option where
> >> the local RIFT instance on tier 0 won't install a prefix unless it has
> >> received it from a minimum number of up streams
> >>
> >> The always disaggregate knob is something you can do per level if you desire but it's basically a big hammer buying you much bigger blast radius in normal operation. And if you pull RIFT onto servers in multi-plane fabrics your FIB may blow up if you do that (unless we think server adapters with 2M FIB size, probably ain't gonna happen ;-).
> >
> > [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a
> > design consideration for anyone before thinking about routes from
> > servers. At a small number selectively it's fine and practiced, e.g.
> > advertising prefixes from servers that are doing software load
> > balancing.
> >
> > [KP]: But advertising them say for every VM so you can move VMs
> > anywhere... that's still going to have impacts on your network design.
> > With FIB sizes as they are these days most people below the top 5 (or
> > so) are going to be fine. And anyone in the top 5 (or so) are still
> > going to be running into trouble. And if you do something like use the
> > same switch for top of rack layer as at further layers in the Clos,
> > then this RIFT scaling feature doesn't apply to you anyway as you have
> > the same FIB size at all tiers.
> >
> > [KP]: In any case disaggregation at the bottom two tiers where this is
> > much more likely to be a problem, still permits aggregation higher.
> >
> >> The other idea I don't grok, you have to explain in more detail.
> >
> > [KP]: As an alternative to disaggregation and announcing a withdraw
> > when the link between tier-1-1 and tier-2-1 goes down. It could be
> > that we have all the RIFT instances on tier-1 configured to know that
> > they should not install a prefix *unless* they have seen it advertised
> > from some minimum number of tier-2 nodes. E.g. if there are 8 tier-2
> > nodes, we might set that to say 4. Now we somewhat avoid the incast
> > scenario where the switch installs the disaggregated prefix with one
> > next hop into it's FIB. Instead it'll wait until it has a minimum of
> > four next-hops. (This is spit balling, it may open other problems.)
> >
> > Cheers :-)
> > Kris
>

Re: [Rift] RIFT Antoni Przygienda
Re: [Rift] RIFT Bruno Rijsman
Re: [Rift] RIFT Kris Price
Re: [Rift] RIFT Antoni Przygienda
Re: [Rift] RIFT Kris Price
Re: [Rift] RIFT Antoni Przygienda
Re: [Rift] RIFT Bruno Rijsman
Re: [Rift] RIFT Bruno Rijsman
Re: [Rift] RIFT Kris Price
Re: [Rift] RIFT Antoni Przygienda
Re: [Rift] RIFT Kris Price
Re: [Rift] RIFT Tony Przygienda