Re: [Rift] RIFT

Antoni Przygienda <prz@juniper.net> Thu, 18 April 2019 18:14 UTC

Return-Path: <prz@juniper.net>
X-Original-To: rift@ietfa.amsl.com
Delivered-To: rift@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D0959120144 for <rift@ietfa.amsl.com>; Thu, 18 Apr 2019 11:14:18 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.338
X-Spam-Level:
X-Spam-Status: No, score=-1.338 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, KHOP_DYNAMIC=1.363, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=juniper.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QpCt1r8bUxK5 for <rift@ietfa.amsl.com>; Thu, 18 Apr 2019 11:14:16 -0700 (PDT)
Received: from mx0b-00273201.pphosted.com (mx0b-00273201.pphosted.com [67.231.152.164]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 6622612001B for <rift@ietf.org>; Thu, 18 Apr 2019 11:14:16 -0700 (PDT)
Received: from pps.filterd (m0108160.ppops.net [127.0.0.1]) by mx0b-00273201.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x3II9HDt024224; Thu, 18 Apr 2019 11:14:12 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=juniper.net; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : mime-version; s=PPS1017; bh=SIt+lVRuQ3kqXyZabCe4RhHcD1pWpdle+eqEXHyh3zM=; b=zs6cVinIPEsfa9Qu2EtuquBbAMCWrM+lHsJhBDH7sn9/TLTOSKbtLZd2UUhpbGxmG3t9 K/H46P5H/OP5wLaxiwtbjKAyksfr4D7RoPtpPR0MznE4UiML4vb7ZTRVsh0FNkh7ChNl PO98260J2wc9YAe4vhvpXETv4boiGnxC4tfVEofuhgCx6v5Ixo4xsYFhuir2GOtNhwdw DaWCLZ4Dr/atm3syjrin+UixXenUKvWlFTy9HGcDBcPfY5N46HzLwLEJAewIwdiDgDL8 kOsrenyGpeJ0BcwffVUzRUEsrg1e8ucLMY3Imez+qbJ4p76hpYzoqzeMX6zBWUDjCmhd UA==
Received: from nam02-cy1-obe.outbound.protection.outlook.com (mail-cys01nam02lp2052.outbound.protection.outlook.com [104.47.37.52]) by mx0b-00273201.pphosted.com with ESMTP id 2rxq8e8vkv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Thu, 18 Apr 2019 11:14:12 -0700
Received: from MWHPR05MB3279.namprd05.prod.outlook.com (10.173.230.18) by MWHPR05MB3182.namprd05.prod.outlook.com (10.173.229.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1813.9; Thu, 18 Apr 2019 18:14:09 +0000
Received: from MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202]) by MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202%10]) with mapi id 15.20.1835.007; Thu, 18 Apr 2019 18:14:09 +0000
From: Antoni Przygienda <prz@juniper.net>
To: Kris Price <kris@krisprice.nz>
CC: "brunorijsman@gmail.com" <brunorijsman@gmail.com>, "rift@ietf.org" <rift@ietf.org>
Thread-Topic: RIFT
Thread-Index: AQHU8VRUPBOno7miBkKQqRko+b0GOKY4xxUkgAludgCAAAZsrQ==
Date: Thu, 18 Apr 2019 18:14:09 +0000
Message-ID: <MWHPR05MB32798005D0A97DCC996CCB11AC260@MWHPR05MB3279.namprd05.prod.outlook.com>
References: <CACqcHa05D9mCNWPtkHMw4t-0opbz33PsnB9Ts=wadfM1UD4cNA@mail.gmail.com> <MWHPR05MB32798B45DD99D8ABF75B875AAC280@MWHPR05MB3279.namprd05.prod.outlook.com>, <CACqcHa3TnRS76Rnr5Wkq4_L47i5ZQLQiZFy5aNt3zmTr487LrA@mail.gmail.com>
In-Reply-To: <CACqcHa3TnRS76Rnr5Wkq4_L47i5ZQLQiZFy5aNt3zmTr487LrA@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [66.129.239.12]
x-ms-publictraffictype: Email
x-ms-office365-filtering-correlation-id: 156368fb-00f1-41f8-a8ff-08d6c429a752
x-ms-office365-filtering-ht: Tenant
x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(5600141)(711020)(4605104)(4618075)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(2017052603328)(7193020); SRVR:MWHPR05MB3182;
x-ms-traffictypediagnostic: MWHPR05MB3182:
x-microsoft-antispam-prvs: <MWHPR05MB31822F917DE5C263E8E82CE3AC260@MWHPR05MB3182.namprd05.prod.outlook.com>
x-forefront-prvs: 0011612A55
x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(136003)(39860400002)(366004)(376002)(346002)(396003)(189003)(199004)(6506007)(55016002)(33656002)(3846002)(14454004)(221733001)(186003)(478600001)(7696005)(3480700005)(486006)(6116002)(19627405001)(86362001)(97736004)(316002)(76176011)(54906003)(102836004)(26005)(25786009)(105004)(66066001)(53936002)(7736002)(11346002)(99286004)(4326008)(81156014)(6246003)(74316002)(53546011)(446003)(476003)(5660300002)(52536014)(68736007)(2906002)(14444005)(256004)(81166006)(229853002)(6916009)(54896002)(71200400001)(6436002)(71190400001)(8936002)(8676002)(7116003)(9686003); DIR:OUT; SFP:1102; SCL:1; SRVR:MWHPR05MB3182; H:MWHPR05MB3279.namprd05.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1;
received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts)
x-ms-exchange-senderadcheck: 1
x-microsoft-antispam-message-info: pqmJ2ut9oPqi7C4NZt6zdIgPcBNHoD4z/Lhdagt7JG7AVcYWk868yFj+QEVveb10tRZGTPAKwMPn0Y3XVYhF+hVTmaeeForOJ8+N0hUTzLv9EOsD7ixV2VZjHo5xkoR+HRZFKaVjUaEpC7MNhi1z4EePf9xMNzxBFskfdy53CbKqkO+kUn6TRcW9r1EjG1RV8DiFWSnqUCaeOGFZa/xMZYlTDbrpy7/zDyMKhDftCqVPlbBDg37oTX5F+ITPjHph73AtTvLV6XrCe/1fDfa3NtzZVuBBa4k+FT+B1wdDPmDqYPnvmcOYyREnvNrRq4SDjNqHueJYdcVEWUzB2xATwbKWDaq3p7kL6LvyFOF7QGKaEc+i+xKq+zGQirlMxvnVswlqNlBYMrgoAqD3TrauKXbNFEGvbZUKDvd0wzZMRs0=
Content-Type: multipart/alternative; boundary="_000_MWHPR05MB32798005D0A97DCC996CCB11AC260MWHPR05MB3279namp_"
MIME-Version: 1.0
X-OriginatorOrg: juniper.net
X-MS-Exchange-CrossTenant-Network-Message-Id: 156368fb-00f1-41f8-a8ff-08d6c429a752
X-MS-Exchange-CrossTenant-originalarrivaltime: 18 Apr 2019 18:14:09.7948 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR05MB3182
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-18_09:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904180113
Archived-At: <https://mailarchive.ietf.org/arch/msg/rift/FdsPv5rvgPrPfpdgPoLIS6bFkMI>
Subject: Re: [Rift] RIFT
X-BeenThere: rift@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion of Routing in Fat Trees <rift.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rift>, <mailto:rift-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rift/>
List-Post: <mailto:rift@ietf.org>
List-Help: <mailto:rift-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rift>, <mailto:rift-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 18 Apr 2019 18:14:19 -0000

Hey, Kris, inline

________________________________
From: Kris Price <kris@krisprice.nz>
Sent: Thursday, April 18, 2019 10:28 AM
To: Antoni Przygienda
Cc: brunorijsman@gmail.com; rift@ietf.org
Subject: Re: RIFT

Hey Tony,

On the rings: Ahh! I get it, okay that makes it better. I was also
wondering if some kind of designated 'S-TIE' reflector / virtual links
/ or explicitly configured multi-hop adjacencies solution could be
used (the issue being one of how do you route these packets between
the peers without needing to do something like source route multiple
hops southbound before being default routed northbound).

good, I know it takes bit to grok the stuff. We did the best we could with ASCII and language but the concepts need some chewing for sure, even if you have been around big fabrics for a bit ;-) So, nothing like route reflectors and so on, within a plane normal south reflection takes care of sync'ing up all you need, outside the plane the ring takes care of sync'ing up planes (for flooding horizontal links below ToF are south and @ ToF level north basically and with that you have all the topology to figure out negative disaggregation.  I explicitly killed any "virtual link" suggestions, I went through this particular hell in my life more than once and don't want to visit it anymore ;-) ...

Back on the subject of disaggregation:

The other reason for asking for the always disaggregate option is to
prevent the transient congestion that can occur on link failures. But
I do see now on rereading the draft you've called this out in the
second to last paragraph of 5.2.5.1., but it's left it as an
implementation specific problem to solve.

well, yes, no free lunch, either you gum up your fabric with all stuff and suffer large blast radius or you dig the beauty of having minimum blast radius and minimal topology info everywhere but on massive failures stuff needs to be sloshed around so e'one has enough info to not blackhole. Finely enough, today's networks, especially fabrics, allow insane flooding rates without breaking half a sweat (first thing I played with when thinking about RIFT design ;-) and I learned here some lessons from looking @ p2p networks BTW. If you run my free package you'll see easily convergence rate of 7-10+K TIEs in the database per second and that's the "usable rate" in the sense that there is much more flooding on the links and it's the "best TIEs in LSDB" rate already. UDP is really quite phenomenal and with a bit of additional help (look @ the packet number & the "you flood too fast" indications) you can dynamically adjust flooding to walk the edge of having losses. Brave new world ...

It seems this would arise frequently at the bottom two tiers of the
network. Any loss of any single link to any rack (tier 0) would result
in all other nodes at tier 1 disaggregating the prefix(es) for that
rack and causing the potential transient incast-like congestion. I'm a
bit concerned that this may be a noticeable event in some cases (e.g.
a storage row/cluster or maybe where RoCE is in use), and one that
would be fairly annoying to debug and remedy post transition to RIFT
if you didn't foresee it and have the tools (knobs) in place to
prevent it from happening without a PR and s/w upgrade.

yepp, you call the spade but you're a bit too pesimistic me thinks. Let's assume 2 ToRs dual-homing a rack or couple racks of servers. if you loose a link in a multi-homed server you basically end up having the other ToR de-aggregating just this server prefix to other servers (even if you run some kubernetes @ scale you may have 100 prefixes or so I'd say, I can't imagine a server hosting thousands really) ... Then, if you think about the ToRs on top of PoD then it's not as bad as you think. If you loose a single ToR in a PoD towards a spine (I'm loose with terminology here) then you will NOT see disaggregation as long the other ToRs in the PoD are still connected to the PoD. Draw pictures & run the public consumption package ;-)  More interesting discussions are bandwdith balancing on link losses (which I think we solved well northbound) and whether it even should be done southbound since notion of "available bandwidth southbound" is confounding ... Spec doesn't forbid it (the beauty of loop-free valley-free routing that gives you insane amount of lee-way how you choose to forward) BTW if somone is smart enough to figure that out ;-) ...

Should
implementations have a conscious solution in advance for this, and
what's the best way to ensure that? The 'always-disaggregate' knob is
one. Another might be something like a 'min-next-hops' option where
the local RIFT instance on tier 0 won't install a prefix unless it has
received it from a minimum number of up streams

The always disaggregate knob is something you can do per level if you desire but it's basically a big hammer buying you much bigger blast radius in normal operation. And if you pull RIFT onto servers in multi-plane fabrics your FIB may blow up if you do that (unless we think server adapters with 2M FIB size, probably ain't gonna happen ;-).

The other idea I don't grok, you have to explain in more detail.

Both of these do run counter to the low-configuration nature of RIFT.
Another might be a protocol change, something like nodes
disaggregating prefixes by default until they know they are more than
1 hop from the bottom of fabric? (This may run into other convergence
issues during fabric bring up and cold start and maybe there are other
issues with it that need doodling out.)

Yeah, doodle, I'm not concerned about the convergence and interested in your ideas. ZTP has no beef with prefixes, will work irregardless. So far I saw no indications that size of the fabric up to any reasonable bound of nodes will prevent it to cold-boot properly. ZTP FSM has no timers for that reason BTW (another no, no I put down there ;-) and flooding only has a single retransmit timer.

And, yes, the more stuff like forced disaggregation you start to fiddle with, the more you loose ZTP (depending whether you want it or not, RIFT will work either way). BTW, same with security, the more security you desire the less ZTP you get ;-)  Nature of the beast ...

--- tony