[Isis-wg] Some comments on draft-white-openfabric-02

Erik Auerswald <auerswald@fg-networking.de> Wed, 12 April 2017 08:50 UTC

Return-Path: <auerswald@fg-networking.de>
X-Original-To: isis-wg@ietfa.amsl.com
Delivered-To: isis-wg@ietfa.amsl.com
Received: from localhost (localhost []) by ietfa.amsl.com (Postfix) with ESMTP id C99BC13146D; Wed, 12 Apr 2017 01:50:47 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.2
X-Spam-Status: No, score=-4.2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([]) by localhost (ietfa.amsl.com []) (amavisd-new, port 10024) with ESMTP id G3e5wCTYhy0u; Wed, 12 Apr 2017 01:50:43 -0700 (PDT)
Received: from mailgw1.uni-kl.de (mailgw1.uni-kl.de [IPv6:2001:638:208:120::220]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id F064D131460; Wed, 12 Apr 2017 01:50:32 -0700 (PDT)
Received: from mail.fg-networking.de (mail.fg-networking.de [IPv6:2001:638:208:cd01::23]) by mailgw1.uni-kl.de (8.14.4/8.14.4/Debian-8+deb8u1) with ESMTP id v3C8oSoN018217 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NOT); Wed, 12 Apr 2017 10:50:28 +0200
Received: from fgn-t61 (unknown []) by mail.fg-networking.de (Postfix) with ESMTP id 52EE92007A; Wed, 12 Apr 2017 10:50:19 +0200 (CEST)
Received: by fgn-t61 (Postfix, from userid 1000) id EAB95100445; Wed, 12 Apr 2017 10:50:18 +0200 (CEST)
Date: Wed, 12 Apr 2017 10:50:18 +0200
From: Erik Auerswald <auerswald@fg-networking.de>
To: Russ White <7riw77@gmail.com>, rtgwg@ietf.org, isis-wg@ietf.org
Cc: Erik Auerswald <auerswald@fg-networking.de>
Message-ID: <20170412085018.GA29441@fg-networking.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
User-Agent: Mutt/1.5.21 (2010-09-15)
Archived-At: <https://mailarchive.ietf.org/arch/msg/isis-wg/LcF9az82-1HfuD94y7BnOxuzEU0>
X-Mailman-Approved-At: Wed, 12 Apr 2017 08:05:47 -0700
Subject: [Isis-wg] Some comments on draft-white-openfabric-02
X-BeenThere: isis-wg@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: IETF IS-IS working group <isis-wg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/isis-wg>, <mailto:isis-wg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/isis-wg/>
List-Post: <mailto:isis-wg@ietf.org>
List-Help: <mailto:isis-wg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/isis-wg>, <mailto:isis-wg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Apr 2017 08:53:58 -0000

Hi all,

I have read draft-white-openfabric-02 and would like to comment
on a few points. I'll start at the top of the draft and continue
through the text.

Please keep my e-mail address in replies, because I am not subscribed
to the isis-wg and rtgwg mailing lists.

   The abstract states "[...]topology information is extracted
through broad based connections." I do not understand that sentence.

   Section 1.1., Goals, mentions large scale data centers. Would
it be appropriate to reference RFC 7938, Use of BGP for Routing
in Large-Scale Data Centers, here? Said RFC proposes a Clos topology
for the network, which seems to be similar to the spine and leaf
topology of openfabric.

   In section 1.3., Simplification, I noticed a spelling mistake:
mutliaccess (should be multiaccess).

   In section 1.5., Sample Network, a spine and leaf network is
shown in figure 1. The topology shown in that figure is different
from the 5-stage Clos topology shown in RFC 7938, figure 3. The
5-stage Clos topology from RFC 7938 represents the network topology
used by Facebook for the Altoona data center, as publicized in

Another generalization of the 3-stage Clos network to more than
3 stages called Beneš network can be found on Wikipedia:

Both of these 5-stage networks differ from figure 1 of the
openfabric draft insofar as each T2 switch is connected to a
proper subset of T1 switches (openfabric designation) in both the
RFC 7938 "Clos" topology and the Beneš network. This is crucial
for increasing the amount of input- and output ports without
using bigger switches.

Since this is important for later comments, I have adapted figure 3
from RFC 7938 into the following drawing:

        +----+                  +----+
        |L1.1|                  |L1.2|             (T0)
        +----+                  +----+
         |   \________________  /   |
         |    ________________\/    |
         |   /                 \    |
        +----+                  +----+
        |F1.1|                  |F1.2|             (T1)
        +----+                  +----+
        /    \                  /    \
       /      \                /      \
   +----+    +----+        +----+    +----+
   |S1.1|    |S1.2|        |S2.1|    |S2.2|        (T2)
   +----+    +----+        +----+    +----+
       \      /                \      /
        \    /                  \    /
        +----+                  +----+
        |F2.1|                  |F2.2|             (T1)
        +----+                  +----+
         |   \________________  /   |
         |    ________________\/    |
         |   /                 \    |
        +----+                  +----+
        |L2.1|                  |L2.2|             (T0)
        +----+                  +----+

       Lx.y: Leaf switches (a.k.a. Top of Rack (ToR) switches)
       Fx.y: Fabric switches
       Sx.y: Spine switches

     Inter-switch connections:
       Lx.y is connected to Fx.*
       Fx.y is connected to Lx.* and Sy.*
       Sx.y is connected to F*.x 

   Figure 2: 5-Stage Clos Topology (adapted from [RFC7938], Figure 3)

I have used the name "Fabric switch" similar to Facebook's use
of that name in the above referenced blog post, just to have
distinct names and single letter abbreviations for each tier.

A reference to RFC 7938, section 3.2, Clos Network Topology, would
fit into this section.

   It might be appropriate to mention the use of timeouts and
exponential back-off for initial adjacency formation in section 2.
Something like sequentially trying all discovered neighbors and
using exponentially increasing random timeouts for subsequent
rounds until the first adjacency is formed. A "Happy Eyeballs"
(RFC 6555) like approach of trying to form two adjacencies with
a slight delay in-between might be nice as well.

   Section 3., Determining Location on the Fabric, relies on the
special topology from figure 1 of the openfaric draft. In both
Beneš networks and the topology shown in figure 2 (of this mail),
FD == TD and TD == 4 holds for non-T0 switches. One example is
S1.1 from figure 2. It can be easily seen from that figure that
for all switches in that topology FD == TD == 4. Thus the algorithms
from sections 3.1., Determining T0, and 3.2., Determining T1 and
above, do not work for general fabric topologies.

   The algorithm described in section 4, Flooding Optimization, does
not work for the 5-stage "Clos" topology (see figure 2). An example
for this is a change that pertains just switches S1.1 and F1.1 in
figure 2 (e.g. a link between these two switches fails). Because
the T0 switches Lx.y receive the LSPs as DNR, the LSPs do not reach
switches Fx.2 and S2.y during flooding. The failure recovery
mechanism of section 4.1., Flooding Failures, is needed to propagate
the LSPs by design, but this is clearly thought of as a backup
mechanism that is not needed for normal operation.

   Section 5.1., Transit Link Reachability, would benefit from
a reference to RFC 5837, Extending ICMP for Interface and Next-Hop

   Section 6., Openfabric and Route Aggregation, should disallow
route summarization. Otherwise the failure of a single link will
result in traffic black-holing without intra-tier links. See e.g.
RFC 7839, sections 8.2. and 8.2.1. But intra-tier links are
disallowed in section 1.5, Sample Network.

Since the reason for disallowing intra-tier links, topology auto-
detection, is not yet solved (see comment 6. above), you might
allow the combination of intra-tier links and route summarization.
I would prefer disallwoing both for openfabric, because the added
complexity of route summarization and its effects on resiliency
in the case of failures seem a bad trade-off for the reduced
routing table size.

Thanks for reading this far. :-)

Best regards,
Dipl.-Inform. Erik Auerswald         http://www.fg-networking.de/
auerswald@fg-networking.de T:+49-631-4149988-0 M:+49-176-64228513

Gesellschaft für Fundamental Generic Networking mbH
Geschäftsführung: Volker Bauer, Jörg Mayer
Gerichtsstand: Amtsgericht Kaiserslautern - HRB: 3630