Re: [GROW] comments on draft-lapukhov-bgp-routing-large-dc-06

Jon Mitchell <jrmitche@puck.nether.net> Mon, 09 September 2013 19:35 UTC

Return-Path: <jrmitche@puck.nether.net>
X-Original-To: grow@ietfa.amsl.com
Delivered-To: grow@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2834221E80B7 for <grow@ietfa.amsl.com>; Mon, 9 Sep 2013 12:35:56 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.599
X-Spam-Level:
X-Spam-Status: No, score=-2.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id M8tJFXq7OQH6 for <grow@ietfa.amsl.com>; Mon, 9 Sep 2013 12:35:55 -0700 (PDT)
Received: from puck.nether.net (puck.nether.net [IPv6:2001:418:3f4::5]) by ietfa.amsl.com (Postfix) with ESMTP id 3516021E80AE for <grow@ietf.org>; Mon, 9 Sep 2013 12:35:55 -0700 (PDT)
Received: from puck.nether.net (localhost [127.0.0.1]) by puck.nether.net (8.14.7/8.14.5) with ESMTP id r89JZoHP001691; Mon, 9 Sep 2013 15:35:50 -0400
Received: (from jrmitche@localhost) by puck.nether.net (8.14.7/8.14.7/Submit) id r89JZoMR001680; Mon, 9 Sep 2013 15:35:50 -0400
Date: Mon, 09 Sep 2013 15:35:50 -0400
From: Jon Mitchell <jrmitche@puck.nether.net>
To: Nick Hilliard <nick@inex.ie>
Message-ID: <20130909193550.GA17348@puck.nether.net>
References: <20130906024523.GA27854@puck.nether.net> <522C72A0.6000004@inex.ie>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <522C72A0.6000004@inex.ie>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.1 (puck.nether.net [127.0.0.1]); Mon, 09 Sep 2013 15:35:51 -0400 (EDT)
Cc: grow@ietf.org
Subject: Re: [GROW] comments on draft-lapukhov-bgp-routing-large-dc-06
X-BeenThere: grow@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Grow Working Group Mailing List <grow.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/grow>, <mailto:grow-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/grow>
List-Post: <mailto:grow@ietf.org>
List-Help: <mailto:grow-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/grow>, <mailto:grow-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Sep 2013 19:35:56 -0000

Nick - thanks for your comments, [JM] inline...

On 08/09/13 13:50 +0100, Nick Hilliard wrote:
> On 06/09/2013 03:45, Jon Mitchell wrote:
> > We would like to solicit any further comments on
> > draft-lapukhov-bgp-routing-large-dc-06.  Originally this draft was
> > presented by Petr in Vancouver in both IDR and GROW and we feel this
> > draft is useful to the IETF community as it documents a working
> > scalable design for building large DC's.
> 
> i think this draft is really interesting.  A couple of things come to mind:
> 
> - default route origination is a real pain in larger scale networks for
> obvious reasons, not least because you can often end up with non-optimal
> egress choice (e.g. the RR's choice of default route).  You also need to
> take special care to ensure that the scope of the default route is limited
> to only router sets that actually need it.  I prefer the idea of using a
> floating anycast default route.  i.e. you inject an anycast prefix from all
> or a selection of all routers which can handle default free traffic flow,
> then on small-fib routers you can install a static default pointing to this
> floating default and depend on recursive route lookup to figure out the
> best path (original idea from Saku Ytti - http://goo.gl/Nj69OZ).  This
> allows much tighter control of default route propagation, with better
> egress choice characteristics.

[JM] Petr and Robert already responded and in this draft and
in most of our environments the default comes from one set of border
routers as decribed but I actually think this is an interesting idea
and could be useful in some other cases depending on how you connect
these large DC fabrics together, right now we are carefully limiting
default propogation in the odd case where it's necessary via policy.
I think it's worth noting that even if we include such an idea, it
would be limited in discussion since the primary focus of this
document is covering a known working deployed design.

> 
> - in virtualised networks hosting third party tenancies, it is iften useful
> to extend L3 to the hypervisor.  With current tech, running thousands of
> vms per cabinet is not unrealistic, and this number will undoubtedly
> increase over time.  This brings up the issue of both how to handle address
> assignment in an optimal manner and also how to be able to assign one or
> more IP addresses per vm or vm cluster, without the problems associated
> with flat shared lans, without using NAT and its consequent problems, but
> also without wasting precious public IP addresses on link networks which
> typically have abysmal assignment efficiency (e.g. 50% for /30, 75% for
> /29, etc).  There are models out there which suggest using routing
> protocols, but BGP may not always be the best choice due to scalability
> issues: if you have a ToR switch, it will be pretty limited in terms of the
> number of bgp sessions it can handle.  Some people have approached this by
> using RIP to the client VMs.  The advantage of this is that RIP is fully
> stateless, whereas BGP has session overhead.  On a small ToR box with an
> itty bitty RP, the overhead associated with high hundreds hundreds or even
> low thousands of BGP sessions may be too much.  I'm not sure if this is in
> the scope of this draft though.

[JM] I think this is worth a whole different doc or WG even :-)

> - i think you skimp over the problems associated with bgp session failure
> detection / reconvergence.  Mandating ebgp will get rid of the problems
> associated with the traditional loopback-to-loopback configuration of most
> ibgp networks, but there are still situations where dead link detection is
> going to be necessary using some form of keepalive mechanism which works
> faster than ebgp timers.  It would be good to have a little more discussion
> about bfd - if we can point vendors to good use cases here, they will be
> more inclined to support bfd on tor boxes.

[JM] I agree, we could definitely look at using BFD and it is
mentioned in the doc, most appropriately draft-ietf-bfd-on-lags-01
when all the smaller vendors are likely to support it and we have done
good testing to make sure it's not causing more problems than adding
value.  Again, the focus of the document has been what's actually been
shown to have value today in actual deployments.  BFD as Petr
mentioned is not providing a lot of value over LACP detection,
although it validates the L3 path, it's session only gets pinned to
one member.  It still may be slightly better than fast fallover mixed
with adjusted BGP timers but we focused on reduction of protocols and
until a majority of vendors have a better BFD implemenations (also
supporting IPv6 draft).  Right now we've choosen to use the limited
centralized control planes on these platforms for BGP and LACP session
maintenance.

> - typo: s/it's/its/g

[JM] Next rev, thanks!