Re: [GROW] comments on draft-lapukhov-bgp-routing-large-dc-06

Nick Hilliard <nick@inex.ie> Sun, 08 September 2013 22:47 UTC

Return-Path: <nick@inex.ie>
X-Original-To: grow@ietfa.amsl.com
Delivered-To: grow@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E5C9521E8102 for <grow@ietfa.amsl.com>; Sun, 8 Sep 2013 15:47:46 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.53
X-Spam-Level:
X-Spam-Status: No, score=-1.53 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, DATE_IN_PAST_06_12=1.069]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ckIr6x8btuNk for <grow@ietfa.amsl.com>; Sun, 8 Sep 2013 15:47:46 -0700 (PDT)
Received: from mail.netability.ie (mail.netability.ie [IPv6:2a03:8900:0:100::5]) by ietfa.amsl.com (Postfix) with ESMTP id 2A87321E8100 for <grow@ietf.org>; Sun, 8 Sep 2013 15:47:44 -0700 (PDT)
X-Envelope-To: grow@ietf.org
Received: from vpn-249.int.inex.ie (vpn-249.int.inex.ie [193.242.111.249]) (authenticated bits=0) by mail.netability.ie (8.14.7/8.14.5) with ESMTP id r88MlYxi015310 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Sun, 8 Sep 2013 23:47:35 +0100 (IST) (envelope-from nick@inex.ie)
Message-ID: <522C72A0.6000004@inex.ie>
Date: Sun, 08 Sep 2013 13:50:40 +0100
From: Nick Hilliard <nick@inex.ie>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: Jon Mitchell <jrmitche@puck.nether.net>
References: <20130906024523.GA27854@puck.nether.net>
In-Reply-To: <20130906024523.GA27854@puck.nether.net>
X-Enigmail-Version: 1.5.2
X-Company-Info-1: Internet Neutral Exchange Association Limited. Registered in Ireland No. 253804
X-Company-Info-2: Registered Offices: 1-2, Marino Mart, Fairview, Dublin 3
X-Company-Info-3: Internet Neutral Exchange Association Limited is limited by guarantee
X-Company-Info-4: Offices: 4027 Kingswood Road, Citywest, Dublin 24.
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Cc: grow@ietf.org
Subject: Re: [GROW] comments on draft-lapukhov-bgp-routing-large-dc-06
X-BeenThere: grow@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Grow Working Group Mailing List <grow.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/grow>, <mailto:grow-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/grow>
List-Post: <mailto:grow@ietf.org>
List-Help: <mailto:grow-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/grow>, <mailto:grow-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Sep 2013 22:47:47 -0000

On 06/09/2013 03:45, Jon Mitchell wrote:
> We would like to solicit any further comments on
> draft-lapukhov-bgp-routing-large-dc-06.  Originally this draft was
> presented by Petr in Vancouver in both IDR and GROW and we feel this
> draft is useful to the IETF community as it documents a working
> scalable design for building large DC's.

i think this draft is really interesting.  A couple of things come to mind:

- default route origination is a real pain in larger scale networks for
obvious reasons, not least because you can often end up with non-optimal
egress choice (e.g. the RR's choice of default route).  You also need to
take special care to ensure that the scope of the default route is limited
to only router sets that actually need it.  I prefer the idea of using a
floating anycast default route.  i.e. you inject an anycast prefix from all
or a selection of all routers which can handle default free traffic flow,
then on small-fib routers you can install a static default pointing to this
floating default and depend on recursive route lookup to figure out the
best path (original idea from Saku Ytti - http://goo.gl/Nj69OZ).  This
allows much tighter control of default route propagation, with better
egress choice characteristics.

- in virtualised networks hosting third party tenancies, it is iften useful
to extend L3 to the hypervisor.  With current tech, running thousands of
vms per cabinet is not unrealistic, and this number will undoubtedly
increase over time.  This brings up the issue of both how to handle address
assignment in an optimal manner and also how to be able to assign one or
more IP addresses per vm or vm cluster, without the problems associated
with flat shared lans, without using NAT and its consequent problems, but
also without wasting precious public IP addresses on link networks which
typically have abysmal assignment efficiency (e.g. 50% for /30, 75% for
/29, etc).  There are models out there which suggest using routing
protocols, but BGP may not always be the best choice due to scalability
issues: if you have a ToR switch, it will be pretty limited in terms of the
number of bgp sessions it can handle.  Some people have approached this by
using RIP to the client VMs.  The advantage of this is that RIP is fully
stateless, whereas BGP has session overhead.  On a small ToR box with an
itty bitty RP, the overhead associated with high hundreds hundreds or even
low thousands of BGP sessions may be too much.  I'm not sure if this is in
the scope of this draft though.

- i think you skimp over the problems associated with bgp session failure
detection / reconvergence.  Mandating ebgp will get rid of the problems
associated with the traditional loopback-to-loopback configuration of most
ibgp networks, but there are still situations where dead link detection is
going to be necessary using some form of keepalive mechanism which works
faster than ebgp timers.  It would be good to have a little more discussion
about bfd - if we can point vendors to good use cases here, they will be
more inclined to support bfd on tor boxes.

- typo: s/it's/its/g

Nick