Re: [GROW] comments on draft-lapukhov-bgp-routing-large-dc-06

Hi Nick,

Thanks for your feedback! For your questions

1) In the proposed design, the network topology is symmetric and homogenous, and the default route is simply relayed from the "WAN-facing" border routers. It is only used to steer traffic to destinations outside of data-center, and the issues of "WAN default routing" are outside of the scope of the document. Keep in mind that the default route is supplied is in addition to the full routing information for all in-data-center destinations. The FIB problem is generally not an issue in DC's with modern merchant silicon switches, even with data-centers as large as 200K bare metal servers. However, if needed, simple virtual aggregation could take care of it, thanks to shared next-hop sets among large groups of prefixes.

2) Server virtualization is outside of the scope of the document. However, if required, overlaying techniques could be used, to isolate tenant IP addressing/signaling from the fabric. Hypervisors may participate in an overlay control plane, but we avoided this topic on purpose, since it's a much broader discussion scope. Let's put it this way - the proposed design is for the "underlay" bare metal network :)

3) The major idea for failure detection was relying on p2p nature of interconnection links and leveraging optical layer for failure detection. BFD could be an option, though many merchant silicon solutions do not support hardware generation of BFD packets, and as such they are not much different from various control-plane keep-alives, since we won't be sharing it across multiple upstream protocols. LACP and other technologies could be used to add an extra layer of health probing, though this does not change the whole fault-detection logic... 

Regards,

Petr

-----Original Message-----
From: grow-bounces@ietf.org [mailto:grow-bounces@ietf.org] On Behalf Of Nick Hilliard
Sent: Sunday, September 8, 2013 5:51 AM
To: Jon Mitchell
Cc: grow@ietf.org
Subject: Re: [GROW] comments on draft-lapukhov-bgp-routing-large-dc-06

On 06/09/2013 03:45, Jon Mitchell wrote:
> We would like to solicit any further comments on 
> draft-lapukhov-bgp-routing-large-dc-06.  Originally this draft was 
> presented by Petr in Vancouver in both IDR and GROW and we feel this 
> draft is useful to the IETF community as it documents a working 
> scalable design for building large DC's.

i think this draft is really interesting.  A couple of things come to mind:

- default route origination is a real pain in larger scale networks for obvious reasons, not least because you can often end up with non-optimal egress choice (e.g. the RR's choice of default route).  You also need to take special care to ensure that the scope of the default route is limited to only router sets that actually need it.  I prefer the idea of using a floating anycast default route.  i.e. you inject an anycast prefix from all or a selection of all routers which can handle default free traffic flow, then on small-fib routers you can install a static default pointing to this floating default and depend on recursive route lookup to figure out the best path (original idea from Saku Ytti - http://goo.gl/Nj69OZ).  This allows much tighter control of default route propagation, with better egress choice characteristics.

- in virtualised networks hosting third party tenancies, it is iften useful to extend L3 to the hypervisor.  With current tech, running thousands of vms per cabinet is not unrealistic, and this number will undoubtedly increase over time.  This brings up the issue of both how to handle address assignment in an optimal manner and also how to be able to assign one or more IP addresses per vm or vm cluster, without the problems associated with flat shared lans, without using NAT and its consequent problems, but also without wasting precious public IP addresses on link networks which typically have abysmal assignment efficiency (e.g. 50% for /30, 75% for /29, etc).  There are models out there which suggest using routing protocols, but BGP may not always be the best choice due to scalability
issues: if you have a ToR switch, it will be pretty limited in terms of the number of bgp sessions it can handle.  Some people have approached this by using RIP to the client VMs.  The advantage of this is that RIP is fully stateless, whereas BGP has session overhead.  On a small ToR box with an itty bitty RP, the overhead associated with high hundreds hundreds or even low thousands of BGP sessions may be too much.  I'm not sure if this is in the scope of this draft though.

- i think you skimp over the problems associated with bgp session failure detection / reconvergence.  Mandating ebgp will get rid of the problems associated with the traditional loopback-to-loopback configuration of most ibgp networks, but there are still situations where dead link detection is going to be necessary using some form of keepalive mechanism which works faster than ebgp timers.  It would be good to have a little more discussion about bfd - if we can point vendors to good use cases here, they will be more inclined to support bfd on tor boxes.

- typo: s/it's/its/g

Nick

_______________________________________________
GROW mailing list
GROW@ietf.org
https://www.ietf.org/mailman/listinfo/grow