Re: draft-marques-l3vpn-mcast-edge-00

Pedro Marques <pedro.r.marques@gmail.com> Tue, 29 May 2012 16:05 UTC

Subject: Re: draft-marques-l3vpn-mcast-edge-00
Mime-Version: 1.0 (Apple Message framework v1278)
Content-Type: text/plain; charset="windows-1252"
From: Pedro Marques <pedro.r.marques@gmail.com>
In-Reply-To: <F1688F301726A74C86D199FA0CAE08E514ED3F1D@TK5EX14MBXC299.redmond.corp.microsoft.com>
Date: Tue, 29 May 2012 09:06:00 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <4544C62C-990C-4321-ADD8-3FC077799C8D@gmail.com>
References: <F1688F301726A74C86D199FA0CAE08E514ED3625@TK5EX14MBXC299.redmond.corp.microsoft.com> <02A52D67-8614-4299-A301-6D1321ACA185@contrailsystems.com> <F1688F301726A74C86D199FA0CAE08E514ED3F1D@TK5EX14MBXC299.redmond.corp.microsoft.com>
To: Petr Lapukhov <petrlapu@microsoft.com>
Cc: "derick.winkworth@fisglobal.com" <derick.winkworth@fisglobal.com>, "l3vpn@ietf.org" <l3vpn@ietf.org>
Precedence: list

Petr,

On May 28, 2012, at 10:22 PM, Petr Lapukhov wrote:

> Pedro,
> 
> Thanks for your response! Firstly, I should have noted that I find the proposed multicast "emulation" to be very helpful in some realistic scenarios, specifically in the cases where multicast could not be easily turned on (e.g. risking a software problem, since PIM SM is not a nice protocol to implement).

With PIM-SM and in an environment where virtual networks are being deployed to support multi-tenancy and/or per application access groups, the multicast group information pertains to the virtual networks and can easily exceed the forwarding capacity of individual switching elements in the fabric not to mention the operational aspects of having virtual network signaling within the fabric.

I'd expect most network operators to default to ingress replication. Which was the original design point for multicast support in L3VPN environments. In my opinion the edge replication solution offers a different trade-off where the number of replicas that cross a single uplink is kept to an upper bound and there is no requirement to support multicast forwarding natively in the fabric.

> 
> 
> I still believe that optimizing multicast routing is "dense" networks is a separate interesting problem, that does have "some" solutions, and I'm sure you are well aware of the research done in that field.

In a data-center environment i would expect that multicast would be very sparse. Many operators are talking about very large data-centers (10k+ machines each supporting 100+ guests)… very few applications would be "dense" in this environment and those would probably be best advised to choose a different design point anyway…

If the multicast membership is sparse, link level multicast replication may not afford any real benefits.

> Of course, in the real world of "custom" Clos fabrics use of regular PIM SM with SPT trees has obvious optimality implications (e.g. load-sharing the SPT's, optimum fan-out distribution and so on).
> 
> For the congestion management in Clos networks - we have observed very high buffer utilization watermarks on practically all of our "spine" switches (unicast traffic only), due to peak loads of various compute traffic.

There are two approaches that i'm aware of:
1) Increase the ratio between link bandwidth and max rate flow.
2) Deploy a network-wide congestion management algorithm from ingress to egress. (you can't assume that the buffers in the intermediate nodes are useful for congestion management).

Lots of work has been put on both of these approaches. With very strong opinion on both sides. Let me just say that from my perspective it would seem unwise to compound the congestion management approach by attempting to also cover multicast.

> Furthermore, experimentation have shown that even a very simple QoS policy with two xWRR queues differentiating bulk and query traffic results in significant performance improvements.

No doubt. Instantaneous drop selection can be beneficial. This is totally different end-to-end congestion avoidance.

> However, I agree that having multicast flows to the mix here may create interesting complications in the switch internal fabrics.

Yes. Which is the reason that i would expect most network operators to choose ingress replication in order to support IP broadcast/multicast for applications that are discovering membership.

> 
> My last question was whether there exists an analysis/research on tradeoffs associated with edge replication using an "overlay" tree  - e.g. having replicated packets cross the bisection twice as part of the constructed distribution tree.

The math is rather straightforward. Each "node" in the tree adds measurable latency. You can assume for analysis that the fabric is over-provisioned and thus "free". The trade-off is the selection of replication node-out degree. An higher replication factor increases the number of packets sent from a single "VPN forwarder" but decreases the height of the tree.

The multicast edge replication document is written such that that degree can be controlled by the network operator as a result of real experimentation with the conditions of a particular network.

thanks,
  Pedro.

> 
> Thank you,
> 
> Petr Lapukhov
> Microsoft
> 
> -----Original Message-----
> From: Pedro Marques [mailto:roque@contrailsystems.com] 
> Sent: Monday, May 28, 2012 8:58 PM
> To: Petr Lapukhov
> Cc: Yiqun Cai; derick.winkworth@fisglobal.com; lufang@cisco.com; l3vpn@ietf.org
> Subject: Re: draft-marques-l3vpn-mcast-edge-00
> 
> 
> On May 27, 2012, at 11:11 PM, Petr Lapukhov wrote:
> 
>> Hi Pedro,
> 
> Petr,
> Thank you for your comments. Answers inline.
> 
>> 
>> Thanks for an interesting read! However, I have some concerns regarding the problem statement in the document:
>> 
>>> For Clos topologies with multiple stages native multicast support 
>>> within the switching infrastructure is both unnecessary and 
>>> undesirable.  By definition the Clos network has enough bandwidth to 
>>> deliver a packet from any input port to any output port.  Native 
>>> multicast support would however make it such that the network would 
>>> no longer be non-blocking.  Bringing with it the need to devise 
>>> congestion management procedures.
>> 
>> Here they are:
>> 
>> 1) Multicast routing over Clos topology could be non-blocking provided that some criteria on Clos topology dimensions are met and multicast distribution tree fan-outs are properly balanced at ingress and middle stages of the Clos fabric.
> 
> Multicast over a CLOS topology creates congestion management issues. One way to address the problem, in large scale CLOS topologies, is to eliminate native multicast in the fabric. That is an approach taken in several networks, including networks that are fully enclosed in a chassis or set of chassis.
> 
>> 
>> 2) Congestion management in Clos networks would be necessary in any case, due to statistical multiplexing and possibility of (N -> 1) port traffic flow.
> 
> In practice, many networks are running CLOS topologies with no congestion management support. The assumption is that if hash based load balancing of flows is "good enough" and if the flows are small compared to link size, that the fabric is non-blocking. This allows one to build very large scale CLOS fabrics with off-the-shelf and/or heterogenous components, where each switch works independently. Congestion management at large scale is a very torny issue...
> 
> I believe that there are several efforts in the IEEE under the umbrella of "data-center ethernet" in order to bring global congestion notification/flow control into a heterogenous environment. It is my understanding that there is a non-trivial number of networks that prefer to operate with simple hash based mechanism.
> 
>> 3) The "ingress unicast replication" in VPN forwarder creates the following issues:
>> 
>> 3.1) If done at software hypervisor level, it will most likely 
>> overload physical uplink(s) on the server: N replicas sent as opposed 
>> to 1 in case of native multicast
> 
> This is the main rational for this work. One could have started with just plain ingress replication. But in that case the ingress would have to replicate to the full membership of the group. With an edge replication tree, the number of copies is limited to N.
> As with any other network design, it is a question of trade-offs. The authors believe there is a non-trivial number of applications (e.g. discovery) where this is a useful approach.
> 
>> 3.2) If done at hardware switch level (edge of physical Clos topology), it cannot leverage hardware capabilities for multicast replication, and thus could be difficult to implement and will stress the switch internal fabric.
> 
> Building hardware with no multicast support can also simplify the hardware design.
> 
>> 
>> 4) If L3 VPN spans WAN for Inter-DC communications, unicast replication makes any WAN multicast optimization impossible, unless there is a "translating" WAN gateway that will forward packets as native multicast.
> 
> The document only covers intra-DC scenarios, as of now. For WAN traffic, we do assume that there are systems that support L3VPN multicast as defined currently.
> 
>> 5) Optimizing overlay multicast distribution tree could be difficult, since underlying network metrics may be hidden from VPN gateways.
> 
> In several practical scenarios i aware of, the intra-DC network has 2 costs points: same rack, different racks. Even in scenarios where there are multiple metrics, the BGP signaling gateway can be made aware of the physical topology of the network. My understanding is that the intra-DC network can be optimized.
> 
>> 
>> I'm reviewing the rest of the document, and hopefully can come up with more comments later.
> 
> Thank you very much for your attention.
> 
>> 
>> Best regards,
>> 
>> Petr Lapukhov
>> Microsoft
>> 
> 
> 
>

Re: draft-marques-l3vpn-mcast-edge-00 Petr Lapukhov
Re: draft-marques-l3vpn-mcast-edge-00 Pedro Roque Marques
RE: draft-marques-l3vpn-mcast-edge-00 Petr Lapukhov
Re: draft-marques-l3vpn-mcast-edge-00 Pedro Marques