RE: draft-marques-l3vpn-mcast-edge-00

Petr Lapukhov <petrlapu@microsoft.com> Tue, 29 May 2012 05:22 UTC

Return-Path: <petrlapu@microsoft.com>
X-Original-To: l3vpn@ietfa.amsl.com
Delivered-To: l3vpn@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2D4C821F87B4 for <l3vpn@ietfa.amsl.com>; Mon, 28 May 2012 22:22:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.917
X-Spam-Level:
X-Spam-Status: No, score=-3.917 tagged_above=-999 required=5 tests=[AWL=-0.917, BAYES_00=-2.599, J_CHICKENPOX_13=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id yPgqI5VJ1kSE for <l3vpn@ietfa.amsl.com>; Mon, 28 May 2012 22:22:15 -0700 (PDT)
Received: from ch1outboundpool.messaging.microsoft.com (ch1ehsobe002.messaging.microsoft.com [216.32.181.182]) by ietfa.amsl.com (Postfix) with ESMTP id BDAAB21F8749 for <l3vpn@ietf.org>; Mon, 28 May 2012 22:22:14 -0700 (PDT)
Received: from mail251-ch1-R.bigfish.com (10.43.68.247) by CH1EHSOBE012.bigfish.com (10.43.70.62) with Microsoft SMTP Server id 14.1.225.23; Tue, 29 May 2012 05:21:52 +0000
Received: from mail251-ch1 (localhost [127.0.0.1]) by mail251-ch1-R.bigfish.com (Postfix) with ESMTP id 500C51BC0371; Tue, 29 May 2012 05:21:52 +0000 (UTC)
X-Forefront-Antispam-Report: CIP:131.107.125.8; KIP:(null); UIP:(null); IPV:NLI; H:TK5EX14HUBC102.redmond.corp.microsoft.com; RD:none; EFVD:NLI
X-SpamScore: -41
X-BigFish: VS-41(zz9371I542M1432N98dK11f6Nzz1202hzz1033IL8275dhz2fh2a8h668h839h944hd25hf0ah)
Received-SPF: pass (mail251-ch1: domain of microsoft.com designates 131.107.125.8 as permitted sender) client-ip=131.107.125.8; envelope-from=petrlapu@microsoft.com; helo=TK5EX14HUBC102.redmond.corp.microsoft.com ; icrosoft.com ;
Received: from mail251-ch1 (localhost.localdomain [127.0.0.1]) by mail251-ch1 (MessageSwitch) id 1338268910341747_6310; Tue, 29 May 2012 05:21:50 +0000 (UTC)
Received: from CH1EHSMHS010.bigfish.com (snatpool1.int.messaging.microsoft.com [10.43.68.246]) by mail251-ch1.bigfish.com (Postfix) with ESMTP id 51A2618004A; Tue, 29 May 2012 05:21:50 +0000 (UTC)
Received: from TK5EX14HUBC102.redmond.corp.microsoft.com (131.107.125.8) by CH1EHSMHS010.bigfish.com (10.43.70.10) with Microsoft SMTP Server (TLS) id 14.1.225.23; Tue, 29 May 2012 05:21:50 +0000
Received: from TK5EX14MBXC299.redmond.corp.microsoft.com ([169.254.2.3]) by TK5EX14HUBC102.redmond.corp.microsoft.com ([157.54.7.154]) with mapi id 14.02.0298.005; Tue, 29 May 2012 05:22:10 +0000
From: Petr Lapukhov <petrlapu@microsoft.com>
To: Pedro Marques <roque@contrailsystems.com>
Subject: RE: draft-marques-l3vpn-mcast-edge-00
Thread-Topic: draft-marques-l3vpn-mcast-edge-00
Thread-Index: AQHNPU8wsz/GIe5QfES8Lg/mMDpeD5bgMNGA
Date: Tue, 29 May 2012 05:22:09 +0000
Message-ID: <F1688F301726A74C86D199FA0CAE08E514ED3F1D@TK5EX14MBXC299.redmond.corp.microsoft.com>
References: <F1688F301726A74C86D199FA0CAE08E514ED3625@TK5EX14MBXC299.redmond.corp.microsoft.com> <02A52D67-8614-4299-A301-6D1321ACA185@contrailsystems.com>
In-Reply-To: <02A52D67-8614-4299-A301-6D1321ACA185@contrailsystems.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [157.54.51.34]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: microsoft.com
Cc: "derick.winkworth@fisglobal.com" <derick.winkworth@fisglobal.com>, "l3vpn@ietf.org" <l3vpn@ietf.org>
X-BeenThere: l3vpn@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: <l3vpn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/l3vpn>, <mailto:l3vpn-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/l3vpn>
List-Post: <mailto:l3vpn@ietf.org>
List-Help: <mailto:l3vpn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/l3vpn>, <mailto:l3vpn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 29 May 2012 05:22:16 -0000

Pedro,

Thanks for your response! Firstly, I should have noted that I find the proposed multicast "emulation" to be very helpful in some realistic scenarios, specifically in the cases where multicast could not be easily turned on (e.g. risking a software problem, since PIM SM is not a nice protocol to implement). 

I still believe that optimizing multicast routing is "dense" networks is a separate interesting problem, that does have "some" solutions, and I'm sure you are well aware of the research done in that field. Of course, in the real world of "custom" Clos fabrics use of regular PIM SM with SPT trees has obvious optimality implications (e.g. load-sharing the SPT's, optimum fan-out distribution and so on).

For the congestion management in Clos networks - we have observed very high buffer utilization watermarks on practically all of our "spine" switches (unicast traffic only), due to peak loads of various compute traffic. Furthermore, experimentation have shown that even a very simple QoS policy with two xWRR queues differentiating bulk and query traffic results in significant performance improvements. However, I agree that having multicast flows to the mix here may create interesting complications in the switch internal fabrics.

My last question was whether there exists an analysis/research on tradeoffs associated with edge replication using an "overlay" tree  - e.g. having replicated packets cross the bisection twice as part of the constructed distribution tree.

Thank you,

Petr Lapukhov
Microsoft

-----Original Message-----
From: Pedro Marques [mailto:roque@contrailsystems.com] 
Sent: Monday, May 28, 2012 8:58 PM
To: Petr Lapukhov
Cc: Yiqun Cai; derick.winkworth@fisglobal.com; lufang@cisco.com; l3vpn@ietf.org
Subject: Re: draft-marques-l3vpn-mcast-edge-00


On May 27, 2012, at 11:11 PM, Petr Lapukhov wrote:

> Hi Pedro,

Petr,
Thank you for your comments. Answers inline.

> 
> Thanks for an interesting read! However, I have some concerns regarding the problem statement in the document:
> 
>> For Clos topologies with multiple stages native multicast support 
>> within the switching infrastructure is both unnecessary and 
>> undesirable.  By definition the Clos network has enough bandwidth to 
>> deliver a packet from any input port to any output port.  Native 
>> multicast support would however make it such that the network would 
>> no longer be non-blocking.  Bringing with it the need to devise 
>> congestion management procedures.
> 
> Here they are:
> 
> 1) Multicast routing over Clos topology could be non-blocking provided that some criteria on Clos topology dimensions are met and multicast distribution tree fan-outs are properly balanced at ingress and middle stages of the Clos fabric.

Multicast over a CLOS topology creates congestion management issues. One way to address the problem, in large scale CLOS topologies, is to eliminate native multicast in the fabric. That is an approach taken in several networks, including networks that are fully enclosed in a chassis or set of chassis.

> 
> 2) Congestion management in Clos networks would be necessary in any case, due to statistical multiplexing and possibility of (N -> 1) port traffic flow.

In practice, many networks are running CLOS topologies with no congestion management support. The assumption is that if hash based load balancing of flows is "good enough" and if the flows are small compared to link size, that the fabric is non-blocking. This allows one to build very large scale CLOS fabrics with off-the-shelf and/or heterogenous components, where each switch works independently. Congestion management at large scale is a very torny issue...

I believe that there are several efforts in the IEEE under the umbrella of "data-center ethernet" in order to bring global congestion notification/flow control into a heterogenous environment. It is my understanding that there is a non-trivial number of networks that prefer to operate with simple hash based mechanism.

> 3) The "ingress unicast replication" in VPN forwarder creates the following issues:
> 
> 3.1) If done at software hypervisor level, it will most likely 
> overload physical uplink(s) on the server: N replicas sent as opposed 
> to 1 in case of native multicast

This is the main rational for this work. One could have started with just plain ingress replication. But in that case the ingress would have to replicate to the full membership of the group. With an edge replication tree, the number of copies is limited to N.
As with any other network design, it is a question of trade-offs. The authors believe there is a non-trivial number of applications (e.g. discovery) where this is a useful approach.

> 3.2) If done at hardware switch level (edge of physical Clos topology), it cannot leverage hardware capabilities for multicast replication, and thus could be difficult to implement and will stress the switch internal fabric.

Building hardware with no multicast support can also simplify the hardware design.

> 
> 4) If L3 VPN spans WAN for Inter-DC communications, unicast replication makes any WAN multicast optimization impossible, unless there is a "translating" WAN gateway that will forward packets as native multicast.

The document only covers intra-DC scenarios, as of now. For WAN traffic, we do assume that there are systems that support L3VPN multicast as defined currently.

> 5) Optimizing overlay multicast distribution tree could be difficult, since underlying network metrics may be hidden from VPN gateways.

In several practical scenarios i aware of, the intra-DC network has 2 costs points: same rack, different racks. Even in scenarios where there are multiple metrics, the BGP signaling gateway can be made aware of the physical topology of the network. My understanding is that the intra-DC network can be optimized.

> 
> I'm reviewing the rest of the document, and hopefully can come up with more comments later.

Thank you very much for your attention.

> 
> Best regards,
> 
> Petr Lapukhov
> Microsoft
>