Re: draft-marques-l3vpn-mcast-edge-00

Pedro Roque Marques <pedro.r.marques@gmail.com> Tue, 29 May 2012 03:59 UTC

Return-Path: <pedro.r.marques@gmail.com>
X-Original-To: l3vpn@ietfa.amsl.com
Delivered-To: l3vpn@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 07EA121F87CF for <l3vpn@ietfa.amsl.com>; Mon, 28 May 2012 20:59:38 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.998
X-Spam-Level:
X-Spam-Status: No, score=-2.998 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, HTML_MESSAGE=0.001, J_CHICKENPOX_13=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id e+xz4AEUMCDO for <l3vpn@ietfa.amsl.com>; Mon, 28 May 2012 20:59:37 -0700 (PDT)
Received: from mail-pb0-f44.google.com (mail-pb0-f44.google.com [209.85.160.44]) by ietfa.amsl.com (Postfix) with ESMTP id 0EC5F21F8797 for <l3vpn@ietf.org>; Mon, 28 May 2012 20:59:37 -0700 (PDT)
Received: by pbcwy7 with SMTP id wy7so5255447pbc.31 for <l3vpn@ietf.org>; Mon, 28 May 2012 20:59:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:subject:date:references:to:message-id :mime-version:x-mailer; bh=AToXLkO5UK6KKXU++tRbTv7y7Ol053UOjG2tMroTvY0=; b=rcP6sSX4IC4x7HPneqXWUKYiLvs7YGAmhcnNITTHeYwzIqSJyuU+fFiwNVpoYCMa7J ohhCuU/KBp/jG6m67Q8du0Xfb7amJ1/gX+kXneGyV9IDrSNlYW+to4SNPrGvynDxT+79 e0oGFd5UF2ufQpP/a5KhzBAHM5MSxrYi8S71bySISEjWOypHAiPVoSm0E+gZoBfQ/NFq xutqntHGM4+h+fBrhDohXpIyDxQtfG7Hqnb/Ymt8Ccf0DAU4qirsBHCz6zq+bRhrOD1t jqr4Uuv4A/vQk3P10Rj9OTXQ5c6bqXrZ0t83FjmQhHAdihGdvYKZOIr7WBFyTaJ1mNiR cTzw==
Received: by 10.68.238.68 with SMTP id vi4mr11395704pbc.123.1338263976619; Mon, 28 May 2012 20:59:36 -0700 (PDT)
Received: from [192.168.1.69] (173-164-176-214-SFBA.hfc.comcastbusiness.net. [173.164.176.214]) by mx.google.com with ESMTPS id qq2sm21384294pbc.27.2012.05.28.20.59.35 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 28 May 2012 20:59:35 -0700 (PDT)
From: Pedro Roque Marques <pedro.r.marques@gmail.com>
Content-Type: multipart/alternative; boundary="Apple-Mail=_D19F5E47-21C2-4D21-A288-BC07BBD2683D"
Subject: Re: draft-marques-l3vpn-mcast-edge-00
Date: Mon, 28 May 2012 20:59:57 -0700
References: <02A52D67-8614-4299-A301-6D1321ACA185@contrailsystems.com>
To: l3vpn@ietf.org
Message-Id: <1B7228B9-16E3-465E-90A0-73E406C177ED@gmail.com>
Mime-Version: 1.0 (Apple Message framework v1278)
X-Mailer: Apple Mail (2.1278)
X-BeenThere: l3vpn@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: <l3vpn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/l3vpn>, <mailto:l3vpn-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/l3vpn>
List-Post: <mailto:l3vpn@ietf.org>
List-Help: <mailto:l3vpn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/l3vpn>, <mailto:l3vpn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 29 May 2012 03:59:38 -0000

Resending from the email address i'm subscribed to...

> From: Pedro Marques <roque@contrailsystems.com>
> Subject: Re: draft-marques-l3vpn-mcast-edge-00
> Date: May 28, 2012 8:57:53 PM PDT
> To: Petr Lapukhov <petrlapu@microsoft.com>
> Cc: Yiqun Cai <yiqunc@microsoft.com>, "derick.winkworth@fisglobal.com" <derick.winkworth@fisglobal.com>, "lufang@cisco.com" <lufang@cisco.com>, "l3vpn@ietf.org" <l3vpn@ietf.org>
> 
> 
> On May 27, 2012, at 11:11 PM, Petr Lapukhov wrote:
> 
>> Hi Pedro,
> 
> Petr,
> Thank you for your comments. Answers inline.
> 
>> 
>> Thanks for an interesting read! However, I have some concerns regarding the problem statement in the document:
>> 
>>> For Clos topologies with multiple stages native multicast support
>>> within the switching infrastructure is both unnecessary and
>>> undesirable.  By definition the Clos network has enough bandwidth to
>>> deliver a packet from any input port to any output port.  Native
>>> multicast support would however make it such that the network would
>>> no longer be non-blocking.  Bringing with it the need to devise
>>> congestion management procedures.
>> 
>> Here they are:
>> 
>> 1) Multicast routing over Clos topology could be non-blocking provided that some criteria on Clos topology dimensions are met and multicast distribution tree fan-outs are properly balanced at ingress and middle stages of the Clos fabric.
> 
> Multicast over a CLOS topology creates congestion management issues. One way to address the problem, in large scale CLOS topologies, is to eliminate native multicast in the fabric. That is an approach taken in several networks, including networks that are fully enclosed in a chassis or set of chassis.
> 
>> 
>> 2) Congestion management in Clos networks would be necessary in any case, due to statistical multiplexing and possibility of (N -> 1) port traffic flow.
> 
> In practice, many networks are running CLOS topologies with no congestion management support. The assumption is that if hash based load balancing of flows is "good enough" and if the flows are small compared to link size, that the fabric is non-blocking. This allows one to build very large scale CLOS fabrics with off-the-shelf and/or heterogenous components, where each switch works independently. Congestion management at large scale is a very torny issue…
> 
> I believe that there are several efforts in the IEEE under the umbrella of "data-center ethernet" in order to bring global congestion notification/flow control into a heterogenous environment. It is my understanding that there is a non-trivial number of networks that prefer to operate with simple hash based mechanism.
> 
>> 3) The "ingress unicast replication" in VPN forwarder creates the following issues:
>> 
>> 3.1) If done at software hypervisor level, it will most likely overload physical uplink(s) on the server: N replicas sent as opposed to 1 in case of native multicast
> 
> This is the main rational for this work. One could have started with just plain ingress replication. But in that case the ingress would have to replicate to the full membership of the group. With an edge replication tree, the number of copies is limited to N.
> As with any other network design, it is a question of trade-offs. The authors believe there is a non-trivial number of applications (e.g. discovery) where this is a useful approach.
> 
>> 3.2) If done at hardware switch level (edge of physical Clos topology), it cannot leverage hardware capabilities for multicast replication, and thus could be difficult to implement and will stress the switch internal fabric.
> 
> Building hardware with no multicast support can also simplify the hardware design.
> 
>> 
>> 4) If L3 VPN spans WAN for Inter-DC communications, unicast replication makes any WAN multicast optimization impossible, unless there is a "translating" WAN gateway that will forward packets as native multicast.
> 
> The document only covers intra-DC scenarios, as of now. For WAN traffic, we do assume that there are systems that support L3VPN multicast as defined currently.
> 
>> 5) Optimizing overlay multicast distribution tree could be difficult, since underlying network metrics may be hidden from VPN gateways.
> 
> In several practical scenarios i aware of, the intra-DC network has 2 costs points: same rack, different racks. Even in scenarios where there are multiple metrics, the BGP signaling gateway can be made aware of the physical topology of the network. My understanding is that the intra-DC network can be optimized.
> 
>> 
>> I'm reviewing the rest of the document, and hopefully can come up with more comments later.
> 
> Thank you very much for your attention.
> 
>> 
>> Best regards,
>> 
>> Petr Lapukhov
>> Microsoft
>> 
>