Re: [bess] Comments on draft-sajassi-bess-evpn-mvpn-seamless-interop

Eric C Rosen <erosen@juniper.net> Mon, 10 September 2018 17:39 UTC

To: "Ali Sajassi (sajassi)" <sajassi@cisco.com>, Bess WG <bess@ietf.org>
References: <d1e53751-289d-6ac9-d019-2fe07cc33602@juniper.net> <0A6CB14F-993C-4BCB-8678-26C3AE0AFE52@cisco.com>
From: Eric C Rosen <erosen@juniper.net>
Message-ID: <a593031b-2c8e-823d-ea2e-13dcfbdfd558@juniper.net>
Date: Mon, 10 Sep 2018 13:39:04 -0400
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1
MIME-Version: 1.0
In-Reply-To: <0A6CB14F-993C-4BCB-8678-26C3AE0AFE52@cisco.com>
Content-Type: multipart/alternative; boundary="------------FFF3AD35AE49F9F401D13292"
Content-Language: en-US
Received-SPF: None (protection.outlook.com: juniper.net does not designate permitted sender hosts)
SpamDiagnosticOutput: 1:99
SpamDiagnosticMetadata: NSPM
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Sep 2018 17:39:08.1678 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 7679c237-bb31-4fc6-51cd-08d617445001
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: bea78b3c-4cdb-4130-854a-1d193232e5f4
X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR0501MB3867
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-09-10_09:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1809100174
Archived-At: <https://mailarchive.ietf.org/arch/msg/bess/hnV5mRX51I_aZ2CI6GHYfVbTyHA>
Subject: Re: [bess] Comments on draft-sajassi-bess-evpn-mvpn-seamless-interop
X-BeenThere: bess@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: BGP-Enabled ServiceS working group discussion list <bess.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/bess>, <mailto:bess-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/bess/>
List-Post: <mailto:bess@ietf.org>
List-Help: <mailto:bess-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/bess>, <mailto:bess-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Sep 2018 17:39:23 -0000

Eric> 1. It seems that the proposal does not do correct ethernet 
emulation.  Intra-subnet multicast only sometimes preserves MAC SA and 
IP TTL, sometimes not, depending upon the topology.

Ali> EVPN doesn't provide LAN service per IEEE 802.1Q but rather an 
emulation of LAN service. This document defines what that emulation means

The fact that the proposal doesn't do correct ethernet emulation cannot 
be resolved by having the proposal redefine "emulation" to mean 
"whatever the proposal does".

EVPN needs to ensure that whatever works on a real ethernet will work on 
the emulated ethernet as well; the externally visible service 
characteristics on which the higher layers may depend must be properly 
offered by the emulation.  This applies to both unicast and multicast 
equally.

Otherwise anyone attempting to replace a real ethernet with EVPN will 
find that not every application and/or protocol working on the real 
ethernet will continue to work on the EVPN.

Eric> TTL handling for inter-subnet multicast seems inconsistent as 
well, depending upon the topology.

Ali> BTW, TTL handling for inter-subnet IP multicast traffic is done 
consistent!

Consider the following in a pure MVPN environment:

- Source S is on subnet1, which is attached to PE1.

- Receivers R1 and R2 are on subnet2, which is attached to both PE1 and PE2.

- Subnet1 and subnet2 are different subnets.

Now every (S,G) packet will follow the same path: either (a) 
subnet1-->PE1-->subnet2 or (b) subnet1-->PE1-->PE2-->subnet2.

Both paths cannot be used at the same time, because L3 multicast will 
not allow both PE1 and PE2 to transmit the (S,G) flow to subnet2.  So an 
(S,G) packet received by R1 will always have the same TTL as the same 
packet received by R2.  TTL scoping will therefore work consistently; 
depending on the routing, and from the perspective of any given flow, 
the two subnets are either one hop away from each other, or two hops 
away from each other.

In the so-called "seamless-mcast" scheme, on the other hand, if R1 and 
R2 get the same (S,G) packet, each may see a different TTL. Suppose R1 
is on an ES attached to PE1 but not to PE2, S is on an ES attached to 
PE1 but not to PE2, and R2 is on an ES attached to PE2 but not to PE1.  
Then a given (S,G) packet received by R1 will have a smaller TTL than 
the same packet received by R2, even though R1 and R2 are on the same 
subnet.

Note that the seamless-mcast proposal does not provide the behavior that 
would be provided by MVPN, despite the claim that it is "just MVPN".

This user-visible inconsistency may break any use of TTL scoping, and is 
just the sort of thing that tends generate a stream of service calls 
from customers that pay attention to this sort of stuff.

In general, TTL should be decremented by 0 for intra-subnet and by 1 
(within the EVPN domain) for inter-subnet.  Failure to handle the TTL 
decrement properly will break anything that depends upon RFC 3682 ("The 
Generalized TTL Security Mechanism").  Have you concluded that no use of 
multicast together with RFC 3682 will, now or in the future, ever need 
to run over EVPN?  I'd like to know how that conclusion is supported.  
You may also wish to do a google search for "multicast ttl scoping".

A related issue is that the number of PEs through which a packet passes 
should not be inferrable by a tenant.  Any sort of multicast traceroute 
tool used by a tenant will give unexpected results if TTL is not handled 
properly; at the very least this will result in service calls.

The OISM proposal (as described in the irb-mcast draft) will decrement 
TTL by 1 when packets go from one subnet to another, as an IP multicast 
frame is distributed unchanged to the PEs that need it, and its TTL is 
decremented by 1 if an egress PE needs to deliver it to a subnet other 
than its source subnet.


The draft still makes the following peculiar claim:

"Based on past experiences with MVPN over last dozen years for supported 
IP multicast applications, layer-3 forwarding of intra-subnet multicast 
traffic should be fine."

Since MVPN does not do intra-subnet multicast, experience with MVPN has 
no bearing whatsoever on the needs of intra-subnet multicast.

Eric> 2. In order to do inter-subnet multicast in EVPN, the proposal 
requires L3VPN/MVPN configuration on ALL the EVPN PEs.  This is required 
even when there is no need for MVPN/EVPN interworking. This is portrayed 
as a "low provisioning" solution!

Ali> Using MVPN constructs doesn't requires additional configuration on 
EVPN PEs beyond multicast configuration needed for IRB-mcast operation.

I think you'll find that if you don't reconfigure all the BGP sessions 
to carry AFI/SAFIs 1/128, 2/128, 1/5, and 2/5, you'll have quite a bit 
of trouble running any of the native MVPN procedures ;-) This is perhaps 
the simplest example of additional configuration that is needed.

If doing MVPN/EVPN interworking, one needs to go to every EVPN PE and 
set up all the RTs used to control the distribution of routes within the 
L3VPN domain.  One has to consider whether the RDs already used by EVPN 
are distinct from the RDs already used by L3VPN.  One has to enable the 
tunneling mechanisms that are used in the L3VPN domain (hopefully the 
EVPN PEs can support those tunneling techniques).  If the L3VPN 
deployment has been set up with particular routing policies (special 
communities carried, or whatever), these need to be configured on every 
EVPN PE.  One needs to take account of whether the L3VPN deployment uses 
segmented P-tunnels or non-segmented P-tunnels, and whether it depends 
upon the use of (C-*,C-*) S-PMSI A-D routes or not.  One needs to 
configure whether the L3VPN is expecting procedures of RFC6514 Section 
13 ("rpt-spt") or whether it is expecting procedures of RFC6514 Section 
14 ("spt-only").  I think there are quite a few other configuration 
items (various timers, and additional stuff that I probably don't even 
know about) that may need to be coordinated with the L3VPN deployment 
with which one is attempting to interwork.

To do interworking between EVPN and L3VPN/MVPN, the L3VPN/MVPN stuff 
obviously needs to be configured at the interworking points.  The 
REQUIREMENT to do ALL this configuration at EVERY single EVPN PE is what 
seems excessive.

Even if one is not doing MVPN/EVPN interworking, all this stuff still 
has to be configured; one just wouldn't have to worry about in that case 
about coordinating with a pre-existing MVPN deployment. But no one ever 
called L3VPN a "low provisioning solution".   EVPN (unlike MVPN), has a 
fair amount of auto-provisioning built-in, and one loses the advantages 
of that if one has to do MVPN provisioning on every PE.

Eric> 3. The draft claims that the exact same control plane should be 
used for EVPN and MVPN, despite the fact that MVPN's control plane is 
unaware of certain information that is very important in EVPN (e.g., 
EVIs, TagIDs).

Ali> IP multicast described in the draft is done at the tenant's level 
(IP-VRF) and not BD level !! So, BD level info such as tagIDs are not 
relevant.

The failure to carry BD level info is what causes the ethernet emulation 
to be done incorrectly.  Remember that most of EVPN is taken from L3VPN, 
with modifications to add stuff that is needed to correctly emulate the 
ethernet service.

Certainly if you look at the control plane used by EVPN to distribute 
unicast IP addresses, you'll see that it does not "just use L3VPN", but 
instead has lots of EVPN-specific stuff.

It's also worth pointing out that the draft does not really use the 
exact same control plane as MVPN, as it seems to require that each IP 
host address be advertised in two routes (an EVPN-specific route and a 
VPN-IP route), and the EVPN-specific routes (types 2 or 5) are now 
required to carry attributes that are typically carried only by the 
VPN-IP routes.  Also, there are the intra-ES tunnels (discussed below), 
something that doesn't exist in MVPN.  And then there are those 
under-specified EVPN-specific 'gateways' (discussed below) that are used 
to connect tunnels of different types.

Eric> 4. The draft proposes to use the same tunnels for MVPN and EVPN, 
i.e., to have tunnels that traverse both the MVPN and the EVPN domains.  
Various "requirements" are stated that seem to require this solution.  
Somewhere along the line it was realized that this requirement cannot be 
met if MVPN and EVPN do not use the same tunnel types.  So for this very 
common scenario, a completely different solution is proposed, that (a) 
tries to keep the EVPN control plane out of the MVPN domain, and vice 
versa, and (b) uses different tunnels in the two domains.  Perhaps the 
"requirements" that suggest using a single cross-domain tunnel are not 
really requirements!

Ali> There are SPDCs with MPLS underlay and there are SPDCs with VxLAN 
underlay. We need a solution that is optimum for both. Just the same way 
that we need both ASBR and GWs to optimize connectivity for inter-AS 
scenarios.

My point is that the document states "requirements", but applies them 
very selectively and very inconsistently.  There is a "requirement" to 
"use the same tunnels for MVPN and EVPN", but there are many deployment 
scenarios in which this "requirement" simply cannot be met.  If the 
"requirement" were stated as "only use tunnels that provide value",  I'd 
have no problem with it.  It seems that many of the specified 
requirements were reverse engineered from the solution as it was 
originally proposed, and then are silently ignored whenever it is 
discovered that they can't be met.

Eric> 5a. In some cases, the "requirements" for optimality in one or 
another respect (e.g., routing, replication) are really only 
considerations that an operator should be able to trade off against 
other considerations.  The real requirement is to be able to create a 
deployment scenario in which such optimality is achievable.  Other 
deployment scenarios, that optimize for other considerations, should not 
be prohibited.

Ali> What deployment scenarios do you think are prohibited ?

The draft does not appear support scenarios in which the MVPN/EVPN 
interworking procedures are confined to a subset of the EVPN PEs, and 
not even visible to the majority of the EVPN PEs.

Eric> While the authors have realized that one cannot have cross-domain 
tunnels when EVPN uses VxLAN and MVPN uses MPLS, they do not seem to 
have acknowledged the multitude of other scenarios in which cross-domain 
tunnels cannot be used.  For instance, MVPN may be using mLDP, while 
EVPN is using IR.  Or EVPN may be using "Assisted Replication", which 
does not exist in MVPN.  Or MVPN may be using PIM while EVPN is using 
RSVP-TE P2MP. Etc., etc.  I suspect that "different tunnel types" will 
be the common case, especially when trying to interwork existing MVPN 
and EVPN deployments.

I note that the latest rev of the draft still does not take this into 
account.

Eric> The gateway-based proposal for interworking MVPN and EVPN when 
they use different tunnel types is severely underspecified.

Ali> Agreed. This will be covered in the subsequent revisions.

It doesn't seem to be in the latest revision.

Eric> One possible approach to this would be to have a single MVPN 
domain that includes the EVPN PEs, and to use MVPN tunnel segmentation 
at the boundary. While that is a complicated solution, at least it is 
known to work. However, that does not seem to be what is being proposed.

Ali> It is not clear to me exactly what you are suggesting here. At the 
boundary, is there any mcast address lookup or not?

If I were working on a proposal like the one in seamless-multicast, I 
would consider whether the MVPN inter-AS segmented P-tunnels feature 
could be leveraged at the border nodes between domains that use 
different tunnel types.  After all, one of the main purposes of MVPN 
inter-AS segmentation is to connect domains that use different tunnel 
types.  Done properly, that does not require any IP lookups at the 
ASBRs.  The draft seems to be trying to reinvent MVPN P-tunnel 
segmentation from scratch.  This is a very intricate part of the MVPN 
specs and you can't just make it up as you go along.

Here is just a selection of some of the problems with section 10.1 
("Control Plane Interconnect") of the -02 revision:

- Much of the document seems to assume that the RTs used in the MVPN 
domain will be the same as the RTs used in the EVPN domain.  If that is 
the case, all the A-D routes from one domain will propagate into the 
other.  This does not appear to be compatible with the sketchy 
description of "gateway" behavior given in Section 10.

- Section 10.1 states that the RD in a Source Active A-D route needs to 
be changed when a such a route is re-originated by a gateway. 
Unfortunately, MVPN requires that the SA A-D route for (S,G) have the 
same RD as the unicast route for S.  So you would need to block all the 
IPVPN routes at the gateway and reoriginate them with new RDs.  The spec 
fails to mention this.  Note that this is not even possible if the EVPN 
PEs share the RTs of the MVPN domain.

- Interesting effects could arise if an EVPN PE chooses a gateway as the 
UMH, but the gateway chooses an EVPN PE as the UMH.  Can you demonstrate 
that this is impossible?

- Section 10.1 says that the C-multicast routes originated by the 
gateway carry the "exported RT list on the IP-VRF".  In MVPN, 
C-multicast routes do not carry the exported RT list, they carry an RT 
created from the VRF Route Import EC of the Selected UMH route.

- Section 10.1 talks about putting the BGP Encapsulation EC on the 
C-multicast routes sent into the MVPN domain.  However, MVPN does not 
make any use of this EC.

- Section 10.1 states that the S-PMSI A-D routes just propagate from one 
domain to the other, but with some unspecified "modifications".

- Leaf A-D routes are not discussed at all, nor is the setting of the 
LIR flag in the PMSI Tunnel attribute.

- Inter-AS I-PMSI A-D routes are not discussed.

This section is still severely underspecified.  It seems to be inventing 
a new way of interconnecting two L3VPN/MVPN domains, but it's not 
"option A", "option B", or "option C", and it's not "segmented 
P-tunnels".  So what is it exactly, and how do we know it works?

Have you thought about cases where multiple domains (i.e., more than 2) 
using different tunnel types are interconnected, perhaps in a cycle?

I think the issue of how to interwork domains that use different tunnel 
types is quite important.  If one wants to interwork an MVPN domain that 
uses mLDP-based P2MP LSPs with an EVPN domain that uses IR, I don't 
think one wants to tell customers that interoperability requires them to 
start using mLDP inside the EVPN domain.  If one is using assisted 
replication (AR) within the EVPN domain, I don't think anyone will want 
to hear "sorry, AR is not supported by MVPN".  I don't think the 
interworking between two domains can be called "seamless" if one has to 
change the tunnel types  of either domain.  But the details for how to 
do the interworking between different tunnel types just don't seem to be 
present.

Furthermore, it is pretty clear that some sort of gateway is going to be 
needed to provide interoperability with RFC 7432 nodes that do not 
implement MVPN; this needs to be addressed as well.

Eric> Another approach would be to set up two independent MVPN domains 
and carefully assign RTs to ensure that routes are not leaked from one 
domain to another.  One would also have to ensure that the boundary 
points send the proper set of routes into the "other" domain.  (This 
includes the unicast routes as well as the multicast routes.)  And one 
would have to include a whole bunch of applicability restrictions, such 
as "don't use the same RR to hold routes of both domains".  I think 
that's what's being proposed, but there isn't enough discussion of RT 
and RD management to be sure, and there isn't much discussion of what 
information the boundary points send into each domain.

Ali> I will expand on that with the RD and RT management aspects. But 
the intention is with a single MVPN domain where both EVPN and MVPN PEs 
participate.

Note that the use of a single RT by both MVPN and EVPN nodes will cause 
routes to be distributed throughout the "single MVPN domain", with no 
opportunity for a gateway to modify the routes.  But section 10.1 does 
seem to require a gateway to modify routes in order to connnect tunnels 
of different types.

Eric> 7. The proposal requires that EVPN export a host route to MVPN for 
each EVPN-attached multicast source.  It's a good thing that there is no 
requirement like "do not burden existing MVPN deployments with a whole 
bunch of additional host routes".  Wait a minute, maybe there is such a 
requirement.

Eric> In fact, whether the host routes are necessary to achieve optimal 
routing depends on the topology.  And this is a case where an operator 
might well want to sacrifice some routing optimality to reduce the 
routing burden on the MVPN nodes.

Ali> If there is mobility, then there is host route advertisement If 
there is no mobility, then prefixes can be advertised.

It seems to me that this is simply not true.  Consider the following 
example:

- BD1 has subnet 192.168.168.0/24.

- BD1 exists on ES1, which is attached to PE1.

- BD2 exists on ES2, which is attached to PE2.  (ES1 and ES2 are not the 
same ES.)

- On BD1/ES1, there are hosts 192.168.168.1, 192.168.168.103, 
192.168.168.204.

- On BD2/ES2 there are hosts 192.168.168.2, 192.168.168.104, 192.168.1.203.

Assume there is no mobility.

In this scenario, I don't see how either PE1 or PE2 can advertise any 
prefix shorter than a /32.  And I don't see how one will prevent all 
these /32 routes from being distributed to all the MVPN nodes.

The fundamental issue here is that while IP addresses can be aggregated 
on a per-BD basis, they cannot be aggregated on a per-ES basis.

I don't think you get "seamless" interworking by requiring all the MVPN 
nodes to receive an unbounded number of host routes.

Eric> 8. The proposal simply does not work when MVPN receivers are 
interested in multicast flows from EVPN sources that are attached to 
all-active multi-homed ethernet segments.

Ali> This issue has been addressed in the new revision.

Yes, this is an improvement.

Suppose PE1 receives an (S,G) IP multicast frame over a local AC from 
BD1/ES1.  And suppose PE2,...,PEn are also attached to ES1. Per the new 
revision, PE1 transmits a copy of the frame on an EVPN-specific tunnel 
to PE2,...,PEn (an "intra-ES1" tunnel), as well as transmitting a copy 
of the contained IP datagram on whatever MVPN tunnel it uses to carry 
(S,G) packets.  Now any EVPN PE attached to the source ES can be 
selected as the UMH by an MVPN node, because all such EVPN PEs get the 
(S,G) frames and can forward forwards them to MVPN receivers.

It's good to see the draft recognizing that IP multicast frames do need 
to be transmitted as frames on EVPN-specific tunnels, in addition to 
being transmitted as packets on MVPN tunnels.  (Of course this solution 
violates the stated "requirement" that a given IP multicast packet not 
be transmitted on two different tunnels. Sigh, another example of 
"requirements" being applied inconsistently.)

However, there are still several problems with this solution.

No control plane is described to support this intra-ES tunneling. Is 
that "for the next revision"? ;-)

There's a suggestion that this solution is trivial, because no one would 
ever home an ES to more than two PEs, and therefore you just have to 
unicast a copy to the other PE.

But the PE receiving a frame has to figure out whether the frame was 
sent to it on an intra-ES tunnel or not, and if so, which ES the tunnel 
is associated with.  It is not clear how the receving PE is supposed to 
make this determination.  One needs to say something more than "just use 
ingress replicaton".

The draft also suggests that "multi-homed" always means "dual-homed", 
which I don't think is acceptable.

Note also that a scheme like this causes EVERY (S,G) frame to get sent 
to EVERY PE that is attached to S's source ES.  This happens even if 
there are NO receivers anywhere interested in (S,G) at all. In effect, 
the LAG hashing algorithm is defeated.  If a switch is multi-homed to n 
PEs, it uses a LAG hashing algorithm to ensure that any given packet is 
sent to just one of those PEs.  Then one EVPN PE gets the packet and 
sends it to the other n-1 PEs, who have to treat the packet as if it had 
just arrived on the AC from the multi-homed switch.  Iit would be better 
to have a "pull model" where PEx gets the (S,G) packet from PE1 only if 
some MVPN PE has sent a C-multicast (S,G) or (*,G) route to PEx.

In addition, the latest rev of the draft is still confused about the way 
UMH selection is done.  It seems to assume all the PEs will select the 
same "Upstream PE" for a given (S,G).  While this is one possible option 
(generally referred to as Single Forwarder Selection), it is not 
required, and I believe the most common deployment scenario is to use 
the "Installed UMH Route" as the "Selected UMH Route".  (See section 
5.1.3 of RFC 6513.)  This means that it is always possible for a PE to 
receive more than one copy of an (S,G) packet, and the PE must therefore 
always be able to apply the "discard from the wrong PE" procedures of 
RFC 6513 Section 9.1.1.

Suppose for example that EVPN-PE1 transmits its IP multicast frames on 
an I-PMSI that is instantiated by a P2MP LSP.  EVPN-PE2 will have to 
join that I-PMSI.  If PE1 and PE2 are both attached to BD1/ES1, then 
when PE1 gets an (S,G) IP multicast frame from BD1/ES1, PE2 will get two 
copies: one on the intra-ES1 tunnel from PE1 and one on the I-PMSI 
tunnel from PE1.  PE2 will probably choose itself as the "Upstream PE" 
for (S,G), in which case it needs to discard the copy that arrives on 
the I-PMSI tunnel from PE1, while accepting the copy that arrives on the 
intra-ES1 tunnel from PE1.  (If PE2 for some reason chose PE1 as the 
Upstream PE for (S,G), it would have to discard the copy arriving on the 
intra-ES1 tunnel and accept the copy arriving on the I-PMSI tunnel.)  
The draft seems to imply, incorrectly, that the "discard from the wrong 
PE" procedure is not necessary.

The "discard from the wrong PE" procedures are also needed to handle the 
case where the source is at a site homed to two or more MVPN PEs, and 
there are MVPN receivers that do not do single forwarder selection.  
This may cause some packets to appear on multiple I-PMSIs, and each 
EVPN-PE will have to join all the I-PMSIs, of course.

(The use of S-PMSIs rather than I-PMSIs does not eliminate this 
problem.  A given S-PMSI from PE1 might carry a flow that PE2 needs from 
PE1, and it might also carry a flow that PE2 is getting on an S-PMSI 
from PE3.)

Please note that if MPLS ingress replication is being used, the "discard 
from the wrong PE" functionality requires that the egress PE be able to 
tell from a packet's encapsulation when a packet is from the wrong 
ingress PE.

If the MVPN nodes are using the "extranet" feature (RFC 7900), "discard 
from the wrong PE" is not actually sufficient, one needs to "discard 
from the wrong ingress VRF".

Since there is no clean layering between MVPN and EVPN protocols in this 
proposal, every little nit and corner case of MVPN has to be examined to 
make sure it will also work in the EVPN domain.

Another problem: according to the draft, if an EVPN PE, say PE1, learns 
of a source via a locally attached all-active multi-homed ES, it will 
originate an IP route for that source.  Consider another PE, say PE2, 
attached to the same multi-homed ES.  When PE2 receives that IP route 
from PE1, PE2 will then originate its own IP route for that source.  
Since PE1 receives PE2's route, it is not clear how the route ever gets 
withdrawn.  If PE1 stops seeing the local traffic, it will still see 
PE2's route, and hence will still originate its own route.  One might 
think this is easily fixed by attaching to PE1's route an EC that 
declares that route to be "authoritative"; PE2's route would not have 
that EC.  Note though that the adding or removal of this "authoritative" 
EC will cause some churn that will be visible to the MVPN-only nodes, 
even though it does not provide them with any useful information.

I would also like to take note of the following issue.  From the draft:

"The EVPN PEs terminate ...  PIM messages from tenant routers on their 
IRB interfaces, thus avoid sending these messages over MPLS/IP core."

A PIM control message from a given PIM router needs to reach whichever 
other PIM router is a possible unicast next hop for any multicast source 
or RP.  The scheme of having each EVPN PE terminate the PIM messages 
presupposes that each tenant router will have the nearest EVPN PE as its 
unicast next hop towards the multicast source or RP.

This is likely to be a common scenario, but it certainly is not the only 
scenario.  A tenant might have several PIM routers on a given BD, where 
each PIM router is attached to a different PE.  The PIM routers could be 
IGP neighbors in the tenant's IGP, and may be exchanging IGP updates 
with each other.  In this case, PIM control messages from one tenant PIM 
router on the BD need to reach the other tenant routers on the BD.

For example, suppose Tenant Router R1 on BD1 attaches to PE1, and Tenant 
Router R2 on a different ES of BD1 attaches to PE2.  If R1 and R2 are 
IGP neighbors, R2 may see R1 as the next hop to a given source S.  In 
that case, R2 may choose to target a PIM Join(S,G) to R1.

In this scenario, the PIM control messages between R1 and R2 have to be 
sent between PE1 and PE2.  Since PIM control messages have a TTL of 1, 
they would have to be sent on BD1's BUM tunnels rather than on the IP 
multicast tunnels.

Now the question is, if R2 sends PIM Join(S,G) to R1, how does R2 get 
the (S,G) traffic from R1?  Either PE1 has to send it on BD1's BUM 
tunnel, or else PE2 has to figure out that it needs to pull (S,G) 
traffic from PE1 on an IP multicast tunnel.  The spec needs to explain 
how this situation is handled.  If the (S,G) traffic travels on BD1's 
BUM tunnel, the spec also has to make it clear how that traffic gets to 
other BDs.

BTW, section 6.5 of the draft says that any frame containing an IP 
packet whose destination address is in the range 224/8 is sent as a BUM 
frame.  I suspect that 224.0.0/24 is what is meant, as that seems to be 
the IPv4 multicast link-local address space.

One more thing.  The draft says that SPT-ONLY (RFC 6514 section 14) mode 
should be the default configuration.  This has several problems:

- SPT-ONLY mode requires each PE to function as an RP, which creates a 
considerable amount of additional work for the PE (handling the register 
messages and maintaining a large number of (S,G) states). It also 
requires the PE to originate a Source Active A-D route for each (S,G), a 
route that would not otherwise be needed.

- If the tenant or MVPN customer already has a multicast infrastructure 
with Rendezvous Points (RPs), it may be impossible to use SPT-ONLY mode, 
as this mode may not be compatible with the customer/tenant's 
infrastructure.  However, it may still be desirable to have RP-free 
operation for multicast groups whose sources and receivers are all in 
the EVPN domain.

- SPT-ONLY mode can sometimes be made compatible with an existing 
tenant/customer multicast infrastructure by having the PEs participate 
in the BSR or Auto-RP protocols, and/or by having the PEs participate in 
MSDP.  This would not generally be regarded as a simplification.

- If one is interworking with an MVPN whose PEs are configured to use 
RPT-SPT mode (RFC 6514 section 13), one must configure the EVPN-PEs to 
use RPT-SPT mode as well, because the two modes are not interoperable.  
I believe most MVPN deployments use RPT-SPT mode.

So I don't see the grounds for recommending the SPT-ONLY mode as the 
default.  The choice between SPT-ONLY mode and RPT-SPT mode depends on 
many factors and requires knowledge of (a) a particular tenant's 
deployment scenario, and (b) if MVPN interworking is being done, the 
mode that is being used by the MVPN nodes.

Re: [bess] Comments on draft-sajassi-bess-evpn-mv… Ali Sajassi (sajassi)
[bess] Comments on draft-sajassi-bess-evpn-mvpn-s… Eric C Rosen
Re: [bess] Comments on draft-sajassi-bess-evpn-mv… Eric C Rosen
Re: [bess] Comments on draft-sajassi-bess-evpn-mv… Kesavan Thiruvenkatasamy (kethiruv)
Re: [bess] Comments on draft-sajassi-bess-evpn-mv… Kesavan Thiruvenkatasamy (kethiruv)