Re: Comments/questions on morin-fast-failover

Yakov Rekhter <yakov@juniper.net> Fri, 19 March 2010 19:51 UTC

Message-ID: <201003191948.o2JJmjD41986@magenta.juniper.net>
To: erosen@cisco.com
Subject: Re: Comments/questions on morin-fast-failover
In-Reply-To: <32336.1268411458@erosen-linux>
References: <32336.1268411458@erosen-linux>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-ID: <80386.1269028124.1@juniper.net>
Date: Fri, 19 Mar 2010 12:48:45 -0700
From: Yakov Rekhter <yakov@juniper.net>
Cc: l3vpn@ietf.org
Precedence: list

Eric,

> 
> I have a few comments on draft-morin-l3vpn-mvpn-fast-failover-04.
> 
> The proposed UMH selection procedure is:
> 
>    o  first, the UMH candidates that either (a) advertise a PMSI bound
>       to a tunnel that is "up", or (b) do not advertise any I- or S-
>       PMSI applicable to the said (C-S,C-G) but have associated a VRF
>       Route Import BGP attribute to the unicast VPN route for S (this is
>       necessary to avoid considering invalid some UMH PEs that use a a
>       policy where no I-PMSI is advertised for a said VRF and where only
>       S-PMSI are used, the S-PMSI advertisement being possibly done only
>       after the upstream PE receives a C-multicast route for (C-S,
>       C-G)/(C-*, C-G) to be carried over the advertised S-PMSI)
> 
> I think condition (a) really has to be "advertise a PMSI bound to a tunnel,
> where the specified tunnel is not known to be down".  Saying "Not known to
> be down" is not the same as saying "up".

Agreed - we'll change it to "not known to be down".

> Otherwise, consider the following:
> 
> - - PE-R sees:
> 
>   * a route to C-S via PE-S1
> 
>   * a route to C-S via PE-S2
> 
>   where the route via PE-S2 is preferable
> 
> - - PE-S1 has bound (C-S,C-G) to an S-PMSI instantiated by P-tunnel P1
> 
> - - PE-S2 has bound (C-S,C-G) to an S-PMSI instantiated by P-tunnel P2
> 
> - - PE-R happens to be already joined to P-tunnel P1, and is actively
>   receiving traffic on it
> 
> - - PE-R is not already joined to P-tunnel P2
> 
> Clearly PE-R will regard P1 as being up.  But what about P2?  PE-R cannot
> know whether P2 is up, as PE-R is not joined to it.  Yet we certainly don't
> want PE-R to pick PE-S1 as the UMH for (C-S,C-G) in this case.
> 
> A consequence is that PE-R has to keep track, for each (C-S, C-G), of the
> set of UMH candidates whose tunnels are known to be down.  This is necessary
> so that PE-R can redo the UMH selection whenever one of those tunnels is no
> longer known to be down.  (Assuming of course that one wants to reoptimize
> the routing when a tunnel comes back up.)  The draft should say something
> about how one knows to return to the optimal route when a tunnel comes up.

At the *conceptual* level PE-R may maintain an ordered list of *all*
UMH candidates, irrespective of whether the tunnels advertised by
these candidates are "not known to be down" or not (the order is
determined by the BGP route selection procedures).  The first one
on that list whose tunnel is determined to be "not known to be down"
is the one currently used.  PE-R monitors the status of the tunnels
of UMHs that are ahead of the current one. Whenever PE-R determines
that one of these tunnels is no longer "not known to be down", PE-R
selects the UMH of that tunnel.

To reflect this in the document we could add something like the
following after th 2nd paragraph in Section 3:

  A downstream PE monitors the status of the tunnels of UMHs that
  are ahead of the current one. Whenever the downstream PE determines
  that one of these tunnels is no longer "known to down", the PE
  selects the UMH corresponding to that as the new UMH.

  
> I also have a question about whether changing UMH can really be considered
> "fast failover" under most circumstances.  If PE-R is already joined to
> PE-S1's tunnel when PE-S2's tunnel fails, then PE-R can certainly switch
> quickly from receiving (C-S,C-G) on one tunnel to receiving it on the other.
> But suppose PE-R is not already joined to PE-S1's tunnel.  Now PE-R must
> engage in control plane activity in order to keep getting the (C-S,C-G)
> traffic. So where's the "fast restoration"?  Fast restoration schemes are
> usually focused on keeping the data flowing without waiting for control
> plane activity to complete.

The above is based on an incorrect assumption that PE-R "is not already
joined to PE-S1's tunnel". To see why this assumption is incorrect,
note the following from 4.1:

   Note that, when a PE advertizes such a Standby C-multicast join for
   an (S,G) it must join the corresponding P-tunnel.

> This is particularly relevant when the "tunnel down" status is inferred by
> PE-R from the fact, locally detected, that the last hop link of the tunnel
> is down.  In this case, the tunnel will get automatically reconstituted
> after the IGP distributes the link status change.  If PE-R changes UMH, it
> may need to send a Join (whether in PIM or in BGP) to the new UMH, and the
> new UMH itself may need to send a C-PIM Join out a VRF interface.  

If PE-R changes UMH, then in order to start receiving traffic from
the new UMH PE-R does *not* need to send C-mcast route to the new UMH.

And if one uses "warm root standby" procedures (see 4.2), then the
new UMH does *not* even need to send a C-PIM Join out a VRF interface.

> needs to add itself to the new P-tunnel, and this may involve a fair amount
> of signaling, both in the tunnel set protocol (mLDP or RSVP-TE) and in BGP
> (e.g., the new UMH may need to send a new S-PMSI A-D route, and PE-R may
> then need to send a new Leaf A-D route.)  For this to count as a "fast
> restoration" scheme, all this has to be done in less time than it would take
> for the IGP to react to the link outage.

First of all note that using last hop link for determining that the
tunnel is down (section 3.1.2) is applicable only when there is no
fast restoration mechanism on that link. Thus for all practical
purposes using last hop link statue for determining that the tunnel
is down is *not* applicable when the tunnel set protocol is RSVP-TE,
and is applicable *only* when the tunnel set protocol is mLDP or PIM,
and only when mLDP is used without FRR link protection.

Second, automatically reconstituting the tunnel after the last hop
link failure may take longer than it would take for the IGP to react
to the link outage (as MPLS forwarding state of the tunnel could be 
reconstituted only after IGP finishes reacting to the link outage).

> If "hot root standby" is being used, then these concerns don't apply unless
> the standby tunnel and the primary tunnel happen to have the same last link
> (which is certainly possible).
> 
> The draft does say:
> 
>    if the PE can determine that there is no fast
>    restoration mechanism (such as MPLS FRR [RFC4090]) in place for the
>    P-tunnel, it can update the UMH immediately.  Else, it should wait
>    before updating the UMH, to let the P-tunnel restoration mechanisms
>    happen
> 
> But this doesn't really address he issue I raise above.  If the timer has to
> be long enough to let the IGP converge, there's not much point in having it
> at all.

The timer is used *only* if there is fast restoration mechanism. The timer
has nothing to do with how long it would take the IGP to converge.

To clarify we'll add the following:

   This allow fast tunnel restoration to take place before the downstream 
   PE updates the UMH.
  
> With regard to the procedures for use of the "standby community":
> 
> - - For the "hot root standby" scheme, I don't think the community is
>   necessary, as it doesn't seem to play any role.

The Standby C-multicast Join is useful to have the secondary PE be
"already sending traffic" because the standby route allows easier
provisioning, as explained in section 5:
  
      Note that the same level of protection would be achievable with a
   simple C-multicast Source Tree Join route advertised to both the
   primary and secondary upstream PEs (carrying as Route Target extended
   communities, the values of the VRF Route Import attribute of each VPN 
   route from each upstream PEs).  The advantage of using the Standby   
   semantic for is that, supposing that downstream PEs always advertise  
   a Standby C-multicast route to the secondary upstream PE, it allows
   to choose the protection level through a change of configuration on
   the secondary upstream PE, without requiring any reconfiguration of
   all the downstream PEs.

> - - For the cold and warm root standby schemes, it seems that the standby PE
>   has to know who the primary PE for a given (C-S,C-G) is.  I don't really
>   see how this will work, since each downstream PE could have a different
>   primary PE.

To "see how this will work" please read the first paragraph of
Section 4:

   The procedures described below are limited to the case where the site
   that contains C-S is connected to exactly two PEs.  The procedures
   require all the PEs of that MVPN to follow the single forwarder PE
   selection, as specified in [I-D.ietf-l3vpn-2547bis-mcast].  The
   procedures assume that if a site of a given MVPN that contains C-S is
   dual-homed to two PEs, then all the other sites of that MVPN would
   have two unicast VPN routes (VPN-IPv4 or VPN-IPv6) routes to C-S,
   each with its own RD.

Yakov.

Comments/questions on morin-fast-failover Eric Rosen
Re: Comments/questions on morin-fast-failover Yakov Rekhter
Re: Comments/questions on morin-fast-failover Eric Rosen
Re: Comments/questions on morin-fast-failover Thomas Morin