[Pals] draft-ietf-pals-endpoint-fast-protection

Stewart Bryant <stbryant@cisco.com> Fri, 07 August 2015 16:19 UTC

Return-Path: <stbryant@cisco.com>
X-Original-To: pals@ietfa.amsl.com
Delivered-To: pals@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CF69A1B2B7C for <pals@ietfa.amsl.com>; Fri, 7 Aug 2015 09:19:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -12.803
X-Spam-Level:
X-Spam-Status: No, score=-12.803 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, J_CHICKENPOX_26=0.6, LOCALPART_IN_SUBJECT=1.107, RCVD_IN_DNSWL_HI=-5, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id cJCoG61B--qg for <pals@ietfa.amsl.com>; Fri, 7 Aug 2015 09:19:44 -0700 (PDT)
Received: from aer-iport-1.cisco.com (aer-iport-1.cisco.com [173.38.203.51]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BA0611ACDBB for <pals@ietf.org>; Fri, 7 Aug 2015 09:19:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=25424; q=dns/txt; s=iport; t=1438964383; x=1440173983; h=from:subject:reply-to:to:cc:message-id:date:mime-version; bh=QeVhTwwSMq0HeZgO/wylBVzpUCkLcM+GFEmxp0JEc1o=; b=F9PBlnEcr5YhTyPfTQ4mqjWeEMUV32yrs1zExjzzfkxl8iPVKO/JIJ1c iuOcxz4A0XlHeOQcqHEk9OA/ckkVtqZw4Q8nkKtMA5+XUklMsS7BBXDJR OtKr5svNbLZOua7Z3ft+6bS/lv6u+LJqOln0Dc8X3ugp0OMEnPMeSYOKN 0=;
X-IronPort-AV: E=Sophos;i="5.15,630,1432598400"; d="scan'208,217";a="619594929"
Received: from aer-iport-nat.cisco.com (HELO aer-core-4.cisco.com) ([173.38.203.22]) by aer-iport-1.cisco.com with ESMTP; 07 Aug 2015 16:19:42 +0000
Received: from [64.103.106.119] (dhcp-bdlk10-data-vlan300-64-103-106-119.cisco.com [64.103.106.119]) by aer-core-4.cisco.com (8.14.5/8.14.5) with ESMTP id t77GJfEE005820; Fri, 7 Aug 2015 16:19:42 GMT
From: Stewart Bryant <stbryant@cisco.com>
To: draft-ietf-pals-endpoint-fast-protection@tools.ietf.org
Message-ID: <55C4DAB5.40103@cisco.com>
Date: Fri, 07 Aug 2015 17:20:05 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.1.0
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="------------020403010105060602020508"
Archived-At: <http://mailarchive.ietf.org/arch/msg/pals/Qlyf_d_VdUEm1YBTYwCwi5AHJCQ>
Cc: "pals-chairs@tools.ietf.org" <pals-chairs@tools.ietf.org>, "pals@ietf.org" <pals@ietf.org>
Subject: [Pals] draft-ietf-pals-endpoint-fast-protection
X-BeenThere: pals@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: stbryant@cisco.com
List-Id: "Pseudowire And LDP-enabled Services dicussion list." <pals.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/pals>, <mailto:pals-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/pals/>
List-Post: <mailto:pals@ietf.org>
List-Help: <mailto:pals-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/pals>, <mailto:pals-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Aug 2015 16:19:52 -0000

Hi Authors,

I have some more comments on draft-ietf-pals-endpoint-fast-protection-00

There will be another WGLC on the draft so please consider
these comments along with any others you receive.


This document needs a careful look at the 2119 language.

I commented on some of the SHOULDs, but I think there are
a whole lot more where you need to s/SHOULD/MUST/
In each cases when I ask myself "what would happen if the
implementer ignored the the suggestion" the system
would break, which leads me to conclude that it needs
to be a MUST.


=========
  1.  Introduction


    Today, fast protection against ingress AC failure and ingress (T-)PE
    failure can be achievable by using a multi-homed CE and redundant
    ACs, such as multi-chassis link aggregation group (MC-LAG).  Fast
    protection against failure of intermediate router of transport tunnel

SB> nit: "of an intermediate"

========

    This document is intended to serve the above need.  It specifies a
    fast protection mechanism based on local repair to protect PWs
    against the following egress endpoint failures.

    a.  Egress AC failure.

    b.  Egress PE failure: Link or node failure of an egress PE of an SS-
        PW, or a T-PE of an MS-PW.

    c.  Switching PE failure: Link or node failure of an S-PE of an MS-
        PW.

SB> Suggest Switching PE (S-PE) failure or
     S-PE failure

  ===========



       Primary egress AC: CE2-PE2

SB> Egress is PE2-CE2 - it is a vector.

       Backup ingress AC: CE1-PE3

       Backup ingress PE: PE3

       Backup PW: PW2

       Backup egress PE: PE4

       Backup egress AC: CE2-PE4

SB> As above


    Based on this schema, this document describes egress endpoint
    failures and the fast protection mechanism on the per-active-path and
    per-direction basis.  In this case, an egress AC failure refers to
    the failure of the AC CE2-PE2, and an egress node failure refers to

SB> Shouldn't that be PE2-CE2 - that is the egress packet direction.

    the failure of PE2.  The ultimate goal is that when a failure occurs,
    the traffic should be locally repaired, so that it can eventually
    reach CE2 via the backup egress PE (PE4) and the backup egress AC
    (CE2-PE4).

SB> Again PE4-CE2 would be a more natural direction for an egress

    Subsequent to the local repair, either the active path should heal
    after control plane converges on the new topology, or the ingress CE
    should switch traffic from the primary path to the backup path,
    depending on the failure scenario.  In the later case, the ingress CE
    may perform such switchover based on end-to-end OAM (in-band or out-
    band), PW status notification, CE-PE control protocols (e.g.  LACP),
    etc.  In the active-standby mode, this will promote the standby path
    to new active path.  In the active-active mode, it will make the
    other active path carry all the traffic.

SB> Surely in active-active it was already carrying all the traffic?

========


    In this document, the following primary and backup roles are assigned
    for the traffic going from CE1 to CE2:

       Primary ingress AC: CE1-TPE1

       Primary ingress T-PE: TPE1

       Primary PW: PW1

       Primary S-PE: SPE1

       Primary egress T-PE: TPE2

       Primary egress AC: CE2-TPE2

SB> Same comment as for SS-PW concerning directionality

      Backup ingress AC: CE1-TPE3

       Backup ingress T-PE: TPE3

       Backup PW: PW2

       Backup S-PE: SPE2

       Backup egress T-PE: TPE4

       Backup egress AC: CE2-TPE4

SB> Same comment as for SS-PW concerning directionality

  =========


4.1.  Applicability

    The mechanism is applicable to LDP signaled PWs.  It is applicable to
    an environment where an egress CE is multi-homed to a primary PE and
    a backup PE and there exists a backup PW in the network.  In S-PE
    node protection, it also assumes a backup S-PE on the backup PW.

    The mechanism assumes IP/MPLS transport tunnels for PWs.  If
    transport tunnels are LDP and there is a possibility of EMCP to a
    primary PE, it is recommended to enable control word for PWs.

SB> Suggest: it is recommended that the PW control word (CW) is used.

     Imagine a scenario where an LDP tunnel traverse a router with ECMP to
    the primary PE, and the ECMP includes a direct link to the primary
    PE.  If a PW does not have control word, its traffic may be forwarded

SB> Suggest: If a PW is not using the CW...

    in a load balance fashion over multiple branches of the ECMP,
    including this link.  When the link fails, the router will treat it
    as an egress PE failure and reroute the portion of traffic traversing
    the link.  Meanwhile, the rest of traffic will remain on the other
    ECMP branches to the primary PE.  This will create a situation where
    the egress CE will receive traffic from both primary PE and backup
    PE, which may not be desirable for a service sensitive to packet
    misordering.

SB> I am not sure yet, how you are detecting failure, but if it is VCCV
SB> then ECMP may mean that you get the CC packet but do not get all the
SB> data traffic.
SB>
SB> You should also consider the FAT PW case.

    The mechanism is also assumed to be used in conjunction with global
    repair and control plane repair, in such a manner that the mechanism
    temporarily repairs traffic by using a bypass tunnel, and global
    repair and control plane repair eventually move traffic to a fully
    functional path.

=========



    A PLR can realize its role based on configuration or the signaling of
    transport tunnel.  For example, in the case where the transport
    tunnel is signaled by RSVP, the penultimate hop router can realize
    that it is the PLR for egress (T-)PE or S-PE failure based on the RRO
    in Resv message, which should indicate that the router is one hop
    away from the PE.  The detail of how this could be achieved on a per-
    protocol basis is out of the scope of this document.

SB> I don't see how the above works. Particularly in an LDP case which
SB> from earlier text seems to be in scope.
SB> P4 is just a P router and knows nothing about the traffic it carries
SB> over its LSP. How does it know to send this traffic to SPE2, but
SB> other FRR traffic to some other node to try to bypass the failed
SB> link/node?


    In all scenarios, when a PLR reroutes traffic through a bypass tunnel
    to a protector during local repair, it MUST keep the label of the
    primary PW intact in the packets.  This obviates the need for the PLR
    to maintain bypass routes on a per-PW basis, and allows a bypass
    tunnel to be shared by multiple PWs.

    The procedure also requires that the protector SHOULD be able to

SB> Surely that is a MUST?

    forward the traffic based on a PW label that is assigned by the
    primary PE, and ensure the traffic to eventually reach the target CE.

SB> Suggest: and ensure that the traffic eventually reaches the target CE.

    From the protector's perspective, this PW label is an upstream
    assigned label (RFC 5331).  To accomplish this, the protector SHOULD
SB> [RFC5331] is the normal reference style in an RFC
SB> again surely it is a MUST?

    learning the PW label from the primary PE prior to the failure, and
SB>s/learning/learn/

    install proper forwarding state for the PW label in a dedicated label
    space associated with the primary PE.  During local repair, the
    protector SHOULD perform PW label lookup in this label space.

SB>Surely s/SHOULD/MUST/

============


4.3.1.  Semantics

    The semantics of a context identifier is twofold.

    o  It identifies a primary PE and an associated protector.  In other
       words, it identifies a primary PE on a per protector basis.  A
       given primary PE may be protected by multiple protectors, each for
       a subset of the primary PWs terminated on the primary PE.  A
       distinct context identifier MUST be assigned to the primary PE and
       each protector.

       For each primary PW, its ingress PE MUST set up or resolve a
       transport tunnel with destination as the context identifier of the
       {primary PE, protector}, rather than a private IP address of the
       primary PE.  This not only allows the transport tunnel to reach
       the primary PE, but also conveys the identity of the protector to
       the PLR(s) along the transport tunnel.  Each PLR can in turn use
       this information to set up a bypass tunnel to the protector
       without relying on local configuration.

    o  It indentifies the primary PE's label space on the protector.  The
       protector may protect PWs for multiple primary PEs.  For each
       primary PE, it MUST maintain a separate label space to store the

SB> What is "it" surely not the contest identifier which is the subject
SB> of this section - a CA is an identifier not an object that holds a
SB> label space.


========


4.4.2.  Centralized Protector

    In this model, the protector is a dedicated P router or PE router
    that serves the role.  In egress AC protection and egress PE node
    protection, the protector MAY or MAY NOT be a backup PE with a direct

SB> Nits objects to the MAY or MAY NOT, and in any case it is not
SB> RFC2119 material, since you are stating a fact not permission
SB> to the implementer.

    connection to the target CE.  In S-PE node protection, the protector
    MAY or MAY NOT be a backup S-PE on the backup PW.

SB> As above

==========

6.1.  Egress Protection Capability TLV

    A protector MUST advertise the Egress Protection Capability TLV in
    its Initialization message and Capability message, over the LDP
    session with a primary PE.  In the centralized protector model, the
    protector MUST also advertise the TLV over the LDP session with a
    backup PE.  The TLV carries one or multiple context identifiers.  To



Yimin Shen, et al.      Expires November 7, 2015               [Page 21]

Internet-Draft     PW Endpoint Fast Failure Protection          May 2015


    the primary PE, the TLV SHOULD carry the context identifier of the
    {primary PE, protector}. In the centralized protector model, the TLV
    SHOULD carry to the backup PE multiple context identifiers, one for
    each {primary PE, protector} where the backup PE serves as a backup
    for the primary PE.  This TLV SHOULD NOT be advertised by the primary
    PE or the backup PE to the protector.


SB> Shouldn't the SHOULDs all be MUSTS? Same for the following text


    The processing of the Egress Protection Capability TLV by a receiving
    router SHOULD follow the procedures defined in RFC 5561.  In
    particular, the router SHOULD advertise PW information to the
    protector by using the Protection FEC Element TLV, only after it has
    received the Egress Protection Capability TLV from the protector.  It
    SHOULD validate each context identifier included in the TLV, and
    advertise the information of only those PWs that are associated with
    the context identifier.  It SHOULD withdraw previously advertised
    Protection FEC TLVs, when the protector has withdrawn a previously
    advertised context identifier or the entire Egress Protection
    Capability TLV via Capability message.

==========

  7.  IANA Considerations

    This document defines the encoding of the Capability Parameter TLV
    for the new "Egress Protection Capability" in Section 6.  This would
    require IANA to assign a TLV Code Point to it.

SB> You need to specify this exactly as IANA would lay it out
SB> in there registry

     This document defines a new LDP Protection FEC Element TLV in
    Section 6.  IANA has assigned the type value 0x83 to it.

SB> You should specify the registry that IANA has made this assignment it.


  8.  Security Considerations

    The security considerations discussed in RFC 5036, RFC 5331, RFC
    3209, and RFC 4090 apply to this document.


SB> Are you sure that there are no new security considerations?
SB> If not make that explicit statement.

========

- Stewart