Re: [Teas] Alia Atlas' Discuss on draft-ietf-teas-gmpls-lsp-fastreroute-10: (with DISCUSS and COMMENT)

Hi Alia,

Thank you for the detailed review of the document and your comments. Please see inline with <RG>.

On 2017-08-02, 3:00 PM, "Teas on behalf of Alia Atlas" <teas-bounces@ietf.org on behalf of akatlas@gmail.com> wrote:

    Alia Atlas has entered the following ballot position for
    draft-ietf-teas-gmpls-lsp-fastreroute-10: Discuss

    When responding, please keep the subject line intact and reply to all
    email addresses included in the To and CC lines. (Feel free to cut this
    introductory paragraph, however.)

    Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
    for more information about IESG DISCUSS and COMMENT positions.

    The document, along with other ballot positions, can be found here:
    https://datatracker.ietf.org/doc/draft-ietf-teas-gmpls-lsp-fastreroute/

    ----------------------------------------------------------------------
    DISCUSS:
    ----------------------------------------------------------------------

    1) In Sec 4.5.1: "The downstream PLR can assign a bypass tunnel when
       processing the first Path message of the protected LSP, however, it
       can not update the forwarding plane until it receives the Resv
       message containing the downstream MP label."

    Please explain how the downstream PLR can assign a bypass tunnel if the LSP
    has a loose ERO - so the downstream PLR does not know the next-next-hop that
    would be the MP for a node-protecting LSP.

<RG> This sentence should be updated as “With exception of the ABR node protection case, where the bypass tunnel starts and ends in different domains,”

    2) Sec 4.5.1: "An upstream PLR (downstream MP) SHOULD check all
    BYPASS_ASSIGNMENT
       subobjects in the Path RRO in order to assign a reverse bypass
       tunnel.  The upstream PLR that detects a BYPASS_ASSIGNMENT subobject,
       selects a reverse bypass tunnel that terminates locally with the
       destination address and tunnel-ID from the subobject, and has a
       source address matching the Node-ID address."

    This isn't very clear - particularly given that there will be many
    BYPASS_ASSIGNMENT subobjects in the path RRO.  The case of BYPASS_ASSIGNMENT
    sub-objects being removed or changed is not addressed at all.  In addition, I
    *assume* that the failure to treat the destination IP address in the
    BYPASS_ASSIGNMENT as the source IP address for the upstream Bypass tunnel is an
    oversight?

    I believe that what is meant  is:

    "An upstream PLR (downstream MP) SHOULD check all BYPASS_ASSIGNMENT sub-objects
    in the Path RRO to see if the destination IP address in the BYPASS_ASSIGNMENT
    matches an address of the upstream PLR.  For each BYPASS_ASSIGNMENT sub-object
    that matches, the upstream PLR looks for a local bypass tunnel that has a
    destination matching the downstream PLR that inserted the BYPASS_ASSIGNMENT, as
    indicated by the Node-ID address, and the same tunnel-ID as indicated in the
    BYPASS_ASSIGNMENT."

<RG> Your suggested text looks good.

    I recall that tunnel-ID is usually scoped by the address of the ingress LSR;
    this seems to assume that the same tunnel-ID is provided to both the downstream
    PLR and upstream PLR??? Alternately, I am misunderstanding - and the
    information in the BYPASS_ASSIGNMENT is really intended to be bypass tunnel to
    be used by the upstream PLR, which the downstream PLR somehow(??details, hints
    in the document please) knows .

<RG> The PLR adds the FEC of the bypass tunnel (Source/Destination/tunnel-ID). The MP uses the FEC for lookup.

    Then there needs to be text to handle the case where the previous PATH message
    contained a particular BYPASS_ASSIGNMENT sub-object and that sub-object has
    been removed or changed.

<RG> Yes, we can add a sentence to clarify.

    3) Sec 4.5.3: "In both examples above, the upstream PLR SHOULD send a Notify
    message
       [RFC3473] with Error-code - FRR Bypass Assignment Error (value: TBA1)
       and Sub-code - Bypass Assignment Cannot Be Used (value: TBA2) to the
       downstream PLR to indicate that it cannot use the bypass tunnel
       assignment in the reverse direction.  Upon receiving this error, the
       downstream PLR MAY remove the bypass tunnel assignment and select an
       alternate bypass tunnel if one available."

    This section is problematic because it creates the use of local policy when the
    ingress has a clear way to signal what type of protection is desired and
    because it provides an error message to where it will only cause pointless
    churn (the MP is the MP based on the type of protection desired - certainly for
    bypass) rather than to the ingress where it could at least be acted upon.  The
    dynamics at time of failure also do not seem to be well considered; asymmetry
    is unfortunate, but worse is lack of protection.

    Consider the case in Example 1.  If R5 suffers a node failure, then there is no
    protection for the upstream LSP from R6 if it prefers the link protection.  It
    simply doesn't matter what bypass tunnel R4 picks! Sending a Notify message to
    R4 asking for a different tunnel is not productive.  If the ingress has
    requested node-protection, then there is simply nothing that can be done for
    this topology by R5.  It could be helpful to send a Notify to the ingress or
    have a flag set in the RESV RRO to indicate the issue, but that's about it.

    For the question about creating local policy, how are the SESSION_ATTRIBUTES
    used?  Obviously, they are available in the PATH message that has the
    BYPASS_ASSIGNMENTs.  Why would the "Node Protection Desired" flag not be
    relevant here?

<RG> The document high-lights the issue that can occur due to different local policies on PLR and MP nodes and hence they should not be provisioned that way to avoid it☺
<RG> We can add a sentence to indicate that Session attributes flags are carried in the forward and reverse directions and can be used by the PLR and MP nodes in case there are different local policies.

    4) Sec 5: "   o  Upstream PLR reroutes traffic upon detecting the link failure
    or
          upon receiving RSVP Path message over the bidirectional bypass
          tunnel.

       o  Upstream PLR also reroutes RSVP Resv signaling after receiving
          RSVP Path message over the bidirectional bypass tunnel. "

    How does the upstream PLR detect that the message was received over the bypass
    tunnel?  Is the assumption that the bypass LSP doesn't do penultimate hop
    popping? Is the assumption that the PLR can tell because RSVP indicates the
    downstream PLR as the previous hop in its signaling?  Please clarify and
    describe how this detection is done - to ease interoperability.

<RG> RFC 4090 has details on what is changed when primary LSP messages are sent over the bypass. No change in the processing required in this document for this case for MP to detect FRR.
<RG> We could indicate “using the procedure defined in [RFC4090]”.

    5) In Sec 5.1.2:  "When upstream PLR R4 receives the protected LSP Path
    messages over
          the restored link, if not already done, it starts sending Resv
          messages and traffic flow of the protected LSP over the restored
          link and stops sending them over the bypass tunnel."

    Is there a reason that "when the downstream PLR receives the protected LSP RESV
    messages over the restored link, if not already done, it starts sending Path
    messages and traffic flow of the protected LSP over the restored link and stops
    sending them over the bypass tunnel." doesn't also make sense to put in this
    section?

    If this is not a good idea, please explain clearly the issues that it causes.

<RG> This was updated in the last revision to keep the processing symmetric – before FRR and for restoration after FRR.

    I am assuming that "after the link is restored" implies that bidirectional
    communication has been successfully tested - not merely that the physical layer
    is up but also that an IGP or BFD is successful across it. (But this is
    standard for RSVP-TE FRR).

    6) Sec 5.2.2: The behavior of R4 is not described.  When the link from R3-R4
    fails, R4 will redirect traffic to R2. As written at the start of Sec 5, R4
    does not start sending its Resv across the bypass tunnel and R2 is thus not
    triggered to use its bypass tunnel.  Please clearly describe this and why.  It
    is this asymmetry in behavior for the downstream PLR and upstream PLR that
    causes the downstream PLR's bypass tunnel to be prioritized.

<RG> R4 is not involved in re-corouting phase. It does normal FRR processing (e.g. Section 5.1.1).

    7) Sec 5.2.2: The need for the PRR to look up the bypass tunnel and then
    reprogram the forwarding plane is quite concerning for having this operate at
    significant scale.  What could be done if one assumes that the selected bypass
    tunnel - from the BYPASS_PROTECTION handling - is used?  Is there a reason that
    decision has to be redone here? What is the issue that the solution is trying
    to work around?   I can certainly imagine scenarios with BFD sessions so that
    the PRR can be rapidly failed over as the result of the BFD session going down.
     What scale of LSPs are you expecting this scenario to handle?

<RG> This is not a “normal” case. 

    8) Sec 5.2.2: Given that the PRR will TEAR DOWN the LSP if it can't find a
    matching bypass tunnel, it would be quite useful for the ingress to have
    visibility as to the protection available.  In RFC 4090, Sec 4.4 defines both
    "local protection available" and "local protection in use" flags in the
    IPv4/IPv6 sub-objects.  Clearly, that isn't sufficient for the co-routed case
    because the ingress needs to know also that "local upstream protection
    available" and perhaps "local upstream protection in use".

<RG> Yes, these flags are definitely used, see Section 4.4.

    9) Sec 5.2.3: "   o  The upstream PLR R4 starts sending the traffic flow of the
          protected LSP over the restored link towards downstream PLR R3 and
          forwarding the Path messages towards PRR R5 and stops sending the
          traffic over the bypass tunnel.

       o  When upstream PLR R4 receives the protected LSP Path messages over
          the restored link, if not already done, it starts sending Resv
          messages and traffic flow over the restored link towards
          downstream PLR R3 and forwarding the Path messages towards PRR R5
          and stops sending them over the bypass tunnel."

    In the referenced figures, R4 is NOT an upstream PLR; that is R5.  R4 could
    have forgotten all state associated with the bi-directional LSP.   Please fix
    the text to actually describe the behavior.

<R4> R4 is the node where restored link is detected in Figure 3. So it is doing the upstream PLR processing for link restoration case.

    10) Sec 5.3: "   Unidirectional link failures can result in the traffic flowing
    on
       asymmetric paths in the forward and reverse directions.  In addition,
       unidirectional link failures can cause RSVP soft-state timeout in the
       control-plane in some cases.  As an example, if the unidirectional
       link failure is in the upstream direction (from R4 to R3 in Figures 1
       and 2), the downstream PLR (node R3) can stop receiving the Resv
       messages of the protected LSP from the upstream PLR (node R4 in
       Figures 1 and 2) and this can cause RSVP soft-state timeout to occur
       on the downstream PLR (node R3)."

    Is the assumption that there is no IGP or BFD running on the link? If not, then
    the IGP or BFD session will go down on the link first, making it unavailable to
    RSVP-TE and should trigger the fast-reroute.

    Also - given this issue, why does the upstream MP not start using the bypass
    tunnel when receiving Resv through a bypass tunnel? There is no explanation in
    the draft and there should be - to prevent incorrect "optimizations".  Ideally,
    the draft would specify something like MUST NOT or SHOULD NOT with explanation
    - if that is the case.

<RG> GMPLS signaling has master/slave model. So Forward direction is always a master and reverse direction is slave, this is to avoid oscillations where two sides starts making independent decisions.

    11) Sec 7.1: The description for the BYPASS_ASSIGNMENT completely fails to be
    clear as to whether the contents are for the bypass tunnel used by the node
    inserting it into the RRO or whether the contents are a direction for the node
    that receives it - based on the Node ID that is included.

<RG> Node inserts the FEC of the bypass tunnel it assigns locally which is then used by the MP for lookup.

    ----------------------------------------------------------------------
    COMMENT:
    ----------------------------------------------------------------------

    a) Sec 5.2.2.1: The approach suggested here seems fairly intensive from a
    forwarding plane perspective.  It would be very helpful to indicate the range
    of expected/desired time for the fail-over.

<RG> This is same as control-plane except the FRR on MP side is detected by the data-plane.

    b) Sec 5.2:  This section is about node failures - but while the bypass tunnels
    are node-protecting, the failures discussed are only link.  A brief example
    that describes the expected signaling for an actual node failure would be
    helpful.

<RG> There should be no difference in processing if link or node fails, as long as bypass tunnel is next-next-hop.

Thanks,
Rakesh

    _______________________________________________
    Teas mailing list
    Teas@ietf.org
    https://www.ietf.org/mailman/listinfo/teas