Re: AD review comments on draft-ietf-rtgwg-remote-lfa-08

Stewart Bryant <stbryant@cisco.com> Tue, 16 December 2014 17:57 UTC

Return-Path: <stbryant@cisco.com>
X-Original-To: rtgwg@ietfa.amsl.com
Delivered-To: rtgwg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B64C01A86EC for <rtgwg@ietfa.amsl.com>; Tue, 16 Dec 2014 09:57:36 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -11.201
X-Spam-Level:
X-Spam-Status: No, score=-11.201 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, NORMAL_HTTP_TO_IP=0.001, RAZOR2_CF_RANGE_51_100=0.5, RAZOR2_CF_RANGE_E8_51_100=1.886, RAZOR2_CHECK=0.922, RCVD_IN_DNSWL_HI=-5, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Ost3Q33iZmgX for <rtgwg@ietfa.amsl.com>; Tue, 16 Dec 2014 09:57:28 -0800 (PST)
Received: from aer-iport-1.cisco.com (aer-iport-1.cisco.com [173.38.203.51]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id DBA991A86DF for <rtgwg@ietf.org>; Tue, 16 Dec 2014 09:57:27 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=37865; q=dns/txt; s=iport; t=1418752648; x=1419962248; h=message-id:date:from:reply-to:mime-version:to:cc:subject: references:in-reply-to; bh=hBI72WRi8qv6iy/eki9hGx01lXs2r7ubGSWTa9TBtUw=; b=UtNtDXwdIQ+Sc1FV08+6Fg9bkIyM077cKElSimT4n82jZivIgn4+ROvY RVYwJeWwlWmf21v+tMEKo41h6QO9UClX/GUFDGY2nSYcV9cipKCaUcgKK 8j8etrDfYjepaTFD+WqPxQ51XX/J1ujdZfxSMx0/Y7xG8ztNVucuKkviS A=;
X-IronPort-AV: E=Sophos;i="5.07,588,1413244800"; d="scan'208,217";a="277203000"
Received: from aer-iport-nat.cisco.com (HELO aer-core-3.cisco.com) ([173.38.203.22]) by aer-iport-1.cisco.com with ESMTP; 16 Dec 2014 17:57:26 +0000
Received: from [64.103.108.123] (dhcp-bdlk10-data-vlan301-64-103-108-123.cisco.com [64.103.108.123]) by aer-core-3.cisco.com (8.14.5/8.14.5) with ESMTP id sBGHvPNO026916; Tue, 16 Dec 2014 17:57:26 GMT
Message-ID: <54907286.7040703@cisco.com>
Date: Tue, 16 Dec 2014 17:57:26 +0000
From: Stewart Bryant <stbryant@cisco.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: Alia Atlas <akatlas@gmail.com>, "draft-ietf-rtgwg-remote-lfa@tools.ietf.org" <draft-ietf-rtgwg-remote-lfa@tools.ietf.org>
Subject: Re: AD review comments on draft-ietf-rtgwg-remote-lfa-08
References: <CAG4d1red0acjgUbsAuss2L+WNdnAUq5Uynv_MzV8DccvTtDPnA@mail.gmail.com>
In-Reply-To: <CAG4d1red0acjgUbsAuss2L+WNdnAUq5Uynv_MzV8DccvTtDPnA@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------070804020404070009070102"
Archived-At: http://mailarchive.ietf.org/arch/msg/rtgwg/9_kWO7FP-NB4sYv_mFt65_f3ygo
Cc: "rtgwg@ietf.org" <rtgwg@ietf.org>
X-BeenThere: rtgwg@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: stbryant@cisco.com
List-Id: Routing Area Working Group <rtgwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/rtgwg/>
List-Post: <mailto:rtgwg@ietf.org>
List-Help: <mailto:rtgwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Dec 2014 17:57:37 -0000

On 12/12/2014 00:00, Alia Atlas wrote:

Alia thank you for your review.

Here are my responses and the changes make to -09
> Minor Comments:
>
> 1) In Sec 2, 3rd paragraph, in the sentence:
> "The single node in both S's P-space and E's Q-space is C; thus node C 
> is selected as the repair tunnel's end-point."
> it should be "S's extended P-space"
Correct - changed
>
> 2) In Sec 2, it says: "The non-failure traffic distribution is not 
> disrupted by the provision of such a tunnel since it is only used for 
> repair traffic and MUST NOT be used for normal traffic."
> This is obviously correct and good - but I think it would be very 
> useful to clarify that OAM traffic to test the rLFA may transit the 
> tunnel at any time.  Otherwise, the MUST NOT could cause some 
> confusion - depending on how one thinks about "normal traffic".
This now says:

The non-failure traffic distribution is not disrupted by the provision 
of such a tunnel since it is only used for repair traffic and MUST NOT 
be used for normal traffic. Note that OAM traffic specifically to verify 
the viability of the repair MAY traverse the tunnel prior to a failure.

I used viability rather than for example "availability" to cover any 
form of OAM test (CC, CV, delay, jitter.....)

I toyed with saying "normal data traffic" and not adding the OAM 
sentence, but that would have allowed routing and network management 
traffic (other than OAM) which we also need to exclude.

>
> 3) In Sec 3:  I can't parse "Examples of worse failures are node 
> failures (see Section 6 ), and through the failure of a shared risk 
> link group (SRLG), the through the independent concurrent failure of 
> multiple links, and these are out of scope for this specification."
>
> I think you mean "Examples of worse failures are node failures (see 
> Section 6), the failure of a shared risk link group (SRLG), the 
> independent concurrent failures of multiple links; protecting against 
> such worse failures is out of scope for this specification."  I would 
> add in the failure of broadcast interfaces and NBMA interfaces for 
> completeness, even though that was mentioned in Sec 2.
This now says:

Examples of worse failures are node failures (see Section 6), the 
failure of a shared risk link group (SRLG), the independent concurrent 
failures of multiple links, broadcast or non-broadcast multi-access 
(NBMA) links [Section 2]; protecting against such worse failures is out 
of scope for this specification.

>
> 4) In Sec 4.2: "Provided both these requirements are met, packets 
> forwarded over the repair tunnel will reach their destination and will 
> not loop."  Please change to:
> "will not loop after the single link failure".  Of course, looping can 
> happen if a worse failure than protected against occurs - as with 
> LFA.  This could also be mitigated by requiring that the PQ node is 
> downstream of the PLR, as  is mentioned in Sec 4.2.2.
Correct

This now says:

Provided both these requirements are met, packets forwarded over the 
repair tunnel will reach their destination, and will not loop after a 
single link failure.
>
> 5) In Sec 4.2.1.2 <http://4.2.1.2>: "This may be calculated by 
> computing an SPT at each of S's neighbors (excluding E) and excising 
> the subtree reached via the path N->S->E."
> As described here, a node Y that is reached via N->S->A would be 
> considered to be in S's extended P-space.  I realize that one would 
> assume that Y would be in S's P-space anyway and thus it is safe to 
> not care about this edge case.  However, the ECMP considerations make 
> it more complex so please at a minimum add in the same caveat as in 
> Sec 4.2.1.2  "(including those routers which are members of an ECMP 
> that includes link S-E)" suitably modified.  In the cost-based version 
> in Compute_Extended_P_Space, this is handled by ignoring any potential 
> node from N whose shortest path goes back through S.  It'd be nice if 
> the two methods were consistent.
I have changed the text to:

This may be calculated by computing an SPT at each of S's neighbors 
(excluding E) and excising the subtree reached via the path N->S->E. 
Note this will excise those routers which are reachable through all 
ECMPs that includes link S-E.

I am not sure that this clarification is strictly needed since "removal 
of the subtree reached via the path N->S->E" would include "those 
routers which are members of any ECMP that includes link S-E".

Would it be less confusing if we changed "excising the subtree reached" 
to "excising the routers reached"?
>
> 6) In Sec 4.2.2: "As described in [RFC5286], always selecting a PQ 
> node that is downstream with respect to the repairing node, prevents 
> the formation of loops when the failure is worse than expected." 
>  Could you clarify that the PQ node is downstream with respect to the 
> repairing node and the destination - rather than the proxy destination 
> E?  I'm fairly certain that the latter wouldn't work (but don't have 
> an example topology created).  If you disagree, let me know and I'll 
> work on creating one.   This is the constraint that is expressed in 
> Apply_Downstream_Constraint().
I don't think there is a problem in practice since if PQ needed to be 
downstream to E WRT S, D_opt(PQ,E) < D_opt(S,E) would apply and in a 
unit cost network there would be no PQ nodes since we would need 
D_opt(PQ,E) < 1, i.e. a link metric from PQ to E of less than one. PQ 
nodes would be so rare that this would no be a practical solution.

I have changed the text to:

As described in [RFC5286], always selecting a PQ node that is downstream 
to the destination with respect to the repairing node, prevents the 
formation of loops when the failure is worse than expected. The use of 
downstream nodes reduces the repair coverage, and operators are advised 
to determine whether adequate coverage is achieved before enabling this 
selection feature.
>
> 7) In Sec 4.3: "The reader is referred to 
> [I-D.psarkar-rtgwg-rlfa-node-protection] for further information 
> on the use of RLFA for node repairs." Can you add "and broadcast or 
> NBMA link repairs"?   Do you feel that is accurate?
>
I cannot see any text on broadcast or NBMA in the draft which is now 
draft-ietf-rtgwg-rlfa-node-protection (updates in text)

I have made no text change on the substantive point.
> 8) In Sec 6: s/"When the failure is a node failure rather than a link 
> failure"/"When the failure is a node failure rather than a 
> point-to-point link failure"
Done
>
> 9) In Sec 6: "Alternatively one might choose to assume that the 
> probability of a node failure and microloops forming is sufficiently 
> rare that the case can be ignored."  Can you please clarify from 
> microloops to "microloops forming due to use of alternates"?  We know 
> that in cases where a rLFA is necessary, that neighbor isn't loop-free 
> and so regular microloops due to reconvergence will form.
>
It took a while to understand the comment but I think I know what you mean.

I have changed the text to:

Alternatively one might choose to assume that the probability of a node 
failure is sufficiently rare that the issue of looping RLFA repairs can 
be ignored.
> 10) In Sec 7: "In the absence of a protocol to learn the preferred IP 
> address for targeted LDP, an LSR should attempt a targeted LDP session 
> with the Router ID [RFC2328] [RFC5305] [RFC5340], unless it is 
> configured otherwise."  Can you please add in some text for how this 
> would work for IPv6?  I believe that there are current drafts 
> discussing carrying Routable IP addresses (e.g. 
> http://datatracker.ietf.org/doc/draft-ietf-ospf-routable-ip-address/ 
> ).  We know that there is interest in having IPv6 only networks with 
> MPLS - so it'd be good not to create new gaps.

It now says

In the absence of a protocol to learn the preferred IP address for targeted LDP, an LSR should attempt a targeted LDP session with the Router ID [RFC2328] [RFC5305] [RFC5340] [RFC6119] [I-D.ietf-ospf-routable-ip-address"
], unless it is configured otherwise.

>
> 11) In Sec 8.4: "In an MPLS network, this is achieved without any 
> scaleability impact, as the tunnels to the PQ nodes are always present 
> as aproperty of an LDP-based deployment."  The targeted LDP sessions 
> don't have a scaleability impact?  That the repair tunnels don't need 
> to be specifically created as new tunnels, I agree with - but this 
> statement is overselling. Please make the technical point more clearly.
>
I have cut this back to

As shown in the table, remote LFA provides close to 100% prefix 
protection against link failure in 11 of the 14 topologies studied, and 
provides a significant improvement in two of the remaining three cases. 
Note that in an MPLS network the tunnels to the PQ nodes are always 
present as a property of an LDP-based deployment.

> 12) In Sec 9:  I feel like here is a good place at least mention the 
> issues with microloops from reconvergence.  Since reconvergence after 
> rLFA is going to result in a local microloop (depending on timing), at 
> least a reference to 
> https://tools.ietf.org/html/draft-litkowski-rtgwg-uloop-delay-03 with 
> a recommendation to consider it is important. Otherwise, the rLFA 
> repair happens and then traffic microloops and is lost.  The fact that 
> these local microloops occur with real impact much more with rLFA (or 
> any advanced FRR technique) is an important management consideration.
I have added the following new para:

When the network re-converges, microloops [RFC5715] may form due to 
transient inconsistencies in the router FIBs. If it is determined that 
microloops are a significant issue in the deployment, then a suitable 
loop free convergence methods such as one of those described in 
[RFC5715], [RFC6976] or  [I-D.litkowski-rtgwg-uloop-delay] should be 
implemented.

>
> 13) Sec 12:  Saying "To prevent their use as an attack vector the 
> repair tunnel endpoints SHOULD be assigned from a set of addresses 
> that are not reachable from outside the routing domain." is basically 
> empty words without more behind Sec 7 default of using Router IDs.  
> Can you find a reference that talks about a BCP for Router IDs not 
> being reachable addresses outside the routing domain? Can you describe 
> how to use the IGP extensions?
Router IDs are used for T-LDP and normal MPLS security applies.

Again with MPLS repair tunnels normal MPLS security applies.

The Section 12 reference was to IP tunnels in an IP rather than MPLS 
network. I have changed the text to:

The security considerations of [RFC 5286] also apply.

Targeted LDP sessions and MPLS tunnels are normal features of an MPLS 
network and their use in this application raises no additional security 
concerns.

To prevent their use as an attack vector  IP repair tunnel endpoints 
(where used) SHOULD be assigned from a set of addresses that are not 
reachable from outside the routing domain.

>
> Nits:
>
> a) In Sec 4.2.1.1 <http://4.2.1.1>: "The exclusion of routers 
> reachable via an ECMP that includes S-E prevents the forwarding 
> subsystem attempting to a repair endpoint via the failed link S-E."
> s/attempting to a repair/from attempting to use a repair
Done
>
> b) In Sec 10: "We propose "Remote LFA" as a natural second step." 
>  This is going to be an RFC - so rather than propose, try specify.
>
I have changed this to:
> The purpose of LFA FRR technology is to provide for a simple FRR 
> solution when such a solution is possible. The first step along this 
> simplicity approach was "local" LFA [RFC5286]. This specification of 
> "Remote LFA" is a natural second step.

Hopefully these resolutions are acceptable to all. If not please let me 
know.

New version at http://datatracker.ietf.org/doc/draft-ietf-rtgwg-remote-lfa/

Diffs at 
http://www.ietf.org/rfcdiff?url1=draft-ietf-rtgwg-remote-lfa-08&difftype=--html&submit=Go!&url2=draft-ietf-rtgwg-remote-lfa-09

- Stewart