AD review comments on draft-ietf-rtgwg-remote-lfa-08

Alia Atlas <akatlas@gmail.com> Fri, 12 December 2014 00:00 UTC

MIME-Version: 1.0
Date: Thu, 11 Dec 2014 19:00:00 -0500
Message-ID: <CAG4d1red0acjgUbsAuss2L+WNdnAUq5Uynv_MzV8DccvTtDPnA@mail.gmail.com>
Subject: AD review comments on draft-ietf-rtgwg-remote-lfa-08
From: Alia Atlas <akatlas@gmail.com>
To: "draft-ietf-rtgwg-remote-lfa@tools.ietf.org" <draft-ietf-rtgwg-remote-lfa@tools.ietf.org>
Content-Type: multipart/alternative; boundary="001a11379da2f27c200509f992ab"
Archived-At: http://mailarchive.ietf.org/arch/msg/rtgwg/yhgn0xFqJCoV_GQyAlo-XTK21d4
Cc: "rtgwg@ietf.org" <rtgwg@ietf.org>
Precedence: list

I have done my usual AD review of this draft before progressing it.
Thanks for the hard work on this!

I have a number of different comments on the draft, as given below.
None of them are sufficient for me to be concerned about starting the
IETF Last Call, but please address them as soon as possible.

Assuming an updated draft will appear and a quiet IETF Last Call, I expect
this to be on the IESG telechat in early Jan.


Minor Comments:

1) In Sec 2, 3rd paragraph, in the sentence:
"The single node in both S's P-space and E's Q-space is C; thus node C is
selected as the repair tunnel's end-point."
it should be "S's extended P-space"

2) In Sec 2, it says: "The non-failure traffic distribution is not
disrupted by the provision of such a tunnel since it is only used for
repair traffic and MUST NOT be used for normal traffic."
This is obviously correct and good - but I think it would be very useful to
clarify that OAM traffic to test the rLFA may transit the tunnel at any
time.  Otherwise, the MUST NOT could cause some confusion - depending on
how one thinks about "normal traffic".

3) In Sec 3:  I can't parse "Examples of worse failures are node failures
(see Section 6 ), and through the failure of a shared risk link group
(SRLG), the through the independent concurrent failure of multiple links,
and these are out of scope for this specification."

I think you mean "Examples of worse failures are node failures (see Section
6), the failure of a shared risk link group (SRLG), the independent
concurrent failures of multiple links; protecting against such worse
failures is out of scope for this specification."  I would add in the
failure of broadcast interfaces and NBMA interfaces for completeness, even
though that was mentioned in Sec 2.

4) In Sec 4.2: " Provided both these requirements are met, packets
forwarded over the repair tunnel will reach their destination and will not
loop."  Please change to:
"will not loop after the single link failure".  Of course, looping can
happen if a worse failure than protected against occurs - as with LFA.
This could also be mitigated by requiring that the PQ node is downstream of
the PLR, as  is mentioned in Sec 4.2.2.

5) In Sec 4.2.1.2: "This may be calculated by computing an SPT at each of
S's neighbors (excluding E) and excising the subtree reached via the path
N->S->E."
As described here, a node Y that is reached via N->S->A would be considered
to be in S's extended P-space.  I realize that one would assume that Y
would be in S's P-space anyway and thus it is safe to not care about this
edge case.  However, the ECMP considerations make it more complex so please
at a minimum add in the same caveat as in Sec 4.2.1.2  "(including those
routers which are members of an ECMP that includes link S-E)" suitably
modified.  In the cost-based version in Compute_Extended_P_Space, this is
handled by ignoring any potential node from N whose shortest path goes back
through S.  It'd be nice if the two methods were consistent.

6) In Sec 4.2.2: "As described in [RFC5286], always selecting a PQ node
that is downstream with respect to the repairing node, prevents the
formation of loops when the failure is worse than expected."  Could you
clarify that the PQ node is downstream with respect to the repairing node
and the destination - rather than the proxy destination E?  I'm fairly
certain that the latter wouldn't work (but don't have an example topology
created).  If you disagree, let me know and I'll work on creating one.
This is the constraint that is expressed in Apply_Downstream_Constraint().

7) In Sec 4.3: "The reader is referred to
[I-D.psarkar-rtgwg-rlfa-node-protection]
for further information on the use of RLFA for node repairs." Can you add
"and broadcast or NBMA link repairs"?   Do you feel that is accurate?

8) In Sec 6: s/"When the failure is a node failure rather than a link
failure"/"When the failure is a node failure rather than a point-to-point
link failure"

9) In Sec 6: "Alternatively one might choose to assume that the probability
of a node failure and microloops forming is sufficiently rare that the case
can be ignored."  Can you please clarify from microloops to "microloops
forming due to use of alternates"?  We know that in cases where a rLFA is
necessary, that neighbor isn't loop-free and so regular microloops due to
reconvergence will form.

10) In Sec 7: "In the absence of a protocol to learn the preferred IP
address for targeted LDP, an LSR should attempt a targeted LDP session with
the Router ID [RFC2328] [RFC5305] [RFC5340], unless it is configured
otherwise."
 Can you please add in some text for how this would work for IPv6?  I
believe that there are current drafts discussing carrying Routable IP
addresses (e.g.
http://datatracker.ietf.org/doc/draft-ietf-ospf-routable-ip-address/ ).  We
know that there is interest in having IPv6 only networks with MPLS - so
it'd be good not to create new gaps.

11) In Sec 8.4: "In an MPLS network, this is achieved without any
scaleability impact, as the tunnels to the PQ nodes are always present as a
property of an LDP-based deployment."  The targeted LDP sessions don't have
a scaleability impact?  That the repair tunnels don't need to be
specifically created as new tunnels, I agree with - but this statement is
overselling.  Please make the technical point more clearly.

12) In Sec 9:  I feel like here is a good place at least mention the issues
with microloops from reconvergence.  Since reconvergence after rLFA is
going to result in a local microloop (depending on timing), at least a
reference to
https://tools.ietf.org/html/draft-litkowski-rtgwg-uloop-delay-03 with a
recommendation to consider it is important.  Otherwise, the rLFA repair
happens and then traffic microloops and is lost.  The fact that these local
microloops occur with real impact much more with rLFA (or any advanced FRR
technique) is an important management consideration.

13) Sec 12:  Saying "To prevent their use as an attack vector the repair
tunnel endpoints SHOULD be assigned from a set of addresses that are not
reachable from outside the routing domain." is basically empty words
without more behind Sec 7 default of using Router IDs.  Can you find a
reference that talks about a BCP for Router IDs not being reachable
addresses outside the routing domain?  Can you describe how to use the IGP
extensions?

Nits:

a) In Sec 4.2.1.1: "The exclusion of routers reachable via an ECMP that
includes S-E prevents the forwarding subsystem attempting to a repair
endpoint via the failed link S-E."
s/attempting to a repair/from attempting to use a repair

b) In Sec 10: "We propose "Remote LFA" as a natural second step."  This is
going to be an RFC - so rather than propose, try specify.

Thanks,
Alia

AD review comments on draft-ietf-rtgwg-remote-lfa… Alia Atlas
Re: AD review comments on draft-ietf-rtgwg-remote… Stewart Bryant