Gunter Van de Velde's Discuss on draft-ietf-rtgwg-segment-routing-ti-lfa-13: (with DISCUSS and COMMENT)

Gunter Van de Velde via Datatracker <noreply@ietf.org> Wed, 17 April 2024 12:04 UTC

Return-Path: <noreply@ietf.org>
X-Original-To: rtgwg@ietf.org
Delivered-To: rtgwg@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id E720AC14F5EE; Wed, 17 Apr 2024 05:04:55 -0700 (PDT)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: Gunter Van de Velde via Datatracker <noreply@ietf.org>
To: The IESG <iesg@ietf.org>
Cc: draft-ietf-rtgwg-segment-routing-ti-lfa@ietf.org, rtgwg-chairs@ietf.org, rtgwg@ietf.org, stewart.bryant@gmail.com, stewart.bryant@gmail.com, gunter.van_de_velde@nokia.com
Subject: Gunter Van de Velde's Discuss on draft-ietf-rtgwg-segment-routing-ti-lfa-13: (with DISCUSS and COMMENT)
X-Test-IDTracker: no
X-IETF-IDTracker: 12.10.0
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: Gunter Van de Velde <gunter.van_de_velde@nokia.com>
Message-ID: <171335549592.57146.16093238893560291616@ietfa.amsl.com>
Date: Wed, 17 Apr 2024 05:04:55 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/rtgwg/aekhtJHT4h8L732j4ORqyJXVa34>
X-BeenThere: rtgwg@ietf.org
X-Mailman-Version: 2.1.39
List-Id: Routing Area Working Group <rtgwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rtgwg/>
List-Post: <mailto:rtgwg@ietf.org>
List-Help: <mailto:rtgwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Apr 2024 12:04:56 -0000

Gunter Van de Velde has entered the following ballot position for
draft-ietf-rtgwg-segment-routing-ti-lfa-13: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/about/groups/iesg/statements/handling-ballot-positions/ 
for more information about how to handle DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-rtgwg-segment-routing-ti-lfa/



----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

# Gunter Van de Velde, RTG AD, comments for draft-ietf-rtgwg-segment-routing-ti-lfa-13

Please find below two blocking DISCUSS points (easy to address), and a series of
non-blocking COMMENTs and some nits.

Many thanks for the RTGDIR reviews from Stewart Bryant,
Andy Smith and Ben Niven-Jenkins during the 7 years development
period of the TI-LFA specification. Also many thanks for the shepherd
write-up by Steward Bryant to provide a brief overview of the
progress of the draft through the WG and the current state of art.

Thank you to the authors of this document. I really appreciate the
effort and believe it captures the TI-LFA normative procedures well.
Reviewing it with fresh eyes, I've made several comments that could
help further improve the quality. I hope these insights will be
valuable for the authors and the Working Group as you continue
to refine the document.

DISCUSS:
========

DISCUSS#1
In section '9. TI-LFA and SR algorithms' i found the text written from sr-mpls
perspective. SRv6 has different considerations.

637	   and Q-Space as well as the post-convergence path.  An implementation
638	   MUST only use Node-SIDs bound to the FlexAlgo and/or Adj-SIDs that
639	   are unprotected to build the repair list.

The above seems written from an sr-mpls perspective. For SRv6 the Adj-SID is bound
to a Locator and consequently bound to an algorithm. As result, the observed limitation
of sr-mpls does not really apply for SRv6. For SRv6 an implementation can
use protected Adj-SID in the repair path without breaking algorithm aware
topology requirements. Consider allowing protected SRv6 Adj-SIDs for TI-LFA.

In addition consider some blob of text about Adj-SIDs and locators in
"section 8.2.  SRv6 dataplane considerations" could be beneficial.
With sr-mpls there is no correlation to the segment routing algorithm, however
when using SRv6 dataplane Adj-SID Locator is correlated to an algorithm.

DISCUSS#2
Sections 11 and 12 do not introduce any supplementary artifacts to the normative
procedures outlined for TI-LFA. The information within section11 and 12 is provided
in extensive detail. Should the Working Group (WG) prefer to maintain this
level of specificity, it is advisable to consider relocating the detailed
content to an appendix unless there is a strong reason to keep it in the main
body of the document.


----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

High level comments:
====================

* TI-LFA is based upon Segment Routing, however the document seems to have
mostly sr-mpls datapane type language. The SRv6 dataplane is only mentioned
first time on line 493, almost half way through the document. Maybe consider
mentioning support for SRv6 dataplane earlier onwards. * 6 people on front
page. Did all authors edit text in the draft? * Operational impact may want to
explicit mention that there is no interop complexity because TI-LFA is a node
local operation * the document makes use of the term 'we' and other
anthropomorphism. Maybe not the best approach in a formal document. Who is
'we'? editor, authors, WG, IETF community, operators, etc? policies have no
awareness or emotions

Detailed review COMMENTS ([minor] and [major])
==============================================
(Line numbers are rendered using idnits rendering)

19         This document presents Topology Independent Loop-free Alternate Fast
20         Re-route (TI-LFA), aimed at providing protection of node and
21         adjacency segments within the Segment Routing (SR) framework.  This

[minor]
s/Re-route/Reroute/

[major]
The description provide insight that TI-LFA provide protection of node and adj
segments. It does not specify what 'protection' is all about or that
'protection' is constrained to single link|node failures. i.e. rfc5286 has
explicit text in the abstract about single failure applicability.

24         (DLFA).  It extends these concepts to provide guaranteed coverage in
25         any two connected networks using a link-state IGP.  A key aspect of

[major]
in this sentence 'two connected networks' is referenced, while earlier in the
paragraph there is indication of 'protection of node and adjacency segments'.
How doe two connected networks correlate with the segments?

25         any two connected networks using a link-state IGP.  A key aspect of
26         TI-LFA is the FRR path selection approach establishing protection
27         over the expected post-convergence paths from the point of local
28         repair, reducing the operational need to control the tie-breaks among
29         various FRR options.

[minor]
suggested rewrite to make the text better readable:
A principal attribute of TI-LFA is the FRR path selection methodology, which
establishes protection over the anticipated post-convergence paths from the
point of local repair. This approach diminishes the operational necessity
to manage the tie-breaks among various FRR alternatives.

[minor]
why is the path selection better? can a hint be given why it is better
beyond a statement proclaiming it is better?

138        *  TI-LFA: Topology Independant LFA.

[minor]
s/Independant/Independent/

144        Segment Routing aims at supporting services with tight SLA guarantees
145        [RFC8402].  By relying on SR this document provides a local repair

[major]
The term SLA does not appear even once in RFC8402. How can the claim of
tight SLA be justified with RFC8402? can an better pointer to the claim be
inserted?

[minor]
s/Segment Routing/Segment Routing (SR)/

145        [RFC8402].  By relying on SR this document provides a local repair
146        mechanism for standard link-state IGP shortest path capable of
147        restoring end-to-end connectivity in the case of a sudden directly
148        connected failure of a network component.  Non-SR mechanisms for

[minor]
readability rewrite:
This document outlines a local repair mechanism that leverages Segment
Routing (SR) to restore end-to-end connectivity in the event of an
abrupt failure involving a directly connected network component.
This mechanism is designed for standard link-state Interior Gateway
Protocol (IGP) shortest path scenarios.

153        The term topology independent (TI) refers to the ability to provide a
154        loop free backup path irrespective of the topologies used in the
155        network.  This provides a major improvement compared to LFA [RFC5286]
156        and remote LFA [RFC7490] which cannot provide a complete protection
157        coverage in some topologies as described in [RFC6571].

[minor]
I think what is been trying to say is:
The term topology independent (TI) describes the capability of
providing a loop-free backup path that is effective across all network
topologies. This represents a significant enhancement over Loop-Free
Alternate (LFA) [RFC5286] and Remote LFA as outlined in
[RFC7490], both of which do not offer comprehensive protection coverage
in certain topological configurations as detailed in [RFC6571]. TI-LFA
ensures the availability of a backup path if a post-convergence path
exists, regardless of the network topology.

167        TI-LFA is a local operation applied by the PLR when it detects
168        failure of one of its local links.  As such, it does not affect:

[minor]
It would be welcome to explicit spell that TI-LFA is protection against
a single local link failure

[minor]
It was mentioned that TI-LFA provide protection against link and node failure.
In this section the abrupt fail of a link is mentioned to trigger FRR. How is
node-protection with TI-LFA achieved and the PLR triggered that neighboring
node is no more operational? It is elaborated upon later in this
section, but maybe a brief hint could be provided here too?

167        TI-LFA is a local operation applied by the PLR when it detects
168        failure of one of its local links.  As such, it does not affect:

170        *  Micro-loops that appear - or do not appear – as part of the
171           distributed IGP convergence [RFC5715] on the paths to the
172           destination that do not pass thru TI-LFA paths:

174           -  As explained in [RFC5714], such micro-loops may result in the
175              traffic not reaching the PLR and therefore not following TI-LFA
176              paths.

178        *  Micro-loops that appear – or do not appear - when the failed link
179           is repaired.

[minor]
This does not process very well. I tried reading a few times this paragraph
and believe what is mentioned could be rewritten as follows:

"TI-LFA operates locally at the Point of Local Repair (PLR) upon detecting
a failure in one of its direct links. Consequently, this local operation
does not influence:

* Micro-loops that may or may not form during the distributed Interior
Gateway Protocol (IGP) convergence as delineated in RFC 5715.

- These micro-loops occur on routes directed towards the destination that
do not traverse TI-LFA-configured paths. According to [RFC5714], the formation
of such micro-loops can prevent traffic from reaching the PLR, thereby
bypassing the TI-LFA paths established for rerouting.

* Micro-loops that may or may not develop when the previously failed link
is restored to functionality.

This specification highlights that while TI-LFA effectively addresses specific
link failures, it does not extend its impact to managing micro-loops
associated with broader IGP convergence issues or subsequent link repairs."

181        TI-LFA paths are loop-free.  What’s more, they follow the post-
182        convergence paths, and, therefore, not subject to micro-loops due to
183        difference in the IGP convergence times of the nodes thru which they
184        pass.

[minor]
This is a rather unformal writing style. what about the following:

TI-LFA paths are inherently loop-free and align with post-convergence routes.
Consequently, they are not susceptible to micro-loops that may arise due to
variations in the IGP convergence times across different nodes through
which these paths traverse. This ensures a stable and predictable routing
environment, minimizing disruptions typically associated with asynchronous
network behavior.

186        TI-LFA paths are applied from the moment the PLR detects failure of a
187        local link and until IGP convergence at the PLR is completed.

[minor]
readability rewrite:
TI-LFA paths are activated from the instant the PLR detects a failure in a
local link and remain in effect until the Interior Gateway Protocol (IGP)
convergence at the PLR is fully achieved.

190        micro-loops, especially if these paths have been computed using the
191        methods described in Section Section 6.2, Section 6.3, or Section 6.4
192        of the draft.  One of the possible ways to prevent such micro-loops

[minor]
Instead of simply referencing the sections 6.2, 6.3 and 6.4, maybe line up the
conditions in which this occurs combined with the section references. This could
be something in the style 'if the FRR path is not using a direct neighbor
then... etc etc etc'

206        For each destination in the network, TI-LFA pre-installs a backup

[minor]
what does destination exactly mean? is that a /32 or /128 node? or is it
router-ids? any other abstraction intended?

224        By using SR, TI-LFA does not require the establishment of TLDP
225        sessions (Targeted Label Distribution Protocol) with remote nodes in
226        order to take advantage of the applicability of remote LFAs (RLFA)
227        [RFC7490][RFC7916] or remote LFAs with directed forwarding
228        (DLFA)[RFC5714].  All the Segment Identifiers (SIDs) are available in
229        the link state database (LSDB) of the IGP.  As a result, preferring
230        LFAs over RLFAs or DLFAs, as well as minimizing the number of RLFA or
231        DLFA repair nodes is not required anymore.

[minor]
possible rewrite for readability and simplicity:

"
By utilizing Segment Routing (SR), TI-LFA eliminates the need to establish
Targeted Label Distribution Protocol (TLDP) sessions with remote nodes for
leveraging the benefits of Remote Loop-Free Alternates (RLFA) [RFC7490][RFC7916]
or Directed Loop-Free Alternates (DLFA) [RFC5714]. All the Segment Identifiers
(SIDs) required are present within the Link State Database (LSDB) of the
Interior Gateway Protocol (IGP). Consequently, there is no longer a necessity
to prefer LFAs over RLFAs or DLFAs, nor is there a need to minimize the number
of RLFA or DLFA repair nodes.
"

233        By using SR, there is no need to create state in the network in order
234        to enforce an explicit FRR path.  This relieves the nodes themselves
235        from having to maintain extra state, and it relieves the operator
236        from having to deploy an extra protocol or extra protocol sessions
237        just to enhance the protection coverage.

[minor]
what about this blob of text:
"
Utilizing SR makes the requirement unnecessary to establish additional
state within the network for enforcing explicit Fast Reroute (FRR) paths.
This alleviation spares the nodes from maintaining supplementary state and
frees the operator from the necessity to implement additional protocols or
protocol sessions solely to augment protection coverage.
"

239        Although not a Ti-LFA requirement or constraint, TI-LFA also brings

s/Ti-LFA/TI-LFA/

242        reduces the need of locally configured policies that drive the backup

[minor]
unsure what is meant with 'drive' means here. Would it be better to day that
'describe the backup...'

243        path selection ([RFC7916]).  The easiest way to express the expected
244        post-convergence path in a loop-free manner is to encode it as a list
245        of adjacency segments.  However, this may create a long SID list that

[major]
you write 'is to encode it'. What is the 'it'? I understand this is a
suggesting Adj SIDs. I also believe that simply having a list of Adj SIDs is
not sufficient, but that an "ordered" list of Adj SIDs is needed.

245        of adjacency segments.  However, this may create a long SID list that
246        some hardware may not be able to push.  One of the challenges of TI-

[minor]
should we say push or program? push seems more sr-mpls dataplane specific, while
TI-LFA has applicability with SRv6 also

248        adjacency segments and node segments.  Each implementation will be
249        free to have its own SID list optimization algorithm.  This document
250        details the basic concepts that could be used to build the SR backup
251        path as well as the associated dataplane procedures

possible rewrite:
"
Each implementation may independently develop its own algorithm for
optimizing the ordered SID list. This document provides an outline of the
fundamental concepts applicable to constructing the SR backup path, along
with the related dataplane procedures.
"

288        We define the main notations used in this document as the following.

290        We refer to "old" and "new" topologies as the LSDB state before and
291        after the considered failure.

[minor]
I would like to prefer not using the word 'we'. It is undefined who
that is. Is it the editor, authors, the WG the internet community, etc...

286     3.  Terminology

[minor]
Would section 3 be better located before section 2 for clarity?

[major]
Later in the document there is usage of P(S,X) and Q(D,X) while
the terminology section only documents P(R,X). Maybe add some text
to clarify the intended use.

321        EP(P, Q) is an explicit SR-based path from a node P to a node Q.

[minor]
why not simply use 'SR path' instead of 'SR-based path'? does the
postfix '-based' add any representative value?

335        An implementation is free to use any local optimization to provide
336        smaller SID lists by combining Node SIDs and Adjacency SIDs.  In

[minor]
The intent seems to be to integrate adj SIDs and node SIDs into the SID lists.
Not sure that we are combining multiple SIDs into less SIDs:
"An implementation may employ any local optimization strategy to reduce
the size of SID lists by integrating Node SIDs and Adjacency SIDs into
the SID lists."

342     5.  Intersecting P-Space and Q-Space with post-convergence paths
343
344        One of the challenges of defining an SR path following the expected
345        post-convergence path is to reduce the size of the segment list.  In

[minor]
at the end of section 4 is written "These optimizations are out of scope of
this document," and then the first paragraph identifies that reducing the SID
lists is one of the challenges. For something that is out-of-scope of the
document it is perceived as rather important though problem to address. If
truly out of scope of this document, then maybe add explicit that the section 5
is all informational

[minor]
in some places the term 'segment lists' is used, in others 'SID lists'. Could a
single terminology be used throughout the document?

[major]
In the Terminology section the P-space, extended P-space and the Q-space is
explained. Not sure why all this is explained again in more explicit steps. It
make me wonder if section 5 can be reduced by reusing the Terminology in
section 3 and focus upon those?

356        We want to determine which nodes on the post-convergence path from

[minor]
who is 'we'?

358        regard to resource X (X can be a link or a set of links adjacent to
359        the PLR, or a neighbor node of the PLR).

[minor]
in section 3 Terminology section the document resource X was defined, but
using different definition: 'resource X (e.g. a link S-F, a node F, or a SRLG)'
Which one is correct? maybe reuse the Terminology definition for consistency

378        This can be found by intersecting the set of nodes belonging to the
379        post-convergence path from R to D, assuming the failure of X, with
380        Q(D, X).

[minor]
In terminology section 3 the Q(R, X) is described with 'R' used while
in this section5.2 the term Q(D, X) has 'D' used.
Is this intentional? why not add this in Terminology
section also? or make the Terminology section more opaque
to using any letter (e.g. 'R' or 'D') and describe the
intend of the Q(...) function?

397        protected resource X and, at the same time, is guaranteed to be loop-
398        free irrespective of the state of FIBs along the nodes belonging to
399        the explicit path.  Thus, there is no need for any co-ordination or

[minor]
There is assumption here that only SR programs the FIB. There may be out
of Band FIB programming that does cause loops. Maybe frame the
claim better by expressing the assumption made to warrant loop-free paths.

460     6.2.  FRR path using a PQ node

[minor]
Is there a reason that there are no considerations for an implementer
to select the PQ node closest to the S or closest to the D?

499        interface for the packet, S-F.  The failure of the primary outgoing

[minor]
what is the 'F' in the S-F?

512        We define hereafter the FRR behavior applied by S for any packet
513        received with an active adjacency segment S-F for which protection
514        was enabled.  As protection has been enabled for the segment S-F and
515        signaled in the IGP (for instance using protocol extensions from
516        [RFC8667] and [RFC8665]), any SR policy using this segment knows that
517        it may be transiently rerouted out of S-F in case of S-F failure.

[minor]
A policy is a configuration. A policy does not 'know' anything. Can the
statement be made without anthropomorphism?

637        and Q-Space as well as the post-convergence path.  An implementation
638        MUST only use Node-SIDs bound to the FlexAlgo and/or Adj-SIDs that
639        are unprotected to build the repair list.

[major]
This is written from an sr-mpls perspective. For SRv6 the Adj is bound to an
algorithm and this condition does not apply

647                S --- R2 --- R3 --- R4 --- R5 --- D
648                         \    |  \  /
649                            R7 -- R8
650                             |    |
651                            R9 -- R10

653                                       Figure 2

655        In Figure 2, all the metrics are equal to 1 except
656        R2-R7,R7-R8,R8-R4,R7-R9 which have a metric of 1000.  Considering R2

[minor]
The drawing here is in different style as figure 1 where - and * is used to
visualize the different link metrics. Maybe consistent drawing style should be
used in the document?

665        To avoid the possibility of this double FRR activation, an
666        implementation of TI-LFA MAY pick only non protected adjacency
667        segments when building the repair list.  However, this is important

[minor]
While double failures may initially sound as an exotic event, it may be
more frequent as initially assumed when SRLGs are considered. In some operators
multiple 'link' use the same optical cables and if one fiber gets cut, then
many links may be impacted, causing double failures. Maybe worth to mention
that double failures is not as rare as one may believe.

676     11.  Advantages of using the expected post-convergence path during FRR

[minor]
This section is complex detailed read and seems surface level over detailed.
Can the advantage description not be simplified. Is this detail necessary for
this place for the document? Alternatively, consider moving this section into
an appendix Consider removing anthropomorphism in this section. TI-LFA has no
awareness, it may however be opaque to constraints (i.e. 'TI-LFA cannot be
aware of such path constraints and' )

783     12.  Analysis based on real network topologies

[major]
consider placing this section into an appendix. The shared information
does not add additional considerations to the TI-LFA procedure description