John Scudder's Discuss on draft-ietf-rtgwg-segment-routing-ti-lfa-13: (with DISCUSS and COMMENT)

John Scudder via Datatracker <noreply@ietf.org> Tue, 16 April 2024 20:14 UTC

Return-Path: <noreply@ietf.org>
X-Original-To: rtgwg@ietf.org
Delivered-To: rtgwg@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id 1C6DBC14F73F; Tue, 16 Apr 2024 13:14:10 -0700 (PDT)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: John Scudder via Datatracker <noreply@ietf.org>
To: The IESG <iesg@ietf.org>
Cc: draft-ietf-rtgwg-segment-routing-ti-lfa@ietf.org, rtgwg-chairs@ietf.org, rtgwg@ietf.org, stewart.bryant@gmail.com, stewart.bryant@gmail.com
Subject: John Scudder's Discuss on draft-ietf-rtgwg-segment-routing-ti-lfa-13: (with DISCUSS and COMMENT)
X-Test-IDTracker: no
X-IETF-IDTracker: 12.10.0
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: John Scudder <jgs@juniper.net>
Message-ID: <171329845008.26808.12738157689303839289@ietfa.amsl.com>
Date: Tue, 16 Apr 2024 13:14:10 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/rtgwg/VV_ah5qxyUiOu8KUMHYVusrO1OU>
X-BeenThere: rtgwg@ietf.org
X-Mailman-Version: 2.1.39
List-Id: Routing Area Working Group <rtgwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rtgwg/>
List-Post: <mailto:rtgwg@ietf.org>
List-Help: <mailto:rtgwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Apr 2024 20:14:10 -0000

John Scudder has entered the following ballot position for
draft-ietf-rtgwg-segment-routing-ti-lfa-13: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/about/groups/iesg/statements/handling-ballot-positions/ 
for more information about how to handle DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-rtgwg-segment-routing-ti-lfa/



----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

# John Scudder, RTG AD, comments for draft-ietf-rtgwg-segment-routing-ti-lfa-13
CC @jgscudder

Thanks for this document. The technology is valuable, and the underlying
techniques sound. Despite what you might guess from my abundance of comments, I
like this document. I suspect it has suffered from having been edited
repeatedly by the same set of experts, it becomes hard to see it as a new
reader might. So, please accept my comments in the spirit of a new set of eyes
looking over some long-established text.

Although most of my remarks are non-blocking COMMENTs, I am putting two in as
DISCUSS points. Some of my other comments are of a similar nature, but I think
they're less serious for the overall coherence of the document.

## DISCUSS

### Whole document, is post-convergence of the essence, or not?

The document seems to be arguing with itself about whether following the
post-convergence path is, or is not an essential/required feature. In the
Abstract:

   A key aspect of
   TI-LFA is the FRR path selection approach establishing protection
   over the expected post-convergence paths from the point of local
   repair

It's a *key* aspect! OK! But then, in the Introduction:

   Although not a Ti-LFA requirement or constraint, TI-LFA also brings
   the benefit of the ability to provide a backup path that follows the
   expected post-convergence path

Wait, it's "key" but "not a requirement or constraint"? Moving on to Section 6,

   The repair list encodes the
   explicit post-convergence path to the destination

So it "encodes the explicit post-convergence path". "Encodes", not "might
encode" or "can encode". So the Abstract is right and Section 2 is wrong. But
wait there's more! Later in Section 11,

   traffic can
   be steered by the PLR onto its expected post-convergence path during
   the FRR phase

So it "can"... which implies "doesn't have to be".

There's more, for example, all of Section 5 talks about post-convergence paths,
and there are many more mentions in Section 6 too.

Given that Sections 5-8 seem to come closest to being the normative ones
(though the document is sadly not very precise in this regard) I'm left with
the impression that the Abstract is right ("key"), and the quoted passages of
Sections 2 and 11 are wrong. In any case, I think this needs to be resolved in
some way.

### Section 10, multiple unrelated failures

   Implementations of TI-LFA should deal with the
   occurence of multiple unrelated failures in accordance to the IP Fast
   Reroute Framework [RFC5714].

(Nit, you misspelled "occurrence".)

Can you explain what you mean by this sentence? I haven't reviewed it carefully
but RFC 5714 is a framework, and I don’t know what it means to be "in
accordance to" it. The only relevant text I was able to find in it was RFC 5714
Section 5.2.6,

   However, it is important that the occurrence of
   a second failure while one failure is undergoing repair should not
   result in a level of service which is significantly worse than that
   which would have been achieved in the absence of any repair strategy.

Putting that together with the quote from your specification, I come up with an
interpretation like "it's important to behave reasonably in the face of
multiple failures, we aren't going tell you how to do it, this other document
we are citing isn't going to tell you how to do it either, it's just going to
tell you that our specification was supposed to cover this, but we didn't.”


----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

## COMMENTS

### General Gripe, Y U no XML?

It looks like you must have uploaded the TXT rendering instead of the XML
source. Whenever it is you upload your next revision, please consider uploading
the XML source instead. The set of renderings available when you upload text is
inferior to the renderings available when you upload source (notably, the
modern HTML rendering is not available). If there's some reason uploading
source is difficult or impossible for you, disregard this request of course.
(The only reason I can think of for it to be hard is if you used a less-common
tool to produce your draft, for example, nroff or the Microsoft Word template.
If you use the more mainstream XML or MD workflows, it should be easy enough.)

### Abstract, for want of a hyphen the meaning was lost

   It extends these concepts to provide guaranteed coverage in
   any two connected networks using a link-state IGP.

The only way I can make sense of this is to add a hyphen:

   It extends these concepts to provide guaranteed coverage in
   any two-connected networks using a link-state IGP.

Is that what you meant? I would also suggest changing "using" to "that uses",
as in,

   It extends these concepts to provide guaranteed coverage in
   any two-connected network that uses a link-state IGP.

I'm not sure the concept of two-connectedness is universally understood enough
that it's suitable for use in an Abstract, so I'd support further editing to
get rid of the graph theory term entirely, but at least this edit makes it
correct.

### Section 1, orphan definitions

- RSPT is defined, but never used. Delete?

- SLA is defined, only used once (and in a comment below I suggest deleting
that use!). Delete and just expand on first use, if the first-and-only use is
kept at all?

- SPT is defined, but never used. Delete?

- SRGB is defined, only used once. Delete and just expand on first use?

- TLDP is defined, only used once. Delete and just expand on first use? (If you
keep it, please correct the nit from Ben Niven-Jenkins’ RTGAREA review.)

### Section 2, "aims at"

   Segment Routing aims at supporting services with tight SLA guarantees
   [RFC8402].

This seems to be rewriting history. That is to say, sure SR “aims at supporting
services with tight SLA guarantees”, but only in the sense that every
general-purpose packet transport does; without specifics this isn’t very
meaningful. In any case, the citation doesn’t supply evidence for the
statement, indeed the string “SLA” or “service level” never occurs in RFC 8402.
Maybe just remove this sentence? It doesn’t appear essential in any case.

### Section 2, can't parse this sentence

   By relying on SR this document provides a local repair
   mechanism for standard link-state IGP shortest path capable of
   restoring end-to-end connectivity in the case of a sudden directly
   connected failure of a network component.

I can’t parse this sentence. I am *guessing* that what you mean is something
like,

NEW:
   This document leverages Segment Routing (SR) to provide a local
   repair mechanism for a shortest path computed by a standard
   link-state IGP. The local repair is capable of restoring end-to-end
   connectivity in the case of a failure of a directly connected
   network component.

If this is what you mean, you’re welcome to use this text if you want. If it’s
not what you mean, please explain?

Note I removed “sudden” in my NEW text because I’m guessing you don’t mean to
exclude a gradual or a foreseeable failure. A failure is a failure, after all.

### Section 2, When the network reconverges

   When the network reconverges

I suggest,

NEW:
   When the network reconverges after a failure

See also the next comment.

### Section 2, microloops

Much of Section 2 is a discussion of microloops and exactly how TI-LFA relates
to them. This doesn’t seem like introductory material, especially because the
rest of the specification doesn't talk about microloops at all. I suggest
moving that material to a new (sub)section for that purpose and just mentioning
it in the Introduction, as in something like "microloops are not addressed by
TI-LFA and can be a concern in some deployments. This is discussed in <xref>."

I don't insist that you make this change, the document is still usable without
it. However, I think that the Introduction as it stands is not very useful *as
an introduction* because of the abundance of non-introductory material. (Some
of my subsequent comments fall under the same heading.)

### Section 2, “primary link”

You mention the “primary link” in a few places here (and nowhere else). What is
the “primary link”? Please clarify or re-word. For example, maybe you mean “the
link whose failure is detected”.

### Section 2, what value does comparing to older FRR techniques add to an
intro?

   By using SR, TI-LFA does not require the establishment of TLDP
   sessions (Targeted Label Distribution Protocol) with remote nodes in
   order to take advantage of the applicability of remote LFAs (RLFA)
   [RFC7490][RFC7916] or remote LFAs with directed forwarding
   (DLFA)[RFC5714].  All the Segment Identifiers (SIDs) are available in
   the link state database (LSDB) of the IGP.  As a result, preferring
   LFAs over RLFAs or DLFAs, as well as minimizing the number of RLFA or
   DLFA repair nodes is not required anymore.

I see why you wanted this paragraph during the process of developing this spec
and persuading the WG of its value. I don’t see how it contributes any value to
the final spec. Similarly,

   By using SR, there is no need to create state in the network in order
   to enforce an explicit FRR path.  This relieves the nodes themselves
   from having to maintain extra state, and it relieves the operator
   from having to deploy an extra protocol or extra protocol sessions
   just to enhance the protection coverage.

Appears to just be a restatement of the value proposition of SR itself. I don’t
see value in restating it in this document.

I think you could remove both these paragraphs without harm, and it would make
the document a quicker and clearer read.

### Section 2, encoding challenges

   One of the challenges of TI-
   LFA is to encode the expected post-convergence path by combining
   adjacency segments and node segments.

Do you mean “compactly encode”, “efficiently encode”, or similar? Is encoding,
per se, the challenge?

### Section 2, roadmap should be a roadmap

I agree with Ben Niven-Jenkins that the omission of Sections 8-10 in the
overview list at the end of the Introduction is a bit jarring to the reader. (I
would add Section 13, too.)

### Section 3, exclude from that set

I think this is wrong:

   Exclude from that set of neighbors that are reachable from R using X.

Did you mean,

NEW:
   Exclude from that set, the neighbors that are reachable from R using X.

### Section 3, defined but not used

   A symmetric network is a network such that the IGP metric of each
   link is the same in both directions of the link.

This definition is never used.

### Section 6, you can't guarantee that

   is guaranteed to be loop-
   free irrespective of the state of FIBs along the nodes belonging to
   the explicit path.

As written, there’s no way to guarantee that. (Trivial proof, one possible
state of a FIB is to point back to the preceding node along the path. That
might not be an *expected* state, but it is *a* state.) I think this is just a
case of overly casual writing, and you mean that the loop-free property will
exist regardless of whether the nodes belonging to the explicit path have
converged to recognize failure X or not. Consider rewriting along those lines?

### Section 6 and others

Please supply definitions for “P node” and “Q node”.

### Section 6, terminology is inconsistent and unclear

I can understand NodeSID(R1) well enough even though you haven't supplied a
definition, it’s the node SID for router R1. But what is “Node_SID(P)”? P isn’t
a router, it’s a set of routers, or a space. Please clarify, whether with a
definition or otherwise. Probably your clarification will fix both this and the
previous point.

While you’re at it, you might as well make your terminology consistent. In one
of the node SID cases above you use an underscore, and in the other, you don’t.
Also, your node SID notation is inconsistent with your adjacency SID notation
which looks like AdjSID_R1R2 — so in one case an underscore and parentheses, in
another case parentheses with no underscore, and in the final case an
underscore with no parentheses. Pick one.

(And then there's Section 12 with "node-SID"...)

### Sections 6.1, 6.2, 6.3, SHOULD NOT use SHOULD

What work are the SHOULDs doing here? Considering that in other parts of the
document you leave the computation of the repair path up to the implementation,
why are you mandating it here? And, if you’re mandating it, why not mandate it
all the way with a MUST?

It seems to me that if you don’t want to mandate implementation, it would be
sensible to take the RFC 2119 language out altogether. If you do want to
mandate implementation, I don’t see why you wouldn’t make this a MUST. If you
really do want it to be SHOULD, Please explain your reasoning.

### Section 7, primary outgoing interface

The existence of a "primary outgoing interface" seems to imply the existence of
a secondary outgoing interface, tertiary outgoing interface, etc. Please define
primary outgoing interface, or if this isn't an important distinction, consider
whether you can simplify to just say "outgoing interface".

### Section 7.1, first

   The active segment becomes the first
   segment of the repair list.

By “first” do you mean “first to be pushed, last to be processed”? If so, I
suggest clarifying that in the text, because the plain English reading of
"first" is the opposite.

### Section 7.2, "as stated"... where?

   As stated in Section 2, when SR
   policies are involved and a strict compliance of the policy is
   required, an end-to-end protection should be preferred over a local
   repair mechanism.

I don’t see this in section 2 (I searched for “end-to-end”, “prefer”, “policy”
and “policies”). Can you help me understand what text in section 2 you’re
talking about?

### Section 7.2.1, what's an Adj()?

Please define Adj(). Is this yet another terminology variation? (c.f. Section 6
comment on AdjSID_R1R2)

### Section 8.1, what's the tail end of a node segment?

   1.  If the active segment is a node segment that has been signaled
       with penultimate hop popping and the repair list ends with an
       adjacency segment terminating on the tail-end of the active
       segment, then the active segment MUST be popped before pushing
       the repair list.

What is the tail end of a node segment? I can’t figure out what that means.

I think I know what you’re trying to say, but please find a way to reword it
that doesn’t end up making me try to parse the above with my "what did the
authors actually *mean*?" glasses on.

### Section 8.2 and others, description of SRv6 behaviors

RFC 8754 describes forwarding behaviors using a kind of line-numbered
pseudocode, and later documents that modify forwarding behaviors specify
updates to the pseudocode. (Examples: RFC 8986,
draft-ietf-spring-srv6-srh-compression-15,
draft-ietf-rtgwg-srv6-egress-protection-16,
draft-ietf-spring-sr-redundancy-protection-03,
draft-ietf-spring-srv6-path-segment-07)

You don’t do this, you use a descriptive approach instead. I'm ok with this in
isolation, but I’d like to know if you made an affirmative decision to diverge
from the usual SRv6 way of doing things, and if so, why, and if the SPRING
working group specifically considered this and is OK with it.

### Section 8.2, shorter than what?

   In such case, there is no need for a preceding
   Prefix SID and the resulting repair list is likely shorter.

Shorter than what? This is the first place in this document the string “Prefix
SID” occurs, so I’m confused.

### Section 11, limit the implementation of local FRR policies

   Based on this assumption, in order to facilitate the operation of
   FRR, and limit the implementation of local FRR policies

Do you mean "limit the need for implementation of local FRR policies"? (And you
could drop “implementation of” for that matter.)

### Section 11, TI-LFA and SR policies don't mix?

The last paragraph, regarding the use of SR policies, and also Section 9,
leaves me wondering whether a simpler statement would be that TI-LFA is
inappropriate for use in a network that makes use of SR policies. Is this a
fair characterization?

### Section 12, can't this be an appendix?

Shouldn’t this be an Appendix? In general, that seems like a common (and good!)
practice for inessential information like this, especially when it has a
potentially limited shelf-life.

Also, although I appreciate that you provided some rudimentary parameterization
of the topologies in Table 1, I think it would be helpful to at least say what
time period the topologies reflect —
draft-francois-rtgwg-segment-routing-ti-lfa-00 dates to summer of 2015; are we
talking about the topologies that were in vogue in 2015? Those of 2023? Etc.

### Section 12, granularity wut

      We do not cover the
      case for 2 SIDs (Section 6.3) separately because there was no
      granularity in the result.

I don’t understand what this means. Can you rephrase it? Generally, I find the
words “granular”. “granularity”, and related have almost zero descriptive
power. :-(

### Section 12, "2 or more" or "2", "3", and no more?

In your description of the table, you say,

The convention that we use is as follows

   *  0 SIDs: the calculated repair path starts with a directly
...
   *  1 SIDs: the repair node is a PQ node, in which case only 1 SID is
...
   *  2 or more SIDs: The repair path consists of 2 or more SIDs as
      described in Section 6.3 and Section 6.4.  We do not cover the
      case for 2 SIDs (Section 6.3) separately
...

But the table headers show:

   +-------------+------------+------------+------------+------------+
   |   Network   |    0 SIDs  |    1 SID   |   2 SIDs   |   3 SIDs   |
   +-------------+------------+------------+------------+------------+

I.e. the table headers don’t show “2 or more” they show 2, and 3, broken out
distinctly, and no "or more" case. Seems like these need to be reconciled.

### Section 13, guaranteed upper bound

   The techniques described in this document are internal
   functionalities to a router that result in the ability to guarantee
   an upper bound on the time taken to restore traffic flow upon the
   failure of a directly connected link or node.

This is the only place in the document where you talk about guaranteed upper
bound. This is a fairly strong promise to make, I think you shouldn't be
mentioning it unless you provide some kind of support for how the guarantee is
provided. Note, I don't question that TI-LFA can be part of the machinery
providing such a guarantee, but without showing your work I don't think you can
make this claim.

## Notes

This review is in the ["IETF Comments" Markdown format][ICMF], You can use the
[`ietf-comments` tool][ICT] to automatically convert this review into
individual GitHub issues.

[ICMF]: https://github.com/mnot/ietf-comments/blob/main/format.md
[ICT]: https://github.com/mnot/ietf-comments