Alvaro Retana's Discuss on draft-ietf-rtgwg-backoff-algo-07: (with DISCUSS and COMMENT)

Alvaro Retana <aretana.ietf@gmail.com> Tue, 20 February 2018 06:02 UTC

Return-Path: <aretana.ietf@gmail.com>
X-Original-To: rtgwg@ietf.org
Delivered-To: rtgwg@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id DCEB5124235; Mon, 19 Feb 2018 22:02:48 -0800 (PST)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
From: Alvaro Retana <aretana.ietf@gmail.com>
To: The IESG <iesg@ietf.org>
Cc: draft-ietf-rtgwg-backoff-algo@ietf.org, Uma Chunduri <uma.chunduri@huawei.com>, rtgwg-chairs@ietf.org, rtgwg@ietf.org
Subject: Alvaro Retana's Discuss on draft-ietf-rtgwg-backoff-algo-07: (with DISCUSS and COMMENT)
X-Test-IDTracker: no
X-IETF-IDTracker: 6.72.2
Auto-Submitted: auto-generated
Precedence: bulk
Message-ID: <151910656889.29750.3686523183770186132.idtracker@ietfa.amsl.com>
Date: Mon, 19 Feb 2018 22:02:48 -0800
Archived-At: <https://mailarchive.ietf.org/arch/msg/rtgwg/UnVg1EaFAZrn_cYHzncO5987Qf0>
X-BeenThere: rtgwg@ietf.org
X-Mailman-Version: 2.1.22
List-Id: Routing Area Working Group <rtgwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rtgwg/>
List-Post: <mailto:rtgwg@ietf.org>
List-Help: <mailto:rtgwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Feb 2018 06:02:49 -0000

Alvaro Retana has entered the following ballot position for
draft-ietf-rtgwg-backoff-algo-07: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about IESG DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-rtgwg-backoff-algo/



----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

I am balloting DISCUSS because I believe that this document presents an
incomplete and vague description of a specification, which (as is) won't result
in consistent implementations.  Consistency, through the specification of a
standard algorithm is used as the basis to justify this work:  "To allow
multi-vendor networks to have all routers delay their SPF computations for the
same duration, this document specifies a standard algorithm."

I am specifically and specially concerned about the fact that there are no
defaults or even suggested values:

   This document does not propose default values for the parameters because
   these values are expected to be context dependent. Implementations are free
   to propose their own default values.

If the whole purpose of standardizing an algorithm is for different
implementation to behave the same way and (specifically) "to have all routers
delay their SPF computations for the same duration", then not defining defaults
(and not being clear in the recommendations -- more on this below) makes the
specification incomplete and vague!

Section 6 tries to provide guidelines about defaults, but it falls short!

   In order to satisfy the goals stated in Section 2, operators are
   RECOMMENDED to configure delay intervals such that SPF_INITIAL_DELAY
   <= SPF_SHORT_DELAY and SPF_SHORT_DELAY <= SPF_LONG_DELAY.

Why are the operators not REQUIRED to meet that relationship?  Are there cases
when it is ok not to follow those guidelines?  Would (for example) the
SPF_LONG_DELAY ever be less than SPF_INITIAL_DELAY?

The other Normative Language in this section can't really be enforced, and
provide (at best) very weak guidance.

   When setting (default) values, one SHOULD consider the customers and
   their application requirements, the computational power of the
   routers, the size of the network, and, in particular, the number of
   IP prefixes advertised in the IGP, the frequency and number of IGP
   events, the number of protocols reactions/computations triggered by
   IGP SPF (e.g., BGP, PCEP, Traffic Engineering CSPF, Fast ReRoute
   computations).

"SHOULD consider..."  How can this statement be Normatively enforced?  Using
"SHOULD" implies that it is ok to only partially consider the list you
provided, or even a different set of criteria.

Based on the suggestions above, I can't imagine how a vendor can set default
values (even if "free to propose their own")...or how the average network
operator could configure anything beyond the numbers that you mentioned as
examples in the text.  For example, the average network operator might ask:
under the same circumstances, should my bigger routers (ones with presumably
more computational power) have lower or higher delays with respect to my
smaller routers?  ...

   Note that some or all of these factors may change over the life of
   the network.  In case of doubt, it's RECOMMENDED to play it safe and
   start with safe, i.e., longer timers.

How can "playing it safe" be Normatively enforced?

   For the standard algorithm to be effective in mitigating micro-loops,
   it is RECOMMENDED that all routers in the IGP domain, or at least all
   the routers in the same area/level, have exactly the same configured
   values.

[A similar statement is made in Section 7.]

If it is so important, why is consistency not mandatory?  IOW, why is it only
"RECOMMENDED" and not "REQUIRED"?  When is it ok to not do it?

Back to the point of this DISCUSS, the importance of consistent values is
clear!  Based on the experience of existing implementations, please specify
"safe" default values.


----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

[I know that some of these comments have been brought up in the SecDir and
GenArt reviews, but I have not seen an update yet.]

(1) Besides the lack of guidance (see above), there are several other
inconsistencies throughout the document:

(1.1) Section 3: "The HOLDDOWN_INTERVAL MUST be defaulted or configured to be
longer than the TIME_TO_LEARN_INTERVAL."  Which one, defaulted OR configured? 
Is it ok for the implementation to provide a default value that doesn't comply
with the expectation that the operator will configure the correct value?  It
seems to me that the definition of MUST doesn't fit with an option.

(1.2) Section 4: "If subsequent IGP events are received in a short period of
time (TIME_TO_LEARN_INTERVAL)...In this situation, we delay the routing
computation by SHORT_SPF_DELAY."  Note that Section 3 provided example values
for TIME_TO_LEARN_INTERVAL and SHORT_SPF_DELAY as 1 sec and 50-100 ms,
respectively.  If IGP events are received within the TIME_TO_LEARN_INTERVAL
window, then the SPF_DELAY ("delay between the first IGP event...and the start
of that routing table computation") set to SHORT_SPF_DELAY will be triggered
before TIME_TO_LEARN_INTERVAL...which means that the SPF run after
SHORT_SPF_DELAY won't cover all the changes.  Is that what you meant, or are
you assuming that the SPF_DELAY will start *after* the TIME_TO_LEARN_INTERVAL?

(1.3) Section 5.1: "LONG_WAIT: State reached after TIME_TO_LEARN_INTERVAL.  In
other words, state reached after TIME_TO_LEARN_INTERVAL in state SHORT_WAIT." 
But Section 3 defines TIME_TO_LEARN_INTERVAL as "the maximum duration typically
needed to learn all the IGP events related to a single component failure" --
why don't the events from that single failure start while in QUIET state?  OR
are you saying (in 5.1) that the TIME_TO_LEARN_INTERVAL is not measured from
the initial IGP Event?

(1.4) What is the relationship between HOLDDOWN_INTERVAL and the *_SPF_DELAY? 
I would assume that *_SPF_DELAY is always less than HOLDDOWN_INTERVAL, but the
document doesn't specify that relationship anywhere.

(1.5) Section 6: "All the parameters MUST be configurable
[I-D.ietf-isis-yang-isis-cfg] [I-D.ietf-ospf-yang] at the protocol instance
granularity."  Given that the references to the YANG models are listed as
Informative, what does that statement mean?  Is it a directive to what must be
included in the models?  What about implementations that don't use YANG (yet)?

(2) Section 5.4 (FSM Events)

(2.1) When will "Transition 7: SPF_TIMER expiration, while in QUIET" happen? 
Because when an IGP Event occurs in QUIET state, the FSM moves to SHORT_WAIT,
the SPF_TIMER should never expire in QUIET state.

(2.2) "Transition 3: LEARN_TIMER expiration." is defined between SHORT_WAIT and
LONG_WAIT, which (at first glance) seems to match how 5.1 defines
"LONG_WAIT:...state reached after TIME_TO_LEARN_INTERVAL in state SHORT_WAIT." 
However, the LEARN_TIMER in only started when an IGP event happens in
QUIET_STATE (transition 1).

(2.3) For completeness, the HOLDDOWN_TIMER expiration events (5 and 6) should
include resetting all the timers, just in case...and to be consistent with the
initialization description.

(3) From Section 3: "Routing table computation: Computation of the routing
table..."  This is a circular definition [1].  I'm sure the authors can figure
out a clear way to explain the meaning without using the terms being defined...

(4) "Note that previously implemented SPF delay algorithms counted the number
of SPF computations."  References?  Knowing that the references may not be
stable (pointing to a vendor's website), you might want to simply remove this
sentence and simply make the point in the paragraph as to why a time interval
is used.  Note that the point of this document is not to compare the
specification to "previously implemented algorithms".

(5) I am surprised that no other documents "must be read to understand or
implement the technology" [2] resulting in no Normative References (beyond
rfc2119).  I would think that at least the OSPF and ISIS specs should be
Normative.

(6) For the rtgwg-chairs/Shepherd: A quick scan of the mail archive shows that
this document wasn't reviewed by the ospf/isis WGs.  Given that what is
specified here affects the protocols directly, I would think that formal review
is needed.  [I note that a couple of the Chairs of the ospf/isis WGs are
co-authors of this document, and that a note was indeed sent when the -00
version of the individual draft was published -- still, it would have been nice
to at least explicitly inform of the progress.]

(7) Nit: "(e.g.  Loop Free Alternates..."   The closing parenthesis is missing.

(8) Nit: Please put a forward reference to 5.1 when the QUIET state is
mentioned in Section 4.

(9) Nit: s/QUIET_STATE/QUIET state.

[1] https://en.wikipedia.org/wiki/Circular_definition
[2] https://www.ietf.org/iesg/statement/normative-informative.html