Re: [secdir] secdir review of draft-ietf-rtgwg-backoff-algo-07

Benjamin Kaduk <kaduk@mit.edu> Fri, 16 February 2018 00:04 UTC

Return-Path: <kaduk@mit.edu>
X-Original-To: secdir@ietfa.amsl.com
Delivered-To: secdir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4B1BC127077; Thu, 15 Feb 2018 16:04:21 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.23
X-Spam-Level:
X-Spam-Status: No, score=-4.23 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6T-mKs7xaK6t; Thu, 15 Feb 2018 16:04:19 -0800 (PST)
Received: from dmz-mailsec-scanner-3.mit.edu (dmz-mailsec-scanner-3.mit.edu [18.9.25.14]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A0403124235; Thu, 15 Feb 2018 16:04:18 -0800 (PST)
X-AuditID: 1209190e-0fdff7000000724f-44-5a862001b6e9
Received: from mailhub-auth-3.mit.edu ( [18.9.21.43]) (using TLS with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by dmz-mailsec-scanner-3.mit.edu (Symantec Messaging Gateway) with SMTP id 0C.F6.29263.100268A5; Thu, 15 Feb 2018 19:04:17 -0500 (EST)
Received: from outgoing.mit.edu (OUTGOING-AUTH-1.MIT.EDU [18.9.28.11]) by mailhub-auth-3.mit.edu (8.13.8/8.9.2) with ESMTP id w1G04FMH000894; Thu, 15 Feb 2018 19:04:16 -0500
Received: from mit.edu (24-107-191-124.dhcp.stls.mo.charter.com [24.107.191.124]) (authenticated bits=56) (User authenticated as kaduk@ATHENA.MIT.EDU) by outgoing.mit.edu (8.13.8/8.12.4) with ESMTP id w1G04BHL007148 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Thu, 15 Feb 2018 19:04:13 -0500
Date: Thu, 15 Feb 2018 18:04:11 -0600
From: Benjamin Kaduk <kaduk@mit.edu>
To: "Acee Lindem (acee)" <acee@cisco.com>
Cc: "bruno.decraene@orange.com" <bruno.decraene@orange.com>, "draft-ietf-rtgwg-backoff-algo.all@ietf.org" <draft-ietf-rtgwg-backoff-algo.all@ietf.org>, "iesg@ietf.org" <iesg@ietf.org>, "secdir@ietf.org" <secdir@ietf.org>
Message-ID: <20180216000410.GP12363@mit.edu>
References: <20180214211017.GI12363@mit.edu> <9677_1518711435_5A85B28A_9677_280_1_53C29892C857584299CBF5D05346208A4799B57B@OPEXCLILM21.corporate.adroot.infra.ftgroup> <EDE93099-A028-4A97-9ECB-49983E2B7A9D@cisco.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <EDE93099-A028-4A97-9ECB-49983E2B7A9D@cisco.com>
User-Agent: Mutt/1.9.1 (2017-09-22)
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrFKsWRmVeSWpSXmKPExsUixCmqrcuo0BZl8GyHtMXkt/OYLX7smMNs cX37DTaLGX8mMlt8WPiQxYHVY8rvjaweS5b8ZPJoeXaSLYA5issmJTUnsyy1SN8ugSvjwZEf jAW/pzFWPF06i6WB8UZ+FyMHh4SAicTyxtguRi4OIYHFTBL3F05hhXA2Mkpcb5jCBuGcZZI4 f2oZSxcjJweLgKrEyYV72UBsNgEViYbuy8wgk0QENCW2vGcBqWcWeMco8WP7FGaQGmEBO4m9 N24wgti8AjoSM5e8gxp6m1GiY9Y+doiEoMTJmU/AFjALqEv8mXcJbCizgLTE8n8cEGF5ieat s8FmcgrYSkzeuQisVVRAWWJv3yH2CYyCs5BMmoVk0iyESbOQTFrAyLKKUTYlt0o3NzEzpzg1 Wbc4OTEvL7VI11gvN7NELzWldBMjOAIk+XYwTmrwPsQowMGoxMO7obc1Sog1say4MvcQoyQH k5Iob9Z5oBBfUn5KZUZicUZ8UWlOavEhRgkOZiUR3luvgXK8KYmVValF+TApaQ4WJXFedxPt KCGB9MSS1OzU1ILUIpisDAeHkgSvvnxblJBgUWp6akVaZk4JQpqJgxNkOA/QcC6QGt7igsTc 4sx0iPwpRmOODQtftDFz3Hjxuo1ZiCUvPy9VSpx3hxxQqQBIaUZpHtw0UBKTyN5f84pRHOg5 Yd4GkIE8wAQIN+8V0ComoFW8SiB/FJckIqSkGhhVrDd90LzHsdc7wmap02X1qh1dl1qk1u3b F6TeKR/7XHfpld6QPVuNp8jxfTYNOn3q74nY9c9/PVpy7t/ZozKPfj5VYzgnMMmEaWtBzraE 3ojN/oaJVQml8Vsy3xTOC9pduHzDf0/Dhye0HNV2yLrKHXHdHPyTYyXz9/kdDsu1HCRNbTR7 eeKUWIozEg21mIuKEwFKMUF3PQMAAA==
Archived-At: <https://mailarchive.ietf.org/arch/msg/secdir/i_PISvaLu2DVfgIqus1PDNR950U>
Subject: Re: [secdir] secdir review of draft-ietf-rtgwg-backoff-algo-07
X-BeenThere: secdir@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Security Area Directorate <secdir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/secdir>, <mailto:secdir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/secdir/>
List-Post: <mailto:secdir@ietf.org>
List-Help: <mailto:secdir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/secdir>, <mailto:secdir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Feb 2018 00:04:21 -0000

[also inline]

On Thu, Feb 15, 2018 at 07:50:34PM +0000, Acee Lindem (acee) wrote:
> Hi Bruno, Benjamin, 
> 
> Thanks to Benjamin for review and Bruno for the detailed response. See my responses preceded by [Acee]. 
> 
> 
> ´╗┐On 2/15/18, 11:17 AM, "bruno.decraene@orange.com"; <bruno.decraene@orange.com>; wrote:
> 
>     Hi Benjamin, 
>     
>     Many thanks for your careful and useful review.
>     Please see inline [Bruno] the proposed resolution.
>     
>     Regarding the posting of draft -08, I'd propose to wait 24H for your possible follow up, plus possible OPSDIR & GENART review; unless Alia has other instructions. (Note that I'll be on PTO next week).
>     
>     Acee,
>     There are a few questions for you. Please looks for "Acee, any opinion on this?"
>     
>     
>      > -----Original Message-----
>      > From: Benjamin Kaduk [mailto:kaduk@mit.edu]
>      > Sent: Wednesday, February 14, 2018 10:10 PM
>      > To: draft-ietf-rtgwg-backoff-algo.all@ietf.org; iesg@ietf.org; secdir@ietf.org
>      > Subject: secdir review of draft-ietf-rtgwg-backoff-algo-07
>      > 
>      > Hi all,
>      > 
>      > I have reviewed this document as part of the security directorate's
>      > ongoing effort to review all IETF documents being processed by the
>      > IESG.  These comments were written primarily for the benefit of the
>      > security area directors.  Document editors and WG chairs should treat
>      > these comments just like any other last call comments.
>      > 
>      > >From a security perspective, this document is Ready. 
>      
>     [Bruno] Excellent, thanks.
>     
>      >  It specifies a
>      > standard scheme that can be used to back off SPF calculations during
>      > periods of frequent IGP events, avoiding excessive resource
>      > consumption performing calculations that would be rendered redundant
>      > (or just be useless) soon.  The security considerations correctly
>      > note that an attacker that can generate IGP events would be able to
>      > delay the IGP convergence time, which is true both for this scheme
>      > and all schemes previously in use.  (I might use more words to say
>      > the same thing if I was writing it, but that probably reflects more
>      > on me than the document.)
>      > 
>      > 
>      > I do have some questions about the actual proposed FSM, though -- I
>      > suspect that I am just making some implicit assumptions that may not
>      > be grounded in reality.  In particular, I am basically assuming that
>      > INITIAL_SPF_DELAY < SHORT_SPF_DELAY < LONG_SPF_DELAY <
>      > HOLDDOWN_INTERVAL.  (The draft itself only has it as RECOMMENDED for
>      > SPF_INITIAL_DELAY <= SPF_SHORT_DELAY <= SPF_LONG_DELAY in Section 6,
>      > yes, with the different spellings, but has a MUST for
>      > HOLDDOWN_INTERVAL > TIME_TO_LEARN_INTERVAL.)
>      > This would potentially affect the state machine for events 1, 6, and
>      > 7.
>     
>     [Bruno] Thanks for detecting the misspelling. Corrected. 
>      
>      > In transition 1, we say to only start the SPF_TIMER if it is not already
>      > running, but I do not see a way for it to already be running unless
>      > the HOLDDOWN_TIMER value is less than one or more of the SPF_TIMER
>      > values.
>      
>     [Bruno] I agree.
>     2 comments:
>     - IMO, it feels safer to check that SPF_TIMER is not already running, plus this is more consistent with other states
>     - As you have noted, the draft does not mandate that HOLDDOWN_TIMER > *_SPF_DELAY. Hence this check is indeed required.
>      
>      > Similarly, I don't see how transition 7 could ever happen, since an IGP
>      > event moves us out of the QUIET state, and I assume that the
>      > SPF_TIMER would fire before the HOLDDOWN_TIMER, since the latter is
>      > reset on every IGP event and I assume the latter has a larger value.
>      
>     [Bruno] Same answer: the draft does not mandate that HOLDDOWN_TIMER > *_SPF_DELAY.
>     
>      > Transition 6 is a little less clear, but is also similar -- if
>      > HOLDDOWN_TIMER is larger than LEARN_TIMER, then LEARN_TIMER must
>      > fire before HOLDDOWN_TIMER, and we leave SHORT_WAIT to go to
>      > LONG_WAIT before we could consider leaving SHORT_WAIT to go back to
>      > QUIET.
>      
>     [Bruno] I agree with you. i.e. transition 6 should never be used if HOLDDOWN_INTERVAL > TIME_TO_LEARN_INTERVAL which is a MUST in the draft.
>     At this point, I'd rather keep it as this gives more robustness to the FSM. (I'm not fully confident that any implementation/configuration interface would enforce it) Also, during previous reviews, we were rather asked to indicate more/all transitions rather than main ones.
>     However, I'm open to other opinions.
>     Acee, any opinion on this?
> 
> [Acee] I agree we should keep it based on the draft history. 
> 

It's probably okay to leave the transition in.  Mostly I just wanted to make
sure that I wasn't missing something that would cause it to happen.

>    Transition 6: HOLDDOWN_TIMER expiration, while in SHORT_WAIT. This
>                             transition would normally not occur since
>                             (HOLDDOWN_INTERVAL > TIME_TO_LEARN_INTERVAL). 

Sounds good to me.

>      
>      > So, am I making some flawed assumptions?  (Are there examples of
>      > situations that clearly demonstrate the flaw?)
>      
>     [Bruno] 
>     Your assumptions are logical but not mandated by the draft, hence the FSM needs to work even without those assumptions.
>     One could argue that your assumptions are valid. But another one may argue that he wants freedom in choosing the timers' value; plus that the FSM should be robust to any timers' values.
>      

Okay, so it sounds like we don't have any concrete scenarios in mind
but want to leave some flexibility for implementors/configuration
choices.  That may well be a reasonable tradeoff.

>      > Also, to confirm my understanding, suppose a scenario happens where
>      > an IGP participant sees an event, then a gap of 100ms, then three
>      > IGP events at equal 10ms intervals, with the SPF delays at the
>      > example values of 0/50/2000 ms and TIME_TO_LEARN of 1s.  The first
>      > event triggers a SPF computation immediately, then we go to
>      > SHORT_WAIT, the first of the three events kicks off a SPF_TIMER for
>      > 50ms, 
>     
>     [Bruno] Agreed
>     
>      > which is reset by the next two events,
>     
>     [Bruno] I don't think I agree.
>     This case is handle by transition "2: IGP event" which triggers the following actions
>     
>             o  Reset HOLDDOWN_TIMER to HOLDDOWN_INTERVAL.
>             o  If SPF_TIMER is not already running, start it with value  SHORT_SPF_DELAY.
>             o  Remain in current state.
>     
>     
>     So in your example, the second (of the three events) arrives 10ms after the start of the 50ms SPF_TIMER. i.e. SPF_TIMER is already running and hence not changed.
>     

Good point, thanks for the correction.

>      >  the SPF timer fires and
>      > we recompute the SPF, then TIME_TO_LEARN fires and we go to
>      > LONG_WAIT until the HOLDDOWN_TIMER fires.
>     
>     [Bruno] Agreed with the above.
>     
>      >  Or maybe the SPF
>      > calculation takes more than 200ms, so when the second IGP event
>      > fires, we abort the currently in-progress calculation and don't
>      > start another one until 50ms after the last event? 
>     
>     [Bruno] The FSM does not take into account the SPF computation time. Hence the behavior is not changed by the SPF computation time.
>     The draft does not talk about aborting the SPF computation. I guess that one implementation may choose to abort the SPF computation, but it must not change the FSM state/timers due to this abortion. (otherwise, this implementation would be out of sync with other nodes/implementations)
>     

Agreed.  I'm also inclined to agree with the GenArt reviewer's
suggestion to add some discussion of (not) aborting SPF computation
to the document, but do not insist on it.

>      >  I bring this up
>      > because of the text in the second paragraph of Section 4 that talks
>      > of computing the post-failure routing table in "a single route
>      > computation".
>     
>     [Bruno] The point is the number of SPF computation may be lower than the number of IGP events.
>     In the FSM, this is achieved with the following action " o  If SPF_TIMER is not already running, start it with value LONG_SPF_DELAY.". Which, IOW (negation form), says that if the SPF_TIMER is already running, we do nothing (new) and hence the new IGP event do not trigger an additional SPF computation.
>     

I agree that the number of SPF computations may be lower than the
number of IGP events.  I am not sure that "a single route
computation" is correct; perhaps "a single additional route
computation" is better.

>      >   But if I understand correctly, the *single*
>      > computation only happens in the second case here, when the
>      > calculation takes some hundreds of milliseconds; otherwise we still
>      > have *two* computations (one triggered while we're in QUIET and the
>      > second triggered in SHORT_DELAY).  So I'm not sure I fully
>      > understand the expected scenario.
>      
>     [Bruno] The expected scenario is that multiple IGP events may be handled by a single SPF computation.
>     The typical real life situation is a node failure. This is a single failure but a link state IGP will trigger and flood N IGP_events (one per IGP neighbors of the failed node). This is because (in short) an IGP link state cannot advertise the failure of a node, but only the failure of a link. 
>     Ideally, we should wait for these N IGP_events before computing the SPF computations because:
>     - it's only by taking into account the N IGP_events that we correctly reflect the real network topology (i.e. the node failure).
>     - computing an SPF before receiving all N events, will require computing another SPF shortly after. i.e. the first computation is wasted ressources.
>     
>     The issue is that we don't know how many IGP events we should wait for. Hence the FSM defines and uses duration "TIME_TO_LEARN_INTERVAL". This duration "should" be able to be evaluated a priori by the network operator (as it is the max of the detection time, origination time, and flooding time).
>     
>      > I also am probably having some problems with terminology, presumably
>      > just my misunderstanding, which hopefully can be set straight
>      > easily.
>      
>     [Bruno] Please comment/ask questions if the above is not clear or does not address your point.
>      

My main question here can be summarized with the proposal to add the
word "additional", above.

>      > In the Introduction, we have a "desire to compute a new Shortest
>      > Path First (SPF) as soon as a failure is detected", which is using
>      > SPF as it is a data structure (e.g., the result of an algorithm),
>      > whereas my intuition has SPF referring to the algorithm [class] but
>      > not its output.
>      
>     [Bruno] You are right that SPF is the algo (and SPT the result).
>     Unfortunately, this gets too subtle for my level of English.  "Acee, any opinion on this?"
> 
> [Acee] Ben is technically correct. However, informally, we often refer to an "SPF" generically to refer both to the algorithm and an instance of the algorithm computation. We could change it to: 
> 
> 
>   OLD: In general, when the network is stable, there is a desire to compute
>            a new Shortest Path First (SPF) as soon as a failure is detected in
>   New: In general, when the network is stable, there is a desire to trigger 
>             a new Shortest Path First (SPF) computation as soon as a failure is detected in

That would help me, the naive reader, thanks.  (But if the other
usage is accepted among experts, there's no need to change it just
on my account.)

>      
>      > In section 3, we talk of "computation of the routing table, by the
>      > IGP", which gets me confused about whether "the IGP" represents a
>      > network protocol for conveying (e.g.) link state information, an
>      > algorithm for SPF computation, or a router that performs SPF
>      > computations.
>      
>     [Bruno] IGP is usually a protocol. In this sentence, it is meant as the IGP process of the router.
>     Again, I'm open to reformulation. "Acee, any opinion on this?"
> 
> [Acee] I don't think we need to change this. IGP is a well-known acronym. 
>              https://www.rfc-editor.org/materials/abbrev.expansion.txt

Perhaps my question was not well phrased.  I propose

OLD: computation of the routing table, by the IGP

NEW: computation of the routing table, by the IGP participant

(or something similar), since the IGP just serves to distribute the
LSDB (conceptually), and the computation of the routing table is
done by each router internally (i.e., not directly using the IGP in
question).  Or is the previous sentence not true?

>      
>      > In section 6 we talk of "the number of protocols
>      > reactions/computations triggered by IGP SPF".  Is this just in the sense
>      > of "each SPF calculation triggers a bunch of other stuff"? 
>     
>     [Bruno] Yes, exactly. Again by "protocol reaction" it's meant router's processes implementing those protocols.
>     FYI, typical protocol I could think of are BGP and PCE, but possibly other IGP (like) in case of route redistribution.
>     
>      > I think
>      > this is another case about me being confused whether "SPF" means an
>      > algorithm, a specific computation using that algorithm, etc.
>      
>     [Bruno] I agree that this is the same case. "Acee, any opinion on this?"
> 
> [Acee] We could change "IGP SPF" to "IGP SPF computation". 

Sounds good to me.

> 
>     > 
>      > 
>      > Some other editorial notes:
>      > 
>      > It's probably better to cite RFC 8174 instead of/in addition to RFC
>      > 2119, especially since there is at least a lowercase "may" present.
>      
>     [Bruno] ok, done.
>      
>      > It's unclear that "temporally close" in "multiple temporally close
>      > failures over a short time" really adds any value, in the
>      > Introduction.
>      
>     [Bruno] ok, done:
>     
>     OLD: However, when the network is experiencing multiple temporally close failures over a short period of time, there is a conflicting desire to limit the frequency of SPF computations.
>     
>     NEW: However, when the network is experiencing multiple failures over a short period of time, there is a conflicting desire to limit the frequency of SPF computations.
>     
>      
>      > In section 2, last bullet point on page 3, "SPF_DELAY timers values"
>      > probably doesn't need the plural "timers" (so, either "timer" or
>      > the possessive "timers'"), though I am mindful of the recent
>      > discussion on ietf@ about (non-)American English.  The second
>      > sentence of the bullet is also a sentence fragment and not a
>      > complete sentence.
>     
>     [Bruno] ok:
>     - I trust you on your first point and picked the possessive option
>     - I agree with you on the second point
>     
>     Currently changed to:
>     OLD:
>     Always try to avoid different SPF_DELAY timers values across different routers in the area/level. Even though not all routers will receive IGP messages at the same time, due to differences both in the distance from the originator of the IGP event and in flooding implementations.
>     
>     NEW:
>     Always try to avoid different SPF_DELAY timers' values across different routers in the area/level. This requires specific consideration as different routers may receive IGP messages at different interval or even order, due to differences both in the distance from the originator of the IGP event and in flooding implementations.
> 
> 
>     That being said, I'm not a native English speaker and Acee is kind enough to spend time correcting my errors. Therefore, Acee and obviously the RFC editor may further edit this text.
> 
> [Acee] I think "SPF_DELAY timer values" reads better as a single plural compound noun. Do you disagree? See clarification below:
>     
>  Always try to avoid different SPF_DELAY timer values across different routers in the area/level. This requires specific consideration as different routers may receive IGP messages at a different interval or even in a different order, due to differences both in the distance from the originator of the IGP event and in flooding implementations. 

This would be my preference, but I did not want to bias the authors
with my initial message.

>     
>     
>      
>      > SRLG is used without expansion in multiple places, but does not
>      > appear on https://www.rfc-editor.org/materials/abbrev.expansion.txt
>      > as a "well-known" abbreviation.
>     
>     [Bruno] ok, expanded on first use.
>      
>      
>      > In section 6, we find the awkward construction "play it safe and
>      > start with safe, i.e., longer timers".  Probably we want to say
>      > "safe values" as the noun, and maybe consider rewording to avoid the
>      > duplicate "safe" and/or the colloquialism "play it safe".
>      
>     [Bruno] ok
>     
>     OLD: In case of doubt, it's RECOMMENDED to play it safe and start with safe, i.e., longer timers.
>     NEW: In case of doubt, it's RECOMMENDED to start with safer (i.e. longer) timer values.
>     
>     Again, text may be subject to further revision.
>      
> [Acee]: In case of doubt, it's RECOMMENDED to start with safer (i.e., longer) timer values.      
> 
>      > Section 8 says:
>      > 
>      >    [...]. FIBs
>      >    are installed after multiple steps such as flooding of the IGP event
>      >    across the network, SPF wait time, SPF computation, FIB distribution
>      >    across line cards, and FIB update.  This document only addresses the
>      >    first contribution.
>      > 
>      > which makes me try to match up "the first contribution" with the
>      > flooding, when I assume it's meant to match up with the SPF wait
>      > time.
>      
>     [Bruno] You are absolute right. Thanks for the catch.
>     
>     OLD:  FIBs are installed after multiple steps such as flooding of the IGP event across the network, SPF wait time, SPF computation, FIB distribution across line cards, and FIB update. This document only addresses the first contribution.
>     NEW: FIBs are installed after multiple steps such as flooding of the IGP event across the network, SPF wait time, SPF computation, FIB distribution across line cards, and FIB update. This document only addresses the contribution from the SPF wait time.

Sounds good.

>     Thanks again for your careful review.
> 
> Yes - Thank you, 

You're welcome!

-Benjamin