Re: [secdir] secdir review of draft-ietf-rtgwg-backoff-algo-07

<bruno.decraene@orange.com> Thu, 15 February 2018 16:17 UTC

Return-Path: <bruno.decraene@orange.com>
X-Original-To: secdir@ietfa.amsl.com
Delivered-To: secdir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E5C6B12D949; Thu, 15 Feb 2018 08:17:18 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.629
X-Spam-Level:
X-Spam-Status: No, score=-2.629 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, UNPARSEABLE_RELAY=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HV57aFtuedxY; Thu, 15 Feb 2018 08:17:16 -0800 (PST)
Received: from orange.com (mta135.mail.business.static.orange.com [80.12.70.35]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 7A207126B7E; Thu, 15 Feb 2018 08:17:16 -0800 (PST)
Received: from opfednr01.francetelecom.fr (unknown [xx.xx.xx.65]) by opfednr20.francetelecom.fr (ESMTP service) with ESMTP id 14DF340E56; Thu, 15 Feb 2018 17:17:15 +0100 (CET)
Received: from Exchangemail-eme2.itn.ftgroup (unknown [xx.xx.31.42]) by opfednr01.francetelecom.fr (ESMTP service) with ESMTP id E72F71A0054; Thu, 15 Feb 2018 17:17:14 +0100 (CET)
Received: from OPEXCLILM21.corporate.adroot.infra.ftgroup ([fe80::e92a:c932:907e:8f06]) by OPEXCLILM41.corporate.adroot.infra.ftgroup ([fe80::c845:f762:8997:ec86%19]) with mapi id 14.03.0382.000; Thu, 15 Feb 2018 17:17:14 +0100
From: <bruno.decraene@orange.com>
To: Benjamin Kaduk <kaduk@mit.edu>, "Acee Lindem (acee)" <acee@cisco.com>
CC: "draft-ietf-rtgwg-backoff-algo.all@ietf.org" <draft-ietf-rtgwg-backoff-algo.all@ietf.org>, "iesg@ietf.org" <iesg@ietf.org>, "secdir@ietf.org" <secdir@ietf.org>
Thread-Topic: secdir review of draft-ietf-rtgwg-backoff-algo-07
Thread-Index: AQHTpdhC0cjOCtUzOESk7b49x53dv6OlOZOg
Date: Thu, 15 Feb 2018 16:17:14 +0000
Message-ID: <9677_1518711435_5A85B28A_9677_280_1_53C29892C857584299CBF5D05346208A4799B57B@OPEXCLILM21.corporate.adroot.infra.ftgroup>
References: <20180214211017.GI12363@mit.edu>
In-Reply-To: <20180214211017.GI12363@mit.edu>
Accept-Language: fr-FR, en-US
Content-Language: fr-FR
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.168.234.4]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/secdir/NBYKBmXMkebi93tC_mfxLZFRR9E>
Subject: Re: [secdir] secdir review of draft-ietf-rtgwg-backoff-algo-07
X-BeenThere: secdir@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Security Area Directorate <secdir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/secdir>, <mailto:secdir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/secdir/>
List-Post: <mailto:secdir@ietf.org>
List-Help: <mailto:secdir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/secdir>, <mailto:secdir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Feb 2018 16:17:19 -0000

Hi Benjamin, 

Many thanks for your careful and useful review.
Please see inline [Bruno] the proposed resolution.

Regarding the posting of draft -08, I'd propose to wait 24H for your possible follow up, plus possible OPSDIR & GENART review; unless Alia has other instructions. (Note that I'll be on PTO next week).

Acee,
There are a few questions for you. Please looks for "Acee, any opinion on this?"


 > -----Original Message-----
 > From: Benjamin Kaduk [mailto:kaduk@mit.edu]
 > Sent: Wednesday, February 14, 2018 10:10 PM
 > To: draft-ietf-rtgwg-backoff-algo.all@ietf.org; iesg@ietf.org; secdir@ietf.org
 > Subject: secdir review of draft-ietf-rtgwg-backoff-algo-07
 > 
 > Hi all,
 > 
 > I have reviewed this document as part of the security directorate's
 > ongoing effort to review all IETF documents being processed by the
 > IESG.  These comments were written primarily for the benefit of the
 > security area directors.  Document editors and WG chairs should treat
 > these comments just like any other last call comments.
 > 
 > >From a security perspective, this document is Ready. 
 
[Bruno] Excellent, thanks.

 >  It specifies a
 > standard scheme that can be used to back off SPF calculations during
 > periods of frequent IGP events, avoiding excessive resource
 > consumption performing calculations that would be rendered redundant
 > (or just be useless) soon.  The security considerations correctly
 > note that an attacker that can generate IGP events would be able to
 > delay the IGP convergence time, which is true both for this scheme
 > and all schemes previously in use.  (I might use more words to say
 > the same thing if I was writing it, but that probably reflects more
 > on me than the document.)
 > 
 > 
 > I do have some questions about the actual proposed FSM, though -- I
 > suspect that I am just making some implicit assumptions that may not
 > be grounded in reality.  In particular, I am basically assuming that
 > INITIAL_SPF_DELAY < SHORT_SPF_DELAY < LONG_SPF_DELAY <
 > HOLDDOWN_INTERVAL.  (The draft itself only has it as RECOMMENDED for
 > SPF_INITIAL_DELAY <= SPF_SHORT_DELAY <= SPF_LONG_DELAY in Section 6,
 > yes, with the different spellings, but has a MUST for
 > HOLDDOWN_INTERVAL > TIME_TO_LEARN_INTERVAL.)
 > This would potentially affect the state machine for events 1, 6, and
 > 7.

[Bruno] Thanks for detecting the misspelling. Corrected. 
 
 > In transition 1, we say to only start the SPF_TIMER if it is not already
 > running, but I do not see a way for it to already be running unless
 > the HOLDDOWN_TIMER value is less than one or more of the SPF_TIMER
 > values.
 
[Bruno] I agree.
2 comments:
- IMO, it feels safer to check that SPF_TIMER is not already running, plus this is more consistent with other states
- As you have noted, the draft does not mandate that HOLDDOWN_TIMER > *_SPF_DELAY. Hence this check is indeed required.
 
 > Similarly, I don't see how transition 7 could ever happen, since an IGP
 > event moves us out of the QUIET state, and I assume that the
 > SPF_TIMER would fire before the HOLDDOWN_TIMER, since the latter is
 > reset on every IGP event and I assume the latter has a larger value.
 
[Bruno] Same answer: the draft does not mandate that HOLDDOWN_TIMER > *_SPF_DELAY.

 > Transition 6 is a little less clear, but is also similar -- if
 > HOLDDOWN_TIMER is larger than LEARN_TIMER, then LEARN_TIMER must
 > fire before HOLDDOWN_TIMER, and we leave SHORT_WAIT to go to
 > LONG_WAIT before we could consider leaving SHORT_WAIT to go back to
 > QUIET.
 
[Bruno] I agree with you. i.e. transition 6 should never be used if HOLDDOWN_INTERVAL > TIME_TO_LEARN_INTERVAL which is a MUST in the draft.
At this point, I'd rather keep it as this gives more robustness to the FSM. (I'm not fully confident that any implementation/configuration interface would enforce it) Also, during previous reviews, we were rather asked to indicate more/all transitions rather than main ones.
However, I'm open to other opinions.
Acee, any opinion on this?

 
 > So, am I making some flawed assumptions?  (Are there examples of
 > situations that clearly demonstrate the flaw?)
 
[Bruno] 
Your assumptions are logical but not mandated by the draft, hence the FSM needs to work even without those assumptions.
One could argue that your assumptions are valid. But another one may argue that he wants freedom in choosing the timers' value; plus that the FSM should be robust to any timers' values.
 
 > Also, to confirm my understanding, suppose a scenario happens where
 > an IGP participant sees an event, then a gap of 100ms, then three
 > IGP events at equal 10ms intervals, with the SPF delays at the
 > example values of 0/50/2000 ms and TIME_TO_LEARN of 1s.  The first
 > event triggers a SPF computation immediately, then we go to
 > SHORT_WAIT, the first of the three events kicks off a SPF_TIMER for
 > 50ms, 

[Bruno] Agreed

 > which is reset by the next two events,

[Bruno] I don't think I agree.
This case is handle by transition "2: IGP event" which triggers the following actions

        o  Reset HOLDDOWN_TIMER to HOLDDOWN_INTERVAL.
        o  If SPF_TIMER is not already running, start it with value  SHORT_SPF_DELAY.
        o  Remain in current state.


So in your example, the second (of the three events) arrives 10ms after the start of the 50ms SPF_TIMER. i.e. SPF_TIMER is already running and hence not changed.

 >  the SPF timer fires and
 > we recompute the SPF, then TIME_TO_LEARN fires and we go to
 > LONG_WAIT until the HOLDDOWN_TIMER fires.

[Bruno] Agreed with the above.

 >  Or maybe the SPF
 > calculation takes more than 200ms, so when the second IGP event
 > fires, we abort the currently in-progress calculation and don't
 > start another one until 50ms after the last event? 

[Bruno] The FSM does not take into account the SPF computation time. Hence the behavior is not changed by the SPF computation time.
The draft does not talk about aborting the SPF computation. I guess that one implementation may choose to abort the SPF computation, but it must not change the FSM state/timers due to this abortion. (otherwise, this implementation would be out of sync with other nodes/implementations)

 >  I bring this up
 > because of the text in the second paragraph of Section 4 that talks
 > of computing the post-failure routing table in "a single route
 > computation".

[Bruno] The point is the number of SPF computation may be lower than the number of IGP events.
In the FSM, this is achieved with the following action " o  If SPF_TIMER is not already running, start it with value LONG_SPF_DELAY.". Which, IOW (negation form), says that if the SPF_TIMER is already running, we do nothing (new) and hence the new IGP event do not trigger an additional SPF computation.

 >   But if I understand correctly, the *single*
 > computation only happens in the second case here, when the
 > calculation takes some hundreds of milliseconds; otherwise we still
 > have *two* computations (one triggered while we're in QUIET and the
 > second triggered in SHORT_DELAY).  So I'm not sure I fully
 > understand the expected scenario.
 
[Bruno] The expected scenario is that multiple IGP events may be handled by a single SPF computation.
The typical real life situation is a node failure. This is a single failure but a link state IGP will trigger and flood N IGP_events (one per IGP neighbors of the failed node). This is because (in short) an IGP link state cannot advertise the failure of a node, but only the failure of a link. 
Ideally, we should wait for these N IGP_events before computing the SPF computations because:
- it's only by taking into account the N IGP_events that we correctly reflect the real network topology (i.e. the node failure).
- computing an SPF before receiving all N events, will require computing another SPF shortly after. i.e. the first computation is wasted ressources.

The issue is that we don't know how many IGP events we should wait for. Hence the FSM defines and uses duration "TIME_TO_LEARN_INTERVAL". This duration "should" be able to be evaluated a priori by the network operator (as it is the max of the detection time, origination time, and flooding time).

 > I also am probably having some problems with terminology, presumably
 > just my misunderstanding, which hopefully can be set straight
 > easily.
 
[Bruno] Please comment/ask questions if the above is not clear or does not address your point.
 
 > In the Introduction, we have a "desire to compute a new Shortest
 > Path First (SPF) as soon as a failure is detected", which is using
 > SPF as it is a data structure (e.g., the result of an algorithm),
 > whereas my intuition has SPF referring to the algorithm [class] but
 > not its output.
 
[Bruno] You are right that SPF is the algo (and SPT the result).
Unfortunately, this gets too subtle for my level of English.  "Acee, any opinion on this?"
 
 > In section 3, we talk of "computation of the routing table, by the
 > IGP", which gets me confused about whether "the IGP" represents a
 > network protocol for conveying (e.g.) link state information, an
 > algorithm for SPF computation, or a router that performs SPF
 > computations.
 
[Bruno] IGP is usually a protocol. In this sentence, it is meant as the IGP process of the router.
Again, I'm open to reformulation. "Acee, any opinion on this?"
 
 > In section 6 we talk of "the number of protocols
 > reactions/computations triggered by IGP SPF".  Is this just in the sense
 > of "each SPF calculation triggers a bunch of other stuff"? 

[Bruno] Yes, exactly. Again by "protocol reaction" it's meant router's processes implementing those protocols.
FYI, typical protocol I could think of are BGP and PCE, but possibly other IGP (like) in case of route redistribution.

 > I think
 > this is another case about me being confused whether "SPF" means an
 > algorithm, a specific computation using that algorithm, etc.
 
[Bruno] I agree that this is the same case. "Acee, any opinion on this?"
> 
 > 
 > Some other editorial notes:
 > 
 > It's probably better to cite RFC 8174 instead of/in addition to RFC
 > 2119, especially since there is at least a lowercase "may" present.
 
[Bruno] ok, done.
 
 > It's unclear that "temporally close" in "multiple temporally close
 > failures over a short time" really adds any value, in the
 > Introduction.
 
[Bruno] ok, done:

OLD: However, when the network is experiencing multiple temporally close failures over a short period of time, there is a conflicting desire to limit the frequency of SPF computations.

NEW: However, when the network is experiencing multiple failures over a short period of time, there is a conflicting desire to limit the frequency of SPF computations.

 
 > In section 2, last bullet point on page 3, "SPF_DELAY timers values"
 > probably doesn't need the plural "timers" (so, either "timer" or
 > the possessive "timers'"), though I am mindful of the recent
 > discussion on ietf@ about (non-)American English.  The second
 > sentence of the bullet is also a sentence fragment and not a
 > complete sentence.

[Bruno] ok:
- I trust you on your first point and picked the possessive option
- I agree with you on the second point

Currently changed to:
OLD:
Always try to avoid different SPF_DELAY timers values across different routers in the area/level. Even though not all routers will receive IGP messages at the same time, due to differences both in the distance from the originator of the IGP event and in flooding implementations.

NEW:
Always try to avoid different SPF_DELAY timers' values across different routers in the area/level. This requires specific consideration as different routers may receive IGP messages at different interval or even order, due to differences both in the distance from the originator of the IGP event and in flooding implementations.


That being said, I'm not a native English speaker and Acee is kind enough to spend time correcting my errors. Therefore, Acee and obviously the RFC editor may further edit this text.

 
 > SRLG is used without expansion in multiple places, but does not
 > appear on https://www.rfc-editor.org/materials/abbrev.expansion.txt
 > as a "well-known" abbreviation.

[Bruno] ok, expanded on first use.
 
 
 > In section 6, we find the awkward construction "play it safe and
 > start with safe, i.e., longer timers".  Probably we want to say
 > "safe values" as the noun, and maybe consider rewording to avoid the
 > duplicate "safe" and/or the colloquialism "play it safe".
 
[Bruno] ok

OLD: In case of doubt, it's RECOMMENDED to play it safe and start with safe, i.e., longer timers.
NEW: In case of doubt, it's RECOMMENDED to start with safer (i.e. longer) timer values.

Again, text may be subject to further revision.
 
 > Section 8 says:
 > 
 >    [...]. FIBs
 >    are installed after multiple steps such as flooding of the IGP event
 >    across the network, SPF wait time, SPF computation, FIB distribution
 >    across line cards, and FIB update.  This document only addresses the
 >    first contribution.
 > 
 > which makes me try to match up "the first contribution" with the
 > flooding, when I assume it's meant to match up with the SPF wait
 > time.
 
[Bruno] You are absolute right. Thanks for the catch.

OLD:  FIBs are installed after multiple steps such as flooding of the IGP event across the network, SPF wait time, SPF computation, FIB distribution across line cards, and FIB update. This document only addresses the first contribution.
NEW: FIBs are installed after multiple steps such as flooding of the IGP event across the network, SPF wait time, SPF computation, FIB distribution across line cards, and FIB update. This document only addresses the contribution from the SPF wait time.

Thanks again for your careful review.

--Bruno 

 > -Benjamin

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.