Re: [CCAMP] Benjamin Kaduk's Discuss on draft-ietf-ccamp-alarm-module-08: (with DISCUSS and COMMENT)

Benjamin Kaduk <kaduk@mit.edu> Wed, 10 April 2019 02:30 UTC

Date: Tue, 09 Apr 2019 21:30:22 -0500
From: Benjamin Kaduk <kaduk@mit.edu>
To: stefan vallin <stefan@wallan.se>
Cc: The IESG <iesg@ietf.org>, draft-ietf-ccamp-alarm-module@ietf.org, Daniele Ceccarelli <daniele.ceccarelli@ericsson.com>, ccamp-chairs@ietf.org, ccamp@ietf.org
Message-ID: <20190410023022.GC18549@kduck.mit.edu>
References: <155475628903.30058.8136578311950144931.idtracker@ietfa.amsl.com> <41BF0E2E-DDDD-45B7-B1F5-BBDB00AA7BC4@wallan.se>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <41BF0E2E-DDDD-45B7-B1F5-BBDB00AA7BC4@wallan.se>
User-Agent: Mutt/1.10.1 (2018-07-13)
Archived-At: <https://mailarchive.ietf.org/arch/msg/ccamp/JFlR3OMcOr4WsQOIAYeK5CuCLxQ>
Subject: Re: [CCAMP] Benjamin Kaduk's Discuss on draft-ietf-ccamp-alarm-module-08: (with DISCUSS and COMMENT)
Precedence: list

On Tue, Apr 09, 2019 at 02:08:46PM +0200, stefan vallin wrote:
> Hi Benjamin!
> Thanks for your review!
> See inline
> 
> br Stefan and Martin
> 
> > On 8 Apr 2019, at 22:44, Benjamin Kaduk via Datatracker <noreply@ietf.org> wrote:
> > 
> > Benjamin Kaduk has entered the following ballot position for
> > draft-ietf-ccamp-alarm-module-08: Discuss
> > 
> > When responding, please keep the subject line intact and reply to all
> > email addresses included in the To and CC lines. (Feel free to cut this
> > introductory paragraph, however.)
> > 
> > 
> > Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
> > for more information about IESG DISCUSS and COMMENT positions.
> > 
> > 
> > The document, along with other ballot positions, can be found here:
> > https://datatracker.ietf.org/doc/draft-ietf-ccamp-alarm-module/
> > 
> > 
> > 
> > ----------------------------------------------------------------------
> > DISCUSS:
> > ----------------------------------------------------------------------
> > 
> > I think we may need to double-check that the example in Appendix C is
> > fully compliant with the main spec.  In particular, are the
> > <status-change> elments properly sorted?  (I'm less sure whether the
> > <last-changed> time needs to match a <status-change> or whether the
> > <operator-state-change> is good enough for that.)
> 
> Thanks for catching this! The example was as intended but the description
> clauses where wrong.

Ah, I'm happy to be ableto help!
These changes look good to bring the example and spec into line with the
intended behavior.

> list status-change {
>     description
>     OLD:
>             This list is ordered according to the timestamps of alarm
>             state changes.  The last item corresponds to the latest
>             state change.
> 
>     NEW:
>          This list is ordered according to the timestamps of alarm
>          status changes.  The first item corresponds to the latest
>          status change.
> 
> Both <status-change> and <operator-state-change> will update
> <last-changed>
> 
> leaf last-changed {
>    description
>    OLD:
>         "A timestamp when the alarm status was last changed.  Status
>         changes are changes to 'is-cleared', 'perceived-severity',
>         and 'alarm-text'.";
> 
>    NEW:
>         "A timestamp when the 'status-change' or 'operator-state-change' list
>          was last changed.";
> 
> > 
> > 
> > ----------------------------------------------------------------------
> > COMMENT:
> > ----------------------------------------------------------------------
> > 
> > Section 1.1
> > 
> >   o  Fault: A fault is the underlying cause of an undesired behaviour.
> >      There is no trivial one-to-one mapping between faults and alarms.
> >      One fault may result in several alarms in case the system lacks
> >      root-cause and correlation capabilities.  An alarm might not have
> >      an underlying fault as a cause, imagine a bad MOS score alarm from
> >      a VOIP probe and the cause being non-optimal QoS configuration.
> > 
> > nit: this is a comma splice
> change to 
> An alarm might not have an underlying fault as a cause. For example,
> imagine a bad Mean Opppinion Score (MOS) alarm from a Voice Over IP
> (VOIP) probe and the cause being non-optimal QoS configuration.
> > 
> >   o  Alarm Type: An alarm type identifies a possible unique alarm state
> >      for a resource.  Alarm types are names to identify the state like
> >      "link-alarm", "jitter-violation", "high-disk-utilization".
> > 
> > Are these only intended for human consumption?
> No, these are YANG identities and intended for automation software as well

I think my follow-up question was going to be whether there was value in a
registry for them, to avoid proliferation of nearly-identical types from
different vendors.  (I honestly don't know.)

> > 
> >   o  Cleared alarm: A cleared alarm is an alarm where the system/server
> >      considers the undesired state to be cleared.  Operators can not
> >      clear alarms, clearance is managed by the system.  A linkUp
> >      notification can be considered a clear condition for a linkDown
> >      state.
> > 
> > nit: I'd suggest "For example, " before "a linkUp notification […]”
> Ok, fixed
> 
> > 
> >   o  Corrective Action: An action taken by an operator or automation
> >      routine in order to minimize the impact of the alarm or resolving
> >      the root cause.
> > 
> > nit: "or resolve the root cause”
> Ok, fixed
> > 
> > Section 2
> > 
> >   o  Clear definition of "alarm" in order to exclude general events
> >      that should not be forwarded as alarm notifications.
> > 
> > I'm not sure I am parsing this correctly.  Is the objective to *provide*
> > such a clear definition?  (Similarly for the next item.)  Part of my
> > confusion probably stems from the dual use of the word "clear" as a verb
> > and a noun in English.
> Good comment, suggestion below:
> OLD:
>     Clear definition of "alarm" in order to exclude general events
>     that should not be forwarded as alarm notifications.
> NEW:
>     to provide a precise definition of "alarm" in order to exclude general events
>     that should not be forwarded as alarm notifications.

Thanks!

> > Section 3.2
> > 
> >   In order to allow for dynamic addition of alarm types the alarm
> >   module allows for further qualification of the identity based alarm
> >   type using a string.  A potential drawback of this is that there is a
> >   big risk that alarm operators will receive alarm types as a surprise,
> >   they do not know how to resolve the problem since a defined alarm
> >   procedure does not necessarily exist.  [...]
> > 
> > nit: this is a comma splice.
> Correct, will change the comma to a period.
> 
> > 
> > Section 3.5.1
> > 
> >   From a resource perspective, an alarm can for example have the
> >   following life-cycle: raise, change severity, change severity, clear,
> > 
> > Is the duplicate "change severity" intentional?
> Yes. An alarm can toggle minor, major, minor for example

Okay.

> > For the alarm history functionality, a given 'time' leaf is clearly the
> > time that a change occurred, but are the sibling leafs the state before
> > that change or after that change?  (It looks like the actual YANG module
> > descriptions (e.g., for 'status-change') indicate that this is the state
> > after that change, but I don't know if it makes sense to also mention
> > that here or not.)
> As described in the YANG module it is the state after that change
> > 
> > Section 3.7
> > 
> >   (blocked/filtered).  Shelved alarms appear in a dedicated shelved
> >   alarm list in order not to disturb the relevant alarms.  Shelved
> > 
> > nit: this sentence could probably be reworded for greater clarity (what
> > does "disturb" mean, and perhaps what "relevant alarms" is may not be
> > clear just from context).
> Good comment!
> OLD:
> (blocked/filtered).  Shelved alarms appear in a dedicated shelved
>  alarm list in order not to disturb the relevant alarms.  Shelved
> NEW:
>  (blocked/filtered).  Shelved alarms appear in a dedicated shelved
>  alarm list in order to be filtered out from the alarm-list.  Shelved

That helps, thanks.  Perhaps "in order for the main alarm-list to only
contain entries of interest" could be a further improvement?

> 
> 
> > 
> > Section 4.1
> > 
> >   The "/alarms/control/notify-status-changes" leaf controls if
> >   notifications are sent for all state changes, only raise and clear,
> >   or only notifications more severe than a configured level.  This
> >   feature in combination with alarm shelving corresponds to the ITU
> >   Alarm Report Control functionality.
> > 
> > Is there a specific section reference for the ITU functionality?  (If
> > not, it's probably fine to leave this as-is).
> Good comment, changed to:
> This feature in combination with alarm shelving corresponds to the ITU
> Alarm Report Control functionality, see Appendix F.2.4.
> 
> > 
> > Section 4.2
> > 
> >      The system might not instrument all defined alarm type identities,
> >      and some alarm identities are abstract.
> > 
> > I just wanted to double-check: the intent is indeed to use "instrument"
> > here (as opposed to, say, "implement”)?
> 
> It was intentional but changed to "implement"

(No need to change it just on my account; I can see that "instrument" has a
valid meaning if that's the intent.)

> > Section 4.7
> > 
> >   /alarms/alarm-list/purge-alarms:  Delete alarms from the "alarm-list"
> >      according to specific criteria, for example all cleared alarms
> >      older than a specific date.
> > 
> >   /alarms/alarm-list/compress-alarms:  Compress the "status-change"
> >      list for the alarms.
> > 
> > It's a bit confusing to me to have such different language for these two
> > action nodes, since as far as I can tell the compress-alarms action also
> > allows for specific criteria (e.g., resources) to be provided to limit
> > the scope of the action.  (Similarly for compress-shelved-alarms.)
> Ok, I do not think we will consider changing the action names at this point

Sorry, I did not mean the name of the YANG leafs, but rather the text about
"according to specific criteria".

> > 
> > Section 6
> > 
> >     feature service-impact-analysis {
> >       description
> >         "The system supports identifying candidate impacted
> >          resources for an alarm, for example a link being impacted
> >          by an interface alarm.";
> > 
> > nit: it feels a bit odd to say that the link is "impacted by an alarm",
> > since normally the causality flow would be the other way -- the link
> > state changes, and then an alarm stat changes.
> OLD:
>        "The system supports identifying candidate impacted
>         resources for an alarm, for example a link being impacted
>         by an interface alarm.";
> NEW:
>        "The system supports identifying candidate impacted
>         resources for an alarm. For example, an interface state change
>         resulting in a link alarm which can refer to a link as being impacted.";
> > 
> > I'm not sure I understand the distinction between 'warning' and 'minor'
> > severities -- it seems that neither is supposed to be indicating a
> > service-affecting fault.
> 
> Can agree the difference is subtle and even unclear. However these are
> the original definitions from X.733 and they are being used in the industry.
> * for minor there is a fault: (indicates the existence of a non-service affecting fault)
> * for warning there might be one coming up: (detection of a potential or impending)

Okay; I am happy to concede that I'm not an expert here :)

> > 
> > For the 'age-spec' choice, am I reading RFC 7950 correctly that I choose
> > exactly one case of the choice, so I can't have both a "minute part"
> > and an "hours part"?  If so, then I would suggest using different
> > description text for the choices, since the current text implies that
> > you can combine parts with different units to assemble a compound
> > timespec.
> 
> Good comment:
> OLD:
>  Seconds part.
> 
> NEW:
>  Age expressed in seconds.
> 
> And the same change for the other cases

Thanks

> 
> > 
> >         list alarm {
> >              [...]
> >              Entries appear in the alarm list the first time an alarm
> >              becomes active for a given alarm-type and resource.
> >              Entries do not get deleted when the alarm is cleared, this
> >              is a boolean state in the alarm.
> > 
> > nit: this is a comma splice.
> OLD:
>             Entries appear in the alarm list the first time an alarm
>             becomes active for a given alarm-type and resource.
>             Entries do not get deleted when the alarm is cleared, this
>             is a boolean state in the alarm.
> NEW:    
> 	     Entries appear in the alarm list the first time an alarm
>             becomes active for a given alarm-type and resource.
>             Entries do not get deleted when the alarm is cleared.
>             Clear status is represented as a boolean flag.
> 
> 
> 
> > 
> > Section 10
> > 
> > It's a bit of a stretch as far as an attack, but an attacker that can
> > temporarily reduce max-alarm-status-changes to a small value (e.g., one)
> > could use that to hide their tracks to some extent if they can cause the
> > alarm to clear after they've done what they intended to do.  (They could
> > then restore the max-alarm-status-changes value, too.)
> Notifications will still be sent. I think this is a stretch and leave it out.

Good point about notifications, thanks.

> > It seems that the list of alarms itself may be potentially sensitive, in
> > that it potentially gives an attacker an authoritative picture of the
> > (broken) state of the network.
> 
> Ok, added the following:
> 
>         The list of alarms itself may be potentially sensitive from a
>         security perspective, in that it potentially gives an attacker
>         an authoritative picture of the (broken) state of the network.
> 
> 
> 
> > 
> > Appendix F.1
> > 
> > nit: We say that X.733 alarms "also use the basic criteria of deviation from
> > normal condition", but it's not entirely clear whether the "also"
> > indicates that they share this behavior with some external reference
> > point or is a synonym for "additionally”.
> 
> OLD:
> 	It also use the basic criteria of deviation from
> 	normal condition
> NEW:
> 	X.733 defines an alarm as a deviation from
>        normal condition, but without the requirement
>        that it requires corrective actions.
> 
> > 
> > nit: I think the semicolon in the ISA "comment" should be a regular
> > colon.
> Fixed
> > 
> > The paragraph at the end talks about "evolution" and "moves from" -- is
> > the table sorted chronologically?
> 
> Mostly, also sorted to show exactly that. I think the phrase is ok

The phrase is definitely fine -- I just didn't do the check for the
publication dates of the various specs, but am happy to trust that they are
sorted properly.

> > Appendix G
> > 
> > Are tables 4 and 5 talking about alarm rates or notification rates?
> 
> These are usability requirements and therefore the alarm rate rather than the notification rates.
> This is also an effect of this module, not focusing on a list of notifications but alarm states as a result of notifications.

Okay, thanks for confirming.

-Ben

[CCAMP] Benjamin Kaduk's Discuss on draft-ietf-cc… Benjamin Kaduk via Datatracker
Re: [CCAMP] Benjamin Kaduk's Discuss on draft-iet… stefan vallin
Re: [CCAMP] Benjamin Kaduk's Discuss on draft-iet… Benjamin Kaduk
Re: [CCAMP] Benjamin Kaduk's Discuss on draft-iet… stefan vallin
Re: [CCAMP] Benjamin Kaduk's Discuss on draft-iet… Martin Bjorklund