Re: [CCAMP] Genart last call review of draft-ietf-ccamp-alarm-module-07

stefan vallin <stefan@wallan.se> Tue, 19 March 2019 15:11 UTC

Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\))
From: stefan vallin <stefan@wallan.se>
In-Reply-To: <155294084679.26073.4005125072161491147@ietfa.amsl.com>
Date: Tue, 19 Mar 2019 16:11:39 +0100
Cc: gen-art@ietf.org, draft-ietf-ccamp-alarm-module.all@ietf.org, ccamp@ietf.org, ietf@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <956268DE-965F-40C3-A845-D236424CE07D@wallan.se>
References: <155294084679.26073.4005125072161491147@ietfa.amsl.com>
To: Dan Romascanu <dromasca@gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/ccamp/LwddCoNgX6CiVt-hCT1CH73a1vE>
Subject: Re: [CCAMP] Genart last call review of draft-ietf-ccamp-alarm-module-07
Precedence: list

Hi Dan!
Thanks for your review, an honour to have RFC 3877 in the loop :)
See inline
br Stefan


> 
> 
> Major issues:
> 
> 1. The definition of Alarm is key for the whole model. It reads like this:
> Alarm (the general concept): An alarm signifies an undesirable state in a
> resource that requires corrective action.
> 
> However, RFC 3877 already defined a number of concepts including:
>  Error
>      A deviation of a system from normal operation.
> 
>   Fault
>      Lasting error or warning condition.
> 
>   ....
> 
>   Alarm
>      Persistent indication of a fault.
> 
> I believe that there is a need to show why the model defined by RFC 3877 needs
> to be changed, and why the difference that RFC 3877 was making between a Fault
> and an Alarm is no longer needed.

Good comment, you are right, and we need to keep the distinction between fault and alarm.
That distinction is used in X.733, 3GPP IRP and others. The general pattern is that “fault”
refers to what is really broken, and the alarm the manifestation of that underlying cause. 
There is not a simple 1-1 relationship between a fault and an alarm
* 1 fault may have many alarms due to limited root cause capabilities of the system
* There might be no underlying fault to an alarm, consider a non-optimal QoS configuration 
  which gives bad quality in VOIP calls. Certainly a MOS alarm from the VOIP probe, but there
  is no “fault” as such (if you do not consider a non-optimal config as a fault)

So X.733
X.733 fault: The physical or algorithmic cause of a malfunction
3GPP fault: a deviation of a system from normal operation, which may result in the loss of operational capabilities of the element or the loss of redundancy in case of a redundant configuration

I suggest we add the following to terminology:
Fault: the underlying cause of an undesired behaviour

If we then turn to the term “alarm". I have added two aspects to the definition of an alarm:

An alarm signifies an *undesirable state* in a resource that *requires corrective action*.

Mostly based on the alarm standardization work in the process industry (see draft references).

1) Rather than “deviation from normal”, we say “undesirable”, subtle difference.
  In IT environments it is easier to define what is normal, a normal load to a web server.
  And anything deviation from that normal load could be an alarm.
  In networking, things are more dynamic, and deviation from normal might be the desired state.
  So the definition stresses the fact that it is an undesired state, not just deviation from normal.

2) Adding the requirement that an alarm per definition should require an action. This is a sound
  requirement that puts requirements on what qualifies as an alarm and limits the amounts of alarms.
  (See for example the EEMUA, and ISA182 references in the draft). The 3GPP Alarm standard
  also added this to their definition at the later revisions to address the alarm overload problem.





> Also, RFC 3877 defined in Section 3 a
> Framework and an Architecture that was consistent with X.733. This document has
> no such section, and while acknowledging the need for a mapping to X.733 it
> states as a goal:
> Mapping to X.733, which is a requirement for some alarm systems. Still, keep
> some of the X.733 concepts out of the core model in order to make the model
> small and easy to understand
> 
> More details about what is left out and why these are not needed would help.
The alarm YANG model  does not *require* the X.733 parameter
definitions of for example probable-cause enum values. Today, most networking devices 
and management systems do not rely on those enumerations.

Those are defined in the X733 augmentation module in order to keep the core model as
small and useful as possible. X733 requirements come more often from telecom environments.

 
> 
> Minor issues:
> 
> 1. Section 2 makes a statement that includes
> ... While IETF has not really addressed alarm management
> 
> This is is actually not accurate. RFC 3877 addressed Alarm Management. Maybe
> there is a need to revise that approach, but this should be done explicitly,
> not by stating that it did not exist.
Correct, bad wording.
OLD TEXT:
Address alarm usability requirements, see Appendix G.  While IETF
      has not really addressed alarm management, telecom standards has
      addressed it purely from a protocol perspective.  The process
      industry has published several relevant standards addressing
      requirements for a useful alarm interface; [EEMUA], [ISA182].
      This alarm module defines usability requirements as well as a YANG
      data model.
SUGGESTION:
Address alarm usability requirements, see Appendix G.  While IETF
      and telecom standards have addressed alarms mostly from a 
      protocol perspective, the process industry has published 
      several relevant standards addressing requirements for a useful 
      alarm interface; [EEMUA], [ISA182].
      This alarm module defines usability requirements as well as a YANG
      data model.

> 
> 2. Section 3.5:
> Closing an alarm implies that the operator considers the corrective action
> performed.
> 
> Is this always true? The undesirable state may have been cancelled by some
> other event than corrective action, for example the resource is no longer used,
> or the time elapsed mat have made the undesirable state irrelevant.

I think it is important to keep the two perspectives in mind. An operator closing an
alarm is only a flag from the operations team that the alarm does not need an action.
It might be cleared or not cleared by the system.

So in your first example, the alarm is probably cleared by the instrumentation, 
correlating “the other event”.

If the resource is no longer used a shelf should be created.

If time has passed, depends, ….

> 
> 3. In section 3.5.1:
> Alarms are not cleared by operators, only the underlying instrumentation can
> clear an alarm.  Operators can close alarms.
> 
> So, the document makes a distinction between clearing an alarm and closing an
> alarm. It may be good to define two two concepts to make the distinction clear.

Good point!

Suggested terminology additions:
* Cleared alarm: a cleared alarm is an alarm where the system/server considers the
undesired state to be cleared. Operators can not clear alarms, clearance is managed
by the system. A linkUp notification can be considered a clear condition for a linkDown state.

* Closed alarm: operators can close alarms irrespective of the alarm being cleared or not.
A closed alarm indicates that the alarm does not need attention, either since the corrective
action has been taken or that it can be ignored for other reasons.


> 
> 4. Appendix F.1:
> The alarm MIB is state oriented rather than notification oriented, an alarm
> is a "lasting condition", not a discrete notification reporting about a
> condition state change.
Good catch, will rephrase, the alarm MIB and the alarm YANG has a stateful view
of alarms, not notification-focused.

Suggested change:
OLD
RFC 3877 defines alarm referring back to "a deviation from normal operation". This is
problematic, since this might not require an  operator action. The alarm MIB is state 
oriented rather than notification oriented,  an alarm is a "lasting  condition", not a 
discrete notification reporting about a condition state change.
NEW:
RFC 3877 defines alarm referring back to "a deviation from normal operation". The Alarm YANG
model adds the requirement that it should require an corrective action and should be undesired, 
not only a deviation from normal. The alarm MIB is state oriented in the same way as the Alarm YANG,
it focuses on the  "lasting  condition", not the individual notifications.


> 
> I am not sure that I understand this comment. Alarm states are defined also in
> this document, and Alarms as defined here are also different than ' a discrete
> notification reporting about a condition state change'. So, what does this
> comment really try to say?
> 
> Nits/editorial comments:
> 
>

[CCAMP] Genart last call review of draft-ietf-cca… Dan Romascanu via Datatracker
Re: [CCAMP] Genart last call review of draft-ietf… stefan vallin
Re: [CCAMP] Genart last call review of draft-ietf… Dan Romascanu
Re: [CCAMP] Genart last call review of draft-ietf… stefan vallin
Re: [CCAMP] [Gen-art] Genart last call review of … Alissa Cooper