Re: [Gen-art] Genart last call review of draft-ietf-ccamp-alarm-module-07

Dan Romascanu <dromasca@gmail.com> Tue, 19 March 2019 18:54 UTC

MIME-Version: 1.0
References: <155294084679.26073.4005125072161491147@ietfa.amsl.com> <956268DE-965F-40C3-A845-D236424CE07D@wallan.se>
In-Reply-To: <956268DE-965F-40C3-A845-D236424CE07D@wallan.se>
From: Dan Romascanu <dromasca@gmail.com>
Date: Tue, 19 Mar 2019 20:54:08 +0200
Message-ID: <CAFgnS4WOQL3Q=ec_8efvKtMXb8RSXyaRGx+iyMs28RnWk=HGtQ@mail.gmail.com>
To: stefan vallin <stefan@wallan.se>
Cc: gen-art <gen-art@ietf.org>, draft-ietf-ccamp-alarm-module.all@ietf.org, ccamp@ietf.org, ietf <ietf@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000060daf105847708b1"
Archived-At: <https://mailarchive.ietf.org/arch/msg/gen-art/Y4FyFGpJxoFcFbiJ4FvSHhYw4fU>
Subject: Re: [Gen-art] Genart last call review of draft-ietf-ccamp-alarm-module-07
Precedence: list

Hi Stefan,

Thank you for your answer and for addressing my concerns. I am comfortable
with your proposals. If your AD agrees, I would include these in a revised
version before submission to the approval of the IESG.

Regards,

Dan


On Tue, Mar 19, 2019 at 5:11 PM stefan vallin <stefan@wallan.se> wrote:

> Hi Dan!
> Thanks for your review, an honour to have RFC 3877 in the loop :)
> See inline
> br Stefan
>
>
> >
> >
> > Major issues:
> >
> > 1. The definition of Alarm is key for the whole model. It reads like
> this:
> > Alarm (the general concept): An alarm signifies an undesirable state in a
> > resource that requires corrective action.
> >
> > However, RFC 3877 already defined a number of concepts including:
> >  Error
> >      A deviation of a system from normal operation.
> >
> >   Fault
> >      Lasting error or warning condition.
> >
> >   ....
> >
> >   Alarm
> >      Persistent indication of a fault.
> >
> > I believe that there is a need to show why the model defined by RFC 3877
> needs
> > to be changed, and why the difference that RFC 3877 was making between a
> Fault
> > and an Alarm is no longer needed.
>
> Good comment, you are right, and we need to keep the distinction between
> fault and alarm.
> That distinction is used in X.733, 3GPP IRP and others. The general
> pattern is that “fault”
> refers to what is really broken, and the alarm the manifestation of that
> underlying cause.
> There is not a simple 1-1 relationship between a fault and an alarm
> * 1 fault may have many alarms due to limited root cause capabilities of
> the system
> * There might be no underlying fault to an alarm, consider a non-optimal
> QoS configuration
>   which gives bad quality in VOIP calls. Certainly a MOS alarm from the
> VOIP probe, but there
>   is no “fault” as such (if you do not consider a non-optimal config as a
> fault)
>
> So X.733
> X.733 fault: The physical or algorithmic cause of a malfunction
> 3GPP fault: a deviation of a system from normal operation, which may
> result in the loss of operational capabilities of the element or the loss
> of redundancy in case of a redundant configuration
>
> I suggest we add the following to terminology:
> Fault: the underlying cause of an undesired behaviour
>
> If we then turn to the term “alarm". I have added two aspects to the
> definition of an alarm:
>
> An alarm signifies an *undesirable state* in a resource that *requires
> corrective action*.
>
> Mostly based on the alarm standardization work in the process industry
> (see draft references).
>
> 1) Rather than “deviation from normal”, we say “undesirable”, subtle
> difference.
>   In IT environments it is easier to define what is normal, a normal load
> to a web server.
>   And anything deviation from that normal load could be an alarm.
>   In networking, things are more dynamic, and deviation from normal might
> be the desired state.
>   So the definition stresses the fact that it is an undesired state, not
> just deviation from normal.
>
> 2) Adding the requirement that an alarm per definition should require an
> action. This is a sound
>   requirement that puts requirements on what qualifies as an alarm and
> limits the amounts of alarms.
>   (See for example the EEMUA, and ISA182 references in the draft). The
> 3GPP Alarm standard
>   also added this to their definition at the later revisions to address
> the alarm overload problem.
>
>
>
>
>
> > Also, RFC 3877 defined in Section 3 a
> > Framework and an Architecture that was consistent with X.733. This
> document has
> > no such section, and while acknowledging the need for a mapping to X.733
> it
> > states as a goal:
> > Mapping to X.733, which is a requirement for some alarm systems. Still,
> keep
> > some of the X.733 concepts out of the core model in order to make the
> model
> > small and easy to understand
> >
> > More details about what is left out and why these are not needed would
> help.
> The alarm YANG model  does not *require* the X.733 parameter
> definitions of for example probable-cause enum values. Today, most
> networking devices
> and management systems do not rely on those enumerations.
>
> Those are defined in the X733 augmentation module in order to keep the
> core model as
> small and useful as possible. X733 requirements come more often from
> telecom environments.
>
>
> >
> > Minor issues:
> >
> > 1. Section 2 makes a statement that includes
> > ... While IETF has not really addressed alarm management
> >
> > This is is actually not accurate. RFC 3877 addressed Alarm Management.
> Maybe
> > there is a need to revise that approach, but this should be done
> explicitly,
> > not by stating that it did not exist.
> Correct, bad wording.
> OLD TEXT:
> Address alarm usability requirements, see Appendix G.  While IETF
>       has not really addressed alarm management, telecom standards has
>       addressed it purely from a protocol perspective.  The process
>       industry has published several relevant standards addressing
>       requirements for a useful alarm interface; [EEMUA], [ISA182].
>       This alarm module defines usability requirements as well as a YANG
>       data model.
> SUGGESTION:
> Address alarm usability requirements, see Appendix G.  While IETF
>       and telecom standards have addressed alarms mostly from a
>       protocol perspective, the process industry has published
>       several relevant standards addressing requirements for a useful
>       alarm interface; [EEMUA], [ISA182].
>       This alarm module defines usability requirements as well as a YANG
>       data model.
>
> >
> > 2. Section 3.5:
> > Closing an alarm implies that the operator considers the corrective
> action
> > performed.
> >
> > Is this always true? The undesirable state may have been cancelled by
> some
> > other event than corrective action, for example the resource is no
> longer used,
> > or the time elapsed mat have made the undesirable state irrelevant.
>
> I think it is important to keep the two perspectives in mind. An operator
> closing an
> alarm is only a flag from the operations team that the alarm does not need
> an action.
> It might be cleared or not cleared by the system.
>
> So in your first example, the alarm is probably cleared by the
> instrumentation,
> correlating “the other event”.
>
> If the resource is no longer used a shelf should be created.
>
> If time has passed, depends, ….
>
> >
> > 3. In section 3.5.1:
> > Alarms are not cleared by operators, only the underlying instrumentation
> can
> > clear an alarm.  Operators can close alarms.
> >
> > So, the document makes a distinction between clearing an alarm and
> closing an
> > alarm. It may be good to define two two concepts to make the distinction
> clear.
>
> Good point!
>
> Suggested terminology additions:
> * Cleared alarm: a cleared alarm is an alarm where the system/server
> considers the
> undesired state to be cleared. Operators can not clear alarms, clearance
> is managed
> by the system. A linkUp notification can be considered a clear condition
> for a linkDown state.
>
> * Closed alarm: operators can close alarms irrespective of the alarm being
> cleared or not.
> A closed alarm indicates that the alarm does not need attention, either
> since the corrective
> action has been taken or that it can be ignored for other reasons.
>
>
> >
> > 4. Appendix F.1:
> > The alarm MIB is state oriented rather than notification oriented, an
> alarm
> > is a "lasting condition", not a discrete notification reporting about a
> > condition state change.
> Good catch, will rephrase, the alarm MIB and the alarm YANG has a stateful
> view
> of alarms, not notification-focused.
>
> Suggested change:
> OLD
> RFC 3877 defines alarm referring back to "a deviation from normal
> operation". This is
> problematic, since this might not require an  operator action. The alarm
> MIB is state
> oriented rather than notification oriented,  an alarm is a "lasting
> condition", not a
> discrete notification reporting about a condition state change.
> NEW:
> RFC 3877 defines alarm referring back to "a deviation from normal
> operation". The Alarm YANG
> model adds the requirement that it should require an corrective action and
> should be undesired,
> not only a deviation from normal. The alarm MIB is state oriented in the
> same way as the Alarm YANG,
> it focuses on the  "lasting  condition", not the individual notifications.
>
>
> >
> > I am not sure that I understand this comment. Alarm states are defined
> also in
> > this document, and Alarms as defined here are also different than ' a
> discrete
> > notification reporting about a condition state change'. So, what does
> this
> > comment really try to say?
> >
> > Nits/editorial comments:
> >
> >
>
>

[Gen-art] Genart last call review of draft-ietf-c… Dan Romascanu via Datatracker
Re: [Gen-art] Genart last call review of draft-ie… stefan vallin
Re: [Gen-art] Genart last call review of draft-ie… Dan Romascanu
Re: [Gen-art] Genart last call review of draft-ie… stefan vallin
Re: [Gen-art] Genart last call review of draft-ie… Alissa Cooper