Re: [CCAMP] [Gen-art] Genart last call review of draft-ietf-ccamp-alarm-module-07

Dan, thanks for your review. Stefan, thank you for making the corresponding changes. I entered a No Objection ballot.

Alissa

> On Mar 20, 2019, at 3:51 AM, stefan vallin <stefan@wallan.se> wrote:
> 
> Thanks Dan!
> 
>> On 19 Mar 2019, at 19:54, Dan Romascanu <dromasca@gmail.com <mailto:dromasca@gmail.com>> wrote:
>> 
>> Hi Stefan, 
>> 
>> Thank you for your answer and for addressing my concerns. I am comfortable with your proposals. If your AD agrees, I would include these in a revised version before submission to the approval of the IESG. 
>> 
>> Regards,
>> 
>> Dan
>> 
>> 
>> On Tue, Mar 19, 2019 at 5:11 PM stefan vallin <stefan@wallan.se <mailto:stefan@wallan.se>> wrote:
>> Hi Dan!
>> Thanks for your review, an honour to have RFC 3877 in the loop :)
>> See inline
>> br Stefan
>> 
>> 
>> > 
>> > 
>> > Major issues:
>> > 
>> > 1. The definition of Alarm is key for the whole model. It reads like this:
>> > Alarm (the general concept): An alarm signifies an undesirable state in a
>> > resource that requires corrective action.
>> > 
>> > However, RFC 3877 already defined a number of concepts including:
>> >  Error
>> >      A deviation of a system from normal operation.
>> > 
>> >   Fault
>> >      Lasting error or warning condition.
>> > 
>> >   ....
>> > 
>> >   Alarm
>> >      Persistent indication of a fault.
>> > 
>> > I believe that there is a need to show why the model defined by RFC 3877 needs
>> > to be changed, and why the difference that RFC 3877 was making between a Fault
>> > and an Alarm is no longer needed.
>> 
>> Good comment, you are right, and we need to keep the distinction between fault and alarm.
>> That distinction is used in X.733, 3GPP IRP and others. The general pattern is that “fault”
>> refers to what is really broken, and the alarm the manifestation of that underlying cause. 
>> There is not a simple 1-1 relationship between a fault and an alarm
>> * 1 fault may have many alarms due to limited root cause capabilities of the system
>> * There might be no underlying fault to an alarm, consider a non-optimal QoS configuration 
>>   which gives bad quality in VOIP calls. Certainly a MOS alarm from the VOIP probe, but there
>>   is no “fault” as such (if you do not consider a non-optimal config as a fault)
>> 
>> So X.733
>> X.733 fault: The physical or algorithmic cause of a malfunction
>> 3GPP fault: a deviation of a system from normal operation, which may result in the loss of operational capabilities of the element or the loss of redundancy in case of a redundant configuration
>> 
>> I suggest we add the following to terminology:
>> Fault: the underlying cause of an undesired behaviour
>> 
>> If we then turn to the term “alarm". I have added two aspects to the definition of an alarm:
>> 
>> An alarm signifies an *undesirable state* in a resource that *requires corrective action*.
>> 
>> Mostly based on the alarm standardization work in the process industry (see draft references).
>> 
>> 1) Rather than “deviation from normal”, we say “undesirable”, subtle difference.
>>   In IT environments it is easier to define what is normal, a normal load to a web server.
>>   And anything deviation from that normal load could be an alarm.
>>   In networking, things are more dynamic, and deviation from normal might be the desired state.
>>   So the definition stresses the fact that it is an undesired state, not just deviation from normal.
>> 
>> 2) Adding the requirement that an alarm per definition should require an action. This is a sound
>>   requirement that puts requirements on what qualifies as an alarm and limits the amounts of alarms.
>>   (See for example the EEMUA, and ISA182 references in the draft). The 3GPP Alarm standard
>>   also added this to their definition at the later revisions to address the alarm overload problem.
>> 
>> 
>> 
>> 
>> 
>> > Also, RFC 3877 defined in Section 3 a
>> > Framework and an Architecture that was consistent with X.733. This document has
>> > no such section, and while acknowledging the need for a mapping to X.733 it
>> > states as a goal:
>> > Mapping to X.733, which is a requirement for some alarm systems. Still, keep
>> > some of the X.733 concepts out of the core model in order to make the model
>> > small and easy to understand
>> > 
>> > More details about what is left out and why these are not needed would help.
>> The alarm YANG model  does not *require* the X.733 parameter
>> definitions of for example probable-cause enum values. Today, most networking devices 
>> and management systems do not rely on those enumerations.
>> 
>> Those are defined in the X733 augmentation module in order to keep the core model as
>> small and useful as possible. X733 requirements come more often from telecom environments.
>> 
>> 
>> > 
>> > Minor issues:
>> > 
>> > 1. Section 2 makes a statement that includes
>> > ... While IETF has not really addressed alarm management
>> > 
>> > This is is actually not accurate. RFC 3877 addressed Alarm Management. Maybe
>> > there is a need to revise that approach, but this should be done explicitly,
>> > not by stating that it did not exist.
>> Correct, bad wording.
>> OLD TEXT:
>> Address alarm usability requirements, see Appendix G.  While IETF
>>       has not really addressed alarm management, telecom standards has
>>       addressed it purely from a protocol perspective.  The process
>>       industry has published several relevant standards addressing
>>       requirements for a useful alarm interface; [EEMUA], [ISA182].
>>       This alarm module defines usability requirements as well as a YANG
>>       data model.
>> SUGGESTION:
>> Address alarm usability requirements, see Appendix G.  While IETF
>>       and telecom standards have addressed alarms mostly from a 
>>       protocol perspective, the process industry has published 
>>       several relevant standards addressing requirements for a useful 
>>       alarm interface; [EEMUA], [ISA182].
>>       This alarm module defines usability requirements as well as a YANG
>>       data model.
>> 
>> > 
>> > 2. Section 3.5:
>> > Closing an alarm implies that the operator considers the corrective action
>> > performed.
>> > 
>> > Is this always true? The undesirable state may have been cancelled by some
>> > other event than corrective action, for example the resource is no longer used,
>> > or the time elapsed mat have made the undesirable state irrelevant.
>> 
>> I think it is important to keep the two perspectives in mind. An operator closing an
>> alarm is only a flag from the operations team that the alarm does not need an action.
>> It might be cleared or not cleared by the system.
>> 
>> So in your first example, the alarm is probably cleared by the instrumentation, 
>> correlating “the other event”.
>> 
>> If the resource is no longer used a shelf should be created.
>> 
>> If time has passed, depends, ….
>> 
>> > 
>> > 3. In section 3.5.1:
>> > Alarms are not cleared by operators, only the underlying instrumentation can
>> > clear an alarm.  Operators can close alarms.
>> > 
>> > So, the document makes a distinction between clearing an alarm and closing an
>> > alarm. It may be good to define two two concepts to make the distinction clear.
>> 
>> Good point!
>> 
>> Suggested terminology additions:
>> * Cleared alarm: a cleared alarm is an alarm where the system/server considers the
>> undesired state to be cleared. Operators can not clear alarms, clearance is managed
>> by the system. A linkUp notification can be considered a clear condition for a linkDown state.
>> 
>> * Closed alarm: operators can close alarms irrespective of the alarm being cleared or not.
>> A closed alarm indicates that the alarm does not need attention, either since the corrective
>> action has been taken or that it can be ignored for other reasons.
>> 
>> 
>> > 
>> > 4. Appendix F.1:
>> > The alarm MIB is state oriented rather than notification oriented, an alarm
>> > is a "lasting condition", not a discrete notification reporting about a
>> > condition state change.
>> Good catch, will rephrase, the alarm MIB and the alarm YANG has a stateful view
>> of alarms, not notification-focused.
>> 
>> Suggested change:
>> OLD
>> RFC 3877 defines alarm referring back to "a deviation from normal operation". This is
>> problematic, since this might not require an  operator action. The alarm MIB is state 
>> oriented rather than notification oriented,  an alarm is a "lasting  condition", not a 
>> discrete notification reporting about a condition state change.
>> NEW:
>> RFC 3877 defines alarm referring back to "a deviation from normal operation". The Alarm YANG
>> model adds the requirement that it should require an corrective action and should be undesired, 
>> not only a deviation from normal. The alarm MIB is state oriented in the same way as the Alarm YANG,
>> it focuses on the  "lasting  condition", not the individual notifications.
>> 
>> 
>> > 
>> > I am not sure that I understand this comment. Alarm states are defined also in
>> > this document, and Alarms as defined here are also different than ' a discrete
>> > notification reporting about a condition state change'. So, what does this
>> > comment really try to say?
>> > 
>> > Nits/editorial comments:
>> > 
>> > 
>> 
> 
> _______________________________________________
> Gen-art mailing list
> Gen-art@ietf.org
> https://www.ietf.org/mailman/listinfo/gen-art