Re: [CCAMP] Review of draft-ietf-ccamp-alarm-module-01

stefan vallin <stefan@wallan.se> Mon, 14 May 2018 06:56 UTC

Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: stefan vallin <stefan@wallan.se>
In-Reply-To: <B8F9A780D330094D99AF023C5877DABA9AE169D1@nkgeml513-mbs.china.huawei.com>
Date: Mon, 14 May 2018 08:56:15 +0200
Cc: "ccamp@ietf.org" <ccamp@ietf.org>, Zhangcuimin <zhangcuimin@huawei.com>, "Zhangmingyu (Jason)" <jason.zhangmingyu@huawei.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <64E6D80F-6282-4D14-A81F-5A5D95C19FE4@wallan.se>
References: <B8F9A780D330094D99AF023C5877DABA9AE169D1@nkgeml513-mbs.china.huawei.com>
To: Qin Wu <bill.wu@huawei.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/ccamp/VY8JattoxadpwhBZxul1ENtOkH0>
Subject: Re: [CCAMP] Review of draft-ietf-ccamp-alarm-module-01
Precedence: list

Hi Qin!
Thanks for your review!
See my comments inline:

Stefan Vallin
stefan@wallan.se
+46705233262

> On 10 May 2018, at 13:02, Qin Wu <bill.wu@huawei.com> wrote:
> 
> Hi, Stefan:
> Thanks for the updated draft v-(01). I get a few time to re-read this draft. Here are a few comments and suggestions:
> 1.       Section 3.2
> Section 3.2 provides two kind of concrete alarm types, one is to use YANG identity derived from base YANG identity to describe concrete alarm type, the second is to use derived YANG identity combing with string type identifier to describe concrete alarm type, however, I don’t think these two kind of concrete alarm type can be supported at the same time inthis model, if you choose one, you should give up the other. Also in the examples of concrete-alarm-type:
> “
>  
>      // Alternative 1: concrete alarm type identity
>      import ietf-alarms {
>        prefix al;
>      }
>      identity environmental-alarm {
>        base al:alarm-type;
>        description "Abstract alarm type";
>      }
>      identity smoke {
>        base environmental-alarm;
>        description "Concrete alarm type";
>      }
>  
>      // Alternative 2: concrete alarm type qualifier
>      import ietf-alarms {
>        prefix al;
>      }
>      identity environmental-alarm {
>        base al:alarm-type;
>        description "Abstract alarm type";
>      }
>      identity external-detector {
>        base environmental-alarm;
>        description
>          "Abstract alarm type, a run-time configuration
>           procedure sets the type of alarm detected. This will
>           be reported in the alarm-type-qualifier.";
>      }
>  
> ”
> I don’t see the clear difference between concrete alarm type identity in alternative 1 and concrete alarm type qualifier in alternative 2 from YANG language perspective.
> In the alarm list:
>       +--ro alarm* [resource alarm-type-id alarm-type-qualifier]
>          +--ro resource                 resource
>          +--ro alarm-type-id            alarm-type-id
>          +--ro alarm-type-qualifier     alarm-type-qualifier

A system should always strive to use the design-time alarm-types. This makes the alarm integration much simpler using the identities.
but there might be occasions where the alarm-types are not known at design-time, like a digital input. It is a runtime config task to bind the input to the detector type.

A system can without problem support both kinds. Only escape to alarm-type-qualifier if the alarm-types are not known at design-time.

> We can see alarm-type-id and alarm-type-qualifier are both mandatory and used as unique key index.

Yes, but qualifier is the empty string “” for all design-time alarms.

> So it looks we go with the second choice to use both alarm-type identity and alarm-type-qualifier to describe concrete alarm type. Therefore I am not sure there is value to keep alarm-inventory
>      +--ro alarm-inventory
>         +--ro alarm-type* [alarm-type-id alarm-type-qualifier]
>            +--ro alarm-type-id           alarm-type-id
>            +--ro alarm-type-qualifier    alarm-type-qualifier
>            +--ro resource*               resource-match
>            +--ro has-clear               boolean
>            +--ro severity-levels*        severity
>            +--ro description             string
>  
> Since any new alarm type can be added at any time, not only at design time, but also in the running time.
It is not a matter of picking one of the two in general for a system. A very limited subset of the alarm types for a system might need the runtime configuration mechanism.
Note well that both the identifier and qualifier is always part of the key for all alarms.

I do not understand your comment “So it looks we go with the second choice to use both alarm-type identity and alarm-type-qualifier to describe concrete alarm type”.
These are always part of the key, but qualifier is the empty string for alarm types known at design time.

The alarm inventory is extremely important for exactly this purpose. We will make it mandatory in next revision.
1) A system might not “support” all statically defined alarm types. Assumes an enterprise specific YANG module defines 10 alarm identities, a specific system might support only a subset of that
2) At runtime, a system might be configured with a number of dynamic alarm types, alarm-type-qualifier, extremely important to publish those.


>  
> 2.       Section 4.1
> I doubt we need alarm control parameters, since we have NETCONF Subtree Filter Components, why NETCONF subtree filter is not sufficient, why we should define additional alarm control mechanism to filter
I do not understand this comment. The control parameters define if the system shall emit notifications for all state changes or only for raise and clear.
This can not be captured by a filter.
> Unnecessary alarm, to decide when to move filtered or alarmed in and when to move them out.
> even we need this schemes, I think we can consider to define alarm control in the separate draft called alarm policy control,J, also it is not clear how do we distinguish the alarm that has been blocked or filtered and alarm that has been suppressed by the system automatically when the system detect the duplicate alarm?
> If this is true, I think shelved alarm list should also been taken out.
I do not understand this comment. A shelved alarm is a configuration by the client to shelf/block alarms matching a specific criteria based on alarmtype, resource etc

>  
> 3.       Is alarm list include the alarm that has been deleted since you introduce purge-alarm RPC support?

I do not understand this comment 100%
If a client purges an alarm it is permanently removed from the alarm list, so by definition it is not part ot the alarm list until the instrumentation detects a new state change.
> Apparently not, but since the alarm can be removed or deleted, why not count them in the alarm-summary?
It is deleted so do not count it….

> Also it is not clear the difference between remove ,delete? I am wondering if we really need this functionality since
> If the alarm can be managed, why we introduce too many human involvement, even it is administrator?
I will make sure the next revision does not use remove, delete and purge meaning the same thing. The rpc is purge which means that alarm list entry is “deleted” until the instrumentation detects a new state change. Realize that that the tree different terms are confusing, will fix!
But it is really important to be able to say “purge all alarms for this resource and this alarm-type”, this will stay.


>  
> 4.       How root cause resource is related to the resource associated with the current alarm in the alarm list? One to one relationship, many to one relationship?
One to many, obvious since it is a leaf-list
leaf-list root-cause-resource 

An optional leaf-list to indicate any root-cause candidates as hints

> 5.       How impacted resource is related to the resource associated with the current alarm in the alarm list?
This is totally dependant on the alarm-type and the specific system. An optional field for a system to say:
I have an alarm on this resource, it might impact the following resources”
For example; the alarm resource might be an interface and the impacted resources might be a VPN.

> 6.       Is there any relationship between impacted resource and alarms listed in the related alarm?
Nothing prescribed by this module, depends on the specific system/alarm-type and on the specific correlation made.
> 7.       How each alarm in the alarm list is related to alarms in the related-alarm? Which alarm will be impacted by which alarm, which alarm trigger another alarm? Is there nesting relationship between each other?
Nothing prescribed by this module, depends on the specific system/alarm-type and on the specific correlation made.
> How does alarm and related alarm, impact resource, root-cause-resource help to diagnostic the network and identify the root cause of the problem? Can you provide an example to explain this?
Going back to the definition of an alarm: it requires a corrective action.
A “good” alarm has the thing that can be acted upon/be fixed as the resource leaf
Any side-effects, resources that have degraded functionality is referred to in the impacted-resources in the same alarm.

In this way the operator is notified where to act.

Example 1:
Take a disc full alarm as an example that impacts a number of applications and their performance.
The alarm shall refer to the disc as the resource and the applications as impacted-resources.

Always strive to alarms as above, as few alarms as possible, point directly to the root cause and indicate which resources are impacted.

Example 2:
In some situations the instrumentation does not know the root cause, where to act. 
Take an active test tool for example, it can detect that a certain service does not comply to the SLA but the underlying root cause is not 100% clear.
An alarm in that case might look like:
- use the tested service as the resource
- use topology knowledge to put all the supporting resources, like the devices in the path in the root-cause-resources. In this way the operator has a hint where to look/act.

Note well that in the two examples above we have not used the related alarms, since we only have one alarm. Instead the impacted/root-cause resources leafs are used.
This is the preferred way. 

Example 3:
A mid level manager might receive alarms  from underlying systems and just try to group them together. This is a scenario where the related alarms leaf can be used.
Going back to example 1, the system might represent 1 disc alarm, 3 alarms for three different applications and link them together with the related alarms leaf.

So the leafs
- root-case-resource
- impacted-resource
- related-alarm

Are there to support various kinds of correlation capabilities, depends on  your system. Pick the one(s) that matches your capabilities and guides the operator in how to act upon the alarm in the best way.
Minimize the numer of alarms, try to embed as much useful information as much in the alarm.



> 8.       I am not sure we should introduce operator life cycle management or administrative life cycle management, it looks to me we need to rely on many manual provision or action, why not only focus on system generated alarm, system cleared alarmed, why we still look for traditional alarm management and require operator to engage and ack alarm, close alarm.
Note well that both operator actions and administrative actions are YANG features, so a system chooses to implement them or not.
Depends on your specific system if they make sense or not. If not, do not support these features.

> 9.       Even though we still require operator life cycle management, I don’t understand we need administrative alarm life cycle management? Since clear alarm can be re-raised by the system, therefore when the system clear alarm, or operator close alarm, these closed alarm or cleared alarm by system can be reraised again, so why we need administrative alarm life cycle management? I am struggling about this.
The capability of purging alarms is relevant in order to remove not relevant history from the alarm list. Helps focus. If you had a serious outage a week ago, all is dealt with and documented in the trouble ticketing system it makes sense to purge the alarms from the devices to make the alarm list smaller and more relevant to operations.

> 10.   What I like to hope is to only consider system generated alarm, the alarm in the alarm list can be classified into new raised alarm, cleared alarm, re-raised alarm,also alarm severity can be increased or decreased, how do we keep track of change of alarm severity changes?
This is available in the status-changes for each alarm:
         +--ro status-change* [time] {alarm-history}?
         |  +--ro time                  yang:date-and-time
         |  +--ro perceived-severity    severity-with-clear
         |  +--ro alarm-text            alarm-text



> I know in the alarm list, we have alarm text to indicate which alarm is raised alarm,which alarm is cleared alarm, that will indicate whether alarm is active alarm or present alarm or the alarm that has been cleared by the system
This is not the purpose of the alarm text. if an alarm is cleared or not is indicated by the is-cleared flag
         +--ro is-cleared               boolean



> , but we may also interested in all the history alarm, e.g., in operator lifecycle management, the closed alarms are also history alarm, maybe shelved alarm are also history alarm? But I am not sure it is a good idea to introduce manual action related alarms in this model?

This is available (note that shelving actions are captured in the operator-state-change)
         +--ro status-change* [time] {alarm-history}?
         |  +--ro time                  yang:date-and-time
         |  +--ro perceived-severity    severity-with-clear
         |  +--ro alarm-text            alarm-text
         +--ro operator-state-change* [time] {operator-actions}?
         |  +--ro time        yang:date-and-time
         |  +--ro operator    string
         |  +--ro state       operator-state
         |  +--ro text?       string

I do not understand what you mean with “manual action”, are you referring to the operator actions? Again it is a YANG feature
> 11.   As for common type resource,
> “
>      typedef resource {
>        type union {
>          type instance-identifier {
>            require-instance false;
>          }
>          type yang:object-identifier;
>          type string;
>        }
> ”
> I am wondering resource common type is extensible to support UUID which is also a common type defined in RFC6991?
> 12.   Regarding resource-match
>      typedef resource-match {
>        type union {
>       type yang:xpath1.0;
>          type yang:object-identifier;
>          type string;
>        }
> I am wondering whether resource type or objected type need to be considered to add more fine granularity. Look forward to your answer to these questions, thanks very much.
Good point, will consider adding it in the next revision. 
However, there is a danger here in that developers might escape throwing UUIDs to operators. As an operator in a NOC it is hard to know what to do with a UUID.
In many cases UUID are a sign of using the alarms as a log/debug thing for developers.

Thanks for all your comments, will use these to improve descriptions in next revision


>  
> -Qin

[CCAMP] Review of draft-ietf-ccamp-alarm-module-01 Qin Wu
Re: [CCAMP] Review of draft-ietf-ccamp-alarm-modu… stefan vallin