Re: [CCAMP] Second review of draft-ietf-ccamp-alarm-module-01

Qin Wu <bill.wu@huawei.com> Thu, 09 August 2018 02:54 UTC

Return-Path: <bill.wu@huawei.com>
X-Original-To: ccamp@ietfa.amsl.com
Delivered-To: ccamp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F180E130DC4 for <ccamp@ietfa.amsl.com>; Wed, 8 Aug 2018 19:54:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id h6sCMPMOIio6 for <ccamp@ietfa.amsl.com>; Wed, 8 Aug 2018 19:54:40 -0700 (PDT)
Received: from huawei.com (lhrrgout.huawei.com [185.176.76.210]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 80E57130F11 for <ccamp@ietf.org>; Wed, 8 Aug 2018 19:54:38 -0700 (PDT)
Received: from lhreml706-cah.china.huawei.com (unknown [172.18.7.108]) by Forcepoint Email with ESMTP id 18C3EC6882759 for <ccamp@ietf.org>; Thu, 9 Aug 2018 03:54:35 +0100 (IST)
Received: from NKGEML414-HUB.china.huawei.com (10.98.56.75) by lhreml706-cah.china.huawei.com (10.201.108.47) with Microsoft SMTP Server (TLS) id 14.3.399.0; Thu, 9 Aug 2018 03:54:34 +0100
Received: from NKGEML513-MBS.china.huawei.com ([169.254.2.163]) by nkgeml414-hub.china.huawei.com ([10.98.56.75]) with mapi id 14.03.0399.000; Thu, 9 Aug 2018 10:54:24 +0800
From: Qin Wu <bill.wu@huawei.com>
To: stefan vallin <stefan@wallan.se>
CC: "ccamp@ietf.org" <ccamp@ietf.org>
Thread-Topic: Second review of draft-ietf-ccamp-alarm-module-01
Thread-Index: AdQgn0ZSsaTuKMi2STS36VPAe6hr7gBBfduAAACrXYAAMCT58AMkrjOAACFUC6A=
Date: Thu, 9 Aug 2018 02:54:24 +0000
Message-ID: <B8F9A780D330094D99AF023C5877DABA9AF9C0BE@nkgeml513-mbs.china.huawei.com>
References: <B8F9A780D330094D99AF023C5877DABA9AF5BDE8@nkgeml513-mbx.china.huawei.com> <E597E310-27B8-4091-89BB-F510CE1AC3C0@wallan.se> <50582C88-3BC2-450F-B761-E61310AABFB4@wallan.se> <B8F9A780D330094D99AF023C5877DABA9AF74602@nkgeml513-mbs.china.huawei.com> <734639AA-E2B4-493A-81D6-2F80D4192883@wallan.se>
In-Reply-To: <734639AA-E2B4-493A-81D6-2F80D4192883@wallan.se>
Accept-Language: zh-CN, en-US
Content-Language: zh-CN
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.138.33.244]
Content-Type: multipart/alternative; boundary="_000_B8F9A780D330094D99AF023C5877DABA9AF9C0BEnkgeml513mbschi_"
MIME-Version: 1.0
X-CFilter-Loop: Reflected
Archived-At: <https://mailarchive.ietf.org/arch/msg/ccamp/zdJ4EwECzHcAOGHEuAsO374_-2Y>
Subject: Re: [CCAMP] Second review of draft-ietf-ccamp-alarm-module-01
X-BeenThere: ccamp@ietf.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: Discussion list for the CCAMP working group <ccamp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ccamp>, <mailto:ccamp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ccamp/>
List-Post: <mailto:ccamp@ietf.org>
List-Help: <mailto:ccamp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ccamp>, <mailto:ccamp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 09 Aug 2018 02:54:46 -0000

Thank for your update in v-(02)
https://www.ietf.org/rfcdiff?url2=draft-ietf-ccamp-alarm-module-02
Why not have a generic model applicable to both controller and the device, I see this model as alarm monitoring framework. Also this draft said in the introduction:
“
   The purpose is to define a standardised alarm interface for network
   devices that can be easily integrated into management applications.
   The model is also applicable as a northbound alarm interface in the
   management applications.

”
In addition, I believe you haven’t touched my followup comments posted at:
https://www.ietf.org/mail-archive/web/ccamp/current/msg18904.html
which are not controller support specific comment, appreciate your response to those comments.
4 issues highlighted below:

1.       Alarm-type-id supports union of identity and string

I know defining alarm-type-id as identity make alarm-type-id is more extensible, but waste more space than using enum.

I am wondering why not define alarm-type-id as uint32 or string with embedded format such as groupid-alarmid(e.g., ”2310-36700394”), this will help manage millions of alarm types easier.

Defining alarm-type-id as identity seems wasting a lot of space and hard to deal with millions of alarm type in the design time since Enumerating each of them require human to enter all of alarm types in yang file.



2.       Alarm-name or alarm-serial-no field support for alarm and alarm inventory

Suppose we have alarm-name or alarm-serial-no, I believe it is more easier to based on one field rather than 3 tuple(resource, alarm-type-id, alarm-type-qualifier) to identify each alarm instance,

The most important is this will simplify operation and management.



3.       Alarm notification category support

Do we based on’ is-cleared’ and ‘status-change’ field to tell The same notification is used for reporting a newly raised alarm, a cleared alarm or changing the text?

How do we know the notification is used for newly raised alarm is not clear to me, since we don’t have raised field.



4.       Consistency between alarm list construct and alarm notification construct
Why alarm notification can not be used to notify the time when this alarm entry was created rather than just the time when alarm status is changed?
Why alarm notification can not be used to notify whether the alarm is cleared or not?
To address this, the proposal is to make Consistency between alarm list construct and alarm notification construct, make sense?

Regards!
-Qin
发件人: stefan vallin [mailto:stefan@wallan.se]
发送时间: 2018年8月9日 1:36
收件人: Qin Wu
抄送: ccamp@ietf.org
主题: Re: Second review of draft-ietf-ccamp-alarm-module-01

Hi!
Sorry for slow response!
Thanks again for your comments.
The larger scope the more complexity.
I think it is important to prove the model in the scope of a NE/device first. Then extend with requirements for the controller/mid-level manager in a later revision or a separate augmenting module.
I am also convinced that the current model works as a base for the controller based on implementation experience. We had some more leafs in the controller than in the device.

So in summary, I would like to progress this to an RFC targeting the NE scope in a first step before adding more features targeting the controller.
Br Stefan




On 23 Jul 2018, at 11:39, Qin Wu <bill.wu@huawei.com<mailto:bill.wu@huawei.com>> wrote:

Are you saying the controller model should be different from device model or the model in the southbound interface of the controller should be different from the model used in northbound interface of the network device?
Or the model used in northbound interface of the controller should be different from one used in the northbound interface of the network device?
Why not have one generic model which can be applied to both southbound and northbound interfaces?

-Qin
发件人: stefan vallin [mailto:stefan@wallan.se]
发送时间: 2018年7月23日 2:37
收件人: Qin Wu; ccamp@ietf.org<mailto:ccamp@ietf.org>
主题: Re: Second review of draft-ietf-ccamp-alarm-module-01

Hi again!
Addition to #8
You could augment with a device leaf in your mgmt app.

The module scope is within one device primarily

Br stefan
Mvh stefan
+46(0)705233262

22 juli 2018 kl. 20:17 skrev stefan vallin <stefan@wallan.se<mailto:stefan@wallan.se>>:
Hi Qin!
Thanks for your review and comments, see inline below:



On 21 Jul 2018, at 14:16, Qin Wu <bill.wu@huawei.com<mailto:bill.wu@huawei.com>> wrote:

Hi, Stefan:
Before the next version of alarm model comes up, I would like to have the following suggestions and comments:
1.       UUID support for the type of resource under alarm list
Last time you said:
“
Good point, will consider adding it in the next revision.
However, there is a danger here in that developers might escape throwing UUIDs to operators. As an operator in a NOC it is hard to know what to do with a UUID.
In many cases UUID are a sign of using the alarms as a log/debug thing for developers.

typedef resource {
        type union {
          type instance-identifier {
            require-instance false;
          }
          type yang:object-identifier;
          type string;
        }
“
However in our implementation case, we did allow operator in a NOC to use UUID to correlate resource objects in the alarm-inventory, don’t we?
We have added UUID to the upcoming version:
  typedef resource {
    type union {
      type instance-identifier {
        require-instance false;
      }
      type yang:object-identifier;
      type yang:uuid;
      type string;
    }

Resource-match is also updated to handle UUIDs.







2.       Dependency between root-cause-resource, impacted-resource, related-alarm
Under alarm list, there are three dependent parameters: root-cause-resource, impacted-resource, related-alarm
It is still not clear to me how root-cause-resource, impacted-resource are used together with resource parameter under related-alarm, why root-cause-resource and impact-resource not part of related-alarm.
If the answer is no, for root-cause-resource leaf-list, I am wondering why not add is-root-cause parameter to indicate a specific alarm under alarm list is root cause alarm. Only when is-root-cause is set to true, then root-cause-resource will be provided. Does this make sense?
In our practice, we usually design one root cause alarm and several derived alarms, the derived alarm will use leafref to point to root cause alarm, I am wondering whether we assume each alarm under alarm list is root cause alarm and Related-alarm are derived alarms. If the answer is no, I think we should one new parameter under related-alarm list to reference to the root cause alarm.
We have updated the test in the RFC document on this topic:
3.6.  Root Cause, Impacted Resources and Related Alarms

   The general principle of this alarm module is to limit the amount of
   alarms.  The alarm has two leaf-lists to identify possible impacted
   resources and possible root-cause resources.  The system should not
   represent individual alarms for the possible root-cause resources and
   impacted resources.  These serves as hints only.  It is up to the
   client application to use this information to present the overall
   status.

   A system should always strive to identify the resource that can be
   acted upon as the "resource" leaf.  The "impacted-resource" leaf-list
   shall be used to identify any side-effects of the alarm.  The
   impacted resources can not be acted upon to fix the problem.  An
   example of this kind of alarm might be a disc full problem which
   impacts a number of databases.

   In some occasions the system might not be capable of detecting the
   root cause, the resource that can be acted upon.  The instrumentation
   in this case only monitors the side-effect and needs to represent an
   alarm that indicates a situation that needs acting upon.  The
   instrumentation still might identify possible candidates for the
   root-cause resource.  In this case the "root-cause-resource" leaf-
   list can be used to indicate the candidate root-cause resources.  An
   example of this kind of alarm might be an active test tool that
   detects an SLA violation on a VPN connection and identifies the
   devices along the chain as candidate root causes.

   The alarm module also supports a way to associate different alarms to
   each other with the "related-alarm" list.  This list enables the
   server to inform the client that certain alarms are related to other
   alarms.

   Note well that this module does not prescribe any dependencies or
   preference between the above alarm correlation mechanisms.  Different
   systems have different capabilities and the above described
   mechanisms are available to support the instrumentation features.




3.       Consolidate tuple corresponding to a single alarm instance into pair
This YANG alarm module uses the tuple (resource, alarm type identifier, alarm type qualifier)to identify a single alarm instance. I am wondering whether the tuple can be reduced into (resource, alarm-type identifier), allow alarm-type identifier support a union of identity and string. The reason for that is inherit base identity for alarm-type-identifier to get a bunch of derived identity is not sufficient when alarm-type can be fine granularity classified into hundreds type.

No that will not work, read the text in the RFC document, alarm type identifier Is static design-time, qualifier is runtime and a refinement of the alarm-type identifier.
See updated text in the upcoming version of the RFC:
3.2.  Alarm Type

   This document defines an alarm type with an alarm type id and an
   alarm type qualifier.

   The alarm type id is modeled as a YANG identity.  With YANG
   identities, new alarm types can be defined in a distributed fashion.
   YANG identities are hierarchical, which means that an hierarchy of
   alarm types can be defined.

   Standards and vendors should define their own alarm type identities
   based on this definition.
   The use of YANG identities means that all possible alarms are
   identified at design time.  This explicit declaration of alarm types
   makes it easier to allow for alarm qualification reviews and
   preparation of alarm actions and documentation.

   There are occasions where the alarm types are not known at design
   time.  For example, a system with digital inputs that allows users to
   connects detectors (e.g., smoke detector) to the inputs.  In this
   case it is a configuration action that says that certain connectors
   are fire alarms for example.  A potential drawback of this is that
   there is a big risk that alarm operators will receive alarm types as
   a surprise, they do not know how to resolve the problem since a
   defined alarm procedure does not necessarily exist.  To avoid this
   risk the system MUST publish all possible alarm types in the alarm
   inventory, see Section 4.2.

   In order to allow for dynamic addition of alarm types the alarm
   module also allows for further qualification of the identity based
   alarm type using a string.

   A vendor or standard can then define their own alarm-type hierarchy.
   The example below shows a hierarchy based on X.733 event types:

     import ietf-alarms {
       prefix al;
     }
     identity vendor-alarms {
       base al:alarm-type;
     }
     identity communications-alarm {
       base vendor-alarms;
     }
     identity link-alarm {
       base communications-alarm;
     }

   Alarm types can be abstract.  An abstract alarm type is used as a
   base for defining hierarchical alarm types.  Concrete alarm types are
   used for alarm states and appear in the alarm inventory.  There are
   two kinds of concrete alarm types:

   1.  The last subordinate identity in the "alarm-type-id" hierarchy is
       concrete, for example: "alarm-identity.environmental-
       alarm.smoke".  In this example "alarm-identity" and
       "environmental-alarm" are abstract YANG identities, whereas
       "smoke" is a concrete YANG identity.





Vallin & Bjorklund      Expires January 11, 2019                [Page 6]
Internet-Draft              YANG Alarm Module                  July 2018


   2.  The YANG identity hierarchy is abstract and the concrete alarm
       type is defined by the dynamic alarm qualifier string, for
       example: "alarm-identity.environmental-alarm.external-detector"
       with alarm-type-qualifier "smoke".

   For example:

     // Alternative 1: concrete alarm type identity
     import ietf-alarms {
       prefix al;
     }
     identity environmental-alarm {
       base al:alarm-type;
       description "Abstract alarm type";
     }
     identity smoke {
       base environmental-alarm;
       description "Concrete alarm type";
     }

     // Alternative 2: concrete alarm type qualifier
     import ietf-alarms {
       prefix al;
     }
     identity environmental-alarm {
       base al:alarm-type;
       description "Abstract alarm type";
     }
     identity external-detector {
       base environmental-alarm;
       description
         "Abstract alarm type, a run-time configuration
          procedure sets the type of alarm detected. This will
          be reported in the alarm-type-qualifier.";
     }

   A server SHOULD strive to minimize the number of dynamically defined
   alarm types.





4.       Semantics difference between description under alarm-inventory and alarm-text nder alarm list
See description definition and alarm-text definition as follows:
“
description:A description of the possible alarm.  It SHOULD include information on possible underlying root causes and corrective actions.
alarm-text:The string used to inform operators about the alarm. This MUST contain enough information for an operator to be able to understand the problem and how to resolve it.  If this string contains structure, this format should be clearly documented for programs to be able to parse that information.
   “
   I am not sure any semantics difference between description and alarm-text, why not replace one with another? Or we can further broke down description/alarm-text into root-cause and corrective-actions. I believe they are key information we want to convey through description/alarm-text.
Alarm description is dynamic/run-time, conveys relevant information for the specific alarm state change.
Description in the inventory is static, cannot convey dynamic state change information




5.       Alarm arrive time support
Under operator-state-change, we have time parameter to represent Timestamp for operator action on alarm, I am wondering do we need to add alarm-arrive-time to represent the time when alarm arrive at the management system.
It is useful information for the alarm management.
The alarm has a leaf representing the real time the state change appeared:
    +--ro alarm* [resource alarm-type-id alarm-type-qualifier]
          ...
       +--ro last-changed               yang:date-and-time
       +--ro status-change* [time]
          +--ro time                    yang:date-and-time
This should represent the time it really happened. Not the time the notification arrived at the management system. If you need that, that is something you can add in your mgmt system.


6.       Alarm-name field support for alarm and alarm inventory
In the current model, each alarm under alarm list is uniquely identified by three leaf key (resource, alarm type identifier, alarm type qualifier),would it more desirable to define a single leaf key, e.g., add alarm name or alarm-no to uniquely identify each alarm? That will simplify the alarm management from the management system perspective. Make sense?
A string no…
This is a fundamental design principle in the alarm module. The key, the tuple, carries semantic information, there is no doubt how to match notifications to the alarm state.
3GPP Alarm IRP, for example, introduced a confusing single key alarmId key which created paradoxes,
if you have different alarmIds but for the the same alarmtype and resource, what does it mean?




7.       Reason-id support for alarm list and alarm inventory
In the current model, is root cause resource is the reason to generate each alarm? If not, I propose to add reason-id for each alarm under alarm list and alarm inventory.
See answer to #2

8.       Alarm generating device or location support for alarm list and alarm inventory
In the current model, it seems the resource type can potentially indicate the device or location where the alarm is generated, but not explicitly. I am wondering why not add alarm-generating-device and alarm-generating-location two parameters to explicitly indicate the device or location where the alarm is generated, that will simplify alarm management, make sense?

I guess you are considering a management application and not the device?
The resource is a leafier which could/should include the device in your model in your management application.



9.       Alarm notification category support
In the current model, alarm notification is defined as follows:
“
This notification is used to report a state change for an alarm. The same notification is used for reporting a newly raised alarm, a cleared alarm or changing the text and/or
severity of an existing alarm.

”
However it is not clear how to distinguish alarm notification for newly reaised alarm from alarm notification for a cleared alarm. Would it be more sensible to add alarm notification category support something as follows:
“
leaf category {
         type enumeration {
           enum fault {
             description
               "Alarm raised.";
           }
           enum recovery {
             description
               "Alarm cleared.";
           }
           enum Change {
             description
               "Alarm changed.";
           }
         }
”
Not needed, this is obvious when you map the notification towards the key tuple.



10.   Consistency between alarm list construct and alarm notification construct
We see the difference between alarm list construct and alarm notification construct is operator action defined under alarm notification construct and operator state change under alarm list construct.
As specified in RFC7950,
“
An action MUST NOT be defined within an rpc, another action, or a
   notification
”
I am not sure action can be allowed within alarm-notification construct, in that case, I would propose to remove operator action from alarm notification construct.
In addition, the operator parameter under operator-state-change can be removed or consolidated into set-operator-state action.
I do not understand
The action is not defined in the notification.




11.   Additionalinfo support for alarm list
I think we should allow vendor specific extension to be added as part of alarm list, the vendor specific extension can be defined in TLV format.
The alarm module does not restrict any vendor additions, better to use augmentation.




12.   Alarm-no support for set-operator-state
If we believe set-operator-state is useful action under alarm list. I am wondering if we can add alarm-no or alarm-name to identify each alarm under set-operator-state. This will help a lot for alarm ack operation based on each alarm number.
See above



13.   Is-acked for alarm list
Since we have is-cleared parameter under alarm list to indicate the current clearance state of the alarm, why not add is-acked parameter under alarm list to indicate the current acked state of the alarm, make sense?
You can get that from the operator-state-change list.




Br Stefan