Re: [CCAMP] Genart last call review of draft-ietf-ccamp-alarm-module-07

stefan vallin <> Wed, 20 March 2019 07:51 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id DAD651310A8 for <>; Wed, 20 Mar 2019 00:51:14 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id jpHwSh3R3Q-x for <>; Wed, 20 Mar 2019 00:51:10 -0700 (PDT)
Received: from ( [IPv6:2a00:1450:4864:20::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 4E1E21310D3 for <>; Wed, 20 Mar 2019 00:51:10 -0700 (PDT)
Received: by with SMTP id f23so1393670ljc.0 for <>; Wed, 20 Mar 2019 00:51:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20150623; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=DH2KtQvKYlpefUqOj3c4JCi2qKPlMr1zaKDW2qcuMto=; b=Sa5Fm5AURaDgc62n66869UYz70dM2zT+l9GxC8n/BXF4jvIQFkoXvVjpf4JsQ576ra WIgUxXUVptIifASocNF3U43Kk27ROAh6hUZnLnm249V7QJlbEEEhdqQGND/r/7flzRmc WBH+Mf+3PDv7JaEyIWuxvdYq95kSs22bfVuNbv4154AvOQcFKEbYCgJD0I2pOouy9E1h 3XLFrxpdC3ftTTG+Q2SJBFUyr9kTWZ26AaOuCE9cMEHH+0pQWSgCWG5Ygf5ewj8fa0mb RgZfxAvfKEmwsjKeuIktrFNOsuJHXZbsE48re3zoUFYsYr1v8dOmaVQ9rCvzlS4gHbJ3 fGoA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=DH2KtQvKYlpefUqOj3c4JCi2qKPlMr1zaKDW2qcuMto=; b=f94AcNh9AofMA1P6gD75mDlvOS711n8fluakHfvNrJHOUaoCOw5EIdnBJwIdZuEkRs 0CuPM3RUtqWatbyokyzKMJnnRMAcnZ35nAjfW/sSs78YZwRQuKx19THxbBFFe+3V+3fl wG1unOO2dEQQLQ8v27xYqIPLMxxDoUrqeCx5r7EFlMbYk8AyOIkYMcMexrgmXokxbW24 D9CPEEbWEABGIenbY4nNN+lVEs/hrXdUytiSDnsb9Uj+2K59AJCioqcWW1ADYKE1rF/v uQvA+qE//Jx7EBNvQ2N2MR16IZSkOsM3qJxAkceT3rvyAJShU2iNV8hUIBY5BVNeypXU B5aA==
X-Gm-Message-State: APjAAAVo3GYxV2IDbZLVjm8INnBP2XTzKbkiKMTHowxyd4H0dFom/nc9 sHA004WMURKlq5XFgm9FcG8lYg==
X-Google-Smtp-Source: APXvYqyW4bXKwHphtiX4boDUNfE/t/hlGgM1L8lm42QwW3u5D/S2dGzv5hFc6VLAljRMXYqZmF90cA==
X-Received: by 2002:a2e:5cc3:: with SMTP id q186mr15580929ljb.23.1553068268489; Wed, 20 Mar 2019 00:51:08 -0700 (PDT)
Received: from [] ( []) by with ESMTPSA id r1sm224293lfm.7.2019. (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 20 Mar 2019 00:51:07 -0700 (PDT)
From: stefan vallin <>
Message-Id: <>
Content-Type: multipart/alternative; boundary="Apple-Mail=_F31598A6-910C-43C5-991D-1BBE56FAE498"
Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\))
Date: Wed, 20 Mar 2019 08:51:03 +0100
In-Reply-To: <>
Cc: gen-art <>,,, ietf <>
To: Dan Romascanu <>
References: <> <> <>
X-Mailer: Apple Mail (2.3445.100.39)
Archived-At: <>
Subject: Re: [CCAMP] Genart last call review of draft-ietf-ccamp-alarm-module-07
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion list for the CCAMP working group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 20 Mar 2019 07:51:23 -0000

Thanks Dan!

> On 19 Mar 2019, at 19:54, Dan Romascanu <> wrote:
> Hi Stefan, 
> Thank you for your answer and for addressing my concerns. I am comfortable with your proposals. If your AD agrees, I would include these in a revised version before submission to the approval of the IESG. 
> Regards,
> Dan
> On Tue, Mar 19, 2019 at 5:11 PM stefan vallin < <>> wrote:
> Hi Dan!
> Thanks for your review, an honour to have RFC 3877 in the loop :)
> See inline
> br Stefan
> > 
> > 
> > Major issues:
> > 
> > 1. The definition of Alarm is key for the whole model. It reads like this:
> > Alarm (the general concept): An alarm signifies an undesirable state in a
> > resource that requires corrective action.
> > 
> > However, RFC 3877 already defined a number of concepts including:
> >  Error
> >      A deviation of a system from normal operation.
> > 
> >   Fault
> >      Lasting error or warning condition.
> > 
> >   ....
> > 
> >   Alarm
> >      Persistent indication of a fault.
> > 
> > I believe that there is a need to show why the model defined by RFC 3877 needs
> > to be changed, and why the difference that RFC 3877 was making between a Fault
> > and an Alarm is no longer needed.
> Good comment, you are right, and we need to keep the distinction between fault and alarm.
> That distinction is used in X.733, 3GPP IRP and others. The general pattern is that “fault”
> refers to what is really broken, and the alarm the manifestation of that underlying cause. 
> There is not a simple 1-1 relationship between a fault and an alarm
> * 1 fault may have many alarms due to limited root cause capabilities of the system
> * There might be no underlying fault to an alarm, consider a non-optimal QoS configuration 
>   which gives bad quality in VOIP calls. Certainly a MOS alarm from the VOIP probe, but there
>   is no “fault” as such (if you do not consider a non-optimal config as a fault)
> So X.733
> X.733 fault: The physical or algorithmic cause of a malfunction
> 3GPP fault: a deviation of a system from normal operation, which may result in the loss of operational capabilities of the element or the loss of redundancy in case of a redundant configuration
> I suggest we add the following to terminology:
> Fault: the underlying cause of an undesired behaviour
> If we then turn to the term “alarm". I have added two aspects to the definition of an alarm:
> An alarm signifies an *undesirable state* in a resource that *requires corrective action*.
> Mostly based on the alarm standardization work in the process industry (see draft references).
> 1) Rather than “deviation from normal”, we say “undesirable”, subtle difference.
>   In IT environments it is easier to define what is normal, a normal load to a web server.
>   And anything deviation from that normal load could be an alarm.
>   In networking, things are more dynamic, and deviation from normal might be the desired state.
>   So the definition stresses the fact that it is an undesired state, not just deviation from normal.
> 2) Adding the requirement that an alarm per definition should require an action. This is a sound
>   requirement that puts requirements on what qualifies as an alarm and limits the amounts of alarms.
>   (See for example the EEMUA, and ISA182 references in the draft). The 3GPP Alarm standard
>   also added this to their definition at the later revisions to address the alarm overload problem.
> > Also, RFC 3877 defined in Section 3 a
> > Framework and an Architecture that was consistent with X.733. This document has
> > no such section, and while acknowledging the need for a mapping to X.733 it
> > states as a goal:
> > Mapping to X.733, which is a requirement for some alarm systems. Still, keep
> > some of the X.733 concepts out of the core model in order to make the model
> > small and easy to understand
> > 
> > More details about what is left out and why these are not needed would help.
> The alarm YANG model  does not *require* the X.733 parameter
> definitions of for example probable-cause enum values. Today, most networking devices 
> and management systems do not rely on those enumerations.
> Those are defined in the X733 augmentation module in order to keep the core model as
> small and useful as possible. X733 requirements come more often from telecom environments.
> > 
> > Minor issues:
> > 
> > 1. Section 2 makes a statement that includes
> > ... While IETF has not really addressed alarm management
> > 
> > This is is actually not accurate. RFC 3877 addressed Alarm Management. Maybe
> > there is a need to revise that approach, but this should be done explicitly,
> > not by stating that it did not exist.
> Correct, bad wording.
> Address alarm usability requirements, see Appendix G.  While IETF
>       has not really addressed alarm management, telecom standards has
>       addressed it purely from a protocol perspective.  The process
>       industry has published several relevant standards addressing
>       requirements for a useful alarm interface; [EEMUA], [ISA182].
>       This alarm module defines usability requirements as well as a YANG
>       data model.
> Address alarm usability requirements, see Appendix G.  While IETF
>       and telecom standards have addressed alarms mostly from a 
>       protocol perspective, the process industry has published 
>       several relevant standards addressing requirements for a useful 
>       alarm interface; [EEMUA], [ISA182].
>       This alarm module defines usability requirements as well as a YANG
>       data model.
> > 
> > 2. Section 3.5:
> > Closing an alarm implies that the operator considers the corrective action
> > performed.
> > 
> > Is this always true? The undesirable state may have been cancelled by some
> > other event than corrective action, for example the resource is no longer used,
> > or the time elapsed mat have made the undesirable state irrelevant.
> I think it is important to keep the two perspectives in mind. An operator closing an
> alarm is only a flag from the operations team that the alarm does not need an action.
> It might be cleared or not cleared by the system.
> So in your first example, the alarm is probably cleared by the instrumentation, 
> correlating “the other event”.
> If the resource is no longer used a shelf should be created.
> If time has passed, depends, ….
> > 
> > 3. In section 3.5.1:
> > Alarms are not cleared by operators, only the underlying instrumentation can
> > clear an alarm.  Operators can close alarms.
> > 
> > So, the document makes a distinction between clearing an alarm and closing an
> > alarm. It may be good to define two two concepts to make the distinction clear.
> Good point!
> Suggested terminology additions:
> * Cleared alarm: a cleared alarm is an alarm where the system/server considers the
> undesired state to be cleared. Operators can not clear alarms, clearance is managed
> by the system. A linkUp notification can be considered a clear condition for a linkDown state.
> * Closed alarm: operators can close alarms irrespective of the alarm being cleared or not.
> A closed alarm indicates that the alarm does not need attention, either since the corrective
> action has been taken or that it can be ignored for other reasons.
> > 
> > 4. Appendix F.1:
> > The alarm MIB is state oriented rather than notification oriented, an alarm
> > is a "lasting condition", not a discrete notification reporting about a
> > condition state change.
> Good catch, will rephrase, the alarm MIB and the alarm YANG has a stateful view
> of alarms, not notification-focused.
> Suggested change:
> RFC 3877 defines alarm referring back to "a deviation from normal operation". This is
> problematic, since this might not require an  operator action. The alarm MIB is state 
> oriented rather than notification oriented,  an alarm is a "lasting  condition", not a 
> discrete notification reporting about a condition state change.
> NEW:
> RFC 3877 defines alarm referring back to "a deviation from normal operation". The Alarm YANG
> model adds the requirement that it should require an corrective action and should be undesired, 
> not only a deviation from normal. The alarm MIB is state oriented in the same way as the Alarm YANG,
> it focuses on the  "lasting  condition", not the individual notifications.
> > 
> > I am not sure that I understand this comment. Alarm states are defined also in
> > this document, and Alarms as defined here are also different than ' a discrete
> > notification reporting about a condition state change'. So, what does this
> > comment really try to say?
> > 
> > Nits/editorial comments:
> > 
> >