Re: [CCAMP] Benjamin Kaduk's Discuss on draft-ietf-ccamp-alarm-module-08: (with DISCUSS and COMMENT)

Martin Bjorklund <mbj@tail-f.com> Thu, 11 April 2019 15:23 UTC

Return-Path: <mbj@tail-f.com>
X-Original-To: ccamp@ietfa.amsl.com
Delivered-To: ccamp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 96D7812039E; Thu, 11 Apr 2019 08:23:06 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Jr9prX0zkUIN; Thu, 11 Apr 2019 08:23:03 -0700 (PDT)
Received: from mail.tail-f.com (mail.tail-f.com [46.21.102.45]) by ietfa.amsl.com (Postfix) with ESMTP id 957F5120396; Thu, 11 Apr 2019 08:23:02 -0700 (PDT)
Received: from localhost (unknown [173.38.220.61]) by mail.tail-f.com (Postfix) with ESMTPSA id 5CC301AE0498; Thu, 11 Apr 2019 17:23:01 +0200 (CEST)
Date: Thu, 11 Apr 2019 17:23:03 +0200 (CEST)
Message-Id: <20190411.172303.950933990774253893.mbj@tail-f.com>
To: kaduk@mit.edu
Cc: stefan@wallan.se, iesg@ietf.org, draft-ietf-ccamp-alarm-module@ietf.org, daniele.ceccarelli@ericsson.com, ccamp-chairs@ietf.org, ccamp@ietf.org
From: Martin Bjorklund <mbj@tail-f.com>
In-Reply-To: <20190410023022.GC18549@kduck.mit.edu>
References: <155475628903.30058.8136578311950144931.idtracker@ietfa.amsl.com> <41BF0E2E-DDDD-45B7-B1F5-BBDB00AA7BC4@wallan.se> <20190410023022.GC18549@kduck.mit.edu>
X-Mailer: Mew version 6.7 on Emacs 25.2 / Mule 6.0 (HANACHIRUSATO)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=utf-8
Content-Transfer-Encoding: base64
Archived-At: <https://mailarchive.ietf.org/arch/msg/ccamp/QbJax9axNcz0cM405sBcykUWkDM>
Subject: Re: [CCAMP] Benjamin Kaduk's Discuss on draft-ietf-ccamp-alarm-module-08: (with DISCUSS and COMMENT)
X-BeenThere: ccamp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion list for the CCAMP working group <ccamp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ccamp>, <mailto:ccamp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ccamp/>
List-Post: <mailto:ccamp@ietf.org>
List-Help: <mailto:ccamp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ccamp>, <mailto:ccamp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Apr 2019 15:23:07 -0000

Hi,

We have now posted draft-ietf-ccamp-alarm-module-09.  Please verify
that this version addresses your DISUSS.


/martin & stefan



Benjamin Kaduk <kaduk@mit.edu>; wrote:
> On Tue, Apr 09, 2019 at 02:08:46PM +0200, stefan vallin wrote:
> > Hi Benjamin!
> > Thanks for your review!
> > See inline
> > 
> > br Stefan and Martin
> > 
> > > On 8 Apr 2019, at 22:44, Benjamin Kaduk via Datatracker <noreply@ietf.org>; wrote:
> > > 
> > > Benjamin Kaduk has entered the following ballot position for
> > > draft-ietf-ccamp-alarm-module-08: Discuss
> > > 
> > > When responding, please keep the subject line intact and reply to all
> > > email addresses included in the To and CC lines. (Feel free to cut this
> > > introductory paragraph, however.)
> > > 
> > > 
> > > Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
> > > for more information about IESG DISCUSS and COMMENT positions.
> > > 
> > > 
> > > The document, along with other ballot positions, can be found here:
> > > https://datatracker.ietf.org/doc/draft-ietf-ccamp-alarm-module/
> > > 
> > > 
> > > 
> > > ----------------------------------------------------------------------
> > > DISCUSS:
> > > ----------------------------------------------------------------------
> > > 
> > > I think we may need to double-check that the example in Appendix C is
> > > fully compliant with the main spec.  In particular, are the
> > > <status-change> elments properly sorted?  (I'm less sure whether the
> > > <last-changed> time needs to match a <status-change> or whether the
> > > <operator-state-change> is good enough for that.)
> > 
> > Thanks for catching this! The example was as intended but the description
> > clauses where wrong.
> 
> Ah, I'm happy to be ableto help!
> These changes look good to bring the example and spec into line with the
> intended behavior.
> 
> > list status-change {
> >     description
> >     OLD:
> >             This list is ordered according to the timestamps of alarm
> >             state changes.  The last item corresponds to the latest
> >             state change.
> > 
> >     NEW:
> >          This list is ordered according to the timestamps of alarm
> >          status changes.  The first item corresponds to the latest
> >          status change.
> > 
> > Both <status-change> and <operator-state-change> will update
> > <last-changed>
> > 
> > leaf last-changed {
> >    description
> >    OLD:
> >         "A timestamp when the alarm status was last changed.  Status
> >         changes are changes to 'is-cleared', 'perceived-severity',
> >         and 'alarm-text'.";
> > 
> >    NEW:
> >         "A timestamp when the 'status-change' or 'operator-state-change' list
> >          was last changed.";
> > 
> > > 
> > > 
> > > ----------------------------------------------------------------------
> > > COMMENT:
> > > ----------------------------------------------------------------------
> > > 
> > > Section 1.1
> > > 
> > >   o  Fault: A fault is the underlying cause of an undesired behaviour.
> > >      There is no trivial one-to-one mapping between faults and alarms.
> > >      One fault may result in several alarms in case the system lacks
> > >      root-cause and correlation capabilities.  An alarm might not have
> > >      an underlying fault as a cause, imagine a bad MOS score alarm from
> > >      a VOIP probe and the cause being non-optimal QoS configuration.
> > > 
> > > nit: this is a comma splice
> > change to 
> > An alarm might not have an underlying fault as a cause. For example,
> > imagine a bad Mean Opppinion Score (MOS) alarm from a Voice Over IP
> > (VOIP) probe and the cause being non-optimal QoS configuration.
> > > 
> > >   o  Alarm Type: An alarm type identifies a possible unique alarm state
> > >      for a resource.  Alarm types are names to identify the state like
> > >      "link-alarm", "jitter-violation", "high-disk-utilization".
> > > 
> > > Are these only intended for human consumption?
> > No, these are YANG identities and intended for automation software as well
> 
> I think my follow-up question was going to be whether there was value in a
> registry for them, to avoid proliferation of nearly-identical types from
> different vendors.  (I honestly don't know.)
> 
> > > 
> > >   o  Cleared alarm: A cleared alarm is an alarm where the system/server
> > >      considers the undesired state to be cleared.  Operators can not
> > >      clear alarms, clearance is managed by the system.  A linkUp
> > >      notification can be considered a clear condition for a linkDown
> > >      state.
> > > 
> > > nit: I'd suggest "For example, " before "a linkUp notification […]”
> > Ok, fixed
> > 
> > > 
> > >   o  Corrective Action: An action taken by an operator or automation
> > >      routine in order to minimize the impact of the alarm or resolving
> > >      the root cause.
> > > 
> > > nit: "or resolve the root cause”
> > Ok, fixed
> > > 
> > > Section 2
> > > 
> > >   o  Clear definition of "alarm" in order to exclude general events
> > >      that should not be forwarded as alarm notifications.
> > > 
> > > I'm not sure I am parsing this correctly.  Is the objective to *provide*
> > > such a clear definition?  (Similarly for the next item.)  Part of my
> > > confusion probably stems from the dual use of the word "clear" as a verb
> > > and a noun in English.
> > Good comment, suggestion below:
> > OLD:
> >     Clear definition of "alarm" in order to exclude general events
> >     that should not be forwarded as alarm notifications.
> > NEW:
> >     to provide a precise definition of "alarm" in order to exclude general events
> >     that should not be forwarded as alarm notifications.
> 
> Thanks!
> 
> > > Section 3.2
> > > 
> > >   In order to allow for dynamic addition of alarm types the alarm
> > >   module allows for further qualification of the identity based alarm
> > >   type using a string.  A potential drawback of this is that there is a
> > >   big risk that alarm operators will receive alarm types as a surprise,
> > >   they do not know how to resolve the problem since a defined alarm
> > >   procedure does not necessarily exist.  [...]
> > > 
> > > nit: this is a comma splice.
> > Correct, will change the comma to a period.
> > 
> > > 
> > > Section 3.5.1
> > > 
> > >   From a resource perspective, an alarm can for example have the
> > >   following life-cycle: raise, change severity, change severity, clear,
> > > 
> > > Is the duplicate "change severity" intentional?
> > Yes. An alarm can toggle minor, major, minor for example
> 
> Okay.
> 
> > > For the alarm history functionality, a given 'time' leaf is clearly the
> > > time that a change occurred, but are the sibling leafs the state before
> > > that change or after that change?  (It looks like the actual YANG module
> > > descriptions (e.g., for 'status-change') indicate that this is the state
> > > after that change, but I don't know if it makes sense to also mention
> > > that here or not.)
> > As described in the YANG module it is the state after that change
> > > 
> > > Section 3.7
> > > 
> > >   (blocked/filtered).  Shelved alarms appear in a dedicated shelved
> > >   alarm list in order not to disturb the relevant alarms.  Shelved
> > > 
> > > nit: this sentence could probably be reworded for greater clarity (what
> > > does "disturb" mean, and perhaps what "relevant alarms" is may not be
> > > clear just from context).
> > Good comment!
> > OLD:
> > (blocked/filtered).  Shelved alarms appear in a dedicated shelved
> >  alarm list in order not to disturb the relevant alarms.  Shelved
> > NEW:
> >  (blocked/filtered).  Shelved alarms appear in a dedicated shelved
> >  alarm list in order to be filtered out from the alarm-list.  Shelved
> 
> That helps, thanks.  Perhaps "in order for the main alarm-list to only
> contain entries of interest" could be a further improvement?
> 
> > 
> > 
> > > 
> > > Section 4.1
> > > 
> > >   The "/alarms/control/notify-status-changes" leaf controls if
> > >   notifications are sent for all state changes, only raise and clear,
> > >   or only notifications more severe than a configured level.  This
> > >   feature in combination with alarm shelving corresponds to the ITU
> > >   Alarm Report Control functionality.
> > > 
> > > Is there a specific section reference for the ITU functionality?  (If
> > > not, it's probably fine to leave this as-is).
> > Good comment, changed to:
> > This feature in combination with alarm shelving corresponds to the ITU
> > Alarm Report Control functionality, see Appendix F.2.4.
> > 
> > > 
> > > Section 4.2
> > > 
> > >      The system might not instrument all defined alarm type identities,
> > >      and some alarm identities are abstract.
> > > 
> > > I just wanted to double-check: the intent is indeed to use "instrument"
> > > here (as opposed to, say, "implement”)?
> > 
> > It was intentional but changed to "implement"
> 
> (No need to change it just on my account; I can see that "instrument" has a
> valid meaning if that's the intent.)
> 
> > > Section 4.7
> > > 
> > >   /alarms/alarm-list/purge-alarms:  Delete alarms from the "alarm-list"
> > >      according to specific criteria, for example all cleared alarms
> > >      older than a specific date.
> > > 
> > >   /alarms/alarm-list/compress-alarms:  Compress the "status-change"
> > >      list for the alarms.
> > > 
> > > It's a bit confusing to me to have such different language for these two
> > > action nodes, since as far as I can tell the compress-alarms action also
> > > allows for specific criteria (e.g., resources) to be provided to limit
> > > the scope of the action.  (Similarly for compress-shelved-alarms.)
> > Ok, I do not think we will consider changing the action names at this point
> 
> Sorry, I did not mean the name of the YANG leafs, but rather the text about
> "according to specific criteria".
> 
> > > 
> > > Section 6
> > > 
> > >     feature service-impact-analysis {
> > >       description
> > >         "The system supports identifying candidate impacted
> > >          resources for an alarm, for example a link being impacted
> > >          by an interface alarm.";
> > > 
> > > nit: it feels a bit odd to say that the link is "impacted by an alarm",
> > > since normally the causality flow would be the other way -- the link
> > > state changes, and then an alarm stat changes.
> > OLD:
> >        "The system supports identifying candidate impacted
> >         resources for an alarm, for example a link being impacted
> >         by an interface alarm.";
> > NEW:
> >        "The system supports identifying candidate impacted
> >         resources for an alarm. For example, an interface state change
> >         resulting in a link alarm which can refer to a link as being impacted.";
> > > 
> > > I'm not sure I understand the distinction between 'warning' and 'minor'
> > > severities -- it seems that neither is supposed to be indicating a
> > > service-affecting fault.
> > 
> > Can agree the difference is subtle and even unclear. However these are
> > the original definitions from X.733 and they are being used in the industry.
> > * for minor there is a fault: (indicates the existence of a non-service affecting fault)
> > * for warning there might be one coming up: (detection of a potential or impending)
> 
> Okay; I am happy to concede that I'm not an expert here :)
> 
> > > 
> > > For the 'age-spec' choice, am I reading RFC 7950 correctly that I choose
> > > exactly one case of the choice, so I can't have both a "minute part"
> > > and an "hours part"?  If so, then I would suggest using different
> > > description text for the choices, since the current text implies that
> > > you can combine parts with different units to assemble a compound
> > > timespec.
> > 
> > Good comment:
> > OLD:
> >  Seconds part.
> > 
> > NEW:
> >  Age expressed in seconds.
> > 
> > And the same change for the other cases
> 
> Thanks
> 
> > 
> > > 
> > >         list alarm {
> > >              [...]
> > >              Entries appear in the alarm list the first time an alarm
> > >              becomes active for a given alarm-type and resource.
> > >              Entries do not get deleted when the alarm is cleared, this
> > >              is a boolean state in the alarm.
> > > 
> > > nit: this is a comma splice.
> > OLD:
> >             Entries appear in the alarm list the first time an alarm
> >             becomes active for a given alarm-type and resource.
> >             Entries do not get deleted when the alarm is cleared, this
> >             is a boolean state in the alarm.
> > NEW:    
> > 	     Entries appear in the alarm list the first time an alarm
> >             becomes active for a given alarm-type and resource.
> >             Entries do not get deleted when the alarm is cleared.
> >             Clear status is represented as a boolean flag.
> > 
> > 
> > 
> > > 
> > > Section 10
> > > 
> > > It's a bit of a stretch as far as an attack, but an attacker that can
> > > temporarily reduce max-alarm-status-changes to a small value (e.g., one)
> > > could use that to hide their tracks to some extent if they can cause the
> > > alarm to clear after they've done what they intended to do.  (They could
> > > then restore the max-alarm-status-changes value, too.)
> > Notifications will still be sent. I think this is a stretch and leave it out.
> 
> Good point about notifications, thanks.
> 
> > > It seems that the list of alarms itself may be potentially sensitive, in
> > > that it potentially gives an attacker an authoritative picture of the
> > > (broken) state of the network.
> > 
> > Ok, added the following:
> > 
> >         The list of alarms itself may be potentially sensitive from a
> >         security perspective, in that it potentially gives an attacker
> >         an authoritative picture of the (broken) state of the network.
> > 
> > 
> > 
> > > 
> > > Appendix F.1
> > > 
> > > nit: We say that X.733 alarms "also use the basic criteria of deviation from
> > > normal condition", but it's not entirely clear whether the "also"
> > > indicates that they share this behavior with some external reference
> > > point or is a synonym for "additionally”.
> > 
> > OLD:
> > 	It also use the basic criteria of deviation from
> > 	normal condition
> > NEW:
> > 	X.733 defines an alarm as a deviation from
> >        normal condition, but without the requirement
> >        that it requires corrective actions.
> > 
> > > 
> > > nit: I think the semicolon in the ISA "comment" should be a regular
> > > colon.
> > Fixed
> > > 
> > > The paragraph at the end talks about "evolution" and "moves from" -- is
> > > the table sorted chronologically?
> > 
> > Mostly, also sorted to show exactly that. I think the phrase is ok
> 
> The phrase is definitely fine -- I just didn't do the check for the
> publication dates of the various specs, but am happy to trust that they are
> sorted properly.
> 
> > > Appendix G
> > > 
> > > Are tables 4 and 5 talking about alarm rates or notification rates?
> > 
> > These are usability requirements and therefore the alarm rate rather than the notification rates.
> > This is also an effect of this module, not focusing on a list of notifications but alarm states as a result of notifications.
> 
> Okay, thanks for confirming.
> 
> -Ben
>