[CCAMP] Benjamin Kaduk's Discuss on draft-ietf-ccamp-alarm-module-08: (with DISCUSS and COMMENT)

Benjamin Kaduk via Datatracker <noreply@ietf.org> Mon, 08 April 2019 20:44 UTC

Return-Path: <noreply@ietf.org>
X-Original-To: ccamp@ietf.org
Delivered-To: ccamp@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id 0BCEA1205D7; Mon, 8 Apr 2019 13:44:49 -0700 (PDT)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
From: Benjamin Kaduk via Datatracker <noreply@ietf.org>
To: The IESG <iesg@ietf.org>
Cc: draft-ietf-ccamp-alarm-module@ietf.org, Daniele Ceccarelli <daniele.ceccarelli@ericsson.com>, ccamp-chairs@ietf.org, daniele.ceccarelli@ericsson.com, ccamp@ietf.org
X-Test-IDTracker: no
X-IETF-IDTracker: 6.94.1
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: Benjamin Kaduk <kaduk@mit.edu>
Message-ID: <155475628903.30058.8136578311950144931.idtracker@ietfa.amsl.com>
Date: Mon, 08 Apr 2019 13:44:49 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/ccamp/ct90uZsxh_sEJ-_vUqnEP0ufoDg>
Subject: [CCAMP] Benjamin Kaduk's Discuss on draft-ietf-ccamp-alarm-module-08: (with DISCUSS and COMMENT)
X-BeenThere: ccamp@ietf.org
X-Mailman-Version: 2.1.29
List-Id: Discussion list for the CCAMP working group <ccamp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ccamp>, <mailto:ccamp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ccamp/>
List-Post: <mailto:ccamp@ietf.org>
List-Help: <mailto:ccamp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ccamp>, <mailto:ccamp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Apr 2019 20:44:49 -0000

Benjamin Kaduk has entered the following ballot position for
draft-ietf-ccamp-alarm-module-08: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about IESG DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-ccamp-alarm-module/



----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

I think we may need to double-check that the example in Appendix C is
fully compliant with the main spec.  In particular, are the
<status-change> elments properly sorted?  (I'm less sure whether the
<last-changed> time needs to match a <status-change> or whether the
<operator-state-change> is good enough for that.)


----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

Section 1.1

   o  Fault: A fault is the underlying cause of an undesired behaviour.
      There is no trivial one-to-one mapping between faults and alarms.
      One fault may result in several alarms in case the system lacks
      root-cause and correlation capabilities.  An alarm might not have
      an underlying fault as a cause, imagine a bad MOS score alarm from
      a VOIP probe and the cause being non-optimal QoS configuration.

nit: this is a comma splice

   o  Alarm Type: An alarm type identifies a possible unique alarm state
      for a resource.  Alarm types are names to identify the state like
      "link-alarm", "jitter-violation", "high-disk-utilization".

Are these only intended for human consumption?

   o  Cleared alarm: A cleared alarm is an alarm where the system/server
      considers the undesired state to be cleared.  Operators can not
      clear alarms, clearance is managed by the system.  A linkUp
      notification can be considered a clear condition for a linkDown
      state.

nit: I'd suggest "For example, " before "a linkUp notification [...]"

   o  Corrective Action: An action taken by an operator or automation
      routine in order to minimize the impact of the alarm or resolving
      the root cause.

nit: "or resolve the root cause"

Section 2

   o  Clear definition of "alarm" in order to exclude general events
      that should not be forwarded as alarm notifications.

I'm not sure I am parsing this correctly.  Is the objective to *provide*
such a clear definition?  (Similarly for the next item.)  Part of my
confusion probably stems from the dual use of the word "clear" as a verb
and a noun in English.

Section 3.2

   In order to allow for dynamic addition of alarm types the alarm
   module allows for further qualification of the identity based alarm
   type using a string.  A potential drawback of this is that there is a
   big risk that alarm operators will receive alarm types as a surprise,
   they do not know how to resolve the problem since a defined alarm
   procedure does not necessarily exist.  [...]

nit: this is a comma splice.

Section 3.5.1

   From a resource perspective, an alarm can for example have the
   following life-cycle: raise, change severity, change severity, clear,

Is the duplicate "change severity" intentional?

For the alarm history functionality, a given 'time' leaf is clearly the
time that a change occurred, but are the sibling leafs the state before
that change or after that change?  (It looks like the actual YANG module
descriptions (e.g., for 'status-change') indicate that this is the state
after that change, but I don't know if it makes sense to also mention
that here or not.)

Section 3.7

   (blocked/filtered).  Shelved alarms appear in a dedicated shelved
   alarm list in order not to disturb the relevant alarms.  Shelved

nit: this sentence could probably be reworded for greater clarity (what
does "disturb" mean, and perhaps what "relevant alarms" is may not be
clear just from context).

Section 4.1

   The "/alarms/control/notify-status-changes" leaf controls if
   notifications are sent for all state changes, only raise and clear,
   or only notifications more severe than a configured level.  This
   feature in combination with alarm shelving corresponds to the ITU
   Alarm Report Control functionality.

Is there a specific section reference for the ITU functionality?  (If
not, it's probably fine to leave this as-is).

Section 4.2

      The system might not instrument all defined alarm type identities,
      and some alarm identities are abstract.

I just wanted to double-check: the intent is indeed to use "instrument"
here (as opposed to, say, "implement")?

Section 4.7

   /alarms/alarm-list/purge-alarms:  Delete alarms from the "alarm-list"
      according to specific criteria, for example all cleared alarms
      older than a specific date.

   /alarms/alarm-list/compress-alarms:  Compress the "status-change"
      list for the alarms.

It's a bit confusing to me to have such different language for these two
action nodes, since as far as I can tell the compress-alarms action also
allows for specific criteria (e.g., resources) to be provided to limit
the scope of the action.  (Similarly for compress-shelved-alarms.)

Section 6

     feature service-impact-analysis {
       description
         "The system supports identifying candidate impacted
          resources for an alarm, for example a link being impacted
          by an interface alarm.";

nit: it feels a bit odd to say that the link is "impacted by an alarm",
since normally the causality flow would be the other way -- the link
state changes, and then an alarm stat changes.

I'm not sure I understand the distinction between 'warning' and 'minor'
severities -- it seems that neither is supposed to be indicating a
service-affecting fault.

For the 'age-spec' choice, am I reading RFC 7950 correctly that I choose
exactly one case of the choice, so I can't have both a "minute part"
and an "hours part"?  If so, then I would suggest using different
description text for the choices, since the current text implies that
you can combine parts with different units to assemble a compound
timespec.

         list alarm {
              [...]
              Entries appear in the alarm list the first time an alarm
              becomes active for a given alarm-type and resource.
              Entries do not get deleted when the alarm is cleared, this
              is a boolean state in the alarm.

nit: this is a comma splice.

Section 10

It's a bit of a stretch as far as an attack, but an attacker that can
temporarily reduce max-alarm-status-changes to a small value (e.g., one)
could use that to hide their tracks to some extent if they can cause the
alarm to clear after they've done what they intended to do.  (They could
then restore the max-alarm-status-changes value, too.)

It seems that the list of alarms itself may be potentially sensitive, in
that it potentially gives an attacker an authoritative picture of the
(broken) state of the network.

Appendix F.1

nit: We say that X.733 alarms "also use the basic criteria of deviation from
normal condition", but it's not entirely clear whether the "also"
indicates that they share this behavior with some external reference
point or is a synonym for "additionally".

nit: I think the semicolon in the ISA "comment" should be a regular
colon.

The paragraph at the end talks about "evolution" and "moves from" -- is
the table sorted chronologically?

Appendix G

Are tables 4 and 5 talking about alarm rates or notification rates?