Re: [CCAMP] Benjamin Kaduk's Discuss on draft-ietf-ccamp-alarm-module-08: (with DISCUSS and COMMENT)

stefan vallin <stefan@wallan.se> Tue, 09 April 2019 12:08 UTC

Return-Path: <stefan@wallan.se>
X-Original-To: ccamp@ietfa.amsl.com
Delivered-To: ccamp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6C6C71207AC for <ccamp@ietfa.amsl.com>; Tue, 9 Apr 2019 05:08:58 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=wallan-se.20150623.gappssmtp.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uC03lG6Qj-RG for <ccamp@ietfa.amsl.com>; Tue, 9 Apr 2019 05:08:55 -0700 (PDT)
Received: from mail-lf1-x133.google.com (mail-lf1-x133.google.com [IPv6:2a00:1450:4864:20::133]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B94301202CD for <ccamp@ietf.org>; Tue, 9 Apr 2019 05:08:54 -0700 (PDT)
Received: by mail-lf1-x133.google.com with SMTP id t15so8248321lfl.12 for <ccamp@ietf.org>; Tue, 09 Apr 2019 05:08:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wallan-se.20150623.gappssmtp.com; s=20150623; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=85QEgXcWDwYH2CVpYgxBfV9iIDQSLl1uW+1coEzkR9w=; b=t5jmAydaGQTiKPywLnzWfo7/qqtLkPCCViPHaKgBlAMUib4Q2/q96YK6+/WTj0pUFm e6c+WBKQsFYbGLUDpkVt97MOtOMIc9zMlHSP2PrDfVZaMwkQ0f77e5Kw0DQEqpjnj3cG oploNpN82Q+eTru2jgH7eV67f/zV32jUpnQ4LZU5l4RVjkUB2RcLpUdwtIFAmT8V1QbA kpN00y7IC19OdsV4XYwKfhfR3l7INppQGKpoP28o2JXVQMHQDqQhnR+avqitdNM7AuuA N2PCfLShuibkFPa/OII/sUyJYIu22N86gi0/NR4Dtl/YNCzpsM9hdggkwcK1UwVFtGM9 BdbQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=85QEgXcWDwYH2CVpYgxBfV9iIDQSLl1uW+1coEzkR9w=; b=UPOnkokNF5LXEA7lxksB52hbrIXVvB9x39c+Bk9Iwny+5cMQqq8pGrIpRQPypJb7UB dsxXsywHHqVMbUTqkA+mIJ1TklsnccoLRfy0Ye3atotIHQ3Sd7AyCrnRE1Ny0lKKdfyT tb60q0TtsrDF/eEh50Gcgf0tup3jDDUWbmbfN4BdPPUn65VR0+qHj6gt34+1jMzlPg+N 6GSEwlmwEBm0cKOLWft1k6AhquzyQNowkOSvLmx/lzxe7cFGCqIO6kh3bPKBMTgnGiaM BlRoy8RvACPp7k/4roiDzv3XCvsP49LyjchaORuNoq11lTcFMzGI5mMT1X0TpDWVYEe9 sU/g==
X-Gm-Message-State: APjAAAXdvnChJf0wRmneHPstUeTY3s7+ceRreS9ZERgzevSpcajoUm9l 3pGxJ1xe+TjjVj13+wRlH3fdAA==
X-Google-Smtp-Source: APXvYqzrTK44x6m1ZzXklfh2c/pTuPCyypY4RY2Ym1zi1XURolEwb7APdm3mdoGh6cRzJfh8VFBKzg==
X-Received: by 2002:ac2:5221:: with SMTP id i1mr8220253lfl.64.1554811732695; Tue, 09 Apr 2019 05:08:52 -0700 (PDT)
Received: from [192.168.8.150] ([195.234.15.130]) by smtp.gmail.com with ESMTPSA id 1sm6722723ljw.56.2019.04.09.05.08.50 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 09 Apr 2019 05:08:51 -0700 (PDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\))
From: stefan vallin <stefan@wallan.se>
In-Reply-To: <155475628903.30058.8136578311950144931.idtracker@ietfa.amsl.com>
Date: Tue, 09 Apr 2019 14:08:46 +0200
Cc: The IESG <iesg@ietf.org>, draft-ietf-ccamp-alarm-module@ietf.org, Daniele Ceccarelli <daniele.ceccarelli@ericsson.com>, ccamp-chairs@ietf.org, ccamp@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <41BF0E2E-DDDD-45B7-B1F5-BBDB00AA7BC4@wallan.se>
References: <155475628903.30058.8136578311950144931.idtracker@ietfa.amsl.com>
To: Benjamin Kaduk <kaduk@mit.edu>
X-Mailer: Apple Mail (2.3445.100.39)
Archived-At: <https://mailarchive.ietf.org/arch/msg/ccamp/YLm6YXTKeOPlwzB3mWD4LyvNxgs>
Subject: Re: [CCAMP] Benjamin Kaduk's Discuss on draft-ietf-ccamp-alarm-module-08: (with DISCUSS and COMMENT)
X-BeenThere: ccamp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion list for the CCAMP working group <ccamp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ccamp>, <mailto:ccamp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ccamp/>
List-Post: <mailto:ccamp@ietf.org>
List-Help: <mailto:ccamp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ccamp>, <mailto:ccamp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 09 Apr 2019 12:08:58 -0000

Hi Benjamin!
Thanks for your review!
See inline

br Stefan and Martin

> On 8 Apr 2019, at 22:44, Benjamin Kaduk via Datatracker <noreply@ietf.org> wrote:
> 
> Benjamin Kaduk has entered the following ballot position for
> draft-ietf-ccamp-alarm-module-08: Discuss
> 
> When responding, please keep the subject line intact and reply to all
> email addresses included in the To and CC lines. (Feel free to cut this
> introductory paragraph, however.)
> 
> 
> Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
> for more information about IESG DISCUSS and COMMENT positions.
> 
> 
> The document, along with other ballot positions, can be found here:
> https://datatracker.ietf.org/doc/draft-ietf-ccamp-alarm-module/
> 
> 
> 
> ----------------------------------------------------------------------
> DISCUSS:
> ----------------------------------------------------------------------
> 
> I think we may need to double-check that the example in Appendix C is
> fully compliant with the main spec.  In particular, are the
> <status-change> elments properly sorted?  (I'm less sure whether the
> <last-changed> time needs to match a <status-change> or whether the
> <operator-state-change> is good enough for that.)

Thanks for catching this! The example was as intended but the description
clauses where wrong.

list status-change {
    description
    OLD:
            This list is ordered according to the timestamps of alarm
            state changes.  The last item corresponds to the latest
            state change.

    NEW:
         This list is ordered according to the timestamps of alarm
         status changes.  The first item corresponds to the latest
         status change.

Both <status-change> and <operator-state-change> will update
<last-changed>

leaf last-changed {
   description
   OLD:
        "A timestamp when the alarm status was last changed.  Status
        changes are changes to 'is-cleared', 'perceived-severity',
        and 'alarm-text'.";

   NEW:
        "A timestamp when the 'status-change' or 'operator-state-change' list
         was last changed.";

> 
> 
> ----------------------------------------------------------------------
> COMMENT:
> ----------------------------------------------------------------------
> 
> Section 1.1
> 
>   o  Fault: A fault is the underlying cause of an undesired behaviour.
>      There is no trivial one-to-one mapping between faults and alarms.
>      One fault may result in several alarms in case the system lacks
>      root-cause and correlation capabilities.  An alarm might not have
>      an underlying fault as a cause, imagine a bad MOS score alarm from
>      a VOIP probe and the cause being non-optimal QoS configuration.
> 
> nit: this is a comma splice
change to 
An alarm might not have an underlying fault as a cause. For example,
imagine a bad Mean Opppinion Score (MOS) alarm from a Voice Over IP
(VOIP) probe and the cause being non-optimal QoS configuration.
> 
>   o  Alarm Type: An alarm type identifies a possible unique alarm state
>      for a resource.  Alarm types are names to identify the state like
>      "link-alarm", "jitter-violation", "high-disk-utilization".
> 
> Are these only intended for human consumption?
No, these are YANG identities and intended for automation software as well

> 
>   o  Cleared alarm: A cleared alarm is an alarm where the system/server
>      considers the undesired state to be cleared.  Operators can not
>      clear alarms, clearance is managed by the system.  A linkUp
>      notification can be considered a clear condition for a linkDown
>      state.
> 
> nit: I'd suggest "For example, " before "a linkUp notification […]”
Ok, fixed

> 
>   o  Corrective Action: An action taken by an operator or automation
>      routine in order to minimize the impact of the alarm or resolving
>      the root cause.
> 
> nit: "or resolve the root cause”
Ok, fixed
> 
> Section 2
> 
>   o  Clear definition of "alarm" in order to exclude general events
>      that should not be forwarded as alarm notifications.
> 
> I'm not sure I am parsing this correctly.  Is the objective to *provide*
> such a clear definition?  (Similarly for the next item.)  Part of my
> confusion probably stems from the dual use of the word "clear" as a verb
> and a noun in English.
Good comment, suggestion below:
OLD:
    Clear definition of "alarm" in order to exclude general events
    that should not be forwarded as alarm notifications.
NEW:
    to provide a precise definition of "alarm" in order to exclude general events
    that should not be forwarded as alarm notifications.
> 
> Section 3.2
> 
>   In order to allow for dynamic addition of alarm types the alarm
>   module allows for further qualification of the identity based alarm
>   type using a string.  A potential drawback of this is that there is a
>   big risk that alarm operators will receive alarm types as a surprise,
>   they do not know how to resolve the problem since a defined alarm
>   procedure does not necessarily exist.  [...]
> 
> nit: this is a comma splice.
Correct, will change the comma to a period.

> 
> Section 3.5.1
> 
>   From a resource perspective, an alarm can for example have the
>   following life-cycle: raise, change severity, change severity, clear,
> 
> Is the duplicate "change severity" intentional?
Yes. An alarm can toggle minor, major, minor for example
> 
> For the alarm history functionality, a given 'time' leaf is clearly the
> time that a change occurred, but are the sibling leafs the state before
> that change or after that change?  (It looks like the actual YANG module
> descriptions (e.g., for 'status-change') indicate that this is the state
> after that change, but I don't know if it makes sense to also mention
> that here or not.)
As described in the YANG module it is the state after that change
> 
> Section 3.7
> 
>   (blocked/filtered).  Shelved alarms appear in a dedicated shelved
>   alarm list in order not to disturb the relevant alarms.  Shelved
> 
> nit: this sentence could probably be reworded for greater clarity (what
> does "disturb" mean, and perhaps what "relevant alarms" is may not be
> clear just from context).
Good comment!
OLD:
(blocked/filtered).  Shelved alarms appear in a dedicated shelved
 alarm list in order not to disturb the relevant alarms.  Shelved
NEW:
 (blocked/filtered).  Shelved alarms appear in a dedicated shelved
 alarm list in order to be filtered out from the alarm-list.  Shelved



> 
> Section 4.1
> 
>   The "/alarms/control/notify-status-changes" leaf controls if
>   notifications are sent for all state changes, only raise and clear,
>   or only notifications more severe than a configured level.  This
>   feature in combination with alarm shelving corresponds to the ITU
>   Alarm Report Control functionality.
> 
> Is there a specific section reference for the ITU functionality?  (If
> not, it's probably fine to leave this as-is).
Good comment, changed to:
This feature in combination with alarm shelving corresponds to the ITU
Alarm Report Control functionality, see Appendix F.2.4.

> 
> Section 4.2
> 
>      The system might not instrument all defined alarm type identities,
>      and some alarm identities are abstract.
> 
> I just wanted to double-check: the intent is indeed to use "instrument"
> here (as opposed to, say, "implement”)?

It was intentional but changed to "implement"
> 
> Section 4.7
> 
>   /alarms/alarm-list/purge-alarms:  Delete alarms from the "alarm-list"
>      according to specific criteria, for example all cleared alarms
>      older than a specific date.
> 
>   /alarms/alarm-list/compress-alarms:  Compress the "status-change"
>      list for the alarms.
> 
> It's a bit confusing to me to have such different language for these two
> action nodes, since as far as I can tell the compress-alarms action also
> allows for specific criteria (e.g., resources) to be provided to limit
> the scope of the action.  (Similarly for compress-shelved-alarms.)
Ok, I do not think we will consider changing the action names at this point

> 
> Section 6
> 
>     feature service-impact-analysis {
>       description
>         "The system supports identifying candidate impacted
>          resources for an alarm, for example a link being impacted
>          by an interface alarm.";
> 
> nit: it feels a bit odd to say that the link is "impacted by an alarm",
> since normally the causality flow would be the other way -- the link
> state changes, and then an alarm stat changes.
OLD:
       "The system supports identifying candidate impacted
        resources for an alarm, for example a link being impacted
        by an interface alarm.";
NEW:
       "The system supports identifying candidate impacted
        resources for an alarm. For example, an interface state change
        resulting in a link alarm which can refer to a link as being impacted.";
> 
> I'm not sure I understand the distinction between 'warning' and 'minor'
> severities -- it seems that neither is supposed to be indicating a
> service-affecting fault.

Can agree the difference is subtle and even unclear. However these are
the original definitions from X.733 and they are being used in the industry.
* for minor there is a fault: (indicates the existence of a non-service affecting fault)
* for warning there might be one coming up: (detection of a potential or impending)

> 
> For the 'age-spec' choice, am I reading RFC 7950 correctly that I choose
> exactly one case of the choice, so I can't have both a "minute part"
> and an "hours part"?  If so, then I would suggest using different
> description text for the choices, since the current text implies that
> you can combine parts with different units to assemble a compound
> timespec.

Good comment:
OLD:
 Seconds part.

NEW:
 Age expressed in seconds.

And the same change for the other cases


> 
>         list alarm {
>              [...]
>              Entries appear in the alarm list the first time an alarm
>              becomes active for a given alarm-type and resource.
>              Entries do not get deleted when the alarm is cleared, this
>              is a boolean state in the alarm.
> 
> nit: this is a comma splice.
OLD:
            Entries appear in the alarm list the first time an alarm
            becomes active for a given alarm-type and resource.
            Entries do not get deleted when the alarm is cleared, this
            is a boolean state in the alarm.
NEW:    
	     Entries appear in the alarm list the first time an alarm
            becomes active for a given alarm-type and resource.
            Entries do not get deleted when the alarm is cleared.
            Clear status is represented as a boolean flag.



> 
> Section 10
> 
> It's a bit of a stretch as far as an attack, but an attacker that can
> temporarily reduce max-alarm-status-changes to a small value (e.g., one)
> could use that to hide their tracks to some extent if they can cause the
> alarm to clear after they've done what they intended to do.  (They could
> then restore the max-alarm-status-changes value, too.)
Notifications will still be sent. I think this is a stretch and leave it out.
> 
> It seems that the list of alarms itself may be potentially sensitive, in
> that it potentially gives an attacker an authoritative picture of the
> (broken) state of the network.

Ok, added the following:

        The list of alarms itself may be potentially sensitive from a
        security perspective, in that it potentially gives an attacker
        an authoritative picture of the (broken) state of the network.



> 
> Appendix F.1
> 
> nit: We say that X.733 alarms "also use the basic criteria of deviation from
> normal condition", but it's not entirely clear whether the "also"
> indicates that they share this behavior with some external reference
> point or is a synonym for "additionally”.

OLD:
	It also use the basic criteria of deviation from
	normal condition
NEW:
	X.733 defines an alarm as a deviation from
       normal condition, but without the requirement
       that it requires corrective actions.

> 
> nit: I think the semicolon in the ISA "comment" should be a regular
> colon.
Fixed
> 
> The paragraph at the end talks about "evolution" and "moves from" -- is
> the table sorted chronologically?

Mostly, also sorted to show exactly that. I think the phrase is ok

> Appendix G
> 
> Are tables 4 and 5 talking about alarm rates or notification rates?

These are usability requirements and therefore the alarm rate rather than the notification rates.
This is also an effect of this module, not focusing on a list of notifications but alarm states as a result of notifications.
> 
>