[CCAMP] Benjamin Kaduk's Discuss on draft-ietf-ccamp-alarm-module-08: (with DISCUSS and COMMENT)
Benjamin Kaduk via Datatracker <noreply@ietf.org> Mon, 08 April 2019 20:44 UTC
Return-Path: <noreply@ietf.org>
X-Original-To: ccamp@ietf.org
Delivered-To: ccamp@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id 0BCEA1205D7; Mon, 8 Apr 2019 13:44:49 -0700 (PDT)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
From: Benjamin Kaduk via Datatracker <noreply@ietf.org>
To: The IESG <iesg@ietf.org>
Cc: draft-ietf-ccamp-alarm-module@ietf.org, Daniele Ceccarelli <daniele.ceccarelli@ericsson.com>, ccamp-chairs@ietf.org, daniele.ceccarelli@ericsson.com, ccamp@ietf.org
X-Test-IDTracker: no
X-IETF-IDTracker: 6.94.1
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: Benjamin Kaduk <kaduk@mit.edu>
Message-ID: <155475628903.30058.8136578311950144931.idtracker@ietfa.amsl.com>
Date: Mon, 08 Apr 2019 13:44:49 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/ccamp/ct90uZsxh_sEJ-_vUqnEP0ufoDg>
Subject: [CCAMP] Benjamin Kaduk's Discuss on draft-ietf-ccamp-alarm-module-08: (with DISCUSS and COMMENT)
X-BeenThere: ccamp@ietf.org
X-Mailman-Version: 2.1.29
List-Id: Discussion list for the CCAMP working group <ccamp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ccamp>, <mailto:ccamp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ccamp/>
List-Post: <mailto:ccamp@ietf.org>
List-Help: <mailto:ccamp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ccamp>, <mailto:ccamp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Apr 2019 20:44:49 -0000
Benjamin Kaduk has entered the following ballot position for draft-ietf-ccamp-alarm-module-08: Discuss When responding, please keep the subject line intact and reply to all email addresses included in the To and CC lines. (Feel free to cut this introductory paragraph, however.) Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html for more information about IESG DISCUSS and COMMENT positions. The document, along with other ballot positions, can be found here: https://datatracker.ietf.org/doc/draft-ietf-ccamp-alarm-module/ ---------------------------------------------------------------------- DISCUSS: ---------------------------------------------------------------------- I think we may need to double-check that the example in Appendix C is fully compliant with the main spec. In particular, are the <status-change> elments properly sorted? (I'm less sure whether the <last-changed> time needs to match a <status-change> or whether the <operator-state-change> is good enough for that.) ---------------------------------------------------------------------- COMMENT: ---------------------------------------------------------------------- Section 1.1 o Fault: A fault is the underlying cause of an undesired behaviour. There is no trivial one-to-one mapping between faults and alarms. One fault may result in several alarms in case the system lacks root-cause and correlation capabilities. An alarm might not have an underlying fault as a cause, imagine a bad MOS score alarm from a VOIP probe and the cause being non-optimal QoS configuration. nit: this is a comma splice o Alarm Type: An alarm type identifies a possible unique alarm state for a resource. Alarm types are names to identify the state like "link-alarm", "jitter-violation", "high-disk-utilization". Are these only intended for human consumption? o Cleared alarm: A cleared alarm is an alarm where the system/server considers the undesired state to be cleared. Operators can not clear alarms, clearance is managed by the system. A linkUp notification can be considered a clear condition for a linkDown state. nit: I'd suggest "For example, " before "a linkUp notification [...]" o Corrective Action: An action taken by an operator or automation routine in order to minimize the impact of the alarm or resolving the root cause. nit: "or resolve the root cause" Section 2 o Clear definition of "alarm" in order to exclude general events that should not be forwarded as alarm notifications. I'm not sure I am parsing this correctly. Is the objective to *provide* such a clear definition? (Similarly for the next item.) Part of my confusion probably stems from the dual use of the word "clear" as a verb and a noun in English. Section 3.2 In order to allow for dynamic addition of alarm types the alarm module allows for further qualification of the identity based alarm type using a string. A potential drawback of this is that there is a big risk that alarm operators will receive alarm types as a surprise, they do not know how to resolve the problem since a defined alarm procedure does not necessarily exist. [...] nit: this is a comma splice. Section 3.5.1 From a resource perspective, an alarm can for example have the following life-cycle: raise, change severity, change severity, clear, Is the duplicate "change severity" intentional? For the alarm history functionality, a given 'time' leaf is clearly the time that a change occurred, but are the sibling leafs the state before that change or after that change? (It looks like the actual YANG module descriptions (e.g., for 'status-change') indicate that this is the state after that change, but I don't know if it makes sense to also mention that here or not.) Section 3.7 (blocked/filtered). Shelved alarms appear in a dedicated shelved alarm list in order not to disturb the relevant alarms. Shelved nit: this sentence could probably be reworded for greater clarity (what does "disturb" mean, and perhaps what "relevant alarms" is may not be clear just from context). Section 4.1 The "/alarms/control/notify-status-changes" leaf controls if notifications are sent for all state changes, only raise and clear, or only notifications more severe than a configured level. This feature in combination with alarm shelving corresponds to the ITU Alarm Report Control functionality. Is there a specific section reference for the ITU functionality? (If not, it's probably fine to leave this as-is). Section 4.2 The system might not instrument all defined alarm type identities, and some alarm identities are abstract. I just wanted to double-check: the intent is indeed to use "instrument" here (as opposed to, say, "implement")? Section 4.7 /alarms/alarm-list/purge-alarms: Delete alarms from the "alarm-list" according to specific criteria, for example all cleared alarms older than a specific date. /alarms/alarm-list/compress-alarms: Compress the "status-change" list for the alarms. It's a bit confusing to me to have such different language for these two action nodes, since as far as I can tell the compress-alarms action also allows for specific criteria (e.g., resources) to be provided to limit the scope of the action. (Similarly for compress-shelved-alarms.) Section 6 feature service-impact-analysis { description "The system supports identifying candidate impacted resources for an alarm, for example a link being impacted by an interface alarm."; nit: it feels a bit odd to say that the link is "impacted by an alarm", since normally the causality flow would be the other way -- the link state changes, and then an alarm stat changes. I'm not sure I understand the distinction between 'warning' and 'minor' severities -- it seems that neither is supposed to be indicating a service-affecting fault. For the 'age-spec' choice, am I reading RFC 7950 correctly that I choose exactly one case of the choice, so I can't have both a "minute part" and an "hours part"? If so, then I would suggest using different description text for the choices, since the current text implies that you can combine parts with different units to assemble a compound timespec. list alarm { [...] Entries appear in the alarm list the first time an alarm becomes active for a given alarm-type and resource. Entries do not get deleted when the alarm is cleared, this is a boolean state in the alarm. nit: this is a comma splice. Section 10 It's a bit of a stretch as far as an attack, but an attacker that can temporarily reduce max-alarm-status-changes to a small value (e.g., one) could use that to hide their tracks to some extent if they can cause the alarm to clear after they've done what they intended to do. (They could then restore the max-alarm-status-changes value, too.) It seems that the list of alarms itself may be potentially sensitive, in that it potentially gives an attacker an authoritative picture of the (broken) state of the network. Appendix F.1 nit: We say that X.733 alarms "also use the basic criteria of deviation from normal condition", but it's not entirely clear whether the "also" indicates that they share this behavior with some external reference point or is a synonym for "additionally". nit: I think the semicolon in the ISA "comment" should be a regular colon. The paragraph at the end talks about "evolution" and "moves from" -- is the table sorted chronologically? Appendix G Are tables 4 and 5 talking about alarm rates or notification rates?
- [CCAMP] Benjamin Kaduk's Discuss on draft-ietf-cc… Benjamin Kaduk via Datatracker
- Re: [CCAMP] Benjamin Kaduk's Discuss on draft-iet… stefan vallin
- Re: [CCAMP] Benjamin Kaduk's Discuss on draft-iet… Benjamin Kaduk
- Re: [CCAMP] Benjamin Kaduk's Discuss on draft-iet… stefan vallin
- Re: [CCAMP] Benjamin Kaduk's Discuss on draft-iet… Martin Bjorklund