Re: [CCAMP] Benjamin Kaduk's Discuss on draft-ietf-ccamp-alarm-module-08: (with DISCUSS and COMMENT)
Martin Bjorklund <mbj@tail-f.com> Thu, 11 April 2019 15:23 UTC
Return-Path: <mbj@tail-f.com>
X-Original-To: ccamp@ietfa.amsl.com
Delivered-To: ccamp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 96D7812039E; Thu, 11 Apr 2019 08:23:06 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Jr9prX0zkUIN; Thu, 11 Apr 2019 08:23:03 -0700 (PDT)
Received: from mail.tail-f.com (mail.tail-f.com [46.21.102.45]) by ietfa.amsl.com (Postfix) with ESMTP id 957F5120396; Thu, 11 Apr 2019 08:23:02 -0700 (PDT)
Received: from localhost (unknown [173.38.220.61]) by mail.tail-f.com (Postfix) with ESMTPSA id 5CC301AE0498; Thu, 11 Apr 2019 17:23:01 +0200 (CEST)
Date: Thu, 11 Apr 2019 17:23:03 +0200
Message-Id: <20190411.172303.950933990774253893.mbj@tail-f.com>
To: kaduk@mit.edu
Cc: stefan@wallan.se, iesg@ietf.org, draft-ietf-ccamp-alarm-module@ietf.org, daniele.ceccarelli@ericsson.com, ccamp-chairs@ietf.org, ccamp@ietf.org
From: Martin Bjorklund <mbj@tail-f.com>
In-Reply-To: <20190410023022.GC18549@kduck.mit.edu>
References: <155475628903.30058.8136578311950144931.idtracker@ietfa.amsl.com> <41BF0E2E-DDDD-45B7-B1F5-BBDB00AA7BC4@wallan.se> <20190410023022.GC18549@kduck.mit.edu>
X-Mailer: Mew version 6.7 on Emacs 25.2 / Mule 6.0 (HANACHIRUSATO)
Mime-Version: 1.0
Content-Type: Text/Plain; charset="utf-8"
Content-Transfer-Encoding: base64
Archived-At: <https://mailarchive.ietf.org/arch/msg/ccamp/QbJax9axNcz0cM405sBcykUWkDM>
Subject: Re: [CCAMP] Benjamin Kaduk's Discuss on draft-ietf-ccamp-alarm-module-08: (with DISCUSS and COMMENT)
X-BeenThere: ccamp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion list for the CCAMP working group <ccamp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ccamp>, <mailto:ccamp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ccamp/>
List-Post: <mailto:ccamp@ietf.org>
List-Help: <mailto:ccamp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ccamp>, <mailto:ccamp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Apr 2019 15:23:07 -0000
Hi, We have now posted draft-ietf-ccamp-alarm-module-09. Please verify that this version addresses your DISUSS. /martin & stefan Benjamin Kaduk <kaduk@mit.edu> wrote: > On Tue, Apr 09, 2019 at 02:08:46PM +0200, stefan vallin wrote: > > Hi Benjamin! > > Thanks for your review! > > See inline > > > > br Stefan and Martin > > > > > On 8 Apr 2019, at 22:44, Benjamin Kaduk via Datatracker <noreply@ietf.org> wrote: > > > > > > Benjamin Kaduk has entered the following ballot position for > > > draft-ietf-ccamp-alarm-module-08: Discuss > > > > > > When responding, please keep the subject line intact and reply to all > > > email addresses included in the To and CC lines. (Feel free to cut this > > > introductory paragraph, however.) > > > > > > > > > Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html > > > for more information about IESG DISCUSS and COMMENT positions. > > > > > > > > > The document, along with other ballot positions, can be found here: > > > https://datatracker.ietf.org/doc/draft-ietf-ccamp-alarm-module/ > > > > > > > > > > > > ---------------------------------------------------------------------- > > > DISCUSS: > > > ---------------------------------------------------------------------- > > > > > > I think we may need to double-check that the example in Appendix C is > > > fully compliant with the main spec. In particular, are the > > > <status-change> elments properly sorted? (I'm less sure whether the > > > <last-changed> time needs to match a <status-change> or whether the > > > <operator-state-change> is good enough for that.) > > > > Thanks for catching this! The example was as intended but the description > > clauses where wrong. > > Ah, I'm happy to be ableto help! > These changes look good to bring the example and spec into line with the > intended behavior. > > > list status-change { > > description > > OLD: > > This list is ordered according to the timestamps of alarm > > state changes. The last item corresponds to the latest > > state change. > > > > NEW: > > This list is ordered according to the timestamps of alarm > > status changes. The first item corresponds to the latest > > status change. > > > > Both <status-change> and <operator-state-change> will update > > <last-changed> > > > > leaf last-changed { > > description > > OLD: > > "A timestamp when the alarm status was last changed. Status > > changes are changes to 'is-cleared', 'perceived-severity', > > and 'alarm-text'."; > > > > NEW: > > "A timestamp when the 'status-change' or 'operator-state-change' list > > was last changed."; > > > > > > > > > > > ---------------------------------------------------------------------- > > > COMMENT: > > > ---------------------------------------------------------------------- > > > > > > Section 1.1 > > > > > > o Fault: A fault is the underlying cause of an undesired behaviour. > > > There is no trivial one-to-one mapping between faults and alarms. > > > One fault may result in several alarms in case the system lacks > > > root-cause and correlation capabilities. An alarm might not have > > > an underlying fault as a cause, imagine a bad MOS score alarm from > > > a VOIP probe and the cause being non-optimal QoS configuration. > > > > > > nit: this is a comma splice > > change to > > An alarm might not have an underlying fault as a cause. For example, > > imagine a bad Mean Opppinion Score (MOS) alarm from a Voice Over IP > > (VOIP) probe and the cause being non-optimal QoS configuration. > > > > > > o Alarm Type: An alarm type identifies a possible unique alarm state > > > for a resource. Alarm types are names to identify the state like > > > "link-alarm", "jitter-violation", "high-disk-utilization". > > > > > > Are these only intended for human consumption? > > No, these are YANG identities and intended for automation software as well > > I think my follow-up question was going to be whether there was value in a > registry for them, to avoid proliferation of nearly-identical types from > different vendors. (I honestly don't know.) > > > > > > > o Cleared alarm: A cleared alarm is an alarm where the system/server > > > considers the undesired state to be cleared. Operators can not > > > clear alarms, clearance is managed by the system. A linkUp > > > notification can be considered a clear condition for a linkDown > > > state. > > > > > > nit: I'd suggest "For example, " before "a linkUp notification […]” > > Ok, fixed > > > > > > > > o Corrective Action: An action taken by an operator or automation > > > routine in order to minimize the impact of the alarm or resolving > > > the root cause. > > > > > > nit: "or resolve the root cause” > > Ok, fixed > > > > > > Section 2 > > > > > > o Clear definition of "alarm" in order to exclude general events > > > that should not be forwarded as alarm notifications. > > > > > > I'm not sure I am parsing this correctly. Is the objective to *provide* > > > such a clear definition? (Similarly for the next item.) Part of my > > > confusion probably stems from the dual use of the word "clear" as a verb > > > and a noun in English. > > Good comment, suggestion below: > > OLD: > > Clear definition of "alarm" in order to exclude general events > > that should not be forwarded as alarm notifications. > > NEW: > > to provide a precise definition of "alarm" in order to exclude general events > > that should not be forwarded as alarm notifications. > > Thanks! > > > > Section 3.2 > > > > > > In order to allow for dynamic addition of alarm types the alarm > > > module allows for further qualification of the identity based alarm > > > type using a string. A potential drawback of this is that there is a > > > big risk that alarm operators will receive alarm types as a surprise, > > > they do not know how to resolve the problem since a defined alarm > > > procedure does not necessarily exist. [...] > > > > > > nit: this is a comma splice. > > Correct, will change the comma to a period. > > > > > > > > Section 3.5.1 > > > > > > From a resource perspective, an alarm can for example have the > > > following life-cycle: raise, change severity, change severity, clear, > > > > > > Is the duplicate "change severity" intentional? > > Yes. An alarm can toggle minor, major, minor for example > > Okay. > > > > For the alarm history functionality, a given 'time' leaf is clearly the > > > time that a change occurred, but are the sibling leafs the state before > > > that change or after that change? (It looks like the actual YANG module > > > descriptions (e.g., for 'status-change') indicate that this is the state > > > after that change, but I don't know if it makes sense to also mention > > > that here or not.) > > As described in the YANG module it is the state after that change > > > > > > Section 3.7 > > > > > > (blocked/filtered). Shelved alarms appear in a dedicated shelved > > > alarm list in order not to disturb the relevant alarms. Shelved > > > > > > nit: this sentence could probably be reworded for greater clarity (what > > > does "disturb" mean, and perhaps what "relevant alarms" is may not be > > > clear just from context). > > Good comment! > > OLD: > > (blocked/filtered). Shelved alarms appear in a dedicated shelved > > alarm list in order not to disturb the relevant alarms. Shelved > > NEW: > > (blocked/filtered). Shelved alarms appear in a dedicated shelved > > alarm list in order to be filtered out from the alarm-list. Shelved > > That helps, thanks. Perhaps "in order for the main alarm-list to only > contain entries of interest" could be a further improvement? > > > > > > > > > > > Section 4.1 > > > > > > The "/alarms/control/notify-status-changes" leaf controls if > > > notifications are sent for all state changes, only raise and clear, > > > or only notifications more severe than a configured level. This > > > feature in combination with alarm shelving corresponds to the ITU > > > Alarm Report Control functionality. > > > > > > Is there a specific section reference for the ITU functionality? (If > > > not, it's probably fine to leave this as-is). > > Good comment, changed to: > > This feature in combination with alarm shelving corresponds to the ITU > > Alarm Report Control functionality, see Appendix F.2.4. > > > > > > > > Section 4.2 > > > > > > The system might not instrument all defined alarm type identities, > > > and some alarm identities are abstract. > > > > > > I just wanted to double-check: the intent is indeed to use "instrument" > > > here (as opposed to, say, "implement”)? > > > > It was intentional but changed to "implement" > > (No need to change it just on my account; I can see that "instrument" has a > valid meaning if that's the intent.) > > > > Section 4.7 > > > > > > /alarms/alarm-list/purge-alarms: Delete alarms from the "alarm-list" > > > according to specific criteria, for example all cleared alarms > > > older than a specific date. > > > > > > /alarms/alarm-list/compress-alarms: Compress the "status-change" > > > list for the alarms. > > > > > > It's a bit confusing to me to have such different language for these two > > > action nodes, since as far as I can tell the compress-alarms action also > > > allows for specific criteria (e.g., resources) to be provided to limit > > > the scope of the action. (Similarly for compress-shelved-alarms.) > > Ok, I do not think we will consider changing the action names at this point > > Sorry, I did not mean the name of the YANG leafs, but rather the text about > "according to specific criteria". > > > > > > > Section 6 > > > > > > feature service-impact-analysis { > > > description > > > "The system supports identifying candidate impacted > > > resources for an alarm, for example a link being impacted > > > by an interface alarm."; > > > > > > nit: it feels a bit odd to say that the link is "impacted by an alarm", > > > since normally the causality flow would be the other way -- the link > > > state changes, and then an alarm stat changes. > > OLD: > > "The system supports identifying candidate impacted > > resources for an alarm, for example a link being impacted > > by an interface alarm."; > > NEW: > > "The system supports identifying candidate impacted > > resources for an alarm. For example, an interface state change > > resulting in a link alarm which can refer to a link as being impacted."; > > > > > > I'm not sure I understand the distinction between 'warning' and 'minor' > > > severities -- it seems that neither is supposed to be indicating a > > > service-affecting fault. > > > > Can agree the difference is subtle and even unclear. However these are > > the original definitions from X.733 and they are being used in the industry. > > * for minor there is a fault: (indicates the existence of a non-service affecting fault) > > * for warning there might be one coming up: (detection of a potential or impending) > > Okay; I am happy to concede that I'm not an expert here :) > > > > > > > For the 'age-spec' choice, am I reading RFC 7950 correctly that I choose > > > exactly one case of the choice, so I can't have both a "minute part" > > > and an "hours part"? If so, then I would suggest using different > > > description text for the choices, since the current text implies that > > > you can combine parts with different units to assemble a compound > > > timespec. > > > > Good comment: > > OLD: > > Seconds part. > > > > NEW: > > Age expressed in seconds. > > > > And the same change for the other cases > > Thanks > > > > > > > > > list alarm { > > > [...] > > > Entries appear in the alarm list the first time an alarm > > > becomes active for a given alarm-type and resource. > > > Entries do not get deleted when the alarm is cleared, this > > > is a boolean state in the alarm. > > > > > > nit: this is a comma splice. > > OLD: > > Entries appear in the alarm list the first time an alarm > > becomes active for a given alarm-type and resource. > > Entries do not get deleted when the alarm is cleared, this > > is a boolean state in the alarm. > > NEW: > > Entries appear in the alarm list the first time an alarm > > becomes active for a given alarm-type and resource. > > Entries do not get deleted when the alarm is cleared. > > Clear status is represented as a boolean flag. > > > > > > > > > > > > Section 10 > > > > > > It's a bit of a stretch as far as an attack, but an attacker that can > > > temporarily reduce max-alarm-status-changes to a small value (e.g., one) > > > could use that to hide their tracks to some extent if they can cause the > > > alarm to clear after they've done what they intended to do. (They could > > > then restore the max-alarm-status-changes value, too.) > > Notifications will still be sent. I think this is a stretch and leave it out. > > Good point about notifications, thanks. > > > > It seems that the list of alarms itself may be potentially sensitive, in > > > that it potentially gives an attacker an authoritative picture of the > > > (broken) state of the network. > > > > Ok, added the following: > > > > The list of alarms itself may be potentially sensitive from a > > security perspective, in that it potentially gives an attacker > > an authoritative picture of the (broken) state of the network. > > > > > > > > > > > > Appendix F.1 > > > > > > nit: We say that X.733 alarms "also use the basic criteria of deviation from > > > normal condition", but it's not entirely clear whether the "also" > > > indicates that they share this behavior with some external reference > > > point or is a synonym for "additionally”. > > > > OLD: > > It also use the basic criteria of deviation from > > normal condition > > NEW: > > X.733 defines an alarm as a deviation from > > normal condition, but without the requirement > > that it requires corrective actions. > > > > > > > > nit: I think the semicolon in the ISA "comment" should be a regular > > > colon. > > Fixed > > > > > > The paragraph at the end talks about "evolution" and "moves from" -- is > > > the table sorted chronologically? > > > > Mostly, also sorted to show exactly that. I think the phrase is ok > > The phrase is definitely fine -- I just didn't do the check for the > publication dates of the various specs, but am happy to trust that they are > sorted properly. > > > > Appendix G > > > > > > Are tables 4 and 5 talking about alarm rates or notification rates? > > > > These are usability requirements and therefore the alarm rate rather than the notification rates. > > This is also an effect of this module, not focusing on a list of notifications but alarm states as a result of notifications. > > Okay, thanks for confirming. > > -Ben >
- [CCAMP] Benjamin Kaduk's Discuss on draft-ietf-cc… Benjamin Kaduk via Datatracker
- Re: [CCAMP] Benjamin Kaduk's Discuss on draft-iet… stefan vallin
- Re: [CCAMP] Benjamin Kaduk's Discuss on draft-iet… Benjamin Kaduk
- Re: [CCAMP] Benjamin Kaduk's Discuss on draft-iet… stefan vallin
- Re: [CCAMP] Benjamin Kaduk's Discuss on draft-iet… Martin Bjorklund