Re: [CCAMP] Genart last call review of draft-ietf-ccamp-alarm-module-07
Dan Romascanu <dromasca@gmail.com> Tue, 19 March 2019 18:54 UTC
Return-Path: <dromasca@gmail.com>
X-Original-To: ccamp@ietfa.amsl.com
Delivered-To: ccamp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9FFB1131535; Tue, 19 Mar 2019 11:54:23 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.998
X-Spam-Level:
X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6FyPw9BbT1vY; Tue, 19 Mar 2019 11:54:21 -0700 (PDT)
Received: from mail-it1-x134.google.com (mail-it1-x134.google.com [IPv6:2607:f8b0:4864:20::134]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 024491315A2; Tue, 19 Mar 2019 11:54:21 -0700 (PDT)
Received: by mail-it1-x134.google.com with SMTP id l4so13220446ite.1; Tue, 19 Mar 2019 11:54:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=HNsAS5zTM69BPzUclVE4E76dWktZMl8i6TG23eNjeW8=; b=fyAQbc+KLFYzrPgnn39yAUiRNw6Pck9GQ9VBAX2ANFQVsoj1eGZ1gQADmJ50ZJGa3Y HFAXg2sDAFTsL/Y0FNsYXmv7zuMJyg5+0CFQyExJv28hqfnRt1D0PKx4XL0hLzuGmQfW gSUlNdCzkzvopJYkYrQLast5AbpdgmX2qmV2Ov3Yl52PQD7QsIkHm3La2MReCH63VaKd eF/wl9BnPpF1jq3cB5u4zSkDGMA2W2d3pfcJvAvNIXImJTrdprjyz2ttgQpkETpX/sVR g/9H1BwEzVPk4V206ZrlCdHYxxmlVgXM9UcwMhJGVTALtThd2mSsnvw1wBWX5jD2zR/A 9nnA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=HNsAS5zTM69BPzUclVE4E76dWktZMl8i6TG23eNjeW8=; b=U+vHFWpW+BUgJKigZUFI4jquHlll785SbFq7GVOjnhRk/BUeJYM2NErMfOze2uwb/I zTnEjprBi4zBmUs1wWnL+h3pjc1Vnyam92JOXXM221ZRcoATRXIdpU8fAkIRt/oi+2je H6KMuFUat3FfqItM+iH2SSj6yBRv+jz6mWWn8yg/FXwXmPcBbl4e+ww3e/3c0cOpTxjL IG/V6J+HtyARjMEbI5FWvKV7epTANMV7Q1qDU6kxLFTp4s58FEbeLqNEDWjJhxE8WMBM /rThJlpWvK8Ep2gf3wMGEnMkOUUE9MNbVJVNjhuoKVHM2bM0KyjBFfI0p/1DNDPRbzP0 eUwQ==
X-Gm-Message-State: APjAAAW/Me631+NzztqMmCRu5Q4sZRsngzamDN5BkUwIbZ8HfUKt0ZmE Wg0WMGZ2XO1OM48RbxYWLW1oCi41BeBzFBDQlmfNrQ==
X-Google-Smtp-Source: APXvYqwo2cdTU8LCfzCu9TQqvxTV9/uQYtu8xmsl4ZzY6yii6tzBwXclB73yba+IQ9/efrD3FTiUNgSujsx0JfvFhY8=
X-Received: by 2002:a02:710e:: with SMTP id n14mr2462145jac.23.1553021660082; Tue, 19 Mar 2019 11:54:20 -0700 (PDT)
MIME-Version: 1.0
References: <155294084679.26073.4005125072161491147@ietfa.amsl.com> <956268DE-965F-40C3-A845-D236424CE07D@wallan.se>
In-Reply-To: <956268DE-965F-40C3-A845-D236424CE07D@wallan.se>
From: Dan Romascanu <dromasca@gmail.com>
Date: Tue, 19 Mar 2019 20:54:08 +0200
Message-ID: <CAFgnS4WOQL3Q=ec_8efvKtMXb8RSXyaRGx+iyMs28RnWk=HGtQ@mail.gmail.com>
To: stefan vallin <stefan@wallan.se>
Cc: gen-art <gen-art@ietf.org>, draft-ietf-ccamp-alarm-module.all@ietf.org, ccamp@ietf.org, ietf <ietf@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000060daf105847708b1"
Archived-At: <https://mailarchive.ietf.org/arch/msg/ccamp/IRXhWJYpjCv7H42GHRiVnB5Ewq4>
Subject: Re: [CCAMP] Genart last call review of draft-ietf-ccamp-alarm-module-07
X-BeenThere: ccamp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion list for the CCAMP working group <ccamp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ccamp>, <mailto:ccamp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ccamp/>
List-Post: <mailto:ccamp@ietf.org>
List-Help: <mailto:ccamp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ccamp>, <mailto:ccamp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 19 Mar 2019 18:54:24 -0000
Hi Stefan, Thank you for your answer and for addressing my concerns. I am comfortable with your proposals. If your AD agrees, I would include these in a revised version before submission to the approval of the IESG. Regards, Dan On Tue, Mar 19, 2019 at 5:11 PM stefan vallin <stefan@wallan.se> wrote: > Hi Dan! > Thanks for your review, an honour to have RFC 3877 in the loop :) > See inline > br Stefan > > > > > > > > Major issues: > > > > 1. The definition of Alarm is key for the whole model. It reads like > this: > > Alarm (the general concept): An alarm signifies an undesirable state in a > > resource that requires corrective action. > > > > However, RFC 3877 already defined a number of concepts including: > > Error > > A deviation of a system from normal operation. > > > > Fault > > Lasting error or warning condition. > > > > .... > > > > Alarm > > Persistent indication of a fault. > > > > I believe that there is a need to show why the model defined by RFC 3877 > needs > > to be changed, and why the difference that RFC 3877 was making between a > Fault > > and an Alarm is no longer needed. > > Good comment, you are right, and we need to keep the distinction between > fault and alarm. > That distinction is used in X.733, 3GPP IRP and others. The general > pattern is that “fault” > refers to what is really broken, and the alarm the manifestation of that > underlying cause. > There is not a simple 1-1 relationship between a fault and an alarm > * 1 fault may have many alarms due to limited root cause capabilities of > the system > * There might be no underlying fault to an alarm, consider a non-optimal > QoS configuration > which gives bad quality in VOIP calls. Certainly a MOS alarm from the > VOIP probe, but there > is no “fault” as such (if you do not consider a non-optimal config as a > fault) > > So X.733 > X.733 fault: The physical or algorithmic cause of a malfunction > 3GPP fault: a deviation of a system from normal operation, which may > result in the loss of operational capabilities of the element or the loss > of redundancy in case of a redundant configuration > > I suggest we add the following to terminology: > Fault: the underlying cause of an undesired behaviour > > If we then turn to the term “alarm". I have added two aspects to the > definition of an alarm: > > An alarm signifies an *undesirable state* in a resource that *requires > corrective action*. > > Mostly based on the alarm standardization work in the process industry > (see draft references). > > 1) Rather than “deviation from normal”, we say “undesirable”, subtle > difference. > In IT environments it is easier to define what is normal, a normal load > to a web server. > And anything deviation from that normal load could be an alarm. > In networking, things are more dynamic, and deviation from normal might > be the desired state. > So the definition stresses the fact that it is an undesired state, not > just deviation from normal. > > 2) Adding the requirement that an alarm per definition should require an > action. This is a sound > requirement that puts requirements on what qualifies as an alarm and > limits the amounts of alarms. > (See for example the EEMUA, and ISA182 references in the draft). The > 3GPP Alarm standard > also added this to their definition at the later revisions to address > the alarm overload problem. > > > > > > > Also, RFC 3877 defined in Section 3 a > > Framework and an Architecture that was consistent with X.733. This > document has > > no such section, and while acknowledging the need for a mapping to X.733 > it > > states as a goal: > > Mapping to X.733, which is a requirement for some alarm systems. Still, > keep > > some of the X.733 concepts out of the core model in order to make the > model > > small and easy to understand > > > > More details about what is left out and why these are not needed would > help. > The alarm YANG model does not *require* the X.733 parameter > definitions of for example probable-cause enum values. Today, most > networking devices > and management systems do not rely on those enumerations. > > Those are defined in the X733 augmentation module in order to keep the > core model as > small and useful as possible. X733 requirements come more often from > telecom environments. > > > > > > Minor issues: > > > > 1. Section 2 makes a statement that includes > > ... While IETF has not really addressed alarm management > > > > This is is actually not accurate. RFC 3877 addressed Alarm Management. > Maybe > > there is a need to revise that approach, but this should be done > explicitly, > > not by stating that it did not exist. > Correct, bad wording. > OLD TEXT: > Address alarm usability requirements, see Appendix G. While IETF > has not really addressed alarm management, telecom standards has > addressed it purely from a protocol perspective. The process > industry has published several relevant standards addressing > requirements for a useful alarm interface; [EEMUA], [ISA182]. > This alarm module defines usability requirements as well as a YANG > data model. > SUGGESTION: > Address alarm usability requirements, see Appendix G. While IETF > and telecom standards have addressed alarms mostly from a > protocol perspective, the process industry has published > several relevant standards addressing requirements for a useful > alarm interface; [EEMUA], [ISA182]. > This alarm module defines usability requirements as well as a YANG > data model. > > > > > 2. Section 3.5: > > Closing an alarm implies that the operator considers the corrective > action > > performed. > > > > Is this always true? The undesirable state may have been cancelled by > some > > other event than corrective action, for example the resource is no > longer used, > > or the time elapsed mat have made the undesirable state irrelevant. > > I think it is important to keep the two perspectives in mind. An operator > closing an > alarm is only a flag from the operations team that the alarm does not need > an action. > It might be cleared or not cleared by the system. > > So in your first example, the alarm is probably cleared by the > instrumentation, > correlating “the other event”. > > If the resource is no longer used a shelf should be created. > > If time has passed, depends, …. > > > > > 3. In section 3.5.1: > > Alarms are not cleared by operators, only the underlying instrumentation > can > > clear an alarm. Operators can close alarms. > > > > So, the document makes a distinction between clearing an alarm and > closing an > > alarm. It may be good to define two two concepts to make the distinction > clear. > > Good point! > > Suggested terminology additions: > * Cleared alarm: a cleared alarm is an alarm where the system/server > considers the > undesired state to be cleared. Operators can not clear alarms, clearance > is managed > by the system. A linkUp notification can be considered a clear condition > for a linkDown state. > > * Closed alarm: operators can close alarms irrespective of the alarm being > cleared or not. > A closed alarm indicates that the alarm does not need attention, either > since the corrective > action has been taken or that it can be ignored for other reasons. > > > > > > 4. Appendix F.1: > > The alarm MIB is state oriented rather than notification oriented, an > alarm > > is a "lasting condition", not a discrete notification reporting about a > > condition state change. > Good catch, will rephrase, the alarm MIB and the alarm YANG has a stateful > view > of alarms, not notification-focused. > > Suggested change: > OLD > RFC 3877 defines alarm referring back to "a deviation from normal > operation". This is > problematic, since this might not require an operator action. The alarm > MIB is state > oriented rather than notification oriented, an alarm is a "lasting > condition", not a > discrete notification reporting about a condition state change. > NEW: > RFC 3877 defines alarm referring back to "a deviation from normal > operation". The Alarm YANG > model adds the requirement that it should require an corrective action and > should be undesired, > not only a deviation from normal. The alarm MIB is state oriented in the > same way as the Alarm YANG, > it focuses on the "lasting condition", not the individual notifications. > > > > > > I am not sure that I understand this comment. Alarm states are defined > also in > > this document, and Alarms as defined here are also different than ' a > discrete > > notification reporting about a condition state change'. So, what does > this > > comment really try to say? > > > > Nits/editorial comments: > > > > > >
- [CCAMP] Genart last call review of draft-ietf-cca… Dan Romascanu via Datatracker
- Re: [CCAMP] Genart last call review of draft-ietf… stefan vallin
- Re: [CCAMP] Genart last call review of draft-ietf… Dan Romascanu
- Re: [CCAMP] Genart last call review of draft-ietf… stefan vallin
- Re: [CCAMP] [Gen-art] Genart last call review of … Alissa Cooper