Re: [lmap] draft-ietf-lmap-information-model-05: Controller timeout

"Carey, Timothy (Timothy)" <timothy.carey@alcatel-lucent.com> Mon, 06 July 2015 12:29 UTC

Return-Path: <timothy.carey@alcatel-lucent.com>
X-Original-To: lmap@ietfa.amsl.com
Delivered-To: lmap@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6C8D71ACEA4 for <lmap@ietfa.amsl.com>; Mon, 6 Jul 2015 05:29:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.91
X-Spam-Level:
X-Spam-Status: No, score=-6.91 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qQKC8DERSbqE for <lmap@ietfa.amsl.com>; Mon, 6 Jul 2015 05:29:34 -0700 (PDT)
Received: from smtp-fr.alcatel-lucent.com (fr-hpida-esg-01.alcatel-lucent.com [135.245.210.20]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id AA05C1ACE5F for <lmap@ietf.org>; Mon, 6 Jul 2015 05:29:33 -0700 (PDT)
Received: from us70tusmtp2.zam.alcatel-lucent.com (unknown [135.5.2.64]) by Websense Email Security Gateway with ESMTPS id 33D9CBD8A35C7; Mon, 6 Jul 2015 12:29:29 +0000 (GMT)
Received: from US70UWXCHHUB02.zam.alcatel-lucent.com (us70uwxchhub02.zam.alcatel-lucent.com [135.5.2.49]) by us70tusmtp2.zam.alcatel-lucent.com (GMO) with ESMTP id t66CTSdC026791 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL); Mon, 6 Jul 2015 12:29:29 GMT
Received: from US70UWXCHMBA05.zam.alcatel-lucent.com ([169.254.10.167]) by US70UWXCHHUB02.zam.alcatel-lucent.com ([135.5.2.49]) with mapi id 14.03.0195.001; Mon, 6 Jul 2015 08:29:29 -0400
From: "Carey, Timothy (Timothy)" <timothy.carey@alcatel-lucent.com>
To: Juergen Schoenwaelder <j.schoenwaelder@jacobs-university.de>, "Weil, Jason" <jason.weil@twcable.com>
Thread-Topic: [lmap] draft-ietf-lmap-information-model-05: Controller timeout
Thread-Index: AdCMHecUFUgxoWRISVOQIyTHkXEC2AAnxEkAAATNJ5AAAJpsgAAH9ouw///bqQD//tQ9MIBRirhZ//pUR1A=
Date: Mon, 06 Jul 2015 12:29:28 +0000
Message-ID: <9966516C6EB5FC4381E05BF80AA55F77DC20B30D@US70UWXCHMBA05.zam.alcatel-lucent.com>
References: <9966516C6EB5FC4381E05BF80AA55F77BA2612C2@US70UWXCHMBA05.zam.alcatel-lucent.com> <20150512100724.GC26662@elstar.local> <9966516C6EB5FC4381E05BF80AA55F77BA2616F5@US70UWXCHMBA05.zam.alcatel-lucent.com> <20150512124209.GC41964@elstar.local> <9966516C6EB5FC4381E05BF80AA55F77BA261858@US70UWXCHMBA05.zam.alcatel-lucent.com> <20150512142006.GB57299@elstar.local> <9966516C6EB5FC4381E05BF80AA55F77BA2626E6@US70UWXCHMBA05.zam.alcatel-lucent.com> <D1792111.419B6%jason.weil@twcable.com> <9966516C6EB5FC4381E05BF80AA55F77BA262F9B@US70UWXCHMBA05.zam.alcatel-lucent.com> <D1792F54.41AD0%jason.weil@twcable.com> <20150702214018.GA5740@elstar.local>
In-Reply-To: <20150702214018.GA5740@elstar.local>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [135.5.27.18]
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
Archived-At: <http://mailarchive.ietf.org/arch/msg/lmap/sHnI959V-RHS0hTft-YDSkVzF_w>
Cc: "lmap@ietf.org" <lmap@ietf.org>
Subject: Re: [lmap] draft-ietf-lmap-information-model-05: Controller timeout
X-BeenThere: lmap@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Large Scale Measurement of Access network Performance <lmap.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lmap>, <mailto:lmap-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/lmap/>
List-Post: <mailto:lmap@ietf.org>
List-Help: <mailto:lmap-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lmap>, <mailto:lmap-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 06 Jul 2015 12:29:37 -0000

Juergen,

Yes there are 2 issues from my notes:
1) Need to describe if schedules, tasks or scheduled actions are to be disabled and how they are re-enabled.

2) Determine how to define the operational status of schedules, tasks, scheduled actions and MAs


So inline <TAC>

-----Original Message-----
From: Juergen Schoenwaelder [mailto:j.schoenwaelder@jacobs-university.de] 
Sent: Thursday, July 02, 2015 4:40 PM
To: Weil, Jason
Cc: Carey, Timothy (Timothy); lmap@ietf.org
Subject: Re: [lmap] draft-ietf-lmap-information-model-05: Controller timeout

Hi,

it seems we should separate two aspects here:

1) What exactly is the reaction to connectivity loss?

   The document currently says this:

   ma-controller-timeout:    A timer is started after each successful
                             contact with a controller.  When the timer
                             reaches the controller-timeout (measured in
                             seconds), all schedules will be disabled,
                             i.e., no new actions will be executed (and
                             hence no new tasks started).  The disabled
                             schedules will be reenabled automatically
                             once contact with a controller has been
                             established successfully again.  Note that
                             this will not affect the execution of
                             actions that are essential to establish
                             contact with the controller or that perform
                             critical housekeeping functions.

<TAC> Ok so it will be Schedules that are disabled and by association any actions associated with the schedule? We should state that.  How does an MA know which task's are housekeeping and which of housekeeping are critical - is that a option passed by the controller for the task? We should state how critical housekeeping tasks are identified.

   An alternative that have been mentioned was to trigger suppression
   by a timeout. This gives more fine grained control but then there
   is only a single suppression object (which may be sufficient - or
   not).

   What we have now simply says that after reaching the timeout, the
   MA stops all measurements and this is really a safety mechanism to
   prevent disconnected probes to measure forever - consider the case
   of a company operating a controller for thousands of devices going
   out of business - you want the homeless MAs to stop at some point
   in time. For such a safety mechanism, I personally think what we
   have is sufficient.

<TAC> I am certainly fine with disabling the schedules - just need the clarity in the text and resolve the identification of critical housekeeping tasks. See my comment above.

2) Status reporting: Right now we have status information for tasks
   but none for schedules or actions. In fact, if multiple actions
   refer to the same task and one action provides options that make
   the task fail, you will be able to see the last fails but not how
   this links back to actions and schedules.

   I think this needs to be improved. As suggested, we should report
   the operational status of the schedules and I think we should
   report the status of actions instead of the status of configured
   tasks so that all information needed is available to understand
   what exactly fails.

<TAC> Yes in the past an operational status could be dependent on the operational status of a "higher-order" entity (be autonomously out of service) - that is what I think the status of the action should be - linked to the schedule.

Bottom line: I think the handling of the controller timeout is good enough but the status reporting should be improved.

/js

On Wed, May 13, 2015 at 05:19:45PM -0400, Weil, Jason wrote:
> Thanks for the reminder Tim. The one-way arrow in the framework doc 
> confirms it. OK so the Logging objects cover any Failure cases such as 
> an MA re-booting or loss of connection to the controller and status of 
> scheduled tasks. So what is being disabled is the Instruction set 
> communicated by the Controller or some subset of it.
> 
> The MA will report its status through logging attributes when it 
> re-establishes communication with the controller.  But yes I do agree 
> that the MA seems to need attributes to communicate at least enough 
> information to the controller about its status in order for the 
> controller to know if it needs to send a new or updated Instruction set.
> 
> Jason
> 
> On 5/13/15, 4:21 PM, "Carey, Timothy (Timothy)"
> <timothy.carey@alcatel-lucent.com> wrote:
> 
> >Jason,
> >
> >Suppression is an administrative action taken through an instruction 
> >by the controller. Loss of connectivity (disable schedule or task) is 
> >an autonomous operational action taken by the MA. The problem is 
> >reporting and recovery - How does a management system know, in order 
> >to troubleshoot or remediate the event, if the schedule or task 
> >(along with the resulting scheduled actions) have been disabled due 
> >to an administrative action like suppression or via an autonomous 
> >action like loss of connectivity.
> >
> >If we would choose to treat the result of the loss of connectivity as 
> >an autonomous suppression event - I am fine with that but we need the 
> >MA to be able to report that the task, schedule and resulting 
> >scheduled actions are autonomously suppressed (what I called 
> >disabled). Likewise we should report that the MA is degraded because 
> >it currently is being autonomously or administratively suppressed. 
> >The MA will at least need to reevaluate its conditions once one clears to figure out its new status.
> >
> >Come to think of it - can a system other than the controller have 
> >access to the LMAP status information?
> 
> Won’t the only source of the logging information bet the last 
> controller configured on the MA?
> 
> >I guess this all mute if the only system allowed access to the status 
> >information is the controller that the MA can't access with Loss of 
> >connectivity. In TR-069 this is possible since the controller can be 
> >different from the ACS and we can have other systems that can access 
> >the data. I'm guessing NETConf/RESTConf allows the same.
> >
> >BR,
> >Tim
> >-----Original Message-----
> >From: Weil, Jason [mailto:jason.weil@twcable.com]
> >Sent: Wednesday, May 13, 2015 2:54 PM
> >To: Carey, Timothy (Timothy); Juergen Schoenwaelder
> >Cc: lmap@ietf.org
> >Subject: Re: [lmap] draft-ietf-lmap-information-model-05: Controller 
> >timeout
> >
> >I am not exactly clear what the difference is between disabled and 
> >suppressed tasks or schedules. It would seem that suppression is a 
> >good fit for the case of loss of connectivity from the way it is 
> >described in the information model when the the 
> >ma-controller-lost-timeout counter expires. Given this following note 
> >in 3.3 ("Note that Suppression has no effect on either Controller 
> >Tasks or Controller Schedules.²) it would clearly not impact future 
> >attempts to re-establish connectivity to the Controller and start 
> >testing where it left off. This may also benefit an operator by 
> >offering finer control over which tests should be suppressed and 
> >which shouldn¹t in the event of loss of connectivity to the 
> >controller. For example if may be of benefit for troubleshooting an 
> >issue to have a certain subset of tests continue running and collecting data even when connectivity to the controller is not available.
> >
> >Jason
> >
> >On 5/13/15, 8:37 AM, "Carey, Timothy (Timothy)"
> ><timothy.carey@alcatel-lucent.com> wrote:
> >
> >>Juergen,
> >>
> >>If you do not want to disable the scheduled tasks - I would actually 
> >>say the Parent tasks not the schedules are disabled. The reason is 
> >>because we are talking about wanted to disable Tasks that are not 
> >>related to the communication between the MA and Controller or proper 
> >>functioning of the MA.  The MA knows this as part of their 
> >>TaskCapability reporting (which somehow got deleted in draft 05).
> >>
> >>As to the reporting of the status I would do the following:
> >>Add an OperationalStatus attribute with the following values to the 
> >>action object (which was scheduled task): enabled, 
> >>disabled-schedule, disabled-task, disabled-other.
> >>
> >>
> >>
> >>I would also add an OperationalStatus to:
> >>1) Schedule object with the following values: enabled, 
> >>disabled-suppressed, disabled-other.
> >>2) Task object (I guess it could go in Task Status) with the 
> >>following
> >>values: enabled, disabled-suppressed, disabled-ma, disabled-other.
> >>3) Measurement Agent object with the following values: enabled, 
> >>degraded-controller-communication, degraded-other, disabled.
> >>
> >>We would always include a disabled-other to allow other conditions. 
> >>For example in TR-069 we have the capability to administratively 
> >>enable/disable the multi-instance objects by conventions - so if for 
> >>some reason these were administratively disabled - we would set the 
> >>OperationalStatus to disabled-other or just disabled for the MA.
> >>
> >>These status' would provide a clear operational status of the 
> >>elements that can be disabled or degraded based on what is currently 
> >>documented in the framework and information model. I would suspect 
> >>as we implement protocols there might be other values we want to 
> >>support - e.g., problems with the collector-MA communication...
> >>
> >>BR,
> >>Tim
> >>-----Original Message-----
> >>From: Juergen Schoenwaelder
> >>[mailto:j.schoenwaelder@jacobs-university.de]
> >>Sent: Tuesday, May 12, 2015 9:20 AM
> >>To: Carey, Timothy (Timothy)
> >>Cc: lmap@ietf.org
> >>Subject: Re: [lmap] draft-ietf-lmap-information-model-05: Controller 
> >>timeout
> >>
> >>On Tue, May 12, 2015 at 01:04:49PM +0000, Carey, Timothy (Timothy) wrote:
> >>
> >>> So does that now mean we need to have a operational status now for 
> >>> a scheduled task? I can see it now as attribute with values like:
> >>> enabled, disabled, suppressed and errored or possibly a conditions 
> >>> object?
> >>
> >>I think there two questions here:
> >>
> >>a) If I loose connectivity, do I disable all schedules (and thus
> >>   implicitely all actions) or do I disable scheduled tasks? My idea
> >>   was to simply disable all schedules (and as such the schedules have
> >>   their state changed to disabled).
> >>
> >>   The suppression mechanism can suppress both a list of tasks and a
> >>   list of schedules and it is configurable what is being suppressed.
> >>
> >>   One option is that loss of connectivity simply causes suppression
> >>   to be activated. Another option is that loss of connectivity
> >>   supresses all tasks that are not essential for communication with a
> >>   controller and house keeping. Another option is that loss of
> >>   connectivity suppresses all schedules. We need to select one.
> >>
> >>b) Do we expose the internal state that is maintained? Right now, we
> >>   do not expose any operational state for schedules. But we do expose
> >>   operational state for tasks. If loss of connectivity means we
> >>   disable all tasks that are not essential for communication with a
> >>   controller and house keeping, than exposing that information would
> >>   be a rather minor change to the information and data model.
> >>
> >>   A task state would then be enabled (default), suppressed or
> >>   disabled. Error information is already covered in the last-failed
> >>   attributes of a task. (And this is useful in situations where tasks
> >>   fail occasionally so you can get information about the last failure
> >>   even though the task is working fine right now).
> >>
> >>/js
> >>
> >>--
> >>Juergen Schoenwaelder           Jacobs University Bremen gGmbH
> >>Phone: +49 421 200 3587         Campus Ring 1 | 28759 Bremen | Germany
> >>Fax:   +49 421 200 3103         <http://www.jacobs-university.de/>
> >>
> >>_______________________________________________
> >>lmap mailing list
> >>lmap@ietf.org
> >>https://www.ietf.org/mailman/listinfo/lmap
> >
> >
> >This E-mail and any of its attachments may contain Time Warner Cable 
> >proprietary information, which is privileged, confidential, or 
> >subject to copyright belonging to Time Warner Cable. This E-mail is 
> >intended solely for the use of the individual or entity to which it 
> >is addressed. If you are not the intended recipient of this E-mail, 
> >you are hereby notified that any dissemination, distribution, 
> >copying, or action taken in relation to the contents of and 
> >attachments to this E-mail is strictly prohibited and may be 
> >unlawful. If you have received this E-mail in error, please notify 
> >the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
> 
> 
> This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.

-- 
Juergen Schoenwaelder           Jacobs University Bremen gGmbH
Phone: +49 421 200 3587         Campus Ring 1 | 28759 Bremen | Germany
Fax:   +49 421 200 3103         <http://www.jacobs-university.de/>