Re: [lmap] draft-ietf-lmap-information-model-05: Controller timeout

Juergen Schoenwaelder <j.schoenwaelder@jacobs-university.de> Thu, 02 July 2015 21:40 UTC

Return-Path: <j.schoenwaelder@jacobs-university.de>
X-Original-To: lmap@ietfa.amsl.com
Delivered-To: lmap@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 05D091A893E for <lmap@ietfa.amsl.com>; Thu, 2 Jul 2015 14:40:35 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.86
X-Spam-Level:
X-Spam-Status: No, score=-3.86 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HELO_EQ_DE=0.35, RCVD_IN_DNSWL_MED=-2.3, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id q6mHrCsOw3IV for <lmap@ietfa.amsl.com>; Thu, 2 Jul 2015 14:40:31 -0700 (PDT)
Received: from atlas3.jacobs-university.de (atlas3.jacobs-university.de [212.201.44.18]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1685D1A888B for <lmap@ietf.org>; Thu, 2 Jul 2015 14:40:30 -0700 (PDT)
Received: from localhost (demetrius5.irc-it.jacobs-university.de [10.70.0.222]) by atlas3.jacobs-university.de (Postfix) with ESMTP id 11403F6C; Thu, 2 Jul 2015 23:40:26 +0200 (CEST)
X-Virus-Scanned: amavisd-new at jacobs-university.de
Received: from atlas3.jacobs-university.de ([10.70.0.220]) by localhost (demetrius5.jacobs-university.de [10.70.0.222]) (amavisd-new, port 10030) with ESMTP id fJDwrawoFNSG; Thu, 2 Jul 2015 23:40:24 +0200 (CEST)
Received: from hermes.jacobs-university.de (hermes.jacobs-university.de [212.201.44.23]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "hermes.jacobs-university.de", Issuer "Jacobs University CA - G01" (verified OK)) by atlas3.jacobs-university.de (Postfix) with ESMTPS; Thu, 2 Jul 2015 23:40:23 +0200 (CEST)
Received: from localhost (demetrius1.jacobs-university.de [212.201.44.46]) by hermes.jacobs-university.de (Postfix) with ESMTP id 1423F2002C; Thu, 2 Jul 2015 23:40:26 +0200 (CEST)
X-Virus-Scanned: amavisd-new at jacobs-university.de
Received: from hermes.jacobs-university.de ([212.201.44.23]) by localhost (demetrius1.jacobs-university.de [212.201.44.32]) (amavisd-new, port 10024) with ESMTP id oL1j6EFzUSiZ; Thu, 2 Jul 2015 23:40:22 +0200 (CEST)
Received: from elstar.local (elstar.jacobs.jacobs-university.de [10.50.231.133]) by hermes.jacobs-university.de (Postfix) with ESMTP id BA40920013; Thu, 2 Jul 2015 23:40:21 +0200 (CEST)
Received: by elstar.local (Postfix, from userid 501) id A130334EB6D4; Thu, 2 Jul 2015 23:40:20 +0200 (CEST)
Date: Thu, 02 Jul 2015 23:40:19 +0200
From: Juergen Schoenwaelder <j.schoenwaelder@jacobs-university.de>
To: "Weil, Jason" <jason.weil@twcable.com>
Message-ID: <20150702214018.GA5740@elstar.local>
Mail-Followup-To: "Weil, Jason" <jason.weil@twcable.com>, "Carey, Timothy (Timothy)" <timothy.carey@alcatel-lucent.com>, "lmap@ietf.org" <lmap@ietf.org>
References: <9966516C6EB5FC4381E05BF80AA55F77BA2612C2@US70UWXCHMBA05.zam.alcatel-lucent.com> <20150512100724.GC26662@elstar.local> <9966516C6EB5FC4381E05BF80AA55F77BA2616F5@US70UWXCHMBA05.zam.alcatel-lucent.com> <20150512124209.GC41964@elstar.local> <9966516C6EB5FC4381E05BF80AA55F77BA261858@US70UWXCHMBA05.zam.alcatel-lucent.com> <20150512142006.GB57299@elstar.local> <9966516C6EB5FC4381E05BF80AA55F77BA2626E6@US70UWXCHMBA05.zam.alcatel-lucent.com> <D1792111.419B6%jason.weil@twcable.com> <9966516C6EB5FC4381E05BF80AA55F77BA262F9B@US70UWXCHMBA05.zam.alcatel-lucent.com> <D1792F54.41AD0%jason.weil@twcable.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
X-Clacks-Overhead: GNU Terry Pratchett
Content-Transfer-Encoding: 8bit
In-Reply-To: <D1792F54.41AD0%jason.weil@twcable.com>
User-Agent: Mutt/1.4.2.3i
Archived-At: <http://mailarchive.ietf.org/arch/msg/lmap/CbB5FHgZWmnemxpr7z03kLw0U8Q>
Cc: "Carey, Timothy (Timothy)" <timothy.carey@alcatel-lucent.com>, "lmap@ietf.org" <lmap@ietf.org>
Subject: Re: [lmap] draft-ietf-lmap-information-model-05: Controller timeout
X-BeenThere: lmap@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: Juergen Schoenwaelder <j.schoenwaelder@jacobs-university.de>
List-Id: Large Scale Measurement of Access network Performance <lmap.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lmap>, <mailto:lmap-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/lmap/>
List-Post: <mailto:lmap@ietf.org>
List-Help: <mailto:lmap-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lmap>, <mailto:lmap-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Jul 2015 21:40:35 -0000

Hi,

it seems we should separate two aspects here:

1) What exactly is the reaction to connectivity loss?

   The document currently says this:

   ma-controller-timeout:    A timer is started after each successful
                             contact with a controller.  When the timer
                             reaches the controller-timeout (measured in
                             seconds), all schedules will be disabled,
                             i.e., no new actions will be executed (and
                             hence no new tasks started).  The disabled
                             schedules will be reenabled automatically
                             once contact with a controller has been
                             established successfully again.  Note that
                             this will not affect the execution of
                             actions that are essential to establish
                             contact with the controller or that perform
                             critical housekeeping functions.

   An alternative that have been mentioned was to trigger suppression
   by a timeout. This gives more fine grained control but then there
   is only a single suppression object (which may be sufficient - or
   not).

   What we have now simply says that after reaching the timeout, the
   MA stops all measurements and this is really a safety mechanism to
   prevent disconnected probes to measure forever - consider the case
   of a company operating a controller for thousands of devices going
   out of business - you want the homeless MAs to stop at some point
   in time. For such a safety mechanism, I personally think what we
   have is sufficient.

2) Status reporting: Right now we have status information for tasks
   but none for schedules or actions. In fact, if multiple actions
   refer to the same task and one action provides options that make
   the task fail, you will be able to see the last fails but not how
   this links back to actions and schedules.

   I think this needs to be improved. As suggested, we should report
   the operational status of the schedules and I think we should
   report the status of actions instead of the status of configured
   tasks so that all information needed is available to understand
   what exactly fails.

Bottom line: I think the handling of the controller timeout is good
enough but the status reporting should be improved.

/js

On Wed, May 13, 2015 at 05:19:45PM -0400, Weil, Jason wrote:
> Thanks for the reminder Tim. The one-way arrow in the framework doc
> confirms it. OK so the Logging objects cover any Failure cases such as an
> MA re-booting or loss of connection to the controller and status of
> scheduled tasks. So what is being disabled is the Instruction set
> communicated by the Controller or some subset of it.
> 
> The MA will report its status through logging attributes when it
> re-establishes communication with the controller.  But yes I do agree that
> the MA seems to need attributes to communicate at least enough information
> to the controller about its status in order for the controller to know if
> it needs to send a new or updated Instruction set.
> 
> Jason
> 
> On 5/13/15, 4:21 PM, "Carey, Timothy (Timothy)"
> <timothy.carey@alcatel-lucent.com> wrote:
> 
> >Jason,
> >
> >Suppression is an administrative action taken through an instruction by
> >the controller. Loss of connectivity (disable schedule or task) is an
> >autonomous operational action taken by the MA. The problem is reporting
> >and recovery - How does a management system know, in order to
> >troubleshoot or remediate the event, if the schedule or task (along with
> >the resulting scheduled actions) have been disabled due to an
> >administrative action like suppression or via an autonomous action like
> >loss of connectivity.
> >
> >If we would choose to treat the result of the loss of connectivity as an
> >autonomous suppression event - I am fine with that but we need the MA to
> >be able to report that the task, schedule and resulting scheduled actions
> >are autonomously suppressed (what I called disabled). Likewise we should
> >report that the MA is degraded because it currently is being autonomously
> >or administratively suppressed. The MA will at least need to reevaluate
> >its conditions once one clears to figure out its new status.
> >
> >Come to think of it - can a system other than the controller have access
> >to the LMAP status information?
> 
> Won’t the only source of the logging information bet the last controller
> configured on the MA?
> 
> >I guess this all mute if the only system allowed access to the status
> >information is the controller that the MA can't access with Loss of
> >connectivity. In TR-069 this is possible since the controller can be
> >different from the ACS and we can have other systems that can access the
> >data. I'm guessing NETConf/RESTConf allows the same.
> >
> >BR,
> >Tim
> >-----Original Message-----
> >From: Weil, Jason [mailto:jason.weil@twcable.com]
> >Sent: Wednesday, May 13, 2015 2:54 PM
> >To: Carey, Timothy (Timothy); Juergen Schoenwaelder
> >Cc: lmap@ietf.org
> >Subject: Re: [lmap] draft-ietf-lmap-information-model-05: Controller
> >timeout
> >
> >I am not exactly clear what the difference is between disabled and
> >suppressed tasks or schedules. It would seem that suppression is a good
> >fit for the case of loss of connectivity from the way it is described in
> >the information model when the the ma-controller-lost-timeout counter
> >expires. Given this following note in 3.3 ("Note that Suppression has no
> >effect on either Controller Tasks or Controller Schedules.²) it would
> >clearly not impact future attempts to re-establish connectivity to the
> >Controller and start testing where it left off. This may also benefit an
> >operator by offering finer control over which tests should be suppressed
> >and which shouldn¹t in the event of loss of connectivity to the
> >controller. For example if may be of benefit for troubleshooting an issue
> >to have a certain subset of tests continue running and collecting data
> >even when connectivity to the controller is not available.
> >
> >Jason
> >
> >On 5/13/15, 8:37 AM, "Carey, Timothy (Timothy)"
> ><timothy.carey@alcatel-lucent.com> wrote:
> >
> >>Juergen,
> >>
> >>If you do not want to disable the scheduled tasks - I would actually
> >>say the Parent tasks not the schedules are disabled. The reason is
> >>because we are talking about wanted to disable Tasks that are not
> >>related to the communication between the MA and Controller or proper
> >>functioning of the MA.  The MA knows this as part of their
> >>TaskCapability reporting (which somehow got deleted in draft 05).
> >>
> >>As to the reporting of the status I would do the following:
> >>Add an OperationalStatus attribute with the following values to the
> >>action object (which was scheduled task): enabled, disabled-schedule,
> >>disabled-task, disabled-other.
> >>
> >>
> >>
> >>I would also add an OperationalStatus to:
> >>1) Schedule object with the following values: enabled,
> >>disabled-suppressed, disabled-other.
> >>2) Task object (I guess it could go in Task Status) with the following
> >>values: enabled, disabled-suppressed, disabled-ma, disabled-other.
> >>3) Measurement Agent object with the following values: enabled,
> >>degraded-controller-communication, degraded-other, disabled.
> >>
> >>We would always include a disabled-other to allow other conditions. For
> >>example in TR-069 we have the capability to administratively
> >>enable/disable the multi-instance objects by conventions - so if for
> >>some reason these were administratively disabled - we would set the
> >>OperationalStatus to disabled-other or just disabled for the MA.
> >>
> >>These status' would provide a clear operational status of the elements
> >>that can be disabled or degraded based on what is currently documented
> >>in the framework and information model. I would suspect as we implement
> >>protocols there might be other values we want to support - e.g.,
> >>problems with the collector-MA communication...
> >>
> >>BR,
> >>Tim
> >>-----Original Message-----
> >>From: Juergen Schoenwaelder
> >>[mailto:j.schoenwaelder@jacobs-university.de]
> >>Sent: Tuesday, May 12, 2015 9:20 AM
> >>To: Carey, Timothy (Timothy)
> >>Cc: lmap@ietf.org
> >>Subject: Re: [lmap] draft-ietf-lmap-information-model-05: Controller
> >>timeout
> >>
> >>On Tue, May 12, 2015 at 01:04:49PM +0000, Carey, Timothy (Timothy) wrote:
> >>
> >>> So does that now mean we need to have a operational status now for a
> >>> scheduled task? I can see it now as attribute with values like:
> >>> enabled, disabled, suppressed and errored or possibly a conditions
> >>> object?
> >>
> >>I think there two questions here:
> >>
> >>a) If I loose connectivity, do I disable all schedules (and thus
> >>   implicitely all actions) or do I disable scheduled tasks? My idea
> >>   was to simply disable all schedules (and as such the schedules have
> >>   their state changed to disabled).
> >>
> >>   The suppression mechanism can suppress both a list of tasks and a
> >>   list of schedules and it is configurable what is being suppressed.
> >>
> >>   One option is that loss of connectivity simply causes suppression
> >>   to be activated. Another option is that loss of connectivity
> >>   supresses all tasks that are not essential for communication with a
> >>   controller and house keeping. Another option is that loss of
> >>   connectivity suppresses all schedules. We need to select one.
> >>
> >>b) Do we expose the internal state that is maintained? Right now, we
> >>   do not expose any operational state for schedules. But we do expose
> >>   operational state for tasks. If loss of connectivity means we
> >>   disable all tasks that are not essential for communication with a
> >>   controller and house keeping, than exposing that information would
> >>   be a rather minor change to the information and data model.
> >>
> >>   A task state would then be enabled (default), suppressed or
> >>   disabled. Error information is already covered in the last-failed
> >>   attributes of a task. (And this is useful in situations where tasks
> >>   fail occasionally so you can get information about the last failure
> >>   even though the task is working fine right now).
> >>
> >>/js
> >>
> >>--
> >>Juergen Schoenwaelder           Jacobs University Bremen gGmbH
> >>Phone: +49 421 200 3587         Campus Ring 1 | 28759 Bremen | Germany
> >>Fax:   +49 421 200 3103         <http://www.jacobs-university.de/>
> >>
> >>_______________________________________________
> >>lmap mailing list
> >>lmap@ietf.org
> >>https://www.ietf.org/mailman/listinfo/lmap
> >
> >
> >This E-mail and any of its attachments may contain Time Warner Cable
> >proprietary information, which is privileged, confidential, or subject to
> >copyright belonging to Time Warner Cable. This E-mail is intended solely
> >for the use of the individual or entity to which it is addressed. If you
> >are not the intended recipient of this E-mail, you are hereby notified
> >that any dissemination, distribution, copying, or action taken in
> >relation to the contents of and attachments to this E-mail is strictly
> >prohibited and may be unlawful. If you have received this E-mail in
> >error, please notify the sender immediately and permanently delete the
> >original and any copy of this E-mail and any printout.
> 
> 
> This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.

-- 
Juergen Schoenwaelder           Jacobs University Bremen gGmbH
Phone: +49 421 200 3587         Campus Ring 1 | 28759 Bremen | Germany
Fax:   +49 421 200 3103         <http://www.jacobs-university.de/>