Re: [OPSAWG] Comments on Service Assurance for Intent-Based Networking Architecture (e.g. draft-claise-opsawg-service-assurance-architecture)

Benoit Claise <bclaise@cisco.com> Tue, 04 August 2020 12:39 UTC

To: Alexander Clemm <alex@futurewei.com>, "draft-claise-opsawg-service-assurance-architecture@ietf.org" <draft-claise-opsawg-service-assurance-architecture@ietf.org>
Cc: "opsawg@ietf.org" <opsawg@ietf.org>, "nmrg@irtf.org" <nmrg@irtf.org>
References: <BY5PR13MB37936AF1D0EEAA2C9C7749FEDB710@BY5PR13MB3793.namprd13.prod.outlook.com> <76ecb76b-efe8-499a-95ec-2602fa0248ef@cisco.com> <CH2PR13MB379932FF300A9903F83E8898DB4D0@CH2PR13MB3799.namprd13.prod.outlook.com>
From: Benoit Claise <bclaise@cisco.com>
Message-ID: <d5bca505-5b44-3ae2-ce3d-5371e27b6e93@cisco.com>
Date: Tue, 04 Aug 2020 14:39:00 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.4.2
MIME-Version: 1.0
In-Reply-To: <CH2PR13MB379932FF300A9903F83E8898DB4D0@CH2PR13MB3799.namprd13.prod.outlook.com>
Content-Type: multipart/alternative; boundary="------------8F4228C62215296F393AC0CE"
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/opsawg/_wLEwNemG5ZQDSmPbe3Yx4VqHD4>
Subject: Re: [OPSAWG] Comments on Service Assurance for Intent-Based Networking Architecture (e.g. draft-claise-opsawg-service-assurance-architecture)
Precedence: list

Thanks Alex,

We'll make sure to introduce the required text in the next draft versions.

Regards, Benoit
>
> Hi Benoit,
>
> thanks for the response.  By and large we are on the same page and I 
> support this work.  And as you know clearly I am of the school who 
> believes in exception-driven management and providing actionable 
> information, not raw data.
>
> Anyway, as mentioned there should perhaps be greater emphasis on the 
> value in maintaining a dependency graph in general, and explaining how 
> it can complement / aid operational tasks from troubleshooting to 
> impact analysis.  It would be good to add some bits on how and where 
> to instrument this effectively (not necessarily all pushed onto device 
> agents; there will be also a role for controllers etc in this)  I 
> remain sceptical regarding the specific use case of continuous 
> maintaining of a synthetically derived health score but am looking 
> forward to progression of this work further iterations of the drafts.
>
> --- Alex
>
> *From:* Benoit Claise <bclaise@cisco.com>
> *Sent:* Friday, July 31, 2020 3:42 AM
> *To:* Alexander Clemm <alex@futurewei.com>; 
> draft-claise-opsawg-service-assurance-architecture@ietf.org
> *Cc:* opsawg@ietf.org; nmrg@irtf.org
> *Subject:* Re: Comments on Service Assurance for Intent-Based 
> Networking Architecture (e.g. 
> draft-claise-opsawg-service-assurance-architecture)
>
> Hi Alex,
>
> Thanks for engaging.
>
>     Hi Benoit,
>
>     I have seen your presentations on Service Assurance for
>     Intent-Based Networking Architecture and read your drafts with
>     interest (draft-claise-opsawg-service-assurance-yang-05 and
>     draft-claise-opsawg-service-assurance-architecture-03).
>     Interesting stuff on which I do have a couple of comments.
>
>     The basis for the drafts is in essence a proposal for Model-Based
>     Reasoning, in which you capture dependencies between objects and
>     make inferences by traversing the corresponding graph.  MBR based
>     on dependency graphs allows to reason about the impact and
>     propagation of the status or health of one object on the status or
>     health of dependent objects “downstream” from it.  Likewise,
>     traversing the same graph in the opposite direction (from the
>     “downstream” or dependent objects) allows to identify potential
>     root causes for symptoms observed by those objects, although this
>     seems to be not so much your focus.
>
>     While MBR as a concept makes sense and has a long tradition in
>     network management, there are also a number of considerable issues
>     with it, and I was wondering about your perspective and mitigation
>     strategies for these.  For one, their effectiveness depends on the
>     model being “complete”.  In most cases, there are myriads of
>     interdependencies which are difficult to capture comprehensively. 
>     The model is still useful for many applications as a starting
>     point, but rarely captures the full reality.  As long as users are
>     clear about that, this is not an issue.
>
> Point taken about the myriads of interdependencies and graph completeness.
> As you observe, even if the graph is not complete, this is useful. 
> Especially when we can assure (networking) components within the 
> assurance graph.
> That way, the graph will tell us where the problem is not, which is 
> equally important as telling where the problem is/might be.... 
> assuming we have complete heuristics for that component assurance 
> obviously ... which implies that the heuristics need to improve along 
> the time.
>
>
>
>     However, the one thing where I have a bit of concern in your model
>     is that you use it to draw conclusions about the health of the
>     dependent objects (for example, your end-to-end service).  It
>     seems that a derived health score will be no substitute for
>     monitoring the actual health, and should not lull users into a
>     false sense of security that as long as they monitor components of
>     a system or service, that they don’t need to be concerned with
>     monitoring the system or service as a whole.  In reality I believe
>     the value (although there still is a value) is more limited than
>     that.  I believe that this should be clearly acknowledged and
>     discussed in the drafts.
>
> This is the exact reason why I wrote in the slides: "This complements 
> the end-to-end synthetic testing"
> Indeed, the way service assurance is usually done is with end to end 
> probing: OWAMP/TWAMP/IP SLA with delay, packet loss, jitter 
> threshold-based, etc. . When the SLA degrades, the end to end probing 
> can't really tell which components in the network degrades (granted, 
> there are exceptions).The network is viewed as a black box. Combining 
> the inferred health score from the assurance graph with the end-to-end 
> probing provides the required correlation to have more of a network 
> crystal view
>
> Point very well taken, "This complements the end-to-end synthetic 
> testing" concept is not mentioned in the draft. I will add it. Thanks.
>
>
>     A second set of issues concerns the intensity of maintaining the
>     graph and of continuously updating the dependencies.  In a
>     realistic system you will have many objects with even more
>     interdependencies. Maintaining derived health state can become
>     computationally very expensive, which suggests a number of
>     mitigation strategies:  for one, don’t continuously maintain this
>     but compute this only “on demand”.
>
> Yes. That's one way
>
>     Second, perhaps don’t maintain this on the server at all, at least
>     to the extent that you expect the server to be a networking
>     device.  It seems much more feasible to perform these type of
>     Model-Based Reasoning computations in an Operations Support System
>     or application outside the network, not within the network.
>     However, it is not clear that YANG models and Netconf/Restconf
>     would be applied there.  It seems to me the drafts should add
>     clarification on where those models would be expected to be
>     deployed and how/would keep them updated.  As an OSS tool, your
>     proposal makes sense, but trying to process this on networking
>     devices strikes me as very heavy, in particular given the
>     limitations as per the earlier point.   So, IMHO I think you may
>     want to consider adding an according section that discusses these
>     aspects in the draft, specifically the architecture draft.
>
> The architecture, with the YANG module, is actually designed to cover 
> distributed graphs.
> We can stream all metrics (whether YANG leaf, MIB variable, CLI, 
> syslog, what have you) to an OSS, sure
> However, I believe into data aggregation as we know that we're going 
> to quickly reach the streaming capabilities limitations.
> And I also believe into each components being responsible for its 
> assurance, to the best of its knowledge.
> Hence the proposal to go via a SAIN agent, inside or outside a router, 
> to send the inferred health score and symptoms to the OSS.
> In the end, what do operational teams care about?
>     1. knowing that an interface, a router, part of the network works 
> fine ... until they tell me otherwise
>     2. collecting all the metrics in a big data lake to draw the same 
> or better conclusion
> Ideally we need both, but we face two schools here. I'm more of in the 
> school of providing information, as opposed to the much data. This 
> would reduce the cost of managing networks.
>
> Regards, Benoit
>

[OPSAWG] Comments on Service Assurance for Intent… Alexander Clemm
Re: [OPSAWG] Comments on Service Assurance for In… Benoit Claise
Re: [OPSAWG] Comments on Service Assurance for In… Alexander Clemm
Re: [OPSAWG] Comments on Service Assurance for In… Benoit Claise