[OPSAWG] Service Assurance for Intent-based Networking Architecture

stefan vallin <stefan@wallan.se> Wed, 20 November 2019 10:38 UTC

From: stefan vallin <stefan@wallan.se>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
Date: Wed, 20 Nov 2019 11:37:17 +0100
References: <5D45EBA8-4DD3-420C-A91A-AA804D6216B4@charter.com> <BYAPR11MB258458B4AE8882195FE1845BDA4F0@BYAPR11MB2584.namprd11.prod.outlook.com>
To: "opsawg@ietf.org" <opsawg@ietf.org>
In-Reply-To: <BYAPR11MB258458B4AE8882195FE1845BDA4F0@BYAPR11MB2584.namprd11.prod.outlook.com>
Message-Id: <034B974F-C977-46F6-BFC8-F2EC2B369CAB@wallan.se>
Archived-At: <https://mailarchive.ietf.org/arch/msg/opsawg/XM1LJ1R3uTtCg-HKq2RvsQnv1Ew>
Subject: [OPSAWG] Service Assurance for Intent-based Networking Architecture
Precedence: list

Hi Benoit!
Thanks for bringing the issue of service assurance to the table. More work is needed on this topic.

Some high-level comments on the drafts

The drafts present a YANG module for performing reasoning across a “generic” service tree.
This has existed in assurance systems for a long time: inventory based systems as well as fault managers had modules for this, like Micromuse Impact, OpenView Service Navigator etc.
The overall idea is to feed KPIs, events, alarms etc into the tree and reasoning upwards to do service impact analysis and downwards to do root-cause analysis.

I have some concerns based on the above:
1) The draft should be renamed. Claiming a service tree being *the* architecture for intent-based network assurance is maybe too ambitious. There are so many other things needed for service assurance in intent-based networking:
- how to represent service tests and service SLA monitoring as part of the intent?
- how to monitor the data-plane as such
- how to represent closed-loop policies
- and more
So I think the draft should be renamed to the relevant, more limited scope: A "YANG data-model for representing generic service trees”, or something

2) Looking back at practical experience with above mentioned tools, it was very hard to get the approach to work (or someone else out there is smarter?).
The dependencies between services/subservices are subtle. Either the dependencies get too coarse-grained so that everything gets red, or too fine-grained to have too few hits. Who are to express this knowledge in a large multi-vendor network? The classical service impact tools had fairly advanced algorithms attached to the graph to try to capture this, not just a dependency link. It more or less is as complex as the knowledge acquisition problem for classical rule-based AI, how many dependencies do we need to express until we can do a good job in the service tree?

Since it is configuration data in the model, I guess it is assumed that the orchestrator should set up all of this. But in many cases this is "hidden" in other domains, like networking protocols and vendor specific details. Also with networks becoming more dynamic and virtual, it is hard to see how the service structures can be statically maintained as *configuration*.

Another underlying challenge is that most of the network problems are configuration related, and not detected by device instrumentation.
Non-optimal QoS config, firewall rules etc, these are not detected by the firewall or router, no alarms.

Hard for you to comment on this, I know, it is just a reality check.

3) Relationship to other YANG service models.
In your approach the service tree is a separate tree from the concrete service trees like L3VPN service model.
Have you considered an approach to augment these concrete models with generic assurance state and dependency information instead of maintaining a separate tree? Maintaining parallell trees might result in inconsistencies at the end

4) Relationship to the Alarm YANG RFC8632
There are several opportunities to reuse definitions and concepts from RFC8632
- You could add alarms in your module according to the service tree, see especially Section 3.6. Root Cause, Impacted Resources
- You could use alarm-types as one kind of symptom (there are many others like active measurements with TWAMP etc)

Hope this helps to flesh out more details in your work
br Stefan

[OPSAWG] IETF 106 Discussion about draft-gray-sam… Gray, Andrew A
Re: [OPSAWG] IETF 106 Discussion about draft-gray… Qin Wu
Re: [OPSAWG] IETF 106 Discussion about draft-gray… Frank Brockners (fbrockne)
[OPSAWG] Service Assurance for Intent-based Netwo… stefan vallin
Re: [OPSAWG] IETF 106 Discussion about draft-gray… Gray, Andrew A
Re: [OPSAWG] IETF 106 Discussion about draft-gray… Joe Clarke (jclarke)
Re: [OPSAWG] IETF 106 Discussion about draft-gray… Gray, Andrew A