Re: [OPSAWG] Comments on Service Assurance for Intent-Based Networking Architecture (e.g. draft-claise-opsawg-service-assurance-architecture)

Benoit Claise <bclaise@cisco.com> Tue, 04 August 2020 12:39 UTC

Return-Path: <bclaise@cisco.com>
X-Original-To: opsawg@ietfa.amsl.com
Delivered-To: opsawg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D1F223A07CE; Tue, 4 Aug 2020 05:39:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -13.649
X-Spam-Level:
X-Spam-Status: No, score=-13.649 tagged_above=-999 required=5 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.949, RCVD_IN_DNSWL_HI=-5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_DEF_DKIM_WL=-7.5] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cisco.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CP19IYCuKHvF; Tue, 4 Aug 2020 05:39:07 -0700 (PDT)
Received: from aer-iport-2.cisco.com (aer-iport-2.cisco.com [173.38.203.52]) (using TLSv1.2 with cipher DHE-RSA-SEED-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 596463A0A9A; Tue, 4 Aug 2020 05:39:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=21092; q=dns/txt; s=iport; t=1596544746; x=1597754346; h=subject:to:cc:references:from:message-id:date: mime-version:in-reply-to; bh=B8VD2UI+V9DuSX+fypfZr9sLPROjCqjaAxo7iqpQEgU=; b=jFReU6v3F94D4EsGjy+zousdLbV6w44r4yNCW1IB/VHBxmZrGplXuNAB OTUR8ciL+qrNuVWQOJU6rf1a8sfKuj3Vvz+5yDqSe9C8DaUQ4ZbUlYNhe xCtAhi8rkP6jhXnD+mMxA+2Ro1Hi/uNQr8ouPJlBsZfuiSlzeoDOdOjNM c=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0DxAAA1Vilf/xbLJq1WChsBAQEBAQEBAQUBAQESAQEBAwMBAQFAgUqBI1IGgXIBIBIsjTaIGZwOCwEBAQwBAS8EAQGETAKCJSU4EwIDAQELAQEFAQEBAgEGBG2FaIVxAQEBAwEtTAULCw4DAQMBASQLSQYIBg0GAgEBgyKCXSCxbHSBNIVSg0eBQIE4jSiBQT+BESeBaxI+Lj6EBwkEEYYOBI9hlUuPYoEFgmyZfwUHAx6CfI5WKI4BkiaUEYsUAgQLAhWBaiOBVzMaCBsVgyRQGQ2OKxcUbgEJjRo/AzA3AgYIAQEDCY0tgkYBAQ
X-IronPort-AV: E=Sophos; i="5.75,434,1589241600"; d="scan'208,217"; a="28472825"
Received: from aer-iport-nat.cisco.com (HELO aer-core-2.cisco.com) ([173.38.203.22]) by aer-iport-2.cisco.com with ESMTP/TLS/DHE-RSA-SEED-SHA; 04 Aug 2020 12:39:01 +0000
Received: from [10.55.221.38] (ams-bclaise-nitro5.cisco.com [10.55.221.38]) (authenticated bits=0) by aer-core-2.cisco.com (8.15.2/8.15.2) with ESMTPSA id 074Cd0fi006885 (version=TLSv1.2 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 4 Aug 2020 12:39:01 GMT
To: Alexander Clemm <alex@futurewei.com>, "draft-claise-opsawg-service-assurance-architecture@ietf.org" <draft-claise-opsawg-service-assurance-architecture@ietf.org>
Cc: "opsawg@ietf.org" <opsawg@ietf.org>, "nmrg@irtf.org" <nmrg@irtf.org>
References: <BY5PR13MB37936AF1D0EEAA2C9C7749FEDB710@BY5PR13MB3793.namprd13.prod.outlook.com> <76ecb76b-efe8-499a-95ec-2602fa0248ef@cisco.com> <CH2PR13MB379932FF300A9903F83E8898DB4D0@CH2PR13MB3799.namprd13.prod.outlook.com>
From: Benoit Claise <bclaise@cisco.com>
Message-ID: <d5bca505-5b44-3ae2-ce3d-5371e27b6e93@cisco.com>
Date: Tue, 04 Aug 2020 14:39:00 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.4.2
MIME-Version: 1.0
In-Reply-To: <CH2PR13MB379932FF300A9903F83E8898DB4D0@CH2PR13MB3799.namprd13.prod.outlook.com>
Content-Type: multipart/alternative; boundary="------------8F4228C62215296F393AC0CE"
Content-Language: en-US
X-Authenticated-User: bclaise
X-Outbound-SMTP-Client: 10.55.221.38, ams-bclaise-nitro5.cisco.com
X-Outbound-Node: aer-core-2.cisco.com
Archived-At: <https://mailarchive.ietf.org/arch/msg/opsawg/_wLEwNemG5ZQDSmPbe3Yx4VqHD4>
Subject: Re: [OPSAWG] Comments on Service Assurance for Intent-Based Networking Architecture (e.g. draft-claise-opsawg-service-assurance-architecture)
X-BeenThere: opsawg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: OPSA Working Group Mail List <opsawg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/opsawg>, <mailto:opsawg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/opsawg/>
List-Post: <mailto:opsawg@ietf.org>
List-Help: <mailto:opsawg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/opsawg>, <mailto:opsawg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 04 Aug 2020 12:39:10 -0000

Thanks Alex,

We'll make sure to introduce the required text in the next draft versions.

Regards, Benoit
>
> Hi Benoit,
>
> thanks for the response.  By and large we are on the same page and I 
> support this work.  And as you know clearly I am of the school who 
> believes in exception-driven management and providing actionable 
> information, not raw data.
>
> Anyway, as mentioned there should perhaps be greater emphasis on the 
> value in maintaining a dependency graph in general, and explaining how 
> it can complement / aid operational tasks from troubleshooting to 
> impact analysis.  It would be good to add some bits on how and where 
> to instrument this effectively (not necessarily all pushed onto device 
> agents; there will be also a role for controllers etc in this)  I 
> remain sceptical regarding the specific use case of continuous 
> maintaining of a synthetically derived health score but am looking 
> forward to progression of this work further iterations of the drafts.
>
> --- Alex
>
> *From:* Benoit Claise <bclaise@cisco.com>
> *Sent:* Friday, July 31, 2020 3:42 AM
> *To:* Alexander Clemm <alex@futurewei.com>; 
> draft-claise-opsawg-service-assurance-architecture@ietf.org
> *Cc:* opsawg@ietf.org; nmrg@irtf.org
> *Subject:* Re: Comments on Service Assurance for Intent-Based 
> Networking Architecture (e.g. 
> draft-claise-opsawg-service-assurance-architecture)
>
> Hi Alex,
>
> Thanks for engaging.
>
>     Hi Benoit,
>
>     I have seen your presentations on Service Assurance for
>     Intent-Based Networking Architecture and read your drafts with
>     interest (draft-claise-opsawg-service-assurance-yang-05 and
>     draft-claise-opsawg-service-assurance-architecture-03).
>     Interesting stuff on which I do have a couple of comments.
>
>     The basis for the drafts is in essence a proposal for Model-Based
>     Reasoning, in which you capture dependencies between objects and
>     make inferences by traversing the corresponding graph.  MBR based
>     on dependency graphs allows to reason about the impact and
>     propagation of the status or health of one object on the status or
>     health of dependent objects “downstream” from it.  Likewise,
>     traversing the same graph in the opposite direction (from the
>     “downstream” or dependent objects) allows to identify potential
>     root causes for symptoms observed by those objects, although this
>     seems to be not so much your focus.
>
>     While MBR as a concept makes sense and has a long tradition in
>     network management, there are also a number of considerable issues
>     with it, and I was wondering about your perspective and mitigation
>     strategies for these.  For one, their effectiveness depends on the
>     model being “complete”.  In most cases, there are myriads of
>     interdependencies which are difficult to capture comprehensively. 
>     The model is still useful for many applications as a starting
>     point, but rarely captures the full reality.  As long as users are
>     clear about that, this is not an issue.
>
> Point taken about the myriads of interdependencies and graph completeness.
> As you observe, even if the graph is not complete, this is useful. 
> Especially when we can assure (networking) components within the 
> assurance graph.
> That way, the graph will tell us where the problem is not, which is 
> equally important as telling where the problem is/might be.... 
> assuming we have complete heuristics for that component assurance 
> obviously ... which implies that the heuristics need to improve along 
> the time.
>
>
>
>     However, the one thing where I have a bit of concern in your model
>     is that you use it to draw conclusions about the health of the
>     dependent objects (for example, your end-to-end service).  It
>     seems that a derived health score will be no substitute for
>     monitoring the actual health, and should not lull users into a
>     false sense of security that as long as they monitor components of
>     a system or service, that they don’t need to be concerned with
>     monitoring the system or service as a whole.  In reality I believe
>     the value (although there still is a value) is more limited than
>     that.  I believe that this should be clearly acknowledged and
>     discussed in the drafts.
>
> This is the exact reason why I wrote in the slides: "This complements 
> the end-to-end synthetic testing"
> Indeed, the way service assurance is usually done is with end to end 
> probing: OWAMP/TWAMP/IP SLA with delay, packet loss, jitter 
> threshold-based, etc. . When the SLA degrades, the end to end probing 
> can't really tell which components in the network degrades (granted, 
> there are exceptions).The network is viewed as a black box. Combining 
> the inferred health score from the assurance graph with the end-to-end 
> probing provides the required correlation to have more of a network 
> crystal view
>
> Point very well taken, "This complements the end-to-end synthetic 
> testing" concept is not mentioned in the draft. I will add it. Thanks.
>
>
>     A second set of issues concerns the intensity of maintaining the
>     graph and of continuously updating the dependencies.  In a
>     realistic system you will have many objects with even more
>     interdependencies. Maintaining derived health state can become
>     computationally very expensive, which suggests a number of
>     mitigation strategies:  for one, don’t continuously maintain this
>     but compute this only “on demand”.
>
> Yes. That's one way
>
>     Second, perhaps don’t maintain this on the server at all, at least
>     to the extent that you expect the server to be a networking
>     device.  It seems much more feasible to perform these type of
>     Model-Based Reasoning computations in an Operations Support System
>     or application outside the network, not within the network.
>     However, it is not clear that YANG models and Netconf/Restconf
>     would be applied there.  It seems to me the drafts should add
>     clarification on where those models would be expected to be
>     deployed and how/would keep them updated.  As an OSS tool, your
>     proposal makes sense, but trying to process this on networking
>     devices strikes me as very heavy, in particular given the
>     limitations as per the earlier point.   So, IMHO I think you may
>     want to consider adding an according section that discusses these
>     aspects in the draft, specifically the architecture draft.
>
> The architecture, with the YANG module, is actually designed to cover 
> distributed graphs.
> We can stream all metrics (whether YANG leaf, MIB variable, CLI, 
> syslog, what have you) to an OSS, sure
> However, I believe into data aggregation as we know that we're going 
> to quickly reach the streaming capabilities limitations.
> And I also believe into each components being responsible for its 
> assurance, to the best of its knowledge.
> Hence the proposal to go via a SAIN agent, inside or outside a router, 
> to send the inferred health score and symptoms to the OSS.
> In the end, what do operational teams care about?
>     1. knowing that an interface, a router, part of the network works 
> fine ... until they tell me otherwise
>     2. collecting all the metrics in a big data lake to draw the same 
> or better conclusion
> Ideally we need both, but we face two schools here. I'm more of in the 
> school of providing information, as opposed to the much data. This 
> would reduce the cost of managing networks.
>
> Regards, Benoit
>