Re: [OPSAWG] review of draft-ietf-opsawg-service-assurance-architecture-08

Benoit Claise <benoit.claise@huawei.com> Tue, 20 September 2022 12:41 UTC

Return-Path: <benoit.claise@huawei.com>
X-Original-To: opsawg@ietfa.amsl.com
Delivered-To: opsawg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 35138C159823 for <opsawg@ietfa.amsl.com>; Tue, 20 Sep 2022 05:41:56 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.199
X-Spam-Level:
X-Spam-Status: No, score=-4.199 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_HTML_ATTACH=0.01, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SW_RxwbHkdiw for <opsawg@ietfa.amsl.com>; Tue, 20 Sep 2022 05:41:52 -0700 (PDT)
Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 37912C15949B for <opsawg@ietf.org>; Tue, 20 Sep 2022 05:41:51 -0700 (PDT)
Received: from fraeml736-chm.china.huawei.com (unknown [172.18.147.201]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4MX1Nt2vKXz67vrS; Tue, 20 Sep 2022 20:40:46 +0800 (CST)
Received: from [10.126.175.77] (10.126.175.77) by fraeml736-chm.china.huawei.com (10.206.15.217) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 20 Sep 2022 14:41:46 +0200
Content-Type: multipart/mixed; boundary="------------vNbB7lJgGXukJbx3ohcoZHqu"
Message-ID: <1a2b8d3a-d353-2065-e2f5-1b8c7fe32560@huawei.com>
Date: Tue, 20 Sep 2022 14:41:42 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2
Content-Language: en-GB
To: Michael Richardson <mcr+ietf@sandelman.ca>, opsawg@ietf.org
References: <74351.1663022748@dooku>
From: Benoit Claise <benoit.claise@huawei.com>
In-Reply-To: <74351.1663022748@dooku>
X-Originating-IP: [10.126.175.77]
X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To fraeml736-chm.china.huawei.com (10.206.15.217)
X-CFilter-Loop: Reflected
Archived-At: <https://mailarchive.ietf.org/arch/msg/opsawg/DpRgbem2RVL99UE0MSwD7A7wZtU>
Subject: Re: [OPSAWG] review of draft-ietf-opsawg-service-assurance-architecture-08
X-BeenThere: opsawg@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: OPSA Working Group Mail List <opsawg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/opsawg>, <mailto:opsawg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/opsawg/>
List-Post: <mailto:opsawg@ietf.org>
List-Help: <mailto:opsawg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/opsawg>, <mailto:opsawg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Sep 2022 12:41:56 -0000

Hi Michael,

Thanks for your review.
And sorry for the delay: I was not too sure how to react to this review. 
Another review after WGLC, to be integrated in IETF LC? Document 
shepherd review needed to addressed for the document to progress?

Anyway, see inline.
Attached you will see diff with some new proposed text. Let us know if 
this addresses your concern.

On 9/13/2022 12:45 AM, Michael Richardson wrote:
> I have read draft-ietf-opsawg-service-assurance-architecture at the request
> of a few people.  This is not part of any directorate review (that I
> remember, or that shows up in my review list).  If it's useful for me to plug
> this in somewhere, let me know.
>
> I find the document well written, and to me rather ambitious.
> That might be because my level of understanding of modern network management
> is poor.
>
> I found section 3.1.1. Circular Dependencies to be interesting, and I think
> telling.   As soon as I saw "DAG" in the previous section, I was all, "yeah, but..."
> I'm not convinced that the process described in 3.1.1 is something that a
> computer program can do, versus that it (the service and the components that
> build the service) has to designed to be cycle from from the beginning.
> It seems to me that this document either has to constrain what services can
> be built by deciding upon a canonical way to describe many things, or that
> different vendors will create interoperable models only by chance.
Typically, it's only when assurance graphs are combined that we might 
have circular dependencies. So in practice, we don't believe we are 
going to see many instances of those.
Different vendors/controllers assuring different parts of the networks 
don't have the exact dependencies, even if that would be welcome.
We believe that this circular dependency removal can be programmed but 
we also fully agree that good design should avoid circular dependencies.
>
> section 3.6. Handling Maintenance Windows
> seems a bit light to me.
> I think that there are three aspects which need to emphasized:
>    a) maintenance windows where components are marked in maintenance, but
>       that the service itself should continue to operate (with a lower score),
>       because some redundancy takes over.
>       A key issue here is sometimes this results in "boy-who-cried-wolf"
>       situation, where the lower score and lack of resiliency is then
>       overlooked later on.  The broken thing never gets repaired, and then
>       some other fault or maintenance causes an actual failure.
Actually, it depends on the intent.
If the intent is to get have a backup link all the time, then yes, the 
service continue to operate with a lower score.
>
>    b) components are marked for maintenance, which have service impacting
>       effects, but during which, other components fail.  To make analogy,
>       you don't care so much if your car steering system does not operate
>       while the starter motor is not operational.  But, as soon as you fix the
>       starter motor (taking hours to day), you find that you still can not
>       go.   You could have fixed both systems in parallel/currently, if only
>       you'd known.
There are two cases here.
1. you knew (from the assurance graph) that car steering system did not 
operate when going for maintenance for the starter motor.
     In such a case, you could be solving both in parallel during 
maintenance
2. you don't know, and you will learn about the broken down car steering 
system when back from the starter motor maintenance
     ... at the time of recomputing the assurance graph and looking at 
the health of each subservice
>
>    c) as the example gives about an update to an device OS.  This sometimes
>       comes with unintended (or poorly documented) side effects which cause
>       other failures, or knock-on updates.  For instance, you upgrade the
>       OS and then TLS 1.1 is disabled in favour of TLS 1.2 and TLS 1.3, but
>       other components are in critical use, and have not yet been updated,
>       and only TLS 1.1 was supported.
Sure. This is similar to the case 2 above.
>
> (c) is in many ways that the DAG *itself* might need to be updated.
> How do you transition from one dependancy DAG to another dependancy DAG?
> I guess that section 3.9 gets into this, but it seems rather weak.
Proposal:
1. we need to add the concept that service depending on the 
under-maintenance subservices will receive the "under maintenance" 
symptom and has to take into account in his health computation. How? We 
don't want to in the specific of health aggregation in this specification.
2. add some text that the DAG might have to recomputed after a 
subservice coming out of maintenance.
>
> 3.8. Timing
> Starts talking about NTP, and synchronization.
> Then goes into garbage collection, and I think that maybe this transition in
> the text could be better presented.
You are right.
We propose to move the following text (which is not consequent enough to 
deserve its own section) just before 3.1

        The SAIN architecture requires time synchronization, with Network
        Time Protocol (NTP) [RFC5905  <https://datatracker.ietf.org/doc/html/rfc5905>] as a candidate, between all elements:
        monitored entities, SAIN agents, Service orchestrator, the SAIN
        collector, as well as the SAIN orchestrator.  This guarantees the
        correlations of all symptoms in the system, correlated with the right
        assurance graph version.


And rename section 3.8 "Timing" to "Garbage Collection"
>
>
> I feel that this SAIN architecture is quite ambitious, and I'm not sure that
> there is enough here to actually create interoperable implementations.
My group created a prototype. I know of another one.
And there is an opensource implementation (presented by Prof Benoit 
Donnet in the past).
The interop part will be with linking YANG modules, which we addressed 
with the circular dependencies.

Regards, Jean and Benoit
>
>
>
> _______________________________________________
> OPSAWG mailing list
> OPSAWG@ietf.org
> https://www.ietf.org/mailman/listinfo/opsawg