Re: [Anima] ANIMA when there is a system-wide issue

Brian E Carpenter <brian.e.carpenter@gmail.com> Fri, 18 December 2020 19:57 UTC

Return-Path: <brian.e.carpenter@gmail.com>
X-Original-To: anima@ietfa.amsl.com
Delivered-To: anima@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B3EAC3A0637 for <anima@ietfa.amsl.com>; Fri, 18 Dec 2020 11:57:46 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.099
X-Spam-Level:
X-Spam-Status: No, score=-2.099 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, NICE_REPLY_A=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 7MwKWOQyFvpR for <anima@ietfa.amsl.com>; Fri, 18 Dec 2020 11:57:45 -0800 (PST)
Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com [IPv6:2607:f8b0:4864:20::102b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0633F3A061B for <anima@ietf.org>; Fri, 18 Dec 2020 11:57:44 -0800 (PST)
Received: by mail-pj1-x102b.google.com with SMTP id b5so1918603pjk.2 for <anima@ietf.org>; Fri, 18 Dec 2020 11:57:44 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=5gT1jTVXM99HqLHZXqY0G4ykF7xw575W/vx1LMd0fTo=; b=PbhD23K/c9N744/PDgBlRfK4mY/vYU0Yai/roHNWZK5Jz8ww5bB+9S8kEim1r59IX+ rsFexLUdXcKgXurLw/VuMag6TQzmn+FXmhUfmlBxYX9zbI+nkfxvKjEoEK2XJGS8r8fL KkcudKjaediHx6R1nkNUvmio7lAh2obA1LUjJv/ccI5aPp9qkD812S2TUUeabEngU+Lq ovKppClxScfPCXR2c0rRaaxkHMno5xag21mrJQ6aGg4v0HhY6MlBE1l8eE9HrS6QS5bj snqRL8cN5DL4An4xQPML/dEB+/xMaERVpKydPAQ5ElqeTVq90YHApkqTtxk904tEWoev B/OA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=5gT1jTVXM99HqLHZXqY0G4ykF7xw575W/vx1LMd0fTo=; b=o+QLAV5rcFZ+XR6I7Reo4IkcaBmiL7NSiQBd5okyZYuoSpGXFw6E84HDZaBRvvX27d NKrBR3mbUcwn7QwQSHswDKQfwvpeogr9r/xphP0q1tc/O36uc9xq4ytUvwhrFq454aLb AiO3FRLE0OH9hCoY31Fqokl08TSaa28oWrQVE60883/5bFkizR15MB03blHk0g6I4xPv v3ICFMCnLNx+99PIBwjht45GV9wDEQgD38/6F9vefpYQqUvA8MmhCiHbqkff1YkiliCe GMnlnTeHmbwWbLX46khEUup6VSTBCgCk8o2toX/DkVALA7O3wYZw4runLIWfid5sjSiX VrqA==
X-Gm-Message-State: AOAM532DxKrQvSoQdUepXNDjYhQgO7b+owTJ6juXS9hi37xxUYYY6jor kG6gNJCjQ6M0f85Vcd2YRaxGb19iF5/7zQ==
X-Google-Smtp-Source: ABdhPJxG0s/+myFxz9bREi4z5HMU6EB3RXQDt/kWIwL+aHMjImnqv/IreDNIDtONOUCM3B0Xm61o+A==
X-Received: by 2002:a17:90b:3845:: with SMTP id nl5mr5860007pjb.108.1608321463073; Fri, 18 Dec 2020 11:57:43 -0800 (PST)
Received: from [192.168.178.20] ([151.210.131.28]) by smtp.gmail.com with ESMTPSA id w70sm9256480pfd.65.2020.12.18.11.57.40 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Dec 2020 11:57:42 -0800 (PST)
To: "Ciavaglia, Laurent (Nokia - FR/Paris-Saclay)" <laurent.ciavaglia@nokia.com>, Anima WG <anima@ietf.org>
References: <136aa329-41a5-8b65-ef9e-fadf089696eb@gmail.com> <704b66e9-d41c-f7e9-7e4b-f2d934ec9158@gmail.com> <PR3PR07MB68265F26A2CFB818D9CFDFBCF3C30@PR3PR07MB6826.eurprd07.prod.outlook.com>
From: Brian E Carpenter <brian.e.carpenter@gmail.com>
Message-ID: <112c3eb2-9abd-4f1a-b9c8-f4bc129caab7@gmail.com>
Date: Sat, 19 Dec 2020 08:57:39 +1300
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.9.1
MIME-Version: 1.0
In-Reply-To: <PR3PR07MB68265F26A2CFB818D9CFDFBCF3C30@PR3PR07MB6826.eurprd07.prod.outlook.com>
Content-Type: text/plain; charset="utf-8"
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/anima/HdyZx1VEg5Yt9eJN8vdRfxK3b0M>
Subject: Re: [Anima] ANIMA when there is a system-wide issue
X-BeenThere: anima@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Autonomic Networking Integrated Model and Approach <anima.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/anima>, <mailto:anima-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/anima/>
List-Post: <mailto:anima@ietf.org>
List-Help: <mailto:anima-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/anima>, <mailto:anima-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 18 Dec 2020 19:57:47 -0000

Laurent,

You are of course correct that simply running some ANIMA components would not avoid or necessarily repair such incidents. Life is not that easy. However, I believe that ANIMA (or other autonomic techniques) have the potential to make such incidents either rarer or more quickly repaired.

The ACP is designed to survive partitions and merges, and in the worst case to rebuild itself completely, in the absence of any traditionally configured control plane or data plane. That would allow ASAs to restart themselves if necessary, but in any case to continue their jobs even when normal routing is broken. I assume that an important class of ASAs will be those that watch and verify normal operations, and in both the incidents reported some kind of anomaly detection could have happened within a minute or two. That might not lead to immediate diagnosis of the problem, but maybe it could (for example) cause an automatic rollback of any recent configuration changes. Even an ASA without network access could do that: nothing is working, so roll back the recent ACL updates!

A slightly more abstract point is that an autonomic network will in theory need fewer configuration updates by human operators, so such problems will be less likely.

Regards
   Brian

On 18-Dec-20 21:42, Ciavaglia, Laurent (Nokia - FR/Paris-Saclay) wrote:
> Hi Brian,
> 
> Thanks for sharing interesting incidents and reflecting on the role of ANIMA technologies.
> 
> Reading the report, it seems the outage results from a series of indirectly linked events 
> leading to isolation of a portion of the GCP network.
> 
> For the incident you refer here, how/where would you see ANIMA components to have (helped) avoided the outage?
> Could we expect ANIMA networks to provide better/longer data plane operation in case of failed control plane/control plane functions? Beyond the pure ability to do so, there is a gain/risk trade-off to let a DP run out of sync of its CP.
> Could we expect ANIMA components to have reacted differently and circumvent the issue, preventing the full disconnection of the GCP network portion? Or mitigated at intermediate points? 
> The incident seems to have been triggered by a legitimate/valid configuration change but with resulting with a functionality loosing its access to files it needed to perform. The config. change validation basically didn't notice there was a possible issue.
> Would such a miss been caught by ANIMA components? (e.g. via a different validation-dependencies approach?)
> 
> Why I'm a bit doubtful here is because even if we deploy robust autonomic functions/agents, the above problem seems to originate from an issue in how the system has been configured.
> And even ASAs would still need to get some form/level of initial configuration or guidance.
> 
> 
> Best regards, 
> Laurent
> 
>> -----Original Message-----
>> From: Anima <anima-bounces@ietf.org> On Behalf Of Brian E Carpenter
>> Sent: Thursday, December 17, 2020 02:47
>> To: Anima WG <anima@ietf.org>
>> Subject: Re: [Anima] ANIMA when there is a system-wide issue
>>
>> And here's what happens when the control plane itself falls over:
>>
>> https://status.cloud.google.com/incident/zall/20011#20011006
>>
>> It seems pretty clear that Cloud needs ANIMA.
>>
>> Regards
>>    Brian
>>
>> On 01-Dec-20 11:02, Brian E Carpenter wrote:
>>> "AWS reveals it broke itself by exceeding OS thread limits"
>>>
>>> https://www.theregister.com/2020/11/30/aws_outage_explanation/
>>>
>>> Especially:
>>> "The TIFU-like post also outlines why Amazon's dashboards offered only
>> scanty info about the incident – because they, too, depend on a service
>> that depends on Kinesis."
>>>
>>> Perhaps there is something we should specify in ANIMA to prevent the
>> ANIMA infrastructure falling into this sort of trap: when there is a
>> system-wide issue (such as hitting an O/S resource limit everywhere at the
>> same time) it also prevents the autonomic mechanisms from working.
>>>
>>> Regards
>>>    Brian Carpenter
>>>
>>
>> _______________________________________________
>> Anima mailing list
>> Anima@ietf.org
>> https://www.ietf.org/mailman/listinfo/anima