Re: [Anima] ANIMA when there is a system-wide issue

Toerless Eckert <tte@cs.fau.de> Thu, 28 January 2021 16:04 UTC

Return-Path: <eckert@i4.informatik.uni-erlangen.de>
X-Original-To: anima@ietfa.amsl.com
Delivered-To: anima@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3C2203A15EF for <anima@ietfa.amsl.com>; Thu, 28 Jan 2021 08:04:06 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.65
X-Spam-Level:
X-Spam-Status: No, score=-1.65 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.249, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4VKujg3a8bwv for <anima@ietfa.amsl.com>; Thu, 28 Jan 2021 08:04:03 -0800 (PST)
Received: from faui40.informatik.uni-erlangen.de (faui40.informatik.uni-erlangen.de [131.188.34.40]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8E4B83A15EC for <anima@ietf.org>; Thu, 28 Jan 2021 08:04:02 -0800 (PST)
Received: from faui48f.informatik.uni-erlangen.de (faui48f.informatik.uni-erlangen.de [131.188.34.52]) by faui40.informatik.uni-erlangen.de (Postfix) with ESMTP id C6765548027; Thu, 28 Jan 2021 17:03:56 +0100 (CET)
Received: by faui48f.informatik.uni-erlangen.de (Postfix, from userid 10463) id C053D440163; Thu, 28 Jan 2021 17:03:56 +0100 (CET)
Date: Thu, 28 Jan 2021 17:03:56 +0100
From: Toerless Eckert <tte@cs.fau.de>
To: "Ciavaglia, Laurent (Nokia - FR/Paris-Saclay)" <laurent.ciavaglia@nokia.com>
Cc: Brian E Carpenter <brian.e.carpenter@gmail.com>, Anima WG <anima@ietf.org>
Message-ID: <20210128160356.GB54347@faui48f.informatik.uni-erlangen.de>
References: <136aa329-41a5-8b65-ef9e-fadf089696eb@gmail.com> <704b66e9-d41c-f7e9-7e4b-f2d934ec9158@gmail.com> <PR3PR07MB68265F26A2CFB818D9CFDFBCF3C30@PR3PR07MB6826.eurprd07.prod.outlook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <PR3PR07MB68265F26A2CFB818D9CFDFBCF3C30@PR3PR07MB6826.eurprd07.prod.outlook.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
Archived-At: <https://mailarchive.ietf.org/arch/msg/anima/vCWdOhjoys82TzZsuimztfg-iBY>
Subject: Re: [Anima] ANIMA when there is a system-wide issue
X-BeenThere: anima@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Autonomic Networking Integrated Model and Approach <anima.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/anima>, <mailto:anima-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/anima/>
List-Post: <mailto:anima@ietf.org>
List-Help: <mailto:anima-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/anima>, <mailto:anima-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 28 Jan 2021 16:04:07 -0000

Laurent, *:

In german: "Am Ast saegen auf dem man sitzt", aka:

https://www.fotocommunity.de/photo/am-ast-saegen-auf-dem-man-sitzt-smithi/32932119

Not sure what the best english equivalent saying is. In german
there are actually multiple of these idioms.

In networking this problem is inband management of IP networks.
You use use IP network that you configure to provide you connectibity
for that configuration. Epic Fail. Thats why in Data Centers and
SPs you stil often have separate out-of-band management networks.
ACP is the best current attempt to solve this issue without the
cost overhead of a separate out-of-band-network.

The google incident is exactly solved by ACP: The ACP is absent of any
configuration. Therefore also absent of any misconfiguration.

The magic of resilience lies in the correct layering. In the best
of implementation designs, the whole ACP implementation on a node
is running on resources that nobody can misconfigure and any user
traffic and contrl plane configuration does not touch it.

This is perfectly possible by combining ACP IETF spec with appropriate
implementation techniques on the nodes. At IEEE IRIC 2016 conference,
i had some slides, showing for example, how one could consider to put
ACP into the firmware of a BMC on every node. And you could segment
the NPU as well. Not rocket science. Just better engineering than what's
done today.

Cheers
    Toerless

On Fri, Dec 18, 2020 at 08:42:02AM +0000, Ciavaglia, Laurent (Nokia - FR/Paris-Saclay) wrote:
> Hi Brian,
> 
> Thanks for sharing interesting incidents and reflecting on the role of ANIMA technologies.
> 
> Reading the report, it seems the outage results from a series of indirectly linked events 
> leading to isolation of a portion of the GCP network.
> 
> For the incident you refer here, how/where would you see ANIMA components to have (helped) avoided the outage?
> Could we expect ANIMA networks to provide better/longer data plane operation in case of failed control plane/control plane functions? Beyond the pure ability to do so, there is a gain/risk trade-off to let a DP run out of sync of its CP.
> Could we expect ANIMA components to have reacted differently and circumvent the issue, preventing the full disconnection of the GCP network portion? Or mitigated at intermediate points? 
> The incident seems to have been triggered by a legitimate/valid configuration change but with resulting with a functionality loosing its access to files it needed to perform. The config. change validation basically didn't notice there was a possible issue.
> Would such a miss been caught by ANIMA components? (e.g. via a different validation-dependencies approach?)
> 
> Why I'm a bit doubtful here is because even if we deploy robust autonomic functions/agents, the above problem seems to originate from an issue in how the system has been configured.
> And even ASAs would still need to get some form/level of initial configuration or guidance.
> 
> 
> Best regards, 
> Laurent
> 
> > -----Original Message-----
> > From: Anima <anima-bounces@ietf.org> On Behalf Of Brian E Carpenter
> > Sent: Thursday, December 17, 2020 02:47
> > To: Anima WG <anima@ietf.org>
> > Subject: Re: [Anima] ANIMA when there is a system-wide issue
> > 
> > And here's what happens when the control plane itself falls over:
> > 
> > https://status.cloud.google.com/incident/zall/20011#20011006
> > 
> > It seems pretty clear that Cloud needs ANIMA.
> > 
> > Regards
> >    Brian
> > 
> > On 01-Dec-20 11:02, Brian E Carpenter wrote:
> > > "AWS reveals it broke itself by exceeding OS thread limits"
> > >
> > > https://www.theregister.com/2020/11/30/aws_outage_explanation/
> > >
> > > Especially:
> > > "The TIFU-like post also outlines why Amazon's dashboards offered only
> > scanty info about the incident ??? because they, too, depend on a service
> > that depends on Kinesis."
> > >
> > > Perhaps there is something we should specify in ANIMA to prevent the
> > ANIMA infrastructure falling into this sort of trap: when there is a
> > system-wide issue (such as hitting an O/S resource limit everywhere at the
> > same time) it also prevents the autonomic mechanisms from working.
> > >
> > > Regards
> > >    Brian Carpenter
> > >
> > 
> > _______________________________________________
> > Anima mailing list
> > Anima@ietf.org
> > https://www.ietf.org/mailman/listinfo/anima
> _______________________________________________
> Anima mailing list
> Anima@ietf.org
> https://www.ietf.org/mailman/listinfo/anima

-- 
---
tte@cs.fau.de