Re: [Anima] ANIMA when there is a system-wide issue

"Ciavaglia, Laurent (Nokia - FR/Paris-Saclay)" <laurent.ciavaglia@nokia.com> Fri, 18 December 2020 08:42 UTC

Return-Path: <laurent.ciavaglia@nokia.com>
X-Original-To: anima@ietfa.amsl.com
Delivered-To: anima@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8F0123A1176 for <anima@ietfa.amsl.com>; Fri, 18 Dec 2020 00:42:09 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.902
X-Spam-Level:
X-Spam-Status: No, score=-1.902 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=nokia.onmicrosoft.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id be5Mtr3x-NE0 for <anima@ietfa.amsl.com>; Fri, 18 Dec 2020 00:42:08 -0800 (PST)
Received: from EUR04-VI1-obe.outbound.protection.outlook.com (mail-eopbgr80102.outbound.protection.outlook.com [40.107.8.102]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 26B823A1175 for <anima@ietf.org>; Fri, 18 Dec 2020 00:42:07 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=UzfxzKxzsRxm6GTimwt+bIVVfquCbQ2S7+17IzQQJtLoyMXGNMzMw6B1WNTDEguoBXjxEfIyhkxmzxQp4aiyNmQWKNRAtN+HgUtnz+ocVu3xqAlC+jedDjUvuKrQoJ09EqIwq6vFgiF0s6u8tNQkSa95nm3NPVUwEFTNuILY0KOfxcDXVI1iFVTzN9DGp5zlsSIkXEnFnil9+ielobxTh8ArUXBSRFu4bRu+vjbz3XaIHDyyU+duZE5EPxvwf83dLrI6cMpbM6iXjPFbE1myJlPfUZQmshMlvBJDNf1kolUN4dLlso4Aw5baK+Vxl/DtE/77dLDwHY38r2HGPIT83Q==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=3Y62iOtn4bQgw3qQsps9VbpdoXpKcJAnZsuP7B2aHDE=; b=YiSCz51nbAdeDLXlmrPRL+7s1uUkITg2Q10tY0mjF3AfjCaj3G6DmSj5C2ej/ablO4vGukOLEUQVp2HaLjNsE708nSoc4aSoEHikptGcNYtFT06BinsJM1/dX1cEMPCm2ksxjjtFMudE9Dtdscg0dlD2BVESfiRJuOUPgOO9NEeM/PNbhzAi+asAu46kjMOvf3y/FvgxxV1AsfaC5LV6wl+Smjahe3JMNlkHdTHhM/JzXGYGPk4ntIfM2tjVJPOdX4tVuKQaLvxD6TnJQE91OIVfPjGe4evpAZC307/xDtjgfvBl1uPK/aFsFHozoJzcn2xz1VcHJLWdlCOT0i7t5A==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nokia.com; dmarc=pass action=none header.from=nokia.com; dkim=pass header.d=nokia.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nokia.onmicrosoft.com; s=selector1-nokia-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=3Y62iOtn4bQgw3qQsps9VbpdoXpKcJAnZsuP7B2aHDE=; b=pZ37PgtCpLSPgl+0sC2zATXQXipXAX19QWMIB7U74ko+WGDmGVb59bHGfvt9czzbEp4euolbeofuf7Z2e3WtoFLYjtQtDGzQ0t3pVEPNH/yUg0+Uk25f1MoKDfDELMYQR8uqsD56P2PIarSJu/2l+2JdnGN4DZyhNY7koa3Tpx0=
Received: from PR3PR07MB6826.eurprd07.prod.outlook.com (2603:10a6:102:7f::20) by PR3PR07MB6939.eurprd07.prod.outlook.com (2603:10a6:102:76::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3676.12; Fri, 18 Dec 2020 08:42:03 +0000
Received: from PR3PR07MB6826.eurprd07.prod.outlook.com ([fe80::2565:c003:5d49:1209]) by PR3PR07MB6826.eurprd07.prod.outlook.com ([fe80::2565:c003:5d49:1209%5]) with mapi id 15.20.3654.018; Fri, 18 Dec 2020 08:42:02 +0000
From: "Ciavaglia, Laurent (Nokia - FR/Paris-Saclay)" <laurent.ciavaglia@nokia.com>
To: Brian E Carpenter <brian.e.carpenter@gmail.com>, Anima WG <anima@ietf.org>
Thread-Topic: [Anima] ANIMA when there is a system-wide issue
Thread-Index: AQHWx2SJfUfwHo6v30eoWG4VFXQobKn6nisAgAH5IdA=
Date: Fri, 18 Dec 2020 08:42:02 +0000
Message-ID: <PR3PR07MB68265F26A2CFB818D9CFDFBCF3C30@PR3PR07MB6826.eurprd07.prod.outlook.com>
References: <136aa329-41a5-8b65-ef9e-fadf089696eb@gmail.com> <704b66e9-d41c-f7e9-7e4b-f2d934ec9158@gmail.com>
In-Reply-To: <704b66e9-d41c-f7e9-7e4b-f2d934ec9158@gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
authentication-results: gmail.com; dkim=none (message not signed) header.d=none;gmail.com; dmarc=none action=none header.from=nokia.com;
x-originating-ip: [176.130.37.253]
x-ms-publictraffictype: Email
x-ms-office365-filtering-ht: Tenant
x-ms-office365-filtering-correlation-id: 2a4fedec-44c6-415e-d29f-08d8a330cad8
x-ms-traffictypediagnostic: PR3PR07MB6939:
x-microsoft-antispam-prvs: <PR3PR07MB6939F78E993148A154883CE2F3C30@PR3PR07MB6939.eurprd07.prod.outlook.com>
x-ms-oob-tlc-oobclassifiers: OLM:10000;
x-ms-exchange-senderadcheck: 1
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info: lY2YLNwEsmHwL+o1vJnEazRF60TlT8uDma4yuK+prw94Rn/tYsexY2ssxorBTDbs7kYHg8lhp/EKAGRpprnXeeAZELnl5mo1D4PdwC9guvCm7AyYFyB1N0VmwN7Vby7ueTXCXOLYkrkIJZUOP6dtGyJgFl1/ibI7do3la9bxiTQFiDBbmhq8INifjXtA8EeHwEwjmiagdcBs2tcpRX0HiEgJ/lYse+iemG2vvZxok79LonTmj9sfwGCgY9NaiK7cYsStlYKNcGy8QeVYMPFC36sMPonKYg+EtpNI81jLOjUWFAYjTIVn8DRfaJJ+qiujPI1P4MRlsno4pIH0lXmCkhghfiW2KLPugSORjBxUoLML9uLRvDN4Jakmwzpy9SJy0SO+dX9PGCEVK6rtybHKOrY3YSQePm2cay9FQ+vxZ6FDZt83hfXPNtCfUCzbY76hr+JE8x9ia4IP1h9TqsUFOw==
x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:PR3PR07MB6826.eurprd07.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(39860400002)(366004)(376002)(136003)(396003)(346002)(478600001)(6506007)(53546011)(26005)(316002)(83380400001)(52536014)(8676002)(64756008)(8936002)(9686003)(71200400001)(5660300002)(66946007)(55016002)(86362001)(110136005)(966005)(66556008)(66446008)(2906002)(76116006)(7696005)(33656002)(186003)(66476007); DIR:OUT; SFP:1102;
x-ms-exchange-antispam-messagedata: =?utf-8?B?TUwrdGU1bGFWQkhDWkdNZVJIcnNTY3g0ZDZnd2xkWDlXVGpMeDlDWDU0VlND?= =?utf-8?B?OFV3U3hmaEZSYjlueTJpVzI4SXJwSlRKN0lDNlBENHdUcWhvVUU3VzlCMzhB?= =?utf-8?B?bzk0SnY5em10V2Zja0lMVERJeEhtTW1oWVRsSzgzTHhSczB6RUFyOHhuNnpI?= =?utf-8?B?ZTJGc2wyb0JCNUEwaUFQcUV3VFZvclpyZk1FeG9SbW5CUjFDUk9yVGM2ODdR?= =?utf-8?B?ZVRLYnhrZmxjY1E3M0ZpNFhzRS9ydWtKU2w3VldxLzMvZjRtdzhnQjdKTHJ1?= =?utf-8?B?dExiVWZhWmg4RXlwZVhMdmdSenEzMEtkWENnaWVRektDcVBQMERKekwrZDlk?= =?utf-8?B?Q3JJd3RGYjhnaG5RazhscGRXTUE3WWJGajRnVURibEt2M1BpODZFTE80OVJI?= =?utf-8?B?TE9EWWhDY3NrL1J4ZWE1cTdHTE92R1hDRU00dEFTUE92SXZNMjZIcjduY3Nn?= =?utf-8?B?blZCeU9LL1ZqZ20yRXBydFpsM3NSbnE3MERCb3BSeVJOdmg1NnVKZWttRnQ2?= =?utf-8?B?YllnWU1JSkdUdEdpeHhZNUtiY1U2REFWRmdLbXVBNW1xaUtoZ0cyOUYzc2dh?= =?utf-8?B?VUNLWlVhc2ZEekhWT2pqcXg0R2kvaUdmLy95ZVFzVEFUZ2ZXNWcveFE5Mno0?= =?utf-8?B?Qnc5ekhhLzdKRUUyeExRTVhjOWp0TXJvaTFackZ3Z1ZCR1NQM3dpby82RVVn?= =?utf-8?B?eDhIUTllaHZ1Q3Y3TzlLeFJWc1RpSkRiVWtGL3dORFNDOU51MlJqcUJnVVFv?= =?utf-8?B?bXlpL0hoRS9DZnkzMG1HKzMyTjRFU29MN3NlQzZCcVc3TkxiWnlrRlZ4b0RO?= =?utf-8?B?am5aWUszbGV4T3lKMjArNkNyQTJYdnBsOGxoREVlOEhMYUdPMDlEQjZWb2xE?= =?utf-8?B?S0VmMjFxSTJSbXlOWEZIMlZYN2RXZk5uVFVZcmwzQW9QaTA4SmwvWlo5cURD?= =?utf-8?B?WTFGVFE3REtTQjE0QXk0MCtxQVNjYy9YeVFJMSs2d05FMUdsaVBUM1A3N29G?= =?utf-8?B?Si9UOFNvZElkVlE0OVM1RllSdVRGdDEzU05hYTlvT21YWGRkNUl4OHdML2xy?= =?utf-8?B?NFFSamhHYUJ1QmR1WGVWbkVvYzBnc0thS0dpTWQwMWNnN2l4VytqOXczV0Y3?= =?utf-8?B?VVV5WTV2cnN3djhXV2hzanRoQ2ZES29hNjM4ejBKS2xJejJpYVpBUVd2anhD?= =?utf-8?B?QStOc0ZRV2dRcHFTSHdwbFVJN3pzbXNTZXB6dGdBdzNLSmpwR1UvczNVS1BX?= =?utf-8?B?NzRDM1c1M0RDZERaajA2MEU2QzFKQnhKbkdaODY0cDB5ZmhIdTdxamMvRUpq?= =?utf-8?Q?Cj5Y1t8K/RBPg=3D?=
x-ms-exchange-transport-forked: True
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
X-OriginatorOrg: nokia.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: PR3PR07MB6826.eurprd07.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 2a4fedec-44c6-415e-d29f-08d8a330cad8
X-MS-Exchange-CrossTenant-originalarrivaltime: 18 Dec 2020 08:42:02.8964 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 5d471751-9675-428d-917b-70f44f9630b0
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: 4d7c5whpquKJa5Q1eJ0jMdWbuo1+TZ48sh3Qo2a4uxP6dYsh84iP/HVUeCwAQAUnCf+pTG1sX++hmcppzqbZD+BdfCazt/aMtPE5Po6nybE=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: PR3PR07MB6939
Archived-At: <https://mailarchive.ietf.org/arch/msg/anima/8IUsl7XXUAFoDcT_DTeYKQmKf0Q>
Subject: Re: [Anima] ANIMA when there is a system-wide issue
X-BeenThere: anima@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Autonomic Networking Integrated Model and Approach <anima.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/anima>, <mailto:anima-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/anima/>
List-Post: <mailto:anima@ietf.org>
List-Help: <mailto:anima-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/anima>, <mailto:anima-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 18 Dec 2020 08:42:10 -0000

Hi Brian,

Thanks for sharing interesting incidents and reflecting on the role of ANIMA technologies.

Reading the report, it seems the outage results from a series of indirectly linked events 
leading to isolation of a portion of the GCP network.

For the incident you refer here, how/where would you see ANIMA components to have (helped) avoided the outage?
Could we expect ANIMA networks to provide better/longer data plane operation in case of failed control plane/control plane functions? Beyond the pure ability to do so, there is a gain/risk trade-off to let a DP run out of sync of its CP.
Could we expect ANIMA components to have reacted differently and circumvent the issue, preventing the full disconnection of the GCP network portion? Or mitigated at intermediate points? 
The incident seems to have been triggered by a legitimate/valid configuration change but with resulting with a functionality loosing its access to files it needed to perform. The config. change validation basically didn't notice there was a possible issue.
Would such a miss been caught by ANIMA components? (e.g. via a different validation-dependencies approach?)

Why I'm a bit doubtful here is because even if we deploy robust autonomic functions/agents, the above problem seems to originate from an issue in how the system has been configured.
And even ASAs would still need to get some form/level of initial configuration or guidance.


Best regards, 
Laurent

> -----Original Message-----
> From: Anima <anima-bounces@ietf.org> On Behalf Of Brian E Carpenter
> Sent: Thursday, December 17, 2020 02:47
> To: Anima WG <anima@ietf.org>
> Subject: Re: [Anima] ANIMA when there is a system-wide issue
> 
> And here's what happens when the control plane itself falls over:
> 
> https://status.cloud.google.com/incident/zall/20011#20011006
> 
> It seems pretty clear that Cloud needs ANIMA.
> 
> Regards
>    Brian
> 
> On 01-Dec-20 11:02, Brian E Carpenter wrote:
> > "AWS reveals it broke itself by exceeding OS thread limits"
> >
> > https://www.theregister.com/2020/11/30/aws_outage_explanation/
> >
> > Especially:
> > "The TIFU-like post also outlines why Amazon's dashboards offered only
> scanty info about the incident – because they, too, depend on a service
> that depends on Kinesis."
> >
> > Perhaps there is something we should specify in ANIMA to prevent the
> ANIMA infrastructure falling into this sort of trap: when there is a
> system-wide issue (such as hitting an O/S resource limit everywhere at the
> same time) it also prevents the autonomic mechanisms from working.
> >
> > Regards
> >    Brian Carpenter
> >
> 
> _______________________________________________
> Anima mailing list
> Anima@ietf.org
> https://www.ietf.org/mailman/listinfo/anima