Re: [Lsr] Multiple failures in Dynamic Flooding

tony.li@tony.li Mon, 11 March 2019 17:41 UTC

Return-Path: <tony1athome@gmail.com>
X-Original-To: lsr@ietfa.amsl.com
Delivered-To: lsr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 993DA131142; Mon, 11 Mar 2019 10:41:12 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.649
X-Spam-Level:
X-Spam-Status: No, score=-1.649 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.249, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id dXJ8Lzc21rhs; Mon, 11 Mar 2019 10:41:10 -0700 (PDT)
Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 7C01713104A; Mon, 11 Mar 2019 10:41:10 -0700 (PDT)
Received: by mail-pf1-x443.google.com with SMTP id n125so4213476pfn.5; Mon, 11 Mar 2019 10:41:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=KVMFxKmrwfz/qh2dQTUccEKKwSSIe4JsPQ9cpSa380o=; b=VTLnDCskg8TUbA9phsrS4kjHbblxxUXqZlDrEq2/EPOEFLE+6wjDYIvXWwqQSMf3b2 5qObkOcZy0EZEfwum/b0SKteKf+/6HFCwOT2ZidpVavK6sVmp/6fdMJgW6axKFN6v9jo /s4Hq2pF6K2t/w6fhG1TZNEfYLnSvi4ucOg8NFjym1BXlCbS3B77tQJtN0SiyBgt4v/2 SKAVU/5Op1l7RVxZwhv8N1fEJivooeyMolNeMxbKbiYJBmGcPEkmi7KfJ59RCyeL5iEI 1P2nQvn/XhWsHDqihjLLrfYAK6B4LoAN673KE4tyIO3V0MfWsALBtx5w5yEGYlYBhDkG 3IZg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=KVMFxKmrwfz/qh2dQTUccEKKwSSIe4JsPQ9cpSa380o=; b=DYjhKddZG7PEq/F+6DezTIP7BkxxpjKmZ+D5NI5b7krOlC0wBIsfoBP0h82FFnrKQU +5Jdwv2C3tYQWKsW7RWc8czG/WHVrMng4vYT1wdp/HwKyxsuj3cineyHWGC12eBBNciT fAP0uOg2QWc2QFo73zse5Nd2E3BuVcZepsk213/TyoA1JWBQfDKDLOz6AqVe0RlOG+OF +QHLuXC4gpGTT1wCb8qZl+E8q6LhmHI4685HckdOBqH7Lnsth4Wx8S/P37fGT/rnr64t tNkByW2sO3TJnAt1IA5sOGz9MD9FcOPmLNlCo3OrDkIalauF6XIyvshjO7s+CSRTX0IM 6epQ==
X-Gm-Message-State: APjAAAWI3sW2nR/1jlGdAIQi6r8cdZi+IS75BWi0wxokVpV4OPA0//0Y 9xSMME1JE69n9jyklVT9pIk=
X-Google-Smtp-Source: APXvYqwxvXw0OWRVY2mJlKwtlvLr2Ndchz/42eIgl8OKajGRnBe6Y0hv2PnapPjNWDeagiNoxmgHqw==
X-Received: by 2002:a62:5c87:: with SMTP id q129mr33596019pfb.180.1552326070018; Mon, 11 Mar 2019 10:41:10 -0700 (PDT)
Received: from [172.22.228.48] ([162.210.130.3]) by smtp.gmail.com with ESMTPSA id e9sm22227180pfh.42.2019.03.11.10.41.09 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 11 Mar 2019 10:41:09 -0700 (PDT)
Sender: Tony Li <tony1athome@gmail.com>
From: tony.li@tony.li
Message-Id: <10A1CA48-0D09-44FF-95ED-8D52FB867B8B@tony.li>
Content-Type: multipart/alternative; boundary="Apple-Mail=_AB2D3636-C782-4119-92FA-EA69EA66D72B"
Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\))
Date: Mon, 11 Mar 2019 10:41:08 -0700
In-Reply-To: <5316A0AB3C851246A7CA5758973207D463B76FDD@sjceml521-mbx.china.huawei.com>
Cc: "lsr@ietf.org" <lsr@ietf.org>, "lsr-chairs@ietf.org" <lsr-chairs@ietf.org>, "lsr-ads@ietf.org" <lsr-ads@ietf.org>
To: Huaimo Chen <huaimo.chen@huawei.com>
References: <sa6lg2md2ok.fsf@chopps.org> <SN6PR11MB284553735B2351FB584BE792C17F0@SN6PR11MB2845.namprd11.prod.outlook.com> <5316A0AB3C851246A7CA5758973207D463B5858A@sjceml521-mbx.china.huawei.com> <420ed1b5-d849-99cc-bcb0-d159783e4de2@cisco.com> <5316A0AB3C851246A7CA5758973207D463B59041@sjceml521-mbx.china.huawei.com> <0B4DF2AC-8EE1-41CA-B357-98325067CA30@gmail.com> <5316A0AB3C851246A7CA5758973207D463B66FE9@sjceml521-mbx.china.huawei.com> <78A866F4-9AF0-481A-9DEC-B04DE72AFDA3@tony.li> <5316A0AB3C851246A7CA5758973207D463B76FDD@sjceml521-mbx.china.huawei.com>
X-Mailer: Apple Mail (2.3445.102.3)
Archived-At: <https://mailarchive.ietf.org/arch/msg/lsr/VHyR7YrNT4ftMrnXufhGa7mm89w>
Subject: Re: [Lsr] Multiple failures in Dynamic Flooding
X-BeenThere: lsr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Link State Routing Working Group <lsr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lsr>, <mailto:lsr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/lsr/>
List-Post: <mailto:lsr@ietf.org>
List-Help: <mailto:lsr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lsr>, <mailto:lsr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2019 17:41:13 -0000

Hi Huaimo,



>     In summary for multiple failures, two issues below in draft-li-lsr-dynamyic-flooding are discussed:
> 1)      how to determine the current flooding topology is split; and
> 2)      how to repair/connect the flooding topology split.
> For the first issue, the discussions are still going on.
> For the second issue, repairing/connecting the flooding topology split through Hello protocol extensions does not work.  When a “backup path”/connection of multiple hops is needed to connect/repair the flooding topology split, Hello can not go beyond one hop, thus can not repair the flooding topology split in this case.


You do not try to repair things remotely, they are always repaired locally.  If there are multiple failures in the flooding topology and it is partitioned, then it follows that there are multiple remaining connected components of the flooding topology.  Nodes that are adjacent to the failures will update their LSPs and flood them throughout their connected component.  Each component will see at least two link failures if there is a partition of the FT and each node in the component can detect that the FT has partitioned.  Each node is then capable of enabling temporary flooding on one or more links that will traverse the partition, thereby restoring a functioning FT.  The Area Leader then recomputes and redistributes the revised FT.

To put it yet another way, repair is fully distributed.  You should like that.  :-)


> >We are not requiring it, but a system could also do a more extensive computation and compare the links between itself and the neighbor
> >by tracing the path in the FT and then confirming that each link is up in the LSDB.
>  
> It normally takes a long time such as more than ten minutes to age out and remove an LSP/LSA for the neighbor from the LSDB even though the neighbor is disconnected physically.
> How can you decide quickly in tens of milliseconds that the flooding topology is disconnected?


You do not wait for LSP/LSA removal.  You look for link changes in the LSPs that you do get, or local link changes.


> >As we have discussed, this is not a solution. In fact, this is more dangerous than anything else that has been proposed and
> >seems highly likely to trigger a cascade failure. You are enabling full flooding for many nodes.  In dense topologies, even
> >a radius of 3 is very high.  For example, in a LS topology, a radius of 3 is sufficient to enable full flooding throughout the
> >entire topology. If that were stable, we would not need Dynamic Flooding at all.
>  
> This full flooding is enabled only for a very short time.


All it takes is enabling it at sufficient density to create a cascade failure.  Milliseconds are sufficient for a collapse.


> How do you get that this is more dangerous than anything else and seems highly likely to trigger a cascade failure? Can you give some explanations in details?


Again, we do not have absolute metrics on what triggers a cascade failure today.  We have several data points of several different implementations at different points in time.  We know that in the early ‘90s, a full mesh of 20 neighbors running L1L2 was sufficient.  Obviously things have changed somewhat, but even more modern implementations have had problems.  This is why the MSDC went to BGP.

As a result, we need to be very conservative about what flooding we temporarily enable.  We do not want to walk anywhere near the cliff, as the cascade failure is fatal to the network.

Tony