Re: [Lsr] Multiple failures in Dynamic Flooding

Huaimo Chen <huaimo.chen@huawei.com> Mon, 11 March 2019 17:08 UTC

Return-Path: <huaimo.chen@huawei.com>
X-Original-To: lsr@ietfa.amsl.com
Delivered-To: lsr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BE5741277D8; Mon, 11 Mar 2019 10:08:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.199
X-Spam-Level:
X-Spam-Status: No, score=-4.199 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 7jJAEgyuhkDn; Mon, 11 Mar 2019 10:08:19 -0700 (PDT)
Received: from huawei.com (lhrrgout.huawei.com [185.176.76.210]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C6A54124BF6; Mon, 11 Mar 2019 10:08:18 -0700 (PDT)
Received: from lhreml706-cah.china.huawei.com (unknown [172.18.7.108]) by Forcepoint Email with ESMTP id 5C206EEBF51D268E55EE; Mon, 11 Mar 2019 17:08:16 +0000 (GMT)
Received: from lhreml703-chm.china.huawei.com (10.201.108.52) by lhreml706-cah.china.huawei.com (10.201.108.47) with Microsoft SMTP Server (TLS) id 14.3.408.0; Mon, 11 Mar 2019 17:08:16 +0000
Received: from lhreml703-chm.china.huawei.com (10.201.108.52) by lhreml703-chm.china.huawei.com (10.201.108.52) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1591.10; Mon, 11 Mar 2019 17:08:15 +0000
Received: from SJCEML703-CHM.china.huawei.com (10.208.112.39) by lhreml703-chm.china.huawei.com (10.201.108.52) with Microsoft SMTP Server (version=TLS1_0, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA_P256) id 15.1.1591.10 via Frontend Transport; Mon, 11 Mar 2019 17:08:15 +0000
Received: from SJCEML521-MBX.china.huawei.com ([169.254.1.179]) by SJCEML703-CHM.china.huawei.com ([169.254.5.104]) with mapi id 14.03.0415.000; Mon, 11 Mar 2019 10:08:09 -0700
From: Huaimo Chen <huaimo.chen@huawei.com>
To: "tony.li@tony.li" <tony.li@tony.li>
CC: "lsr@ietf.org" <lsr@ietf.org>, "lsr-chairs@ietf.org" <lsr-chairs@ietf.org>, "lsr-ads@ietf.org" <lsr-ads@ietf.org>
Thread-Topic: Multiple failures in Dynamic Flooding
Thread-Index: AQHU1DOh90IUANMeGEWHdn8fMqQEB6YGlyHw
Date: Mon, 11 Mar 2019 17:08:09 +0000
Message-ID: <5316A0AB3C851246A7CA5758973207D463B76FDD@sjceml521-mbx.china.huawei.com>
References: <sa6lg2md2ok.fsf@chopps.org> <SN6PR11MB284553735B2351FB584BE792C17F0@SN6PR11MB2845.namprd11.prod.outlook.com> <5316A0AB3C851246A7CA5758973207D463B5858A@sjceml521-mbx.china.huawei.com> <420ed1b5-d849-99cc-bcb0-d159783e4de2@cisco.com> <5316A0AB3C851246A7CA5758973207D463B59041@sjceml521-mbx.china.huawei.com> <0B4DF2AC-8EE1-41CA-B357-98325067CA30@gmail.com> <5316A0AB3C851246A7CA5758973207D463B66FE9@sjceml521-mbx.china.huawei.com> <78A866F4-9AF0-481A-9DEC-B04DE72AFDA3@tony.li>
In-Reply-To: <78A866F4-9AF0-481A-9DEC-B04DE72AFDA3@tony.li>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.212.245.164]
Content-Type: multipart/alternative; boundary="_000_5316A0AB3C851246A7CA5758973207D463B76FDDsjceml521mbxchi_"
MIME-Version: 1.0
X-CFilter-Loop: Reflected
Archived-At: <https://mailarchive.ietf.org/arch/msg/lsr/b5Tig4Ze6E3mDAYohbxZQ7Nj4Ew>
Subject: Re: [Lsr] Multiple failures in Dynamic Flooding
X-BeenThere: lsr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Link State Routing Working Group <lsr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lsr>, <mailto:lsr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/lsr/>
List-Post: <mailto:lsr@ietf.org>
List-Help: <mailto:lsr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lsr>, <mailto:lsr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2019 17:08:22 -0000

Hi Tony,

    In summary for multiple failures, two issues below in draft-li-lsr-dynamyic-flooding are discussed:

1)      how to determine the current flooding topology is split; and

2)      how to repair/connect the flooding topology split.
For the first issue, the discussions are still going on.
For the second issue, repairing/connecting the flooding topology split through Hello protocol extensions does not work.  When a “backup path”/connection of multiple hops is needed to connect/repair the flooding topology split, Hello can not go beyond one hop, thus can not repair the flooding topology split in this case.

>From: Tony Li [mailto:tony1athome@gmail.com] On Behalf Of tony.li@tony.li
>Sent: Wednesday, March 6, 2019 10:45 AM
>To: Huaimo Chen <huaimo.chen@huawei.com>
>Cc: Christian Hopps <chopps@chopps.org>; lsr@ietf.org; lsr-chairs@ietf.org; lsr-ads@ietf.org
>Subject: Multiple failures in Dynamic Flooding
>
>Hi Huaimo,
>
>>> I’m sorry that you don’t find it useful. Determining the split is trivial: when you receive an IIH,
>>> it has a system ID of the another system in it. If that other system is not currently part of the
>>> flooding topology, then it is quite clear that it is disconnected from the flooding topology.
>>> Repairing the split is done by enabling temporary flooding on the new link.

>>For an adjacency between two nodes is up, the Hello packets exchanged between them will not change node/system IDs in them.
>>How do you determine that other system is not currently part of the flooding topology?

>The IIH includes the system ID.  See ISO 10589 v2, section 9.7, field “source Id”.  The local system will have
>a copy of the flooding topology and can easily see if the neighbor was present as of the last FT computation.  If not, then it should be
>added (modulo rate limiting). The local system can also examine it’s own LSDB.  If there is no LSP for the neighbor, then it would seem
>highly likely that there is a disconnect and the neighbor should again be added (modulo rate limiting).

>We are not requiring it, but a system could also do a more extensive computation and compare the links between itself and the neighbor
>by tracing the path in the FT and then confirming that each link is up in the LSDB.

It normally takes a long time such as more than ten minutes to age out and remove an LSP/LSA for the neighbor from the LSDB even though the neighbor is disconnected physically.
How can you decide quickly in tens of milliseconds that the flooding topology is disconnected?

>>> There is an issue here that we have not yet resolved, which is the rate that new links should be
>>> temporarily added to the flooding topology.  Some believe that adding any new link is the
>>> correct thing to do as it minimizes the recovery time. Others feel that enabling too many links
>>> could cause a flooding collapse, so link addition should be highly constrained. We are still
>>> discussing this and invite the WG’s opinions.

>>The issue is resolved by the solutions in draft-cc-lsr-flooding-reduction.
One solution is below, where the given distance can be adjusted/configured.
If we want every node to flood on all its links, we let the given
>>distance to a big number. If we want the nodes within 2 hops to a failure
>>to flood on all their links, we set the given distance to 2.
   “In one way, when two or more failures on the current flooding
  > >topology occur almost in the same time, each of the nodes within a
  > >given distance (such as 3 hops) to a failure point, floods the link
  > >state (LS) that it receives to all the links (except for the one from
   which the LS is received) until a new flooding topology is built.”


>As we have discussed, this is not a solution. In fact, this is more dangerous than anything else that has been proposed and
>seems highly likely to trigger a cascade failure. You are enabling full flooding for many nodes.  In dense topologies, even
>a radius of 3 is very high.  For example, in a LS topology, a radius of 3 is sufficient to enable full flooding throughout the
>entire topology. If that were stable, we would not need Dynamic Flooding at all.

This full flooding is enabled only for a very short time.
How do you get that this is more dangerous than anything else and seems highly likely to trigger a cascade failure? Can you give some explanations in details?

>>Another solution is just adding minimum links temporarily on the flooding
>>topology to repair the split flooding topology until a new flooding topology
>>is built.

>Agreed.  Which links constitute the minimum?  In a general topology, with arbitrary failures that are not distributed globally,
>how do we make a distributed decision about which links to enable? This is the problem that we are trying to solve. And
>we have no oracle to tell us The Right Answer.

We can discuss this after the first method is discussed.

Best Regards,
Huaimo

>Regards,
>Tony