Re: [Lsr] Multiple failures in Dynamic Flooding

tony.li@tony.li Mon, 11 March 2019 17:41 UTC

Sender: Tony Li <tony1athome@gmail.com>
From: tony.li@tony.li
Message-Id: <10A1CA48-0D09-44FF-95ED-8D52FB867B8B@tony.li>
Content-Type: multipart/alternative; boundary="Apple-Mail=_AB2D3636-C782-4119-92FA-EA69EA66D72B"
Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\))
Date: Mon, 11 Mar 2019 10:41:08 -0700
In-Reply-To: <5316A0AB3C851246A7CA5758973207D463B76FDD@sjceml521-mbx.china.huawei.com>
Cc: "lsr@ietf.org" <lsr@ietf.org>, "lsr-chairs@ietf.org" <lsr-chairs@ietf.org>, "lsr-ads@ietf.org" <lsr-ads@ietf.org>
To: Huaimo Chen <huaimo.chen@huawei.com>
References: <sa6lg2md2ok.fsf@chopps.org> <SN6PR11MB284553735B2351FB584BE792C17F0@SN6PR11MB2845.namprd11.prod.outlook.com> <5316A0AB3C851246A7CA5758973207D463B5858A@sjceml521-mbx.china.huawei.com> <420ed1b5-d849-99cc-bcb0-d159783e4de2@cisco.com> <5316A0AB3C851246A7CA5758973207D463B59041@sjceml521-mbx.china.huawei.com> <0B4DF2AC-8EE1-41CA-B357-98325067CA30@gmail.com> <5316A0AB3C851246A7CA5758973207D463B66FE9@sjceml521-mbx.china.huawei.com> <78A866F4-9AF0-481A-9DEC-B04DE72AFDA3@tony.li> <5316A0AB3C851246A7CA5758973207D463B76FDD@sjceml521-mbx.china.huawei.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/lsr/VHyR7YrNT4ftMrnXufhGa7mm89w>
Subject: Re: [Lsr] Multiple failures in Dynamic Flooding
Precedence: list

Hi Huaimo,



>     In summary for multiple failures, two issues below in draft-li-lsr-dynamyic-flooding are discussed:
> 1)      how to determine the current flooding topology is split; and
> 2)      how to repair/connect the flooding topology split.
> For the first issue, the discussions are still going on.
> For the second issue, repairing/connecting the flooding topology split through Hello protocol extensions does not work.  When a “backup path”/connection of multiple hops is needed to connect/repair the flooding topology split, Hello can not go beyond one hop, thus can not repair the flooding topology split in this case.


You do not try to repair things remotely, they are always repaired locally.  If there are multiple failures in the flooding topology and it is partitioned, then it follows that there are multiple remaining connected components of the flooding topology.  Nodes that are adjacent to the failures will update their LSPs and flood them throughout their connected component.  Each component will see at least two link failures if there is a partition of the FT and each node in the component can detect that the FT has partitioned.  Each node is then capable of enabling temporary flooding on one or more links that will traverse the partition, thereby restoring a functioning FT.  The Area Leader then recomputes and redistributes the revised FT.

To put it yet another way, repair is fully distributed.  You should like that.  :-)


> >We are not requiring it, but a system could also do a more extensive computation and compare the links between itself and the neighbor
> >by tracing the path in the FT and then confirming that each link is up in the LSDB.
>  
> It normally takes a long time such as more than ten minutes to age out and remove an LSP/LSA for the neighbor from the LSDB even though the neighbor is disconnected physically.
> How can you decide quickly in tens of milliseconds that the flooding topology is disconnected?


You do not wait for LSP/LSA removal.  You look for link changes in the LSPs that you do get, or local link changes.


> >As we have discussed, this is not a solution. In fact, this is more dangerous than anything else that has been proposed and
> >seems highly likely to trigger a cascade failure. You are enabling full flooding for many nodes.  In dense topologies, even
> >a radius of 3 is very high.  For example, in a LS topology, a radius of 3 is sufficient to enable full flooding throughout the
> >entire topology. If that were stable, we would not need Dynamic Flooding at all.
>  
> This full flooding is enabled only for a very short time.


All it takes is enabling it at sufficient density to create a cascade failure.  Milliseconds are sufficient for a collapse.


> How do you get that this is more dangerous than anything else and seems highly likely to trigger a cascade failure? Can you give some explanations in details?


Again, we do not have absolute metrics on what triggers a cascade failure today.  We have several data points of several different implementations at different points in time.  We know that in the early ‘90s, a full mesh of 20 neighbors running L1L2 was sufficient.  Obviously things have changed somewhat, but even more modern implementations have had problems.  This is why the MSDC went to BGP.

As a result, we need to be very conservative about what flooding we temporarily enable.  We do not want to walk anywhere near the cliff, as the cascade failure is fatal to the network.

Tony

[Lsr] WG Adoption Call for draft-li-lsr-dynamic-f… Christian Hopps
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Acee Lindem (acee)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… tony.li
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Robert Raszuk
[Lsr] 答复: WG Adoption Call for draft-li-lsr-dynam… Lizhenbin
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Edward
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Christian Hopps
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… David Allan I
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… steve ulrich
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Peter Psenak
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Naiming Shen (naiming)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Les Ginsberg (ginsberg)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Jeff Tantsura
[Lsr] 答复: WG Adoption Call for draft-li-lsr-dynam… Aijun Wang
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Lizhenbin
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Guyunan (Yunan Gu, IP Technology Research Dept. NW)
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Huzhibo
[Lsr] 答复: 答复: WG Adoption Call for draft-li-lsr-d… Dongjie (Jimmy)
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Yangang
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Christian Hopps
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… John E Drake
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… LEI LIU
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Mankamana Mishra (mankamis)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Ketan Talaulikar (ketant)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… sridhar santhanam
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Huaimo Chen
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Peter Psenak
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Christian Hopps
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Sri
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Tony Li
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Christian Hopps
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Huaimo Chen
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Tony Li
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Huaimo Chen
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Christian Hopps
[Lsr] Multiple failures in Dynamic Flooding tony.li
Re: [Lsr] Multiple failures in Dynamic Flooding Les Ginsberg (ginsberg)
Re: [Lsr] Multiple failures in Dynamic Flooding Huaimo Chen
Re: [Lsr] Multiple failures in Dynamic Flooding Peter Psenak
Re: [Lsr] Multiple failures in Dynamic Flooding tony.li