Re: [Lsr] Multiple failures in Dynamic Flooding

Huaimo Chen <huaimo.chen@huawei.com> Mon, 11 March 2019 17:08 UTC

From: Huaimo Chen <huaimo.chen@huawei.com>
To: "tony.li@tony.li" <tony.li@tony.li>
CC: "lsr@ietf.org" <lsr@ietf.org>, "lsr-chairs@ietf.org" <lsr-chairs@ietf.org>, "lsr-ads@ietf.org" <lsr-ads@ietf.org>
Thread-Topic: Multiple failures in Dynamic Flooding
Thread-Index: AQHU1DOh90IUANMeGEWHdn8fMqQEB6YGlyHw
Date: Mon, 11 Mar 2019 17:08:09 +0000
Message-ID: <5316A0AB3C851246A7CA5758973207D463B76FDD@sjceml521-mbx.china.huawei.com>
References: <sa6lg2md2ok.fsf@chopps.org> <SN6PR11MB284553735B2351FB584BE792C17F0@SN6PR11MB2845.namprd11.prod.outlook.com> <5316A0AB3C851246A7CA5758973207D463B5858A@sjceml521-mbx.china.huawei.com> <420ed1b5-d849-99cc-bcb0-d159783e4de2@cisco.com> <5316A0AB3C851246A7CA5758973207D463B59041@sjceml521-mbx.china.huawei.com> <0B4DF2AC-8EE1-41CA-B357-98325067CA30@gmail.com> <5316A0AB3C851246A7CA5758973207D463B66FE9@sjceml521-mbx.china.huawei.com> <78A866F4-9AF0-481A-9DEC-B04DE72AFDA3@tony.li>
In-Reply-To: <78A866F4-9AF0-481A-9DEC-B04DE72AFDA3@tony.li>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative; boundary="_000_5316A0AB3C851246A7CA5758973207D463B76FDDsjceml521mbxchi_"
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/lsr/b5Tig4Ze6E3mDAYohbxZQ7Nj4Ew>
Subject: Re: [Lsr] Multiple failures in Dynamic Flooding
Precedence: list

Hi Tony,

    In summary for multiple failures, two issues below in draft-li-lsr-dynamyic-flooding are discussed:

1)      how to determine the current flooding topology is split; and

2)      how to repair/connect the flooding topology split.
For the first issue, the discussions are still going on.
For the second issue, repairing/connecting the flooding topology split through Hello protocol extensions does not work.  When a “backup path”/connection of multiple hops is needed to connect/repair the flooding topology split, Hello can not go beyond one hop, thus can not repair the flooding topology split in this case.

>From: Tony Li [mailto:tony1athome@gmail.com] On Behalf Of tony.li@tony.li
>Sent: Wednesday, March 6, 2019 10:45 AM
>To: Huaimo Chen <huaimo.chen@huawei.com>
>Cc: Christian Hopps <chopps@chopps.org>; lsr@ietf.org; lsr-chairs@ietf.org; lsr-ads@ietf.org
>Subject: Multiple failures in Dynamic Flooding
>
>Hi Huaimo,
>
>>> I’m sorry that you don’t find it useful. Determining the split is trivial: when you receive an IIH,
>>> it has a system ID of the another system in it. If that other system is not currently part of the
>>> flooding topology, then it is quite clear that it is disconnected from the flooding topology.
>>> Repairing the split is done by enabling temporary flooding on the new link.

>>For an adjacency between two nodes is up, the Hello packets exchanged between them will not change node/system IDs in them.
>>How do you determine that other system is not currently part of the flooding topology?

>The IIH includes the system ID.  See ISO 10589 v2, section 9.7, field “source Id”.  The local system will have
>a copy of the flooding topology and can easily see if the neighbor was present as of the last FT computation.  If not, then it should be
>added (modulo rate limiting). The local system can also examine it’s own LSDB.  If there is no LSP for the neighbor, then it would seem
>highly likely that there is a disconnect and the neighbor should again be added (modulo rate limiting).

>We are not requiring it, but a system could also do a more extensive computation and compare the links between itself and the neighbor
>by tracing the path in the FT and then confirming that each link is up in the LSDB.

It normally takes a long time such as more than ten minutes to age out and remove an LSP/LSA for the neighbor from the LSDB even though the neighbor is disconnected physically.
How can you decide quickly in tens of milliseconds that the flooding topology is disconnected?

>>> There is an issue here that we have not yet resolved, which is the rate that new links should be
>>> temporarily added to the flooding topology.  Some believe that adding any new link is the
>>> correct thing to do as it minimizes the recovery time. Others feel that enabling too many links
>>> could cause a flooding collapse, so link addition should be highly constrained. We are still
>>> discussing this and invite the WG’s opinions.

>>The issue is resolved by the solutions in draft-cc-lsr-flooding-reduction.
One solution is below, where the given distance can be adjusted/configured.
If we want every node to flood on all its links, we let the given
>>distance to a big number. If we want the nodes within 2 hops to a failure
>>to flood on all their links, we set the given distance to 2.
   “In one way, when two or more failures on the current flooding
  > >topology occur almost in the same time, each of the nodes within a
  > >given distance (such as 3 hops) to a failure point, floods the link
  > >state (LS) that it receives to all the links (except for the one from
   which the LS is received) until a new flooding topology is built.”


>As we have discussed, this is not a solution. In fact, this is more dangerous than anything else that has been proposed and
>seems highly likely to trigger a cascade failure. You are enabling full flooding for many nodes.  In dense topologies, even
>a radius of 3 is very high.  For example, in a LS topology, a radius of 3 is sufficient to enable full flooding throughout the
>entire topology. If that were stable, we would not need Dynamic Flooding at all.

This full flooding is enabled only for a very short time.
How do you get that this is more dangerous than anything else and seems highly likely to trigger a cascade failure? Can you give some explanations in details?

>>Another solution is just adding minimum links temporarily on the flooding
>>topology to repair the split flooding topology until a new flooding topology
>>is built.

>Agreed.  Which links constitute the minimum?  In a general topology, with arbitrary failures that are not distributed globally,
>how do we make a distributed decision about which links to enable? This is the problem that we are trying to solve. And
>we have no oracle to tell us The Right Answer.

We can discuss this after the first method is discussed.

Best Regards,
Huaimo

>Regards,
>Tony

[Lsr] WG Adoption Call for draft-li-lsr-dynamic-f… Christian Hopps
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Acee Lindem (acee)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… tony.li
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Robert Raszuk
[Lsr] 答复: WG Adoption Call for draft-li-lsr-dynam… Lizhenbin
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Edward
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Christian Hopps
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… David Allan I
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… steve ulrich
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Peter Psenak
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Naiming Shen (naiming)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Les Ginsberg (ginsberg)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Jeff Tantsura
[Lsr] 答复: WG Adoption Call for draft-li-lsr-dynam… Aijun Wang
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Lizhenbin
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Guyunan (Yunan Gu, IP Technology Research Dept. NW)
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Huzhibo
[Lsr] 答复: 答复: WG Adoption Call for draft-li-lsr-d… Dongjie (Jimmy)
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Yangang
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Christian Hopps
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… John E Drake
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… LEI LIU
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Mankamana Mishra (mankamis)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Ketan Talaulikar (ketant)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… sridhar santhanam
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Huaimo Chen
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Peter Psenak
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Christian Hopps
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Sri
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Tony Li
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Christian Hopps
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Huaimo Chen
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Tony Li
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Huaimo Chen
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Christian Hopps
[Lsr] Multiple failures in Dynamic Flooding tony.li
Re: [Lsr] Multiple failures in Dynamic Flooding Les Ginsberg (ginsberg)
Re: [Lsr] Multiple failures in Dynamic Flooding Huaimo Chen
Re: [Lsr] Multiple failures in Dynamic Flooding Peter Psenak
Re: [Lsr] Multiple failures in Dynamic Flooding tony.li