[Lsr] Multiple failures in Dynamic Flooding

tony.li@tony.li Wed, 06 March 2019 15:45 UTC

Sender: Tony Li <tony1athome@gmail.com>
From: tony.li@tony.li
Message-Id: <78A866F4-9AF0-481A-9DEC-B04DE72AFDA3@tony.li>
Content-Type: multipart/alternative; boundary="Apple-Mail=_973A248E-27A6-45B5-B1A2-2C13D0E632DD"
Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\))
Date: Wed, 06 Mar 2019 07:45:17 -0800
In-Reply-To: <5316A0AB3C851246A7CA5758973207D463B66FE9@sjceml521-mbx.china.huawei.com>
Cc: Christian Hopps <chopps@chopps.org>, "lsr@ietf.org" <lsr@ietf.org>, "lsr-chairs@ietf.org" <lsr-chairs@ietf.org>, "lsr-ads@ietf.org" <lsr-ads@ietf.org>
To: Huaimo Chen <huaimo.chen@huawei.com>
References: <sa6lg2md2ok.fsf@chopps.org> <SN6PR11MB284553735B2351FB584BE792C17F0@SN6PR11MB2845.namprd11.prod.outlook.com> <5316A0AB3C851246A7CA5758973207D463B5858A@sjceml521-mbx.china.huawei.com> <420ed1b5-d849-99cc-bcb0-d159783e4de2@cisco.com> <5316A0AB3C851246A7CA5758973207D463B59041@sjceml521-mbx.china.huawei.com> <0B4DF2AC-8EE1-41CA-B357-98325067CA30@gmail.com> <5316A0AB3C851246A7CA5758973207D463B66FE9@sjceml521-mbx.china.huawei.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/lsr/2Q6z44DOkZLTiiEydHXaGlwz6UY>
Subject: [Lsr] Multiple failures in Dynamic Flooding
Precedence: list

Hi Huaimo,


> > I’m sorry that you don’t find it useful. Determining the split is trivial: when you receive an IIH,
> > it has a system ID of the another system in it. If that other system is not currently part of the
> > flooding topology, then it is quite clear that it is disconnected from the flooding topology.
> > Repairing the split is done by enabling temporary flooding on the new link.
>  
> For an adjacency between two nodes is up, the Hello packets exchanged between them will not change node/system IDs in them.
> How do you determine that other system is not currently part of the flooding topology?


The IIH includes the system ID.  See ISO 10589 v2, section 9.7, field “source Id”.  The local system will have
a copy of the flooding topology and can easily see if the neighbor was present as of the last FT computation.  If not, then it should be
added (modulo rate limiting). The local system can also examine it’s own LSDB.  If there is no LSP for the neighbor, then it would seem
highly likely that there is a disconnect and the neighbor should again be added (modulo rate limiting).

We are not requiring it, but a system could also do a more extensive computation and compare the links between itself and the neighbor
by tracing the path in the FT and then confirming that each link is up in the LSDB.


> > There is an issue here that we have not yet resolved, which is the rate that new links should be
> > temporarily added to the flooding topology.  Some believe that adding any new link is the
> > correct thing to do as it minimizes the recovery time. Others feel that enabling too many links
> > could cause a flooding collapse, so link addition should be highly constrained. We are still
> > discussing this and invite the WG’s opinions.
>  
> The issue is resolved by the solutions in draft-cc-lsr-flooding-reduction.
> One solution is below, where the given distance can be adjusted/configured.
> If we want every node to flood on all its links, we let the given
> distance to a big number. If we want the nodes within 2 hops to a failure
> to flood on all their links, we set the given distance to 2.
>    “In one way, when two or more failures on the current flooding
>    topology occur almost in the same time, each of the nodes within a
>    given distance (such as 3 hops) to a failure point, floods the link
>    state (LS) that it receives to all the links (except for the one from
>    which the LS is received) until a new flooding topology is built.”


As we have discussed, this is not a solution. In fact, this is more dangerous than anything else that has been proposed and
seems highly likely to trigger a cascade failure. You are enabling full flooding for many nodes.  In dense topologies, even
a radius of 3 is very high.  For example, in a LS topology, a radius of 3 is sufficient to enable full flooding throughout the
entire topology. If that were stable, we would not need Dynamic Flooding at all.


> Another solution is just adding minimum links temporarily on the flooding
> topology to repair the split flooding topology until a new flooding topology
> is built.


Agreed.  Which links constitute the minimum?  In a general topology, with arbitrary failures that are not distributed globally,
how do we make a distributed decision about which links to enable? This is the problem that we are trying to solve. And
we have no oracle to tell us The Right Answer.


> The link can be enabled for “temporary flooding” by the node without using any TLV or Hello with the TLV.


There are cases where it is far easier for the neighbor to realize that it is disconnected than for the local system to realize
that the neighbor is disconnected.  Thus, it is easier to allow one system to request temporary addition. 


> The TLV in Hello packet just requests for adding “temporary flooding” on the link. The other information is accessed by the node locally. The TLV in Hello packet does not help for corner case. In the case where a node is rebooted, a new link attached to a new node may apply.


If the node that rebooted has 1000 interfaces, which interfaces should be temporarily added?  Adding all of them is likely to trigger a cascade failure.  The TLV allows us to signal which ones should be enabled.


> >All adjacencies are a single hop in both IS-IS and OSPF.  Yes, Hello packets may be lost.
> >Fortunately, they are periodically transmitted, thus the next transmission will also contain the
> > TLV.  If IIH’s are getting lost at a significant rate, then the adjacency will not (and should not)
> >come up.  Thus, the request for temporary flooding will propagate to the neighbor in all cases
> >that matter.
>  
> It takes too long when Hello packet is lost. Repairing split flooding topology needs to be fast.


Fortunately, lost hello packets are a relatively rare occurrence.  While repairing the flooding topology needs to be done expediently, attempting to do so and triggering a cascade failure of the network is counter-productive. Given this alternative, a bit of extra delay when adding a new system to the network, or trying to recover from multiple failures seems wise. Rushing and making things worse does not.  The first
priority must remain network stability.


> 
> It does not mean that a user/operator configures/select an area leader. It means that a user/operator configures other things such as indicating an algorithm or selecting the centralized mode on the area leader. 


In an implementation, centralized mode and algorithm selection can be the defaults.  In fact, in our implementation, the only required configuration is to enable dynamic flooding. Everything else is automatic.



Regards,
Tony

[Lsr] WG Adoption Call for draft-li-lsr-dynamic-f… Christian Hopps
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Acee Lindem (acee)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… tony.li
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Robert Raszuk
[Lsr] 答复: WG Adoption Call for draft-li-lsr-dynam… Lizhenbin
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Edward
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Christian Hopps
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… David Allan I
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… steve ulrich
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Peter Psenak
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Naiming Shen (naiming)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Les Ginsberg (ginsberg)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Jeff Tantsura
[Lsr] 答复: WG Adoption Call for draft-li-lsr-dynam… Aijun Wang
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Lizhenbin
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Guyunan (Yunan Gu, IP Technology Research Dept. NW)
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Huzhibo
[Lsr] 答复: 答复: WG Adoption Call for draft-li-lsr-d… Dongjie (Jimmy)
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Yangang
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… Christian Hopps
Re: [Lsr] 答复: WG Adoption Call for draft-li-lsr-d… John E Drake
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… LEI LIU
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Mankamana Mishra (mankamis)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Ketan Talaulikar (ketant)
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… sridhar santhanam
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Huaimo Chen
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Peter Psenak
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Christian Hopps
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Sri
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Tony Li
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Christian Hopps
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Huaimo Chen
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Tony Li
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Huaimo Chen
Re: [Lsr] WG Adoption Call for draft-li-lsr-dynam… Christian Hopps
[Lsr] Multiple failures in Dynamic Flooding tony.li
Re: [Lsr] Multiple failures in Dynamic Flooding Les Ginsberg (ginsberg)
Re: [Lsr] Multiple failures in Dynamic Flooding Huaimo Chen
Re: [Lsr] Multiple failures in Dynamic Flooding Peter Psenak
Re: [Lsr] Multiple failures in Dynamic Flooding tony.li