Re: [Lsr] Multiple failures in Dynamic Flooding

Peter Psenak <ppsenak@cisco.com> Mon, 11 March 2019 17:21 UTC

Return-Path: <ppsenak@cisco.com>
X-Original-To: lsr@ietfa.amsl.com
Delivered-To: lsr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1D673131135; Mon, 11 Mar 2019 10:21:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -14.501
X-Spam-Level:
X-Spam-Status: No, score=-14.501 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_HI=-5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cisco.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VujJWYeoP3jO; Mon, 11 Mar 2019 10:21:31 -0700 (PDT)
Received: from aer-iport-3.cisco.com (aer-iport-3.cisco.com [173.38.203.53]) (using TLSv1.2 with cipher DHE-RSA-SEED-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BA3421311B5; Mon, 11 Mar 2019 10:21:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=6203; q=dns/txt; s=iport; t=1552324890; x=1553534490; h=subject:to:references:cc:from:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=vfMbx/2Rt0VUD4X+mZa67iEPMylEwgeAOYWHmyfibcQ=; b=NAPEdyGOVAElT5jTG56bQpH6zAxSzzsLRGOjiMvjshQ24mQsHOUXSY/B 4/mLSQLagfbVEn//Z8CS7Xp2MFQ1KEkKFM3/SL//JBdaUr3GSWRfJ6lV1 Ax9Hd0FA+cWh7OItk9Ub2bIWrq6wWKxBjEf51H1s1huKJUU3QKuYthy4a Y=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0AHAACPmIZc/xbLJq1kGQEBAQEBAQEBAQEBAQcBAQEBAQGBUwIBAQEBAQsBgWAFgRJxEieNAow0LYkxjnWBew0YC4RJAoRcNgcNAQEDAQEHAQMCbRwMhUoBAQEBAgEBATABBTYLEAsRBAEBAScHIQYfCQgGAQwGAgEBF4MHAYFdAw0ID7FMhUWCQA2CGgWBLwGEW4ZogUA/gREngj0ugldHAQGBHYYlA4oIjUGLeyozCY9OgzgGGYsCiDiKeIcWi1mBTQExgVYzGggbFTuCbIMtAQmHVYVAPgMwjXMHI4IjAQE
X-IronPort-AV: E=Sophos;i="5.58,468,1544486400"; d="scan'208";a="10619850"
Received: from aer-iport-nat.cisco.com (HELO aer-core-1.cisco.com) ([173.38.203.22]) by aer-iport-3.cisco.com with ESMTP/TLS/DHE-RSA-SEED-SHA; 11 Mar 2019 17:21:27 +0000
Received: from [10.60.140.54] (ams-ppsenak-nitro5.cisco.com [10.60.140.54]) by aer-core-1.cisco.com (8.15.2/8.15.2) with ESMTP id x2BHLPYp017417; Mon, 11 Mar 2019 17:21:26 GMT
To: Huaimo Chen <huaimo.chen@huawei.com>, "tony.li@tony.li" <tony.li@tony.li>
References: <sa6lg2md2ok.fsf@chopps.org> <SN6PR11MB284553735B2351FB584BE792C17F0@SN6PR11MB2845.namprd11.prod.outlook.com> <5316A0AB3C851246A7CA5758973207D463B5858A@sjceml521-mbx.china.huawei.com> <420ed1b5-d849-99cc-bcb0-d159783e4de2@cisco.com> <5316A0AB3C851246A7CA5758973207D463B59041@sjceml521-mbx.china.huawei.com> <0B4DF2AC-8EE1-41CA-B357-98325067CA30@gmail.com> <5316A0AB3C851246A7CA5758973207D463B66FE9@sjceml521-mbx.china.huawei.com> <78A866F4-9AF0-481A-9DEC-B04DE72AFDA3@tony.li> <5316A0AB3C851246A7CA5758973207D463B76FDD@sjceml521-mbx.china.huawei.com>
Cc: "lsr@ietf.org" <lsr@ietf.org>, "lsr-ads@ietf.org" <lsr-ads@ietf.org>, "lsr-chairs@ietf.org" <lsr-chairs@ietf.org>
From: Peter Psenak <ppsenak@cisco.com>
Message-ID: <f51300a0-baa6-6ff7-9fbd-df7cbd568550@cisco.com>
Date: Mon, 11 Mar 2019 18:21:25 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <5316A0AB3C851246A7CA5758973207D463B76FDD@sjceml521-mbx.china.huawei.com>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: 8bit
X-Outbound-SMTP-Client: 10.60.140.54, ams-ppsenak-nitro5.cisco.com
X-Outbound-Node: aer-core-1.cisco.com
Archived-At: <https://mailarchive.ietf.org/arch/msg/lsr/455jm4S7K0Eyf_RkIOyGsO1NeSE>
Subject: Re: [Lsr] Multiple failures in Dynamic Flooding
X-BeenThere: lsr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Link State Routing Working Group <lsr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lsr>, <mailto:lsr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/lsr/>
List-Post: <mailto:lsr@ietf.org>
List-Help: <mailto:lsr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lsr>, <mailto:lsr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2019 17:21:38 -0000

Hi Huaimo,

On 11/03/2019 18:08 , Huaimo Chen wrote:
> Hi Tony,
>
>
>
>     In summary for multiple failures, two issues below in
> draft-li-lsr-dynamyic-flooding are discussed:
>
> 1)      how to determine the current flooding topology is split; and

there is no need to do that. The recovery mechanism will repair the 
split topology if there is a way to do that.

>
> 2)      how to repair/connect the flooding topology split.

6.7.11.  Recovery from Multiple Failures


    "The nodes that remain active on the edges
    of the flooding topology partitions will recognize this and will try
    to repair the flooding topology locally by enabling temporary
    flooding towards the nodes that they consider disconnected from the
    flooding topology until a new flooding topology becomes connected
    again."
>
> For the first issue, the discussions are still going on.
>
> For the second issue, repairing/connecting the flooding topology split
> through Hello protocol extensions does not work.  When a “backup
> path”/connection of multiple hops is needed to connect/repair the
> flooding topology split, Hello can not go beyond one hop, thus can not
> repair the flooding topology split in this case.

there is no need to send anything multi-hop.

thanks,
Peter

>
>
>
>>*From:* Tony Li [mailto:tony1athome@gmail.com] *On Behalf Of
> *tony.li@tony.li
>>*Sent:* Wednesday, March 6, 2019 10:45 AM
>>*To:* Huaimo Chen <huaimo.chen@huawei.com>
>>*Cc:* Christian Hopps <chopps@chopps.org>; lsr@ietf.org;
> lsr-chairs@ietf.org; lsr-ads@ietf.org
>>*Subject:* Multiple failures in Dynamic Flooding
>
>>
>
>>Hi Huaimo,
>
>>
>
>>>> I’m sorry that you don’t find it useful. Determining the split is
> trivial: when you receive an IIH,
>
>>>> it has a system ID of the another system in it. If that other system is
> not currently part of the
>
>>>> flooding topology, then it is quite clear that it is disconnected from
> the flooding topology.
>
>>>> Repairing the split is done by enabling temporary flooding on the new link.
>
>
>
>>>For an adjacency between two nodes is up, the Hello packets exchanged
> between them will not change node/system IDs in them.
>
>>>How do you determine that other system is not currently part of the
> flooding topology?
>
>
>
>>The IIH includes the system ID.  See ISO 10589 v2, section 9.7, field
> “source Id”.  The local system will have
>
>>a copy of the flooding topology and can easily see if the neighbor was
> present as of the last FT computation.  If not, then it should be
>
>>added (modulo rate limiting). The local system can also examine it’s own
> LSDB.  If there is no LSP for the neighbor, then it would seem
>
>>highly likely that there is a disconnect and the neighbor should again
> be added (modulo rate limiting).
>
>
>
>>We are not requiring it, but a system could also do a more extensive
> computation and compare the links between itself and the neighbor
>
>>by tracing the path in the FT and then confirming that each link is up
> in the LSDB.
>
>
>
> It normally takes a long time such as more than ten minutes to age out
> and remove an LSP/LSA for the neighbor from the LSDB even though the
> neighbor is disconnected physically.
>
> How can you decide quickly in tens of milliseconds that the flooding
> topology is disconnected?
>
>
>
>>>> There is an issue here that we have not yet resolved, which is the rate
> that new links should be
>
>>>> temporarily added to the flooding topology.  Some believe that adding
> any new link is the
>
>>>> correct thing to do as it minimizes the recovery time. Others feel that
> enabling too many links
>
>>>> could cause a flooding collapse, so link addition should be highly
> constrained. We are still
>
>>>> discussing this and invite the WG’s opinions.
>
>
>
>>>The issue is resolved by the solutions in draft-cc-lsr-flooding-reduction.
>
> One solution is below, where the given distance can be adjusted/configured.
>
> If we want every node to flood on all its links, we let the given
>
>>>distance to a big number. If we want the nodes within 2 hops to a failure
>
>>>to flood on all their links, we set the given distance to 2.
>
>    “In one way, when two or more failures on the current flooding
>
>   > >topology occur almost in the same time, each of the nodes within a
>
>   > >given distance (such as 3 hops) to a failure point, floods the link
>
>   > >state (LS) that it receives to all the links (except for the one from
>
>    which the LS is received) until a new flooding topology is built.”
>
>
>
>
>
>>As we have discussed, this is not a solution. In fact, this is more
> dangerous than anything else that has been proposed and
>
>>seems highly likely to trigger a cascade failure. You are enabling full
> flooding for many nodes.  In dense topologies, even
>
>>a radius of 3 is very high.  For example, in a LS topology, a radius of
> 3 is sufficient to enable full flooding throughout the
>
>>entire topology. If that were stable, we would not need Dynamic Flooding
> at all.
>
>
>
> This full flooding is enabled only for a very short time.
>
> How do you get that this is more dangerous than anything else and seems
> highly likely to trigger a cascade failure? Can you give some
> explanations in details?
>
>>>Another solution is just adding minimum links temporarily on the flooding
>
>>>topology to repair the split flooding topology until a new flooding
> topology
>
>>>is built.
>
>
>
>>Agreed.  Which links constitute the minimum?  In a general topology,
> with arbitrary failures that are not distributed globally,
>
>>how do we make a distributed decision about which links to enable? This
> is the problem that we are trying to solve. And
>
>>we have no oracle to tell us The Right Answer.
>
>
>
> We can discuss this after the first method is discussed.
>
>
>
> Best Regards,
>
> Huaimo
>
>
>
>>Regards,
>
>>Tony
>
>
>
> _______________________________________________
> Lsr mailing list
> Lsr@ietf.org
> https://www.ietf.org/mailman/listinfo/lsr
>