[Lsr] Multiple failures in Dynamic Flooding

tony.li@tony.li Wed, 06 March 2019 15:45 UTC

Return-Path: <tony1athome@gmail.com>
X-Original-To: lsr@ietfa.amsl.com
Delivered-To: lsr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6524C127287; Wed, 6 Mar 2019 07:45:22 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.251
X-Spam-Level:
X-Spam-Status: No, score=0.251 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.249, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id AMAU-hrn_c7b; Wed, 6 Mar 2019 07:45:20 -0800 (PST)
Received: from mail-pg1-x543.google.com (mail-pg1-x543.google.com [IPv6:2607:f8b0:4864:20::543]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 91D981275E9; Wed, 6 Mar 2019 07:45:20 -0800 (PST)
Received: by mail-pg1-x543.google.com with SMTP id q206so8720978pgq.4; Wed, 06 Mar 2019 07:45:20 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=BXVoNd45aPd3TQsO8lrcLz3PoewuRnjmLValVebYtbU=; b=FWgkEI9TqF2lZlJoSe/vWtyr42bmVKCABslPEDZ31nAIs4F+Jdc1RxB1YX8GASX7m5 CW9HZ16ydHFsTfHLsSE3zNddNVDM5GD10x6YPec1qlTmfglpZo2paGWGwZJwDQijzDpT cYQcVElQO0gQq7plr6WLoS7bKLgstcyyBSs/u068mil87XnOm+0azGe+W5RfKFfXAHRq AcH9Aa3w7jts7hi4IVdCUUbc84G1NvJomYEHBbOkeTEM4ieYo3cQXFer/xHARQRtYAQd yogsu8uHAfehm5rIqNLN9LiGX4YL404avSVrZ1RWya85UgTEjeqrm9GHqzRuyxGm382h TH5A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=BXVoNd45aPd3TQsO8lrcLz3PoewuRnjmLValVebYtbU=; b=d6OOEEgcM4OW4C9uvSYp9MxP1xPy0a47VHW1WiailfkzBkkoGfOKoWpJx7le7VehnH TuY02Z/hMCr6fusPON73M8pF+jzq7SGasn/SOezJv1cjEEKGgcx9xYY75cZ7CJ4mORvN iYvN+PdT3mf7CVLbWBglMWMmi4l/XLhvhCCfGnBlk8VOe9XuYT/1LHe+eSSa3/NlKxaI hQ2T2fj3BM4jfsqKR8NL720sskbzNF7PsMJp/vt5ELmslSA/AA/8DPcod795SA4Lz0JT J2U1o2FaqbjZrXkRTMoZ9dHYF1lO1q3l5w1ujDCJemnHx63v//ScWNMJDfif97ais6Mi HDOg==
X-Gm-Message-State: APjAAAVgCvF2xGwMALoOfdPAHjyum045V1kxqyfv8aSMTvafJQjMzthB xfJ3rL97MhTGzcfCKfGJK90=
X-Google-Smtp-Source: APXvYqwTtE9S8piPJt/nYuJWG1y6spjbPTYqRmrcCkSMDabnTeV4/HqLHsOvXFyHg+2ATHmCPRqOvA==
X-Received: by 2002:a17:902:9a01:: with SMTP id v1mr2028562plp.34.1551887119941; Wed, 06 Mar 2019 07:45:19 -0800 (PST)
Received: from [10.95.95.229] ([162.210.129.5]) by smtp.gmail.com with ESMTPSA id z6sm3949945pgo.31.2019.03.06.07.45.18 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 06 Mar 2019 07:45:19 -0800 (PST)
Sender: Tony Li <tony1athome@gmail.com>
From: tony.li@tony.li
Message-Id: <78A866F4-9AF0-481A-9DEC-B04DE72AFDA3@tony.li>
Content-Type: multipart/alternative; boundary="Apple-Mail=_973A248E-27A6-45B5-B1A2-2C13D0E632DD"
Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\))
Date: Wed, 06 Mar 2019 07:45:17 -0800
In-Reply-To: <5316A0AB3C851246A7CA5758973207D463B66FE9@sjceml521-mbx.china.huawei.com>
Cc: Christian Hopps <chopps@chopps.org>, "lsr@ietf.org" <lsr@ietf.org>, "lsr-chairs@ietf.org" <lsr-chairs@ietf.org>, "lsr-ads@ietf.org" <lsr-ads@ietf.org>
To: Huaimo Chen <huaimo.chen@huawei.com>
References: <sa6lg2md2ok.fsf@chopps.org> <SN6PR11MB284553735B2351FB584BE792C17F0@SN6PR11MB2845.namprd11.prod.outlook.com> <5316A0AB3C851246A7CA5758973207D463B5858A@sjceml521-mbx.china.huawei.com> <420ed1b5-d849-99cc-bcb0-d159783e4de2@cisco.com> <5316A0AB3C851246A7CA5758973207D463B59041@sjceml521-mbx.china.huawei.com> <0B4DF2AC-8EE1-41CA-B357-98325067CA30@gmail.com> <5316A0AB3C851246A7CA5758973207D463B66FE9@sjceml521-mbx.china.huawei.com>
X-Mailer: Apple Mail (2.3445.102.3)
Archived-At: <https://mailarchive.ietf.org/arch/msg/lsr/2Q6z44DOkZLTiiEydHXaGlwz6UY>
Subject: [Lsr] Multiple failures in Dynamic Flooding
X-BeenThere: lsr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Link State Routing Working Group <lsr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lsr>, <mailto:lsr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/lsr/>
List-Post: <mailto:lsr@ietf.org>
List-Help: <mailto:lsr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lsr>, <mailto:lsr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Mar 2019 15:45:23 -0000

Hi Huaimo,


> > I’m sorry that you don’t find it useful. Determining the split is trivial: when you receive an IIH,
> > it has a system ID of the another system in it. If that other system is not currently part of the
> > flooding topology, then it is quite clear that it is disconnected from the flooding topology.
> > Repairing the split is done by enabling temporary flooding on the new link.
>  
> For an adjacency between two nodes is up, the Hello packets exchanged between them will not change node/system IDs in them.
> How do you determine that other system is not currently part of the flooding topology?


The IIH includes the system ID.  See ISO 10589 v2, section 9.7, field “source Id”.  The local system will have
a copy of the flooding topology and can easily see if the neighbor was present as of the last FT computation.  If not, then it should be
added (modulo rate limiting). The local system can also examine it’s own LSDB.  If there is no LSP for the neighbor, then it would seem
highly likely that there is a disconnect and the neighbor should again be added (modulo rate limiting).

We are not requiring it, but a system could also do a more extensive computation and compare the links between itself and the neighbor
by tracing the path in the FT and then confirming that each link is up in the LSDB.


> > There is an issue here that we have not yet resolved, which is the rate that new links should be
> > temporarily added to the flooding topology.  Some believe that adding any new link is the
> > correct thing to do as it minimizes the recovery time. Others feel that enabling too many links
> > could cause a flooding collapse, so link addition should be highly constrained. We are still
> > discussing this and invite the WG’s opinions.
>  
> The issue is resolved by the solutions in draft-cc-lsr-flooding-reduction.
> One solution is below, where the given distance can be adjusted/configured.
> If we want every node to flood on all its links, we let the given
> distance to a big number. If we want the nodes within 2 hops to a failure
> to flood on all their links, we set the given distance to 2.
>    “In one way, when two or more failures on the current flooding
>    topology occur almost in the same time, each of the nodes within a
>    given distance (such as 3 hops) to a failure point, floods the link
>    state (LS) that it receives to all the links (except for the one from
>    which the LS is received) until a new flooding topology is built.”


As we have discussed, this is not a solution. In fact, this is more dangerous than anything else that has been proposed and
seems highly likely to trigger a cascade failure. You are enabling full flooding for many nodes.  In dense topologies, even
a radius of 3 is very high.  For example, in a LS topology, a radius of 3 is sufficient to enable full flooding throughout the
entire topology. If that were stable, we would not need Dynamic Flooding at all.


> Another solution is just adding minimum links temporarily on the flooding
> topology to repair the split flooding topology until a new flooding topology
> is built.


Agreed.  Which links constitute the minimum?  In a general topology, with arbitrary failures that are not distributed globally,
how do we make a distributed decision about which links to enable? This is the problem that we are trying to solve. And
we have no oracle to tell us The Right Answer.


> The link can be enabled for “temporary flooding” by the node without using any TLV or Hello with the TLV.


There are cases where it is far easier for the neighbor to realize that it is disconnected than for the local system to realize
that the neighbor is disconnected.  Thus, it is easier to allow one system to request temporary addition. 


> The TLV in Hello packet just requests for adding “temporary flooding” on the link. The other information is accessed by the node locally. The TLV in Hello packet does not help for corner case. In the case where a node is rebooted, a new link attached to a new node may apply.


If the node that rebooted has 1000 interfaces, which interfaces should be temporarily added?  Adding all of them is likely to trigger a cascade failure.  The TLV allows us to signal which ones should be enabled.


> >All adjacencies are a single hop in both IS-IS and OSPF.  Yes, Hello packets may be lost.
> >Fortunately, they are periodically transmitted, thus the next transmission will also contain the
> > TLV.  If IIH’s are getting lost at a significant rate, then the adjacency will not (and should not)
> >come up.  Thus, the request for temporary flooding will propagate to the neighbor in all cases
> >that matter.
>  
> It takes too long when Hello packet is lost. Repairing split flooding topology needs to be fast.


Fortunately, lost hello packets are a relatively rare occurrence.  While repairing the flooding topology needs to be done expediently, attempting to do so and triggering a cascade failure of the network is counter-productive. Given this alternative, a bit of extra delay when adding a new system to the network, or trying to recover from multiple failures seems wise. Rushing and making things worse does not.  The first
priority must remain network stability.


> 
> It does not mean that a user/operator configures/select an area leader. It means that a user/operator configures other things such as indicating an algorithm or selecting the centralized mode on the area leader. 


In an implementation, centralized mode and algorithm selection can be the defaults.  In fact, in our implementation, the only required configuration is to enable dynamic flooding. Everything else is automatic.



Regards,
Tony