Re: [Lsr] Comments regarding convergence regarding draft-cc-lsr-flooding-reduction and draft-li-lsr-dynamic-flooding

Huaimo Chen <huaimo.chen@huawei.com> Fri, 01 February 2019 15:54 UTC

Return-Path: <huaimo.chen@huawei.com>
X-Original-To: lsr@ietfa.amsl.com
Delivered-To: lsr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 35D81130E76; Fri, 1 Feb 2019 07:54:11 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.2
X-Spam-Level:
X-Spam-Status: No, score=-4.2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 805vHGdiWGNg; Fri, 1 Feb 2019 07:54:06 -0800 (PST)
Received: from huawei.com (lhrrgout.huawei.com [185.176.76.210]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 398FC130E72; Fri, 1 Feb 2019 07:54:05 -0800 (PST)
Received: from lhreml702-cah.china.huawei.com (unknown [172.18.7.107]) by Forcepoint Email with ESMTP id 6AA0CF099B1AC85C2E06; Fri, 1 Feb 2019 15:54:02 +0000 (GMT)
Received: from SJCEML701-CHM.china.huawei.com (10.208.112.40) by lhreml702-cah.china.huawei.com (10.201.108.43) with Microsoft SMTP Server (TLS) id 14.3.408.0; Fri, 1 Feb 2019 15:54:01 +0000
Received: from SJCEML521-MBX.china.huawei.com ([169.254.1.96]) by SJCEML701-CHM.china.huawei.com ([169.254.3.31]) with mapi id 14.03.0415.000; Fri, 1 Feb 2019 07:53:59 -0800
From: Huaimo Chen <huaimo.chen@huawei.com>
To: "tony.li@tony.li" <tony.li@tony.li>
CC: "Van De Velde, Gunter (Nokia - BE/Antwerp)" <gunter.van_de_velde@nokia.com>, "lsr@ietf.org" <lsr@ietf.org>, Yingzhen Qu <yingzhen.ietf@gmail.com>, Christian Hopps <chopps@chopps.org>, "draft-li-lsr-dynamic-flooding@ietf.org" <draft-li-lsr-dynamic-flooding@ietf.org>, Alvaro Retana <aretana.ietf@gmail.com>, "draft-cc-lsr-flooding-reduction@ietf.org" <draft-cc-lsr-flooding-reduction@ietf.org>
Thread-Topic: [Lsr] Comments regarding convergence regarding draft-cc-lsr-flooding-reduction and draft-li-lsr-dynamic-flooding
Thread-Index: AdSz/FEDV1gXhhlyTJuIwqwhTB9x7gDHOM9AAE9heIAAe9tFwA==
Date: Fri, 01 Feb 2019 15:53:58 +0000
Message-ID: <5316A0AB3C851246A7CA5758973207D463B3B5F7@sjceml521-mbx.china.huawei.com>
References: <DB7PR07MB5148B48096A7AD81A385DCFEE09A0@DB7PR07MB5148.eurprd07.prod.outlook.com> <5316A0AB3C851246A7CA5758973207D463B3A32F@sjceml521-mbx.china.huawei.com> <0FE21364-D23A-4CFE-8559-A9125EC01E77@tony.li>
In-Reply-To: <0FE21364-D23A-4CFE-8559-A9125EC01E77@tony.li>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.212.247.178]
Content-Type: multipart/alternative; boundary="_000_5316A0AB3C851246A7CA5758973207D463B3B5F7sjceml521mbxchi_"
MIME-Version: 1.0
X-CFilter-Loop: Reflected
Archived-At: <https://mailarchive.ietf.org/arch/msg/lsr/1rSnEjewCDnM6LaoyjTP4A6BzxI>
Subject: Re: [Lsr] Comments regarding convergence regarding draft-cc-lsr-flooding-reduction and draft-li-lsr-dynamic-flooding
X-BeenThere: lsr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Link State Routing Working Group <lsr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lsr>, <mailto:lsr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/lsr/>
List-Post: <mailto:lsr@ietf.org>
List-Help: <mailto:lsr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lsr>, <mailto:lsr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Feb 2019 15:54:11 -0000

Hi Everyone,



Our answers/explanations to Tony’s comments/questions are in line below with prefix [HC] and in red.



Best Regards,

Huaimo



Hi all,



I wanted to address some of the points that Huaimo has raised.  I’ve taken the editorial liberty of reformatting some of Huaimo’s comments.



> 1.                  Flooding topology encoding

>

> The encoding in "draft-li-lsr-dynamic-flooding" comprises two parts. One part is the encoding of the flooding topology using the TLVs of the paths constituting the flooding topology; the second part is the encoding of the mapping from the nodes in the area to their indexes. The TLVs use the indexes of the nodes in the paths.



> There are two encodings in "draft-cc-lsr-flooding-reduction", each of

> which can be used to encode the flooding topology. One encoding is

> called "Links Encoding". The other is called "Block Encoding”.





Not distributing the encoding of the indices is an issue that needs to be addressed. Without an explicit encoding, all systems are forced to rely on the complete and area-wide synchronization of the entire LSDB. This seems like it is a significant risk. What happens if there is an expired LSP and one system has expunged it from the database but another has not? This would perturb the indices. If a new node has been added to the topology and its presence has not completed flooding, then the indices would also be perturbed.



[HC]  Area-wide synchronization is the pre-requisite for IGP to work correctly by default. When the network has some nodes having their LSDBs out of synchronization, in general, the issue on the indexes is even bigger if the mapping from the nodes to their indexes is generated by the leader of the area using its LSDB and is flooded to all the other nodes.



> The former encodes the links between a local node and its adjacent

> nodes using the compact indexes of these nodes. The whole flooding

> topology is represented by the links and nodes on it. This is like the normal topology representation in an IGP, in which the topology is represented by the links and nodes in an IGP area.  No mapping encoding and flooding is needed. Every node creates and maintains the mapping from the nodes to their indexes. It is simpler and faster for the node to have the mapping in this way.





The use of two different encodings for the topology is highly unusual.



It is not at all clear about how it should be done. Ambiguity in a specification results in differing interpretations, that leads to complexity and interoperability errors. In this specific case, it would also seem to result in redundancy.  It would be better to have a single encoding.



[HC]  At this stage, we proposed two options in our draft. At last, one single encoding will be used.





The use of the “compact index” is not clear.  At 14 bits, it is clearly not large enough to support large areas. While we are generally not exceeding 16k systems today, that may grow. It wasn’t long ago that areas were practically confined to 1k nodes.  That’s exceeded now.



[HC] How did you get 14 bits?

The idea of compact node indexes or say a variable size (or length) of node indexes, is used in our links encoding and block encoding.



For the links from a local node to its adjacent nodes, the size (or length) of the indexes of these nodes (i.e., the local node and the adjacent nodes) is set to the maximum among the sizes of these node indexes. For example, suppose that these nodes have indexes 1, 126, 245, and 511. The sizes of their indexes are 1, 7, 8, and 9 bits respectively. The maximum among 1, 7, 8, and 9 is 9 (bits). This maximum size (i.e., 9) is represented by an element, which is the 4 bits field. This element can be a byte, a word, or even a TLV if you want. After this element, these four node indexes (i.e., 1, 126, 245, and 511) are represented by four 9 bits fields, which contain 1, 126, 245, and 511 respectively.



When a block of flooding topology is encoded in our block encoding, the size (or length) of the indexes of the nodes in this block is set to the maximum among the sizes of these node indexes.



When the value of 4 bits (from 0 to 15) is directly used to indicate the size of node indexes, it is big enough to support 32k (i.e., 2^15) nodes. When the links from a local node to its adjacent nodes are encoded, the size of these node indexes is indicated by the value of this 4 bits, which is the maximum among the sizes of these node indexes.



From section 6.2.1.1 in our draft, the size of node indexes is the value of 4 bits field (called ENSI) plus 8. The range of 4 bits value is from 0 to 15. Thus the range of the size of node indexes is from 8 bits to 23 bits. Any node index from 0 to 2^23 can be represented. When (the value of the 4 bits field plus 8) is used to indicate the size of these node indexes, it is the maximum among the sizes of these node indexes or 8 in case where the maximum is less than 8.





A better way of encoding the indices is to make them variable length, depending on the number of nodes in the area (credit: Tony P.). In our draft, the explicit encoding of the mapping from system to index makes it very clear how many systems there are, and thus how many bits should be used in this encoding.  Please note that this optimization is not yet reflected in our draft.



[HC] We see that your better way is an application of our idea about “variable size (or length) of node indexes”.



Our idea can be used in the encoding in your draft in a few different ways.

In one way, a variable size (or length) of node indexes is used in a Path TLV. For all the nodes in this Path TLV, the size (or length) of the indexes of these nodes is set to the maximum among the sizes of these node indexes. The size (or length) of indexes is represented by an element, which can be a 4 bits field, a byte, a word or even a TLV.

In another way, a variable size (or length) of node indexes is used in an LSP/LSA for all the nodes contained in the Path TLVs in the LSP/LSA.

In a third way, a variable size (or length) of node indexes is used in the mapping for all the nodes in an area. The size (or length) of indexes is represented by an element. After this element, all the node indexes, including those in paths TLVs, are represented by the fields of the size indicated by the element.





> For example, one issue is that different types of messages (i.e., the messages for the mapping, and the messages for the TLVs of the paths in the flooding topology, depending on the mapping) for the flooding topology may be out of order on a receiving router. This may cause incorrect/corrupted flooding topology.





As I’ve explained, the ordering is irrelevant.



[HC] The ordering is critical. The TLVs for paths contain the indexes of the nodes in the paths. The indexes of these nodes are generated by the leader of the area, and the mapping from the nodes to the indexes is created by the leader. We can see that the TLVs (i.e., the indexes of the nodes in the TLVs for paths) depend on the mapping.



The ordering of the TLVs for the mapping from node to index is self describing. Each TLV carries the starting index for the contents of its TLV.



[HC] The ordering that you talked about is not the ordering we discussed. The former is the ordering of the TLVs in the mapping, i.e., the ordering of the same type of the TLVs for the mapping. Latter is ordering of the different types of TLVs. One type of TLVs is the TLVs for the paths containing the indexes of the nodes in the paths. The second type of TLVs is the TLVs for the mapping from nodes to indexes.



The ordering of arrival of various LSP segments is also irrelevant as the computation need only be deferred until all segments have arrived.  This is already commonly done as part of SPF scheduling. This is also a necessity for all proposals, as any computation with inconsistent segments is going to yield dubious results.





> 2.                  Fault tolerance to multiple failures

>

> An efficient method for fault tolerance to multiple failures on the

> flooding topology is very important to centralized flooding reduction.

> Without it, the network convergence will slow down significantly. In general, the convergence time will be more than doubled comparing to normal convergence when the flooding topology is split by multiple failures.





Multiple simultaneous failures are an extremely unlikely event. Routers are not designed to survive multiple concurrent hardware failures. Most topologies run at commercially viable loading levels cannot survive multiple critical link or node failures without manual intervention. What is critical is to survive a single failure efficiently. Our proposal is to compute a biconnected flooding topology. This protects against single failures, retaining a functioning flooding topology in the event of a failure. At that point, the system computes a new bi-connected flooding topology, protecting against further failures.



[HC] A network without flooding reduction deployment may survive multiple failures due to “reliable flooding” as we know that flooding is occurring on every link. After flooding reduction is deployed, if failure occurs on more than one links or/and nodes that are on the flooding topology, the network can not survive at all.



Carrying arbitrary alternate information to protect against multiple failures negates your point about encoding efficiency. Backup links are not a tractable solution.



[HC] There is no arbitrary alternate information to protect against multiple failures that is encoded or flooded in our proposal.



> The more time the network convergence takes, the more customers'

> traffic in the network gets lost in general. Thus after the centralized flooding reduction without fault tolerance to multiple failures is deployed in a network, some customers' traffic lose will be more than doubled when multiple failures splitting the flooding topology occur comparing to the same network without the flooding reduction.





If there are multiple failures, in our proposal, this can easily be detected. The flooding topology would clearly be partitioned. Systems on the edge of the partition can explicitly request temporary changes to the flooding topology, immediately restoring the connectivity of the flooding topology if that is possible.  Thus, the convergence time is going to be on the order of two LSP propagation times: one to detect the failure and another to repair it and converge.



[HC] From your description above, we can not see any details or description on how to restore the connectivity of the flooding topology.



The convergence time is the total of the times for finishing followings:

    1) the failures are flooded to the leader in the area through updated LSAs/LSPs,

    2) the leader computes a new flooding topology after receiving the LSAs/LSPs,

    3) the leader floods the new flooding topology to every other node in the area,

    4) every other node receives, decodes and installs the new flooding topology, and

5) some nodes resynchronize their LSDBs with their neighbors.



The time for the leader to compute a new flooding topology depends on the complexity of the algorithm. For computing a flooding topology from any network topology that survives any one failure, the algorithm is very complex in term of time in general.





Tree topologies are particularly vulnerable to failures since each and every link failure creates a partition of the flooding topology.



Repairing this in a distributed fashion is also not perceptibly better.  After a partition, it will require at least one LSP propagation time to detect the failure. Then execution of the algorithm, followed by another LSP propagation time before a complete SPF can be run.



[HC] Tree topologies we mentioned are just for examples.



> "draft-cc-lsr-flooding-reduction" has a light-weight method for providing fault tolerance to multiple failures on the flooding topology. This method makes sure that the network still converges fast when the multiple failures happen. Thus the failures have almost minimal impact on the network convergence in the network with the flooding reduction deployment. The customers' traffic lose is significantly reduced (or almost minimized) when the failures occur in the network with the flooding reduction.





Your proposal is to revert to almost full flooding.  You bound the flooding to three hops (maybe, this is not defined), but in a LS topology, that is the full topology.  This is not a solution.  If we could support full flooding, we would not have problem in the first place.



[HC] This is one optional solution. When multiple failures on the flooding topology happen, the nodes around the failure points revert to their normal (or full) flooding temporarily until a new flooding topology is built. This normal (or full) flooding on some nodes is only temporary for the emergency where multiple failures occur. After the new flooding topology is built on these nodes, they revert to flooding to the links on the flooding topology.

There is another solution in our draft, which is a light-weight method for providing fault tolerance to multiple failures on the flooding topology.





> "draft-li-lsr-dynamic-flooding" just mentions that the edges of split

> parts will do something after the split of the flooding topology happens. However, it needs time to determine if a split of the flooding topology occurs and to find backup paths to connect split parts. There is no description on how to determine a split or find backup paths. This is not useful for implementors or users.





It is described. Please see section 6.7.11. After a topology change, all nodes can examine the existing flooding topology and the faults that have occurred.  If it is clear that the flooding topology has partitioned, then nodes that have adjacencies that are not reachable on the flooding topology make a temporary request to their adjacencies to add to the flooding topology.  This creates a temporary repair until the next flooding topology is distributed.



[HC] From both your draft and your description above, we can not see any details or description on how to determine a node has adjacencies that are not reachable on the flooding topology, and how does the node make a temporary request to their adjacencies to add to the flooding topology.

If the TLV with R bit in IS-IS Hello or FR-bit in OSPF Hello is used to request for adding a link to flooding topology temporarily (or say enabling temporary flooding over the link), then there is a big issue. For example, if the first Hello packet containing the TLV with R bit in IS-IS (or FR-bit in OSPF) gets lost, then the request to add/enable “temporary flooding” on the link will take more than a Hello interval (typically 10 seconds in OSPF). This is too long to accept!



> In addition, the algorithm for computing a flooding topology that can survive any one link failure on the flooding topology (i.e., any one link failure on the flooding topology does not split the flooding topology) will be very complex, at least in terms of time. With a good method for fault tolerance, the algorithm is not required to compute a flooding topology that can survive any one link failure on the flooding topology.





Collapsing the network is not a 'good method'.



[HC] The above paragraph just was proposed as an optional solution, which works and reduces the complexity of the algorithm for computing a flooding topology, but is not mandatory for implementation.





> 3.                  A New Node and Link Addition

>

> Two different procedures for handling a new node and link addition to the topology are described in the two drafts ("draft-li-lsr-dynamic-flooding" and "draft-cc-lsr-flooding-reduction").

>

> The procedure in "draft-cc-lsr-flooding-reduction": After the new node (say node A) and an existing node (say node B) establishes the adjacency over the link, A and B add the link to the flooding topology temporarily until a new flooding topology is built.





If A and B are already on the flooding topology, then this is redundant and harmful.



[HC] It is impossible that a new node A and a new link are on the flooding topology already.





> The procedure in "draft-li-lsr-dynamic-flooding": A new TLV with R bit in IIH (IS-IS Hello), and FR-bit in OSPF Hello LLS Extended Options are defined. The new node (say node A) and an existing node (say node B), send each other the TLV with R bit in IS-IS Hello or FR-bit in OSPF Hello to request for adding the link to flooding topology temporarily (or say enabling temporary flooding over the link).





This only happens if the other node does not appear on the flooding topology.



[HC] The case that we talked is “A New Node and Link Addition” from the title above.

The new node is not on the flooding topology. This case applies to your procedure.



> The procedure in "draft-cc-lsr-flooding-reduction" is simpler and more efficient. We can see that the two procedures try to achieve the same result ("

> temporary flooding" on the link) in different approaches. When a new

> node a nd link is added to the topology/network, it is deterministic

> that the link needs to be added to the flooding topology temporarily

> by the two end node s of the link. There is no need for the end nodes

> to add/enable "temporary flooding" on the link through (new TLV in IIH

> or FR-Bit in OSPF Hello LLS) signaling over the link.





Automatically adding to the flooding topology risks collapsing the network, again contradicting the entire goal of reducing the flooding topology.



[HC] The results produced by the two procedures in the two drafts are the same.

The differences between the two procedures is that one procedure is simpler and more efficient than the other one.

There is no risk collapsing the network. It does not contradict the entire goal of reducing the flooding topology.

The new link is added to the flooding topology “temporarily” by the two end nodes of the link locally.

This link “temporarily” on the flooding topology is not flooded to the other nodes.

It is just for link states to be flooded “temporarily” over this link before a new flooding topology is built.





> During internal review of "draft-li-lsr-dynamic-flooding" we observed

> significant procedure issues, which are big issues for implementations

> and users.  For example in one issue, if the first Hello packet

> containing the TLV with R bit in IS-IS (or FR-bit in OSPF) gets lost,

> then to add/enable "temporary flooding" on the link will take more

> than a Hello interval (typically

> 10 seconds in OSPF). This is too long to accept!





Loss of a hello also implies that the adjacency will be delayed in coming up, so this is the same in both situations.



[HC] There is a case mentioned above, which is only for your draft. The case is under title “2. Fault tolerance to multiple failures”.





> 4.                  An Existing Link Addition

>

> The approach in "draft-cc-lsr-flooding-reduction": The end node sends

> its adjacent node over the adjacency the link states with changes it received or originated during the period of time in which the flooding topology may split.





This requires the implementation to retain state about every LSP on every link, even when it is down.



[HC] There is no such requirement in our approach.

Our approach defines a variable TimeOf1stLinkDown, which stores the time at time the first link on the flooding topology is down after a new flooding topology is built.

It uses the time information of an LSA/LSP in LSDB, the time at which a changed LSA/LSP is received or originated. If a current OSPF/IS-IS implementation does not have this time information, it can be easily added.

When adding an existing link/adjacency not on the flooding topology between two end nodes to the flooding topology after building another new topology, each end node of the link sends its adjacent node over the adjacency the changed LSA/LSPs from TimeOf1stLinkDown if there is a link down (i.e., TimeOf1stLinkDown != 0).

As we explained, this approach is more efficient.





> The approach in "draft-li-lsr-dynamic-flooding": A full LSDB

> resynchronization is requested over the existing adjacency/link between two end nodes for possible flooding topology split which may cause LSDB out of synchronization.





Resynchronization is a normal part of the protocol process at link up anyway.



[HC] This case is not about link up. It is about adding an existing link/adjacency not on the flooding topology between two end nodes to the flooding topology. The following is the original text for your reference:

“Two different approaches for adding an existing link/adjacency not on the flooding topology between two end nodes to the flooding topology are described in the two drafts (“draft-li-lsr-dynamic-flooding” and “draft-cc-lsr-flooding-reduction”).”

Even though the adjacency between two end nodes is already there and their LSDBs are the same in most of situations, your approach still requests for a full LSDB resynchronization, which consumes (or wastes) the power for processing IGP messages and the link bandwidth for transferring IGP messages. This is contradicting with the entire goal of the flooding reduction.



Regards,

Tony