Re: [Lsr] Comments regarding convergence regarding draft-cc-lsr-flooding-reduction and draft-li-lsr-dynamic-flooding

Huaimo Chen <huaimo.chen@huawei.com> Mon, 28 January 2019 15:26 UTC

Return-Path: <huaimo.chen@huawei.com>
X-Original-To: lsr@ietfa.amsl.com
Delivered-To: lsr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 475D6127598; Mon, 28 Jan 2019 07:26:52 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.199
X-Spam-Level:
X-Spam-Status: No, score=-4.199 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id kzcJ_7m6Cdjt; Mon, 28 Jan 2019 07:26:47 -0800 (PST)
Received: from huawei.com (lhrrgout.huawei.com [185.176.76.210]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2D13112D4EB; Mon, 28 Jan 2019 07:26:46 -0800 (PST)
Received: from lhreml704-cah.china.huawei.com (unknown [172.18.7.106]) by Forcepoint Email with ESMTP id AFFB3D4AC9B7394EDF64; Mon, 28 Jan 2019 15:26:43 +0000 (GMT)
Received: from SJCEML702-CHM.china.huawei.com (10.208.112.38) by lhreml704-cah.china.huawei.com (10.201.108.45) with Microsoft SMTP Server (TLS) id 14.3.408.0; Mon, 28 Jan 2019 15:26:42 +0000
Received: from SJCEML521-MBX.china.huawei.com ([169.254.1.96]) by SJCEML702-CHM.china.huawei.com ([169.254.4.240]) with mapi id 14.03.0415.000; Mon, 28 Jan 2019 07:26:36 -0800
From: Huaimo Chen <huaimo.chen@huawei.com>
To: "Van De Velde, Gunter (Nokia - BE/Antwerp)" <gunter.van_de_velde@nokia.com>, "lsr@ietf.org" <lsr@ietf.org>
CC: Yingzhen Qu <yingzhen.ietf@gmail.com>, Christian Hopps <chopps@chopps.org>, "draft-li-lsr-dynamic-flooding@ietf.org" <draft-li-lsr-dynamic-flooding@ietf.org>, Alvaro Retana <aretana.ietf@gmail.com>, "draft-cc-lsr-flooding-reduction@ietf.org" <draft-cc-lsr-flooding-reduction@ietf.org>
Thread-Topic: [Lsr] Comments regarding convergence regarding draft-cc-lsr-flooding-reduction and draft-li-lsr-dynamic-flooding
Thread-Index: AdSz/FEDV1gXhhlyTJuIwqwhTB9x7gDHOM9A
Date: Mon, 28 Jan 2019 15:26:35 +0000
Message-ID: <5316A0AB3C851246A7CA5758973207D463B3A32F@sjceml521-mbx.china.huawei.com>
References: <DB7PR07MB5148B48096A7AD81A385DCFEE09A0@DB7PR07MB5148.eurprd07.prod.outlook.com>
In-Reply-To: <DB7PR07MB5148B48096A7AD81A385DCFEE09A0@DB7PR07MB5148.eurprd07.prod.outlook.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: yes
X-MS-TNEF-Correlator:
x-originating-ip: [10.212.246.13]
Content-Type: multipart/mixed; boundary="_004_5316A0AB3C851246A7CA5758973207D463B3A32Fsjceml521mbxchi_"
MIME-Version: 1.0
X-CFilter-Loop: Reflected
Archived-At: <https://mailarchive.ietf.org/arch/msg/lsr/BbvtLhUXp_GlKfuSJgj_0Pymp24>
Subject: Re: [Lsr] Comments regarding convergence regarding draft-cc-lsr-flooding-reduction and draft-li-lsr-dynamic-flooding
X-BeenThere: lsr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Link State Routing Working Group <lsr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lsr>, <mailto:lsr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/lsr/>
List-Post: <mailto:lsr@ietf.org>
List-Help: <mailto:lsr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lsr>, <mailto:lsr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2019 15:26:53 -0000

Hi Everyone,



The authors of "draft-cc-lsr-flooding-reduction" (our draft for short below) believed there are multiple key advantages when compared to the solution proposed in "draft-li-lsr-dynamic-flooding" (Tony's draft for short below).



    Thank you, Gunter, for reviewing the solutions but as authors of "draft-cc-lsr-flooding-reduction", although we felt some of the conclusions made are flawed. As your "Items of interest to me were draft documentation flow, draft technological format/content and high level architectural decisions made within each proposal.", while I agree with these criteria are important for any IETF document that tends to be published and implemented, other criteria such as minimizing customers' traffic lose when failures occur, efficiency of algorithms, spaces used for extended encodings, handling of corner cases are at least equally important, and in some cases (see below), are more significant in the operation, since the proposed solution must not just resolve the stated problems, but also must be optimal with least overhead to the IGP's operation and minimal customers' traffic lose.

The specific concerns made in Gunter's email are addressed in line below with prefix [HC].





1.                  Flooding topology encoding



Flooding topology encoding is a key component or item for centralized flooding reduction (centralized mode for short).

The efficiency of the encoding is very important since the objective of the flooding reduction is to minimize the amount of link state flooding.



The encoding in "draft-li-lsr-dynamic-flooding" comprises two parts. One part is the encoding of the flooding topology using the TLVs of the paths constituting the flooding topology; the second part is the encoding of the mapping from the nodes in the area to their indexes. The TLVs use the indexes of the nodes in the paths.



There are two encodings in "draft-cc-lsr-flooding-reduction", each of which can be used to encode the flooding topology. One encoding is called "Links Encoding". The other is called "Block Encoding". The former encodes the links between a local node and its adjacent nodes using the compact indexes of these nodes. The whole flooding topology is represented by the links and nodes on it. This is like the normal topology representation in an IGP, in which the topology is represented by the links and nodes in an IGP area.  No mapping encoding and flooding is needed. Every node creates and maintains the mapping from the nodes to their indexes. It is simpler and faster for the node to have the mapping in this way.



The block encoding uses a single structure to encode a block of a flooding topology. It starts with a local node and its adjacent nodes and can be considered as an extension to the links encoding.



Each of the encoding types in "draft-cc-lsr-flooding-reduction" offers a solution for encoding the flooding topology, which is simpler to implement from a procedural and protocol perspective, and much more efficient than the encoding method used in "draft-li-lsr-dynamic-flooding".



For example, for encoding 63 nodes flooding topology of a binary tree in IS-IS, some comparisons are listed in the table below (the table in .pdf is also attached).

No.


Encoding in Tony's draft


Encoding in our draft


Comparisons


1


248 bytes for TLVs of paths only


167 bytes for encoding the flooding topology by links encoding


248/167 = 1.5


121 bytes for encoding the flooding topology by block encoding


248/121 = 2


2


382 bytes for mapping nodes to indexes


0 bytes for mapping





3


630 bytes for encoding the flooding topology (248 for TLVs of paths + 382 for mapping)


167 bytes for encoding the flooding topology by links encoding


630/167 = 3.7


121 bytes for encoding the flooding topology by block encoding


630/121 = 5.2




>From the table, we can see that even without encoding the mapping, 248 bytes are used for encoding the flooding topology by the TLVs for paths only in "draft-li-lsr-dynamic-flooding" 's encoding. 167 bytes are used for encoding the same flooding topology by the links encoding in "draft-cc-lsr-flooding-reduction". 248/167 = 1.5.

121 bytes are used for encoding the same flooding topology by the block encoding in "draft-cc-lsr-flooding-reduction". 248/121 = 2. This means that the block encoding in "draft-cc-lsr-flooding-reduction" uses about 50% flooding resource comparing to the encoding in "draft-li-lsr-dynamic-flooding". In our discussion with operators, this is a significant advantage over the method proposed in "draft-li-lsr-dynamic-flooding".



To encode the whole flooding topology, 630 bytes are used by the encoding in "draft-li-lsr-dynamic-flooding". 167 bytes are used for encoding the same flooding topology by the links encoding in "draft-cc-lsr-flooding-reduction". 630/167 = 3.7. This means that the links encoding in "draft-cc-lsr-flooding-reduction" uses about 27% flooding resource comparing to the encoding in "draft-li-lsr-dynamic-flooding".

121 bytes are used for encoding the same flooding topology by the block encoding in "draft-cc-lsr-flooding-reduction". 630/121 = 5.2. This means that the block encoding in "draft-cc-lsr-flooding-reduction" uses about 19% flooding comparing to the encoding in "draft-li-lsr-dynamic-flooding".



In addition, there are some issues in the encoding in "draft-li-lsr-dynamic-flooding".

We raised the issues to Tony during the discussions on merging, but did not receive any solution yet, this is not good for operators.

For example, one issue is that different types of messages (i.e., the messages for the mapping, and the messages for the TLVs of the paths in the flooding topology, depending on the mapping) for the flooding topology may be out of order on a receiving router. This may cause incorrect/corrupted flooding topology.





2.                  Fault tolerance to multiple failures



An efficient method for fault tolerance to multiple failures on the flooding topology is very important to centralized flooding reduction. Without it, the network convergence will slow down significantly. In general, the convergence time will be more than doubled comparing to normal convergence when the flooding topology is split by multiple failures.



The more time the network convergence takes, the more customers' traffic in the network gets lost in general. Thus after the centralized flooding reduction without fault tolerance to multiple failures is deployed in a network, some customers' traffic lose will be more than doubled when multiple failures splitting the flooding topology occur comparing to the same network without the flooding reduction.



Does an operator of a network want to see more traffic lose after the flooding reduction is deployed in the network when multiple failures happen?



Does anyone from a vendor want to see more traffic lose after their flooding reduction solution is deployed in a network when multiple failures occur?



When multiple failures split the flooding topology, the LSDB is out of synchronization among some nodes. This is fixed after following steps:

    1) the failures are flooded to the leader in the area through updated LSAs/LSPs,

    2) the leader computes a new flooding topology after receiving the LSAs/LSPs,

    3) the leader floods the new flooding topology to every other node in the area,

    4) every other node receives, decodes and installs the new flooding topology, and

    5) some nodes resynchronize their LSDBs with their neighbors.



Introducing a method for fault tolerance to multiple failures on the flooding topology into the solution for flooding reduction may bring some complexity. There are some trade-offs between the amount of traffic lose and some complexity. The solution with this complexity will have much less traffic lose comparing to a solution without any method for fault tolerance to multiple failures on the flooding topology. In fact, the complexity is invisible to operators. It is inside the solution.



Does an operator of a network want to see much less traffic lose after the flooding reduction with some complexity invisible is deployed in the network when multiple failures happen?



Does anyone from a vendor want to see much less traffic lose after their flooding reduction solution is deployed in a network when multiple failures occur?



"draft-cc-lsr-flooding-reduction" has a light-weight method for providing fault tolerance to multiple failures on the flooding topology. This method makes sure that the network still converges fast when the multiple failures happen. Thus the failures have almost minimal impact on the network convergence in the network with the flooding reduction deployment. The customers' traffic lose is significantly reduced (or almost minimized) when the failures occur in the network with the flooding reduction.



"draft-li-lsr-dynamic-flooding" just mentions that the edges of split parts will do something after the split of the flooding topology happens. However, it needs time to determine if a split of the flooding topology occurs and to find backup paths to connect split parts. There is no description on how to determine a split or find backup paths. This is not useful for implementors or users.



In addition, the algorithm for computing a flooding topology that can survive any one link failure on the flooding topology (i.e., any one link failure on the flooding topology does not split the flooding topology) will be very complex, at least in terms of time. With a good method for fault tolerance, the algorithm is not required to compute a flooding topology that can survive any one link failure on the flooding topology.





3.                  A New Node and Link Addition



Two different procedures for handling a new node and link addition to the topology are described in the two drafts ("draft-li-lsr-dynamic-flooding" and "draft-cc-lsr-flooding-reduction").



The procedure in "draft-cc-lsr-flooding-reduction": After the new node (say node A) and an existing node (say node B) establishes the adjacency over the link, A and B add the link to the flooding topology temporarily until a new flooding topology is built.



The procedure in "draft-li-lsr-dynamic-flooding": A new TLV with R bit in IIH (IS-IS Hello), and FR-bit in OSPF Hello LLS Extended Options are defined. The new node (say node A) and an existing node (say node B), send each other the TLV with R bit in IS-IS Hello or FR-bit in OSPF Hello to request for adding the link to flooding topology temporarily (or say enabling temporary flooding over the link).



The procedure in "draft-cc-lsr-flooding-reduction" is simpler and more efficient. We can see that the two procedures try to achieve the same result ("temporary flooding" on the link) in different approaches. When a new node and link is added to the topology/network, it is deterministic that the link needs to be added to the flooding topology temporarily by the two end nodes of the link. There is no need for the end nodes to add/enable "temporary flooding" on the link through (new TLV in IIH or FR-Bit in OSPF Hello LLS) signaling over the link.



During internal review of "draft-li-lsr-dynamic-flooding" we observed significant procedure issues, which are big issues for implementations and users.  For example in one issue, if the first Hello packet containing the TLV with R bit in IS-IS (or FR-bit in OSPF) gets lost, then to add/enable "temporary flooding" on the link will take more than a Hello interval (typically 10 seconds in OSPF). This is too long to accept!





4.                  An Existing Link Addition



Two different approaches for adding an existing link/adjacency not on the flooding topology between two end nodes to the flooding topology are described in the two drafts ("draft-li-lsr-dynamic-flooding" and "draft-cc-lsr-flooding-reduction").



The approach in "draft-cc-lsr-flooding-reduction": The end node sends its adjacent node over the adjacency the link states with changes it received or originated during the period of time in which the flooding topology may split.



The approach in "draft-li-lsr-dynamic-flooding": A full LSDB resynchronization is requested over the existing adjacency/link between two end nodes for possible flooding topology split which may cause LSDB out of synchronization.



The approach in "draft-cc-lsr-flooding-reduction" is more efficient. In general, it is unlikely that the LSDB is out of synchronization among some nodes when an existing link/adjacency not on the flooding topology between two end nodes is added to the flooding topology. In this case, our approach almost does nothing, i.e., there is no link state with changes that needs the end node to send its adjacent node over the existing adjacency. However, the approach in "draft-li-lsr-dynamic-flooding" always has a full LSDB resynchronization over the existing link/adjacency between two end nodes in this case.





In summary, "draft-cc-lsr-flooding-reduction" has the following key advantages over "draft-li-lsr-dynamic-flooding":

1.      It uses a fraction of flooding resource (i.e., it is multiple times more efficient in flooding topology encoding);

2.      It provides fault tolerance to multiple failures, minimizing impact on network convergence, thus minimizing traffic lose; and

3.      It is simpler and needs less processing time (i.e., faster and more efficient) in multiple scenarios.



In addition to the above key advantages, "draft-cc-lsr-flooding-reduction" has almost all the technical components for the whole solution including distributed flooding reduction and centralized flooding reduction.



Based on technical considerations, the authors of "draft-cc-lsr-flooding-reduction" have observed too many issues exist with "draft-li-lsr-dynamic-flooding", and overall, it's a sub-optimal solution compared to "draft-cc-lsr-flooding-reduction".



We see that "draft-cc-lsr-flooding-reduction" is a much better candidate for achieving high quality WG deliverable if people want to select one draft from "draft-cc-lsr-flooding-reduction" and "draft-li-lsr-dynamic-flooding".





There are two other options to move forward:



1. Adopt "draft-cc-lsr-flooding-reduction" and "draft-li-lsr-dynamic-flooding" as both experimental and allow implementors and operators (the market) to decide which to deploy. If one solution is preferred it may be converted from Experimental to Standards Track.



2. Again, try to merge "draft-cc-lsr-flooding-reduction" and "draft-li-lsr-dynamic-flooding" and develop a single solution for the WG to adopt. However, some compromise must be made by the authors of "draft-li-lsr-dynamic-flooding".



Either option is a much better candidate for achieving high quality WG deliverable, when compared to only moving forward with "draft-li-lsr-dynamic-flooding".



Best Regards,

Huaimo



-----Original Message-----
From: Lsr [mailto:lsr-bounces@ietf.org] On Behalf Of Van De Velde, Gunter (Nokia - BE/Antwerp)
Sent: Thursday, January 24, 2019 12:00 PM
To: lsr@ietf.org
Cc: Yingzhen Qu <yingzhen.ietf@gmail.com>;; Christian Hopps <chopps@chopps.org>;; draft-li-lsr-dynamic-flooding@ietf.org; Alvaro Retana <aretana.ietf@gmail.com>;
Subject: [Lsr] Comments regarding convergence regarding draft-cc-lsr-flooding-reduction and draft-li-lsr-dynamic-flooding





Dear LSR WG,



The LSR chairs informed that attempts to merge drafts draft-cc-lsr-flooding-reduction and draft-li-lsr-dynamic-flooding is exposing challenges, and hence asked me as independent unbiased WG contributor to have a look at both drafts and provide suggestions on how we may progress to deliver the highest quality WG technology deliverable.



Please find following observations on both documents with my hat of "independent unbiased LSR contributor":



Both drafts have clearly written problem space description and in general a well written solution space. The problem space is real. Each draft approaches the problem space using different accents within the solution space. These accents are obvious when the solution-space is discussed in both drafts.



I found the solution space described in "draft-cc-lsr-flooding-reduction" centered around restoring flooding topology and making sure that during critical changes, flooding is still fast. The solution space provides mechanism to make sure that a reduced flooding topology has minimal impact upon convergence (it use backup paths, critical paths/node, ...). The compromise to achieve fast topology restoration is by introducing a complexity trade-off. For example, there are many moving parts from flooding mechanics, to avoid against flooding topology split. This raise to me some concern from implementor perspective.



[HC] In our draft, there are only about two pages describing the mechanism for fault tolerance to multiple failures, which is one of many components for the whole solution. The mechanism is independent of most other components (such as the encoding of the flooding topology) in the whole solution. On each node, the computation of the backup paths is called by the component rebuilding flooding topology if there is a change on the flooding topology. It can be implemented easily. The other components (such as those components talked above) for the solution in our draft are also simple to implement.



[HC] The mechanism reduces customers' traffic lose significantly when multiple failures splitting the flooding topology happen even though it may bring some complexity. The trade-offs between the traffic lose and complexity are discussed in details above.



[HC] Also, the mechanism for fault tolerance is optional. The distributed flooding reduction does not need it. It is very important to the centralized mode.





The solution space discussed in "draft-li-lsr-dynamic-flooding" uses a generic vision based upon solution requirements and seems to focus around protocol efficiency/simplicity versus speed of topology convergence. From a documentation perspective I found the flow contained by "draft-li-lsr-dynamic-flooding" easier to understand. This understanding includes packet format descriptions and architectural decisions (for example known LSR technology properties/limitations). I found most of the information sufficiently well described with minimal complexity.



[HC] The efficiency/simplicity, convergence and complexity have been analyzed in greater details at the beginning for most of the components of the solutions in the two drafts ("draft-cc-lsr-flooding-reduction" and "draft-li-lsr-dynamic-flooding"). In summary,

1.      "draft-cc-lsr-flooding-reduction" is multiple times more efficient in flooding topology encoding than "draft-li-lsr-dynamic-flooding";

2.      "draft-cc-lsr-flooding-reduction" provides fault tolerance to multiple failures, which almost minimizes impact on network convergence by the failures, thus almost minimizes customers' traffic lose; however, "draft-li-lsr-dynamic-flooding" does not have any concrete method for fault tolerance to multiple failures, thus convergence time will be more than doubled when multiple failures split the flooding topology;

3.      "draft-cc-lsr-flooding-reduction" is simpler and needs less processing time (i.e., more efficient and faster) in multiple scenarios.

We have observed too many issues exist with "draft-li-lsr-dynamic-flooding", and overall, it's a sub-optimal solution compared to "draft-cc-lsr-flooding-reduction".



[HC] Regarding to "'draft-li-lsr-dynamic-flooding' easier to understand", this item should not be a determining factor at this stage of the document.



In both drafts the distributed algorithm solution seems underspecified and that raise implementation concerns to me.



[HC] For the distributed flooding reduction, our draft describes three algorithms for computing the flooding topology in Appendix and the operations related to the distributed flooding reduction in section 9 "Operations on Flooding Reduction".





Conclusion:

Coming back to the original request of the LSR chairs asking feedback to progress for a quality WG deliverable. I have read both drafts and constructed an opinion. Items of interest to me were draft documentation flow, draft technological format/content and high level architectural decisions made within each proposal. Using those criteria, the balance scales towards "draft-li-lsr-dynamic-flooding" because it is using clear solution requirements, clear descriptions of encoding and their usage, clear behavioral specifications and has considered introducing minimal complexity. This means that from review perspective "draft-li-lsr-dynamic-flooding" seems best candidate to adopt with most potential for high quality WG deliverable. In addition, we could assign a shepherd to work within the WG. If needed I am happy to help shepherding the WG deliverable.



[HC] Basically, the conclusion is based on the draft flow and clarity, which are important. However, the key technical advantages in a draft are at least equally important, and are more important in general. These are analyzed in details above.



This work is important for industry. The solution must work and be causing less issues as the problem we are trying to fix.

We need to progress the work on flooding reduction. We need to select the most optimal/pragmatic solution.

Obviously, additional reviews from WG contributors will help LSR WG to define and build the highest quality WG deliverable.



Kind Regards,

Gunter Van de Velde



_______________________________________________

Lsr mailing list

Lsr@ietf.org<mailto:Lsr@ietf.org>
https://www.ietf.org/mailman/listinfo/lsr