Re: [Roll] Border router failure detection
Konrad Iwanicki <iwanicki@mimuw.edu.pl> Wed, 21 April 2021 08:38 UTC
Return-Path: <iwanicki@mimuw.edu.pl>
X-Original-To: roll@ietfa.amsl.com
Delivered-To: roll@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 64BBC3A1B6E for <roll@ietfa.amsl.com>; Wed, 21 Apr 2021 01:38:15 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EBMylZ_NJ5Ez for <roll@ietfa.amsl.com>; Wed, 21 Apr 2021 01:38:11 -0700 (PDT)
Received: from mail.mimuw.edu.pl (mail.mimuw.edu.pl [193.0.96.6]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0CF653A1B6B for <roll@ietf.org>; Wed, 21 Apr 2021 01:38:06 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by duch.mimuw.edu.pl (Postfix) with ESMTP id 714406036AA1F; Wed, 21 Apr 2021 10:38:03 +0200 (CEST)
X-Virus-Scanned: amavisd-new at mimuw.edu.pl
Received: from duch.mimuw.edu.pl ([127.0.0.1]) by localhost (mail.mimuw.edu.pl [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id xHUaJbywv20r; Wed, 21 Apr 2021 10:38:00 +0200 (CEST)
Received: from [IPv6:2001:6a0:5001:2:e859:37e0:88bf:b374] (unknown [IPv6:2001:6a0:5001:2:e859:37e0:88bf:b374]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by duch.mimuw.edu.pl (Postfix) with ESMTPSA; Wed, 21 Apr 2021 10:37:58 +0200 (CEST)
From: Konrad Iwanicki <iwanicki@mimuw.edu.pl>
To: Routing Over Low power and Lossy networks <roll@ietf.org>, "Pascal Thubert (pthubert)" <pthubert=40cisco.com@dmarc.ietf.org>
References: <CAP+sJUfcEY2DNEQV=duJdN6P8zZn0ccuei+4ra-B6TcLb5z8Kg@mail.gmail.com> <49ac5fc3-4a3c-fb87-d366-eb7e7cfd60df@mimuw.edu.pl> <18233.1583176305@localhost> <CAO0Djp3w4vWCOawQ+eegNTRzb_HRGYH6n=bdEH6iVf5ZO0AGFQ@mail.gmail.com> <f71fe153-c0d1-097e-a72e-49ece97cbd48@mimuw.edu.pl> <10272666-28c7-ab3e-9ceb-1b8f2bb6e5e5@mimuw.edu.pl> <CO1PR11MB4881A5AA0E5C5010FD2BE39ED8749@CO1PR11MB4881.namprd11.prod.outlook.com> <bc174171-4b68-40b2-d532-463709e5bea8@mimuw.edu.pl> <CO1PR11MB4881D0C985582B28AE2DE8BED84E9@CO1PR11MB4881.namprd11.prod.outlook.com>
Message-ID: <ab695952-3b11-46ad-f638-622ca770f8e1@mimuw.edu.pl>
Date: Wed, 21 Apr 2021 10:39:05 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <CO1PR11MB4881D0C985582B28AE2DE8BED84E9@CO1PR11MB4881.namprd11.prod.outlook.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/roll/Ks2SzkmdXeuEUgR2TDlPbfbyWZI>
Subject: Re: [Roll] Border router failure detection
X-BeenThere: roll@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Routing Over Low power and Lossy networks <roll.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/roll>, <mailto:roll-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/roll/>
List-Post: <mailto:roll@ietf.org>
List-Help: <mailto:roll-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/roll>, <mailto:roll-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 21 Apr 2021 08:38:15 -0000
Hi Pascal, Thanks for reading the draft in such a great depth! A short answer to your questions: sentinel selection and other issues you mention are very much dependent on a particular deployment (i.e., network topology, application, other technologies used with RPL); RNFD thus focuses on providing mechanism for enforcing deployment-specific policies in that matter, including possibilities for dynamically changing the policies at runtime. A much longer answer follows. On 14/04/2021 16:47, Pascal Thubert (pthubert) wrote: > At an abstract level I see your proposal as a form of BFD where the network assesses the liveliness of the root. The operation is done by a number of representatives that must agree that the root is gone. The draft does a good job at addressing the Byzantine problem once we know who the generals are, but it's unclear to me how they are elected and how they converse. This intuition is correct. Just one remark to be precise: the algorithm assumes no Byzantine nodes, merely ones that may have incorrect observations about the state of the DODAG root but otherwise follow the protocol. > That's a family of questions around the issue of how we avoid false positive. Seems that the more sentinels the less of them. But the less efficient the algorithm is. I think this intuition is by and large justified. However, it need not be correct formally, especially if one considers corner cases. Let me try to build up more intuition, which will hopefully also clarify some of my answers to your detailed questions. To start with, as you pointed up above, the decision that the root is down is a collective one. It is based on observations of multiple sentinel nodes. Each observation regards the direct link between the corresponding sentinel and the root. The observations are communicated between nodes---both sentinels and acceptors---in CFRCs. A decision based on such accumulated observations can be made by any node: sentinel or acceptor. It is important to note that "multiple" here means a number above some threshold, referred to in the draft as RNFD_CONSENSUS_THRESHOLD. The threshold should be reasonably large (e.g., more than 50%). However, it should not be too large (in particular, equal to 100%) because some sentinels may be dead as well or may be too slow detecting the failure. Formally, this implies that false positives are unavoidable. For example, assume that you have N>=3 sentinels, your threshold is 51%, 1 sentinel is certain that the root is up while the other N-1 sentinels claim the opposite, and the root is actually up and communicating with the 1 sentinel. In such a case, the N-1 sentinels claiming that the root is dead will outvote the 1 sentinel claiming otherwise and a global decision will be made that the root has crashed, which---formally---is a false positive irrespective of how large N is. What will happen is that, being able to communicate with the 1 sentinel, the root will learn the decision and rapidly initiate DODAG rebuilding by generating a new DODAG Version. In the new version, the N-1 old sentinels will likely not be sentinels anymore because they will not satisfy the necessary condition (unless their connectivity to the root is restored). Are such false positives bad? A sentinel claims that the root is dead if and only if its link with the root seems to be down. If a significant number of sentinels (e.g., more than the aforementioned 50%) claim this, then something undesirable must have happened to the connectivity within the DODAG. Therefore, in practice, even if the root has not crashed, it may make a lot of sense to rebuild the DODAG to account for the connectivity changes, especially since RPL's rank assignment algorithms may preclude some nodes from assigning ranks significantly larger than their minimal ones. Is there anything to be done to prevent such rebuilding from disrupting routing? First of all, if so many links are broken to make nodes believe that the root is dead, then routing is likely already significantly disrupted. Nevertheless, one may still want to avoid all nodes emptying their parent sets, which happens when RNFD's LORS transits to GLOBALLY DOWN. In this case, the root may act proactively, as mentioned in paragraph 2, Section 5.5 of the draft: It [the root] MAY also generate a new DODAG Version if fraction value(NegativeCFRC)/value(PositiveCFRC) approaches RNFD_CONSENSUS_THRESHOLD, so as to avoid potential interruptions to routing. For example, if the fraction exceeds 50% or 80% of the aforementioned threshold. Also, additional solutions are possible to further alleviate the problem but I think we have never needed to implement them, and hence they are omitted in the draft. So let's proceed to your detailed questions. > - do the sentinels encompass all first hop nodes? The formal answer is "not necessarily", which is what the draft states: to be sentinel a node must be a first-hop node but not every first-hop node has to be sentinel. The practical answer is that if one can accommodate all such nodes as sentinels (see the next question), then why not. > - how do we compute the right number of sentinels? The formal answer is that an optimal selection of sentinels depends on a number of deployment-specific factors and may thus be extremely hard to compute. Therefore, rather than an exact formula describing the number (and the placement) of sentinels, it is better to think in terms of trade-offs, precisely as in your intuition: - on the one hand, the more sentinels and the higher the decision threshold, the more "confidence" one can feel about their decisions on the root's crashes being correct but - on the other hand, the greater the overhead in that: the larger the CFRCs (which implies longer DIOs, possibly even fragmented), the more trickle timer resets, and potentially the more load on the root during suspicion verification. Given this, ideally, to select root's neighbors acting as sentinels, one would do a thorough statistical analysis factoring in all major risks. This may also be too difficult, though. My practical, experience-based recommendation for the sentinel selection procedure is thus roughly as follows: 1. Select the CFRC size such that the RNFD Option fits a DIO without causing too much overhead (especially to avoid fragmentation). 2. Let the root's neighbors become sentinels at random and gradually adapt their behavior such that PositiveCFRCs do not fill up (as described in Section 6) of the draft. More specifically, - start with all neighbors having the possibility of becoming sentinels; - if PositiveCFRCs do fill up (without a significant increase in NegativeCFRC), decrease the probability of becoming sentinel by half and issue a new DODAG version (this can be done autonomously by the individual nodes observing filling up of PositiveCFRC); - repeat the last step unless satisfied. 3. Furthermore, if some of the root's neighbors tend to switch roles too often or have unstable links to the root, blacklist them from becoming sentinels (again, this can be done autonomously by the nodes -- they can self-blacklist). 4. Finally, if RNFD experiences too many false positives nevertheless (which at this point probably means that the root is poorly connected in general), simply disable the algorithm (never happened in our deployments). To give some absolute numbers, I would aim at ~10, at most ~20 sentinels with stable links to the root. In the end, they are just representatives of the root's neighbors' population. After going though this a few times, one will probably have a good setting of initial parameters for similar deployments and the particular technologies employed with RPL. In effect, one would have to fine-tune them only in extreme cases, like ultra-sparse or ultra-dense networks. What I am trying to convey here is that sentinel selection is a matter of managing the network rather than a responsibility of the RNFD protocol itself. The protocol strives to provide only mechanisms that enable effective management by network administrators. Equipped with such tools, it is the administrators who will come up with the best management policies, which are inherently deployment-specific. What is more, the mechanisms provided by RNFD allow for having the policy enforcement automated to a large extent. > - do the sentinels needs direct connectivity to one another? No. The sentinels alone do not even need to form a connected graph on their own (decisions are made by sentinels or acceptors), which makes the algorithm really broadly applicable (including the aforementioned possibility of random sentinel selection and robustness to some of the security threats mentioned earlier by Michael). > - how quick should the protocol react ? This also depends on the particular technologies one uses RPL with. At worst, RNFD must react as soon as routing adjacency maintenance at a root's neighbor acting as sentinel detects that the link to the root is down. More aggressive approaches can be adopted though, especially since for some neighbor unreachability detection solutions, the precise convergence criteria need not be straightforward to infer. > - I mean, if the root reboots, is that worth the hassle of breaking and forming the DODAG again? This also depends on a particular deployment, the duration of the reboot, and other technologies that RPL is used with (e.g., how quickly a link failure is detected by routing adjacency maintenance). In our settings, an immediate LBR reboot could pass undetected, unless we did not want it to. In this context, I believe that one DODAG rebuilding due to an LBR reboot probably does not hurt. However, repeated reboots may indicate some problems with the LBR that may require attention, irrespective of whether RNFD is used or not. > - do the sentinels form a connected group ? how ? if not, what happens if one group loose connectivity to the root but the other does not? Ad 1. No (see one of the previous answers). Ad 2. N/A Ad 3. To answer precisely, one would need to know the sizes of the groups with respect to the total number of sentinels and the decision threshold. In short, in a pessimistic scenario in which in addition to the crash of the root the whole network partitions into multiple groups none of which has enough sentinels to exceed the decision threshold, RNFD would not detect the crash in any of the groups; it would be detected with RPL's own mechanisms, which are much slower though (provided RPL is correctly configured). To close this e-mail, I would like to emphasize that, as mentioned in the draft, RNFD is inherently probabilistic. It can significantly improve RPL's performance in a wide range of deployment settings. However, like for any other protocol, this range is not infinite, and beyond it, no gains will be observed. Similarly, like any other protocol, it may require some maintenance, but this maintenance can be largely automated. Finally, should such a need arise, RNFD can always be disabled at runtime. Once again, thank you for your insightful questions Pascal. Best regards, -- - Konrad Iwanicki. >> -----Original Message----- >> From: Roll <roll-bounces@ietf.org> On Behalf Of Konrad Iwanicki >> Sent: jeudi 8 avril 2021 23:20 >> To: Routing Over Low power and Lossy networks <roll@ietf.org>; Pascal >> Thubert (pthubert) <pthubert=40cisco.com@dmarc.ietf.org> >> Subject: Re: [Roll] Border router failure detection >> >> Hi again Pascal, >> >> On 08/04/2021 16:16, Pascal Thubert (pthubert) wrote: >>> Hello Konrad >>> >>> About this text >>> " >>> >>> Since all DODAG paths lead to the corresponding LBR, detecting its >>> crash by a node entails dropping all parents and adopting an >> infinite >>> rank, which reflects the node's inability to reach the LBR. >> However, >>> achieving this state for all nodes is slow, can generate heavy >>> traffic, and is difficult to implement correctly [Iwanicki16] >>> [Paszkowska19] [Ciolkosz19]. >>> " >>> >>> Please note that this may be what an implementation decides to do, but >> is by far not my favorite option. >>> The alternate is to rest the "G" but and become floating root. The >> benefit is that the DODAG structure is mostly maintained and that routing >> inside may persist. >>> >>> See https://tools.ietf.org/html/rfc6550#section-18.2.4 for how to >> configure this. >> >> OK, I think I addressed this in my previous e-mail to you and Michael. >> >> (The next two e-mails will have to wait till tomorrow I am affraid.) >> >> Best, >> -- >> - Konrad Iwanicki. >> >>>> -----Original Message----- >>>> From: Roll <roll-bounces@ietf.org> On Behalf Of Konrad Iwanicki >>>> Sent: mercredi 7 avril 2021 21:37 >>>> To: Routing Over Low power and Lossy networks <roll@ietf.org> >>>> Subject: Re: [Roll] Border router failure detection >>>> >>>> Dear all, >>>> >>>> Approximately a year ago, I joined the WG and advertised an extension >>>> to RPL that I had been working on for some time with a goal of >>>> describing it in a draft. Unfortunately, just when I got the >>>> necessary pointers and a few initial comments, the pandemic broke >>>> out. The resulting surge of professional and personal obligations >>>> made it literally impossible for me to participate in and, at some >>>> point, even follow the group's activities, for which I am sorry. >>>> >>>> However, I did strive to regularly find a little time to get somewhat >>>> acquainted with IETF's procedures and write up something that may >>>> resemble a draft. I have just submitted it as: >>>> >>>> draft-iwanicki-roll-rnfd-00 >>>> >>>> and created a dedicated repository on the group's GitHub. >>>> >>>> Since I have no experience writing IETF documents, I would really >>>> appreciate your collaboration on either turning the text into a true >>>> draft or concluding that this should not be done. >>>> >>>> Please, keep mind in mind that I decided to maximally cut down the >>>> original algorithms, so that the extension is as simple as possible >>>> without losing key properties. Therefore, it is likely that if the >>>> extension as a whole or any of its parts turn out relevant or >>>> interesting, still a considerable amount of work may be necessary to >>>> have it or them described properly. Your contribution is most welcome. >>>> >>>> Thanks in advance. >>>> >>>> Best regards, >>>> -- >>>> - Konrad Iwanicki. >>>> >>>> _______________________________________________ >>>> Roll mailing list >>>> Roll@ietf.org >>>> https://www.ietf.org/mailman/listinfo/roll >>> >>> _______________________________________________ >>> Roll mailing list >>> Roll@ietf.org >>> https://www.ietf.org/mailman/listinfo/roll >>> >> >> _______________________________________________ >> Roll mailing list >> Roll@ietf.org >> https://www.ietf.org/mailman/listinfo/roll > > _______________________________________________ > Roll mailing list > Roll@ietf.org > https://www.ietf.org/mailman/listinfo/roll >
- [Roll] IETF 107: Call for presentations Ines Robles
- [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Michael Richardson
- Re: [Roll] Border router failure detection Rahul Jadhav
- Re: [Roll] Border router failure detection Ines Robles
- Re: [Roll] IETF 107: Call for presentations Remous-Aris Koutsiamanis
- Re: [Roll] IETF 107: Call for presentations Ines Robles
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] IETF 107: Call for presentations Rahul Jadhav
- Re: [Roll] IETF 107: Call for presentations Ines Robles
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] IETF 107: Call for presentations dominique.barthel
- Re: [Roll] IETF 107: Call for presentations Georgios Z. Papadopoulos
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Michael Richardson
- Re: [Roll] Border router failure detection Pascal Thubert (pthubert)
- Re: [Roll] Border router failure detection Pascal Thubert (pthubert)
- Re: [Roll] Border router failure detection Michael Richardson
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Pascal Thubert (pthubert)
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Michael Richardson
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Michael Richardson
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Michael Richardson
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Michael Richardson
- Re: [Roll] Border router failure detection Konrad Iwanicki
- Re: [Roll] Border router failure detection Michael Richardson
- Re: [Roll] Border router failure detection Konrad Iwanicki