Re: [Roll] Border router failure detection

Konrad Iwanicki <iwanicki@mimuw.edu.pl> Thu, 08 April 2021 21:17 UTC

Return-Path: <iwanicki@mimuw.edu.pl>
X-Original-To: roll@ietfa.amsl.com
Delivered-To: roll@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4CC5E3A1CBE for <roll@ietfa.amsl.com>; Thu, 8 Apr 2021 14:17:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jmk9TsEtVZMt for <roll@ietfa.amsl.com>; Thu, 8 Apr 2021 14:17:12 -0700 (PDT)
Received: from mail.mimuw.edu.pl (mail.mimuw.edu.pl [IPv6:2001:6a0:5001::4]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E78F33A1CC6 for <roll@ietf.org>; Thu, 8 Apr 2021 14:17:11 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by duch.mimuw.edu.pl (Postfix) with ESMTP id 0185A600A1A25; Thu, 8 Apr 2021 23:17:08 +0200 (CEST)
X-Virus-Scanned: amavisd-new at mimuw.edu.pl
Received: from duch.mimuw.edu.pl ([127.0.0.1]) by localhost (mail.mimuw.edu.pl [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 9eHWJ2hThwmp; Thu, 8 Apr 2021 23:17:05 +0200 (CEST)
Received: from Konrads-MacBook-Pro.local (unknown [IPv6:2a02:a311:813e:880:446c:71a3:f6c4:59d5]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by duch.mimuw.edu.pl (Postfix) with ESMTPSA; Thu, 8 Apr 2021 23:17:02 +0200 (CEST)
To: Routing Over Low power and Lossy networks <roll@ietf.org>, Michael Richardson <mcr+ietf@sandelman.ca>, "Pascal Thubert (pthubert)" <pthubert=40cisco.com@dmarc.ietf.org>
References: <CAP+sJUfcEY2DNEQV=duJdN6P8zZn0ccuei+4ra-B6TcLb5z8Kg@mail.gmail.com> <49ac5fc3-4a3c-fb87-d366-eb7e7cfd60df@mimuw.edu.pl> <18233.1583176305@localhost> <CAO0Djp3w4vWCOawQ+eegNTRzb_HRGYH6n=bdEH6iVf5ZO0AGFQ@mail.gmail.com> <f71fe153-c0d1-097e-a72e-49ece97cbd48@mimuw.edu.pl> <10272666-28c7-ab3e-9ceb-1b8f2bb6e5e5@mimuw.edu.pl> <CO1PR11MB4881A5AA0E5C5010FD2BE39ED8749@CO1PR11MB4881.namprd11.prod.outlook.com> <27309.1617893979@localhost>
From: Konrad Iwanicki <iwanicki@mimuw.edu.pl>
Message-ID: <4ab61b5f-c5f2-085b-beb6-26f4a67d8ca2@mimuw.edu.pl>
Date: Thu, 8 Apr 2021 23:17:01 +0200
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:78.0) Gecko/20100101 Thunderbird/78.9.0
MIME-Version: 1.0
In-Reply-To: <27309.1617893979@localhost>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/roll/TNAoOybherbZlKM8wKkuFDtbpIU>
Subject: Re: [Roll] Border router failure detection
X-BeenThere: roll@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Routing Over Low power and Lossy networks <roll.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/roll>, <mailto:roll-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/roll/>
List-Post: <mailto:roll@ietf.org>
List-Help: <mailto:roll-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/roll>, <mailto:roll-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 08 Apr 2021 21:17:16 -0000

Hi Pascal, Michael,

First of all, thanks for your interest, encouragement, and feedback!

I'll try to answer in the "stack mode" (from your last email to the 
first) and as much as I can tonight because my throughput this week 
again became very limited due to another lockdown.

On 08/04/2021 16:59, Michael Richardson wrote:
> 
> Pascal Thubert \(pthubert\) <pthubert=40cisco.com@dmarc.ietf.org> wrote:
>      > Since all DODAG paths lead to the corresponding LBR, detecting its
>      > crash by a node entails dropping all parents and adopting an infinite
>      > rank, which reflects the node's inability to reach the LBR.  However,
>      > achieving this state for all nodes is slow, can generate heavy
>      > traffic, and is difficult to implement correctly [Iwanicki16]
>      > [Paszkowska19] [Ciolkosz19].
>      > "
> 
>      > Please note that this may be what an implementation decides to do, but
>      > is by far not my favorite option.
> 
>      > The alternate is to rest the "G" but and become floating root. The
>      > benefit is that the DODAG structure is mostly maintained and that
>      > routing inside may persist.
> 
> This works well for storing mode where one expects that all 6LR are pretty
> much equivalently capable.
> 
> For Non-Storing Mode, it could be that only the LBR (and it's redundant
> backup...), are capable of accepting the DAOs, maintaining that table and
> generating the right routing headers.
> 
> Perhaps this is where the gap lies.

Floating DODAGs are really the best option to keep routing inside a 
DODAG possible in the face of failures. RNFD, however, deals with an 
orthogonal problem and can actually help spawning floating DODAGs faster 
after a grounded DODAG's root dies, as I try to explain below.

To start with, the envisioned use of RNFD concerns "important" DODAG 
roots (e.g., providing connectivity with the rest of the Internet) that 
in addition have an increased risk of failures (because of e.g. power 
supply interruptions or bugs resulting from greater HW/SW complexity 
compared to "normal" nodes). In practice, these will be LBRs that are 
roots of grounded DODAGs, and hence they are emphasized in the text. 
Nevertheless, one can use the algorithm also for any other DODAG roots. 
The cost is the overhead on DIOs and DISs due to the extra option and 
possible Trickle timer resets.

Moreover, RNFD aims at minimal requirements with respect to RPL. In 
particular, it works even if a DODAG is completely degenerated. 
Likewise, it works for all RPL MOPs, notably even MOP 0 without any 
downward routes (which I believe is a particularly attractive feature).

Therefore, regarding floating DODAGs, irrespective of nodes' 
capabilities, a major question is: When should a node leave a grounded 
DODAG and start advertising a floating one?

More specifically, https://tools.ietf.org/html/rfc6550#section-18.2.4, 
quoted by Pascal in the previous e-mail, states that this is done by a 
node whose parent set becomes empty. And this moment is justified 
because---before---a node always has a possibility to select some node 
as the preferred parent.

However, normally having an empty parent set implies that the node has 
already performed multiple unnecessary parent switches after a crash of 
the DODAG root, as the ranks of its subsequent parents grew repeatedly 
due to the lack of root. These switches must have taken time and 
generated control traffic resulting from Trickle timer resets.

What RNFD can do is reduce the time from the crash of a DODAG root to 
the moment a node empties its parent set (which is done when RNFD sets 
its LORS to GLOBALLY DOWN) by an order of magnitude. This means that a 
node can much quicker conclude that the root node of a grounded DODAG is 
dead, empty its parent set, and start being the root of a floating 
DODAG. Note that this is independent of whether a storing or non-storing 
mode is utilized.

What is more, this reduction of the root's crash detection period also 
reduces RPL's opportunities to switch the node's parents and degenerate 
downward routes in effect. Therefore, with RNFD it is likely that 
downward routes in the floating DODAG remain closer to what where in the 
original grounded DODAG, which is probably more important in a storing mode.

To sum up, when operating together with RPL in a deployment where nodes 
can spawn floating DODAGs, RNFD can significantly reduce the 
interruption period of downward routing as well.

What I see in the text of the draft is that indeed I did not mention the 
possibility of starting a floating DODAG when the GLOBALLY DOWN value of 
LORS is attained. Thanks a lot for these observations!

Where do you think it would be best to put this information in the text?

Best,
-- 
- Konrad Iwanicki.