Re: [Roll] Border router failure detection

Konrad Iwanicki <iwanicki@mimuw.edu.pl> Wed, 14 April 2021 12:11 UTC

Return-Path: <iwanicki@mimuw.edu.pl>
X-Original-To: roll@ietfa.amsl.com
Delivered-To: roll@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1C90A3A0B21 for <roll@ietfa.amsl.com>; Wed, 14 Apr 2021 05:11:36 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id D8-4pK7mVzCs for <roll@ietfa.amsl.com>; Wed, 14 Apr 2021 05:11:32 -0700 (PDT)
Received: from mail.mimuw.edu.pl (mail.mimuw.edu.pl [193.0.96.6]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C937C3A0B15 for <roll@ietf.org>; Wed, 14 Apr 2021 05:11:31 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by duch.mimuw.edu.pl (Postfix) with ESMTP id 89C5161D63B0C; Wed, 14 Apr 2021 14:11:28 +0200 (CEST)
X-Virus-Scanned: amavisd-new at mimuw.edu.pl
Received: from duch.mimuw.edu.pl ([127.0.0.1]) by localhost (mail.mimuw.edu.pl [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id V9UWzi_mGCpx; Wed, 14 Apr 2021 14:11:25 +0200 (CEST)
Received: from [IPv6:2001:6a0:5001:2:6195:3168:3284:dbd5] (unknown [IPv6:2001:6a0:5001:2:6195:3168:3284:dbd5]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by duch.mimuw.edu.pl (Postfix) with ESMTPSA; Wed, 14 Apr 2021 14:11:23 +0200 (CEST)
From: Konrad Iwanicki <iwanicki@mimuw.edu.pl>
To: Michael Richardson <mcr+ietf@sandelman.ca>, Routing Over Low power and Lossy networks <roll@ietf.org>
References: <CAP+sJUfcEY2DNEQV=duJdN6P8zZn0ccuei+4ra-B6TcLb5z8Kg@mail.gmail.com> <49ac5fc3-4a3c-fb87-d366-eb7e7cfd60df@mimuw.edu.pl> <18233.1583176305@localhost> <CAO0Djp3w4vWCOawQ+eegNTRzb_HRGYH6n=bdEH6iVf5ZO0AGFQ@mail.gmail.com> <f71fe153-c0d1-097e-a72e-49ece97cbd48@mimuw.edu.pl> <10272666-28c7-ab3e-9ceb-1b8f2bb6e5e5@mimuw.edu.pl> <8372.1617839184@localhost>
Message-ID: <4db6ca0b-86a7-cad2-27fc-4728f01e0f06@mimuw.edu.pl>
Date: Wed, 14 Apr 2021 14:12:23 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <8372.1617839184@localhost>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/roll/dSC-MMBv99h0xmghy54cKVz8H8M>
Subject: Re: [Roll] Border router failure detection
X-BeenThere: roll@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Routing Over Low power and Lossy networks <roll.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/roll>, <mailto:roll-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/roll/>
List-Post: <mailto:roll@ietf.org>
List-Help: <mailto:roll-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/roll>, <mailto:roll-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Apr 2021 12:11:36 -0000

Dear Michael,

On 08/04/2021 01:46, Michael Richardson wrote:
> You might want to add a Makefile, either via
>      https://github.com/martinthomson/i-d-template/blob/master/doc/SETUP.md
> or rip of my minimal one, such as:
>      https://github.com/roll-wg/draft-ietf-roll-enrollment-priority/blob/master/Makefile

Thanks for the suggestion! I did borrow your getver and Makefile, 
slightly adapting the latter.

> RPL is being used in storing mode the ANIMA WG's Autonomic Control Plane.
> See: https://datatracker.ietf.org/doc/draft-ietf-anima-autonomic-control-plane/
> ( section 6.12. ).   The document is minutes away from getting an RFC#.
>
> We think that the border router, DODAG root, will be a device in the NOC.
> The NOC may get replicated into multiple locations so there could potentially
> be more than one candidate DODAG root.  Given nodes are not constrained in
> the RFC7228 sense, supporting multiple DODAGs could be done, but we
> simplified our life by not mandating (actually, at this point, forbidding)
> the RPI header, so we are lacking an instanceID.
>
> The short of it is that I'd really like nodes to be able to float
> non-grounded DODAG roots if they don't hear a DIO after a few seconds.

I have read the relevant section of the ANIMA ACP draft. I do not know 
if I got all the details, so correct me if I am wrong.

Indeed, the draft adopts RPL in a storing MOP with a single RPLInstance. 
It assumes that NOC can be replicated and seems to suggest that the 
DODAG Root is a node within an NOC. More specifically, though I may not 
have gotten this correctly, there would be a single primary DODAG root 
and when it malfunctions, another node, possibly in a different NOC, 
takes over this role so that at any time at most one DODAG root is active.

I did not find the details of the failover mechanisms but I presume that 
the nodes that can act as DODAG roots synchronize with each other using 
regular RPL control messages over non-LLN links (perhaps inter-NOC 
ones). Therefore, DIOs over such links can be exchanged at a much higher 
frequency than over LLN links. In effect, indeed, if one of such 
NOC-based DODAG root candidate nodes does not hear a DIO from the acting 
DODAG root node within a few seconds, it may take over.

In such a setting, RNFD may not have a lot of advantages: depending on a 
number of factors, it could offer comparable performance, and hence 
could serve as a support for the DIO timeout mechanism. As I explained 
in the previous e-mails, the possibility of spawning floating DODAGs is 
actually not an obstacle.

There is, however, at least one situation in which RNFD could help. More 
specifically, it may be the case that the LLN interface of the NOC-based 
DODAG root does not work but the one used for synchronizing with other 
NOC-based candidates works normally. In this scenario, simply exchanging 
DIOs between the DODAG root candidates need not detect the failure, in 
contrast to RNFD, which would monitor the actual LLN links.

All in all, I believe that to some extent RNFD can be beneficial even 
with ACP.

What is more, your remark actually got me thinking about Virtual DODAG 
Roots (as per https://tools.ietf.org/html/rfc6550#section-8.2.2.2) 
where multiple nodes, potentially NOC-based, coordinate to transparently 
and simultaneously (not in a primary-backup fashion) act as a single 
DODAG root. In such a deployment, RNFD also works but need not detect 
crashes of an individual node comprising the virtual root. This is, 
however, an expected behavior because, despite the crash of the node, 
the root as a whole continues to function. Depending on the deployment 
policies, RNFD may need to be configured with a different threshold, 
though. Consequently, I believe that virtual DODAG roots also need to be 
mentioned in the draft, which I forgot about. Thanks for the remark!

As for the comments below, I will try to apply them to the draft, 
together with addressing the previous issues, and get back to you when 
this has been done.

> It seems that you might want a term for the LBR's children.
> That is, the devices at rank "1", that hear the LBR's DIOs.
>
> I think that I would move some of section 3.2 further forward in the
> document.  I think that I need a gentler introduction to CFRCs here, and I
> don't really need to know the properties, rather I need a higher-level idea
> of things.    Since section 4 goes over the operations again, I would leave
> it for that spot, and make it a section 4.1.
>
> Having gone forward and back a bit, I'm still a bit uncertain how nodes
> assign themselves a bit... oh, self() in section 4 says "random".
> Why not make this a function (hash?) of the short-IPv6 address or something?
>
> Not every media has ACK frames at the L2 to establish that there are
> failures.  It might be worth putting the Detecting and the Verifying into
> separate sections.  Aside from the ANIMA case (which is usually pure ethernet),
> there are also situations where there is an ethernet backbone connecting a
> few 6LBRs (RFC8929), and your protocol would sensibly run on both the
> wireless and the wired side of the 6LBRs.
>
> I also wonder if the RNFD could be included in DAOs (particularly storing
> mode ones) sent to the DODAG root.
> I know that probably seems senseless: why tell the root that you are
> observing it to be dying....  But, it acts as interesting telemetry about
> what the nodes are seeing, and might serve as a useful indication of imminent
> failure, or some kind of systematic long-cycle pathology.
>
> Your IANA considerations are how the document will look after IANA has
> processed it.  Prior to that point, you need to write it as a request.
>
> Something like:
>
>     IANA is requested to allocate the value TBD1 from the "RPL Control Message
>     Options" sub-registry of the "Routing Protocol for Low Power and Lossy
>     Networks (RPL)" registry.
>
> I like to include the URL of the registry in my request to be really really
> clear, and to save everyone else the time to find it.
>
> Your security considerations will want to cite RFC7416.
> In particular, 7.2.4, and section 7.3.4 and 7.3.5 might be relevant.

Best,
-- 
- Konrad Iwanicki.