Re: [Roll] Border router failure detection

Dear Raul,

On 03.03.2020 11:45, Rahul Jadhav wrote:
> Welcome Konrad and great to hear you on ROLL,

Thank you.

> I have a few questions with regard to the problem statement before we
> dive into the solution part.
>
> Border Router (BR) crash/restart is an issue and any deployment needs to
> tackle it (IMO, in version 0.1). As I understand the aim of the paper is
> to ensure that the attached nodes should detect the BR failure asap.
> Just curious to understand how the nodes use this information in your
> deployments? What will be the use-case of this detection?

Yes, the main goal is to detect a failure of a border router asap. In 
general, the use-cases cover highly available systems. Faster detection 
of BR failures gives more possibilities when switching to fallback 
mechanisms, rebinding to a different DODAG, or spawning a new one. A few 
concrete examples:

1. Minimizing lost samples in push-based data-collection apps.

A classic app: each node periodically sends data upward to a BR, which 
then forwards them to a sink.

If one aims at perfect reliability, an end-to-end solution is required: 
the sink acknowledges each data sample, and samples that have not been 
acknowledged are buffered and retransmitted when possible. This requires 
downward-routed ACKs.

In contrast, if we assume nearly-perfect data delivery in that we 
occasionally allow individual samples to be lost, we can have a more 
efficient solution. To begin with, if implemented properly, link-layer 
mechanisms such as hop-by-hop acks, queuing, and retransmissions already 
allow for a very high end-to-end packet reception rate for many 
deployments. As I mentioned in my first e-mail, this leaves BR failures 
as the main cause of burst data loss, at least in our experience. 
Therefore, if we can detect a BR failure fast, nodes can quickly switch 
to buffering samples instead of forwarding them (few samples are lost), 
and start forwarding again only when the BR recovers. What we gain in 
this approach is first eliminating end-to-end ACKs, which reduces 
data-related traffic up to two times. Second, if the nodes just collect 
data, the network's MOP can be downgraded to 0, which in particular 
eliminates DAOs and DAO-ACKs, thereby further reducing the overall 
traffic. Third, when operating under MOP 0, we can employ a dedicated 
radio activity scheduling algorithms at the link layer, which reduces 
energy expenditures even more and can further improve end-to-end packet 
reception rate. Overall, with RNFD we can get a few-fold extension of 
the node battery lifetime without sacrificing end-to-end data delivery 
much compared to the end-to-end solution.

2. Maximizing availability in pull-based data-collection apps.

Here an opposite approach: nodes act as servers from which other 
Internet hosts pull data on demand. (The applicability of the approach 
below is much wider though).

In this app, and actually in general, it makes sense for a node to be a 
member of as few DODAGs as possible, ideally just the best one. This is 
because belonging to a DODAG incurs an overhead on memory, computation, 
and control traffic, not only on this node but also its ancestors in the 
DODAG. The problem is that if a BR goes down, a node may be unreachable 
for a while.

One solution would thus be joining multiple DODAGs, irrespective of the 
overhead. The drawback is that this overhead is to be paid always, even 
when the network operates correctly. If, in turn, we have fast BR 
failure detection, nodes can belong to just one DODAG, and quickly 
switch it upon a BR failure. Again, this can reduce the overhead, at 
least twice, without sacrificing availability much (at least 
unavailability due to BR failures).

3. Maximizing availability in networks with actuation.

We have sensor and actuator nodes. Routing from sensors to actuators is 
normally driven by a BR. This allows, for instance, for traffic analysis 
and forwarding sensor samples to other sinks (e.g., analytical or 
visualization components) besides the relevant actuators.

If a BR goes down, however, what you want to have at the very least is 
communication between sensors and actuators, so that the system 
continues to operate. To this end, the actuators can be roots of their 
own (floating) DODAGs. Again, for efficiency one would probably prefer 
this to happen only upon a BR failure not throughout the entire lifetime 
of the system. This means that being able to detect a BR failure as fast 
as possible may be crucial to ensure that actuators spawn their own 
DODAGs and get sensor readings without major disruptions.

> In my deployments, my focus was to get the BR restarted as soon
> as possible and then ensure that the nodes below it could rejoin the
> restarted BR. This would mean that BR needs to backup some state
> information that would be used post-crash-recovery.

This is very important as well. However, in our case, the control of 
when the BR is restarted was limited to some extent. The goal of RNFD is 
giving additional possibilities to improve the network's performance 
over the period of such a failure. The two problems are thus orthogonal.

> On the contrary, in my scenario, it was necessary to detect 6LN/6LR
> failure without depending on periodical pings from nodes to central
> sever i.e., for some reason if the node (non-BR node) is stuck, then BR
> should detect it asap (without depending on route lifetime since it
> could be very high) so that measures could be taken (for e.g., send a
> personnel to repair the smart meter). This also is non-trivial to
> handle. For BR, it is always possible to ping on the external leg and
> check availability. Also, BR unavailability means no traffic going out
> from any nodes and thus is easy to identify on the external monitoring
> system.

True this is also not trivial but, as I wrote above, it is an orthogonal 
problem.

In general, RNFD is no silver bullet. Apart from the cases that both of 
us mentioned, there are other types of failures possible in a network 
that RNFD does not address. What it does, however, is solving a 
particular problem that we found relevant in practice and that hopefully 
may be relevant for other people as well.

> Also, as Michael mentioned, the working group had discussed some issues
> with respect to reboot handling (on 6LN/6LR/BR) and it has been captured
> in https://datatracker.ietf.org/doc/draft-ietf-roll-rpl-observations/
> It would be immensely helpful to get your view on those points.

Sure, I will take a look at this when I find some time slot.

Best regards,
-- 
- Konrad Iwanicki.

> Best,
> Rahul
>
> On Tue, 3 Mar 2020 at 00:42, Michael Richardson <mcr+ietf@sandelman.ca
> <mailto:mcr%2Bietf@sandelman.ca>> wrote:
>
>
>     Welcome!
>
>     Konrad Iwanicki <iwanicki@mimuw.edu.pl
>     <mailto:iwanicki@mimuw.edu.pl>> wrote:
>         > In a nutshell, I would like to propose an extension to RPL
>     that had been
>         > invented to significantly improve handling crashes of border
>     routers. Since I
>         > have little experience writing RFC-like drafts, I would
>     greatly appreciate
>         > any help.
>
>     Use the markdown method, and use someone's template github.
>
>         > What we observed, however, is that RPL does not efficiently
>     handle crashes of
>         > border routers [1][2]. Upon such a failure, tearing down
>     nonexistent upward
>         > routes can take a lot of time (depending on the data-plane
>     traffic) and
>         > generate considerable control traffic, which is problematic in
>     many
>         > applications.
>
>     Rahul and Pascal (and others) have had a lot of conversation about
>     how we
>     deal with the various lollipop counters.  So I am interested in what
>     your
>     border router does when it boots: how does it announce the new DIOs?
>
>         > What we did to address the problem was developing an
>     algorithm, called RNFD,
>         > in which nodes collaborate to monitor the state of a border
>     router of the
>         > DODAG they belong to [1]. Experiments with a TinyOS
>     implementation of the
>         > algorithm on two testbeds (32 nodes at 2.4GHz and 76 nodes at
>     868MHz) and in
>         > simulations show that it can outperform bare RPL: it can
>     detect a border
>         > router crash one or two orders of magnitude faster and with
>     much lower
>         > control traffic [1].
>
>     okay.
>
>         > [1] K. Iwanicki: “RNFD: Routing-Layer Detection of DODAG
>     (Root) Node Failures
>         > in Low-Power Wireless Networks,” in IPSN 2016: Proceedings of
>     the 15th
>         > ACM/IEEE International Conference on Information Processing in
>     Sensor
>         > Networks. IEEE. Vienna, Austria. April 2016. pp. 1—12. DOI:
>         > 10.1109/IPSN.2016.7460720
>
>     Unfortunately, it's behind the IEEE paywall.
>     I have given up on getting documents from the IEEE.
>     I guess you have been working on this for at least five years now.
>
>         > [2] A. Paszkowska and K. Iwanicki: “Failure Handling in RPL
>     Implementations:
>         > An Experimental Qualitative Study,” in Mission-Oriented Sensor
>     Networks and
>         > Systems: Art and Science (Habib M. Ammari ed.). Springer
>     International
>         > Publishing. Cham, Switzerland. September 2019. pp. 49—95. DOI:
>         > 10.1007/978-3-319-91146-5_3
>
>         > [3] P. Ciolkosz: “Integration of the RNFD Algorithm for Border
>     Router Failure
>         > Detection with the RPL Standard for Routing IPv6 Packets,”
>     Master's Thesis,
>         > University of Warsaw. November 2019.
>
>     --
>     Michael Richardson <mcr+IETF@sandelman.ca
>     <mailto:mcr%2BIETF@sandelman.ca>>, Sandelman Software Works
>      -= IPv6 IoT consulting =-
>
>     _______________________________________________
>     Roll mailing list
>     Roll@ietf.org <mailto:Roll@ietf.org>
>     https://www.ietf.org/mailman/listinfo/roll
>