Re: [RTG-DIR] [Roll] RtgDir Early review: draft-ietf-roll-rnfd-02

Konrad Iwanicki <iwanicki@mimuw.edu.pl> Wed, 20 March 2024 14:21 UTC

Return-Path: <iwanicki@mimuw.edu.pl>
X-Original-To: rtg-dir@ietfa.amsl.com
Delivered-To: rtg-dir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 14522C14F74E; Wed, 20 Mar 2024 07:21:33 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.106
X-Spam-Level:
X-Spam-Status: No, score=-2.106 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=mimuw.edu.pl
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id wuNkBBXlyDFt; Wed, 20 Mar 2024 07:21:27 -0700 (PDT)
Received: from mail.mimuw.edu.pl (mail.mimuw.edu.pl [IPv6:2001:6a0:5001::4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E4339C14F713; Wed, 20 Mar 2024 07:21:23 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by mail.mimuw.edu.pl (Postfix) with ESMTP id 0A8873000EA62; Wed, 20 Mar 2024 15:21:20 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=mimuw.edu.pl; h= content-transfer-encoding:content-type:content-type:in-reply-to :from:from:references:content-language:subject:subject :user-agent:mime-version:date:date:message-id:received:received; s=20240128; t=1710944477; x=1711549278; bh=T3Ajw8Kx2LlBcQ5D2aaA qQjyiB3mCxXL8JhSYZ+YisA=; b=q7AdSzv6huCBKiNxy9Sbu2UsG1H+GAVTx3xh 5JOYHkiOAVmCB0lc5pnLgOVNZ8n7TFNt0QbsIqNgJ+GjYwdDcQf7DqBw7dzEJHmu xNURXM40A0aJ2MPJKhWfLaAmHtJjT/himVQuKvR5StzYZDresn5DyBUacWqoikN+ U189Iz5abiUKbaXLPAhafk6WFfkM/teyE012cMwGCHxbqFQh6/fREHU3imEPR2Iu Pu9dxnCrPgLB8AdIJh0slDjqgGuHE3VPh0mHsNhPc+MRSM12QVCGlPSvxogxvX7T 8cyk5NMojfMgoy917TEODxJBcjQWY0lOb6jae1JoyGGVyBKmzg==
X-Virus-Scanned: Debian amavis at mail.mimuw.edu.pl
Received: from mail.mimuw.edu.pl ([127.0.0.1]) by localhost (mail.mimuw.edu.pl [127.0.0.1]) (amavis, port 10026) with ESMTP id M1qfhXlxDTfl; Wed, 20 Mar 2024 15:21:17 +0100 (CET)
Received: from [192.168.0.171] (unknown [213.134.167.50]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.mimuw.edu.pl (Postfix) with ESMTPSA; Wed, 20 Mar 2024 15:21:17 +0100 (CET)
Message-ID: <19cec562-e7f4-4499-9e7f-22be828b9746@mimuw.edu.pl>
Date: Wed, 20 Mar 2024 15:21:17 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Content-Language: en-US, pl
To: Routing Over Low power and Lossy networks <roll@ietf.org>, Victoria Pritchard <pritchardv0@gmail.com>, roll-chairs@ietf.org, draft-ietf-roll-rnfd.all@ietf.org
Cc: rtg-dir@ietf.org
References: <CA+fLEh+B9Njz=BDqQw0X8ePVLu0m4u_pe43DZURmFhW=i7+zOg@mail.gmail.com>
From: Konrad Iwanicki <iwanicki@mimuw.edu.pl>
In-Reply-To: <CA+fLEh+B9Njz=BDqQw0X8ePVLu0m4u_pe43DZURmFhW=i7+zOg@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/rtg-dir/xqZCIgTPkyMF9t8xEUt-Vuqjb5k>
Subject: Re: [RTG-DIR] [Roll] RtgDir Early review: draft-ietf-roll-rnfd-02
X-BeenThere: rtg-dir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Routing Area Directorate <rtg-dir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtg-dir>, <mailto:rtg-dir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rtg-dir/>
List-Post: <mailto:rtg-dir@ietf.org>
List-Help: <mailto:rtg-dir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtg-dir>, <mailto:rtg-dir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Mar 2024 14:21:33 -0000

Dear Victoria,

Thank you a lot for your feedback! Please, find my replies inline. 
Whenever I talk about changes made in the draft, I mean its revised 
version, 03, which I submitted earlier today.

On 27/11/2023 22:44, Victoria Pritchard wrote:
> 3 - "each node running RNFD" - Would all nodes need to participate? I 
> suppose the benefit would be limited if not? Also, if not all nodes 
> participate, some will continue to advertise a finite rank, would this 
> cause others to think the root is there? Maybe that's no worse than 
> without RNFD... those nodes using RNFD will look at the CFRCs and the 
> LORS value, assuming enough nodes are participating.

No, not all nodes need to participate. Depending on the fraction and 
location of non-participants, the benefit need not be limited, at least 
as long as the network of participants is connected. To explain, if the 
root node does not participate, then neither does any other node (per 
Section 5.5 in the reviewed version of the draft), so there is no 
problem. Let us thus assume in the remainder of our discussion that the 
root node does run RNFD. If, in turn, a non-root node does not 
participate, then in particular it does not assume any of the two roles: 
Acceptor or Sentinel.

Let us first consider a non-participant that would normally be Acceptor. 
The main drawback in this case is that such a non-Acceptor will 
potentially not make a decision to initiate a global transition to the 
“GLOBALLY DOWN” state and appropriately modify the CFRCs. However, if 
the root is down, this decision will be made by some Acceptor node when 
it gathers CFRCs containing the same or a larger set of Sentinels 
considering the root as dead. This decision may be made even earlier 
than it would have been by the non-Acceptor. Given our assumption that 
the network of participants is connected, the decision will spread 
throughout all participants. So in this case, it no important benefits 
for the participants may be lost.

Let us thus now consider a non-participant that could be Sentinel. Such 
a non-Sentinel does not take part in detecting whether the root has 
crashed, and hence does not contribute to any of the two CFRCs. 
Therefore, provided that our previous assumptions hold and there are 
some Sentinel nodes that do monitor the root, if the root has crashed, 
then those Sentinels will hopefully detect the crash anyway and some 
node will initiate the global transition to “GLOBALLY DOWN”. In effect, 
also in this case likely no important benefits for the participants will 
be lost.

The loss of the benefits can possibly happen in a few corner cases, though.

First, suppose that there are non-Sentinels (i.e., non-participants that 
could otherwise be Sentinels) and that, while the root has NOT crashed, 
sufficiently many of its links with Sentinels are down. In such a case, 
the network will incorrectly detect the root as “GLOBALLY DOWN”. In 
contrast, if the non-Sentinels did participate in RNFD as Sentinels, and 
there were sufficiently many of them, then the false positive could 
possibly be avoided. Is such a scenario a problem? Hard to say. In any 
case, however, once the root hears at least one DIO with the RNFD option 
declaring its death, it will issue a new DODAG version, thereby 
rebuilding a DODAG in such a way that its dead links to the former 
Sentinels are not considered, which would hopefully make it more stable. 
Rebuilding the DODAG in such a case would be a good choice anyway, given 
that so many links to the root went down.

Second, if the number and location of non-participants causes the 
network of participants not to be connected, then using RNFD globally 
will not bring any benefits. However, not much can be done in this case, 
because there is no physical way to propagate the necessary information 
between the participants.

As to your second question, if a participant enters the “GLOBALLY DOWN” 
state, then no advertisement of a finite rank will cause the node to 
reconnect to the DODAG unless the advertisement comes from a newer DODAG 
Version.

> 3.2 - negativeCFRC - says that this shows "sentinels that have 
> considered or still consider the root node as dead" - given the 
> description in 5.2 that a node can go from LOCALLY DOWN to UP by 
> re-adding itself to the positive CFRC, i.e. they dont still consider it 
> down, this might be phrased better as "sentinels that consider, or have 
> previously considered, the root node as dead".

Done.

> Similarly, positiveCFRC counts the number of times sentinels have 
> considered the root node as alive. A count in the positive CFRC doesnt 
> mean its still considered alive, as there could be a matching bit in the 
> negativeCFRC. As the last paragraph says, it's the difference between 
> positive and negative which shows how many sentinels consider the root 
> as alive.

Done -- rephrased in the same way as the previous one.

> 4.2 What's the significance of having bit lengths be prime numbers?

It improves the “randomness” of bit selection if the pseudorandomness of 
the implementation of the self() function is poor.

> Why if all positive bits are 1, should all negative bits also be 1? Is 
> this to indicate "globally down" as referred to in 5.3? i.e. using all 
> the bits not only the "prime number" of bits.

Yes, I wanted the “GLOBALLY DOWN” state to be represented by a unique 
combination of CFRCs. To make this combination as robust as possible 
against buggy implementations, I decided it would be infinity (or rather 
a maximal set). This is also the reason for choosing one as the value of 
bits beyond the “prime” limit.

> self() - selects the bit uniformly at random - will multiple sentinel 
> nodes select the same bit? Especially when nearing saturation?

Yes, they may. This is a probabilistic counter.

> Why 63% for saturation? Is it related to the probability of 2 sentinels 
> choosing the same bit?

Yes. The answer below is copied from my reply to Carles' earlier e-mail.

"The threshold is much as the maximal load factor value in a hash table. 
The higher it is, the higher the probability of collisions when Sentinel 
nodes add themselves to CFRCs. The selected value of 63% bits set to one 
corresponds to the maximum likelihood estimate equal to the total number 
of available bits, that is, as if the hash table was fully loaded. In 
the reviewed draft, this single value was selected for simplicity. 
However, we did test the algorithm with lower values, and they make 
sense if one requires a lower probability of collisions. Therefore, in 
the revised version, the value is made a configuration parameter with 
the default value of 63%."

> 5.1 Based on later discussion about the CFRC length, I assume the node 
> cannot send any CFRCs until it has received a message containing them, 
> so that it knows the length of the fields?

Yes, that’s the right assumption.

> 5.1 when changing role to acceptor, the rules state but dont explain why 
> the node MUST NOT modify its values, in particular negativeCFRC. I think 
> it's because the correct setting is already applied? (e.g. if already 
> globally down or locally down, the negativeCFRC has already been updated)

Yes, precisely.

> 5.2 - paragraph that mentions "previous conditions 2-4": Why wouldn't it 
> be able to set LORS back to UP if the saturation was already over 63% 
> (condition 2)? Makes sense that saturation would mean the CFRC can't be 
> added to, but if the root is deemed to be up, why can't the node hold 
> that state locally? Does it have any impact if this remains at LOCALLY DOWN?

In theory, the node could set its state to “UP” and attempt to add one 
bit to the positive CFRC, which, given the 63% threshold, would have 
only slightly more than 1/3 chance to have any effect. In practice, 
however, if saturated(PositiveCFRC) is true, then this means that the 
DODAG root will generate a new DODAG version soon (per Paragraph 3 of 
Section 5.4), so setting the bit and changing LORS to “UP” does not 
matter much. Therefore, to avoid exceptions and simplify the draft, I 
decided to keep all conditions 2-4 in this paragraph.

> 5.2 final paragraph
> To re-add itself, it sets a new (uniformly selected) bit in the positive 
> CFRC?

Yes.

> If saturated is TRUE, does this mean the number of bits is almost used 
> up? Is 63% then a 'sensible' value, based on the fact that a number of 
> sentinels could be updating the CFRC at any one time and the saturation 
> value could end up well above 63% when all the updates are distributed?

Depends on the number of sentinels and the number of bits. Both values 
can be adjusted at runtime.

> If many sentinels have reported that the root is down, and then want to 
> report it back up, is there a risk that the saturation point will be 
> reached and they wont be able to add themselves to the positive CFRC 
> again? (Answered later - CFRC length can be extended - could consider 
> reordering these sections).

Yes, precisely. Regarding the reordering, see further.

> 5.4 The DODAG root chooses bit length for the CFRCs? Does it start with 
> CFRCs set to zero? Does the root have to advertise the number of bits 
> before any nodes can become a sentinel? How does it increase the number 
> of bits (what are the CFRC values set to when it increases?), and how do 
> the sentinels react?

3 x yes. Increasing the number of bits is done by putting larger CFRCs 
into the RNFD Option, as discussed in Section 5.6. The reactions are 
also described in the same section.

> If a node didnt add itself to the positive CFRC because the saturation 
> was true, should it then add itself if it sees the number of bits 
> increase (and the saturation has gone back below the threshold)?

That is an interesting corner case with a race. Yes, the rules in the 
draft (and specifically Sections 5.2 and 5.6 combined) entail adding the 
node to the positive CFRC and setting its state accordingly as long as 
the necessary conditions hold.

> 5.4, 5.5, 5.6, seem in the wrong order... questions I had while reading 
> the early sections were answered later. e.g. 5.4 - how do the nodes 
> react when the root changes the number of bits?

I thought about such a reordering some time ago, but ultimately decided 
against it. This was because Sections 5.1-5.3 were meant to explain the 
invariants and core operation of the protocol, while the remaining ones 
to discuss exceptions and side functionalities.

In particular, your questions above are in-depth and formulating them 
required truly understanding the core algorithm presented in Sections 
5.1-5.3. This could indicate that the current flow facilitates 
developing such an understanding, while the answers to those and similar 
questions can still be found in the draft. Reordering the sections, in 
turn, would shift the reader's initial focus from core information to 
side details whose importance need not be immediately apparent, in 
effect causing problems with understanding everything. Does it make 
sense? Of course, if you have some specific alternative ordering in 
mind, then I am open to a discussion.

> 5.6 - how does the root extend the number of bits, does it merge from 
> any heard bits already or start fresh from CFRCs with all bits set to 
> zero? If a node receives a CFRC with more bits, how does it know which 
> is its bit to set? Can it assume its previous counter is not present 
> when it adds itself again as detailed in 5.6?

According to the rules from the section, the root starts afresh with all 
bits set to either zeros or ones, depending on its LORS. However, it 
will always eventually set them to zeroes (per the rules from Section 
5.4). To clarify this, I added a new paragraph dedicated to the root at 
the end of Section 5.6.

As to the last two questions, when a Sentinel node receives a CFRC with 
more bits, then, in line with the rules, it invokes self() to select its 
bit anew among the larger set. Therefore, yes, it assumes that its 
counter is not present.

> 6. If the positive CFRC becomes saturated with little or no increase in 
> negative CFRC, wouldn't that mean the network is behaving well, and all 
> the sentinels say the root is alive? Would issuing a new DODAG version 
> disrupt things unnecesaarily? Or would it mean only the closest/fastest 
> nodes would have become sentinels, and is it then better to use the 
> probablilty mechanism to get a different spread of sentinels?
Yes, the network is behaving well. However, the saturation in such a 
case indicates that the number of Sentinels may potentially be excessive 
and we should consider limiting this number to avoid possibly increased 
traffic when later verifying suspicions about the root and potential 
false-positive detections. So, yes, generating a new DODAG Version could 
briefly disrupt things (but not necessarily since the DODAG will not 
change much). However, this would hopefully benefit things in the long 
term. Moreover, generating a new DODAG Version is not the only option: 
the root may lengthen CFRCs instead. Finally, the root is not obliged to 
generate the new version (there is only a “SHOULD” in Section 5.4) and 
can instead continue operating ignoring the aforementioned risks (if it 
makes any sense).

I hope I have answered all questions. Thank you again for them. They 
were truly insightful.

Best,
-- 
- Konrad Iwanicki.