Re: [Ntp] Handling synchronization loops

On 2/16/2022 2:39 PM, Danny Mayer wrote:
> 
> On 2/16/22 9:29 AM, Miroslav Lichvar wrote:
>> There is a new version of my NTPv5 draft:
>> https://datatracker.ietf.org/doc/draft-mlichvar-ntp-ntpv5/
>>
>> Beside some smaller improvements, it now specifies the Reference IDs
>> Request and Response extension fields using the 4096-bit Bloom filter
>> that was proposed by Daniel in his design.
>>
>> NTPv5 clients should now be able to detect a synchronization loop over
>> any number of servers, even if they are not the best selected source
>> which in NTPv4 set the client's reference ID.
>>
>> However, I'm not sure what should exactly happen when the client
>> detects a loop. As there is a delay in the distribution of the
>> reference IDs, I suspect there is an instability in the selection.
>>
>> Let's say we have 4 NTP clients in a local network configured in a
>> ring for "peering":
>>
>>          +--------+         +--------+
>>          | Host A |---------| Host B |
>>     +--------+         +--------+
>>              |                  |
>>              |                  |
>>          +--------+         +--------+
>>          | Host D |---------| Host C |
>>     +--------+         +--------+
>>
>> Each one is configured to poll some remote servers and two of its
>> local peers. We would like them to reach a stable state in the
>> selection, for example A->B->C->D, or A<-B->C->D.
>>
>> If they are started at the same time, every host will first select
>> only the remote servers as the local servers are not synchronized yet.
>> Then on each link the direction of synchronization is selected
>> randomly depending on the order of their polling. There is no loop on
>> individual links (assuming the 4096-bit filter is not split over
>> multiple messages), but a loop can form over the whole ring, e.g:
>>
>> A->B->C->D->A
>>
>> after three polls they all see that they are in a loop. They unselect
>> the peer:
>>
>> A  B  C  D  A
>>
>> and we are back where we started.
>>
>> I'm not sure if there is a guarantee a stable state will eventually be
>> reached if the polling order happens to be stable. There is also an
>> additional delay when the filter is exchanged in smaller parts. A loop
>> can form even between two hosts.
>>
>> I think we might need to specify for how long should clients wait
>> before selecting a server that was in a loop again. This could be a
>> random number based on their polling interval, stratum, and the number
>> of polls it takes to exchange the whole filter. There could also be a
>> requirement for servers to delay removing reference IDs from their
>> filters.
>>
>> For reference, in NTPv3 loops cannot form, because only sources with
>> lower stratum can be selected. That is too restrictive. In NTPv4 there
>> is a single reference ID, which can detect only loops between two
>> clients and only if one of them selected the other as the "system"
>> peer. Other loops are ignored and synchronization stops only when
>> stratum reaches 16.
>>
>> Any suggestions?
>>
> What is being lost in all this is the fact that a host can respond on 
> multiple interfaces. Thus ReferenceID's *cannot* be IP Addresses as is 
> currently being done. A server *should* create an value that can be part 
> of the packet that is sent out on ALL interfaces regardless of the type 
> of packet. This ID would need to be added to the base packet. This can 
> be a randomly generated ID but needs to be the same for the duration of 
> the server. This can then be used for the referenceID sent in the 
> packet. IP addresses only worked when there was only one interface and 
> IP address. That's no longer a viable solution. If I'm getting ntp 
> packets from a multicast NTP server what Reference ID would you want? It 
> can't be the multicast address.

I get that this is a theoretical problem.

Is it a *real* problem?

I'm not at all worried about the "I have multiple IPs/NICs" issue because:

- in the ~30 years' time I've been actively working with NTP, I have not
   heard of a single case where this has been an issue
- the NTP Project has already submitted a "suggested refid" proposal,
   which will also help with this

If it's clear that this is a REAL problem and not a hypothetical one, we 
can certainly look at this harder.

And given that "longer" loops mean larger stratum differences between 
the beginning and ending nodes in the loop and also increasing root 
distance, I suspect that this really isn't a problem for any "real" 
installation.

> Danny
> 
> _______________________________________________
> ntp mailing list
> ntp@ietf.org
> https://www.ietf.org/mailman/listinfo/ntp
> 

-- 
Harlan Stenn <stenn@nwtime.org>
http://networktimefoundation.org - be a member!