Re: [Ntp] Timescales, leapseconds and smearing

Martin Burnicki <martin.burnicki@meinberg.de> Tue, 08 December 2020 17:29 UTC

Return-Path: <martin.burnicki@meinberg.de>
X-Original-To: ntp@ietfa.amsl.com
Delivered-To: ntp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 79E6F3A10BD for <ntp@ietfa.amsl.com>; Tue, 8 Dec 2020 09:29:38 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.1
X-Spam-Level:
X-Spam-Status: No, score=-2.1 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, NICE_REPLY_A=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=meinberg.de
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2iG-Zv7-biq4 for <ntp@ietfa.amsl.com>; Tue, 8 Dec 2020 09:29:33 -0800 (PST)
Received: from server1a.meinberg.de (server1a.meinberg.de [176.9.44.212]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 12DBE3A105D for <ntp@ietf.org>; Tue, 8 Dec 2020 09:29:32 -0800 (PST)
Received: from srv-kerioconnect.py.meinberg.de (unknown [193.158.22.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by server1a.meinberg.de (Postfix) with ESMTPSA id 162BD71C107D; Tue, 8 Dec 2020 18:29:28 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meinberg.de; s=dkim; t=1607448568; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rIGbvgKg7KLtwVKWk0DOQtIiMNBrF58RB1lduLbmTpw=; b=S3RlPKwTQKfcNi8zVzoE7fqAVC0/v4yWWsnjtde93Pk1Ni/YEnpDH0rYhLBNJmsle4E4fy h7JADFqjdOzXfSpniBstg5Q9MsH78jiE79RnlllimdPDYhC5vG8oumYkc5AUb9pFKHwvS/ 48cxkrdkzAdjr5dMIlz1cB43Wqgc92vUFfkILatiHIL8v4okqwru41bQsUEHFJGoae+o7z h5UYyjmOPQDxPnb6YuDt2wKXKvKVmDtDB6c6FNuyPX7s+QM61z+BskXCiUIciOhcbJXkss fMW0dSN76PUfDuhnDZL9FeWul68xLdZf7H9gNvX6g5q3DUkbYc8/fQ1X1OhTaw==
X-Footer: bWVpbmJlcmcuZGU=
Received: from localhost ([127.0.0.1]) by srv-kerioconnect.py.meinberg.de with ESMTPSA (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256 bits)); Tue, 8 Dec 2020 18:29:25 +0100
To: Kurt Roeckx <kurt@roeckx.be>, ntp@ietf.org
References: <X86sVykHUqlkXP96@roeckx.be>
From: Martin Burnicki <martin.burnicki@meinberg.de>
Organization: Meinberg Funkuhren GmbH & Co. KG, Bad Pyrmont, Germany
Message-ID: <f809b97a-91e1-751d-889d-cf832625f052@meinberg.de>
Date: Tue, 08 Dec 2020 18:29:25 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0
MIME-Version: 1.0
In-Reply-To: <X86sVykHUqlkXP96@roeckx.be>
Content-Type: text/plain; charset="utf-8"
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/ntp/bXiNFrpZk_HYj4PL2Bcl1RzkeQg>
Subject: Re: [Ntp] Timescales, leapseconds and smearing
X-BeenThere: ntp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <ntp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ntp>, <mailto:ntp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ntp/>
List-Post: <mailto:ntp@ietf.org>
List-Help: <mailto:ntp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ntp>, <mailto:ntp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 08 Dec 2020 17:29:45 -0000

Kurt Roeckx wrote:
[...]
> One problem that NTP currently isn't very good at is dealing with
> the leap second.

Hm, I think the problems aren't due to the protocol but due to the
environments where NTP implementations are expected to run.

If you consider NTP just like a kind of transportation, it has to rely
on the information coming in from different sources, e.g.

- leap second schedule and TAI offset from a leapsecond file

- leap second schedule and TAI offset from a GPS receiver, if the
protocol (e.g. NMEA) supports this at all

- Leap second warning but no TAI offset, e.g. from DCF77

- leap second warning only very shortly before it happpens, e.g 1 hour
for DCF77, or 10 seconds (IIRC from IRIG, only with IEEE extensions)

There are e.g. GPS receivers out there which provided faulty leap second
warnings, e.g. at the end of September. What should a stratum 1 server
do when it receives such a warning from its trusted reference clock?


At the other side of the transport are the operating systems. Most Unix
systems just step the time back by 1 second to insert a leap second.
Other operating systems (e.g. Windows, except very current versions)
didn't know about leap seconds at all.

If a stratum 1 server receives a valid leap second warning early enough,
it can pass this down to its clients, but it doesn't know what the
clients do with it.

If a PTP slave receives a leap second warning from its grandmaster and
passes it down to its Unix kernel which runs on UTC, the same problems
will occur if the system time is stepped back by the kernel.

> The current draft proposal doesn't address the
> problems. The major problems with it are:
> - The NTP timescale is non-continues in case of a leap second,
>   without an indication of on which scale you are. It also
>   doesn't define which second should get repeated.

As said above, it's not a decision of an NTP daemon how a leap second is
to be handled. If the kernel just steps the time back and whether it
repeats the last or the first second is up to the kernel, as long as the
kernels don't provide any other API.

>   This means there
>   is a 2 second window where it's unclear what the time is.

ntpd from ntp.org sends "not synchronized" for a short interval across a
leap second. This has been proposed by Miroslav some time ago, and is
IMO a very good way to come across the leap second.

> - There is a need to distribute information about when a leap second
>   will happen, which for can happen over NTP or some other way.

This was a feature of the Autokey protocol and extensions, even though I
have to admit that it's not really related to Autokey. Not sure if there
is a replacement in the new extension fields.

>   Experience shows that a lot of NTP servers get this wrong,
>   resulting in synchronization problems when some servers change
>   and others not. When distributed over NTP, a majority of the
>   servers need to indicate that it will happen. There is no way
>   to indicate that you don't know a leap second will happen or
>   not, making it harder to get it correct.

See my comments above. What should an NTP server do if a trusted source
provides invalid information?

IMO you can't blame NTP servers for this. The majority vote of a client
is a good thing to discard such faulty announcements.

> You can fix the first problem by moving to a scale that is
> continues, like TAI. But I'm not sure if it's better or worse
> because of the 2nd problem, it will probably be about the same.
> In TAI it would always be clear what the time means, even if some
> servers know about the leap seconds and others not. It would avoid
> marking some servers as false tickers.

I'm not sure it is so easy. If you switch to TAI you always need a
reliable source for the TAI offset, and the TAI offset has to be
forwarded down the time synchronization chain.

If your trusted source for the TAI offset provides faulty information
(as with the leap second warning above), you will run into the same
problems down the synchronization chain.

> The current proposed draft
> supports working in TAI and smeared NTP. I'm not sure about the
> UT1 scale in the draft but assume it's non-continues.

If you talk about smeared leap seconds, the questions are still

- what shape is used? cosine, linear, ...

- what interval? 2 hours? 24 hours?

- is smearing half before / half after the leap second,
  or fully before the leap second?

> An other way to fix the non-continues problem is to have some
> indication of on which scale you are. It needs to be able to say
> in which scale each of the timestamps is. The proposed draft has a
> TAI-UTC offset if you use the NTP scale that could be used for this,
> but it would then apply to both timestamps. If that is what we
> want to do, it needs to be more clear. But for the UT1 scale there
> is no way to indicate it.

One possible way that come to my mind is to use an arbitrary time (which
ever), and provide an UTC or TAI offset (including fractions of a
second) in an extension field.

That would be similar to the times used in emails, where the main
timestamp is some local time, but an offset allows to determine the
associated UTC time.

> NTP could distribute information about difference between
> timescales. A leap second will change the offsets, so we already
> do this in a very limited way. TAI-UTC and UT1-UTC are mentioned in
> the proposed draft, but it depends on which timescale you're using
> which offset you can get. I'm not sure NTP is the best way to
> distribute it. But for a lot of devices NTP is the only source of a
> leap second information.

There is also tzdist, which could be used for this. IMO this could be a
good companion to NTP (or even PTP) because it can provide leap second
warnings and TAI offsets, but also time zone rules, which is also very
important for user space applications.

> The document also has a smeared NTP option. It doesn't actually
> say which time you put in the fields, the NTP time, or the smeared
> NTP time. It then has an offset to the NTP time, without being
> clear about the sign. The offset is also optional, which means you
> might not be able to combine servers that smear, that smear
> differently and that don't smear.
>
> I'm currently not sure if we should do something with smearing. We
> could for instance say that even if the server is smearing, NTP
> should always contain unsmeared time, and that smearing is an
> implementation detail. Or we could standardize how it should be
> smeared. Or like the current draft that you have smear offset.

AFAIK, the original reason to do smearing is to completely hide a leap
second from the clients.

For example, we had customers running a certified Linux system with a
known bug that the kernel locks up when it inserts a leap second. Due to
the certification, the customer was not even allowed to update the Linux
kernel, even though a kernel with a bug fix was available.

So they used a smearing server to hide the leap second and prevent the
kernel from lockup.

With this in mind, I think the best approach would be to let a server
provide smeared time, with an optional information about the current
smear offset, so clients could optionally compensate the smearing.

ntpd from ntp.org currently provides the current smear offset in a
special interpretation of the refid field, if smearing is in progress.


Martin
-- 
Martin Burnicki

Senior Software Engineer

MEINBERG Funkuhren GmbH & Co. KG
Email: martin.burnicki@meinberg.de
Phone: +49 5281 9309-414
Linkedin: https://www.linkedin.com/in/martinburnicki/

Lange Wand 9, 31812 Bad Pyrmont, Germany
Amtsgericht Hannover 17HRA 100322
Geschäftsführer/Managing Directors: Günter Meinberg, Werner Meinberg,
Andre Hartmann, Heiko Gerstung
Websites: https://www.meinberg.de  https://www.meinbergglobal.com
Training: https://www.meinberg.academy