Re: [Cellar] [cellar-wg/matroska-specification] Consider providing a facility for integer-fraction timescales (#422)

Dave Rice <dave@dericed.com> Mon, 07 September 2020 13:56 UTC

From: Dave Rice <dave@dericed.com>
Content-Type: multipart/alternative; boundary="Apple-Mail=_1465C232-E660-45C3-B41B-28B76F92814D"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.1\))
Date: Mon, 07 Sep 2020 09:56:42 -0400
References: <cellar-wg/matroska-specification/issues/422@github.com>
To: Codec Encoding for LossLess Archiving and Realtime transmission <cellar@ietf.org>, cellar-wg/matroska-specification <reply+AAF2F3BLDZGLCVLNLXZDM3N5MGI5NEVBNHHCSZ7WDU@reply.github.com>
In-Reply-To: <cellar-wg/matroska-specification/issues/422@github.com>
Message-Id: <43E3FC4D-855C-4067-A58A-F2EAB3470A9B@dericed.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/ja5PChdX5M7wUsx_4yWd7IhvAbA>
Subject: Re: [Cellar] [cellar-wg/matroska-specification] Consider providing a facility for integer-fraction timescales (#422)
Precedence: list

Hi,
I see this was brought up as an issue in the GitHub repository and am cross-posting to the cellar working group.

> On Sep 7, 2020, at 12:02 AM, rcombs <notifications@github.com> wrote:
> It's pretty well-established that Matroska's poor timebase support is one of the format's worst properties. While it support very precise timestamps (down to the nanosecond), it's very inefficient to do so (and the resulting values still aren't exact for most input rates), so muxers tend to default to 1ms timestamps, which can lead to a variety of subtle issues, especially with high-packet-rate streams (e.g. audio) and VFR video content. Muxers can choose rates that are closer to the time base of their inputs (or the packet rate of the content), but exactly how best to do so has always been unclear, and some of the possible options would lead to either worse player behavior, or timestamp drift. I'm proposing a format addition to remedy this.
> 
> The only actual normative change I propose is this: in addition to the classic nanosecond-denominator time scale, muxers could provide 2 additional integers, serving as a numerator and denominator time base value, which is required to round to the existing nanosecond-scaled value.
> 
> This should be paired with some advice for muxer implementations on how to make use of this feature. This depends on the properties of the input. For reference, here are some examples of the error produced by rounding a variety of common time bases to the nearest nanosecond, scaled by 3 hours (a reasonable target for the duration of a film):
> 
> nearest_ns(x) = round(x * 1,000,000,000) / 1,000,000,000
> ceil_ns(x) = ceil(x * 1,000,000,000) / 1,000,000,000
> floor_ns(x) = floor(x * 1,000,000,000) / 1,000,000,000
> nearest_error(x) = 1 - (x / nearest_ns(x))
> ceil_error(x) = 1 - (x / ceil_ns(x))
> floor_error(x) = 1 - (x / floor_ns(x))
> nearest_error_3h(x) = nearest_error(x) * 60 * 60 * 3
> ceil_error_3h(x) = ceil_error(x) * 60 * 60 * 3
> floor_error_3h(x) = floor_error(x) * 60 * 60 * 3
> e(x) = nearest_error_3h(1 / x)
> ce(x) = ceil_error_3h(1 / x)
> fe(x) = floor_error_3h(1 / x)
> 
> # Integer video frame rates
> e(24)        => 8.64e-5
> e(25)        => 0
> e(30)        => -0.0001
> e(48)        => -0.0002
> e(50)        => 0
> e(60)        => 0.0002
> e(120)       => -0.0004
> 
> # NTSC video frame rates
> e(24/1.001)  => -8.6314e-5
> e(30/1.001)  => 0.0001
> e(48/1.001)  => 0.0002
> e(60/1.001)  => -0.0002
> e(120/1.001) => 0.0004
> 
> # TrueHD frame rates
> e(44100/40)   => -0.0057
> e(48000/40)   => -0.0043
> e(88200/40)   => 0.0062
> e(96000/40)   => 0.0086
> 
> # AAC frame rates
> e(44100/960)  => -0.0002
> e(48000/960)  => 0
> e(88200/960)  => 0.0003
> e(96000/960)  => 0
> e(44100/1024) => 0.0002
> e(48000/1024) => -0.0002
> e(88200/1024) => -0.0003
> e(96000/1024) => 0.0003
> 
> # MP3 frame rates
> e(44100/1152)  => 8.4375e-6
> e(48000/1152)  => 0
> e(88200/1152)  => -0.0004
> e(96000/1152)  => 0
> 
> # Other audio frame rates
> e(44100/128)   => -0.0012
> e(48000/128)   => 0.0013
> e(88200/128)   => -0.0012
> e(96000/128)   => -0.0027
> e(44100/2880)  => -7.425e-5
> e(48000/2880)  => 2.3981e-12
> e(88200/2880)  => -7.425e-5
> e(96000/2880)  => 2.3981e-12
> 
> # GCF of common short-first audio frame sizes
> e(44100/64)   => -0.0012
> e(48000/64)   => -0.0027
> e(88200/64)   => 0.0062
> e(96000/64)   => 0.0054
> 
> # Raw audio sample rates
> e(44100)     => 0.1253
> e(48000)     => -0.1728
> e(88200)     => 0.1253
> e(96000)     => 0.3456
> 
> fe(44100)    => -0.351
> ce(48000)    => 0.3456
> fe(88200)    => -0.8273
> fe(96000)    => -0.6912
> 
> # MPEGTS time base
> e(90000)     => -0.108
> 
> ce(90000)    => 0.8639
> 
> # Common multiples
> e(30000)     => -0.108
> e(60000)     => 0.216
> e(120000)    => -0.432
> e(240000)    => 0.8639
> e(480000)    => -1.7283
> 
> ce(30000)    => 0.216
> fe(60000)    => -0.432
> ce(120000)   => 0.8639
> fe(240000)   => -1.7283
> ce(480000)   => 3.4549
> As we can see, rounding common video and audio frame rates (including e.g. the least common multiple of 24 and 60 for that VFR case) produces a negligible amount of error over a reasonable duration. This means that for content where all timestamps can reasonably be expressed in integer values of those rates, there would be no significant error over common file durations, even if different streams were muxed with different time bases.
> 
> There are a few real-world time bases that would produce significant rounding error (upwards of 100ms) over the course of 3 hours when used in existing players: MPEGTS's 90000Hz, all common raw audio sample rates, and least-common-multiples between integer and NTSC video frame rates. This essentially means that mixing these rates with others would produce significant desync over a reasonable duration for static on-disk content; the same issue could occur when muxing very lengthy content (e.g. streaming).
> 
> All of these issues can be addressed in one of the following ways:
> 
> Using a lower rate (e.g. 90,000Hz isn't usually the real content rate but instead an artifact of its previous container; expressing timestamps in samples rather than frames is usually unnecessary)
> Choosing the highest of the input rates for all streams (e.g. 48000 is a multiple of many common frame rates, including 24/1.001)
> Choosing a more precise common-multiple rate that may create a larger total drift, but does so equally for all streams (see the "Common multiples" section; 1/30000 is suitable for mixing 24fps and 30/1.001fps content alongside most common framed audio rates, while the later listed bases are suitable for increasingly large sets).
> Round some tracks' nanosecond timescales in the opposite direction, creating a larger drift, but potentially one with the same sign (and thus a closer value) as the drift in other tracks (this is probably too complex and niche to have substantial use)
> Fall back to classic rounded nanosecond-based timestamps (and don't write an integer-fraction time base at all)
> Use the extension, resulting in significant sync drift in older players that haven't implemented the change
> This last option is usually unacceptable, but may be fine for files that use codecs that become available after the change is made (and thus are unavoidably non-backwards-compatible anyway).
> 
> If combined with clear advice in the spec on how muxers SHOULD (or MAY) decide on time bases for various possible input cases, I think this extension could get actual adoption in muxers and solve one of the format's longest-standing problems.
> 
This has been discussed on the list before though I don’t remember clear consensus on how to address this. Steve even compiled a list of discussions on this at https://mailarchive.ietf.org/arch/msg/cellar/ZpZxhG1gML9xVx_ir1Jf6_gcI8U/ <https://mailarchive.ietf.org/arch/msg/cellar/ZpZxhG1gML9xVx_ir1Jf6_gcI8U/>.

I proposed an option in this https://mailarchive.ietf.org/arch/msg/cellar/mTprgjNqVbe20e6hyYxns8ZnVwY/ <https://mailarchive.ietf.org/arch/msg/cellar/mTprgjNqVbe20e6hyYxns8ZnVwY/> where one of the existing reserved bits of the Block Header (in the byte that contains the keyframe, invisible, and lacing flags) be used as a flag for Timescale Alignment.

With this approach, new elements could be added to the track header with a numerator and denominator of a rationale time scale and if Timescale Alignment were set to true, then the nearest increment of the rationale time scale would be used. Example:

> Thus if the frame rate of the track header is 120000/1001, then
> 
> If Matroska timecode is 4 and Enable TimeScale Alignment is 0, than it is at 4 / (1000000000 / TimecodeScale ).
> If Matroska timecode is 4 and Enable TimeScale Alignment is 1, than it is at 0 / 1200000 (nearest increment of the rationale frame rate).
> 
> If Matroska timecode is 17 and Enable TimeScale Alignment is 0, than it is at 17 / (1000000000 / TimecodeScale ).
> If Matroska timecode is 17 and Enable TimeScale Alignment is 1, than it is at 2002 / 1200000 (nearest increment of the rationale frame rate).

In a Matroska demuxer doesn’t understand the new nom/denom elements or the Alignment flag then it would simply use the existing nanosecond timestamp system.

In that thread there were other proposals, for example Steve discussed using a float to depict a point in time.
Dave

Re: [Cellar] [cellar-wg/matroska-specification] C… Dave Rice