Re: [Cellar] AV1 seeking

Andreas Rheinhardt <andreas.rheinhardt@googlemail.com> Tue, 17 July 2018 13:24 UTC

To: Codec Encoding for LossLess Archiving and Realtime transmission <cellar@ietf.org>
References: <CAOXsMFKTNCxYcviYS0h_VYjegV3RZFvZ7AV7GhdCq=oeGmgMuQ@mail.gmail.com> <62c29889-49f5-6634-049a-a2d73315bb3c@googlemail.com> <CAOXsMFKgA2PN-SUFZdCauXmOzyU-T0cHP6nxVeNOVdC1z1uWhg@mail.gmail.com>
From: Andreas Rheinhardt <andreas.rheinhardt@googlemail.com>
Message-ID: <4ff1f1f8-3b66-e229-7884-c851dae8e2e0@googlemail.com>
Date: Tue, 17 Jul 2018 13:23:00 +0000
MIME-Version: 1.0
In-Reply-To: <CAOXsMFKgA2PN-SUFZdCauXmOzyU-T0cHP6nxVeNOVdC1z1uWhg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/D7fBkjf9lwrFFRAOLLxRhg-aiS4>
Subject: Re: [Cellar] AV1 seeking
Precedence: list

Hello,

Steve Lhomme:
> Hi,
> 
> 2018-07-16 0:19 GMT+02:00 Andreas Rheinhardt
> <andreas.rheinhardt@googlemail.com>:
>>> In MP4 they have [initial_presentation_delay_minus_one] in the
>>> CodecPrivate. I did not understand it so far because it's not found in
>>> the AV1 spec. But it seems to guarantee that to read frame 'f' you
>>> need to decode X frame before that one. In our case that would be 5 to
>>> have at least a decoded.
>>>
>> You completely misunderstood this field. It is the equivalent of the
>> max_num_reorder_frames value from H.264. Let me explain it in MPEG
>> terminology (with b-frames) as you probably have way more experience
>> with this: Consider a decoder that can only decode one frame per unit of
>> time and a stream like this (left to right is decoding order; the
>> numbers are presentation order):
>> I0 P2 B1 ...
>> If one displayed the leading I frame immediately after decoding it, one
>> would not have the right frame to display at time 1, because at that
>> time only I0 and P2 has been decoded, not B1. Therefore one has to
>> decode I0 and P2 before one outputs the first frame and and
>> max_num_reorder_frames would be 1. The typical b-pyramid would require
>> to decode the first three frames before the display of the first frame
>> and max_num_reorder_frames would be 2.
>> The [initial_presentation_delay_minus_one] is the AV1 analogue of this.
>> This number is not an upper bound for the amount of temporal units
>> between delayed RAP and recovery point/for the amount of frames shared
>> between a GOP. Just look at this example:
>> MPEG example
>> I0 P5 B1 B2 B3 B4 I10 B6 B7 B8 B9
>> AV1 example (I kept the MPEG-naming with P and B to make it easier
>> comparable to the above; furthermore the pointer *x denotes that a frame
>> is showable and x without * is a frame header that outputs *x via
>> show_existing_frame;  square brackets are the delimiters of temporal
>> units; I10 is the delayed RAP frame)
>> [I0] [*P5 B1] [B2] [B3] [B4] [P5] [*I10 B6] [B7] [B8] [B9] [I10] ...
>> This stream can have [initial_presentation_delay_minus_one] equal to 1,
> 
> A value of 1 means 2 frame delay. That would mean B6 needs *I10 and P5 ?
> 
No, absolutely not. This field tells you nothing at all about seeking
and in particular it does not say that "2 frame delay" implies that it
is sufficient to seek to a frame two frames before said frame and start
decoing there to get the (undamaged) desired frame. But judging from
your next comment, you have already realized this on your own. The frame
delay field is there to solve a problem that is completely orthogonal to
the seeking problem, hence it is actually wrong to discuss it under the
label "[Cellar] AV1 seeking". I will nevertheless reply to your latest
change of the current proposal in this email.

>> yet in order to seek to [I10] one has to decode the temporal unit [*I10
>> B6] (or at least the decodable keyframe in it) which is four temporal
>> units in front of [I10].
> 
> That seems more like it.
> 
> In the Frame presentation timing paragraph (E.4.7) it says:
> 
> InitialPresentationDelay =  Removal [ initial_display_delay_minus_1 ]
> + TimeToDecode [ initial_display_delay_minus_1 ]
> 
> and
> 
> PresentationTime[ 0 ] = InitialPresentationDelay
> PresentationTime[ j ] = InitialPresentationDelay + (
> frame_presentation_time[ j ] - frame_presentation_time[ 0 ] ) * DispCT
> 
> or in constant bitrate mode
> 
> PresentationTime[ 0 ] = InitialPresentationDelay
> PresentationTime[ j ] = PresentationTime[ j - 1 ] + (
> num_ticks_per_picture_minus_1 + 1 ) * DispCT
> 
> 
> And our `Block` timestamp derives directly from this
> [PresentationTime]. The delay is carried over every frame. So it does
> seem like we need to globally shift these timestamps for the Track.
> Possibly with `CodecDelay`.
> 
> Basically each frame has its timestamp on which this delay has to be
> added (can't be negative). While `CodecDelay` is a positive value that
> needs to be substracted from the timestamps. It is available in WebM,
> because Opus needs it.
> 
> Let's assume CodecDelay = InitialPresentationDelay
> 
> The `Block` timestamps would be stored like this:
> Block[ 0 ] = 0
> Block[ 1 ] = frame_presentation_time[ 1 ] * DispCT
> Block[ 2 ] = frame_presentation_time[ 2 ] * DispCT
> ...
> 
> And on output of the demuxer we would get:
> Block[ 0 ] = -InitialPresentationDelay
> Block[ 1 ] = -InitialPresentationDelay + frame_presentation_time[ 1 ] * DispCT
> Block[ 2 ] = -InitialPresentationDelay + frame_presentation_time[ 2 ] * DispCT
> ...
> 
> `CodecDelay` cannot be used here. But it's very close to what we need.
> 
> It's not the SeekPreRoll either because the actual frame timestamps of
> each frame is affected, not just when seeking (although it may be
> needed for delayed RAP).
> 
> We also need to figure out whether the Blocks need to be stored with
> or without the delay. If that's with the delay then we don't even need
> to care about it. The frames will just not start at 0 but that's
> already the case in the original stream.
> 
> But that may just be the tricky part here, the reference point. The
> PresentationTime[ 0 ] doesn't start at 0 because some frames were
> needed to decode before getting actual usable data out of the decoder.
> But this is really the first frame to display so it would actually be
> 0 on the output of the demuxer. So it does seem exactly like what
> CodecDelay does:
> "CodecDelay is The codec-built-in delay in nanoseconds. This value
> must be subtracted from each block timestamp in order to get the
> actual timestamp."
> 
> The `Block` timestamp would be PresentationTime[ 0 ] which isn't 0.
> But on the output Block[ 0 ] should really give 0. So we would have
> something like this in the file:
> Block[ 0 ] = InitialPresentationDelay
> Block[ 1 ] = InitialPresentationDelay + frame_presentation_time[ 1 ] * DispCT
> Block[ 2 ] = InitialPresentationDelay + frame_presentation_time[ 2 ] * DispCT
> ...
> 
> And on the output of the demuxer we would have the following, when
> `CodecDelay` is [InitialPresentationDelay]:
> Block[ 0 ] = 0
> Block[ 1 ] = frame_presentation_time[ 1 ] * DispCT
> Block[ 2 ] = frame_presentation_time[ 2 ] * DispCT
> ...
> 
> It does look like what the internal AV1 display delay is trying to achieve.
> 
> The `CodecDelay` also has this note on proper muxing:
> The value SHOULD be small so the muxing of tracks with the same actual
> timestamp are in the same Cluster.
> 
> Because muxing is done based on the stored `Block` timestamp. It
> usually doesn't take in account CodecDelay (but it could/should).
> 
> So I'll update my AV1 codec mapping saying `CodecDelay` is
> [InitialPresentationDelay] and how to use it.
> 
1. Good that you thought about the initial presentation delay. I have
personally experienced how unfortunate it is that the H.264
packetization doesn't provide a way to indicate the necessary number of
reorder frames (see e.g. <https://github.com/FFMS/ffms2/issues/301>) and
I have talked to Moritz Bunkus about using the `MinCache` value to store
the number of reorder frames. (Btw: The "reference pseudo-cache system"
this element talks about seems to be undefined.)

2. Nevertheless I think that your proposal is totally wrong:

a) The semantics of `CodecDelay` are unclear: Although the current
wording only speaks about a delay that should be subtracted from the
timestamps, the muxers are actually using this value to map the Opus
preskip to a Matroska field, i.e. they are treating it as if it would
not only be a delay that should be subtracted from the timestamps, but
as if it also signaled `DiscardPadding` at the beginning. There was a
discussion in May 2016 in which everyone agreed that `CodecDelay`
included discarding samples at the beginning and the following proposal
for an alternative wording was made (see
<https://mailarchive.ietf.org/arch/msg/cellar/ATGyypffoo9DuIFGUpeRSfzTVqE>):
"CodecDelay is the duration of the codec-built-in delay in nanoseconds.
The decoded frames from the beginning of a stream should be discarded
until CodecDelay duration has passed, and CodecDelay must be subtracted
from each block timestamp in order to get the presentation timestamp.
The value should be small so the muxing of tracks with the same
presentation timestamp are in the same Cluster."
This has not been agreed on not because the discard-part was
controversial per se, but because it was unclear how exactly to convert
this ns value to precise samples given that they use different timebases.

Notice that one thing has not been brought up during the linked
discussion: What exactly will be discarded? Is it always the first
`CodecDelay` output or is it only the output which has a negative
timestamp after the timestamp shift specified by `CodecDelay` has been
applied? This is important if the lowest timestamp is >0 (before
shifting by `CodecDelay`). In the second case one would have to
additionaly use a `DiscardPadding` element at the beginning to signal
that the appropriate number of samples should be discarded. No muxer I
know of currently does this, i.e. they work as if not only the samples
that end up with a negative timestamp after the shift were discarded.
And if we keep this logic then the first frames will be decoded, but
discarded under your proposal; if we don't keep it, then some files will
have been wrongly muxed (i.e. then current muxers haven't signalled the
Opus `PreSkip` correctly at all and people that relied on this
interpretation are screwed).

b) The elementary streams of other codecs also has such an offset (See
e.g. equation (C-12) in the H.264 standard.), but we don't store it for
them either. And it works.

c) The time in Annex E is obviously a dts given that it starts at zero
when decoding begins. (Here I consider the arrival of the data in a
buffer as part of the decoding process.) But Matroska uses pts. The only
thing that your proposal does is signalling the pts to dts difference.
But only for the very first picture. This is enough so that one can play
the track smoothly from the beginning; but it is not enough in general.

Let me explain: Nothing guarantees you that the pts to dts offset for
the first frame is the pts to dts offset for any further keyframes (even
when we restrict this to the keyframe RAP and not the delayed RAPs;
after all, if one discards from a CVS all OBUs in front of a keyframe
RAP, one still has a valid CVS (provided that the `Sequence Header OBU`
is still available)). That's because it may depend on
[buffer_removal_time] which needs to be signalled for every frame. Now
the obvious solution for this would be to use the highest pts to dts
offset for any RAP (or only for every keyframe RAP; I actually don't
know if one would have to use the offset between pts and dts for the
delayed RAPs, too). This has the downside that you would have to do the
muxing in two passes; after all, if you notice that you need to use an
increased value of `CodecDelay`, you would have to rewrite not only the
track header, but the timestamps written so far as well if you tried to
mux in one pass.
Here is a situation where the above scenario might be realistic:
You have a vfr track that actually consists of several sections that are
cfr (but when combined are nevertheless vfr). The beginning is (say)
50p, the rest is 25p (this really happens: if the movie is shot in 25p,
but the title and end credits are overlayed as 50p, then the result is
effectively 50p at the beginning and end and 25p in the middle; I have
several TV broadcasts like that). Then it is quite likely that the
individual 50p frames are smaller (coded size) so that they can be
transmitted faster and so one can use a lower ScheduledRemovalTiming for
the DFGs corresponding to the 50p frames. This means that the Removal
times (in decoding schedule mode) will be smaller as well. Hence the
InitialPresentationDelay for decoding from the very first frame is
smaller than the InitialPresentationDelay for decoding from one of the
later RAPs.
For this reason decoders would probably rely on the
[initial_display_delay_minus_1] values (that are luckily not stripped
away). But then there is no point in using `CodecDelay` at all as this
value can also be used at the beginning.

d) There are additional complications:

i) It would complicate muxers that would have to implement the
procedures described in E.4.

ii) It breaks compability with mp4/ISOBMFF because they only store the
initial_presentation_delay_minus_one value (storing it is recommended,
but not required) and (that's a SHOULD) they should set the
[timing_info_present] flag to zero which means that the bitstream itself
wouldn't contain an explicitly signalled presentation delay any more.
One would have to recalculate the `CodecDelay` value, but there is a
problem with this approach: Appendix E.3 knows two modes: Resource
availability mode and the decoding schedule mode. The latter needs the
buffer_removal_time information from the frame header, but they SHOULD
be stripped away during muxing into mp4, so they are likely not
available and the calculation can't be performed at all. The former has
the prerequisite that it needs to be crf ([equal_picture_interval] equal
to 1). Given that the timing information has been discarded,
[equal_picture_interval] is generally unavailable in mp4, but we can
still check whether it is crf. And what if it isn't? What `CodecDelay`
value does one use?

iii) It is not possible every time even when muxing from an AV1
elementary stream: A bitstream can contain
[initial_display_delay_minus_1] even when it doesn't signal explicit
timing info or when it doesn't signal the decoder_model_info or the
operating_parameters_info or if it doesn't signal the buffer removal time.
In my experience with H.264, it was enough to explicitly signal the
[max_num_reorder_frames] for correct playback; it was not necessary to
also add the SEI messages necessary for conformance testing of the
track. (This is also how x264 works by default: No unnessary SEIs, but
the [max_num_reorder_frames] is set.) I actually wouldn't be surprised
if an x264-like AV1 encoder would use the same defaults: Signal
[initial_display_delay_minus_1], don't signal the buffer_removal_time
(too many wasted bits for no gain) and don't signal the operating
parameters.

iv) Of course the `CodecDelay` would also depend on the operating point
(if scalability is used). I have to admit this could be easily
rectified: Simply calculate the `CodecDelay` for every operating point
and use the maximum.

v) Purely Matroska-wise, there are further complications: Splitting such
a file would mean that the muxer has to use the correct `CodecDelay`
value for the second file. This is in general a different value than the
value for the first file, but the value can't be calculated because the
timing information has been discarded from the bitstream.
Alternatively appending files where two tracks that are to be appended
to one another have different `CodecDelay` values also requires changes;
if the second file has the higher `CodecDelay` value, one would have to
increase the `CodecDelay` for the whole output track to that value and
this entails to shift the timestamps of the first file, too.
There probably are more complications further down the road.

3. Here is a counterproposal: We more or less copy how mp4 does it. I.e.
we ignore InitialPresentationDelay when determining the timestamps and
we add a field that is equivalent to their
[initial_presentation_delay_minus_one].
Honstely I don't even know what they exactly mean by "display model
verification algorithm". Is it the procedure described in annex E? If
so, they can run this algorithm only if the video is cfr (resource
availability mode) or the buffer_removal_time is given in the frame
headers (decoding schedule mode), but the latter SHOULD have been
striped away, so that leaves a crf requirement. (Or they could have kept
the relevant paramters from the elementary stream (maybe not in the mp4
file, but somewhere else).)
Anyway, for our purposes it will be enough to use the maximum of all
[initial_display_delay_minus_1] values if said value is present for
every operating point that occurs in the file. It is one of the fields
which can't change mid-way in a CVS so it works well for one-pass muxers.
The easiest way would be probably to add a byte to the beginning of
`CodecPrivate` (whose definition would have to be changed). How about
the MSB bit being the "initial_presentation_delay_present" bit, then
three reserved bits (set to zero, may hypothetically be used to indicate
a new version of the AV1 packetization in Matroska if we wished to keep
the AV1 CodecID) and then the `initial_presentation_delay_minus_one`,
coded on 4 bits.
If someone wonders why I propose to add such a field even when the
information is in the 'Sequence Header OBU` (after all, this is not
stripped away): Because it actually needn't be in the `Sequence Header
OBU`! E.g. think of an AV1 bitstream inside mp4 where the `Sequence
Header OBU` delay field doesn't exist, but where the  initial delay is
signalled on the container level. Muxing from mp4 to Matroska would
loose this information when it is not kept somewhere. Furthermore, just
as a muxer can actually count the number of reorder frames necessary for
H.264, it can probably do the equivalent with the initial presentation
delay for AV1 (this is not to say that this should be required or even a
SHOULD) and then it can put the value into this field (rewriting a field
in the track header is possible even for a one-pass muxer).

4. How helpful is this buffer removal timing information actually? Is it
needed for streaming? I don't think so, because then the mp4 guys (one
of the main authors of the mp4 packetization works for Netflix) would
not have recommended to strip it away. But maybe conformance testing
should be mentioned as another reason not to strip the [timing_info]
away. (I would only keep the timing_info parameters for a crf track in
order to have the real framerate stored, but I would discard the
decoder_model_info stuff.)

- Andreas Rheinhardt

[Cellar] AV1 seeking Steve Lhomme
Re: [Cellar] AV1 seeking Timothy B. Terriberry
Re: [Cellar] AV1 seeking Andreas Rheinhardt
Re: [Cellar] AV1 seeking Steve Lhomme
Re: [Cellar] AV1 seeking Steve Lhomme
Re: [Cellar] AV1 seeking Andreas Rheinhardt
Re: [Cellar] AV1 seeking Steve Lhomme
Re: [Cellar] AV1 seeking Steve Lhomme
Re: [Cellar] AV1 seeking Cyril Concolato
Re: [Cellar] AV1 seeking Steve Lhomme
Re: [Cellar] AV1 seeking Andreas Rheinhardt
Re: [Cellar] AV1 seeking Cyril Concolato
Re: [Cellar] AV1 seeking Cyril Concolato
Re: [Cellar] AV1 seeking Steve Lhomme