Re: [Cellar] AV1 mapping update

Hi Andreas,

Thanks for your detailed feedback.

2018-07-11 15:47 GMT+02:00 Andreas Rheinhardt
<andreas.rheinhardt@googlemail.com>:
> Steve Lhomme:
>> I updated the AV1 mapping to clean a few sentences.
>>
>>
>> https://github.com/Matroska-Org/matroska-specification/blob/av1-mappin/codec/av1.md
>> and the list of changes can be found here
>> https://github.com/Matroska-Org/matroska-specification/commits/av1-mappin/codec/av1.md
>>
> 1. Whether `DisplayWidth` and `DisplayHeight` needs to be written
> actually depends on the value of `DisplayUnit`.

True, I will mention that.

> 2. You forgot `OBU_PADDING` in the list of OBU types that mustn't be in
> the `CodecPrivate`. (Either that or your sentence that only
> `OBU_SEQUENCE_HEADER` and `OBU_METADATA` are currently allowed in the
> `CodecPrivate` should be changed.)

Indeed. The MP4 is a bit fuzzy on what is allowed. It's Sequence
Header and Metadata only if they apply to all samples. But anything
that is is valid to put before a sync point is OK. But in our case
it's better to be more strict. Remuxing from MP4 might require some
cleaning...

> 3. "They SHOULD have the [obu_has_size_field] set to 1 except for the
> last OBU in the sample, for which [obu_has_size_field] MAY be set to 0,
> in which case it is assumed to fill the remaining of the sample."
> "The OBUs in the Block MUST follow the [Low Overhead Bitstream Format
> syntax]."
> The first sentence leaves the possibility that [obu_has_size_field] is 0
> for OBUs other than the last OBU of a block (only a SHOULD). And the
> requirement in the second sentence actually makes MUST out of the SHOULD
> in the first sentence (making this part of the first sentence redundant)
> and contradicts/voids the MAY part of the first sentence. In other
> words, the two sentences should be merged to something like: "The `OBUs`
> in the block must follow the `Low Overhead Bitstream Format` (in which
> [obu_has_size_field] MUST be equal to one for every OBU) for every `OBU`
> with the possible exception of the very last `OBU` in which
> [obu_has_size_field] MAY be set to 0, in which case the `OBU` is assumed
> to consist of the remainder of the block."

Indeed there's a contradiction here. If we use MUST (can't be must
lowercase) on [Low Overhead Bitstream Format] then the
[obu_has_size_field] MUST be 1.

On MP4 for the CodecPrivate the [obu_has_size_field] MUST be 1. But in
the Blocks it can be 0:

"Each OBU SHALL have the obu_has_size_field set to 1 except for the
last OBU in the sample, for which obu_has_size_field MAY be set to 0,
in which case it is assumed to fill the remaining of the sample"

I think we should mimic that. I'll rephrase it.

> 4. "ReferenceBlocks inside a BlockGroup MUST reference frames according
> to the [ref_frame_idx] values of frame that is neither a KEYFRAME nor an
> INTRA_ONLY_FRAME.": The problem with this sentence is that
> [ref_frame_idx] needn't be present. It depends upon
> [frame_refs_short_signaling] and [show_existing_frame]. If one uses a
> Block inside a Blockgroup and if [show_exsting_frame] equals one one
> should reference the block that contained the showable frame that is now
> output (and that this should be the only `ReferenceBlock` written). In
> case of [frame_refs_short_signaling] == 1 the obvious candidates for
> `ReferenceBlocks` are the blocks containing the `last_frame_idx` and
> `gold_frame_idx` that are explicitly signalled. If I am not mistaken,
> then there are also other reference frames that are not explicitly
> signalled, but computed. I don't know if we should really write a
> `ReferenceBlock` entry for every reference as the current proposal seems
> to imply. This would be quite a bit of overhead for no gain (and
> furthermore, it would complicate muxers that would have to compute the
> references that are not explicitly signalled in case that

This is how `ReferenceBlock` is supposed to be used. So a muxer that
has no idea of any codec can cut a file and keep the relevant
references. So they all have to be there. It's one of the reasons
SimpleBlock was added, to simplify things a little (and reduce
overhead).

> [frame_refs_short_signaling] is 1). One `ReferenceBlock` would be enough
> to distinguish keyframes from non-keyframes.
> By the way: If a temporal unit contains multiple frames with references,
> whose references should end up as `ReferenceBlocks`? Or may the muxer
> choose some?

I think a Temporal Unit can only have one (visible frame). I don't
know if golden frames can have extra references. But the BlockGroup
should contain all frames needed to decode this frame, that includes
all the frames in the Block (even if not visible).

> 5. AV1 may use spatial scalability and/or temporal scalability. What do
> we make of these? They are currently not forbidden if I am not mistaken,
> but if e.g. the spatial dimensions of different layers disagree, the
> `PixelWidth` and `PixelHeight` values can't be true for all layers.
> Matroska seems to be missing some features here.

Our spec says that the Sequence Header OBU should be valid for all
frames. That can't be used for spatial scalability. We don't support
that mode for now.

It may technically possible to add different sizes in BlockAddition.

> 6. Depending on [frame_size_override_flag] there is even the possibility
> that the size of the frames differs even without scalability (if I am
> not mistaken). Should this be allowed?

It's not restricted by our spec. But then it's up to the codec to
handle, not the container. It's used with a SWITCH frame. The MP4 spec
don't make any special case for that either.

(please add spacing between your paragraph, it's hard to read this big
block of text)
> 7. Then there is another thing with keyframes and cues (for this point
> it is always presumed that the relevant sequence header OBUs are
> available regardless of whether this is done in-band or via CodecPrivate):
> a) The proposal currently does not take into account that key frames
> reset the decoder when they are output, not when they are decoded. A key
> frame needn't be immediately output; if it is (i.e. [show_frame]
> equaling 1), it is called a "key frame random access point" in section
> 7.6 of the standard and is the equivalent of an IDR frame in H.264.
> Everything's fine here. But a key frame can also be declared a
> showable_frame (but only if [show_frame] equals 0) and output later via
> the show_existing_frame mechanism. This is similar to an open GOP in
> other codecs (but in contrast to them, the block that contains the coded
> keyframe doesn't have the same timestamp (pts) as the first frame that
> can be output after a seek). The coded key frame with [show_frame] equal
> to zero is called a delayed random access point and a key frame
> dependent recovery point is a frame where a key frame with
> [showable_frame] equal to 1 is output via the show_existing_frame
> mechanism. If one starts decoding at the delayed random access point,
> all the output frames up to but not including the key frame dependent
> recovery point can depend both on the delayed random access point frame
> and on other earlier frames so that these frames can't be correctly
> decoded in general. But all the frames from the key frame dependent
> recovery point onwards can be correctly decoded if one starts decoding
> at the delayed random access point (because the decoder is reset after
> displaying the key frame). If one starts decoding at the key frame
> dependent recovery point, one doesn't have the key frame that should be
> shown via the show_existing_frames mechanism at all, so that this frame
> is simply not a real key frame.
> b) But although a key frame dependent recovery point is not a "real" key
> frame, it has the same [frame_type] as the frame that is output, i.e.
> its [frame_type] is KEY_FRAME. According to our current proposal this
> would mean that it should be treated as a keyframe in Matroska which is
> obviously wrong.

That's not how I understand it. Here's section 7.6.3:

"Informally, the requirement for decoder conformance is that decoding
can start at any key frame random access point or delayed random
access point."

And 7.6.2:

"delayed random access point is defined as being a frame:
• with frame_type equal to KEY_FRAME
• with show_frame equal to 0
• that is contained in a temporal unit that also contains a sequence header OBU"

So as long as we seek on frames of type KEY_FRAME we should be able to
seek. Wether it's a visible frame or not.

But because this is a bit loose in the 7.6.3 section they add this:

"To support the different modes of operation, a conformant decoder is
required to be able to decode bitstreams consisting of:
• a temporal unit containing a delayed random access point
• immediately followed by a temporal unit containing the associated
key frame dependent recovery point"

So the invisible KEY_FRAME should be immediately followed by the
recovery point data. So effectively it will work. There's also a note
that if it's not followed immediately then what is done with the
intermediate frames is implementation dependent. I don't think we need
to care too much.

> c) Marking a delayed random access point as keyframe deviates from the
> way that flag has been traditionally understood: If one starts decoding
> at this point, one doesn't get the frame that should be output for the
> temporal unit containing the delayed random access point. But I
> nevertheless think that these are the right keyframes, because they are
> the points at which random access has to begin when there aren't key
> frame random access points available; this also means that one can split
> the stream at this point and the second part will still play so that a
> muxer like mkvmerge needn't be rewritten too much.

Yes, IMO this is the correct way.

> d) A consequence of this is that a `Blockgroup` containing a delayed
> random access point mustn't contain a `ReferenceBlock` (although the
> actual frame that is output for that temporal unit very likely uses
> other reference frames than the key frame that is contained in the same
> temporal unit).

This won't happen if it's a proper random access point, ie it doesn't
need past frames to start decoding. If it's not then it's not a RAP
and then it can/should have ReferenceBlock.

> e) Yes, this proposal means that it is impossible to tell from Matroska
> alone (well, from the block structure that is; see f) for a way for
> which one could put this information into the Cues) whether it is a key
> frame random access point or a delayed random access point. One will
> have to decode it (or parse deeper) to know.

No, this is independent of the codec. Also Cues can target a frames
that can't be seeked to directly but that's beside the point. S
SimpleBlock marked keyframe or BlockGroup with no ReferenceBlock can
be seeked to directly and that's why they equal Random Access Points
as defined in AV1 (and other codecs).

> f) This also leads to problems with seeking: If one simply added a
> CuePoint for the keyframe (i.e. for the delayed random access point) and
> the user wants to seek to a point between the delayed random access
> point (inclusive) and the dependent recovery point (exclusive) and the
> player used the cues to seek to the nearest keyframe in front of the
> desired point, then decoding at the point referenced in the cues would
> not yield the desired frame (it would be either corrupted or not output
> at all). Therefore I think it is best to add a CuePoint for every key
> frame random access point and every key frame dependent recovery point.
> The CuePoint for the key frame random access point would be an ordinary
> CuePoint as usual. But the CuePoint for the key frame dependent recovery
> point wouldn't be (my favourite is iv) (and if I were allowed to play
> God it would be i))):

Cues are quite loose. It would be possible to do Cues only for frames
that are not delayed RAP and that's valid. IMO it's fine to reference
the delayed RAP. But they are RAP so it's legal to seek there.

The AV1 specs have this to say about this tricky case:

"Note:In practice, decoder implementations are expected to be able to
start decoding bitstreams from a delayed random access point when the
intermediate temporal units are still present. The decoder should
correctly produce all output frames from the next key frame or key
frame dependent recovery point onwards, while the preceding frames are
implementation defined. For example: a streaming decoder may choose to
decode and display all frames even when the reference frames are not
available (tolerating some errors in the output), a low latency
decoder may choose to decode and display all frames that are
guaranteed to be correct (e.g. an inter frame that only uses inter
prediction from the delayed random access point), a media player
decoder may choose to decode and display only frames starting from a
key frame or key frame dependent recovery point (guaranteeing smooth
playback once display starts)."

It's not up to the container to solve this.

> i) A comprehensive way of doing it is this: The CueTime would be the
> timestamp of the block containing the dependent recovery point; it would
> include a CueTrackPositions for the video track we are talking about
> that contains the right CueTrack, the CueClusterPosition containing the
> position of the dependent recovery point block and a CueReference with
> CueRefTime and CueRefCluster, both corresponding to the valus of the
> delayed random access point. This proposal has several downsides: It
> uses Cue elements that are deprecated in Matroska and not part of Webm.
> So this would require a quite nontrivial change in both projects. (Btw:
> If one does this, one should add a default value for `CueRefCluster`: It
> should be the same as `CueClusterPosition` as both blocks that we are
> talking about will probably end up in the same cluster anyway.)
> ii) One uses the CueTime of the dependent recovery point, but the
> position of the Cluster of the delayed access point (and
> `CueRelativePosition` (if used) should also point to this block).
> Pro: It only uses elements that are supported by both Matroska and Webm.
> Furthermore, the specs only say that `CueClusterPosition` should point
> to the cluster containing the "required block; they don't explicitly say
> that said block needs to have the same timestamp as `CueTime`.
> Contra: How does a demuxer know from which block onwards it should feed
> the data to the decoder? It might use the `CueRelativePosition`, but
> probably a lot of demuxers would simply read the cluster until they come
> to the block with timestamp `CueTime` (i.e. they interpret the specs so
> that the "required block" is the block with the timestamp `CueTime`) and
> then they would either deliver this to the decoder or conclude that the
> file is damaged (because the block they found is no keyframe).
> iii) The last is the same as i) with the difference that `CueRefCluster`
> is omitted. It is also incompatible with current Webm, but at least it
> has the advantage that it doesn't use any currently deprecated elements
> of Matroska. One could add a requirement that the delayed random access
> point and the dependent recovery point need to be in the same cluster
> and then omitting `CueRefCluster` is not a problem any more.
> iv) And then there is the possibility of creating a normal CuePoint for
> the dependent recovery point, writing the dependent recovery point as a
> Block in a Blockgroup with exactly one ReferenceBlock which points to
> the delayed access point block and let the demuxer seek backwards from
> the dependent recovery point to the delayed access point.
> Pro: Would only use things that are already supported by Matroska and
> Webm. It would also not be AV1 specific. The demuxer doesn't need to
> know anything about AV1, everything is signalled at the container level.
> Contra: Demuxers would have to be adapted not to expect any more that
> only keyframes are referenced in the cues. They would also have to be
> adapted to actually make use of the value of `ReferenceBlock` and seek
> backwards. This also implies more seeks, but this should be quite
> limited when one puts both the delayed random access point and the
> dependent recovery point in the same cluster -- hopefully the data is
> still cached. (Maybe one should add a SHOULD clause that says that both
> blocks should be in the same cluster.)
> g) Of course there are two easy alternative solutions:
> i) Restrict the type of AV1 that is allowed in Matroska even further so
> that all key frames are of key frame random access type. (This could
> exclude quite a lot of AV1 and therefore I recommend not doing so.)
> ii) Create cues as usual, i.e. reference every delayed random access
> point, and don't care about the fact that seeking will be partially
> broken in this case.
> h) It should be noted that exactly the same situation exists with
> periodic intra refresh in general. There was a short discussion on the
> Matroska developer mailing list in April 2011, but nothing came out of
> it. Every solution I outlined here for AV1 is also applicable for this case.
>
> Steve Lhomme:
>> Since we allow stripping the Sequence Header OBU from the stream when
>> it's equal to the CodecPrivate one, we need to add it back to the
>> bitstream for compliance. At least when seeking on keyframes. So I
>> added a section to explain that.
>>
>> IMO that's an extra feature of the CodecPrivate that it's meant to be
>> added to the bistream as-is. And in this case on startup and when
>> seeking. I wonder if we should add an element next to the CodecPrivate
>> to describe that. Because in this case it's not entirely opaque to the
>> demuxer. Or maybe it's implied by the CodecID and is up to the decoder
>> to use it how it's supposed to be (in this case detecting keyframes
>> and possibly adding back the Sequence Header OBU).
>
> 8. I think we can relax the requirements on the existence of in-band
> sequence header OBUs a bit: If a keyframe (i.e. a key random access
> point or a delayed random access point, not a dependent recovery point)
> uses the same sequence header OBU as in the CodecPrivate (including the
> same operating_parameters_info), then the sequence header OBU needn't be
> prepended to the block with the keyframe, because seeking already works
> without it provided one always adds the sequence header OBU from the
> CodecPrivate back in the bitstream on seeking. For example, consider the
> following scenario:
> One has an elementary stream that uses two different sequence header
> OBUs A and B that only differ in the operating_parameters_info. The
> first three keyframes use A, between the third and the B is contained in
> a temporal unit between the third and the fourth keyframe. Between the
> sixth and the seventh keyframe is a temporal unit containing sequence
> header A again. Then a muxer that wants to put this elementary stream
> into Matroska may put A in the CodecPrivate can strip the very first
> occurence of A away; it must leave B inside the temporal unit that it
> was in (so that a player that plays the file linearly is notified about
> the change) and has to make sure that keyframes #4 to #6 contain
> sequence header B (so that one has the correct sequence header when
> seeking to said keyframes). It mustn't strip A between the sixth and the
> seventh keyframe away (so that a player that plays the file linearly
> notices the change of sequence header), but it needn't preprend
> keyframes #7 and following with A.

That works but then we need to tell in the specs that following a
Sequence Header MUST not be stripped if the previous Sequence Header
was not bit identical to the one in CodecPrivate.

IMO it's valid to output A and then B before the rest of the data in
frame #4. That should be done if it's a keyframe. But if it's not a
keyframe the demuxer/decoder doesn't know it has to prepend the
CodecPrivate there. That could be a problem. Even though it doesn't
make much sense to change Sequence Header data before a non keyframe
(RAP).

So maybe your proposal should be added. We'll gain a bit of weight but
it's safer.

> That way one can save a few bytes.
> This is consistent with an interpretation of the CodecPrivate as the
> default extradata/header (but it is not necessarily a truly global
> header). Before `CueCodecState` was deprecated it had a default value of
> 0 that mandated that one should look in the CodecPrivate for the
> CodecState upon seking. So the only thing specific to AV1 in this case
> is the "as-is" part; that one should reset the decoder to whatever
> initialization information is contained in the CodecPrivate upon seeking
> is nothing new.
>
> 9. Image that future AV1 encoders would find out that changing the
> sequence header (with a change of the CVS) enables better compression.
> What would we do in this case? Simply lift the restriction of one track
> = one CVS even when this means that some players that can't cope with
> changing coded video sequences won't know in advance whether they can
> play the files at all? I'm asking because we don't include a version
> field in the CodecPrivate (in contrast to how it is done in mp4 with avcC).

Another CodecID should be used if the restrictions/format is defined.
It's safer than reusing one with different data and hoping the system
using the old one paid attention to the version bit that was always
the same anyway.

> Steve Lhomme:
>> Let me know what you think so we can settle this spec for good.
>>
> I think we are not even close to settling this for good.

\o/

>
> - Andreas Rheinhardt
>
> _______________________________________________
> Cellar mailing list
> Cellar@ietf.org
> https://www.ietf.org/mailman/listinfo/cellar

-- 
Steve Lhomme
Matroska association Chairman