Re: [Cellar] AV1 mapping update

New update after the points from Andreas were addressed:

https://github.com/Matroska-Org/matroska-specification/blob/av1-mappin/codec/av1.md
and the list of changes can be found here
https://github.com/Matroska-Org/matroska-specification/commits/av1-mappin/codec/av1.md

2018-07-12 11:54 GMT+02:00 Steve Lhomme <slhomme@matroska.org>:
> Hi Andreas,
>
> Thanks for your detailed feedback.
>
> 2018-07-11 15:47 GMT+02:00 Andreas Rheinhardt
> <andreas.rheinhardt@googlemail.com>:
>> Steve Lhomme:
>>> I updated the AV1 mapping to clean a few sentences.
>>>
>>>
>>> https://github.com/Matroska-Org/matroska-specification/blob/av1-mappin/codec/av1.md
>>> and the list of changes can be found here
>>> https://github.com/Matroska-Org/matroska-specification/commits/av1-mappin/codec/av1.md
>>>
>> 1. Whether `DisplayWidth` and `DisplayHeight` needs to be written
>> actually depends on the value of `DisplayUnit`.
>
> True, I will mention that.
>
>> 2. You forgot `OBU_PADDING` in the list of OBU types that mustn't be in
>> the `CodecPrivate`. (Either that or your sentence that only
>> `OBU_SEQUENCE_HEADER` and `OBU_METADATA` are currently allowed in the
>> `CodecPrivate` should be changed.)
>
> Indeed. The MP4 is a bit fuzzy on what is allowed. It's Sequence
> Header and Metadata only if they apply to all samples. But anything
> that is is valid to put before a sync point is OK. But in our case
> it's better to be more strict. Remuxing from MP4 might require some
> cleaning...
>
>> 3. "They SHOULD have the [obu_has_size_field] set to 1 except for the
>> last OBU in the sample, for which [obu_has_size_field] MAY be set to 0,
>> in which case it is assumed to fill the remaining of the sample."
>> "The OBUs in the Block MUST follow the [Low Overhead Bitstream Format
>> syntax]."
>> The first sentence leaves the possibility that [obu_has_size_field] is 0
>> for OBUs other than the last OBU of a block (only a SHOULD). And the
>> requirement in the second sentence actually makes MUST out of the SHOULD
>> in the first sentence (making this part of the first sentence redundant)
>> and contradicts/voids the MAY part of the first sentence. In other
>> words, the two sentences should be merged to something like: "The `OBUs`
>> in the block must follow the `Low Overhead Bitstream Format` (in which
>> [obu_has_size_field] MUST be equal to one for every OBU) for every `OBU`
>> with the possible exception of the very last `OBU` in which
>> [obu_has_size_field] MAY be set to 0, in which case the `OBU` is assumed
>> to consist of the remainder of the block."
>
> Indeed there's a contradiction here. If we use MUST (can't be must
> lowercase) on [Low Overhead Bitstream Format] then the
> [obu_has_size_field] MUST be 1.
>
> On MP4 for the CodecPrivate the [obu_has_size_field] MUST be 1. But in
> the Blocks it can be 0:
>
> "Each OBU SHALL have the obu_has_size_field set to 1 except for the
> last OBU in the sample, for which obu_has_size_field MAY be set to 0,
> in which case it is assumed to fill the remaining of the sample"
>
> I think we should mimic that. I'll rephrase it.
>
>> 4. "ReferenceBlocks inside a BlockGroup MUST reference frames according
>> to the [ref_frame_idx] values of frame that is neither a KEYFRAME nor an
>> INTRA_ONLY_FRAME.": The problem with this sentence is that
>> [ref_frame_idx] needn't be present. It depends upon
>> [frame_refs_short_signaling] and [show_existing_frame]. If one uses a
>> Block inside a Blockgroup and if [show_exsting_frame] equals one one
>> should reference the block that contained the showable frame that is now
>> output (and that this should be the only `ReferenceBlock` written). In
>> case of [frame_refs_short_signaling] == 1 the obvious candidates for
>> `ReferenceBlocks` are the blocks containing the `last_frame_idx` and
>> `gold_frame_idx` that are explicitly signalled. If I am not mistaken,
>> then there are also other reference frames that are not explicitly
>> signalled, but computed. I don't know if we should really write a
>> `ReferenceBlock` entry for every reference as the current proposal seems
>> to imply. This would be quite a bit of overhead for no gain (and
>> furthermore, it would complicate muxers that would have to compute the
>> references that are not explicitly signalled in case that
>
> This is how `ReferenceBlock` is supposed to be used. So a muxer that
> has no idea of any codec can cut a file and keep the relevant
> references. So they all have to be there. It's one of the reasons
> SimpleBlock was added, to simplify things a little (and reduce
> overhead).
>
>> [frame_refs_short_signaling] is 1). One `ReferenceBlock` would be enough
>> to distinguish keyframes from non-keyframes.
>> By the way: If a temporal unit contains multiple frames with references,
>> whose references should end up as `ReferenceBlocks`? Or may the muxer
>> choose some?
>
> I think a Temporal Unit can only have one (visible frame). I don't
> know if golden frames can have extra references. But the BlockGroup
> should contain all frames needed to decode this frame, that includes
> all the frames in the Block (even if not visible).
>
>> 5. AV1 may use spatial scalability and/or temporal scalability. What do
>> we make of these? They are currently not forbidden if I am not mistaken,
>> but if e.g. the spatial dimensions of different layers disagree, the
>> `PixelWidth` and `PixelHeight` values can't be true for all layers.
>> Matroska seems to be missing some features here.
>
> Our spec says that the Sequence Header OBU should be valid for all
> frames. That can't be used for spatial scalability. We don't support
> that mode for now.
>
> It may technically possible to add different sizes in BlockAddition.
>
>> 6. Depending on [frame_size_override_flag] there is even the possibility
>> that the size of the frames differs even without scalability (if I am
>> not mistaken). Should this be allowed?
>
> It's not restricted by our spec. But then it's up to the codec to
> handle, not the container. It's used with a SWITCH frame. The MP4 spec
> don't make any special case for that either.
>
> (please add spacing between your paragraph, it's hard to read this big
> block of text)
>> 7. Then there is another thing with keyframes and cues (for this point
>> it is always presumed that the relevant sequence header OBUs are
>> available regardless of whether this is done in-band or via CodecPrivate):
>> a) The proposal currently does not take into account that key frames
>> reset the decoder when they are output, not when they are decoded. A key
>> frame needn't be immediately output; if it is (i.e. [show_frame]
>> equaling 1), it is called a "key frame random access point" in section
>> 7.6 of the standard and is the equivalent of an IDR frame in H.264.
>> Everything's fine here. But a key frame can also be declared a
>> showable_frame (but only if [show_frame] equals 0) and output later via
>> the show_existing_frame mechanism. This is similar to an open GOP in
>> other codecs (but in contrast to them, the block that contains the coded
>> keyframe doesn't have the same timestamp (pts) as the first frame that
>> can be output after a seek). The coded key frame with [show_frame] equal
>> to zero is called a delayed random access point and a key frame
>> dependent recovery point is a frame where a key frame with
>> [showable_frame] equal to 1 is output via the show_existing_frame
>> mechanism. If one starts decoding at the delayed random access point,
>> all the output frames up to but not including the key frame dependent
>> recovery point can depend both on the delayed random access point frame
>> and on other earlier frames so that these frames can't be correctly
>> decoded in general. But all the frames from the key frame dependent
>> recovery point onwards can be correctly decoded if one starts decoding
>> at the delayed random access point (because the decoder is reset after
>> displaying the key frame). If one starts decoding at the key frame
>> dependent recovery point, one doesn't have the key frame that should be
>> shown via the show_existing_frames mechanism at all, so that this frame
>> is simply not a real key frame.
>> b) But although a key frame dependent recovery point is not a "real" key
>> frame, it has the same [frame_type] as the frame that is output, i.e.
>> its [frame_type] is KEY_FRAME. According to our current proposal this
>> would mean that it should be treated as a keyframe in Matroska which is
>> obviously wrong.
>
> That's not how I understand it. Here's section 7.6.3:
>
> "Informally, the requirement for decoder conformance is that decoding
> can start at any key frame random access point or delayed random
> access point."
>
> And 7.6.2:
>
> "delayed random access point is defined as being a frame:
> • with frame_type equal to KEY_FRAME
> • with show_frame equal to 0
> • that is contained in a temporal unit that also contains a sequence header OBU"
>
> So as long as we seek on frames of type KEY_FRAME we should be able to
> seek. Wether it's a visible frame or not.
>
> But because this is a bit loose in the 7.6.3 section they add this:
>
> "To support the different modes of operation, a conformant decoder is
> required to be able to decode bitstreams consisting of:
> • a temporal unit containing a delayed random access point
> • immediately followed by a temporal unit containing the associated
> key frame dependent recovery point"
>
> So the invisible KEY_FRAME should be immediately followed by the
> recovery point data. So effectively it will work. There's also a note
> that if it's not followed immediately then what is done with the
> intermediate frames is implementation dependent. I don't think we need
> to care too much.
>
>
>> c) Marking a delayed random access point as keyframe deviates from the
>> way that flag has been traditionally understood: If one starts decoding
>> at this point, one doesn't get the frame that should be output for the
>> temporal unit containing the delayed random access point. But I
>> nevertheless think that these are the right keyframes, because they are
>> the points at which random access has to begin when there aren't key
>> frame random access points available; this also means that one can split
>> the stream at this point and the second part will still play so that a
>> muxer like mkvmerge needn't be rewritten too much.
>
> Yes, IMO this is the correct way.
>
>> d) A consequence of this is that a `Blockgroup` containing a delayed
>> random access point mustn't contain a `ReferenceBlock` (although the
>> actual frame that is output for that temporal unit very likely uses
>> other reference frames than the key frame that is contained in the same
>> temporal unit).
>
> This won't happen if it's a proper random access point, ie it doesn't
> need past frames to start decoding. If it's not then it's not a RAP
> and then it can/should have ReferenceBlock.
>
>> e) Yes, this proposal means that it is impossible to tell from Matroska
>> alone (well, from the block structure that is; see f) for a way for
>> which one could put this information into the Cues) whether it is a key
>> frame random access point or a delayed random access point. One will
>> have to decode it (or parse deeper) to know.
>
> No, this is independent of the codec. Also Cues can target a frames
> that can't be seeked to directly but that's beside the point. S
> SimpleBlock marked keyframe or BlockGroup with no ReferenceBlock can
> be seeked to directly and that's why they equal Random Access Points
> as defined in AV1 (and other codecs).
>
>> f) This also leads to problems with seeking: If one simply added a
>> CuePoint for the keyframe (i.e. for the delayed random access point) and
>> the user wants to seek to a point between the delayed random access
>> point (inclusive) and the dependent recovery point (exclusive) and the
>> player used the cues to seek to the nearest keyframe in front of the
>> desired point, then decoding at the point referenced in the cues would
>> not yield the desired frame (it would be either corrupted or not output
>> at all). Therefore I think it is best to add a CuePoint for every key
>> frame random access point and every key frame dependent recovery point.
>> The CuePoint for the key frame random access point would be an ordinary
>> CuePoint as usual. But the CuePoint for the key frame dependent recovery
>> point wouldn't be (my favourite is iv) (and if I were allowed to play
>> God it would be i))):
>
> Cues are quite loose. It would be possible to do Cues only for frames
> that are not delayed RAP and that's valid. IMO it's fine to reference
> the delayed RAP. But they are RAP so it's legal to seek there.
>
> The AV1 specs have this to say about this tricky case:
>
> "Note:In practice, decoder implementations are expected to be able to
> start decoding bitstreams from a delayed random access point when the
> intermediate temporal units are still present. The decoder should
> correctly produce all output frames from the next key frame or key
> frame dependent recovery point onwards, while the preceding frames are
> implementation defined. For example: a streaming decoder may choose to
> decode and display all frames even when the reference frames are not
> available (tolerating some errors in the output), a low latency
> decoder may choose to decode and display all frames that are
> guaranteed to be correct (e.g. an inter frame that only uses inter
> prediction from the delayed random access point), a media player
> decoder may choose to decode and display only frames starting from a
> key frame or key frame dependent recovery point (guaranteeing smooth
> playback once display starts)."
>
> It's not up to the container to solve this.
>
>> i) A comprehensive way of doing it is this: The CueTime would be the
>> timestamp of the block containing the dependent recovery point; it would
>> include a CueTrackPositions for the video track we are talking about
>> that contains the right CueTrack, the CueClusterPosition containing the
>> position of the dependent recovery point block and a CueReference with
>> CueRefTime and CueRefCluster, both corresponding to the valus of the
>> delayed random access point. This proposal has several downsides: It
>> uses Cue elements that are deprecated in Matroska and not part of Webm.
>> So this would require a quite nontrivial change in both projects. (Btw:
>> If one does this, one should add a default value for `CueRefCluster`: It
>> should be the same as `CueClusterPosition` as both blocks that we are
>> talking about will probably end up in the same cluster anyway.)
>> ii) One uses the CueTime of the dependent recovery point, but the
>> position of the Cluster of the delayed access point (and
>> `CueRelativePosition` (if used) should also point to this block).
>> Pro: It only uses elements that are supported by both Matroska and Webm.
>> Furthermore, the specs only say that `CueClusterPosition` should point
>> to the cluster containing the "required block; they don't explicitly say
>> that said block needs to have the same timestamp as `CueTime`.
>> Contra: How does a demuxer know from which block onwards it should feed
>> the data to the decoder? It might use the `CueRelativePosition`, but
>> probably a lot of demuxers would simply read the cluster until they come
>> to the block with timestamp `CueTime` (i.e. they interpret the specs so
>> that the "required block" is the block with the timestamp `CueTime`) and
>> then they would either deliver this to the decoder or conclude that the
>> file is damaged (because the block they found is no keyframe).
>> iii) The last is the same as i) with the difference that `CueRefCluster`
>> is omitted. It is also incompatible with current Webm, but at least it
>> has the advantage that it doesn't use any currently deprecated elements
>> of Matroska. One could add a requirement that the delayed random access
>> point and the dependent recovery point need to be in the same cluster
>> and then omitting `CueRefCluster` is not a problem any more.
>> iv) And then there is the possibility of creating a normal CuePoint for
>> the dependent recovery point, writing the dependent recovery point as a
>> Block in a Blockgroup with exactly one ReferenceBlock which points to
>> the delayed access point block and let the demuxer seek backwards from
>> the dependent recovery point to the delayed access point.
>> Pro: Would only use things that are already supported by Matroska and
>> Webm. It would also not be AV1 specific. The demuxer doesn't need to
>> know anything about AV1, everything is signalled at the container level.
>> Contra: Demuxers would have to be adapted not to expect any more that
>> only keyframes are referenced in the cues. They would also have to be
>> adapted to actually make use of the value of `ReferenceBlock` and seek
>> backwards. This also implies more seeks, but this should be quite
>> limited when one puts both the delayed random access point and the
>> dependent recovery point in the same cluster -- hopefully the data is
>> still cached. (Maybe one should add a SHOULD clause that says that both
>> blocks should be in the same cluster.)
>> g) Of course there are two easy alternative solutions:
>> i) Restrict the type of AV1 that is allowed in Matroska even further so
>> that all key frames are of key frame random access type. (This could
>> exclude quite a lot of AV1 and therefore I recommend not doing so.)
>> ii) Create cues as usual, i.e. reference every delayed random access
>> point, and don't care about the fact that seeking will be partially
>> broken in this case.
>> h) It should be noted that exactly the same situation exists with
>> periodic intra refresh in general. There was a short discussion on the
>> Matroska developer mailing list in April 2011, but nothing came out of
>> it. Every solution I outlined here for AV1 is also applicable for this case.
>>
>> Steve Lhomme:
>>> Since we allow stripping the Sequence Header OBU from the stream when
>>> it's equal to the CodecPrivate one, we need to add it back to the
>>> bitstream for compliance. At least when seeking on keyframes. So I
>>> added a section to explain that.
>>>
>>> IMO that's an extra feature of the CodecPrivate that it's meant to be
>>> added to the bistream as-is. And in this case on startup and when
>>> seeking. I wonder if we should add an element next to the CodecPrivate
>>> to describe that. Because in this case it's not entirely opaque to the
>>> demuxer. Or maybe it's implied by the CodecID and is up to the decoder
>>> to use it how it's supposed to be (in this case detecting keyframes
>>> and possibly adding back the Sequence Header OBU).
>>
>> 8. I think we can relax the requirements on the existence of in-band
>> sequence header OBUs a bit: If a keyframe (i.e. a key random access
>> point or a delayed random access point, not a dependent recovery point)
>> uses the same sequence header OBU as in the CodecPrivate (including the
>> same operating_parameters_info), then the sequence header OBU needn't be
>> prepended to the block with the keyframe, because seeking already works
>> without it provided one always adds the sequence header OBU from the
>> CodecPrivate back in the bitstream on seeking. For example, consider the
>> following scenario:
>> One has an elementary stream that uses two different sequence header
>> OBUs A and B that only differ in the operating_parameters_info. The
>> first three keyframes use A, between the third and the B is contained in
>> a temporal unit between the third and the fourth keyframe. Between the
>> sixth and the seventh keyframe is a temporal unit containing sequence
>> header A again. Then a muxer that wants to put this elementary stream
>> into Matroska may put A in the CodecPrivate can strip the very first
>> occurence of A away; it must leave B inside the temporal unit that it
>> was in (so that a player that plays the file linearly is notified about
>> the change) and has to make sure that keyframes #4 to #6 contain
>> sequence header B (so that one has the correct sequence header when
>> seeking to said keyframes). It mustn't strip A between the sixth and the
>> seventh keyframe away (so that a player that plays the file linearly
>> notices the change of sequence header), but it needn't preprend
>> keyframes #7 and following with A.
>
> That works but then we need to tell in the specs that following a
> Sequence Header MUST not be stripped if the previous Sequence Header
> was not bit identical to the one in CodecPrivate.
>
> IMO it's valid to output A and then B before the rest of the data in
> frame #4. That should be done if it's a keyframe. But if it's not a
> keyframe the demuxer/decoder doesn't know it has to prepend the
> CodecPrivate there. That could be a problem. Even though it doesn't
> make much sense to change Sequence Header data before a non keyframe
> (RAP).
>
> So maybe your proposal should be added. We'll gain a bit of weight but
> it's safer.
>
>> That way one can save a few bytes.
>> This is consistent with an interpretation of the CodecPrivate as the
>> default extradata/header (but it is not necessarily a truly global
>> header). Before `CueCodecState` was deprecated it had a default value of
>> 0 that mandated that one should look in the CodecPrivate for the
>> CodecState upon seking. So the only thing specific to AV1 in this case
>> is the "as-is" part; that one should reset the decoder to whatever
>> initialization information is contained in the CodecPrivate upon seeking
>> is nothing new.
>>
>> 9. Image that future AV1 encoders would find out that changing the
>> sequence header (with a change of the CVS) enables better compression.
>> What would we do in this case? Simply lift the restriction of one track
>> = one CVS even when this means that some players that can't cope with
>> changing coded video sequences won't know in advance whether they can
>> play the files at all? I'm asking because we don't include a version
>> field in the CodecPrivate (in contrast to how it is done in mp4 with avcC).
>
> Another CodecID should be used if the restrictions/format is defined.
> It's safer than reusing one with different data and hoping the system
> using the old one paid attention to the version bit that was always
> the same anyway.
>
>> Steve Lhomme:
>>> Let me know what you think so we can settle this spec for good.
>>>
>> I think we are not even close to settling this for good.
>
> \o/
>
>>
>> - Andreas Rheinhardt
>>
>> _______________________________________________
>> Cellar mailing list
>> Cellar@ietf.org
>> https://www.ietf.org/mailman/listinfo/cellar
>
>
>
> --
> Steve Lhomme
> Matroska association Chairman

-- 
Steve Lhomme
Matroska association Chairman