Re: [Cellar] AV1 mapping update

Andreas Rheinhardt <andreas.rheinhardt@googlemail.com> Thu, 12 July 2018 18:21 UTC

Return-Path: <andreas.rheinhardt@googlemail.com>
X-Original-To: cellar@ietfa.amsl.com
Delivered-To: cellar@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 138D4131150 for <cellar@ietfa.amsl.com>; Thu, 12 Jul 2018 11:21:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=googlemail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QJeXRk5Ela7a for <cellar@ietfa.amsl.com>; Thu, 12 Jul 2018 11:21:41 -0700 (PDT)
Received: from mail-wr1-x435.google.com (mail-wr1-x435.google.com [IPv6:2a00:1450:4864:20::435]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id F1E21130E63 for <cellar@ietf.org>; Thu, 12 Jul 2018 11:21:40 -0700 (PDT)
Received: by mail-wr1-x435.google.com with SMTP id m1-v6so9922309wrg.5 for <cellar@ietf.org>; Thu, 12 Jul 2018 11:21:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20161025; h=subject:to:references:from:message-id:date:mime-version:in-reply-to :content-transfer-encoding; bh=01DASMNcW+Cglw9U2C5cXnQhKQmKUw4SHO4dhCsQZck=; b=X4ExU13oKVqS3id5ZaKRh2TJtW7jv1rRhuYfLcI3Uw0vRMjimiJdxAcEDcX3HH3l5Q NVJz8s7UhpSR5mLxQqIvTZzPObr68DX1/JTTx+fGGLPebOxguRairD1LMyHo3locD4t2 I3Ej63OpQxFr5Ig6heNLJlyBwwcKhGQaFyZwQvKuNhrR6LlHqleUkOcwyt9aqZAVLpD0 H6zw8w7d3/Wg73nGiISU/lkMgTjMHT5Fwyl6Gw6SlrM/jnYkbkK8GqsaFSdreXPtgDSt l0t+C9RnTuAMOE1TzoEI+2ujALOx818K2MJPQd0nPsnYkBouHcC8tCOBUBWVUo306/Td mkEg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :mime-version:in-reply-to:content-transfer-encoding; bh=01DASMNcW+Cglw9U2C5cXnQhKQmKUw4SHO4dhCsQZck=; b=PyV1C09JNRCn35tzu7DOe5iPjN8fHrTR+Jl4Vspc0q88ErsbLRKpIGZopXNujwhDqC fUdhtLwmjnEIYxwkCiE1f+zWnXgDBEm0l2BfEF6x3d+CpYZRjPhsuUQx6dYBQFyLevcG fXmNfp0WasSfblmMEBAqOVn+mR/2xtTxBF9o5vCNKLAgIfJvkjOsBZFvXqpIJ5nZ7RNP X+3H3UZ/KZ1W6eB9jk0vRZHAkeAMDEE4KJkbVxe6u2QBSsBUg3w2654nbjRvh3IoDJ2y cRKs2uRLOJhX7UVXe7xNI2miHboz0okzNCboA24dGJIKtRjgyWOlZdRjHz2KVQT5UvAB lqSQ==
X-Gm-Message-State: AOUpUlESl9B014DlN77XcjurWAGgjG8sGOuyRFgTGPM1Tu9gU1DxsnNy D1t/r5PoAHs/qBIVSumMg/TOW3ks
X-Google-Smtp-Source: AAOMgpdmlPbeU2WPpkigeZiQctZzqIWyOMS9XKO+aFJ7yYO0K5bo/kPzhg0UCUiFAVW6jc1/BYHebA==
X-Received: by 2002:adf:fc86:: with SMTP id g6-v6mr2531878wrr.216.1531419698973; Thu, 12 Jul 2018 11:21:38 -0700 (PDT)
Received: from [127.0.0.1] (tor-exit-01.jelleschneiders.com. [145.239.90.27]) by smtp.googlemail.com with ESMTPSA id r125-v6sm4353655wmb.27.2018.07.12.11.21.37 for <cellar@ietf.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 12 Jul 2018 11:21:38 -0700 (PDT)
To: Codec Encoding for LossLess Archiving and Realtime transmission <cellar@ietf.org>
References: <CAOXsMFKHo6RS+q8KCXKoKCiBBS9pVqs92wsLgSfXZO+DT3dStQ@mail.gmail.com> <ca0f009e-a245-fcd6-95f8-f051736c9161@googlemail.com> <CAOXsMFL5-MaHQaAOyh7jSFUpCNbSEvAWKmAHcepaF+QsQuYbHw@mail.gmail.com>
From: Andreas Rheinhardt <andreas.rheinhardt@googlemail.com>
Message-ID: <fee747da-77ca-9282-a4c3-c112fd746507@googlemail.com>
Date: Thu, 12 Jul 2018 18:20:00 +0000
MIME-Version: 1.0
In-Reply-To: <CAOXsMFL5-MaHQaAOyh7jSFUpCNbSEvAWKmAHcepaF+QsQuYbHw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/je9gaXFXuDLY9ZFbkXxPmiOhVfw>
Subject: Re: [Cellar] AV1 mapping update
X-BeenThere: cellar@ietf.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: Codec Encoding for LossLess Archiving and Realtime transmission <cellar.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cellar>, <mailto:cellar-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cellar/>
List-Post: <mailto:cellar@ietf.org>
List-Help: <mailto:cellar-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cellar>, <mailto:cellar-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 12 Jul 2018 18:21:45 -0000

Hello,

Steve Lhomme:
> Hi Andreas,
> 
> Thanks for your detailed feedback.
> 
>> 1. Whether `DisplayWidth` and `DisplayHeight` needs to be written
>> actually depends on the value of `DisplayUnit`.
>
> True, I will mention that.
>
The "Notes" still only cover the case of `DisplayUnit` indicating pixels.

> 2018-07-11 15:47 GMT+02:00 Andreas Rheinhardt
> <andreas.rheinhardt@googlemail.com>:
>> 3. "They SHOULD have the [obu_has_size_field] set to 1 except for the
>> last OBU in the sample, for which [obu_has_size_field] MAY be set to 0,
>> in which case it is assumed to fill the remaining of the sample."
>> "The OBUs in the Block MUST follow the [Low Overhead Bitstream Format
>> syntax]."
>> The first sentence leaves the possibility that [obu_has_size_field] is 0
>> for OBUs other than the last OBU of a block (only a SHOULD). And the
>> requirement in the second sentence actually makes MUST out of the SHOULD
>> in the first sentence (making this part of the first sentence redundant)
>> and contradicts/voids the MAY part of the first sentence. In other
>> words, the two sentences should be merged to something like: "The `OBUs`
>> in the block must follow the `Low Overhead Bitstream Format` (in which
>> [obu_has_size_field] MUST be equal to one for every OBU) for every `OBU`
>> with the possible exception of the very last `OBU` in which
>> [obu_has_size_field] MAY be set to 0, in which case the `OBU` is assumed
>> to consist of the remainder of the block."
> 
> Indeed there's a contradiction here. If we use MUST (can't be must
> lowercase) on [Low Overhead Bitstream Format] then the
> [obu_has_size_field] MUST be 1.
> 
> On MP4 for the CodecPrivate the [obu_has_size_field] MUST be 1. But in
> the Blocks it can be 0:
> 
> "Each OBU SHALL have the obu_has_size_field set to 1 except for the
> last OBU in the sample, for which obu_has_size_field MAY be set to 0,
> in which case it is assumed to fill the remaining of the sample"
> 
> I think we should mimic that. I'll rephrase it.
> 

The current version is:
"The OBUs in the Block follow the [Low Overhead Bitstream Format
syntax]. They SHOULD have the [obu_has_size_field] set to 1 except for
the last OBU in the sample, for which [obu_has_size_field] MAY be set to
0, in which case it is assumed to fill the remaining of the sample."

If one interprets the first sentence as meaning "The OBUs in the Block
MUST follow the [Low Overhead Bitstream Format syntax]", then given that
this syntax mandates [obu_has_size_field] to be equal to 1 the first
part of the second sentence is redundant (given that MUST is stronger
than SHOULD) and the second part is again in contradiction to/voided by
the first sentence because the first sentence doesn't allow
"[obu_has_size_field]" set to zero at all.
If one interprets the first sentence as not conveying a MUST, then it is
allowed (albeit strongly discouraged) to use [obe_has_size_field] equal
to 0 for an OBU that is not the last OBU in the sample. This is not what
we want, isn't it? How about:
"The OBUs in the `Block` MUST follow the [Low Overhead Bitstream Format
syntax] with the possible exception of the last OBU of a `Block` for
which [obu_has_size_field] MAY be set to 0, in which case it is assumed
to fill the remainder of the `Block`."

>> 4. "ReferenceBlocks inside a BlockGroup MUST reference frames according
>> to the [ref_frame_idx] values of frame that is neither a KEYFRAME nor an
>> INTRA_ONLY_FRAME.": The problem with this sentence is that
>> [ref_frame_idx] needn't be present. It depends upon
>> [frame_refs_short_signaling] and [show_existing_frame]. If one uses a
>> Block inside a Blockgroup and if [show_exsting_frame] equals one one
>> should reference the block that contained the showable frame that is now
>> output (and that this should be the only `ReferenceBlock` written). In
>> case of [frame_refs_short_signaling] == 1 the obvious candidates for
>> `ReferenceBlocks` are the blocks containing the `last_frame_idx` and
>> `gold_frame_idx` that are explicitly signalled. If I am not mistaken,
>> then there are also other reference frames that are not explicitly
>> signalled, but computed. I don't know if we should really write a
>> `ReferenceBlock` entry for every reference as the current proposal seems
>> to imply. This would be quite a bit of overhead for no gain (and
>> furthermore, it would complicate muxers that would have to compute the
>> references that are not explicitly signalled in case that
> 
> This is how `ReferenceBlock` is supposed to be used. So a muxer that
> has no idea of any codec can cut a file and keep the relevant
> references. So they all have to be there. It's one of the reasons
> SimpleBlock was added, to simplify things a little (and reduce
> overhead).
> 
Actually a muxer can cut a file if it just knows the keyframes, the
decoding order and the display order. It doesn't need to have complete
information about reference frames. After all, one can cut files that
exclusively use `SimpleBlocks`.
(If one wants to cut the beginning away, one can cut according to the
keyframes; and at the end one can cut every block with timestamp >t_0
away if there is no block with timestamp >t_0 that precedes a block with
timestamp <=t_0 in coding/storage order.)



>> [frame_refs_short_signaling] is 1). One `ReferenceBlock` would be enough
>> to distinguish keyframes from non-keyframes.
>> By the way: If a temporal unit contains multiple frames with references,
>> whose references should end up as `ReferenceBlocks`? Or may the muxer
>> choose some?
> 
> I think a Temporal Unit can only have one (visible frame). I don't
> know if golden frames can have extra references. But the BlockGroup
> should contain all frames needed to decode this frame, that includes
> all the frames in the Block (even if not visible).
> 
>> 5. AV1 may use spatial scalability and/or temporal scalability. What do
>> we make of these? They are currently not forbidden if I am not mistaken,
>> but if e.g. the spatial dimensions of different layers disagree, the
>> `PixelWidth` and `PixelHeight` values can't be true for all layers.
>> Matroska seems to be missing some features here.
> 
> Our spec says that the Sequence Header OBU should be valid for all
> frames. That can't be used for spatial scalability. We don't support
> that mode for now.
> 
Then this should be explicitly stated in the codec mapping. And I also
fail to see why the fact that the Sequence Header OBU should be valid
for all frames should be incompatible with spatial scalability (after
all, in my reading of the spec the various share the same Sequence
Header OBU).

>> 7. Then there is another thing with keyframes and cues (for this point
>> it is always presumed that the relevant sequence header OBUs are
>> available regardless of whether this is done in-band or via CodecPrivate):
>> a) The proposal currently does not take into account that key frames
>> reset the decoder when they are output, not when they are decoded. A key
>> frame needn't be immediately output; if it is (i.e. [show_frame]
>> equaling 1), it is called a "key frame random access point" in section
>> 7.6 of the standard and is the equivalent of an IDR frame in H.264.
>> Everything's fine here. But a key frame can also be declared a
>> showable_frame (but only if [show_frame] equals 0) and output later via
>> the show_existing_frame mechanism. This is similar to an open GOP in
>> other codecs (but in contrast to them, the block that contains the coded
>> keyframe doesn't have the same timestamp (pts) as the first frame that
>> can be output after a seek). The coded key frame with [show_frame] equal
>> to zero is called a delayed random access point and a key frame
>> dependent recovery point is a frame where a key frame with
>> [showable_frame] equal to 1 is output via the show_existing_frame
>> mechanism. If one starts decoding at the delayed random access point,
>> all the output frames up to but not including the key frame dependent
>> recovery point can depend both on the delayed random access point frame
>> and on other earlier frames so that these frames can't be correctly
>> decoded in general. But all the frames from the key frame dependent
>> recovery point onwards can be correctly decoded if one starts decoding
>> at the delayed random access point (because the decoder is reset after
>> displaying the key frame). If one starts decoding at the key frame
>> dependent recovery point, one doesn't have the key frame that should be
>> shown via the show_existing_frames mechanism at all, so that this frame
>> is simply not a real key frame.
>> b) But although a key frame dependent recovery point is not a "real" key
>> frame, it has the same [frame_type] as the frame that is output, i.e.
>> its [frame_type] is KEY_FRAME. According to our current proposal this
>> would mean that it should be treated as a keyframe in Matroska which is
>> obviously wrong.
> 
> That's not how I understand it. Here's section 7.6.3:
> 
> "Informally, the requirement for decoder conformance is that decoding
> can start at any key frame random access point or delayed random
> access point."
> 
> And 7.6.2:
> 
> "delayed random access point is defined as being a frame:
> • with frame_type equal to KEY_FRAME
> • with show_frame equal to 0
> • that is contained in a temporal unit that also contains a sequence header OBU"
> 
> So as long as we seek on frames of type KEY_FRAME we should be able to
> seek. Wether it's a visible frame or not.
a) According to the Uncompressed Header Syntax, the [frame_type] of a
frame that is output via the [show_existing_frame] mechanism is the
[frame_type] of the [showable] frame that is output. This implies that a
key frame dependent recovery point (KFDRP from now on) has [frame_type]
KEY_FRAME. And this means that a KFDRP is a keyframe according to the
codec mapping (actually it is only heavily implied to be a keyframe,
because the current codec mapping does not require to label any frame a
keyframe).

b) The first passage you cited says that decoding can start at key frame
random access points (KFRAP from now on) or delayed random access points
(DRAP from now on). It does not say that one can seek to frames of type
KFDRP. And these frames are of type KEY_FRAME, too.

> 
> But because this is a bit loose in the 7.6.3 section they add this:
> 
> "To support the different modes of operation, a conformant decoder is
> required to be able to decode bitstreams consisting of:
> • a temporal unit containing a delayed random access point
> • immediately followed by a temporal unit containing the associated
> key frame dependent recovery point"
> 
> So the invisible KEY_FRAME should be immediately followed by the
> recovery point data. So effectively it will work. There's also a note
> that if it's not followed immediately then what is done with the
> intermediate frames is implementation dependent. I don't think we need
> to care too much.
> 
I don't think we should assume that the DRAP block is immediately
followed by a KFDRP (and I don't see how the assumption that there are
no temporal units between DRAP and KFDRP simplifies anything for us
container guys). That's just the only place where being able to decode
is absolutely required. But just read the note (that you actually quote
below) and you will see that they expect more:

"In practice, decoder implementations are expected to be able to start
decoding bitstreams from a delayed random access point when the
intermediate temporal units are still present. The decoder should
correctly produce all output frames from the next key frame or key frame
dependent recovery point onwards, while the preceding frames are
implementation defined."

So it's reasonable to assume that there will be temporal units between
DRAP and KFDRP.

And I think we should care about such stuff and in particular make sure
that seeking works well (because when there are non-KFRAP keyframes, AV1
is very much like intra decoder refresh when it comes to seeking and
intra decoder refresh seeking currently doesn't work well because the
cues only contain the timestamps of the frames where one should start
decoding in order to output frames, but not the timestamp of the first
frame that is undamaged after one has started decoding; this is a
problem when one wants to seek to one of the frames inbetween these two
frames).
> 
>> c) Marking a delayed random access point as keyframe deviates from the
>> way that flag has been traditionally understood: If one starts decoding
>> at this point, one doesn't get the frame that should be output for the
>> temporal unit containing the delayed random access point. But I
>> nevertheless think that these are the right keyframes, because they are
>> the points at which random access has to begin when there aren't key
>> frame random access points available; this also means that one can split
>> the stream at this point and the second part will still play so that a
>> muxer like mkvmerge needn't be rewritten too much.
> 
> Yes, IMO this is the correct way.
> 
>> d) A consequence of this is that a `Blockgroup` containing a delayed
>> random access point mustn't contain a `ReferenceBlock` (although the
>> actual frame that is output for that temporal unit very likely uses
>> other reference frames than the key frame that is contained in the same
>> temporal unit).
> 
> This won't happen if it's a proper random access point, ie it doesn't
> need past frames to start decoding. If it's not then it's not a RAP
> and then it can/should have ReferenceBlock.
> 
The DRAP frame itself (being coded intra) won't reference other frames,
but the other frames (including the shown frame) in the temporal unit
containing said DRAP will of course reference earlier frames (the H.264
equivalent of this scenario is an open gop and the other frames in this
case are B-frames shared between two GOPs (i.e. the B-frames that
precede the open GOP's keyframe in display order and follow it in coding
order) -- and they of course reference both the keyframe and earlier
frames, that is after all the whole point of having an open GOP). So it
does happen although it is a random access point; it does happen,
because it is just a delayed random access point.


>> e) Yes, this proposal means that it is impossible to tell from Matroska
>> alone (well, from the block structure that is; see f) for a way for
>> which one could put this information into the Cues) whether it is a key
>> frame random access point or a delayed random access point. One will
>> have to decode it (or parse deeper) to know.
> 
> No, this is independent of the codec. Also Cues can target a frames
> that can't be seeked to directly but that's beside the point. S
> SimpleBlock marked keyframe or BlockGroup with no ReferenceBlock can
> be seeked to directly and that's why they equal Random Access Points
> as defined in AV1 (and other codecs).
> 
Given that your reply began with "No" I presume that you believe that
you refuted me, but I don't see that you have. After all, how can one
tell from the Matroska layer alone whether a `Block` is a KFRAP or a DRAP?

(Of course I know that Cues can also target frames that can't be seeked
to directly -- after all, my proposal below is about adding Cues for
such frames.)

>> f) This also leads to problems with seeking: If one simply added a
>> CuePoint for the keyframe (i.e. for the delayed random access point) and
>> the user wants to seek to a point between the delayed random access
>> point (inclusive) and the dependent recovery point (exclusive) and the
>> player used the cues to seek to the nearest keyframe in front of the
>> desired point, then decoding at the point referenced in the cues would
>> not yield the desired frame (it would be either corrupted or not output
>> at all). Therefore I think it is best to add a CuePoint for every key
>> frame random access point and every key frame dependent recovery point.
>> The CuePoint for the key frame random access point would be an ordinary
>> CuePoint as usual. But the CuePoint for the key frame dependent recovery
>> point wouldn't be (my favourite is iv) (and if I were allowed to play
>> God it would be i))):
> 
> Cues are quite loose. It would be possible to do Cues only for frames
> that are not delayed RAP and that's valid. IMO it's fine to reference
> the delayed RAP. But they are RAP so it's legal to seek there.
> 
Cues are loose, but we can (and I think we should) add a SHOULD clause
that describes what cues muxers are expected to produce. And referencing
the delayed RAP brings the problem that the cues only contain where to
start decoding, but not from when on the output is undamaged/valid.
> The AV1 specs have this to say about this tricky case:
> 
> "Note:In practice, decoder implementations are expected to be able to
> start decoding bitstreams from a delayed random access point when the
> intermediate temporal units are still present. The decoder should
> correctly produce all output frames from the next key frame or key
> frame dependent recovery point onwards, while the preceding frames are
> implementation defined. For example: a streaming decoder may choose to
> decode and display all frames even when the reference frames are not
> available (tolerating some errors in the output), a low latency
> decoder may choose to decode and display all frames that are
> guaranteed to be correct (e.g. an inter frame that only uses inter
> prediction from the delayed random access point), a media player
> decoder may choose to decode and display only frames starting from a
> key frame or key frame dependent recovery point (guaranteeing smooth
> playback once display starts)."
> 
> It's not up to the container to solve this.
> 
It is not up to the container to decide whether the possibly damaged
frames between the DRAP and the KFDRP should be displayed, but it very
much is up to the container to make it as easy as possible to find the
data needed for random access. And so the cues should answer the
question "I want to play from point t_0 on. Where do I have to seek to?"

>> i) A comprehensive way of doing it is this: The CueTime would be the
>> timestamp of the block containing the dependent recovery point; it would
>> include a CueTrackPositions for the video track we are talking about
>> that contains the right CueTrack, the CueClusterPosition containing the
>> position of the dependent recovery point block and a CueReference with
>> CueRefTime and CueRefCluster, both corresponding to the valus of the
>> delayed random access point. This proposal has several downsides: It
>> uses Cue elements that are deprecated in Matroska and not part of Webm.
>> So this would require a quite nontrivial change in both projects. (Btw:
>> If one does this, one should add a default value for `CueRefCluster`: It
>> should be the same as `CueClusterPosition` as both blocks that we are
>> talking about will probably end up in the same cluster anyway.)
>> ii) One uses the CueTime of the dependent recovery point, but the
>> position of the Cluster of the delayed access point (and
>> `CueRelativePosition` (if used) should also point to this block).
>> Pro: It only uses elements that are supported by both Matroska and Webm.
>> Furthermore, the specs only say that `CueClusterPosition` should point
>> to the cluster containing the "required block; they don't explicitly say
>> that said block needs to have the same timestamp as `CueTime`.
>> Contra: How does a demuxer know from which block onwards it should feed
>> the data to the decoder? It might use the `CueRelativePosition`, but
>> probably a lot of demuxers would simply read the cluster until they come
>> to the block with timestamp `CueTime` (i.e. they interpret the specs so
>> that the "required block" is the block with the timestamp `CueTime`) and
>> then they would either deliver this to the decoder or conclude that the
>> file is damaged (because the block they found is no keyframe).
>> iii) The last is the same as i) with the difference that `CueRefCluster`
>> is omitted. It is also incompatible with current Webm, but at least it
>> has the advantage that it doesn't use any currently deprecated elements
>> of Matroska. One could add a requirement that the delayed random access
>> point and the dependent recovery point need to be in the same cluster
>> and then omitting `CueRefCluster` is not a problem any more.
>> iv) And then there is the possibility of creating a normal CuePoint for
>> the dependent recovery point, writing the dependent recovery point as a
>> Block in a Blockgroup with exactly one ReferenceBlock which points to
>> the delayed access point block and let the demuxer seek backwards from
>> the dependent recovery point to the delayed access point.
>> Pro: Would only use things that are already supported by Matroska and
>> Webm. It would also not be AV1 specific. The demuxer doesn't need to
>> know anything about AV1, everything is signalled at the container level.
>> Contra: Demuxers would have to be adapted not to expect any more that
>> only keyframes are referenced in the cues. They would also have to be
>> adapted to actually make use of the value of `ReferenceBlock` and seek
>> backwards. This also implies more seeks, but this should be quite
>> limited when one puts both the delayed random access point and the
>> dependent recovery point in the same cluster -- hopefully the data is
>> still cached. (Maybe one should add a SHOULD clause that says that both
>> blocks should be in the same cluster.)
>> g) Of course there are two easy alternative solutions:
>> i) Restrict the type of AV1 that is allowed in Matroska even further so
>> that all key frames are of key frame random access type. (This could
>> exclude quite a lot of AV1 and therefore I recommend not doing so.)
>> ii) Create cues as usual, i.e. reference every delayed random access
>> point, and don't care about the fact that seeking will be partially
>> broken in this case.
>> h) It should be noted that exactly the same situation exists with
>> periodic intra refresh in general. There was a short discussion on the
>> Matroska developer mailing list in April 2011, but nothing came out of
>> it. Every solution I outlined here for AV1 is also applicable for this case.
>>
>> Steve Lhomme:
>>> Since we allow stripping the Sequence Header OBU from the stream when
>>> it's equal to the CodecPrivate one, we need to add it back to the
>>> bitstream for compliance. At least when seeking on keyframes. So I
>>> added a section to explain that.
>>>
>>> IMO that's an extra feature of the CodecPrivate that it's meant to be
>>> added to the bistream as-is. And in this case on startup and when
>>> seeking. I wonder if we should add an element next to the CodecPrivate
>>> to describe that. Because in this case it's not entirely opaque to the
>>> demuxer. Or maybe it's implied by the CodecID and is up to the decoder
>>> to use it how it's supposed to be (in this case detecting keyframes
>>> and possibly adding back the Sequence Header OBU).
>>
>> 8. I think we can relax the requirements on the existence of in-band
>> sequence header OBUs a bit: If a keyframe (i.e. a key random access
>> point or a delayed random access point, not a dependent recovery point)
>> uses the same sequence header OBU as in the CodecPrivate (including the
>> same operating_parameters_info), then the sequence header OBU needn't be
>> prepended to the block with the keyframe, because seeking already works
>> without it provided one always adds the sequence header OBU from the
>> CodecPrivate back in the bitstream on seeking. For example, consider the
>> following scenario:
>> One has an elementary stream that uses two different sequence header
>> OBUs A and B that only differ in the operating_parameters_info. The
>> first three keyframes use A, between the third and the B is contained in
>> a temporal unit between the third and the fourth keyframe. Between the
>> sixth and the seventh keyframe is a temporal unit containing sequence
>> header A again. Then a muxer that wants to put this elementary stream
>> into Matroska may put A in the CodecPrivate can strip the very first
>> occurence of A away; it must leave B inside the temporal unit that it
>> was in (so that a player that plays the file linearly is notified about
>> the change) and has to make sure that keyframes #4 to #6 contain
>> sequence header B (so that one has the correct sequence header when
>> seeking to said keyframes). It mustn't strip A between the sixth and the
>> seventh keyframe away (so that a player that plays the file linearly
>> notices the change of sequence header), but it needn't preprend
>> keyframes #7 and following with A.
> 
> That works but then we need to tell in the specs that following a
> Sequence Header MUST not be stripped if the previous Sequence Header
> was not bit identical to the one in CodecPrivate.
> 
> IMO it's valid to output A and then B before the rest of the data in
> frame #4. That should be done if it's a keyframe. But if it's not a
> keyframe the demuxer/decoder doesn't know it has to prepend the
> CodecPrivate there. That could be a problem. Even though it doesn't
> make much sense to change Sequence Header data before a non keyframe
> (RAP).
Why should the demuxer ever want to prepend the CodecPrivate in front of
a non-keyframe? This doesn't make sense to me; after all, one should
only seek to keyframes and only add the CodecPrivate in front of a
keyframe if one has just seeked to said keyframe, not if one just
encounters a keyframe (after the seek one just uses whatever Sequence
Header one encounters in-band (if any)).

> 
> So maybe your proposal should be added. We'll gain a bit of weight but
> it's safer.
> 
My wording would be:
"Upon seeking to a keyframe the player/demuxer MUST prepend the `Block`
with the `Sequence Header OBU` contained in the `CodecPrivate`.
A muxer MUST make sure that the correct `Sequence Header OBU` is in
force both during linear access and also after seeking to a keyframe
`Block`. So in particular a keyframe `Block` where a `Sequence Header
OBU` that is not bit-identical to the one in the `CodecPrivate` is in
force for the decoding of the first frame contained in said `Block` MUST
contain the `Sequence Header OBU` that is in force for the decoding of
the first frame contained in said `Block` in front of the first frame
contained in said `Block`."

One could also relax this a bit and only make the correct linear access
a MUST and the rest a SHOULD. This might be useful for applications
where seeking isn't desired (although it really should be included even
for those scenarios to support resuming playback after a transmission
error).

And maybe one should add a clause like "Seeking to non-keyframes is
undefined and not recommended."

And if one wants to one can add a clause recommending to strip out
unnecessary `Sequence Header OBUs`


>> Steve Lhomme:
>>> Let me know what you think so we can settle this spec for good.
>>>
>> I think we are not even close to settling this for good.
> 
> \o/
> 
I honestly don't know what this means.

- Andreas Rheinhardt