Re: [Cellar] AV1 mapping update

Steve Lhomme <slhomme@matroska.org> Fri, 13 July 2018 07:24 UTC

Return-Path: <slhomme@matroska.org>
X-Original-To: cellar@ietfa.amsl.com
Delivered-To: cellar@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AEFBA130DCC for <cellar@ietfa.amsl.com>; Fri, 13 Jul 2018 00:24:23 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.91
X-Spam-Level:
X-Spam-Status: No, score=-1.91 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, T_DKIMWL_WL_MED=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=matroska-org.20150623.gappssmtp.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id v3tF_5lYvqh9 for <cellar@ietfa.amsl.com>; Fri, 13 Jul 2018 00:24:20 -0700 (PDT)
Received: from mail-pg1-x52d.google.com (mail-pg1-x52d.google.com [IPv6:2607:f8b0:4864:20::52d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EDF5B126F72 for <cellar@ietf.org>; Fri, 13 Jul 2018 00:24:19 -0700 (PDT)
Received: by mail-pg1-x52d.google.com with SMTP id p23-v6so4563817pgv.13 for <cellar@ietf.org>; Fri, 13 Jul 2018 00:24:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=matroska-org.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=qNNBBqyG0Q8ptRUbwk95gHswefxe58f94mW9ALTvWKs=; b=uWTOjjzWYzia1GeOTTbVqfsKxI9AaQnhsTAzRexLQjlJYVUj+G1CwI1EurGVJsfbnr 3Oxgjcbgi9wAYK9fPEZvUFSvszpmyLy7ICpq6p40UVkhgNoYP5hGLqeV5ger3rebdDyS 8Ny0BSUspfCm2IHARi50pu60Wvl+YIOSO0qloKaVnw/913/x/N1ctqXahopNn2zuqhoi u8btcTdO/nOGGOGrr7eyFWSfqgx38oiYKmFOgY48KaZ8c+3fp9ng9Nz09tLwMHUBMAVW d+yXAfxSzfAc2Nn0LPqtOMESq+iJc+491DhR5AXJ9WIrLVu8XUhHHV28zPsOWOuvUlwm P1rA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=qNNBBqyG0Q8ptRUbwk95gHswefxe58f94mW9ALTvWKs=; b=siIB9GHCw0J+7GcsTrskpERxjdh085/JNlrnOBJE3abUGYYQi+9ZZDAwNFCoPpv2To xkQEd3bnf7nb1WmUbtCTqnTrt9kWp8KXhwSEEdHCEOADnCDEX772f1Zw6GjvkJ8ju8xX rEfuY3fh4sbLwFDsLft8F+Vp+q8/Fq1ae/qLsyKQQOkW7ayW6n+MZYLanjjj9vhdsnrv 0NUbA+Ew06G7K27xxdXSWrdmG8bU5iu/ZWdQjlI2BYt0TIhAFE/Svj5u7HRsypkeRBn6 Ntye/KdhqglhMGkrmthJw+juBfc1lk4NLUFoFgrUYc6zP3sglc7y0I0hkJIzwrQjsGtU i67Q==
X-Gm-Message-State: AOUpUlEB46uQua83RUSwWjlmlkH91ZYWBrpCZ/1L+5Z+yZf+idWhqnZY kAUvOfoXr0xRHCMwRljcDA9rNb3E1d1IeNamUAscSJOd
X-Google-Smtp-Source: AAOMgpdy+HaPpQnbQXimOVkn2PM0ii7X9iBMLRTkQZpysAZ4rszlUlEmbHe2icxmxrZ/OAF07NJYYriNQ3XEkS/VLgQ=
X-Received: by 2002:a62:9652:: with SMTP id c79-v6mr5861820pfe.114.1531466659239; Fri, 13 Jul 2018 00:24:19 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a17:90a:9c13:0:0:0:0 with HTTP; Fri, 13 Jul 2018 00:24:18 -0700 (PDT)
In-Reply-To: <fee747da-77ca-9282-a4c3-c112fd746507@googlemail.com>
References: <CAOXsMFKHo6RS+q8KCXKoKCiBBS9pVqs92wsLgSfXZO+DT3dStQ@mail.gmail.com> <ca0f009e-a245-fcd6-95f8-f051736c9161@googlemail.com> <CAOXsMFL5-MaHQaAOyh7jSFUpCNbSEvAWKmAHcepaF+QsQuYbHw@mail.gmail.com> <fee747da-77ca-9282-a4c3-c112fd746507@googlemail.com>
From: Steve Lhomme <slhomme@matroska.org>
Date: Fri, 13 Jul 2018 09:24:18 +0200
Message-ID: <CAOXsMFJtc9pq+PphRb5kF9Mp4jyS5j3LQi6vQQmHRyTDYWyQ-A@mail.gmail.com>
To: Andreas Rheinhardt <andreas.rheinhardt@googlemail.com>
Cc: Codec Encoding for LossLess Archiving and Realtime transmission <cellar@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/ridigqyYchK6392Xcgd_VM6fqcQ>
Subject: Re: [Cellar] AV1 mapping update
X-BeenThere: cellar@ietf.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: Codec Encoding for LossLess Archiving and Realtime transmission <cellar.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cellar>, <mailto:cellar-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cellar/>
List-Post: <mailto:cellar@ietf.org>
List-Help: <mailto:cellar-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cellar>, <mailto:cellar-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Jul 2018 07:24:24 -0000

Hi,

2018-07-12 20:20 GMT+02:00 Andreas Rheinhardt
<andreas.rheinhardt@googlemail.com>:
> Hello,
>
> Steve Lhomme:
>> Hi Andreas,
>>
>> Thanks for your detailed feedback.
>>
>>> 1. Whether `DisplayWidth` and `DisplayHeight` needs to be written
>>> actually depends on the value of `DisplayUnit`.
>>
>> True, I will mention that.
>>
> The "Notes" still only cover the case of `DisplayUnit` indicating pixels.

If they are not in pixels then the values should not be taken from the
codec anyway. The general Matroska rules apply. IMO we are not going
to repeat here the way DisplayWidth/Height has to be interpreted.

>> 2018-07-11 15:47 GMT+02:00 Andreas Rheinhardt
>> <andreas.rheinhardt@googlemail.com>:
>>> 3. "They SHOULD have the [obu_has_size_field] set to 1 except for the
>>> last OBU in the sample, for which [obu_has_size_field] MAY be set to 0,
>>> in which case it is assumed to fill the remaining of the sample."
>>> "The OBUs in the Block MUST follow the [Low Overhead Bitstream Format
>>> syntax]."
>>> The first sentence leaves the possibility that [obu_has_size_field] is 0
>>> for OBUs other than the last OBU of a block (only a SHOULD). And the
>>> requirement in the second sentence actually makes MUST out of the SHOULD
>>> in the first sentence (making this part of the first sentence redundant)
>>> and contradicts/voids the MAY part of the first sentence. In other
>>> words, the two sentences should be merged to something like: "The `OBUs`
>>> in the block must follow the `Low Overhead Bitstream Format` (in which
>>> [obu_has_size_field] MUST be equal to one for every OBU) for every `OBU`
>>> with the possible exception of the very last `OBU` in which
>>> [obu_has_size_field] MAY be set to 0, in which case the `OBU` is assumed
>>> to consist of the remainder of the block."
>>
>> Indeed there's a contradiction here. If we use MUST (can't be must
>> lowercase) on [Low Overhead Bitstream Format] then the
>> [obu_has_size_field] MUST be 1.
>>
>> On MP4 for the CodecPrivate the [obu_has_size_field] MUST be 1. But in
>> the Blocks it can be 0:
>>
>> "Each OBU SHALL have the obu_has_size_field set to 1 except for the
>> last OBU in the sample, for which obu_has_size_field MAY be set to 0,
>> in which case it is assumed to fill the remaining of the sample"
>>
>> I think we should mimic that. I'll rephrase it.
>>
>
> The current version is:
> "The OBUs in the Block follow the [Low Overhead Bitstream Format
> syntax]. They SHOULD have the [obu_has_size_field] set to 1 except for
> the last OBU in the sample, for which [obu_has_size_field] MAY be set to
> 0, in which case it is assumed to fill the remaining of the sample."
>
> If one interprets the first sentence as meaning "The OBUs in the Block
> MUST follow the [Low Overhead Bitstream Format syntax]", then given that
> this syntax mandates [obu_has_size_field] to be equal to 1 the first
> part of the second sentence is redundant (given that MUST is stronger
> than SHOULD) and the second part is again in contradiction to/voided by
> the first sentence because the first sentence doesn't allow
> "[obu_has_size_field]" set to zero at all.
> If one interprets the first sentence as not conveying a MUST, then it is
> allowed (albeit strongly discouraged) to use [obe_has_size_field] equal
> to 0 for an OBU that is not the last OBU in the sample. This is not what
> we want, isn't it? How about:
> "The OBUs in the `Block` MUST follow the [Low Overhead Bitstream Format
> syntax] with the possible exception of the last OBU of a `Block` for
> which [obu_has_size_field] MAY be set to 0, in which case it is assumed
> to fill the remainder of the `Block`."

MUST is wrong IMO if you add an exception. Then it should be a SHOULD
and explain the cases where it shouldn't. I left it out of the first
sentence on purpose because the "normative" SHOULD is on the next
sentence and the one that should apply.

>>> 4. "ReferenceBlocks inside a BlockGroup MUST reference frames according
>>> to the [ref_frame_idx] values of frame that is neither a KEYFRAME nor an
>>> INTRA_ONLY_FRAME.": The problem with this sentence is that
>>> [ref_frame_idx] needn't be present. It depends upon
>>> [frame_refs_short_signaling] and [show_existing_frame]. If one uses a
>>> Block inside a Blockgroup and if [show_exsting_frame] equals one one
>>> should reference the block that contained the showable frame that is now
>>> output (and that this should be the only `ReferenceBlock` written). In
>>> case of [frame_refs_short_signaling] == 1 the obvious candidates for
>>> `ReferenceBlocks` are the blocks containing the `last_frame_idx` and
>>> `gold_frame_idx` that are explicitly signalled. If I am not mistaken,
>>> then there are also other reference frames that are not explicitly
>>> signalled, but computed. I don't know if we should really write a
>>> `ReferenceBlock` entry for every reference as the current proposal seems
>>> to imply. This would be quite a bit of overhead for no gain (and
>>> furthermore, it would complicate muxers that would have to compute the
>>> references that are not explicitly signalled in case that
>>
>> This is how `ReferenceBlock` is supposed to be used. So a muxer that
>> has no idea of any codec can cut a file and keep the relevant
>> references. So they all have to be there. It's one of the reasons
>> SimpleBlock was added, to simplify things a little (and reduce
>> overhead).
>>
> Actually a muxer can cut a file if it just knows the keyframes, the
> decoding order and the display order. It doesn't need to have complete
> information about reference frames. After all, one can cut files that
> exclusively use `SimpleBlocks`.

I'm not sure it's true for modern codec where the referenced frame may
be older than the previous keyframes. Also the ReferenceBlock allows
picking only the frames necessary to render that particular frame,
regardless of the internals of the codec.

> (If one wants to cut the beginning away, one can cut according to the
> keyframes; and at the end one can cut every block with timestamp >t_0
> away if there is no block with timestamp >t_0 that precedes a block with
> timestamp <=t_0 in coding/storage order.)
>
>
>
>>> [frame_refs_short_signaling] is 1). One `ReferenceBlock` would be enough
>>> to distinguish keyframes from non-keyframes.
>>> By the way: If a temporal unit contains multiple frames with references,
>>> whose references should end up as `ReferenceBlocks`? Or may the muxer
>>> choose some?
>>
>> I think a Temporal Unit can only have one (visible frame). I don't
>> know if golden frames can have extra references. But the BlockGroup
>> should contain all frames needed to decode this frame, that includes
>> all the frames in the Block (even if not visible).
>>
>>> 5. AV1 may use spatial scalability and/or temporal scalability. What do
>>> we make of these? They are currently not forbidden if I am not mistaken,
>>> but if e.g. the spatial dimensions of different layers disagree, the
>>> `PixelWidth` and `PixelHeight` values can't be true for all layers.
>>> Matroska seems to be missing some features here.
>>
>> Our spec says that the Sequence Header OBU should be valid for all
>> frames. That can't be used for spatial scalability. We don't support
>> that mode for now.
>>
> Then this should be explicitly stated in the codec mapping. And I also
> fail to see why the fact that the Sequence Header OBU should be valid
> for all frames should be incompatible with spatial scalability (after
> all, in my reading of the spec the various share the same Sequence
> Header OBU).

I didn't fully understand how the spatial scalability works. If the
same Sequence Header OBU supports it then we can support it. After all
the PixelWidth/Height use the "maximum" width/height. But internally
it may be less.

>>> 7. Then there is another thing with keyframes and cues (for this point
>>> it is always presumed that the relevant sequence header OBUs are
>>> available regardless of whether this is done in-band or via CodecPrivate):
>>> a) The proposal currently does not take into account that key frames
>>> reset the decoder when they are output, not when they are decoded. A key
>>> frame needn't be immediately output; if it is (i.e. [show_frame]
>>> equaling 1), it is called a "key frame random access point" in section
>>> 7.6 of the standard and is the equivalent of an IDR frame in H.264.
>>> Everything's fine here. But a key frame can also be declared a
>>> showable_frame (but only if [show_frame] equals 0) and output later via
>>> the show_existing_frame mechanism. This is similar to an open GOP in
>>> other codecs (but in contrast to them, the block that contains the coded
>>> keyframe doesn't have the same timestamp (pts) as the first frame that
>>> can be output after a seek). The coded key frame with [show_frame] equal
>>> to zero is called a delayed random access point and a key frame
>>> dependent recovery point is a frame where a key frame with
>>> [showable_frame] equal to 1 is output via the show_existing_frame
>>> mechanism. If one starts decoding at the delayed random access point,
>>> all the output frames up to but not including the key frame dependent
>>> recovery point can depend both on the delayed random access point frame
>>> and on other earlier frames so that these frames can't be correctly
>>> decoded in general. But all the frames from the key frame dependent
>>> recovery point onwards can be correctly decoded if one starts decoding
>>> at the delayed random access point (because the decoder is reset after
>>> displaying the key frame). If one starts decoding at the key frame
>>> dependent recovery point, one doesn't have the key frame that should be
>>> shown via the show_existing_frames mechanism at all, so that this frame
>>> is simply not a real key frame.
>>> b) But although a key frame dependent recovery point is not a "real" key
>>> frame, it has the same [frame_type] as the frame that is output, i.e.
>>> its [frame_type] is KEY_FRAME. According to our current proposal this
>>> would mean that it should be treated as a keyframe in Matroska which is
>>> obviously wrong.
>>
>> That's not how I understand it. Here's section 7.6.3:
>>
>> "Informally, the requirement for decoder conformance is that decoding
>> can start at any key frame random access point or delayed random
>> access point."
>>
>> And 7.6.2:
>>
>> "delayed random access point is defined as being a frame:
>> • with frame_type equal to KEY_FRAME
>> • with show_frame equal to 0
>> • that is contained in a temporal unit that also contains a sequence header OBU"
>>
>> So as long as we seek on frames of type KEY_FRAME we should be able to
>> seek. Wether it's a visible frame or not.
>
> a) According to the Uncompressed Header Syntax, the [frame_type] of a
> frame that is output via the [show_existing_frame] mechanism is the
> [frame_type] of the [showable] frame that is output. This implies that a
> key frame dependent recovery point (KFDRP from now on) has [frame_type]
> KEY_FRAME. And this means that a KFDRP is a keyframe according to the
> codec mapping (actually it is only heavily implied to be a keyframe,
> because the current codec mapping does not require to label any frame a
> keyframe).

In "6.8.2 Uncompressed header semantics" it says:

"If obu_type is equal to OBU_FRAME, it is a requirement of bitstream
conformance that show_existing_frame is equal to 0."

So the KFDRP which has show_existing_frame=1 is not an OBU_FRAME. I
think it's in an OBU of type OBU_FRAME_HEADER. Also a `Block` is one
Temporal Unit which has at least one shown OBU frame (exactly 1 if
spatial scalability is not used). It's not entirely clear if an OBU of
type OBU_FRAME_HEADER is a frame or not. We could add the requirement
for our keyframe definition that the OBU with the keyframe flag must
be of type OBU_FRAME. Currently we say:

A `SimpleBlock` MUST only be marked as a Keyframe if the first `Frame
Header OBU` in the `Block` has a __[frame_type]__ of `KEY_FRAME`

This is actually the opposite of what we want... It should not be a
`Frame Header OBU` but a `Frame OBU`. I think the Frame Header OBU is
a frame placeholder for the show_existing_frame flag. It's missing the
Tile Group that contains the actual data. I'll fix that in my next
push.

> b) The first passage you cited says that decoding can start at key frame
> random access points (KFRAP from now on) or delayed random access points
> (DRAP from now on). It does not say that one can seek to frames of type
> KFDRP. And these frames are of type KEY_FRAME, too.

I think this is exactly what Random Access Point means. You can seek
there and start decoding safely. And that's exactly when a Block must
be a keyframe in Matroska and not any other time. And for that I
prefer to use Chapter 7.6.

>> But because this is a bit loose in the 7.6.3 section they add this:
>>
>> "To support the different modes of operation, a conformant decoder is
>> required to be able to decode bitstreams consisting of:
>> • a temporal unit containing a delayed random access point
>> • immediately followed by a temporal unit containing the associated
>> key frame dependent recovery point"
>>
>> So the invisible KEY_FRAME should be immediately followed by the
>> recovery point data. So effectively it will work. There's also a note
>> that if it's not followed immediately then what is done with the
>> intermediate frames is implementation dependent. I don't think we need
>> to care too much.
>
> I don't think we should assume that the DRAP block is immediately
> followed by a KFDRP (and I don't see how the assumption that there are
> no temporal units between DRAP and KFDRP simplifies anything for us
> container guys). That's just the only place where being able to decode
> is absolutely required. But just read the note (that you actually quote
> below) and you will see that they expect more:
>
> "In practice, decoder implementations are expected to be able to start
> decoding bitstreams from a delayed random access point when the
> intermediate temporal units are still present. The decoder should
> correctly produce all output frames from the next key frame or key frame
> dependent recovery point onwards, while the preceding frames are
> implementation defined."
>
> So it's reasonable to assume that there will be temporal units between
> DRAP and KFDRP.
>
> And I think we should care about such stuff and in particular make sure
> that seeking works well (because when there are non-KFRAP keyframes, AV1
> is very much like intra decoder refresh when it comes to seeking and
> intra decoder refresh seeking currently doesn't work well because the
> cues only contain the timestamps of the frames where one should start
> decoding in order to output frames, but not the timestamp of the first
> frame that is undamaged after one has started decoding; this is a
> problem when one wants to seek to one of the frames inbetween these two
> frames).
>>
>>> c) Marking a delayed random access point as keyframe deviates from the
>>> way that flag has been traditionally understood: If one starts decoding
>>> at this point, one doesn't get the frame that should be output for the
>>> temporal unit containing the delayed random access point. But I
>>> nevertheless think that these are the right keyframes, because they are
>>> the points at which random access has to begin when there aren't key
>>> frame random access points available; this also means that one can split
>>> the stream at this point and the second part will still play so that a
>>> muxer like mkvmerge needn't be rewritten too much.
>>
>> Yes, IMO this is the correct way.
>>
>>> d) A consequence of this is that a `Blockgroup` containing a delayed
>>> random access point mustn't contain a `ReferenceBlock` (although the
>>> actual frame that is output for that temporal unit very likely uses
>>> other reference frames than the key frame that is contained in the same
>>> temporal unit).
>>
>> This won't happen if it's a proper random access point, ie it doesn't
>> need past frames to start decoding. If it's not then it's not a RAP
>> and then it can/should have ReferenceBlock.
>>
> The DRAP frame itself (being coded intra) won't reference other frames,
> but the other frames (including the shown frame) in the temporal unit
> containing said DRAP will of course reference earlier frames (the H.264
> equivalent of this scenario is an open gop and the other frames in this
> case are B-frames shared between two GOPs (i.e. the B-frames that
> precede the open GOP's keyframe in display order and follow it in coding
> order) -- and they of course reference both the keyframe and earlier
> frames, that is after all the whole point of having an open GOP). So it
> does happen although it is a random access point; it does happen,
> because it is just a delayed random access point.
>
>
>>> e) Yes, this proposal means that it is impossible to tell from Matroska
>>> alone (well, from the block structure that is; see f) for a way for
>>> which one could put this information into the Cues) whether it is a key
>>> frame random access point or a delayed random access point. One will
>>> have to decode it (or parse deeper) to know.
>>
>> No, this is independent of the codec. Also Cues can target a frames
>> that can't be seeked to directly but that's beside the point. S
>> SimpleBlock marked keyframe or BlockGroup with no ReferenceBlock can
>> be seeked to directly and that's why they equal Random Access Points
>> as defined in AV1 (and other codecs).
>>
> Given that your reply began with "No" I presume that you believe that
> you refuted me, but I don't see that you have. After all, how can one
> tell from the Matroska layer alone whether a `Block` is a KFRAP or a DRAP?

A RAP is exactly what we need, no matter the time. If there are
glitches because it's a DRAP it's fine according to the specs. But
that's were proper seek should be done.

> (Of course I know that Cues can also target frames that can't be seeked
> to directly -- after all, my proposal below is about adding Cues for
> such frames.)
>
>>> f) This also leads to problems with seeking: If one simply added a
>>> CuePoint for the keyframe (i.e. for the delayed random access point) and
>>> the user wants to seek to a point between the delayed random access
>>> point (inclusive) and the dependent recovery point (exclusive) and the
>>> player used the cues to seek to the nearest keyframe in front of the
>>> desired point, then decoding at the point referenced in the cues would
>>> not yield the desired frame (it would be either corrupted or not output
>>> at all). Therefore I think it is best to add a CuePoint for every key
>>> frame random access point and every key frame dependent recovery point.
>>> The CuePoint for the key frame random access point would be an ordinary
>>> CuePoint as usual. But the CuePoint for the key frame dependent recovery
>>> point wouldn't be (my favourite is iv) (and if I were allowed to play
>>> God it would be i))):
>>
>> Cues are quite loose. It would be possible to do Cues only for frames
>> that are not delayed RAP and that's valid. IMO it's fine to reference
>> the delayed RAP. But they are RAP so it's legal to seek there.
>>
> Cues are loose, but we can (and I think we should) add a SHOULD clause
> that describes what cues muxers are expected to produce. And referencing
> the delayed RAP brings the problem that the cues only contain where to
> start decoding, but not from when on the output is undamaged/valid.
>> The AV1 specs have this to say about this tricky case:
>>
>> "Note:In practice, decoder implementations are expected to be able to
>> start decoding bitstreams from a delayed random access point when the
>> intermediate temporal units are still present. The decoder should
>> correctly produce all output frames from the next key frame or key
>> frame dependent recovery point onwards, while the preceding frames are
>> implementation defined. For example: a streaming decoder may choose to
>> decode and display all frames even when the reference frames are not
>> available (tolerating some errors in the output), a low latency
>> decoder may choose to decode and display all frames that are
>> guaranteed to be correct (e.g. an inter frame that only uses inter
>> prediction from the delayed random access point), a media player
>> decoder may choose to decode and display only frames starting from a
>> key frame or key frame dependent recovery point (guaranteeing smooth
>> playback once display starts)."
>>
>> It's not up to the container to solve this.
>>
> It is not up to the container to decide whether the possibly damaged
> frames between the DRAP and the KFDRP should be displayed, but it very
> much is up to the container to make it as easy as possible to find the
> data needed for random access. And so the cues should answer the
> question "I want to play from point t_0 on. Where do I have to seek to?"

In Matroska to the CuePoint with no CueReference with the closest time
before point_t_0.

In AV1 to the RAP with the closest time before point_t_0.

>>> i) A comprehensive way of doing it is this: The CueTime would be the
>>> timestamp of the block containing the dependent recovery point; it would
>>> include a CueTrackPositions for the video track we are talking about
>>> that contains the right CueTrack, the CueClusterPosition containing the
>>> position of the dependent recovery point block and a CueReference with
>>> CueRefTime and CueRefCluster, both corresponding to the valus of the
>>> delayed random access point. This proposal has several downsides: It
>>> uses Cue elements that are deprecated in Matroska and not part of Webm.
>>> So this would require a quite nontrivial change in both projects. (Btw:
>>> If one does this, one should add a default value for `CueRefCluster`: It
>>> should be the same as `CueClusterPosition` as both blocks that we are
>>> talking about will probably end up in the same cluster anyway.)
>>> ii) One uses the CueTime of the dependent recovery point, but the
>>> position of the Cluster of the delayed access point (and
>>> `CueRelativePosition` (if used) should also point to this block).
>>> Pro: It only uses elements that are supported by both Matroska and Webm.
>>> Furthermore, the specs only say that `CueClusterPosition` should point
>>> to the cluster containing the "required block; they don't explicitly say
>>> that said block needs to have the same timestamp as `CueTime`.
>>> Contra: How does a demuxer know from which block onwards it should feed
>>> the data to the decoder? It might use the `CueRelativePosition`, but
>>> probably a lot of demuxers would simply read the cluster until they come
>>> to the block with timestamp `CueTime` (i.e. they interpret the specs so
>>> that the "required block" is the block with the timestamp `CueTime`) and
>>> then they would either deliver this to the decoder or conclude that the
>>> file is damaged (because the block they found is no keyframe).
>>> iii) The last is the same as i) with the difference that `CueRefCluster`
>>> is omitted. It is also incompatible with current Webm, but at least it
>>> has the advantage that it doesn't use any currently deprecated elements
>>> of Matroska. One could add a requirement that the delayed random access
>>> point and the dependent recovery point need to be in the same cluster
>>> and then omitting `CueRefCluster` is not a problem any more.
>>> iv) And then there is the possibility of creating a normal CuePoint for
>>> the dependent recovery point, writing the dependent recovery point as a
>>> Block in a Blockgroup with exactly one ReferenceBlock which points to
>>> the delayed access point block and let the demuxer seek backwards from
>>> the dependent recovery point to the delayed access point.
>>> Pro: Would only use things that are already supported by Matroska and
>>> Webm. It would also not be AV1 specific. The demuxer doesn't need to
>>> know anything about AV1, everything is signalled at the container level.
>>> Contra: Demuxers would have to be adapted not to expect any more that
>>> only keyframes are referenced in the cues. They would also have to be
>>> adapted to actually make use of the value of `ReferenceBlock` and seek
>>> backwards. This also implies more seeks, but this should be quite
>>> limited when one puts both the delayed random access point and the
>>> dependent recovery point in the same cluster -- hopefully the data is
>>> still cached. (Maybe one should add a SHOULD clause that says that both
>>> blocks should be in the same cluster.)
>>> g) Of course there are two easy alternative solutions:
>>> i) Restrict the type of AV1 that is allowed in Matroska even further so
>>> that all key frames are of key frame random access type. (This could
>>> exclude quite a lot of AV1 and therefore I recommend not doing so.)
>>> ii) Create cues as usual, i.e. reference every delayed random access
>>> point, and don't care about the fact that seeking will be partially
>>> broken in this case.
>>> h) It should be noted that exactly the same situation exists with
>>> periodic intra refresh in general. There was a short discussion on the
>>> Matroska developer mailing list in April 2011, but nothing came out of
>>> it. Every solution I outlined here for AV1 is also applicable for this case.
>>>
>>> Steve Lhomme:
>>>> Since we allow stripping the Sequence Header OBU from the stream when
>>>> it's equal to the CodecPrivate one, we need to add it back to the
>>>> bitstream for compliance. At least when seeking on keyframes. So I
>>>> added a section to explain that.
>>>>
>>>> IMO that's an extra feature of the CodecPrivate that it's meant to be
>>>> added to the bistream as-is. And in this case on startup and when
>>>> seeking. I wonder if we should add an element next to the CodecPrivate
>>>> to describe that. Because in this case it's not entirely opaque to the
>>>> demuxer. Or maybe it's implied by the CodecID and is up to the decoder
>>>> to use it how it's supposed to be (in this case detecting keyframes
>>>> and possibly adding back the Sequence Header OBU).
>>>
>>> 8. I think we can relax the requirements on the existence of in-band
>>> sequence header OBUs a bit: If a keyframe (i.e. a key random access
>>> point or a delayed random access point, not a dependent recovery point)
>>> uses the same sequence header OBU as in the CodecPrivate (including the
>>> same operating_parameters_info), then the sequence header OBU needn't be
>>> prepended to the block with the keyframe, because seeking already works
>>> without it provided one always adds the sequence header OBU from the
>>> CodecPrivate back in the bitstream on seeking. For example, consider the
>>> following scenario:
>>> One has an elementary stream that uses two different sequence header
>>> OBUs A and B that only differ in the operating_parameters_info. The
>>> first three keyframes use A, between the third and the B is contained in
>>> a temporal unit between the third and the fourth keyframe. Between the
>>> sixth and the seventh keyframe is a temporal unit containing sequence
>>> header A again. Then a muxer that wants to put this elementary stream
>>> into Matroska may put A in the CodecPrivate can strip the very first
>>> occurence of A away; it must leave B inside the temporal unit that it
>>> was in (so that a player that plays the file linearly is notified about
>>> the change) and has to make sure that keyframes #4 to #6 contain
>>> sequence header B (so that one has the correct sequence header when
>>> seeking to said keyframes). It mustn't strip A between the sixth and the
>>> seventh keyframe away (so that a player that plays the file linearly
>>> notices the change of sequence header), but it needn't preprend
>>> keyframes #7 and following with A.
>>
>> That works but then we need to tell in the specs that following a
>> Sequence Header MUST not be stripped if the previous Sequence Header
>> was not bit identical to the one in CodecPrivate.
>>
>> IMO it's valid to output A and then B before the rest of the data in
>> frame #4. That should be done if it's a keyframe. But if it's not a
>> keyframe the demuxer/decoder doesn't know it has to prepend the
>> CodecPrivate there. That could be a problem. Even though it doesn't
>> make much sense to change Sequence Header data before a non keyframe
>> (RAP).
>
> Why should the demuxer ever want to prepend the CodecPrivate in front of
> a non-keyframe? This doesn't make sense to me; after all, one should
> only seek to keyframes and only add the CodecPrivate in front of a
> keyframe if one has just seeked to said keyframe, not if one just
> encounters a keyframe (after the seek one just uses whatever Sequence
> Header one encounters in-band (if any)).

Because they can be omitted. But given we changed the way we omit it
so that changes always go back to the "reference" one it's not needed
anymore.

The AV1 specs say a RAP must contains a Sequence Header OBU. In our
case it may be omitted but it's up to the codec/demuxer to add it back
in the stream when seeking.

>>
>> So maybe your proposal should be added. We'll gain a bit of weight but
>> it's safer.
>>
> My wording would be:
> "Upon seeking to a keyframe the player/demuxer MUST prepend the `Block`
> with the `Sequence Header OBU` contained in the `CodecPrivate`.
> A muxer MUST make sure that the correct `Sequence Header OBU` is in
> force both during linear access and also after seeking to a keyframe
> `Block`. So in particular a keyframe `Block` where a `Sequence Header
> OBU` that is not bit-identical to the one in the `CodecPrivate` is in
> force for the decoding of the first frame contained in said `Block` MUST
> contain the `Sequence Header OBU` that is in force for the decoding of
> the first frame contained in said `Block` in front of the first frame
> contained in said `Block`."
>
> One could also relax this a bit and only make the correct linear access
> a MUST and the rest a SHOULD. This might be useful for applications
> where seeking isn't desired (although it really should be included even
> for those scenarios to support resuming playback after a transmission
> error).

OK, I'll try to add something for seeking.

> And maybe one should add a clause like "Seeking to non-keyframes is
> undefined and not recommended."
>
> And if one wants to one can add a clause recommending to strip out
> unnecessary `Sequence Header OBUs`
>
>
>>> Steve Lhomme:
>>>> Let me know what you think so we can settle this spec for good.
>>>>
>>> I think we are not even close to settling this for good.
>>
>> \o/
>>
> I honestly don't know what this means.

Raised arms.

> - Andreas Rheinhardt
>
> _______________________________________________
> Cellar mailing list
> Cellar@ietf.org
> https://www.ietf.org/mailman/listinfo/cellar



-- 
Steve Lhomme
Matroska association Chairman