Re: [Cellar] AV1 mapping update

Andreas Rheinhardt <andreas.rheinhardt@googlemail.com> Wed, 11 July 2018 13:48 UTC

Return-Path: <andreas.rheinhardt@googlemail.com>
X-Original-To: cellar@ietfa.amsl.com
Delivered-To: cellar@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 43F18130DC9 for <cellar@ietfa.amsl.com>; Wed, 11 Jul 2018 06:48:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2
X-Spam-Level:
X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=googlemail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qSC5XIH8DNa9 for <cellar@ietfa.amsl.com>; Wed, 11 Jul 2018 06:48:13 -0700 (PDT)
Received: from mail-wr1-x42f.google.com (mail-wr1-x42f.google.com [IPv6:2a00:1450:4864:20::42f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1F7BC130E19 for <cellar@ietf.org>; Wed, 11 Jul 2018 06:48:13 -0700 (PDT)
Received: by mail-wr1-x42f.google.com with SMTP id b15-v6so18280305wrv.10 for <cellar@ietf.org>; Wed, 11 Jul 2018 06:48:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20161025; h=subject:to:references:from:message-id:date:mime-version:in-reply-to :content-transfer-encoding; bh=M1ETHoJD99r24YvH9zYnG7QAEeUXb0ON7oKZMe1OYok=; b=dQ03co0KTsEmMlkd8Fd3OOgdn/R9GHsmiLtK7DdIKd060zom4gWzDcsdQZYDjVqxjQ bfYLmCNY4LnRjbCp5nXQyIACjux2d8Yux15cejpYh183P7D5Iv+0ifgPmvrk9k79/7oP 6+If4Dq7Odv7tvleS0z210hcJLFX1sHybWzo1xuNLqrFPTbFzmHRkt2RfqWeRQv4pn8E AHKrCRx4357JLbr3qVFTHYvzWbGCAT+Owec4Up2HCLGNSlTOnyZpgGvwgqVzNiNVJGqC aN7gRfKnjA31PzWKSEMJx1ZUAuenXpNBKmxTA68S4DL1wLPv9k7erknnV3yHOKMtEl7r W4TA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :mime-version:in-reply-to:content-transfer-encoding; bh=M1ETHoJD99r24YvH9zYnG7QAEeUXb0ON7oKZMe1OYok=; b=FNYXct0djluQYEuKgwRj/zFKm2o7T++SR8Hu9jwJAZZB59FaaSIcC+ne6LPbSeSXQd qywPHB1/PYrTOXFSdpvK5hzsPHVy71QzI1IY1GiWxYt9+yClXiAjiG49PBqY3xasc52i a5KnUHwNCX58m2eAB4YZAWMl199DS5N6ve7SPOB2AxPL0BtuB/Fu1V701ucFoQmjAQ3u 8yU5pUR69rmXlfEbGNHEdu2Jyga/aeRu0GAeh4aWDlGYqBoOgXwniznXxqdIXnBPqxeA 4J3NiVbp8xmMdUMfCRqd4EFKlKROAP02amJS6wQfpdnzIqjbVtmGrR4Rqn9Fx9IHlMnZ TJZw==
X-Gm-Message-State: APt69E0prjOY0UVG4pFe4WJbbOrlZOGoQnuOc0o1NK0UEGgY19m59zYe x5ufRmf/Qm9+q7PICci+qEykAd4M
X-Google-Smtp-Source: AAOMgpehknHN3MOaSfTBxFdE8Cyy8r2iLjgPuMLfMCERkiupHeNaoqyUr4qQfSk33rlrJ7zdr1Hv9A==
X-Received: by 2002:adf:ac66:: with SMTP id v93-v6mr20251637wrc.7.1531316891238; Wed, 11 Jul 2018 06:48:11 -0700 (PDT)
Received: from [127.0.0.1] ([31.131.2.19]) by smtp.googlemail.com with ESMTPSA id y203-v6sm3321849wme.42.2018.07.11.06.48.09 for <cellar@ietf.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 11 Jul 2018 06:48:10 -0700 (PDT)
To: cellar@ietf.org
References: <CAOXsMFKHo6RS+q8KCXKoKCiBBS9pVqs92wsLgSfXZO+DT3dStQ@mail.gmail.com>
From: Andreas Rheinhardt <andreas.rheinhardt@googlemail.com>
Message-ID: <ca0f009e-a245-fcd6-95f8-f051736c9161@googlemail.com>
Date: Wed, 11 Jul 2018 13:47:00 +0000
MIME-Version: 1.0
In-Reply-To: <CAOXsMFKHo6RS+q8KCXKoKCiBBS9pVqs92wsLgSfXZO+DT3dStQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/nLpmtjKSL22njUKZni8F5F_HWwE>
Subject: Re: [Cellar] AV1 mapping update
X-BeenThere: cellar@ietf.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: Codec Encoding for LossLess Archiving and Realtime transmission <cellar.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cellar>, <mailto:cellar-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cellar/>
List-Post: <mailto:cellar@ietf.org>
List-Help: <mailto:cellar-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cellar>, <mailto:cellar-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Jul 2018 13:48:18 -0000

Steve Lhomme:
> I updated the AV1 mapping to clean a few sentences.
> 
> 
> https://github.com/Matroska-Org/matroska-specification/blob/av1-mappin/codec/av1.md
> and the list of changes can be found here
> https://github.com/Matroska-Org/matroska-specification/commits/av1-mappin/codec/av1.md
> 
1. Whether `DisplayWidth` and `DisplayHeight` needs to be written
actually depends on the value of `DisplayUnit`.

2. You forgot `OBU_PADDING` in the list of OBU types that mustn't be in
the `CodecPrivate`. (Either that or your sentence that only
`OBU_SEQUENCE_HEADER` and `OBU_METADATA` are currently allowed in the
`CodecPrivate` should be changed.)

3. "They SHOULD have the [obu_has_size_field] set to 1 except for the
last OBU in the sample, for which [obu_has_size_field] MAY be set to 0,
in which case it is assumed to fill the remaining of the sample."
"The OBUs in the Block MUST follow the [Low Overhead Bitstream Format
syntax]."
The first sentence leaves the possibility that [obu_has_size_field] is 0
for OBUs other than the last OBU of a block (only a SHOULD). And the
requirement in the second sentence actually makes MUST out of the SHOULD
in the first sentence (making this part of the first sentence redundant)
and contradicts/voids the MAY part of the first sentence. In other
words, the two sentences should be merged to something like: "The `OBUs`
in the block must follow the `Low Overhead Bitstream Format` (in which
[obu_has_size_field] MUST be equal to one for every OBU) for every `OBU`
with the possible exception of the very last `OBU` in which
[obu_has_size_field] MAY be set to 0, in which case the `OBU` is assumed
to consist of the remainder of the block."

4. "ReferenceBlocks inside a BlockGroup MUST reference frames according
to the [ref_frame_idx] values of frame that is neither a KEYFRAME nor an
INTRA_ONLY_FRAME.": The problem with this sentence is that
[ref_frame_idx] needn't be present. It depends upon
[frame_refs_short_signaling] and [show_existing_frame]. If one uses a
Block inside a Blockgroup and if [show_exsting_frame] equals one one
should reference the block that contained the showable frame that is now
output (and that this should be the only `ReferenceBlock` written). In
case of [frame_refs_short_signaling] == 1 the obvious candidates for
`ReferenceBlocks` are the blocks containing the `last_frame_idx` and
`gold_frame_idx` that are explicitly signalled. If I am not mistaken,
then there are also other reference frames that are not explicitly
signalled, but computed. I don't know if we should really write a
`ReferenceBlock` entry for every reference as the current proposal seems
to imply. This would be quite a bit of overhead for no gain (and
furthermore, it would complicate muxers that would have to compute the
references that are not explicitly signalled in case that
[frame_refs_short_signaling] is 1). One `ReferenceBlock` would be enough
to distinguish keyframes from non-keyframes.
By the way: If a temporal unit contains multiple frames with references,
whose references should end up as `ReferenceBlocks`? Or may the muxer
choose some?

5. AV1 may use spatial scalability and/or temporal scalability. What do
we make of these? They are currently not forbidden if I am not mistaken,
but if e.g. the spatial dimensions of different layers disagree, the
`PixelWidth` and `PixelHeight` values can't be true for all layers.
Matroska seems to be missing some features here.

6. Depending on [frame_size_override_flag] there is even the possibility
that the size of the frames differs even without scalability (if I am
not mistaken). Should this be allowed?

7. Then there is another thing with keyframes and cues (for this point
it is always presumed that the relevant sequence header OBUs are
available regardless of whether this is done in-band or via CodecPrivate):
a) The proposal currently does not take into account that key frames
reset the decoder when they are output, not when they are decoded. A key
frame needn't be immediately output; if it is (i.e. [show_frame]
equaling 1), it is called a "key frame random access point" in section
7.6 of the standard and is the equivalent of an IDR frame in H.264.
Everything's fine here. But a key frame can also be declared a
showable_frame (but only if [show_frame] equals 0) and output later via
the show_existing_frame mechanism. This is similar to an open GOP in
other codecs (but in contrast to them, the block that contains the coded
keyframe doesn't have the same timestamp (pts) as the first frame that
can be output after a seek). The coded key frame with [show_frame] equal
to zero is called a delayed random access point and a key frame
dependent recovery point is a frame where a key frame with
[showable_frame] equal to 1 is output via the show_existing_frame
mechanism. If one starts decoding at the delayed random access point,
all the output frames up to but not including the key frame dependent
recovery point can depend both on the delayed random access point frame
and on other earlier frames so that these frames can't be correctly
decoded in general. But all the frames from the key frame dependent
recovery point onwards can be correctly decoded if one starts decoding
at the delayed random access point (because the decoder is reset after
displaying the key frame). If one starts decoding at the key frame
dependent recovery point, one doesn't have the key frame that should be
shown via the show_existing_frames mechanism at all, so that this frame
is simply not a real key frame.
b) But although a key frame dependent recovery point is not a "real" key
frame, it has the same [frame_type] as the frame that is output, i.e.
its [frame_type] is KEY_FRAME. According to our current proposal this
would mean that it should be treated as a keyframe in Matroska which is
obviously wrong.
c) Marking a delayed random access point as keyframe deviates from the
way that flag has been traditionally understood: If one starts decoding
at this point, one doesn't get the frame that should be output for the
temporal unit containing the delayed random access point. But I
nevertheless think that these are the right keyframes, because they are
the points at which random access has to begin when there aren't key
frame random access points available; this also means that one can split
the stream at this point and the second part will still play so that a
muxer like mkvmerge needn't be rewritten too much.
d) A consequence of this is that a `Blockgroup` containing a delayed
random access point mustn't contain a `ReferenceBlock` (although the
actual frame that is output for that temporal unit very likely uses
other reference frames than the key frame that is contained in the same
temporal unit).
e) Yes, this proposal means that it is impossible to tell from Matroska
alone (well, from the block structure that is; see f) for a way for
which one could put this information into the Cues) whether it is a key
frame random access point or a delayed random access point. One will
have to decode it (or parse deeper) to know.
f) This also leads to problems with seeking: If one simply added a
CuePoint for the keyframe (i.e. for the delayed random access point) and
the user wants to seek to a point between the delayed random access
point (inclusive) and the dependent recovery point (exclusive) and the
player used the cues to seek to the nearest keyframe in front of the
desired point, then decoding at the point referenced in the cues would
not yield the desired frame (it would be either corrupted or not output
at all). Therefore I think it is best to add a CuePoint for every key
frame random access point and every key frame dependent recovery point.
The CuePoint for the key frame random access point would be an ordinary
CuePoint as usual. But the CuePoint for the key frame dependent recovery
point wouldn't be (my favourite is iv) (and if I were allowed to play
God it would be i))):
i) A comprehensive way of doing it is this: The CueTime would be the
timestamp of the block containing the dependent recovery point; it would
include a CueTrackPositions for the video track we are talking about
that contains the right CueTrack, the CueClusterPosition containing the
position of the dependent recovery point block and a CueReference with
CueRefTime and CueRefCluster, both corresponding to the valus of the
delayed random access point. This proposal has several downsides: It
uses Cue elements that are deprecated in Matroska and not part of Webm.
So this would require a quite nontrivial change in both projects. (Btw:
If one does this, one should add a default value for `CueRefCluster`: It
should be the same as `CueClusterPosition` as both blocks that we are
talking about will probably end up in the same cluster anyway.)
ii) One uses the CueTime of the dependent recovery point, but the
position of the Cluster of the delayed access point (and
`CueRelativePosition` (if used) should also point to this block).
Pro: It only uses elements that are supported by both Matroska and Webm.
Furthermore, the specs only say that `CueClusterPosition` should point
to the cluster containing the "required block; they don't explicitly say
that said block needs to have the same timestamp as `CueTime`.
Contra: How does a demuxer know from which block onwards it should feed
the data to the decoder? It might use the `CueRelativePosition`, but
probably a lot of demuxers would simply read the cluster until they come
to the block with timestamp `CueTime` (i.e. they interpret the specs so
that the "required block" is the block with the timestamp `CueTime`) and
then they would either deliver this to the decoder or conclude that the
file is damaged (because the block they found is no keyframe).
iii) The last is the same as i) with the difference that `CueRefCluster`
is omitted. It is also incompatible with current Webm, but at least it
has the advantage that it doesn't use any currently deprecated elements
of Matroska. One could add a requirement that the delayed random access
point and the dependent recovery point need to be in the same cluster
and then omitting `CueRefCluster` is not a problem any more.
iv) And then there is the possibility of creating a normal CuePoint for
the dependent recovery point, writing the dependent recovery point as a
Block in a Blockgroup with exactly one ReferenceBlock which points to
the delayed access point block and let the demuxer seek backwards from
the dependent recovery point to the delayed access point.
Pro: Would only use things that are already supported by Matroska and
Webm. It would also not be AV1 specific. The demuxer doesn't need to
know anything about AV1, everything is signalled at the container level.
Contra: Demuxers would have to be adapted not to expect any more that
only keyframes are referenced in the cues. They would also have to be
adapted to actually make use of the value of `ReferenceBlock` and seek
backwards. This also implies more seeks, but this should be quite
limited when one puts both the delayed random access point and the
dependent recovery point in the same cluster -- hopefully the data is
still cached. (Maybe one should add a SHOULD clause that says that both
blocks should be in the same cluster.)
g) Of course there are two easy alternative solutions:
i) Restrict the type of AV1 that is allowed in Matroska even further so
that all key frames are of key frame random access type. (This could
exclude quite a lot of AV1 and therefore I recommend not doing so.)
ii) Create cues as usual, i.e. reference every delayed random access
point, and don't care about the fact that seeking will be partially
broken in this case.
h) It should be noted that exactly the same situation exists with
periodic intra refresh in general. There was a short discussion on the
Matroska developer mailing list in April 2011, but nothing came out of
it. Every solution I outlined here for AV1 is also applicable for this case.

Steve Lhomme:
> Since we allow stripping the Sequence Header OBU from the stream when
> it's equal to the CodecPrivate one, we need to add it back to the
> bitstream for compliance. At least when seeking on keyframes. So I
> added a section to explain that.
>
> IMO that's an extra feature of the CodecPrivate that it's meant to be
> added to the bistream as-is. And in this case on startup and when
> seeking. I wonder if we should add an element next to the CodecPrivate
> to describe that. Because in this case it's not entirely opaque to the
> demuxer. Or maybe it's implied by the CodecID and is up to the decoder
> to use it how it's supposed to be (in this case detecting keyframes
> and possibly adding back the Sequence Header OBU).
8. I think we can relax the requirements on the existence of in-band
sequence header OBUs a bit: If a keyframe (i.e. a key random access
point or a delayed random access point, not a dependent recovery point)
uses the same sequence header OBU as in the CodecPrivate (including the
same operating_parameters_info), then the sequence header OBU needn't be
prepended to the block with the keyframe, because seeking already works
without it provided one always adds the sequence header OBU from the
CodecPrivate back in the bitstream on seeking. For example, consider the
following scenario:
One has an elementary stream that uses two different sequence header
OBUs A and B that only differ in the operating_parameters_info. The
first three keyframes use A, between the third and the B is contained in
a temporal unit between the third and the fourth keyframe. Between the
sixth and the seventh keyframe is a temporal unit containing sequence
header A again. Then a muxer that wants to put this elementary stream
into Matroska may put A in the CodecPrivate can strip the very first
occurence of A away; it must leave B inside the temporal unit that it
was in (so that a player that plays the file linearly is notified about
the change) and has to make sure that keyframes #4 to #6 contain
sequence header B (so that one has the correct sequence header when
seeking to said keyframes). It mustn't strip A between the sixth and the
seventh keyframe away (so that a player that plays the file linearly
notices the change of sequence header), but it needn't preprend
keyframes #7 and following with A.
That way one can save a few bytes.
This is consistent with an interpretation of the CodecPrivate as the
default extradata/header (but it is not necessarily a truly global
header). Before `CueCodecState` was deprecated it had a default value of
0 that mandated that one should look in the CodecPrivate for the
CodecState upon seking. So the only thing specific to AV1 in this case
is the "as-is" part; that one should reset the decoder to whatever
initialization information is contained in the CodecPrivate upon seeking
is nothing new.

9. Image that future AV1 encoders would find out that changing the
sequence header (with a change of the CVS) enables better compression.
What would we do in this case? Simply lift the restriction of one track
= one CVS even when this means that some players that can't cope with
changing coded video sequences won't know in advance whether they can
play the files at all? I'm asking because we don't include a version
field in the CodecPrivate (in contrast to how it is done in mp4 with avcC).

Steve Lhomme:
> Let me know what you think so we can settle this spec for good.
>
I think we are not even close to settling this for good.


- Andreas Rheinhardt