Re: [Cellar] AV1 mapping update

Andreas Rheinhardt <andreas.rheinhardt@googlemail.com> Sun, 15 July 2018 17:28 UTC

To: cellar@ietf.org
References: <CAOXsMFKHo6RS+q8KCXKoKCiBBS9pVqs92wsLgSfXZO+DT3dStQ@mail.gmail.com> <ca0f009e-a245-fcd6-95f8-f051736c9161@googlemail.com> <CAOXsMFL5-MaHQaAOyh7jSFUpCNbSEvAWKmAHcepaF+QsQuYbHw@mail.gmail.com> <fee747da-77ca-9282-a4c3-c112fd746507@googlemail.com> <CAOXsMFJtc9pq+PphRb5kF9Mp4jyS5j3LQi6vQQmHRyTDYWyQ-A@mail.gmail.com> <b8486fa4-132b-f814-7046-91efb0a48ec6@googlemail.com> <10d56d2a-3053-6069-7805-54bb4fd1d4e0@xiph.org>
From: Andreas Rheinhardt <andreas.rheinhardt@googlemail.com>
Message-ID: <b8c8236c-5dd7-5511-6994-730ea0aaef84@googlemail.com>
Date: Sun, 15 Jul 2018 17:27:00 +0000
MIME-Version: 1.0
In-Reply-To: <10d56d2a-3053-6069-7805-54bb4fd1d4e0@xiph.org>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/rF8_q12XdTLbW0zFQ4N3u2uXif4>
Subject: Re: [Cellar] AV1 mapping update
Precedence: list

Hello,

Timothy B. Terriberry:
> Andreas Rheinhardt wrote:
>> "The presentation times of AV1 samples are given by the Matroska
>> container. The [timing_info_present_flag] in the `Sequence Header OBU`
>> (in the `CodecPrivate` or in the bitstream) SHOULD be set to 0. If set
> 
> I'll ask the usual question: when is it reasonable to violate this SHOULD?
> 
every timestamp in Matroska is a multiple of 1 ns (this number is
hard-coded) so that NTSC timings can't be exactly represented on the
container level: E.g. the default duration field for 24/1.001fps content
would indicate 41708333ns, although it is 41708333.33333333...ns. So by
preserving the bitstream information it is possible to increase the
certainity that 41708333ns actually means 24/1.001 fps.

>> width as [max_frame_width_minus_1]+1 (similar for height). But what I'd
>> like to know is how much of AV1 will probably be excluded from Matroska
>> by these requirements? Could we contact some of the codec designers and
>> ask them about their opinions on this? (Honestly, "we" probably means
> 
> Hi, I am one of the codec designers. I think the expectation is that
> DisplayWidth and DisplayHeight will remain constant for a coded video
> sequence (and these should match [max_frame_width_minus_1] + 1 and
> [max_frame_height_minus_1] + 1). If the output dimensions of a frame
> change, it should be scaled to the display resolution before being
> displayed. Spatial scalability depends on this (at least if you ever
> intend to drop some of the layers, and if you don't then there is no
> point to using scalability). Even without spatial scalability, it is
> sometimes desirable to reduce the coded resolution to maintain quality
> while fitting within the instantaneous channel bandwidth. That mostly
> applies to live/low-latency streams, but I think an important use case
> for Matroska is to be able to capture such streams for long-term storage
> (for example, via WebRTC and the MediaStream Recording API in web
> browsers).>
>> [operating_parameters_info]. Furthermore the dimensions of all output
>> frames MUST be equal."
> 
> As such, I disagree with this part (as an individual).
> 
a) I don't like restrictions on the content that can be put into
Matroska either. The reason why I put it in is because this is (in my
understanding) currently an absolute requirement of Matroska
(`PixelWidth` and `PixelHeight` are mandatory and valid for every
frame). Given that decoders will have to support these varying output
dimensions anyway (it is a mandatory part of AV1, isn't it?), it would
be possible to drop this requirement, but for this the specifications of
both Matroska and Webm would have to be changed. I'm not sure if
everyone is ok with this.

b) Just to be sure:  "Normally" the size of an output frame is given by
[max_frame_height_minus_1]+1 and [max_frame_width_minus_1]+1, although
part 7.18 that deals with the output process says that the width is
given by `UpscaledWidth`. The reason for this is that for intra frames,
key frames and some inter frames the initial `FrameWidth` (directly
inferred from [max_frame_width_minus_1] or [frame_width_minus_1]) is
saved to `UpscaledWidth` and then `FrameWidth` gets overwritten with a
smaller value (that is the value that is used during much of the
decoding process, but only internally, so that we can ignore it). Is
that correct? (I have to admit I thought for a time that the output
dimensions also depend on the superres parameters and that they can be
even bigger than [max_frame_height/width_minus_1]+1.)

>> "A SimpleBlock MUST be marked as a Keyframe only if the first Frame OBU
>> in the Block has a [frame_type] of KEY_FRAME and the SimpleBlock
>> contains a Sequence Header OBU or if the Sequence Header OBU is
>> correctly omitted (see above)."
> 
> This is mathematically correct (assuming you really mean "only if" and
> not "if and only if), but I think it could be easily misread. Perhaps
> "MUST NOT... unless..."?
> 
I agree that this should be changed and your proposal is certainly an
improvement. However this is not the only thing in this definition that
needs to be looked at (see 4. below).

>> Seeking to a non-RAP is undefined and not recommended.
> 
> RECOMMENDED is an RFC 2119 keyword (but "NOT RECOMMENDED" is not
> explicitly listed as one). You might want to rephrase to avoid confusion.
> 
Ok: "Seeking to a non-RAP is undefined and discouraged." (We also need
to explicitly add what a RAP is.)

And because you are one of the AV1 designers, here are a few questions:

1. Section 7.5 includes the following:
"If scalability is not being used (OperatingPointIdc equal to 0), then
all frames are part of the operating point. The following constraints
must hold:

    The first frame header must have frame_type equal to KEY_FRAME and
show_frame equal to 1.

    Each temporal unit must have exactly one shown frame.

If scalability is being used (OperatingPointIdc not equal to 0), then
only a subset of frames are part of the operating point. For each
operating point, the following constraints must hold:

    The first frame header that will be decoded must have frame_type
equal to KEY_FRAME and show_frame equal to 1.

    Every layer that has a coded frame in a temporal unit must have
exactly one shown frame that is the last frame of that layer in the
temporal unit."

I am wondering whether the the last two requirements only apply if
scalability is being used (OperatingPointIdc not equal to 0) or in any
case. The actual reason why I am asking myself this is that in case that
scalability is not being used there is no restriction that a shown frame
is the last frame in that temporal unit. I don't see a reason why there
should be another frame afterwards in the same temporal unit, but is it
actually allowed?

2. Would it be legal to code a frame with [show_frame] 0,
[showable_frame] 1 followed (perhaps immediately) by a frame header in
the same temporal unit that has [show_existing_frame] set to 1 and
outputs the frame just coded? If yes, is there any reason to ever use
it? I don't see any. According to the definition, this temporal unit
contains both delayed random access point and a key frame dependent
access point and according to 7.6.3 a decoder is not required to be able
to handle such a situation for random access. But obviously it is as
good as a key frame random access point.

3. Would it be legal to have a temporal unit that contains a frame that
is not shown (but might be declared as showable) followed by a keyframe
with [show_frame] equal to 1? I'm asking because both the current
proposal as well as the mp4/ISOBMFF definition of sync sample require
the keyframe to be the first frame in the temporal unit.

4. You have only criticized the "MUST only" wording of the definition of
keyframe simpleblocks. You said nothing about the requirement that it is
the first frame that needs to be of type KEY_FRAME. The standard doesn't
require this and I fail to see why this should be needed. Moreover, in
contrast to 2. and 3. there seems to be a usecase for this: Think about
a situation where all reference frames available before/at the start of
the temporal unit that contains a delayed RAP constitute a very good set
of reference frames for a frame that is to be shown between the output
frame of the temporal unit that contains the delayed RAP and the output
frame of the temporal unit that contains the dependent keyframe recovery
point, but where the combination of the dependent RAP frame + all but
one of the reference frames available at the start of the temporal unit
that contains the delayed RAP frame constitute a worse set of reference
frames (regardless of which of the initial reference frames gets
overwritten by the delayed RAP frame). Then it may make sense to first
code this frame as a showable frame using all the reference frames and
then code the delayed RAP frame followed by the frame that is actually
output for the temporal unit containing the delayed RAP. Is my usecase
sound and should the "first"-requirement be dropped?

5. What do you think of the restriction of a Matroska track to a (subset
of) a CVS?

- Andreas Rheinhardt

[Cellar] AV1 mapping update Steve Lhomme
Re: [Cellar] AV1 mapping update Andreas Rheinhardt
Re: [Cellar] AV1 mapping update Steve Lhomme
Re: [Cellar] AV1 mapping update Steve Lhomme
Re: [Cellar] AV1 mapping update Andreas Rheinhardt
Re: [Cellar] AV1 mapping update Steve Lhomme
Re: [Cellar] AV1 mapping update Steve Lhomme
Re: [Cellar] AV1 mapping update Timothy B. Terriberry
Re: [Cellar] AV1 mapping update Andreas Rheinhardt
Re: [Cellar] AV1 mapping update Timothy B. Terriberry
Re: [Cellar] AV1 mapping update Andreas Rheinhardt
Re: [Cellar] AV1 mapping update Timothy B. Terriberry
Re: [Cellar] AV1 mapping update Timothy B. Terriberry
Re: [Cellar] AV1 mapping update Steve Lhomme