Re: [AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-payload-vp9-13: (with DISCUSS and COMMENT)

Benjamin Kaduk <kaduk@mit.edu> Thu, 03 June 2021 22:29 UTC

Return-Path: <kaduk@mit.edu>
X-Original-To: avt@ietfa.amsl.com
Delivered-To: avt@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B483A3A1D00; Thu, 3 Jun 2021 15:29:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.894
X-Spam-Level:
X-Spam-Status: No, score=-1.894 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id yLRz1uyBtAHX; Thu, 3 Jun 2021 15:29:40 -0700 (PDT)
Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A16653A1CFE; Thu, 3 Jun 2021 15:29:39 -0700 (PDT)
Received: from kduck.mit.edu ([24.16.140.251]) (authenticated bits=56) (User authenticated as kaduk@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 153MTUrG009393 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 3 Jun 2021 18:29:35 -0400
Date: Thu, 3 Jun 2021 15:29:30 -0700
From: Benjamin Kaduk <kaduk@mit.edu>
To: Jonathan Lennox <jonathan.lennox@8x8.com>
Cc: The IESG <iesg@ietf.org>, draft-ietf-payload-vp9@ietf.org, avtcore-chairs@ietf.org, avt@ietf.org, bernard.aboba@gmail.com
Message-ID: <20210603222930.GN32395@kduck.mit.edu>
References: <162270312471.25253.15642596825639278144@ietfa.amsl.com> <54407DE0-2D67-4221-BB52-707393B009AE@8x8.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <54407DE0-2D67-4221-BB52-707393B009AE@8x8.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/avt/B32Cv_WGeqzv7hwsC0q8p8UBdCg>
Subject: Re: [AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-payload-vp9-13: (with DISCUSS and COMMENT)
X-BeenThere: avt@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <avt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/avt>, <mailto:avt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/avt/>
List-Post: <mailto:avt@ietf.org>
List-Help: <mailto:avt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Jun 2021 22:29:43 -0000

Hi Jonathan,

Sparse inline replies, as most of the comments/resolutions are
straightforward...

On Thu, Jun 03, 2021 at 05:14:12PM -0400, Jonathan Lennox wrote:
> 
> 
> > On Jun 3, 2021, at 2:52 AM, Benjamin Kaduk via Datatracker <noreply@ietf.org> wrote:
> > 
> > Benjamin Kaduk has entered the following ballot position for
> > draft-ietf-payload-vp9-13: Discuss
> > 
> > ----------------------------------------------------------------------
> > DISCUSS:
> > ----------------------------------------------------------------------
> > 
> > Hopefully trivial to resolve, but in Table 3 where we claim to reproduce
> > the capabilities of coding profiles, defined in section 7.2 of
> > [VP9-BITSTREAM], I do not think we did so faithfully.  In particular, in
> > the last line, our table has:
> > 
> >   +---------+-----------+-----------------+--------------------------+
> >   |    3    |  10 or 12 |       Yes       | YUV 4:2:0,4:4:0 or 4:4:4 |
> >   +---------+-----------+-----------------+--------------------------+
> > 
> > but I'm seeing 4:2:2 (not 4:2:0) in [VP9-BITSTREAM].
> 
> Indeed, Zahed also pointed this out.  Thanks! 

and did so more comprehensively than I (the best I can come up with for why
I only saw the one instance was "I was working late, and tired", which is
hardly much of an explanation).
I cleared based on the -14 (but left a few nit-level remarks, as you
probably saw).

> > ----------------------------------------------------------------------
> > COMMENT:
> > ----------------------------------------------------------------------
> > 
> > Section 3
> > 
> >   Layers are designed (and MUST be encoded) such that if any layer, and
> >   all higher layers, are removed from the bitstream along either of the
> >   two dimensions, the remaining bitstream is still correctly decodable.
> > 
> > Just to check my understanding: the "two dimensions" here are "temporal"
> > and "spatial"?  ("dimensions" can of course also refer to the x and y
> > coordinates of a image, but that doesn't seem to make sense here.)
> 
> Yes. I’ve changed the text to make this explicit.
> 
> >   and helps it understand the temporal layer structure.  Since this is
> >   signaled in each packet it makes it possible to have very flexible
> >   temporal layer hierarchies and patterns which are changing
> >   dynamically.
> > 
> > I'm not sure what type of "patterns" are being referred to here.  (The
> > word "pattern" does not appear in the VP9 spec.)
> 
> Patterns of scalability structures.  I’ve changed it to just say “scalability structures which are changing dynamically”.
> 
> >   (Note: A "Picture Group", as used in this document, is not the same
> >   thing as a the term "Group of Pictures" as it is traditionally used
> >   in video coding, i.e. to mean an independently-decoadable run of
> >   pictures beginning with a keyframe.)
> > 
> > Please give a clear definition for how "Picture Group" is used by this
> > document and follow that with the note about differing from "Group of
> > Pictures".  That said, we only seem to use the term a handful of
> > times...
> 
> I changed the previous paragraph (which introduced Picture Group) to say: 
>       In non-flexible mode, frames are encoded using a fixed, recurring pattern of dependencies;                   
>       the set of pictures that recur in this pattern is known as a Picture Group (PG).                                
>       In this mode, …
> 
> Does that clarify things?

Yes, that helps a lot!

> > Section 4.1
> > 
> >     |                               : VP9 pyld hdr  |               |
> >     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
> >     |                                                               |
> >     +                                                               |
> >     :                   Bytes 2..N of VP9 payload                   :
> >     |                                                               |
> >     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > 
> > Is the "2" in "2..N" because byte 1 is on the previous line?  since
> > there's not a horizontal boundary in this version of the figure, can we
> > just say "VP9 payload" here?
> 
> Yes, thanks.  (This was a relic of the fact that this draft was originally based on the VP8 payload spec, where the first byte of the payload is actually described in the document.).

Ah, that makes sense for how it got this way.  (I did look far enough into
[VP9-BISTREAM] to tell that there is at least sometimes more than one byte
of header for the frame, and thus that "VP9 pyld hdr" was not going to be
the 1 byte indicated here.  But Éric's point was good as well, and your
resolution seems fine to me.)

> This will also resolve Éric’s issue that “VP9 dyld hdr" isn’t defined anywhere.
> 
> > 
> > Section 4.2
> > 
> >   I:  Picture ID (PID) present.  When set to one, the OPTIONAL PID MUST
> >      be present after the mandatory first octet and specified as below.
> >      Otherwise, PID MUST NOT be present.  If the SS field was present
> >      in the stream's most recent start of a keyframe (i.e., non-
> >      flexible scalability mode is in use), then the PID MUST also be
> >      present in every packet.
> > 
> > (I assume that the "SS field was present" condition is not a route to
> > ignoring the I bit but rather a constraint on when the I bit is set.  I
> > would hope that this is sufficiently obvious to go without saying…)\
> 
> Just because it’s obvious doesn’t mean it’s not worth saying; I changed it to end “then this bit MUST be set on every packet”.
> 
> >   Picture ID (PID):  Picture ID represented in 7 or 15 bits, depending
> >      on the M bit.  This is a running index of the pictures.  The field
> >      MUST be present if the I bit is equal to one.  If M is set to
> >      zero, 7 bits carry the PID; else if M is set to one, 15 bits carry
> >      the PID in network byte order.  The sender may choose between a 7-
> >      or 15-bit index.  The PID SHOULD start on a random number, and
> >      MUST wrap after reaching the maximum ID (0x7f or 0x7fff depending
> >      on the index size chosen).  The receiver MUST NOT assume that the
> >      number of bits in PID stay the same through the session.
> > 
> > There's perhaps an edge case here where the PID goes from taking 15 bits
> > to taking 7 and then taking 15 again in the same session.  On the 7->15
> > transition, are the endpoints required to preserve the full 15-bit
> > state/phase from the previous incarnation?  That is, must the local
> > representation always have at least 15 bits and track wraparounds of the
> > 7-bit counter if used on the wire?
> 
> No, each transition is a truncation (15->7) or a zero-extension (7->15). I’ve added text to clarify this.

Thanks.  The edge cases here seem like they are just "weird", but not going
to lead to interoperability issues.

> >      P_DIFF:  The reference index (in 7 bits) specified as the relative
> >         PID from the current picture.  For example, when P_DIFF=3 on a
> >         packet containing the picture with PID 112 means that the
> >         picture refers back to the picture with PID 109.  This
> > 
> > Just to check: is a P_DIFF of zero invalid or interpreted as "subtract
> > 256"?
> 
> I hadn’t thought about this, but invalid is probably easier to implement.

I originally wrote "I assume it's invalid" but then came up with the other
plausible interpretation.  Invalid does seem easier to implement.

> >      G set to 0 or N_G set to 0 indicates that either there is only one
> >      temporal layer or no fixed inter-picture dependency information is
> >      present going forward in the bitstream.
> > 
> > These map up to each other, right -- G==0 iff only one temporal layer,
> > and N_G==0 iff no *fixed* inter-picture dependency?  The current wording
> > with "A or B indicates X or Y" is a bit ambiguous about what is implied
> > by what.
> 
> No; the distinction is whether future pictures are flexible or non-flexible mode. I’ve clarified this.  G 0 and N_G 0 are semantically the same.

Ah, the clarification helps a lot.

> > Section 5.1
> > 
> >   to the sender.  The message body (i.e., the "native RPSI bit string"
> >   in [RFC4585]) is simply the PictureID of the received frame.
> > 
> > Does it matter if the PictureID uses the 7-bit or 15-bit form?
> 
> It should be the form that was in the frame that was received.
> 
> > (Also, we spelled "Picture ID" with a space in §4.2.)
> Thanks, fixed.
> 
> > https://datatracker.ietf.org/doc/html/rfc4585#section-6.3.3.2 seems to
> > confirm that a non-multiple-of-8 bit count is fine in this field.
> > 
> >   Note: because all frames of the same picture must have the same
> >   inter-picture reference structure, there is no need for a message to
> >   specify which frame is being selected.
> > 
> > This note is a little confusing to me, since the previons discussion is
> > only about having received an entire *frame*, but sending the Picture ID
> > would seem to acknowledge the entire *picture*, possibly having not
> > received some of the component frames.
> 
> Other than the discussion of “golden” and “altref" frames, the rest of the discussion has to do with pictures - do you think that text needs clarifying for the spatial scalability case?

I don't really expect any practical consequences from leaving this text as
is (with no additional clarification), since this is just attempting to
explain why a given design choice is safe.  I think I am still a bit
confused about the situation, but am willing to just let it go.

> > 
> > Section 5.3
> > 
> >   referenced.  Therefore it's recommended for both the flexible and the
> >   non-flexible mode that, when upgrade frames are being encoded in
> > 
> > Where is "upgrade frame" defined?  (In §4.2 for the U bit we talk only
> > about the "switching up point".)
> 
> Changed to switching-up points.
> 
> > 
> > Section 8
> > 
> > Is there anything interesting to say about missing/incorrect
> > begin/end-of-frame markers (that might diverge in the RTP payload
> > descriptor from the actual encoded bitstream)?
> 
> I think Roman’s suggested text about malicious or malformed payloads should cover this case as well.

Sure, that is good text.

> > NITS
> > 
> > Section 4.2
> > 
> >   V:  Scalability structure (SS) data present.  When set to one, the
> > 
> > Is there a mnemonic for how 'V' got its name?
> 
> Not that I recall, sadly; I think we were just running short on letters by this point.
> 
> >   Z:  Not a reference frame for upper spatial layers.  If set to 1,
> >      indicates that frames with higher spatial layers SID+1 of the
> >      current and following pictures do not depend on the current
> > 
> > Something about the way this is written makes me want to read "layers"
> > as indicating "and higher layers", not just the immediate SID+1.  Maybe
> > "frame with the next higher spatial layer SID+1"?
> 
> No, the “and higher layers” reading is correct.  I’ve changed it to “SID+1 and greater”.

Oh, I had assumed it was just SID+1 since I remembered reading something
about the RTP payload only being usable for VP9 streams that make spatial
reference to the next lowest spatial stream.  I am sure you are more on top
of this than me, though!

> >      Note that for a given picture, all frames follow the same inter-
> >      picture dependency structure.  However, the frame rate of each
> >      spatial layer can be different from each other and this can be
> >      controlled with the use of the D bit described above.  The
> > 
> > "controlled" may not be quite the right word here, as in order to have
> > higher-frame rate at a given (non-base) layer, the "off" frames have to
> > use a 'D' of zero, but the converse is not necessarily true.
> 
> Is “described” better?

Yes, that is better, or maybe "indicated".

> >   In a scalable stream sent with a fixed pattern, the SS data SHOULD be
> >   included in the first packet of every key frame.  This is a packet
> >   with P bit equal to zero, SID or D bit equal to zero, and B bit equal
> > 
> > (If SID is zero, D is also zero, so from an information-theoretic (but
> > not human usability) point of view, we could just say "D bit equal to
> > zero".)
> 
> No, you have the information-theoretic description backwards; SID=0 implies D=0, but not the other way around.

(I meant to say SID=0 implies D=0, whether that's what actually came out is
debatable.)

> Actually, I think this points out an actual error; SID=1 and D=0 is a valid packet, but would not be the first packet of a key frame.  I think “SID or L bit equal to 0” would be correct, i.e. either no layer information is present, or it’s spatial layer 0.

(Looking for where we define "key frame" I discover that we mix "key frame"
and "keyframe" seemingly interchangably.)  I think I managed to convince
myself last night that a frame with SID greater than zero could still
qualify as a key frame.  I'm having a hard time recovering why I thought
that, now, though.  What you describe with SID=0 or L bit 0 makes sense
intuitively, though.

> > Setion 4.4
> > 
> >   legitimately be removed from the stream.  Thus, a frame that follows
> >   a removable frame (in full decode order) MUST be encoded with
> >   "error_resilient_mode" set to true.
> > 
> > This is the only instance of the word "removable" in the document, so
> > I'd suggest using a phrasing that does not imply a defined term, like "a
> > frame that directly follows a frame that might be removed".
> 
> Ok.
> 
> > 
> > Section 5.3
> > 
> >   Identification of a layer refresh frame can be derived from the
> >   reference IDs of each frame by backtracking the dependency chain
> >   until reaching a point where only decodable frames are being
> >   referenced.  [...]
> > 
> > This description leaves me a bit unclear on what exactly the receiver
> > concludes is a layer refresh frame.  (What dependencies could there be
> > from the frame sent in response to LRR?)
> 
> As in the example, if the LRR was a request to upgrade from {1,0} to {2,1}, the layer refresh frame can depend on {0,0} and {1,0}, because the request indicated that the receiver already has that decode state.

I think that part of my confusion is that I'm not even sure whether it's
the encoder or decoder that needs to identify the layer refresh frame or
why it's a useful thing to do.  You don't need to spend a bunch of time
educating me here, though, so I can just suggest that maybe something like
"known to be available to the decoder" instead of "decodable" would map
more directly to the example.

Thanks for all the updates and explanations; they were really helpful.

-Ben

> >   response to a LRR, those packets should contain layer indices and the
> >   reference fields so that the decoder or an MCU can make this
> >   derivation.
> > 
> > Maybe "field(s)" since only one P_DIFF is an allowed state?
> 
> Ok.
> 
> > 
> > Section 8
> > 
> >   RTP packets using the payload format defined in this specification
> >   are subject to the security considerations discussed in the RTP
> >   specification [RFC3550], and in any applicable RTP profile such as
> >   RTP/AVP [RFC3551], RTP/AVPF [RFC4585], RTP/SAVP [RFC3711], or RTP/
> >   SAVPF [RFC5124].  SAVPF [RFC5124].  However, as "Securing the RTP
> > 
> > duplicate "SAVPF [RFC5124]"
> 
> Thanks.
>