[AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-payload-vp9-13: (with DISCUSS and COMMENT)

Benjamin Kaduk via Datatracker <noreply@ietf.org> Thu, 03 June 2021 06:52 UTC

Return-Path: <noreply@ietf.org>
X-Original-To: avt@ietf.org
Delivered-To: avt@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id 330113A2CE9; Wed, 2 Jun 2021 23:52:05 -0700 (PDT)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: Benjamin Kaduk via Datatracker <noreply@ietf.org>
To: The IESG <iesg@ietf.org>
Cc: draft-ietf-payload-vp9@ietf.org, avtcore-chairs@ietf.org, avt@ietf.org, bernard.aboba@gmail.com, bernard.aboba@gmail.com
X-Test-IDTracker: no
X-IETF-IDTracker: 7.30.0
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: Benjamin Kaduk <kaduk@mit.edu>
Message-ID: <162270312471.25253.15642596825639278144@ietfa.amsl.com>
Date: Wed, 02 Jun 2021 23:52:05 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/avt/M6LC329uufO-WattzWq6BGcJltg>
Subject: [AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-payload-vp9-13: (with DISCUSS and COMMENT)
X-BeenThere: avt@ietf.org
X-Mailman-Version: 2.1.29
List-Id: Audio/Video Transport Core Maintenance <avt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/avt>, <mailto:avt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/avt/>
List-Post: <mailto:avt@ietf.org>
List-Help: <mailto:avt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Jun 2021 06:52:06 -0000

Benjamin Kaduk has entered the following ballot position for
draft-ietf-payload-vp9-13: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-payload-vp9/



----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

Hopefully trivial to resolve, but in Table 3 where we claim to reproduce
the capabilities of coding profiles, defined in section 7.2 of
[VP9-BITSTREAM], I do not think we did so faithfully.  In particular, in
the last line, our table has:

   +---------+-----------+-----------------+--------------------------+
   |    3    |  10 or 12 |       Yes       | YUV 4:2:0,4:4:0 or 4:4:4 |
   +---------+-----------+-----------------+--------------------------+

but I'm seeing 4:2:2 (not 4:2:0) in [VP9-BITSTREAM].


----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

Section 3

   Layers are designed (and MUST be encoded) such that if any layer, and
   all higher layers, are removed from the bitstream along either of the
   two dimensions, the remaining bitstream is still correctly decodable.

Just to check my understanding: the "two dimensions" here are "temporal"
and "spatial"?  ("dimensions" can of course also refer to the x and y
coordinates of a image, but that doesn't seem to make sense here.)

   and helps it understand the temporal layer structure.  Since this is
   signaled in each packet it makes it possible to have very flexible
   temporal layer hierarchies and patterns which are changing
   dynamically.

I'm not sure what type of "patterns" are being referred to here.  (The
word "pattern" does not appear in the VP9 spec.)

   (Note: A "Picture Group", as used in this document, is not the same
   thing as a the term "Group of Pictures" as it is traditionally used
   in video coding, i.e. to mean an independently-decoadable run of
   pictures beginning with a keyframe.)

Please give a clear definition for how "Picture Group" is used by this
document and follow that with the note about differing from "Group of
Pictures".  That said, we only seem to use the term a handful of
times...

Section 4.1

     |                               : VP9 pyld hdr  |               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
     |                                                               |
     +                                                               |
     :                   Bytes 2..N of VP9 payload                   :
     |                                                               |
     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Is the "2" in "2..N" because byte 1 is on the previous line?  since
there's not a horizontal boundary in this version of the figure, can we
just say "VP9 payload" here?

Section 4.2

   I:  Picture ID (PID) present.  When set to one, the OPTIONAL PID MUST
      be present after the mandatory first octet and specified as below.
      Otherwise, PID MUST NOT be present.  If the SS field was present
      in the stream's most recent start of a keyframe (i.e., non-
      flexible scalability mode is in use), then the PID MUST also be
      present in every packet.

(I assume that the "SS field was present" condition is not a route to
ignoring the I bit but rather a constraint on when the I bit is set.  I
would hope that this is sufficiently obvious to go without saying...)

   Picture ID (PID):  Picture ID represented in 7 or 15 bits, depending
      on the M bit.  This is a running index of the pictures.  The field
      MUST be present if the I bit is equal to one.  If M is set to
      zero, 7 bits carry the PID; else if M is set to one, 15 bits carry
      the PID in network byte order.  The sender may choose between a 7-
      or 15-bit index.  The PID SHOULD start on a random number, and
      MUST wrap after reaching the maximum ID (0x7f or 0x7fff depending
      on the index size chosen).  The receiver MUST NOT assume that the
      number of bits in PID stay the same through the session.

There's perhaps an edge case here where the PID goes from taking 15 bits
to taking 7 and then taking 15 again in the same session.  On the 7->15
transition, are the endpoints required to preserve the full 15-bit
state/phase from the previous incarnation?  That is, must the local
representation always have at least 15 bits and track wraparounds of the
7-bit counter if used on the wire?

      P_DIFF:  The reference index (in 7 bits) specified as the relative
         PID from the current picture.  For example, when P_DIFF=3 on a
         packet containing the picture with PID 112 means that the
         picture refers back to the picture with PID 109.  This

Just to check: is a P_DIFF of zero invalid or interpreted as "subtract
256"?

      G set to 0 or N_G set to 0 indicates that either there is only one
      temporal layer or no fixed inter-picture dependency information is
      present going forward in the bitstream.

These map up to each other, right -- G==0 iff only one temporal layer,
and N_G==0 iff no *fixed* inter-picture dependency?  The current wording
with "A or B indicates X or Y" is a bit ambiguous about what is implied
by what.

Section 5.1

   to the sender.  The message body (i.e., the "native RPSI bit string"
   in [RFC4585]) is simply the PictureID of the received frame.

Does it matter if the PictureID uses the 7-bit or 15-bit form?
(Also, we spelled "Picture ID" with a space in §4.2.)
https://datatracker.ietf.org/doc/html/rfc4585#section-6.3.3.2 seems to
confirm that a non-multiple-of-8 bit count is fine in this field.

   Note: because all frames of the same picture must have the same
   inter-picture reference structure, there is no need for a message to
   specify which frame is being selected.

This note is a little confusing to me, since the previons discussion is
only about having received an entire *frame*, but sending the Picture ID
would seem to acknowledge the entire *picture*, possibly having not
received some of the component frames.

Section 5.3

   referenced.  Therefore it's recommended for both the flexible and the
   non-flexible mode that, when upgrade frames are being encoded in

Where is "upgrade frame" defined?  (In §4.2 for the U bit we talk only
about the "switching up point".)

Section 8

Is there anything interesting to say about missing/incorrect
begin/end-of-frame markers (that might diverge in the RTP payload
descriptor from the actual encoded bitstream)?

NITS

Section 4.2

   V:  Scalability structure (SS) data present.  When set to one, the

Is there a mnemonic for how 'V' got its name?

   Z:  Not a reference frame for upper spatial layers.  If set to 1,
      indicates that frames with higher spatial layers SID+1 of the
      current and following pictures do not depend on the current

Something about the way this is written makes me want to read "layers"
as indicating "and higher layers", not just the immediate SID+1.  Maybe
"frame with the next higher spatial layer SID+1"?

      Note that for a given picture, all frames follow the same inter-
      picture dependency structure.  However, the frame rate of each
      spatial layer can be different from each other and this can be
      controlled with the use of the D bit described above.  The

"controlled" may not be quite the right word here, as in order to have
higher-frame rate at a given (non-base) layer, the "off" frames have to
use a 'D' of zero, but the converse is not necessarily true.

   In a scalable stream sent with a fixed pattern, the SS data SHOULD be
   included in the first packet of every key frame.  This is a packet
   with P bit equal to zero, SID or D bit equal to zero, and B bit equal

(If SID is zero, D is also zero, so from an information-theoretic (but
not human usability) point of view, we could just say "D bit equal to
zero".)

Setion 4.4

   legitimately be removed from the stream.  Thus, a frame that follows
   a removable frame (in full decode order) MUST be encoded with
   "error_resilient_mode" set to true.

This is the only instance of the word "removable" in the document, so
I'd suggest using a phrasing that does not imply a defined term, like "a
frame that directly follows a frame that might be removed".

Section 5.3

   Identification of a layer refresh frame can be derived from the
   reference IDs of each frame by backtracking the dependency chain
   until reaching a point where only decodable frames are being
   referenced.  [...]

This description leaves me a bit unclear on what exactly the receiver
concludes is a layer refresh frame.  (What dependencies could there be
from the frame sent in response to LRR?)

   response to a LRR, those packets should contain layer indices and the
   reference fields so that the decoder or an MCU can make this
   derivation.

Maybe "field(s)" since only one P_DIFF is an allowed state?

Section 8

   RTP packets using the payload format defined in this specification
   are subject to the security considerations discussed in the RTP
   specification [RFC3550], and in any applicable RTP profile such as
   RTP/AVP [RFC3551], RTP/AVPF [RFC4585], RTP/SAVP [RFC3711], or RTP/
   SAVPF [RFC5124].  SAVPF [RFC5124].  However, as "Securing the RTP

duplicate "SAVPF [RFC5124]"