Re: [AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-payload-vp9-13: (with DISCUSS and COMMENT)

Jonathan Lennox <> Thu, 03 June 2021 21:14 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 084C93A1A7E for <>; Thu, 3 Jun 2021 14:14:22 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (1024-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id J-JO-nL0UFlK for <>; Thu, 3 Jun 2021 14:14:16 -0700 (PDT)
Received: from ( [IPv6:2607:f8b0:4864:20::833]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id A065A3A1A86 for <>; Thu, 3 Jun 2021 14:14:16 -0700 (PDT)
Received: by with SMTP id z4so2431294qts.4 for <>; Thu, 03 Jun 2021 14:14:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=googlemail; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=sX0iLV48mIHN+UxG2UchhAptL9B4bhTzJ80ceZyWAfM=; b=llHdZ1jY0IOMgrCLyymPp5NU95RUVxFNgfbXPo6d30D2VDgWrXouIDgpp8U54o48QV /DkRTURT8JTopFvpJfoaWfIjY+s9CmAnCKKPXYX73NrNQkgXsu78sLVViUS5H/bXXr2A egdjEA60qtdYpCdBcKfAHWql/soL7Kvco8YiI=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=sX0iLV48mIHN+UxG2UchhAptL9B4bhTzJ80ceZyWAfM=; b=Avu5iiTMFtDf4DYnRGBlmPPgtle6qA8iGDQS0kKoQ/erEmf+5g9Cz/j16Zk20Gi0Vp nwcgoTTGLa8UHV2ur5KtPCEySx/7HkEaIwvICyl/RFcgmyGVErfkpJxwUexWRbLS/ZlV WgmTkOP3s4Hxa6bmJuNsJTft0DjBBBHn9ddaaXqAHWS/1AHpD/RbTh1SPwr7m9Ge3A3O dRF+XvHjgeP/t9bXV3GqlwxEQm3+yy+Blr5XZirq7EaOeffd1wrLou0ykbatyOK8guwT dO03iuId892Rbu3QDA1AEcFhV0dmNyqR93LbWr2M5JHDKIU33B7M4m30VnYINR2TYqTt pdFw==
X-Gm-Message-State: AOAM533p9I29a2RNB5PnpEvq/EckeE01ImDvCA+0yEr34C6lLUem+C5E BQdG8WJ/e4ZILLn2wvrmDREf2Q==
X-Google-Smtp-Source: ABdhPJyGqx2/vvC0mR5HAjjdsek77ImqC047y9YLIPz2hBeNDSm34lgDivRxoQBaZTMeuB5CkZLC5g==
X-Received: by 2002:a05:622a:392:: with SMTP id j18mr1476103qtx.29.1622754854223; Thu, 03 Jun 2021 14:14:14 -0700 (PDT)
Received: from ( []) by with ESMTPSA id 64sm2619125qkj.123.2021. (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 03 Jun 2021 14:14:13 -0700 (PDT)
From: Jonathan Lennox <>
Message-Id: <>
Content-Type: multipart/alternative; boundary="Apple-Mail=_2B1C2903-161B-4461-B58D-25983DBDCA9B"
Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.\))
Date: Thu, 3 Jun 2021 17:14:12 -0400
In-Reply-To: <>
Cc: The IESG <>,,,,
To: Benjamin Kaduk <>
References: <>
X-Mailer: Apple Mail (2.3654.
Archived-At: <>
Subject: Re: [AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-payload-vp9-13: (with DISCUSS and COMMENT)
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 03 Jun 2021 21:14:22 -0000

> On Jun 3, 2021, at 2:52 AM, Benjamin Kaduk via Datatracker <> wrote:
> Benjamin Kaduk has entered the following ballot position for
> draft-ietf-payload-vp9-13: Discuss
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> Hopefully trivial to resolve, but in Table 3 where we claim to reproduce
> the capabilities of coding profiles, defined in section 7.2 of
> [VP9-BITSTREAM], I do not think we did so faithfully.  In particular, in
> the last line, our table has:
>   +---------+-----------+-----------------+--------------------------+
>   |    3    |  10 or 12 |       Yes       | YUV 4:2:0,4:4:0 or 4:4:4 |
>   +---------+-----------+-----------------+--------------------------+
> but I'm seeing 4:2:2 (not 4:2:0) in [VP9-BITSTREAM].

Indeed, Zahed also pointed this out.  Thanks! 

> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> Section 3
>   Layers are designed (and MUST be encoded) such that if any layer, and
>   all higher layers, are removed from the bitstream along either of the
>   two dimensions, the remaining bitstream is still correctly decodable.
> Just to check my understanding: the "two dimensions" here are "temporal"
> and "spatial"?  ("dimensions" can of course also refer to the x and y
> coordinates of a image, but that doesn't seem to make sense here.)

Yes. I’ve changed the text to make this explicit.

>   and helps it understand the temporal layer structure.  Since this is
>   signaled in each packet it makes it possible to have very flexible
>   temporal layer hierarchies and patterns which are changing
>   dynamically.
> I'm not sure what type of "patterns" are being referred to here.  (The
> word "pattern" does not appear in the VP9 spec.)

Patterns of scalability structures.  I’ve changed it to just say “scalability structures which are changing dynamically”.

>   (Note: A "Picture Group", as used in this document, is not the same
>   thing as a the term "Group of Pictures" as it is traditionally used
>   in video coding, i.e. to mean an independently-decoadable run of
>   pictures beginning with a keyframe.)
> Please give a clear definition for how "Picture Group" is used by this
> document and follow that with the note about differing from "Group of
> Pictures".  That said, we only seem to use the term a handful of
> times...

I changed the previous paragraph (which introduced Picture Group) to say: 
      In non-flexible mode, frames are encoded using a fixed, recurring pattern of dependencies;                   
      the set of pictures that recur in this pattern is known as a Picture Group (PG).                                
      In this mode, …

Does that clarify things?

> Section 4.1
>     |                               : VP9 pyld hdr  |               |
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
>     |                                                               |
>     +                                                               |
>     :                   Bytes 2..N of VP9 payload                   :
>     |                                                               |
>     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> Is the "2" in "2..N" because byte 1 is on the previous line?  since
> there's not a horizontal boundary in this version of the figure, can we
> just say "VP9 payload" here?

Yes, thanks.  (This was a relic of the fact that this draft was originally based on the VP8 payload spec, where the first byte of the payload is actually described in the document.).

This will also resolve Éric’s issue that “VP9 dyld hdr" isn’t defined anywhere.

> Section 4.2
>   I:  Picture ID (PID) present.  When set to one, the OPTIONAL PID MUST
>      be present after the mandatory first octet and specified as below.
>      Otherwise, PID MUST NOT be present.  If the SS field was present
>      in the stream's most recent start of a keyframe (i.e., non-
>      flexible scalability mode is in use), then the PID MUST also be
>      present in every packet.
> (I assume that the "SS field was present" condition is not a route to
> ignoring the I bit but rather a constraint on when the I bit is set.  I
> would hope that this is sufficiently obvious to go without saying…)\

Just because it’s obvious doesn’t mean it’s not worth saying; I changed it to end “then this bit MUST be set on every packet”.

>   Picture ID (PID):  Picture ID represented in 7 or 15 bits, depending
>      on the M bit.  This is a running index of the pictures.  The field
>      MUST be present if the I bit is equal to one.  If M is set to
>      zero, 7 bits carry the PID; else if M is set to one, 15 bits carry
>      the PID in network byte order.  The sender may choose between a 7-
>      or 15-bit index.  The PID SHOULD start on a random number, and
>      MUST wrap after reaching the maximum ID (0x7f or 0x7fff depending
>      on the index size chosen).  The receiver MUST NOT assume that the
>      number of bits in PID stay the same through the session.
> There's perhaps an edge case here where the PID goes from taking 15 bits
> to taking 7 and then taking 15 again in the same session.  On the 7->15
> transition, are the endpoints required to preserve the full 15-bit
> state/phase from the previous incarnation?  That is, must the local
> representation always have at least 15 bits and track wraparounds of the
> 7-bit counter if used on the wire?

No, each transition is a truncation (15->7) or a zero-extension (7->15). I’ve added text to clarify this.

>      P_DIFF:  The reference index (in 7 bits) specified as the relative
>         PID from the current picture.  For example, when P_DIFF=3 on a
>         packet containing the picture with PID 112 means that the
>         picture refers back to the picture with PID 109.  This
> Just to check: is a P_DIFF of zero invalid or interpreted as "subtract
> 256"?

I hadn’t thought about this, but invalid is probably easier to implement.

>      G set to 0 or N_G set to 0 indicates that either there is only one
>      temporal layer or no fixed inter-picture dependency information is
>      present going forward in the bitstream.
> These map up to each other, right -- G==0 iff only one temporal layer,
> and N_G==0 iff no *fixed* inter-picture dependency?  The current wording
> with "A or B indicates X or Y" is a bit ambiguous about what is implied
> by what.

No; the distinction is whether future pictures are flexible or non-flexible mode. I’ve clarified this.  G 0 and N_G 0 are semantically the same.

> Section 5.1
>   to the sender.  The message body (i.e., the "native RPSI bit string"
>   in [RFC4585]) is simply the PictureID of the received frame.
> Does it matter if the PictureID uses the 7-bit or 15-bit form?

It should be the form that was in the frame that was received.

> (Also, we spelled "Picture ID" with a space in §4.2.)
Thanks, fixed.

> seems to
> confirm that a non-multiple-of-8 bit count is fine in this field.
>   Note: because all frames of the same picture must have the same
>   inter-picture reference structure, there is no need for a message to
>   specify which frame is being selected.
> This note is a little confusing to me, since the previons discussion is
> only about having received an entire *frame*, but sending the Picture ID
> would seem to acknowledge the entire *picture*, possibly having not
> received some of the component frames.

Other than the discussion of “golden” and “altref" frames, the rest of the discussion has to do with pictures - do you think that text needs clarifying for the spatial scalability case?

> Section 5.3
>   referenced.  Therefore it's recommended for both the flexible and the
>   non-flexible mode that, when upgrade frames are being encoded in
> Where is "upgrade frame" defined?  (In §4.2 for the U bit we talk only
> about the "switching up point".)

Changed to switching-up points.

> Section 8
> Is there anything interesting to say about missing/incorrect
> begin/end-of-frame markers (that might diverge in the RTP payload
> descriptor from the actual encoded bitstream)?

I think Roman’s suggested text about malicious or malformed payloads should cover this case as well.

> Section 4.2
>   V:  Scalability structure (SS) data present.  When set to one, the
> Is there a mnemonic for how 'V' got its name?

Not that I recall, sadly; I think we were just running short on letters by this point.

>   Z:  Not a reference frame for upper spatial layers.  If set to 1,
>      indicates that frames with higher spatial layers SID+1 of the
>      current and following pictures do not depend on the current
> Something about the way this is written makes me want to read "layers"
> as indicating "and higher layers", not just the immediate SID+1.  Maybe
> "frame with the next higher spatial layer SID+1"?

No, the “and higher layers” reading is correct.  I’ve changed it to “SID+1 and greater”.

>      Note that for a given picture, all frames follow the same inter-
>      picture dependency structure.  However, the frame rate of each
>      spatial layer can be different from each other and this can be
>      controlled with the use of the D bit described above.  The
> "controlled" may not be quite the right word here, as in order to have
> higher-frame rate at a given (non-base) layer, the "off" frames have to
> use a 'D' of zero, but the converse is not necessarily true.

Is “described” better?

>   In a scalable stream sent with a fixed pattern, the SS data SHOULD be
>   included in the first packet of every key frame.  This is a packet
>   with P bit equal to zero, SID or D bit equal to zero, and B bit equal
> (If SID is zero, D is also zero, so from an information-theoretic (but
> not human usability) point of view, we could just say "D bit equal to
> zero".)

No, you have the information-theoretic description backwards; SID=0 implies D=0, but not the other way around.

Actually, I think this points out an actual error; SID=1 and D=0 is a valid packet, but would not be the first packet of a key frame.  I think “SID or L bit equal to 0” would be correct, i.e. either no layer information is present, or it’s spatial layer 0.

> Setion 4.4
>   legitimately be removed from the stream.  Thus, a frame that follows
>   a removable frame (in full decode order) MUST be encoded with
>   "error_resilient_mode" set to true.
> This is the only instance of the word "removable" in the document, so
> I'd suggest using a phrasing that does not imply a defined term, like "a
> frame that directly follows a frame that might be removed".


> Section 5.3
>   Identification of a layer refresh frame can be derived from the
>   reference IDs of each frame by backtracking the dependency chain
>   until reaching a point where only decodable frames are being
>   referenced.  [...]
> This description leaves me a bit unclear on what exactly the receiver
> concludes is a layer refresh frame.  (What dependencies could there be
> from the frame sent in response to LRR?)

As in the example, if the LRR was a request to upgrade from {1,0} to {2,1}, the layer refresh frame can depend on {0,0} and {1,0}, because the request indicated that the receiver already has that decode state.

>   response to a LRR, those packets should contain layer indices and the
>   reference fields so that the decoder or an MCU can make this
>   derivation.
> Maybe "field(s)" since only one P_DIFF is an allowed state?


> Section 8
>   RTP packets using the payload format defined in this specification
>   are subject to the security considerations discussed in the RTP
>   specification [RFC3550], and in any applicable RTP profile such as
>   RTP/AVP [RFC3551], RTP/AVPF [RFC4585], RTP/SAVP [RFC3711], or RTP/
>   SAVPF [RFC5124].  SAVPF [RFC5124].  However, as "Securing the RTP
> duplicate "SAVPF [RFC5124]"