[AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-payload-vp9-13: (with DISCUSS and COMMENT)
Benjamin Kaduk via Datatracker <noreply@ietf.org> Thu, 03 June 2021 06:52 UTC
Return-Path: <noreply@ietf.org>
X-Original-To: avt@ietf.org
Delivered-To: avt@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id 330113A2CE9; Wed, 2 Jun 2021 23:52:05 -0700 (PDT)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: Benjamin Kaduk via Datatracker <noreply@ietf.org>
To: The IESG <iesg@ietf.org>
Cc: draft-ietf-payload-vp9@ietf.org, avtcore-chairs@ietf.org, avt@ietf.org, bernard.aboba@gmail.com, bernard.aboba@gmail.com
X-Test-IDTracker: no
X-IETF-IDTracker: 7.30.0
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: Benjamin Kaduk <kaduk@mit.edu>
Message-ID: <162270312471.25253.15642596825639278144@ietfa.amsl.com>
Date: Wed, 02 Jun 2021 23:52:05 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/avt/M6LC329uufO-WattzWq6BGcJltg>
Subject: [AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-payload-vp9-13: (with DISCUSS and COMMENT)
X-BeenThere: avt@ietf.org
X-Mailman-Version: 2.1.29
List-Id: Audio/Video Transport Core Maintenance <avt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/avt>, <mailto:avt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/avt/>
List-Post: <mailto:avt@ietf.org>
List-Help: <mailto:avt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/avt>, <mailto:avt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Jun 2021 06:52:06 -0000
Benjamin Kaduk has entered the following ballot position for draft-ietf-payload-vp9-13: Discuss When responding, please keep the subject line intact and reply to all email addresses included in the To and CC lines. (Feel free to cut this introductory paragraph, however.) Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html for more information about DISCUSS and COMMENT positions. The document, along with other ballot positions, can be found here: https://datatracker.ietf.org/doc/draft-ietf-payload-vp9/ ---------------------------------------------------------------------- DISCUSS: ---------------------------------------------------------------------- Hopefully trivial to resolve, but in Table 3 where we claim to reproduce the capabilities of coding profiles, defined in section 7.2 of [VP9-BITSTREAM], I do not think we did so faithfully. In particular, in the last line, our table has: +---------+-----------+-----------------+--------------------------+ | 3 | 10 or 12 | Yes | YUV 4:2:0,4:4:0 or 4:4:4 | +---------+-----------+-----------------+--------------------------+ but I'm seeing 4:2:2 (not 4:2:0) in [VP9-BITSTREAM]. ---------------------------------------------------------------------- COMMENT: ---------------------------------------------------------------------- Section 3 Layers are designed (and MUST be encoded) such that if any layer, and all higher layers, are removed from the bitstream along either of the two dimensions, the remaining bitstream is still correctly decodable. Just to check my understanding: the "two dimensions" here are "temporal" and "spatial"? ("dimensions" can of course also refer to the x and y coordinates of a image, but that doesn't seem to make sense here.) and helps it understand the temporal layer structure. Since this is signaled in each packet it makes it possible to have very flexible temporal layer hierarchies and patterns which are changing dynamically. I'm not sure what type of "patterns" are being referred to here. (The word "pattern" does not appear in the VP9 spec.) (Note: A "Picture Group", as used in this document, is not the same thing as a the term "Group of Pictures" as it is traditionally used in video coding, i.e. to mean an independently-decoadable run of pictures beginning with a keyframe.) Please give a clear definition for how "Picture Group" is used by this document and follow that with the note about differing from "Group of Pictures". That said, we only seem to use the term a handful of times... Section 4.1 | : VP9 pyld hdr | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | + | : Bytes 2..N of VP9 payload : | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Is the "2" in "2..N" because byte 1 is on the previous line? since there's not a horizontal boundary in this version of the figure, can we just say "VP9 payload" here? Section 4.2 I: Picture ID (PID) present. When set to one, the OPTIONAL PID MUST be present after the mandatory first octet and specified as below. Otherwise, PID MUST NOT be present. If the SS field was present in the stream's most recent start of a keyframe (i.e., non- flexible scalability mode is in use), then the PID MUST also be present in every packet. (I assume that the "SS field was present" condition is not a route to ignoring the I bit but rather a constraint on when the I bit is set. I would hope that this is sufficiently obvious to go without saying...) Picture ID (PID): Picture ID represented in 7 or 15 bits, depending on the M bit. This is a running index of the pictures. The field MUST be present if the I bit is equal to one. If M is set to zero, 7 bits carry the PID; else if M is set to one, 15 bits carry the PID in network byte order. The sender may choose between a 7- or 15-bit index. The PID SHOULD start on a random number, and MUST wrap after reaching the maximum ID (0x7f or 0x7fff depending on the index size chosen). The receiver MUST NOT assume that the number of bits in PID stay the same through the session. There's perhaps an edge case here where the PID goes from taking 15 bits to taking 7 and then taking 15 again in the same session. On the 7->15 transition, are the endpoints required to preserve the full 15-bit state/phase from the previous incarnation? That is, must the local representation always have at least 15 bits and track wraparounds of the 7-bit counter if used on the wire? P_DIFF: The reference index (in 7 bits) specified as the relative PID from the current picture. For example, when P_DIFF=3 on a packet containing the picture with PID 112 means that the picture refers back to the picture with PID 109. This Just to check: is a P_DIFF of zero invalid or interpreted as "subtract 256"? G set to 0 or N_G set to 0 indicates that either there is only one temporal layer or no fixed inter-picture dependency information is present going forward in the bitstream. These map up to each other, right -- G==0 iff only one temporal layer, and N_G==0 iff no *fixed* inter-picture dependency? The current wording with "A or B indicates X or Y" is a bit ambiguous about what is implied by what. Section 5.1 to the sender. The message body (i.e., the "native RPSI bit string" in [RFC4585]) is simply the PictureID of the received frame. Does it matter if the PictureID uses the 7-bit or 15-bit form? (Also, we spelled "Picture ID" with a space in §4.2.) https://datatracker.ietf.org/doc/html/rfc4585#section-6.3.3.2 seems to confirm that a non-multiple-of-8 bit count is fine in this field. Note: because all frames of the same picture must have the same inter-picture reference structure, there is no need for a message to specify which frame is being selected. This note is a little confusing to me, since the previons discussion is only about having received an entire *frame*, but sending the Picture ID would seem to acknowledge the entire *picture*, possibly having not received some of the component frames. Section 5.3 referenced. Therefore it's recommended for both the flexible and the non-flexible mode that, when upgrade frames are being encoded in Where is "upgrade frame" defined? (In §4.2 for the U bit we talk only about the "switching up point".) Section 8 Is there anything interesting to say about missing/incorrect begin/end-of-frame markers (that might diverge in the RTP payload descriptor from the actual encoded bitstream)? NITS Section 4.2 V: Scalability structure (SS) data present. When set to one, the Is there a mnemonic for how 'V' got its name? Z: Not a reference frame for upper spatial layers. If set to 1, indicates that frames with higher spatial layers SID+1 of the current and following pictures do not depend on the current Something about the way this is written makes me want to read "layers" as indicating "and higher layers", not just the immediate SID+1. Maybe "frame with the next higher spatial layer SID+1"? Note that for a given picture, all frames follow the same inter- picture dependency structure. However, the frame rate of each spatial layer can be different from each other and this can be controlled with the use of the D bit described above. The "controlled" may not be quite the right word here, as in order to have higher-frame rate at a given (non-base) layer, the "off" frames have to use a 'D' of zero, but the converse is not necessarily true. In a scalable stream sent with a fixed pattern, the SS data SHOULD be included in the first packet of every key frame. This is a packet with P bit equal to zero, SID or D bit equal to zero, and B bit equal (If SID is zero, D is also zero, so from an information-theoretic (but not human usability) point of view, we could just say "D bit equal to zero".) Setion 4.4 legitimately be removed from the stream. Thus, a frame that follows a removable frame (in full decode order) MUST be encoded with "error_resilient_mode" set to true. This is the only instance of the word "removable" in the document, so I'd suggest using a phrasing that does not imply a defined term, like "a frame that directly follows a frame that might be removed". Section 5.3 Identification of a layer refresh frame can be derived from the reference IDs of each frame by backtracking the dependency chain until reaching a point where only decodable frames are being referenced. [...] This description leaves me a bit unclear on what exactly the receiver concludes is a layer refresh frame. (What dependencies could there be from the frame sent in response to LRR?) response to a LRR, those packets should contain layer indices and the reference fields so that the decoder or an MCU can make this derivation. Maybe "field(s)" since only one P_DIFF is an allowed state? Section 8 RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [RFC3550], and in any applicable RTP profile such as RTP/AVP [RFC3551], RTP/AVPF [RFC4585], RTP/SAVP [RFC3711], or RTP/ SAVPF [RFC5124]. SAVPF [RFC5124]. However, as "Securing the RTP duplicate "SAVPF [RFC5124]"
- [AVTCORE] Benjamin Kaduk's Discuss on draft-ietf-… Benjamin Kaduk via Datatracker
- Re: [AVTCORE] Benjamin Kaduk's Discuss on draft-i… Jonathan Lennox
- Re: [AVTCORE] Benjamin Kaduk's Discuss on draft-i… Benjamin Kaduk