Re: [AVTCORE] Comments on draft-ietf-avtcore-rtp-vvc-03(Internet mail)

"shuaiizhao(Shuai Zhao)" <> Thu, 29 October 2020 21:19 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 7ED343A0773 for <>; Thu, 29 Oct 2020 14:19:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.098
X-Spam-Status: No, score=-2.098 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (1024-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id QXpc0YkDgvpO for <>; Thu, 29 Oct 2020 14:19:39 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 746743A07A0 for <>; Thu, 29 Oct 2020 14:19:38 -0700 (PDT)
Received: from (unknown []) by (Postfix) with ESMTP id 6588B94259; Fri, 30 Oct 2020 05:19:36 +0800 (CST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;; s=s202002; t=1604006376; bh=5nS+rTKKq4nzt9zotOwhgpO/ZjRaIO5RtQiPAeF7Epw=; h=From:To:CC:Subject:Date:References:In-Reply-To; b=SeYVPUCRuevvBkqL4U0ZMECBf9P64s6l7H3QOqlPRxF79kI7AzR2yPdOvjBL8NXCG FFBgS/86zRUxdQE1SQCV9MtsP0EYR4mhNQnE5CGkQTtAN/CEuZ3aICVQ0rf1JkN5UW Cjt5jovOrNsQ4yvzR51xvrElx0rF6jRjswxWgTqc=
Received: from ( by ( with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Fri, 30 Oct 2020 05:19:35 +0800
Received: from ( by ( with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Fri, 30 Oct 2020 05:19:33 +0800
Received: from ([fe80::2871:b384:ea95:2ddc]) by ([fe80::2871:b384:ea95:2ddc%4]) with mapi id 15.01.2106.002; Fri, 30 Oct 2020 05:19:33 +0800
From: "shuaiizhao(Shuai Zhao)" <>
To: Jonathan Lennox <>
CC: "" <>, Bernard Aboba <>, Ye-Kui Wang <>
Thread-Topic: [AVTCORE] Comments on draft-ietf-avtcore-rtp-vvc-03(Internet mail)
Thread-Index: AdasvsNAW3axshuzRXqGx0bR/oBGQAAECmKAAEeByIAAEw+bgA==
Date: Thu, 29 Oct 2020 21:19:33 +0000
Message-ID: <>
References: <0ad901d6acbe$d47e9ec0$7d7bdc40$> <>, <>
In-Reply-To: <>
Accept-Language: en-US, zh-CN
Content-Language: en-US
Content-Type: multipart/alternative; boundary="_000_9F6454D927584466B550B60570E112B4tencentcom_"
MIME-Version: 1.0
Archived-At: <>
Subject: Re: [AVTCORE] Comments on draft-ietf-avtcore-rtp-vvc-03(Internet mail)
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Audio/Video Transport Core Maintenance <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 29 Oct 2020 21:19:43 -0000

Thanks Chair.
Welcome onboard @Yekui.

I will be uploading a new version to address those comments momentarily.


On Oct 29, 2020, at 13:14, Jonathan Lennox <> wrote:

This is fine, I have no objection.  Welcome, Ye-Kui!

On Wed, Oct 28, 2020 at 1:06 PM shuaiizhao(Shuai Zhao) <<>> wrote:
Dear Chairs,

We thank Yekui for his detailed comments. All of the co-authors are agreed to these changes.

We are requesting to add YeKui as a co-author to this draft and will be providing a new version to address those comments.

Therefore, I would like to ask AVTcore chairs if we have a consensus to do so.

Shuai Zhao

From: avt <<>> on behalf of Ye-Kui Wang <<>>
Date: Tuesday, October 27, 2020 at 17:12
To: 'shuai zhao' <<>>, "<>" <<>>
Cc: 'Jonathan Lennox' <<>>
Subject: [AVTCORE] Comments on draft-ietf-avtcore-rtp-vvc-03(Internet mail)

Thanks Shuai for integrating my earlier comments!

I read some parts of draft-ietf-avtcore-rtp-vvc-03, and got the following comments and suggested changes:

1)      In Abstract, “ISO23090-3” should be replaced with “23090-3”.

2)      In paragraph 1 of Section 1, the last sentence, “[H.266] is reported to provide significant coding efficiency gains over H.265 and earlier video codec formats.” should be replaced with “VVC is reported to provide significant coding efficiency gains over HEVC [HEVC], a.k.a. H.265, and earlier video codecs.”

3)      In paragraph 2 of Section 1, “This memo specifices …” should be “This memo specifies …”.

4)      In Section 1.1 and its subsections, all instances of “[VVC]” and “[HEVC]” should be replaced with “VVC” and “HEVC”, respectively.

5)      In Section 1.1, paragraph 1, replace “ITU- T” with “ITU-T” (remove the unnecessary space after ‘-’).

6)      In Section 1.1.1, paragraph 1 in the subsection of “Motion prediction and coding”, replace “Sub- block” with “Sub-block” (remove the unnecessary space after ‘-’).

7)      In Section 1.1.1, paragraph 1 in the section of “Intra prediction and intra-coding”, in “a 6-most -probable-mode scheme”, remove the unnecessary space after “most.

8)      In Section 1.1.2, add “(informative)” at the end of the section title.

9)      In Section 1.1.2, change the subsection title “Decoding Capability Information” to “Decoding capability information” for consistency.

10)   In Section 1.1.2, the subsection of “Video parameter set”, replace “TThe ideo parameter set (VPS)” with “The video parameter set (VPS)”.

11)   In Section 1.1.2, change the subsection title “Picture Header” to “Picture header” for consistency.

12)   In Section 1.1.2, the subsection of “Picture Header”, first sentence, change “A Picture Header” to “A picture header” for consistency.

13)   In Section 1.1.2, change the subsection title “Sub-Profiles” to “Sub-profiles” for consistency.

14)   In Section 1.1.2, change the subsection title “Constraint Fields” to “General constraint fields”.

15)   In Section 1.1.2, the subsection of “General constraint fields”, change “(more of which are flags)” to “(most of which are flags)”.

16)   In Section 1.1.2, the subsection of “Temporal scalability support”, remove the editor’s note, as the text description is good enough.

17)   In Section 1.1.2, change the subsection title “Picture reference resampling (RPR)” to “Reference picture resampling (RPR)”.

18)   In Section 1.1.2, the subsection of “Reference picture resampling (RPR)”, replace the editor’s note with the following description text for RPR:

In AVC and HEVC, the spatial resolution of pictures cannot change unless a new sequence using a new SPS starts, with an IRAP picture. VVC enables picture resolution change within a sequence at a position without encoding an IRAP picture, which is always intra-coded. This feature is sometimes referred to as reference picture resampling (RPR), as the feature needs resampling of a reference picture used for inter prediction when that reference picture has a different resolution than the current picture being decoded. RPR allows resolution change without the need of coding an IRAP picture, which causes a momentary bit rate spike in streaming or video conferencing scenarios, e.g., to cope with network condition changes.  RPR can also be used in application scenarios wherein zooming of the entire video region or some region of interest is needed.

19)   In Section 1.1.2, the subsection of “Spatial, SNR, and multiview scalability”, replace “all those forms of scalability are supported natively …” with “all those forms of scalability are supported in the first version of VVC, natively …”.

20)   In Section 1.1.2, the subsection of “Spatial, SNR, and multiview scalability”, remove the last sentence that says “Scalability support can be implemented in a single decoding "loop" and is widely considered a comparatively lightweight operation.” Most readers would understand this as the same single-loop decoding concept as in H.264/SVC, which is then incorrect. I think the key point intended to be said is the same as what is said in the subsequent description of spatial scalability (i.e., the support in VVC does not need a resampling filtering module as in the scalable extensions of AVC and HEVC, but only needs some high-level syntax changes).

21)   In Section 1.1.2, change the subsection title of “Spatial Scalability” to “Spatial scalability” for consistency.

22)   In Section 1.1.2, the subsection of “Spatial scalability”, 1st sentence, remove “in the "main" profile of VVC,” as the spatial scalability support is in a separate profile, although also in VVC version 1.

23)   In Section 1.1.2, after the subsection of “SNR scalability”, add a subsection of “Multiview scalability”, with the following description text: The first version of VVC also supports multiview scalability, wherein a multi-layer bitstream carries layers representing multiple views, and one or more of the represented views can be output at the same time.

24)   In Section 1.1.2, the subsection of “SEI Message”, change the title to be “SEI messages” for consistence, and add at the end of the last sentence “but in a companion specification [VSEI]”, and add the VSEI spec (H.274) reference to the list of references. The reference to the VSEI spec should be added as some applications that use this RTP payload format may use some of the SEI messages specified in the VSEI spec.

25)   In Section 1.1.3, change the section title to be “High-Level Picture Partitioning (informative)”, and replace the basically empty section body with the following description text:

VVC inherited the concept of tiles and wavefront parallel processing (WPP) from HEVC, with some minor to moderate differences. The basic concept of slices was kept in VVC but designed in an essentially different form. VVC is the first video coding standard that includes subpictures as a feature, which provides the same functionality as HEVC motion-constrained tile sets (MCTSs) but designed in a different way to have better coding efficiency and to be friendlier for usage in application systems. More details of these differences are described below.

Tiles and WPP

Same as in HEVC, a picture can be split into tile rows and tile columns in VVC, in-picture prediction across tile boundaries is disallowed, etc. However, the syntax for signaling of tile partitioning has been simplified, by using a unified syntax design for both the uniform and the non-uniform mode. In addition, signaling of entry point offsets for tiles in the SH is optional in VVC while it is mandatory in HEVC. The WPP design in VVC has two differences compared to HEVC: i) The CTU row delay is reduced from two CTUs to one CTU; ii) Signaling of entry point offsets for WPP in the SH is optional in VVC while it is mandatory in HEVC.


In VVC, the conventional slices based on CTUs (as in HEVC) or macroblocks (as in AVC) have been removed. The main reasoning behind this architectural change is as follows. The advances in video coding since 2003 (the publication year of AVC v1) have been such that slice based error concealment has become practically impossible, due to the ever-increasing number and efficiency of in-picture and inter-picture prediction mechanisms. An error-concealed picture is the decoding result of a transmitted coded picture for which there is some data loss (e.g., loss of some slices) of the coded picture or a reference picture for at least some part of the coded picture is not error-free (e.g., that reference picture was an error-concealed picture). For example, when one of the multiple slices of a picture is lost, it may be error-concealed using an interpolation of the neighboring slices. While advanced video coding prediction mechanisms provide significantly higher coding efficiency, they also make it harder for machines to estimate the quality of an error-concealed picture, which was already a hard problem with the use of simpler prediction mechanisms. Advanced in-picture prediction mechanisms also cause the coding efficiency loss due to splitting a picture into multiple slices to be more significant. Furthermore, network conditions become significantly better while at the same time techniques for dealing with packet losses have become significantly improved. As a result, very few implementations have recently used slices for maximum transmission unit size matching. Instead, substantially all applications where low-delay error resilience is required (e.g., video telephony and video conferencing) rely on system/transport-level error resilience (e.g., retransmission, forward error correction) and/or picture-based error resilience tools (feedback based error resilience, insertion of IRAPs, scalability with higher protection level of the base layer, and so on). Considering all the above, nowadays it is very rare that a picture that cannot be correctly decoded is passed to the decoder, and when such a rare case occurs, the system can afford to wait for an error-free picture to be decoded and available for display without result in frequent and long periods of picture freezing seen by end users.

Slices in VVC have two modes: rectangular slices and raster-scan slices. The rectangular slice, as indicated by its name, cover a rectangular region of the picture. Typically, a rectangular slice consists of a number of complete tiles. However, it is also possible that a rectangular slice is a subset of a tile and consists of one or more consecutive, complete CTU rows within a tile. A raster-scan slice consists of one or more complete tiles in tile raster scan order, hence the region covered by a raster-scan slices need not but could have a non-rectangular shape, but it may also happen to have the shape of a rectangle. The concept of slices in VVC is therefore strongly linked to or based on tiles instead of CTUs (as in HEVC) or macroblocks (as in AVC).


VVC is the first video coding standard that includes the support of subpictures as a feature. Each subpicture consists of one or more complete rectangular slices that collectively cover a rectangular region of the picture. A subpicture may be either specified to be extractable (i.e., coded independently of other subpictures of the same picture and of earlier pictures in decoding order) or not extractable. Regardless of whether a subpicture is extractable or not, the encoder can control whether in-loop filtering (including deblocking, SAO, and ALF) is applied across the subpicture boundaries individually for each subpicture.

Functionally, subpictures are similar to the motion-constrained tile sets (MCTSs) in HEVC. They both allow independent coding and extraction of a rectangular subset of a sequence of coded pictures, for use cases like viewport-dependent 360° video streaming optimization and region of interest (ROI) applications.

There are several important design differences between subpictures and MCTSs. First, the subpictures feature in VVC allows motion vectors of a coding block pointing outside of the subpicture even when the subpicture is extractable by applying sample padding at subpicture boundaries in this case, similarly as at picture boundaries. Second, additional changes were introduced for the selection and derivation of motion vectors in the merge mode and in the decoder side motion vector refinement process of VVC. This allows higher coding efficiency compared to the non-normative motion constraints applied at encoder-side for MCTSs. Third, rewriting of SHs (and PH NAL units, when present) is not needed when extracting of one or more extractable subpictures from a sequence of pictures to create a sub-bitstream that is a conforming bitstream. In sub-bitstream extractions based on HEVC MCTSs, rewriting of SHs is needed. Note that in both HEVC MCTSs extraction and VVC subpictures extraction, rewriting of SPSs and PPSs is needed. However, typically there are only a few parameter sets in a bitstream, while each picture has at least one slice, therefore rewriting of SHs can be a significant burden for application systems. Fourth, slices of different subpictures within a picture are allowed to have different NAL unit types. Fifth, VVC specifies HRD and level definitions for subpicture sequences, thus the conformance of the sub-bitstream of each extractable subpicture sequence can be ensured by encoders.

If needed, I can work with the authors to integrate these suggested changes, if agreed, into the next version of the draft.


From: shuai zhao <<>>
Sent: Monday, October 26, 2020 21:23
Cc: Jonathan Lennox <<>>;<>; Ye-Kui Wang <<>>
Subject: [External] Fwd: New Version Notification for draft-ietf-avtcore-rtp-vvc-03.txt

Thanks Yekui for providing valuable comments. This version I have fixed the following:

•         Updates on VVC coding tool up to Section 1.1.3
•         Add a PRP section with editor’s notes
•         Make Section 1.1.2 naming consistent per Yekui’s comments.


Begin forwarded message:

Subject: New Version Notification for draft-ietf-avtcore-rtp-vvc-03.txt
Date: October 26, 2020 at 21:15:04 PDT
To: "Shuai Zhao" <<>>, "Yago Sanchez" <<>>, "Stephan Wenger" <<>>

A new version of I-D, draft-ietf-avtcore-rtp-vvc-03.txt
has been successfully submitted by Shuai Zhao and posted to the
IETF repository.

Name:                  draft-ietf-avtcore-rtp-vvc
Revision:              03
Title:                     RTP Payload Format for Versatile Video Coding (VVC)
Document date:               2020-10-27
Group:                  avtcore
Pages:                  44

  This memo describes an RTP payload format for the video coding
  standard ITU-T Recommendation H.266 and ISO/IEC International
  Standard ISO23090-3, both also known as Versatile Video Coding (VVC)
  and developed by the Joint Video Experts Team (JVET).  The RTP
  payload format allows for packetization of one or more Network
  Abstraction Layer (NAL) units in each RTP packet payload as well as
  fragmentation of a NAL unit into multiple RTP packets.  The payload
  format has wide applicability in videoconferencing, Internet video
  streaming, and high-bitrate entertainment-quality video, among other

Please note that it may take a couple of minutes from the time of submission
until the htmlized version and diff are available at<>.

The IETF Secretariat