Re: [payload] AD review of draft-ietf-payload-rtp-h265

"Wang, Ye-Kui" <yekuiw@qti.qualcomm.com> Mon, 16 February 2015 18:29 UTC

From: "Wang, Ye-Kui" <yekuiw@qti.qualcomm.com>
To: Richard Barnes <rlb@ipv.sx>, "payload@ietf.org" <payload@ietf.org>, "draft-ietf-payload-rtp-h265@tools.ietf.org" <draft-ietf-payload-rtp-h265@tools.ietf.org>
Thread-Topic: AD review of draft-ietf-payload-rtp-h265
Thread-Index: AQHQPv3glBFWM1XnHkO+uj7o+tOkGZzznTGg
Date: Mon, 16 Feb 2015 18:29:22 +0000
Message-ID: <dde156a02580476aaab3bdb14bf7cbaf@NALASEXR01H.na.qualcomm.com>
References: <CAL02cgSBumHNsYrGtgJAinodzVEkq8VA-_rhzpGV5Zk9khUs6g@mail.gmail.com>
In-Reply-To: <CAL02cgSBumHNsYrGtgJAinodzVEkq8VA-_rhzpGV5Zk9khUs6g@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative; boundary="_000_dde156a02580476aaab3bdb14bf7cbafNALASEXR01Hnaqualcommco_"
MIME-Version: 1.0
Archived-At: <http://mailarchive.ietf.org/arch/msg/payload/o5VUdtpQtBAbKMKoJpkscbcdCr4>
Subject: Re: [payload] AD review of draft-ietf-payload-rtp-h265
Precedence: list

Hi Richard,

Thanks for your review and comments.  Here is our reply (see interleaved below). This is a very long email; sorry for the word count.

From: Richard Barnes [mailto:rlb@ipv.sx]
Sent: Monday, February 02, 2015 7:35 AM
To: payload@ietf.org; draft-ietf-payload-rtp-h265@tools.ietf.org
Subject: AD review of draft-ietf-payload-rtp-h265

I have reviewed this draft in preparation for IETF LC, and have serious reservations.  I would like to resolve the below comments before a re-review and IETF LC.  I have also asked for an SDP Directorate review of Section 7.

While I agree that the system it describes is interoperable as stated,

[Authors] Thanks.  This is one of the bigger compliments spec writers can get.

there are several points of unneeded complexity which, together with readability concerns, indicate that it will be very difficult for an implementor to create a correct implementation.  In other words, in order to implement this spec, you have to get a lot of complicated things right, and the way things are laid out makes it likely you'll miss something.

[Authors] We disagree.

This payload format follows the same fundamental design, both in terms of technical design choices and in editorial style, as the payload formats for H.264 (RFC 3984, RFC 6184) and SVC (RFC 6190).  For 3984 and 6184, dozens, if not hundreds, of interoperable and independently implemented implementations exist. Deployment includes most video conferencing systems, IPTV endpoints, and a myriad of niche products such as surveillance and military applications.  RFC 6190 is less widely deployed, but still compliant and interoperable products are shipping in the millions (Lync, Vidyo’s products, just to name two).  Somehow, the implementer community got it right, for we rarely (considering the popularity of those RFCs) receive questions, let alone complaints, about the specifications.  Also, the mentioned payload format RFCs have an OK track record in interoperability tests, such as those conducted in the IMTC.  To summarize, we believe it’s fair to characterize these RFCS as “good” specs.

(In fairness, we note that there was one aspect of the previous RFCs that folks occasionally got confused about, implemented wrongly, and complained about, namely the multiple “transmission modes”.  This shortcoming has been pointed out during the design of the H.265 payload format rather prominently, and was fixed.  We get to the details later.)

There are, of course, reasons why these older RFCs are “good”.  One reason is, in our experience (and some of us did or do implementation work of payload formats ourselves), that the RTP payload implementers, especially for NAL-unit based video coding standards, are mostly identical to the implementers of the high level part of the codec itself, or the device driver glue code when a hardware codec is in play.  By necessity, these folks need to read the video coding spec.  By aligning the style and precision of the payload format text with what one finds in the video coding specs, these folks feel right at home.  For those who come to the field “fresh”, there are tutorial papers available covering design choices and motivations—not necessary for interoperable spec development, but helpful to come up to speed.  For the H.265 codec, see for example “System Layer Integration of High Efficiency Video Coding”, Thomas Schierl, Miska M. Hannuksela, Ye-Kui Wang, Stephan Wenger, IEEE Transactions on Circuits and Systems for Video Technology 12/2012; 22(12):1871-1884.  (We can add an informative reference to that article, if you want.  Note that it is in a special issue that contains tutorials about all aspects of H.265, so you can expect that most folks implementing an H.265 product read at least parts of it.)

Another reason is the high flexibility of these older payload formats, of which, for example, the large IANA template is evidence.  The MIME template of RFC3984 was an “uncool” 14 pages long.  Most of the parameters were used one way or the other, and industry required additional parameters that were bolted on through a second MIME registration of 16 pages, see RFC 6185.  In some industries, organizations like UCIF/IMTC “profiled” our RFCs by requiring certain parameters to be included in the session setup, and recommending or mandating values for those.  Many configurable codec (and payload format) parameters are exposed, so to give profiling SDOs and niche product designers the flexibility they may need.  All contain default parameters so that mainstream users don’t need to worry about them.

All this was, of course, known in the PAYLOAD working group.  Even the individual -00 draft, posted in October 2012, made it clear that we would be using RFC 6190 as a starting point (H.265 does explicitly support temporal scalability from the outset, so the logical starting point was an RFC for a NAL unit based codec that supports scalability).  This approach was signed off by the WG numerous times; explicitly or implicitly (i.e. through adoption of the individual draft as a WG draft and, of course during WGLC).  The draft was discussed extensively; some 150 messages on the payload list and many hours of hallway discussions in the IETF and in JCT-VC (where H.265 is standardized).  Not one of the emails to the payload list question the general design principles.  Oh, sorry: one does.  Yours.
Our initial reaction to your review (and not only ours but also that of a number of people knowledgeable in the area of RTP payload formats for video), was one of great surprise.  Your own phrase “Wait, what?” describes it rather well.  It’s not only the content of your message, it’s also the timing.  Sending us back to the drawing board with suggestions of fundamental design changes is not what we would have expected from the AD responsible for the PAYLOAD WG.

No one of us expects an area director to have the full context of all of the working groups under him or her at his or her finger tips.  But we would have expected that the sitting AD responsible for a WG as the very minimum checks some of the background, before posting as harsh a review as you did.

Now, while we will push back on your suggestion of making fundamental design changes at the 11th hour (products implementing the draft are shipping today.  Millions of them), we also found a lot of points in your review which we believe they can lead to an improvement of the draft.  Of course we will implement those.

With this long introduction, let us respond to the detailed points your review contains.

More specific comments follow.

Section 1 and 3 are almost entirely unnecessary.  Section 1.1.4 should be moved to Section 4.2, and the remainder replaced with a brief description of the concepts critical for understanding the payload specification (AFAICT, mainly just NAL units, access units, and VCL/non-VCL NAL units).  The larger text could go in an appendix.

[Authors] Firstly, we disagree that the two sections are almost entirely unnecessary. Section 1.1 gives an overview of the HEVC codec, helping potential readers understand HEVC, particular the aspects that are most relevant to use of HEVC in applications, including using RTP transport. Better understanding herein enables better designs and optimizations. It’s also a quite common request in the PAYLOAD WG to add explanatory overview text about the media codec.  Section 1.2 provides an overview of the payload format itself, which is anyway needed. Section 3.1 provides definitions that are either defined in HEVC (those in Section 3.1) and newly defined or refined in this specification, both are used. How can definitions of terms be unnecessary for standard specifications?  And do you really want to have people dig out from the H.265 spec definitions such as “coded video sequence” (which is defined differently from what would be intuitive for a non-video codec person)?

Secondly, why would it be an improvement to move a lot of text into an appendix instead of having in an introduction section? We think both ways work fine, and following the same style as in RFCs for H.264 and SVC (meaning leave the organization as it is now) would make more sense as lots of readers of this specification would have read those RFCs earlier.

Section 3 should be trimmed to contain only terms that are required for implementing the payload specification.  For example, there are an entire suite of terms around "BLA", "BLA picture", "CRA", "CRA picture", etc., only to justify an example in Section 4.6.  Acronyms that are only mentioned once or twice (e.g., MANE) also impedes readability.

[Authors] We agree to not list the acronyms for those that are only mentioned once or twice.

Section 4.2 is the first in a series of "Wait, what?" moments.  The payload header was mentioned without this context in Section 1.1.4., so when the reader arrives here, it would be good to have the reminder that we get in Section 4.3 (in fact, that whole first paragraph seems like 4.2 material), that the first two bytes of the payload are the payload header.  Even this would be unnecessary if you move Section 1.1.4 into Section 4.2.

[Authors] Good suggestion to move the first paragraph in Section 4.3 to the beginning of Section 4.2. Thanks! To be done in the next version. However, we do not agree that it would be better to move section 1.1.4 as suggested, since section 1.1.4 summarizes the NAL unit header of HEVC, hence belonging to the HEVC overview part.  Yes, the NAL unit header format is also used as the payload header, but that is spelled out explicitly.

In Section 4.3, I'm bewildered as to why there's a need for four separate payload structures.  Why is it not sufficient to do as the VP8 payload structure does, and simply have each packet carry a sequence of one or more NAL units, with truncation possible at the beginning or end?  In addition to simplifying the protocol structure, that would avoid the ambiguity between actual NAL units and NAL-unit-like structures (as Section 6 does).

[Authors] The payload structures are the core of the specification that define the bits on the wire. They have been, in essentially unchanged from, in the -00 individual draft, posted about three years ago.  We have never (never!) received pushback around that design choice.  Perhaps because payload implementers who care have a reasonably understanding why they are there and how they are used.

Three of the four payload structures (single NAL unit packet, aggregation packet, and fragmentation unit) are inherited from the design of RFC 6184 and RFC 6190, and have already been justified when designing those payload formats. They are widely implemented in H.264 implementations.  As the Network Abstraction Layer of H.264 and H.265 are very similar, we expect that existing implementations can relatively easily be upgraded.

The fourth payload structure (PACI) is fixing a problem that was found during the development of RFC 6190 for SVC but could be fixed only by a dirty hack (called PACSI).  More comments about the design choices for PACI later.

The key technical reasons for the payload structure designs are as follows.

In HEVC (as in other modern video codec specs such as H.264, a sequence of NAL units does not form a self-contained decodable bitstream.  The decoder needs access to individual NAL units, which means that they need to be delimited somehow through framing.   The HEVC spec specifies a byte stream format for transporting HEVC directly over bit-pipes (such as an MPEG-2 transport stream), using “start codes”.  Start codes not only add overhead, but there is also the headache of “start code emulation prevention” which costs both bits and cycles.  Back in 2001, the Network Adaptation Layer of H.264 was expressly designed to do away with byte stream formats and start codes for transports that have a native packet structure.  A costly, in terms of bits, design alternative is explicit length fields, and we use it in the aggregation packets where we cannot use the packet delimiter to indicate the end of the NAL unit.

For the single NAL unit packet, which is the most common packet in encoding systems that can adjust the slice size to the MTU, the IP packet boundary identifies the NAL unit boundaries, saving bits one would otherwise need to spend for framing.  Also, NAL units are designed to be independently decodable, so the loss of a packet containing a NAL unit does not necessarily lead to catastrophic behavior of the decoder (as was the case for previous video coding standards and payload formats that did not align media coding structures with packet boundaries.

We don't think it is really relevant to compare the design here to that in the VP8 payload format, but just to name one benefit of such design compared to a single payload structure that would have the same functionality: the current design with the payload structures allows adding fields only when required, e.g., in an environment where you do want to encapsulate just one NAL unit into one packet, the design here provides a way more simpler implementation with low overhead, without unnecessary additional fields within the payload header.

Also, we believe the difference between actual NAL units and NAL-unit-like structures has been clearly clarified. Where is the ambiguity between the two?

PACI should be framed as an optional extension to the payload header (signaled by a flag) rather than a separate payload structure altogether.

[Authors] No, we don’t think it should.

The reasons for the design have been extensively discussed in the WG, see, for example, the PAYLOAD minutes of IETF 87 (where a strawman was presented), mailing list discussions (between IETF87 and IETF88) the AVTCORE minutes of IETF 88 (agreement for one of the options, ratified on mailing list later).

The key technical issue that speaks against your proposal is that there is no room in the 16 bit payload header for that flag you need.  Since RFC 3984, one fundamental design principle of NAL unit based video payload formats has been that the NAL unit header (specified in the video codec groups, though with input by what one could call the “PAYLOAD” community) co-serves as the payload header, reducing the payload header overhead essentially to 0 bits for many practical packets.  Why add an octet for that one flag (and reserved flags), when there is a perfectly valid solution specified already?

More generally speaking, the extension mechanism of the payload format is not through the adding of bits in a (variable length) payload header, but by picking one of the NAL unit types reserved specifically for that purpose.  PACI did just that.  PACI is extensible in its own right, and future proof.

In any case, Section 4.6 through 4.9 should be subsections of 4.3.

[Authors] Sections 4.6 to 4.9 use what was discussed in Section 4.4. However, we can move Section 4.3 to be after Section 4.4, then making Sections 4.6 to 4.9 its subsections. To be done in the next version.

Section 4.5 - "Wait, what?"  This section is the first mention of Decoding Order Number, and the reader has no idea where this DON[n] value comes from.  Placing this section after 4.6 - 4.9 (as suggested above) would clear this up.

[Authors] We can also do this.

Section 4.5:
"""
   When two consecutive NAL units in the NAL unit decoding order
   have different values of AbsDon, the value of AbsDon for the
   second NAL unit in decoding order MUST be greater than the value
   of AbsDon for the first NAL unit
"""
This is redundant with the bullets above (which are already wordier than necessary).

[Authors] Good spotting. This redundancy will be removed in the next version.

Section 7 - Using the IANA template to describe these parameters is really awkward, e.g., given the way you seem to want to group them.  It would be better to pull most of this text into a separate section that is referenced by the IANA considerations.  Given how separable it is from the rest of the spec, I could even imagine section 7 being pulled into a separate document.  A 20-page IANA template is not cool.

[Authors] There has been a general agreement in the AVT and PAYLOAD groups since before 2000 that IANA templates go into the payload RFCs.  It’s done for reasons.  20 pages IANA templates may be “not cool”, but there is precedence in them (well: there is precedence in 16 page IANA templates) and we haven’t heard of any problems with them.

We don’t mind the work in doing that change, but we would like to do it only once.  Perhaps we should ask IANA, or wait for the SDP directorate input?

Section 7 - You seem to be allowing for negotiation on every possible basis.  Are there real-world receivers that need to indicate, say, a max-fps constraint without already being constrained by one of the other max-* factors?

[Authors] This is intended and has historically been exercised, at least to a large extent. With respect to your example, max-fps can be a limiting factor of the display, while all other max-* parameters jointly won't provide the desired constraint.

"... maximum picture rate in units of pictures per 100 seconds"
"The highest level the receiver supports is equal to the value of max-recv-level-id divided by 30."
Why the gratuitous constants?

[Authors] They certainly have their (good technical and historical) reasons, and we think it is probably not really appropriate to add all such detailed reasons - people with relevant video expertise should not feel bewildered here. The 100 is there to support non-integrate picture/frame rate, such as 29.97 fps. There is no need to have more than two digits of fraction precision. The 30 is simply what is specified by the HEVC specification, which has some more complicated reasons, e.g., to enable using 8 bits to represent sufficient number of levels but still to allow for adding intermediate levels in the future, and video market people are used to call Level 3.1 instead of Level 31, and so on.

Section 8 - Please do not re-create the registry here.  It will only cause confusion when the registry changes.

[Authors] We are not (re-)creating the registry, but just to introduce which code points have been used and for what messages. This was good when we had a new message defined in an earlier version, to show which code point should be used for the new one. That new message was agreed later on to be moved to a different topic. Thus indeed, the introduction is now not needed. Will remove in the next version.

EDITORIAL

There's a bunch of really convoluted writing in here.  These are examples that I've hit upon, with suggested resolutions.  I would encourage the authors to read through the document with an eye toward simplifying similar instances.

Section 4.4.
OLD: "the value of AbsDon for NAL unit n, is derived as equal to n."
NEW: "the value of AbsDon for NAL unit n is n."

[Authors] No.  Imprecision like this may be common in some parts of the IETF, but it would be a shame to change precise language to imprecise just to meet the lower standards of other docs.

With reference to your example, wording like "something is n" is describing a status while here we are describing a process. And, also, if you put the entire sentence (copied below for convenience) into consideration, it is not good English to remove the comma between 'n' and 'is', either.

The original entire sentence is: If tx-mode is equal to "SRST" and sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon for NAL unit n, is derived as equal to n.

Section 6.
OLD: "Initial buffering lasts until condition A (...) or condition B (...) is true."
NEW: "Initial buffering lasts until either (A) ... or (B) ..."

[Authors] The suggested change is not good, as we needed to describe the process while at the same time to define what is condition A and what is condition B, which are to be used subsequently.

Of course, we can try to make the paragraph longer, by firstly define condition A and condition B, then describe the process, like:

Let condition A be …. Let condition B be …. Initial buffering lasts until condition A or condition B is true.

But that is not what you are asking for, right?

Section 6.
OLD: "condition A and condition A"
NEW: "condition A and condition B"

[Authors] Yes - this is obviously a typo - thanks for another good spotting.

Best Regards,
The authors of this draft

[payload] AD review of draft-ietf-payload-rtp-h265 Richard Barnes
Re: [payload] AD review of draft-ietf-payload-rtp… Wang, Ye-Kui
Re: [payload] AD review of draft-ietf-payload-rtp… Magnus Westerlund