Re: [Gen-art] Gen-art last call review of draft-ietf-codec-opus-12.txt (completed)

Cullen Jennings <fluffy@cisco.com> Thu, 17 May 2012 16:07 UTC

Subject: Re: [Gen-art] Gen-art last call review of draft-ietf-codec-opus-12.txt (completed)
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: text/plain; charset="us-ascii"
From: Cullen Jennings <fluffy@cisco.com>
In-Reply-To: <1337203600.31554.2213.camel@mightyatom.folly.org.uk>
Date: Thu, 17 May 2012 10:07:36 -0600
Content-Transfer-Encoding: quoted-printable
Message-Id: <6D6B36EF-C4B4-4C65-AA07-6C6A59725291@cisco.com>
References: <1337001184.23527.1544.camel@mightyatom.folly.org.uk> <4FB2F5D0.7070701@jmvalin.ca> <1337203600.31554.2213.camel@mightyatom.folly.org.uk>
To: Elwyn Davies <elwynd@folly.org.uk>
Cc: codec@ietf.org, Jean-Marc Valin <jmvalin@jmvalin.ca>, IETF discussion <ietf@ietf.org>, draft-ietf-codec-opus.all@tools.ietf.org
Precedence: list

I think the authors just about have a -14 draft ready but I wanted to comment on one topic inline ....


On May 16, 2012, at 3:26 PM, Elwyn Davies wrote:

> Hi, Jean-Marc.
> 
> ... and thanks for the super-quick response!  You have been quite busy.
> 
> I have had a look through the new draft and I think the additions help
> considerably with comprehension for the naive (and to give new
> implementers a way in.)
> 
> I'll leave you to negotiate with the RFC Editor over the Wikipaedia
> references.  To quote the RFC Style guide
> http://www.rfc-editor.org/rfc-style-guide/rfc-style
> Section 4.8, item (x) References, last section:
>> URLs and DNS names in RFCs
>> 
>>      The use of URLs in RFCs is discouraged, because many URLs are not
>>      stable references.  Exceptions may be made for normative
>>      references in those cases where the URL is demonstrably the most
>>      stable reference available.  References to long-lived files on
>>      ietf.org and rfc-editor.org are generally acceptable.
> They are certainly convenient *as long as they remain in place and
> aren't corrupted*.
> 
> I found a couple of trivial editorial nits in the changes:
> s4.3.3 (in the added text):
>> The CELT layer, however, can adapt over a very wide range of rates,
>> and thus has a large number of codebooks sizes
> s/codebooks/codebook/
> 
> s4.3.3, para after Table 57: s?the maximums in bit/sample are
> precomputed?the maximums in bits/sample are precomputed?
> 
> Also suggest:
> s4.3: Add reference for Bark scale: Zwicker, E. (1961), "Subdivision of
> the audible frequency range into critical bands," The Journal of the
> Acoustical Society of America, 33, Feb., 1961.
> 
> A few responses in line below (agreed pieces elided): 
> 
> Regards,
> Elwyn  Davies
> ====================================================================================
> 
> On Tue, 2012-05-15 at 20:33 -0400, Jean-Marc Valin wrote:
>> Hi Elwyn,
>> 
>> Thanks for the very thorough review. We've addressed your issues and
>> submitted draft version -13. See our response to each of the issues you
>> raised (aggregated from all the authors) in the attached document.
> 
> Thanks very much to all the authors.
>> 
>> Cheers,
>> 
>> 	Jean-Marc
>> 
> 
> 
>> Elwyn Davies wrote:
>>> Major issues: 
>>> Can we accept the code as normative?  If not how do we proceed?
>> 
>> The issue with code being normative was specifically addressed in 
>> the guidelines document for this WG (RFC 6569).
>> 
> Yes.  I failed to read this although Robert Sparks did point out to me
> that the code was normative - but he didn't think he said this was
> agreed in advance (or maybe I didn't understand what he was saying).
> 
> To be honest I would like to have had the time to tie the text to the
> code but realistically that is several weeks or even months work to do
> it properly - without that I feel that I have only done half a job. I
> may decide that it interests me enough to have a superficial go next
> week but no promises!
> 
>> 
>>> Minor issues:
>>> Contrast between decoder descriptions of LP part and CELT part:  The
>>> SILK descriptions go into gory detail on the values used in lots of
>>> tables, etc., whereas the CELT part has a very limited treatment of
>> the
>>> numeric values used (assuming reliance on finding the values in the
>>> reference implementation, either explictly or implicitly).  There
>> are
>>> things to be said for both techniques.  I was wondering (while
>> reading
>>> the SILK description) if the authors have any means of automatically
>>> generating the tables from the code in the SILK part (or vice versa)
>> to
>>> avoid double maintenance. On the other hand, there are pieces of the
>>> CELT decoder description (especially in s4.3.3 where knowing numbers
>> of
>>> bands, etc.) where some actual numbers would help comprehension.
>>> 
>> 
>> We have made many changes to section 4.3 (and 4.3.3 specifically) to
>> address
>> the specific issues below. As for the tables, they are not generated
>> automatically.
> 
> I think this is addressed satisfactorily now.  There is still some
> difference but it is much reduced and not so glaring now. The addition
> of the band tables really helps.
> 
>> 
>> 
>>> s2 (and more generally):  The splitting of the signal in the
>> frequency
>>> domain into signal (components?) below and above 8kHz presumably
>>> requires that the signal is subjected to a Discrete Fourier
>> Transform to
>>> allow the signal to be split.  I think sometging is needed in s2 to
>>> explain how this is managed (or if I don't understand, to explain
>> why it
>>> isn't necessary).
>> 
>> No DFT is used. The lower band is obtained through resampling (which
>> is already described) and the higher band is obtained by not coding
>> the lower band with CELT (the text says that CELT starts at band 17 in
>> hybrid mode). The explanation was reworded to make this as clear as
>> possible at this point in the text.
> 
> [I thought I had reworded this comment in the 2nd version to talk about
> MDCT but no matter]. 
> Yes, para 5 of s2 does say that the bands are discarded.  I think it
> would useful to have a concrete statement in the new text added to s4.3
> that bands 0 to 16 are discarded in hybrid mode (thereby making the 17
> in the band boost section more obvious) [There is a comment below that
> you have added some text about band 17 in section 4.3 but I can't see
> it].
> 
> 
>> 
>>> s4.2.5, para 3:
>>>>    When switching from 20 ms to 10 ms, the 10 ms
>>>>    20 ms Opus frame, potentially leaving a hole that needs to be
>>>>    concealed from even a single packet loss.
>>> How?
>> 
>> As explained in the LBRR text, a 10 ms frame will only contain 10 ms
>> LBRR data even if the previous frame was 20 ms, so there's 10 ms
>> "missing".
> Indeed - that there would be a hole was clear.  The 'How' referred to
> how would it be concealed.  Having read further by now this may be down
> to Packet Loss Concealment - so maybe all it needs is a foward ref to
> s4.4. 
>> 
>>> s4.3:
>>> As a complete newcomer to CELT, I would have appreciated a more high
>>> level understanding of what CELT is doing at this point.  I  tried
>>> reading s4.3 without any additional input and found it very hard
>> going.
>>> Eventually I gave up and went looking for some additional input.
>> This
>>> presentation seems to have a useful view 
>>> http://www.ietf.org/proceedings/79/slides/codec-2.pdf
>>> I think that it would be extremely helpful to have a description
>> similar
>>> to this at this point in the document, even though there is some
>>> material in section 5.3 which could also be forward referenced.
>> Still
>>> the material in s5.3 does not start from the basic principles that
>> CELT
>>> is using, and since these are essentially novel, it would be very
>> good
>>> to give prospective implementers/users an understanding of what is
>> going
>>> on.  Incidentally, I found the above IETF presentation more useful
>> than
>>> http://www.celt-codec.org/presentations/misc/lca-celt.pdf
>>> Note that the SILK part seems rather less opaque.  It would also be
>>> useful to indicate numerically how many bands are involved and what
>> the
>>> number of MDCT bins are in the various bands. 
>> 
>> The intro of section 4.3 has been expanded with general information
>> about CELT
>> similar to what the codec-2.pdf slides from 79 included. 
>> 
> I think this is an excellent improvement.
>> 
>>> Nits/editorial comments:
>>> 
>>> global: bytes ->  octets
>> 
>> We believe that in the field of audio codecs, the mention of "byte"
>> without
>> further context is well understood to mean 8 bits.
> 
> True. But this is a matter of IETF style.  The style is to use octets
> where we mean 8 bit bytes. I think you now have a mixture!


This topic has generated more phone calls and discussion for me than all the rest of your review comments put together :-) 

The use of the term "byte" does not seem to have caused any confusion or issues in either the spec writing or in the implementations and it is the term generally used by the community to refer to a group of 8 bits - so folks would like to stay with that. If the IESG wants to put  a discuss on this, preferable with a pointer to the relevant section of the discuss criteria document, we can deal with that then. If using octet is the IESG's desire, the document can easily be moved to use octet and my belief is that implementors, thought finding that sort of weird , will not have implementation or interoperability problems from that change. 

Thanks, Cullen <CODEC WG Co-Chair>


> 
>> 
>>> global: The form/terminology Q<n>  (e.g., Q13, Q15, Q16) ought to be
>>> explained.
>> 
>> This was already defined in the section on notation:
>>   The notation "Q<n>", where n
>>   is an integer, denotes the number of binary digits to the right of
>>   the decimal point in a fixed-point number.
> Sorry - I missed that.
>> 
>>> s3.4: I am not keen on duplicating normative requirements in this
>> way
>>> (double maintenance issue).  It would be better to put explicit
>> numbered
>>> requirements in the sections above an reference the resulting
>> numbers
>>> here.
>> 
>> A checklist style summary is quite useful for implementors, and worth
>> the maintenance burden for a document that is (hopefully) going to be
>> published once and read many times. The list intentionally does not
>> include any RFC 2119 keywords, to avoid any conflict should there
>> (accidentally) be room to interpret the re-statement any differently
>> from the original statement. Numbering the requirements and
>> referencing the numbers is still a good idea, but it should be
>> possible to read the list without flipping back and forth to the
>> previous sections.
> 
> Good solution!
> 
>> 
>>> s4.1.1:  This is really a global point.  This section refers to
>>> entdec.c.  Presumably (since we haven't reached the code yet) and it
>> is
>>> still compressed, there is some file structure.  I don't think this
>> has
>>> been said above.  It would be good to provide a list of the file
>>> components (i.e., sectional structure of the code) at the start,
>> maybe
>>> even  giving line number positions within the decompressed code.
>> 
>> In cases where it was deemed necessary (e.g. large functions), there
>> are
>> indeed line numbers in references to the code. As for a list of files
>> we did not think it was useful because one already needs to decompress
>> the code to see the references.
> 
> OK. We'll have to live with this situation.
> 
> Having looked at the code, I think it is a considerable pity that it
> isn't Doxygen commented (or some such) throughout so that the whole
> system can be viewed as a Doxygen tree. I can smell roasted programmer
> from here... :-)
>> 
>>> s4.2.7.5.1, para 1: s/This indexes an element in a coarse codebook,
>>>    selects the PDFs for the second stage of the VQ/This indexes an
>>>    element in a coarse codebook that selects the PDFs for the
>> second stage
>>>    of the VQ/
>> 
>> The text as written is correct. The index I1 is what selects the PDFs
>> for the second stage, not the vector from the coarse codebook in
>> Tables 23 and 24. I.e., it's saying, "This does A, B, and C."
> 
> OK.  I think it might be clearer if the three things were separated out
> as a list.  Now you point it out I can read it correctly but it
> triggered minor confusion - worth turning the three things into bullet
> points.
> 
> NEW:  s4.3: Add reference for Bark scale: Zwicker, E. (1961),
> "Subdivision of the audible frequency range into critical bands," The
> Journal of the Acoustical Society of America, 33, Feb., 1961.
> 
> Generally the new intro to s4.3 helps *a lot*.
>>> s4.3.2.1: Avoid the equation picture splitting across page
>> boundaries.
>>> in the current version it is unclear what the denominator is. (use
>>> needLines processing direcrive in xml2rfc).  Same applies to the
>>> equation below Table 57 in s4.3.4.3.
>> 
>> It's not quite clear how to use needLines without undesirable
>> side-effects.
>> Hopefully this is something the RFC editor should be able to handle.
> Indeed.. but I don't know what undesirable side effects there are?
> AFAIK (and in my own usage) it just ensures there are n lines available
> on the current page at some point in the text and forces a page throw if
> not.
>> 
>> 
>>> s4.3.3: (was specified as s 4.3.2.3 whcj was wrong) Paragraph on
>> decoding band boosts:  Might be improved by using
>>> equations rather than the wordy descriptions used at present.
> 
> Any thoughts on this one
>>> 
>>> (global)s4.3.2.3, para above table 56: s/iff/if and only if/
>> 
>> Fixed.
>> 
>> 
>>> s4.3.3: <<snip>>.
>> 
>> Added an explanation of band 17 
> 
> I don't think this happened.
> 
>> 
>> 
>>> s6.1: This section is perhaps a little 'triumphalist' for the
>> reference
>>> implementation (this may of course be justified!.  The quality
>> metric is
>>> a 'relative quality metric' and presumably if someone does a
>> *really*
>>> good job of coding, it is entirely possible that the new algorithms
>>> might get better than 100 on the quality score (i.e., provide a
>> better
>>> answer than the reference implementation).
>> 
>> Conformance with the specification is defined by a faithful
>> reproduction of
>> the specified decoder behavior (see RFC 6569 s4.3). By specifying the
>> decoder, future encoders have the freedom to have improved quality
>> with the
>> confidence that they know what output a conforming decoder will
>> produce.
>> 
>> The tests in s6.1 are measuring the quality of the decoded signal
>> against the 
>> reference decoded signal, not against the original audio, so greater
>> than 100%
>> wouldn't be possible or meaningful. The test signals have been
>> specially
>> prepared to thoroughly test a decoder implementation, and they
>> sacrifice encoded
>> quality in order to rapidly exercise the corner cases.
>> 
> You might want to add this comment to the text.
> 
> As regards the 100 limit, I was sort of assuming that the quality figure
> was derived from improving on the 48dB SNR figure.  Probably a
> misreading.  AS a matter of interest, would one be able to tell from the
> tests that a putative new implementation really was 'better' in some
> sense? Or is this now almost a subjective matter that can only be
> determined by extensive listening tests?  I got the impression we may be
> converging on the diminishing returns point.
> 
> /Elwyn
> 
>> 
> 
>

Gen-art last call review of draft-ietf-codec-opus… Elwyn Davies
Re: Gen-art last call review of draft-ietf-codec-… SM
Re: Gen-art last call review of draft-ietf-codec-… Cullen Jennings
Re: Gen-art last call review of draft-ietf-codec-… Elwyn Davies
Re: Gen-art last call review of draft-ietf-codec-… Jean-Marc Valin
Re: [Gen-art] Gen-art last call review of draft-i… Cullen Jennings
Re: [Gen-art] Gen-art last call review of draft-i… Elwyn Davies
Re: [Gen-art] Gen-art last call review of draft-i… Jean-Marc Valin
Re: [Gen-art] Gen-art last call review of draft-i… Elwyn Davies