Re: [Gen-art] Gen-art last call review of draft-ietf-codec-opus-12.txt (completed)

Elwyn Davies <> Wed, 16 May 2012 21:26 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 149B821F87CD; Wed, 16 May 2012 14:26:50 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: 0.873
X-Spam-Status: No, score=0.873 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, FH_HOST_EQ_D_D_D_D=0.765, FH_HOST_EQ_D_D_D_DB=0.888, HELO_EQ_IP_ADDR=1.119, J_CHICKENPOX_13=0.6, RDNS_DYNAMIC=0.1]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id 2W0cFGKDAa6V; Wed, 16 May 2012 14:26:45 -0700 (PDT)
Received: from ( [IPv6:2001:8b0:0:30::51bb:1e35]) by (Postfix) with ESMTP id D109721F8799; Wed, 16 May 2012 14:26:44 -0700 (PDT)
Received: from ([] helo=[]) by with esmtp (Exim 4.72) (envelope-from <>) id 1SUlk2-0005BY-54; Wed, 16 May 2012 22:26:42 +0100
From: Elwyn Davies <>
To: Jean-Marc Valin <>
In-Reply-To: <>
References: <> <>
Content-Type: text/plain
Organization: Folly Consulting
Date: Wed, 16 May 2012 22:26:40 +0100
Message-Id: <>
Mime-Version: 1.0
X-Mailer: Evolution 2.26.3
Content-Transfer-Encoding: 7bit
Cc: General Area Review Team <>, "" <>, IETF discussion <>,
Subject: Re: [Gen-art] Gen-art last call review of draft-ietf-codec-opus-12.txt (completed)
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "GEN-ART: General Area Review Team" <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 16 May 2012 21:26:50 -0000

Hi, Jean-Marc.

... and thanks for the super-quick response!  You have been quite busy.

I have had a look through the new draft and I think the additions help
considerably with comprehension for the naive (and to give new
implementers a way in.)

I'll leave you to negotiate with the RFC Editor over the Wikipaedia
references.  To quote the RFC Style guide
Section 4.8, item (x) References, last section:
> URLs and DNS names in RFCs
>       The use of URLs in RFCs is discouraged, because many URLs are not
>       stable references.  Exceptions may be made for normative
>       references in those cases where the URL is demonstrably the most
>       stable reference available.  References to long-lived files on
> and are generally acceptable.
They are certainly convenient *as long as they remain in place and
aren't corrupted*.

I found a couple of trivial editorial nits in the changes:
s4.3.3 (in the added text):
> The CELT layer, however, can adapt over a very wide range of rates,
> and thus has a large number of codebooks sizes

s4.3.3, para after Table 57: s?the maximums in bit/sample are
precomputed?the maximums in bits/sample are precomputed?

Also suggest:
s4.3: Add reference for Bark scale: Zwicker, E. (1961), "Subdivision of
the audible frequency range into critical bands," The Journal of the
Acoustical Society of America, 33, Feb., 1961.

A few responses in line below (agreed pieces elided): 

Elwyn  Davies

On Tue, 2012-05-15 at 20:33 -0400, Jean-Marc Valin wrote:
> Hi Elwyn,
> Thanks for the very thorough review. We've addressed your issues and
> submitted draft version -13. See our response to each of the issues you
> raised (aggregated from all the authors) in the attached document.

Thanks very much to all the authors.
> Cheers,
> 	Jean-Marc

> Elwyn Davies wrote:
> > Major issues: 
> > Can we accept the code as normative?  If not how do we proceed?
> The issue with code being normative was specifically addressed in 
> the guidelines document for this WG (RFC 6569).
Yes.  I failed to read this although Robert Sparks did point out to me
that the code was normative - but he didn't think he said this was
agreed in advance (or maybe I didn't understand what he was saying).

To be honest I would like to have had the time to tie the text to the
code but realistically that is several weeks or even months work to do
it properly - without that I feel that I have only done half a job. I
may decide that it interests me enough to have a superficial go next
week but no promises!

> > Minor issues:
> > Contrast between decoder descriptions of LP part and CELT part:  The
> > SILK descriptions go into gory detail on the values used in lots of
> > tables, etc., whereas the CELT part has a very limited treatment of
> the
> > numeric values used (assuming reliance on finding the values in the
> > reference implementation, either explictly or implicitly).  There
> are
> > things to be said for both techniques.  I was wondering (while
> reading
> > the SILK description) if the authors have any means of automatically
> > generating the tables from the code in the SILK part (or vice versa)
> to
> > avoid double maintenance. On the other hand, there are pieces of the
> > CELT decoder description (especially in s4.3.3 where knowing numbers
> of
> > bands, etc.) where some actual numbers would help comprehension.
> > 
> We have made many changes to section 4.3 (and 4.3.3 specifically) to
> address
> the specific issues below. As for the tables, they are not generated
> automatically.

I think this is addressed satisfactorily now.  There is still some
difference but it is much reduced and not so glaring now. The addition
of the band tables really helps.

> > s2 (and more generally):  The splitting of the signal in the
> frequency
> > domain into signal (components?) below and above 8kHz presumably
> > requires that the signal is subjected to a Discrete Fourier
> Transform to
> > allow the signal to be split.  I think sometging is needed in s2 to
> > explain how this is managed (or if I don't understand, to explain
> why it
> > isn't necessary).
> No DFT is used. The lower band is obtained through resampling (which
> is already described) and the higher band is obtained by not coding
> the lower band with CELT (the text says that CELT starts at band 17 in
> hybrid mode). The explanation was reworded to make this as clear as
> possible at this point in the text.

[I thought I had reworded this comment in the 2nd version to talk about
MDCT but no matter]. 
Yes, para 5 of s2 does say that the bands are discarded.  I think it
would useful to have a concrete statement in the new text added to s4.3
that bands 0 to 16 are discarded in hybrid mode (thereby making the 17
in the band boost section more obvious) [There is a comment below that
you have added some text about band 17 in section 4.3 but I can't see

> > s4.2.5, para 3:
> >>     When switching from 20 ms to 10 ms, the 10 ms
> >>     20 ms Opus frame, potentially leaving a hole that needs to be
> >>     concealed from even a single packet loss.
> > How?
> As explained in the LBRR text, a 10 ms frame will only contain 10 ms
> LBRR data even if the previous frame was 20 ms, so there's 10 ms
> "missing".
Indeed - that there would be a hole was clear.  The 'How' referred to
how would it be concealed.  Having read further by now this may be down
to Packet Loss Concealment - so maybe all it needs is a foward ref to
> > s4.3:
> > As a complete newcomer to CELT, I would have appreciated a more high
> > level understanding of what CELT is doing at this point.  I  tried
> > reading s4.3 without any additional input and found it very hard
> going.
> > Eventually I gave up and went looking for some additional input.
> This
> > presentation seems to have a useful view 
> >
> > I think that it would be extremely helpful to have a description
> similar
> > to this at this point in the document, even though there is some
> > material in section 5.3 which could also be forward referenced.
> Still
> > the material in s5.3 does not start from the basic principles that
> > is using, and since these are essentially novel, it would be very
> good
> > to give prospective implementers/users an understanding of what is
> going
> > on.  Incidentally, I found the above IETF presentation more useful
> than
> >
> > Note that the SILK part seems rather less opaque.  It would also be
> > useful to indicate numerically how many bands are involved and what
> the
> > number of MDCT bins are in the various bands. 
> The intro of section 4.3 has been expanded with general information
> about CELT
> similar to what the codec-2.pdf slides from 79 included. 
I think this is an excellent improvement.
> > Nits/editorial comments:
> > 
> > global: bytes ->  octets
> We believe that in the field of audio codecs, the mention of "byte"
> without
> further context is well understood to mean 8 bits.

True. But this is a matter of IETF style.  The style is to use octets
where we mean 8 bit bytes. I think you now have a mixture!

> > global: The form/terminology Q<n>  (e.g., Q13, Q15, Q16) ought to be
> > explained.
> This was already defined in the section on notation:
>    The notation "Q<n>", where n
>    is an integer, denotes the number of binary digits to the right of
>    the decimal point in a fixed-point number.
Sorry - I missed that.
> > s3.4: I am not keen on duplicating normative requirements in this
> way
> > (double maintenance issue).  It would be better to put explicit
> numbered
> > requirements in the sections above an reference the resulting
> numbers
> > here.
> A checklist style summary is quite useful for implementors, and worth
> the maintenance burden for a document that is (hopefully) going to be
> published once and read many times. The list intentionally does not
> include any RFC 2119 keywords, to avoid any conflict should there
> (accidentally) be room to interpret the re-statement any differently
> from the original statement. Numbering the requirements and
> referencing the numbers is still a good idea, but it should be
> possible to read the list without flipping back and forth to the
> previous sections.

Good solution!

> > s4.1.1:  This is really a global point.  This section refers to
> > entdec.c.  Presumably (since we haven't reached the code yet) and it
> is
> > still compressed, there is some file structure.  I don't think this
> has
> > been said above.  It would be good to provide a list of the file
> > components (i.e., sectional structure of the code) at the start,
> maybe
> > even  giving line number positions within the decompressed code.
> In cases where it was deemed necessary (e.g. large functions), there
> are
> indeed line numbers in references to the code. As for a list of files
> we did not think it was useful because one already needs to decompress
> the code to see the references.

OK. We'll have to live with this situation.

Having looked at the code, I think it is a considerable pity that it
isn't Doxygen commented (or some such) throughout so that the whole
system can be viewed as a Doxygen tree. I can smell roasted programmer
from here... :-)
> > s4., para 1: s/This indexes an element in a coarse codebook,
> >     selects the PDFs for the second stage of the VQ/This indexes an
> >     element in a coarse codebook that selects the PDFs for the
> second stage
> >     of the VQ/
> The text as written is correct. The index I1 is what selects the PDFs
> for the second stage, not the vector from the coarse codebook in
> Tables 23 and 24. I.e., it's saying, "This does A, B, and C."

OK.  I think it might be clearer if the three things were separated out
as a list.  Now you point it out I can read it correctly but it
triggered minor confusion - worth turning the three things into bullet

NEW:  s4.3: Add reference for Bark scale: Zwicker, E. (1961),
"Subdivision of the audible frequency range into critical bands," The
Journal of the Acoustical Society of America, 33, Feb., 1961.

Generally the new intro to s4.3 helps *a lot*.
> > s4.3.2.1: Avoid the equation picture splitting across page
> boundaries.
> > in the current version it is unclear what the denominator is. (use
> > needLines processing direcrive in xml2rfc).  Same applies to the
> > equation below Table 57 in s4.3.4.3.
> It's not quite clear how to use needLines without undesirable
> side-effects.
> Hopefully this is something the RFC editor should be able to handle.
Indeed.. but I don't know what undesirable side effects there are?
AFAIK (and in my own usage) it just ensures there are n lines available
on the current page at some point in the text and forces a page throw if
> > s4.3.3: (was specified as s whcj was wrong) Paragraph on
> decoding band boosts:  Might be improved by using
> > equations rather than the wordy descriptions used at present.

Any thoughts on this one
> > 
> > (global)s4.3.2.3, para above table 56: s/iff/if and only if/
> Fixed.
> > s4.3.3: <<snip>>.
> Added an explanation of band 17 

I don't think this happened.

> > s6.1: This section is perhaps a little 'triumphalist' for the
> reference
> > implementation (this may of course be justified!.  The quality
> metric is
> > a 'relative quality metric' and presumably if someone does a
> *really*
> > good job of coding, it is entirely possible that the new algorithms
> > might get better than 100 on the quality score (i.e., provide a
> better
> > answer than the reference implementation).
> Conformance with the specification is defined by a faithful
> reproduction of
> the specified decoder behavior (see RFC 6569 s4.3). By specifying the
> decoder, future encoders have the freedom to have improved quality
> with the
> confidence that they know what output a conforming decoder will
> produce.
> The tests in s6.1 are measuring the quality of the decoded signal
> against the 
> reference decoded signal, not against the original audio, so greater
> than 100%
> wouldn't be possible or meaningful. The test signals have been
> specially
> prepared to thoroughly test a decoder implementation, and they
> sacrifice encoded
> quality in order to rapidly exercise the corner cases.
You might want to add this comment to the text.

As regards the 100 limit, I was sort of assuming that the quality figure
was derived from improving on the 48dB SNR figure.  Probably a
misreading.  AS a matter of interest, would one be able to tell from the
tests that a putative new implementation really was 'better' in some
sense? Or is this now almost a subjective matter that can only be
determined by extensive listening tests?  I got the impression we may be
converging on the diminishing returns point.