On 12-05-16 05:26 PM, Elwyn Davies wrote:
>> The CELT layer, however, can adapt over a very wide range of rates,
>> and thus has a large number of codebooks sizes
> s/codebooks/codebook/

Fixed.

> s4.3.3, para after Table 57: s?the maximums in bit/sample are
> precomputed?the maximums in bits/sample are precomputed?

Fixed.

> Also suggest:
> s4.3: Add reference for Bark scale: Zwicker, E. (1961), "Subdivision of
> the audible frequency range into critical bands," The Journal of the
> Acoustical Society of America, 33, Feb., 1961.

Done.

>> No DFT is used. The lower band is obtained through resampling (which
>> is already described) and the higher band is obtained by not coding
>> the lower band with CELT (the text says that CELT starts at band 17 in
>> hybrid mode). The explanation was reworded to make this as clear as
>> possible at this point in the text.
> 
> [I thought I had reworded this comment in the 2nd version to talk about
> MDCT but no matter]. 
> Yes, para 5 of s2 does say that the bands are discarded.  I think it
> would useful to have a concrete statement in the new text added to s4.3
> that bands 0 to 16 are discarded in hybrid mode (thereby making the 17
> in the band boost section more obvious) [There is a comment below that
> you have added some text about band 17 in section 4.3 but I can't see
> it].

Sorry, we started working on the revision as soon as you sent the first part
of the review, and then we just copied the new parts (didn't notice some of
the review changed).

Also, the reason you didn't see the new explanations about band 17 in s4.3 is that we moved them to s2 para 5, to help make the explanation of the signal splitting clearer, but forgot to update the response to indicate that (sorry about that). However, it probably does make sense to explain it in both places, so the sentence,
  "In hybrid mode, the first 17 bands (up to 8 kHz) are not coded."
has been added to s4.3 as well.

>> As explained in the LBRR text, a 10 ms frame will only contain 10 ms
>> LBRR data even if the previous frame was 20 ms, so there's 10 ms
>> "missing".
> Indeed - that there would be a hole was clear.  The 'How' referred to
> how would it be concealed.  Having read further by now this may be down
> to Packet Loss Concealment - so maybe all it needs is a foward ref to
> s4.4. 

Reference added.

>> We believe that in the field of audio codecs, the mention of "byte"
>> without
>> further context is well understood to mean 8 bits.
> 
> True. But this is a matter of IETF style.  The style is to use octets
> where we mean 8 bit bytes. I think you now have a mixture!
> 
>>

Indeed, there's a bit of inconsistency here. Considering Cullen's email,
the document now uses "byte" consistently.

>>> s4.2.7.5.1, para 1: s/This indexes an element in a coarse codebook,
>>>     selects the PDFs for the second stage of the VQ/This indexes an
>>>     element in a coarse codebook that selects the PDFs for the
>> second stage
>>>     of the VQ/
>>
>> The text as written is correct. The index I1 is what selects the PDFs
>> for the second stage, not the vector from the coarse codebook in
>> Tables 23 and 24. I.e., it's saying, "This does A, B, and C."
> 
> OK.  I think it might be clearer if the three things were separated out
> as a list.  Now you point it out I can read it correctly but it
> triggered minor confusion - worth turning the three things into bullet
> points.

This is not a bad idea. I agree it helps make things clearer. Done.

> NEW:  s4.3: Add reference for Bark scale: Zwicker, E. (1961),
> "Subdivision of the audible frequency range into critical bands," The
> Journal of the Acoustical Society of America, 33, Feb., 1961.

Done (as stated above).

>>> s4.3.3: (was specified as s 4.3.2.3 whcj was wrong) Paragraph on
>> decoding band boosts:  Might be improved by using
>>> equations rather than the wordy descriptions used at present.
> 
> Any thoughts on this one

Oops, that one slipped through. While most of the text is actually
describing an algorithm rather than an equation, it was possible to simplify
the part about the quanta with an equation. The text now reads:
"For each band from the coding start (0 normally, but 17 in Hybrid mode)
to the coding end (which changes depending on the signaled bandwidth), the boost quanta
in units of 1/8 bit is calculated as: quanta = min(8*N, max(48, N))."

>>> s4.3.3: <<snip>>.
>>
>> Added an explanation of band 17 
> 
> I don't think this happened.

See above -- operator error. It's fixed now.

>> The tests in s6.1 are measuring the quality of the decoded signal
>> against the 
>> reference decoded signal, not against the original audio, so greater
>> than 100%
>> wouldn't be possible or meaningful. The test signals have been
>> specially
>> prepared to thoroughly test a decoder implementation, and they
>> sacrifice encoded
>> quality in order to rapidly exercise the corner cases.
>>
> You might want to add this comment to the text.

Added the comment about 100 being the max. As for the other part, the test
vector section already states that:
"These test vectors were created specifically to exercise all aspects of the
decoder and therefore the audio quality of the decoded output is
significantly lower than what Opus can achieve in normal operation."

> As regards the 100 limit, I was sort of assuming that the quality figure
> was derived from improving on the 48dB SNR figure.  Probably a
> misreading.  AS a matter of interest, would one be able to tell from the
> tests that a putative new implementation really was 'better' in some
> sense? Or is this now almost a subjective matter that can only be
> determined by extensive listening tests?  I got the impression we may be
> converging on the diminishing returns point.

You can't have a "better" decoder because the reference implementation is
*by definition* the best decoder possible. From there, the encoder can
be improved to optimize the quality of a bitstream to be decoded by that
reference decoder. The encoder included with the reference is mature enough
that improvements usually need to be validated with human listening tests;
objective quality measurements aren't quite reliable enough alone to distinguish
'different' from 'better', unless the change is very significant.