Elwyn Davies wrote: > Major issues: > Can we accept the code as normative? If not how do we proceed? The issue with code being normative was specifically addressed in the guidelines document for this WG (RFC 6569). > Minor issues: > Contrast between decoder descriptions of LP part and CELT part: The > SILK descriptions go into gory detail on the values used in lots of > tables, etc., whereas the CELT part has a very limited treatment of the > numeric values used (assuming reliance on finding the values in the > reference implementation, either explictly or implicitly). There are > things to be said for both techniques. I was wondering (while reading > the SILK description) if the authors have any means of automatically > generating the tables from the code in the SILK part (or vice versa) to > avoid double maintenance. On the other hand, there are pieces of the > CELT decoder description (especially in s4.3.3 where knowing numbers of > bands, etc.) where some actual numbers would help comprehension. > We have made many changes to section 4.3 (and 4.3.3 specifically) to address the specific issues below. As for the tables, they are not generated automatically. > s2 (and more generally): The splitting of the signal in the frequency > domain into signal (components?) below and above 8kHz presumably > requires that the signal is subjected to a Discrete Fourier Transform to > allow the signal to be split. I think sometging is needed in s2 to > explain how this is managed (or if I don't understand, to explain why it > isn't necessary). No DFT is used. The lower band is obtained through resampling (which is already described) and the higher band is obtained by not coding the lower band with CELT (the text says that CELT starts at band 17 in hybrid mode). The explanation was reworded to make this as clear as possible at this point in the text. > s4.1: Despite having found a copy of the range coding paper and tried to > read it, I found the whole business of range coding opaque. Given that > this is (I think) fairly novel, some additional less opaque description > could be very helpful to people trying to understand what is going on. > In particular knowledge of how entropy coding works is pretty much a > prerequisite. Added a link to the Wikipedia article on range coding. > s4.2.5, para 3: >> When switching from 20 ms to 10 ms, the 10 ms >> 20 ms Opus frame, potentially leaving a hole that needs to be >> concealed from even a single packet loss. > How? As explained in the LBRR text, a 10 ms frame will only contain 10 ms LBRR data even if the previous frame was 20 ms, so there's 10 ms "missing". > s4.2.5, para 4: >> In order to properly produce LBRR frames under all conditions, an >> encoder might need to buffer up to 60 ms of audio and re-encode it >> during these transitions. However, the reference implementation opts >> to disable LBRR frames at the transition point for simplicity. >> > Should this be phrased in RFC 2119 requiremenmts language? The first > part sounds like a SHOULD with the second part being the get out , but > its not entirely clear what the consequences are other then simplicity. There really is no SHOULD there. We're just describing possible strategies for encoding. In general, we have avoided 2119 language as much as possible for the encoder. However, the sentence, "Since transitions are relatively infrequent in normal usage, this does not have a significant impact on packet loss robustness," was added to alleviate any concerns about the consequences of this decision. > ====================================================================== > MINOR ISSUES ABOVE submitted as part 1 of review > ====================================================================== > general/s11.2: Several references [Hadamard], [Viterbi}, etc., are to > Wikipaedia pages. Whilst these are convenient (and only illustrative) > they are not guaranteed to be very stable. Better (i.e., more stable) > references are desirable. We believe that the links themselves should be quite stable. We could also link specific revisions in Wikipedia, but the quality of the content is more likely to get better than worse. Considering that these references are informational, we believe that Wikipedia is a good source. > s1, para 2: >> The decoder contains a great deal of integer and fixed-point >> arithmetic which must be performed exactly, including all rounding >> considerations,... > Is this a MUST? There are instances in the text which might contradict > this (e.g., para 1 of s4.2.7.5 which has (capitalized) SHOULDs). This sentence did not mean to imply that _all_ such arithmetic must be performed exactly, only that there exists a good deal of arithmetic for which this is true (for example, to achieve the final range check MUST from Section 6). Since it isn't clear at this point exactly which arithmetic is being referred to, we can't use RFC 2119 strength here. Replaced "must be" with "needs to be" to avoid the impression of an RFC 2119 requirement. > s4.3: > As a complete newcomer to CELT, I would have appreciated a more high > level understanding of what CELT is doing at this point. I tried > reading s4.3 without any additional input and found it very hard going. > Eventually I gave up and went looking for some additional input. This > presentation seems to have a useful view > http://www.ietf.org/proceedings/79/slides/codec-2.pdf > I think that it would be extremely helpful to have a description similar > to this at this point in the document, even though there is some > material in section 5.3 which could also be forward referenced. Still > the material in s5.3 does not start from the basic principles that CELT > is using, and since these are essentially novel, it would be very good > to give prospective implementers/users an understanding of what is going > on. Incidentally, I found the above IETF presentation more useful than > http://www.celt-codec.org/presentations/misc/lca-celt.pdf > Note that the SILK part seems rather less opaque. It would also be > useful to indicate numerically how many bands are involved and what the > number of MDCT bins are in the various bands. The intro of section 4.3 has been expanded with general information about CELT similar to what the codec-2.pdf slides from 79 included. > s4.3, last para: Can the 'out of range error' occur in the LP decoder? (if not why not?) Added the following text: "Such out of range errors cannot occur in the SILK layer." > ====================================================================== > > Nits/editorial comments: > > global: bytes -> octets We believe that in the field of audio codecs, the mention of "byte" without further context is well understood to mean 8 bits. > global: The form/terminology Q (e.g., Q13, Q15, Q16) ought to be > explained. This was already defined in the section on notation: The notation "Q", where n is an integer, denotes the number of binary digits to the right of the decimal point in a fixed-point number. > s1: Expand CELP Done. > s1.1.6: Need to define floor(). Done. Also defined abs(), ceil(), and round(). > s2: ? Expand SILK SILK is not an acronym. > s2: Reference for Vorbis. Linked to http://xiph.org/vorbis/ > s2.1.8: Expand AAC (and MP3). Expanded MP3: MPEG 1, Layer 3 AAC: Advanced Audio Coding > s3: References for Ogg, Matroska, maybe RTP. Done. > s3: Fig 1, Table 2 and intervening text: Presumably SILK only (Table 2) > etc correspond to MODES 1-3 in Figure 1. This needs to be consistent. Changed LP to SILK and MDCT to CELT. > s3.2.1: Make it clear which of the octets is len[0]/len[1]. To be > precise it might be better to say len0/len1 are the values of the two > length octets (in whichever order you intend). The form len[0] could be > misinterpreted as a function 'length of 0'. Replaced len[0] and len[1] with first_byte and second_byte, respectively. > s3.2.5: better s/figure below/Figure 5/ Fixed for all cases of "figure below". > s3.2.5: >> In the CBR case, the compressed length of each frame in bytes is >> equal to the number of remaining bytes in the packet after >> subtracting the (optional) padding, (N-2-P), divided by M. This >> number MUST be a non-negative integer multiple of M. > 'This number' is not the compressed length of each frame that is the > subject of the first sentence, but the number of remaining octets - this > needs rewording. Fixed by defining R=N-2-P, e.g., In the CBR case, let R=N-2-P be the number of bytes remaining in the packet after subtracting the (optional) padding. Then the compressed length of each frame in bytes is equal to R/M. The value R MUST be a non-negative integer multiple of M. The compressed data for all M frames follows, each of size R/M bytes, as illustrated in Figure 6. > s3.2.5: >> The number of header bytes (TOC byte, frame >> count byte, padding length bytes, and frame length bytes), plus the >> length of the first M-1 frames themselves, plus the length of the >> padding MUST be no larger than N, the total size of the packet. > Surely this is a non sequitur? This might be better phrased as 'The > total size of a well formed packet MUST be at least...' Note that the text mentions the "length of the first M-1 frames", so the calculation does not include the size of the last frame. That being said, we also clarified by using "signaled length" instead of "length". > s3.3: The example diagrams ought to have figure numbers. Fixed. > s3.4: I am not keen on duplicating normative requirements in this way > (double maintenance issue). It would be better to put explicit numbered > requirements in the sections above an reference the resulting numbers > here. A checklist style summary is quite useful for implementors, and worth the maintenance burden for a document that is (hopefully) going to be published once and read many times. The list intentionally does not include any RFC 2119 keywords, to avoid any conflict should there (accidentally) be room to interpret the re-statement any differently from the original statement. Numbering the requirements and referencing the numbers is still a good idea, but it should be possible to read the list without flipping back and forth to the previous sections. > s4.1: >> The decoder initializes rng to 128 and initializes val to >> 127 minus the top 7 bits of the first input octet. > How are the 'top seven bits' to be interpreted here? e.g. as the bottom > seven bits of a 8 bit integer field? an 8 bit integer with the lowest > bit zeroed out? Tried for something more explicit: Let b0 be the first input octet (or zero if there are no octets in this Opus frame). The decoder initializes rng to 128 and initializes val to (127 - (b0>>1)), where (b0>>1) is the top 7 bits of the first input octet. It saves the remaining bit, (b0&1), for use in the renormalization procedure described in ... > s4.1.1: This is really a global point. This section refers to > entdec.c. Presumably (since we haven't reached the code yet) and it is > still compressed, there is some file structure. I don't think this has > been said above. It would be good to provide a list of the file > components (i.e., sectional structure of the code) at the start, maybe > even giving line number positions within the decompressed code. In cases where it was deemed necessary (e.g. large functions), there are indeed line numbers in references to the code. As for a list of files we did not think it was useful because one already needs to decompress the code to see the references. > s4.1.1.1: >> Then it reads the next octet of the >> payload and combines it with the left-over bit buffered from the >> previous octet to form the 8-bit value sym. It takes the left-over >> bit as the high bit (bit 7) of sym, and the top 7 bits of the octet >> it just read as the other 7 bits of sym. > This is not well phrased. Better > Then it reads the next octet of the payload [packet? payload hasn't > really been used before] and combines the left over bit from the "Packet" is inaccurate, as the next byte in the packet may come from a different Opus frame. Replaced with "Opus frame" instead. > previous octet (see Section 4.1 for starting this process) as the high > bit (bit 7)| of 'sym' and the top 7 bits of the octet as the other 7 > bits of sym, leaving the remaining bit for the next iteration. This may be trying to do too much in a single sentence. Perhaps: Then it reads the next octet of the Opus frame and forms an 8-bit value sym, using the left-over bit buffered from the previous octet as the high bit and the top 7 bits of the octet just read as the other 7 bits of sym. The remaining bit in the octet just read is buffered for use in the next iteration. If no more input octets remain, it uses zero bits instead. See for the initialization used to process the first octet. Also broke out range-decoder-init into its own section to make it easier to reference. > s4.1.5.2: > Should r_Q15 = rng>> (l-16) be r_Q15 = rng>> (lg-16)? There doesn't > seem to be an 'l' defined. Fixed. Sorry; this is a leftover from changing "l" to "lg" in response to one of the AD's review comments. > s4.2.1: Expand LTP earlier. It would also be useful to expand LPC again. Done (both). > s4.2.2: acronym VAD is not expanded until the beginning of s4.2.3. Fixed. > s4.2.7: acronym LSF needs to be expanded on first use. Fixed. > s4.2.7.1: Explain briefly why Table 7 has values for indices 0 to 15 > when wi0/1 are in range 0 to 14. Added: Although wi0 and wi1 only have 15 possible values, Table 7 contains 16 entries to allow interpolation between entry wi0 and (wi0 + 1) (and likewise for wi1). > s4.2.7.4, para below Table 12: >> These 6 bits are combined to form a gain index between 0 and 63. > s/gain index/gain_index/ as this variable is used subsequently. Used: These 6 bits are combined to form a value, gain_index, between 0 and 63. Also updated the "delta gain index" for subframes without an independent gain, below. > s4.2.7.4: The use of log_gain seems slightly confusing when combined > with gain_index. One at least is presumably log scaled. Maybe a bit > more explanation is needed. The gain_index happens to map directly to log_gain for subframes with an independent gain (with the exception of the clamping), but this is not true for subframes where the gain is coded relative to the previous subframe. We are open to suggestions that would make the need for this distinction clearer. Also replaced "previous gain index" with "previous_log_gain", as it is the latter that actually matters. > ====================================================================== > COMMENTS ABOVE submitted as part 1 of review > ====================================================================== > s4.2.7.2, last para: >> In that case, if this >> flag is zero (indicating that there should be a side channel), then >> Packet Loss Concealment (PLC, see Section 4.4) SHOULD be invoked to >> recover a side channel signal. > What are the consequences (or what actions need to be taken) if it is > not invoked? Added: Otherwise, the stereo image will collapse. > s4.2.7.5, para 1: >> These represent the interleaved zeros on the >> unit circle between 0 and pi (hence "normalized") in the standard >> decomposition of the LPC filter into a symmetric part and an anti- >> symmetric part (P and Q in Section 4.2.7.5.6). > 'on the unit circle between 0 and pi' might be clearer as 'on the upper > half of the unit circle' or 'on the half of the unit circle in the > positive imaginary area of the complex plane'. > 'standard decomposition'? Needs a reference. Reworded as: These represent the interleaved zeros on the upper half of the unit circle (between 0 and pi, hence "normalized") in the standard decomposition [line-spectral-pairs] of the LPC filter into a symmetric part and an anti-symmetric part (P and Q in Section 4.2.7.5.6). Added the reference [line-spectral-pairs] pointing to http://en.wikipedia.org/wiki/Line_spectral_pairs > s4.2.7.5, para 1: A reference for the use of LSF in LPC would be useful. The above reference serves this function as well. > s4.2.7.5.x: There is inconsistent use of stage 1/stage 2 vs > stage-1/stage-2. Please be consusistent "Stage 1" is correct when referring to the noun by itself, or as an object of a verb ("stage 1 decoding"). "Stage-1" is correct when modifying another noun ("stage-1 index"). However, we changed the titles of some of the tables around to allow them to use the modifier form more consistently. > s4.2.7.5, para 1: >> Because of non-linear >> effects in the decoding process, an implementation SHOULD match the >> fixed-point arithmetic described in this section exactly. An encoder >> SHOULD also use the same process. > - Does this contradict the 'must' in s1, para 2? No, for the reasons described above. > - What are the consequences of ignoring the SHOULD? How bad would they > get? Might it become unstable and how would one know? They would not become unstable. However, changing the exact arithmetic may lead to a decoder output that is not close enough to that of the reference decoder to pass the conformance test. > s4.2.7.5.1, para 1: s/This indexes an element in a coarse codebook, > selects the PDFs for the second stage of the VQ/This indexes an > element in a coarse codebook that selects the PDFs for the second stage > of the VQ/ The text as written is correct. The index I1 is what selects the PDFs for the second stage, not the vector from the coarse codebook in Tables 23 and 24. I.e., it's saying, "This does A, B, and C." > s4.2.7.5.3, last para: >> However, nothing in >> either the reconstruction process or the quantization process in the >> encoder thus far guarantees that the coefficients are monotonically >> increasing and separated well enough to ensure a stable filter. > A reference that indicates why this requirement is needed would be desirable. > (and also for s4.2.5.7.8). Added a reference to: P. Kabal and R. P. Ramachandran, "The Computation of Line Spectral Frequencies Using Chebyshev Polynomials", IEEE Trans. Acoustics, Speech, Signal Processing, vol. 34, no. 6, pp. 1419-1426, Dec. 1986. > s4.2.7.5.4 and Table 25: Are the values in Table 25 NDeltaMin or NDelatMin_Q15? > The equations after Table 25 use both NDeltaMin and NDeltaMin_Q15. Is this correct? > In particular the first two equations deliver _Q15 values but use raw NDeltaMin. Good catch. All uses of NDeltaMin should have been NDeltaMin_Q15. > s4.1.1/s4.2.7.1 and other places: The term 'exact integer division' is > used in various places. My understanding was that this phrase implied > that it was known that the dividend was an exact multiple of the divisor > by some out-of-band means. This doesn't seem to be the case generally > in Opus (e.g,, where both n/5 and n%5 are needed - clearly this doesn't > anticipate n%5 == 0 every time!) So what does 'exact integer division' > imply? A definition may be needed. The "exact" here was merely meant in contrast to inexact floating-point division. Removed "exact", leaving only "integer division" (which, like all other operators, uses C conventions, as stated in Notation and Conventions). Hopefully this should avoid people trying to read too much into the phrase. > s4.3, last para: s/described in the figure above./described in Table 55 above./ Fixed. > s4.3.1: >> The "transient" flag encoded in the bitstream has a probability of 1/8. > This statement appears out of the blue apparently. Some more > explanation of what the transient flag actually implies and why we > should be so sure about its PDF would help. Added a bit of context and moved the probability 1/8 to later in the text. > s4.3.2.1: Arguably a reference is needed for the z-transform. Reference added. > s4.3.2.1: Avoid the equation picture splitting across page boundaries. > in the current version it is unclear what the denominator is. (use > needLines processing direcrive in xml2rfc). Same applies to the > equation below Table 57 in s4.3.4.3. It's not quite clear how to use needLines without undesirable side-effects. Hopefully this is something the RFC editor should be able to handle. > 4.3.2.1, after the equations: >> The >> prediction is clamped internally so that fixed point implementations >> with limited dynamic range do not suffer desynchronization. > As a person with limited skills in the srt, I have no idea what > desynchronization implies here. Clarified to "always remain in the same state as floating point implementations". > 4.3.2.1, ibid: >> We >> approximate the ideal probability distribution of the prediction >> error using a Laplace distribution with separate parameters for each >> frame size in intra- and inter-frame modes. > I suspect this sentence belongs before the equation described the z-transform. > Where are the values of the parameters for the inter-frame mode defined > (the intra-frame ones are in the text)? The probability model is for the difference between the energy and its prediction. Added a reference to the table where the pdf parameters are located in the code. > s4.3.2.3: Paragraph on decoding band boosts: Might be improved by using > equations rather than the wordy descriptions used at present. > > (global)s4.3.2.3, para above table 56: s/iff/if and only if/ Fixed. > s4.3.2.3: LOG2_FRAC_TABLE is missing. Text now says that the table is in rate.c > s4.3.3: It would be helpful to explain either here, or at the outset of > s4.3 overall, how the concept of energy bands and MDCT bins applies to > the CELT part of the codec, and just how many bands and bins are used. > Some of this is contained in s5.3.2, but the magic number 17 appears > later in 4.3.3 which is presumably something to do with the point in the > frequency domain that CELT takes over from LP in the hybrid mode. It > would make the very complex section 4.3.3 rather easier to understand > with this extra information - I have to say I struggled! On reflection, > I think an example of what bits are allocated to a band and how thay rae > subsequently used would be quite helpful - Without going to delve into > the code I am really not clear that I understand just what bits are > allocated and what they then encode and I have read the text quite a few > times now. Added an explanation of band 17 at the beginning of 4.3. There is also some text about the number of bins per band. > s4.3.3: Be consistent between 'tone to noise' and 'tone-to-noise'. The text was rephrased in terms of signal-to-noise. > s4.3.3: >> The band-energy normalized structure of Opus MDCT mode ensures that a >> constant bit allocation for the shape content of a band will result >> in a roughly constant tone to noise ratio, which provides for fairly >> consistent perceptual performance. The effectiveness of this >> approach is the result of two factors: that the band energy, which is >> understood to be perceptually important on its own, is always >> preserved regardless of the shape precision, and because the constant >> tone-to-noise ratio implies a constant intra-band noise to masking >> ratio. Intra-band masking is the strongest of the perceptual masking >> effects. This structure means that the ideal allocation is more >> consistent from frame to frame than it is for other codecs without an >> equivalent structure. > This paragraph contains a number of interesting assertions: Is there a > reference where one could see them justified (it may be that this is the > result of original research in the Opus team). The particular importance of energy preservation was substantively a discovery of the Vorbis team during the later tuning of that codec. Vorbis encoders include a separate step called 'noise normalization' which attempts to restore lost energy. In CELT this energy preservation was made implicit in the design of the codec. The text here is intended to explain the motivation behind the particular band-energy normalized structure of this part of the codec in the hopes that a higher-level understanding would help make the component steps make a little more sense. This is "original research", but there is now a link to an older paper describing this research: [Valin2010] Valin, JM., Terriberry, T., Montgomery, C., and G. Maxwell, "A High-Quality Speech and Audio Codec With Less Than 10 ms delay", IEEE Trans. on Audio, Speech and Language Processing, Vol. 18, No. 1, pp. 58-67 2010. > s4.3.3, paragraphs after the bullet points: The concepts of 'shape' and > 'shape encoding' is introcuced here without explicit definition. Are we > talking about the shape windowing used in FFT/MDCT here? This should be > made clear. The concept of shape is now discussed at the beginning of s4.3. > s4.3.3, 5th para after bullet points: s/In the reference the maximums/In > the reference implementation the maximums/ Fixed. > s4.3.3, 6th para after the bullet points: A table of bands per mode and > number of MDCT bins covered would be helpful here in order to get a > feeling for the scale of the problem. Also the cache_caps50 table in > the code contains the magic number 168. Where does this come from? The layout of the band is now included at the beginning of 4.3. Also, section 4.3.3 now includes an explanation for the size of cache_caps50 (21 bands, 4 values of LM, mono+stereo). > s4.3.3, 6th para after the bullet points: >> set LM to the shift value for the frame size (e.g. 0 for 120, 1 for >> 240, 3 for 480), > Where do these frame sizes get specified? And what is the total set of > frame sizes? The text says 'e.g.' (which incidentally should be 'e.g.,') > implying that this is not the complete set. The calculation of LM is now given earlier in section 4.3.3 > s4.3.3, 6th para after the bullet points: Need to define 'truncating > integer division' to go with 'exact integer division'. Replaced with "integer division" > s4.3.3, 7th para after the bullet points: >> The band boosts are represented by a series of binary symbols which >> are coded with very low probability. > How many, at least, and what values? Are these range encoded? I don't > see them in the table above or with a PDF specified. The text has been clarified to say that this is entropy coded. Also, the probabilities that correspond to 6 bit and 2 bits (1/64 and 1/4, respectively) are given to emphasize the fact that the number of bits is directly related to the probabilities. > s4.3.3, 7th para after the bullet points: >> and every time >> a band is boosted the initial cost is reduced (down to a minimum of >> two). > Would that be a value of two or two bits? Two bits (fixed in the draft). > s4.3.3: Paragraph on decoding band boosts: Might be improved by using > equations rather than the wordy descriptions used at present. > > (global)s4.3.3, para above table 56: s/iff/if and only if/ Fixed. > s4.3.3, 2nd para after Table 56: >> For stereo frames, bits are reserved for intensity stereo and for >> dual stereo. Intensity stereo requires ilog2(end-start) bits. > The terms 'intenmsity stereo' and 'dual stereo' don't appear to have > been defined. Added a definition of intensity, dual, and mid-side stereo to the beginning of section 4.3. > s4.3.3: LOG2_FRAC_TABLE is missing. Text now says that the table is in rate.c > 4.3.4.3, last para: >> If the decoded vector represents more than one time block, then the >> following process is applied separately on each time block. > Should this sentence come before the previous paragraph? There isn't > really a 'following process' in this section and I don't think it menas > the process in s4.3.4.4? Indeed. Replaced "the following process" with "this spreading process". > s4.3.4.3, last sentence: >> This extra rotation is applied in an interleaved manner with a stride >> equal to round(sqrt(N/nb_blocks)) > I think this needs some more explanation for the uninitiated. Although "stride" is a relatively common term, we added this to the sentence: "i.e. it is applied independently for each set of sample S_k = {stride*n + k}, n=0..N/stride-1." > s4.3.4.4: >> Multiple levels of splitting may be >> applied up to a frame size dependent limit. > What this limit is does not appear to be defined. Added the limit of LM+1 splits. > s4.3.5: The 'collapse' phenomenon is not fully defined, and it would be > useful to mention why it happens. Also s/min/minimum/. Added a short description of the anti-collapse feature and fixed the s/min/minimum/ > s4.3.6: s/Just like/Just as/ Fixed. > s4.3.7, last para: I think 'power complementarity' requires further > explanation or a reference. Added reference to: John P. Princen and Alan B. Bradley, "Analysis/synthesis filter bank design based on time domain aliasing cancellation," IEEE Trans. Acoust. Speech Sig. Proc. ASSP-34 (5), 1153-1161 (1986) > s4.5, para 3: s/To avoid or reduces glitches during these/To avoid or > reduce glitches during these/ Fixed. > s4.5.1.1, para 1: s/For for SILK-only/For SILK-only/ Fixed. > s4.5.1.4, para 2: s/redundant frame is as-is,/redundant frame as-is,/ Fixed. > s5, figure 16: The Optional High-pass Filter box has two spurious '+' > symbols on the vertical sides. Fixed. > s5, last para: A reference for the Auto Regressive Moving Average > (ARMA) filter would be useful. This is a simple issue and a reference to the entire theory would have made the issue more complicated. The parenthesis now states "i.e. with poles and zeros" to clarify this. > s5.2.3.4.2.1, title: s/Burgs method/Burg's Method/ Fixed. > s5.2.3.5, para 2: Expand 'R/D performance' (or probably specify it as > abbreviation for rate-distortion in para 1). Done. > s5.3.5, para 1 below equation: Is E an abbreviation for 'extra'? Oops. s/extra/E/ in the text below. > s5.3.6: The abbreviation RD for rate-distortion is defined here (see > comment on s5.2.3.5). Fixed. > s6.1: This section is perhaps a little 'triumphalist' for the reference > implementation (this may of course be justified!. The quality metric is > a 'relative quality metric' and presumably if someone does a *really* > good job of coding, it is entirely possible that the new algorithms > might get better than 100 on the quality score (i.e., provide a better > answer than the reference implementation). Conformance with the specification is defined by a faithful reproduction of the specified decoder behavior (see RFC 6569 s4.3). By specifying the decoder, future encoders have the freedom to have improved quality with the confidence that they know what output a conforming decoder will produce. The tests in s6.1 are measuring the quality of the decoded signal against the reference decoded signal, not against the original audio, so greater than 100% wouldn't be possible or meaningful. The test signals have been specially prepared to thoroughly test a decoder implementation, and they sacrifice encoded quality in order to rapidly exercise the corner cases. > s6.2: Just wondering, but is non-standard frame size the only option > offered by Opus Custom? If not, probably more text is needed here. Are > there any major changes to the algorithms implied by the use of Opus > Custom? Opus-custom permits more flexibility for frame-sizes at a cost of limiting the mode switching, requiring more dynamic memory, etc. The algorithm itself remains exactly the same, though some things like the band count/layout change to accommodate the different frame sizes while still approximating the same Bark structure. > s7. Referencing SECGUIDE (RFC 3552) seems inappropriate since it occurs > in such a security considerations section. Just omit it. Done. > s11.2: A number of references are to Wikipaedia pages. While these were > useful to me in refreshing or initializing my knowledge, they are not > usually considered adequately stable for use in RFCs. I fear you may > have to provide more stable references. In the cases where we use Wikipedia citations, they are intended as informative general references for introductory subject matter, and we are not relying on any specific details or wording that would be affected by further revisions. While the alternatives are often dense academic papers that may not be well suited to a general codec implementing audience.