[Cellar] My notes on https://datatracker.ietf.org/doc/draft-ietf-cellar-flac/
Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com> Mon, 27 February 2023 02:42 UTC
Return-Path: <spencerdawkins.ietf@gmail.com>
X-Original-To: cellar@ietfa.amsl.com
Delivered-To: cellar@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4D536C151B16 for <cellar@ietfa.amsl.com>; Sun, 26 Feb 2023 18:42:55 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.086
X-Spam-Level:
X-Spam-Status: No, score=-0.086 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001, URI_DOTEDU=1.999] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id nCTuS9bl1iQl for <cellar@ietfa.amsl.com>; Sun, 26 Feb 2023 18:42:50 -0800 (PST)
Received: from mail-pj1-x102a.google.com (mail-pj1-x102a.google.com [IPv6:2607:f8b0:4864:20::102a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9E458C14F721 for <cellar@ietf.org>; Sun, 26 Feb 2023 18:42:50 -0800 (PST)
Received: by mail-pj1-x102a.google.com with SMTP id x20-20020a17090a8a9400b00233ba727724so10458217pjn.1 for <cellar@ietf.org>; Sun, 26 Feb 2023 18:42:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=BuBezTGja85O5Ig9o6l/a7jlXQh3HKdcdWMMjzsNoSQ=; b=Fu+wmoqeuXsH6G8IgWXx5GMlT8XJw+mANE5b44IoreAJVTQ14RGqAlQKZJfnxlPT/C n4kWnPYz9OB1AZKPn9m9rJ4HMDywSVujDfQ8bGCw/l9boboTjAhOKQ+/1s0L9/xPm7yk QZXa+eXZpTJDbJFN5cYo3lR9tdckSiSZSglQuCIf/C8nuImiLF2mVsee9nV22vGpMk+c a2+gKtPdEZb1BuS038n4LWZPQQzlKjqo8lhaHL/PBd5azJ5Ulyah1WPQamsutXQyUxjN vwBhnDtXfCW4GCXBMeX1/4cbFyQ1HkdwF0XW0pMiY07de/XN5r2/MFWogLFEbxVet5sV o9yw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=BuBezTGja85O5Ig9o6l/a7jlXQh3HKdcdWMMjzsNoSQ=; b=TV3OMKQJCdBNwjChKLpyprp3vPFDZwUYRNgjVGrAOhlmbr9Ql6cjxptF7IrQsHmzNw aehKW5hJ4yeVTUlLQik1vCiVU0qVyRlimuw/KfedPI7Vr8TTzMKt3ffGMtnr0IVu8CXQ 3Ap5rpJNz61LotW9ITyIAeYlDRt6Z1+0WjNQFUg3IY4voQMYxrgp2zIkiuOwtfiMP9tI d1H2q/tFG2YzX8YGCUwPDg7mkhW+T8rS5Da8ylCnKrcnLXLjoe4LJEEZPG2lvn9uSbC+ uYY8pRTEbVBNTGi6Ty08N75TJIzfHM+0/9MBkQ2Vna2Qvo6Uk9jQnJ0xSE3youF/J/pd zbsg==
X-Gm-Message-State: AO0yUKXCXwKukorWhLiDOAxZ3vxWi2Z53sxlv/x29nmEObqTSfiDq7lG yYXKQJiVHLiNy8aETM5sX5wcv8xerXwaxvh1G+GjBn8Pc5S5Kg==
X-Google-Smtp-Source: AK7set98PxnbcTDs3NqTccSoXTqHcn9qR2Kk9qalIs/l3u0NAPUaBpWY6+fQ8qqLCPR+Az+rmY2Lcm/ObUk6xjZ+ir0=
X-Received: by 2002:a17:90a:dd8d:b0:230:3b84:9169 with SMTP id l13-20020a17090add8d00b002303b849169mr4108174pjv.2.1677465768953; Sun, 26 Feb 2023 18:42:48 -0800 (PST)
MIME-Version: 1.0
From: Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com>
Date: Sun, 26 Feb 2023 20:42:21 -0600
Message-ID: <CAKKJt-fp4Xf5uoM9TOC6T1azCgCD0AGA-LbH4X03w0ot1CAPvA@mail.gmail.com>
To: Codec Encoding for LossLess Archiving and Realtime transmission <cellar@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000480d7705f5a56f73"
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/gYqVZiws52oA38zSrqw5yXEj_TA>
Subject: [Cellar] My notes on https://datatracker.ietf.org/doc/draft-ietf-cellar-flac/
X-BeenThere: cellar@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Codec Encoding for LossLess Archiving and Realtime transmission <cellar.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cellar>, <mailto:cellar-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cellar/>
List-Post: <mailto:cellar@ietf.org>
List-Help: <mailto:cellar-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cellar>, <mailto:cellar-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Feb 2023 02:42:55 -0000
Dear Cellar, I'm still working my way through https://datatracker.ietf.org/doc/draft-ietf-cellar-flac/. I still need to read the Appendices, but I wanted to pass along my notes so far. In general, I like the document. I do have questions, especially about the use of BCP14 terminology, and some suggestions for readability, but this should get discussion started. I expect to have the reviews of the appendices done by tomorrow at close of business, my time. Best, Spencer ----------- I have a couple of comments on the abstract. Abstract This document defines the Free Lossless Audio Codec (FLAC) format. FLAC is designed to reduce the amount of computer storage space needed to store digital audio signals without needing to remove information in doing so (i.e. lossless). I might say, "without losing information in doing so (i.e. lossless)." FLAC is free in the sense that its specification is open, its reference implementation is open- source and it is not encumbered by any known patent. This ^^^^ might all be true, but I'd suggest removing at least "and it is not encumbered by any known patent". The IETF tries not to make statements about whether a technology is encumbered - the first question a reviewer would ask, is "any patent known to whom?", and it would get worse from there. Compared to other lossless (audio) coding formats, FLAC is a format with low complexity and can be coded to and from with little computing resources. Decoding of FLAC has seen many independent implementations on many different platforms, and both encoding and decoding can be implemented without needing floating-point arithmetic. 1. Introduction The coding methods provided by the FLAC format work best on PCM audio signals of which the samples have a signed representation and are centered around zero. Audio signals in which samples have an unsigned representation must be transformed to a signed representation as described in this document in order to achieve reasonable compression. The FLAC format is not suited to compress audio that is not PCM. Given ^^^^, do you need to say this? Is there a meaningful difference between "not suited for" and "cannone be compressed"? Pulse-density modulated audio, e.g. DSD, cannot be compressed by FLAC. 2. Notation and Conventions In this next paragraph, I'd suggest replacing "may be" with "are". "May" tends to be used in describing protocol operations in the IETF, and these are statements of fact. Values expressed as u(n) represent unsigned big-endian integer using n bits. Values expressed as s(n) represent signed big-endian integer using n bits, signed two's complement. n may be expressed as an equation using * (multiplication), / (division), + (addition), or - (subtraction). An inclusive range of the number of bits expressed may be represented with an ellipsis, such as u(m...n). The name of a value followed by an asterisk * indicates zero or more occurrences of the value. The name of a value followed by a plus sign + indicates one or more occurrences of the value. 3. Acknowledgments This section is very nice. We would ordinarily put an acknowledgement section at the end of a document, so, in this case, it would be after IANA Considerations. The RFC Editor folks may ask you to move the URLs in this section to Informative References, and use reference tags to link the people to the work they did. FLAC owes much to the many people who have advanced the audio compression field so freely. For instance: * A. J. Robinson (https://web.archive.org/web/20160315141134/ http://mi.eng.cam.ac.uk/~ajr/) for his work on Shorten; his paper ([robinson-tr156]) is a good starting point on some of the basic methods used by FLAC. FLAC trivially extends and improves the fixed predictors, LPC coefficient quantization, and Rice coding used in Shorten. * S. W. Golomb (https://web.archive.org/web/20040215005354/http://csi.usc.edu/ faculty/golomb.html) and Robert F. Rice; their universal codes are used by FLAC's entropy coder. * N. Levinson and J. Durbin; the reference encoder uses an algorithm developed and refined by them for determining the LPC coefficients from the autocorrelation coefficients. * And of course, Claude Shannon (https://en.wikipedia.org/wiki/ Claude_Shannon) The FLAC format, the FLAC reference implementation and this document were originally developed by Josh Coalson. While many others have contributed since, this original effort is deeply appreciated. 4. Definitions * *Subframe*: An encoded subblock. All subframes within a frame code for the same number of samples. A subframe MAY correspond to I'd suggest "might correspond". a subblock, else it corresponds to either the addition or subtraction of two subblocks, see section on interchannel decorrelation (#interchannel-decorrelation). * *Linear predictor*: a predictor using linear prediction (https://en.wikipedia.org/wiki/Linear_prediction) This is also I'd create a reference for this URL, and include it as Informative. called *linear predictive coding (LPC)*. With a linear predictor each prediction is a linear combination of past samples, hence the name. A linear predictor has a causal discrete-time finite impulse response (https://en.wikipedia.org/wiki/ Finite_impulse_response). I'd create a reference for this URL, and include it as Informative. * *Rice code*: A variable-length code (https://en.wikipedia.org/wiki/Variable-length_code) which I'd create a reference for this URL, and include it as Informative. compresses data by making use of the observation that, after using an effective predictor, most residual samples are closer to zero than the original samples, while still allowing for a small part of the samples to be much larger. 5. Conceptual overview Similar to many audio coders, a FLAC file is encoded following the I'd suggest "many other audio coders" (other than FLAC, right?) steps below. On decoding a FLAC file, these steps are undone in reverse order, i.e. from bottom to top. * Blocking (see section on Blocking (#blocking)). The input is split up into many contiguous blocks. With FLAC, the blocks MAY vary in size. The optimal size of the block is usually affected So, what you say is true, but I'd suggest saying "Because FLAC blocks can vary in size, a FLAC decoder MUST be able to decode variable-sized blocks", so that the weak permission of MAY for an encoder is reflected in a strong requirement of MUST for the decoder. I'm assuming this is mandatory for decoders, unless it's OK for decoders to fail to decode conformant FLAC blocks. (if you want to say "MAY vary in size", I won't argue 🙂) by many factors, including the sample rate, spectral characteristics over time, etc. However, as finding the optimal block size arrangement is a rather complex problem, the FLAC format allows for a constant block size throughout a stream as well. * Prediction (see section on Prediction (#prediction)). To remove redundancy in a signal, a predictor is stored for each subblock or its transformation as formed in the previous step. A predictor consists of a simple mathematical description that can be used, as the name implies, to predict a certain sample from the samples that preceded it. As this prediction is rarely exact, the error of this prediction is passed to the next stage. The predictor of each subblock is completely independent from other subblocks. Since the methods of prediction are known to both the encoder and decoder, only the parameters of the predictor need be included in the compressed stream. In case no usable predictor can be found for a certain subblock, the signal is stored instead of compressed Meta-comment - the phrase "in case" appears about 20 times in this document. I think this should be "if", throughout. This may be super clear to those skilled in the art, but I wonder if it would be clearer if it said, "the uncompressed signal is stored and the next stage is skipped" and the next stage is skipped. * Residual Coding (See section on Residual Coding (#residual- coding)). As the predictor does not describe the signal exactly, the difference between the original signal and the predicted signal (called the error or residual signal) MUST be coded Is this "MUST" a statement of reality ("if you don't code the error losslessly, the signal cannot be decoded without losing information"), or an actual requirement for the implementation? If the latter, perhaps "MUST be coded losslessly in order to recover the original signal without loss". losslessly. If the predictor is effective, the residual signal will require fewer bits per sample than the original signal. FLAC uses Rice coding, a subset of Golomb coding, with either 4-bit or 5-bit parameters to code the residual signal. 5.1. Blocking The size used for blocking the audio data has a direct effect on the compression ratio. If the block size is too small, the resulting large number of frames mean that excess bits will be wasted on frame headers. If the block size is too large, the characteristics of the signal may vary so much that the encoder will be unable to find a good predictor. In order to simplify encoder/decoder design, FLAC imposes a minimum block size of 16 samples, and a maximum block size of 65535 samples. This range covers the optimal size for all of the audio data FLAC supports. Is there a reference that could be included for the last statement here? ^ 5.3. Prediction A FLAC encoder is free in selecting which method is used to model the input, with the following exceptions: * When the samples that need to be stored do not all have the same value (i.e. the signal is not constant), a constant subframe MUST NOT be used, My experience has been that if you're saying "this is a bad idea", it's better to explain the consequences ("using a constant subframe when the input signal is not constant, will result in very compact and inaccurate decoded streams"). If you're saying "MUST NOT be used", a decoder MUST detect that you're violating a MUST requirement, and the decoder's behavior would be undefined. Which are you saying? * When an encoder is unable to find a fixed or linear predictor of which all residual samples are representable in 32-bit signed integers as stated in section coded residual (#coded-residual), a verbatim subframe MUST be used. I have a similar comment here: are you saying "this is what a wise encoder would do", or are you saying "the decoder MUST detect that the encoder is violating a MUST protocol requirement and <do something, perhaps just discarding the subframe"? For more information on fixed and linear predictors, see [HPL-1999-144] and [robinson-tr156]. 5.4. Residual Coding In case a subframe uses a predictor to approximate the audio signal, a residual needs to be stored to 'correct' the approximation to the I'd suggest "is stored". exact value. When an effective predictor is used, the average numerical value of the residual samples is smaller than that of the samples before prediction. While having smaller values on average, it is possible a few 'outlier' residual samples are much larger than any of the original samples. Sometimes these outliers even exceed the range the bit depth of the original audio offers. Quite often the optimal Rice parameter varies over the course of a subframe. To accommodate this, the residual can be split up into partitions, where each partition has its own Rice parameter. To keep overhead and complexity low, the number of partitions used in a subframe is limited to powers of two. I'd suggest "MUST be a power of two". 6. Format principles FLAC has no format version information, but it does contain reserved space in several places. Future versions of the format MAY use this reserved space safely without breaking the format of older streams. Older decoders MAY choose to abort decoding or skip data encoded with newer methods. Apart from reserved patterns, the format specifies I'd suggest "data encoded using methods they do not recognize". invalid patterns in certain places, meaning that the patterns MUST NOT appear in any valid bitstream, in any prior, present, or future versions of the format. These invalid patterns are usually used to make the synchronization mechanism more robust. I need some help here, please. I searched the document for occurrences of the word "invalid", and found three or four occurrences in tables that said something like "this value is invalid", and one or two more that said "these values are invalid", in text, not in tables. Are those occurrences the patterns that you're referencing in "the patterns MUST NOT appear in any valid bitstream, in any prior, present, or future versions of the format"? In any event, it would be helpful to the reader if you had a table that said "these are patterns are invalid", and if it makes sense to include a forward reference to someplace in the document that explains where each pattern might occur, that would also be helpful. A separate comment - I'd suggest "in any future version of the format". You can't have a MUST requirement that controls what's happened in the past. Is this document describing the present version of the format? All numbers used in a FLAC bitstream MUST be integers, there are no floating-point representations. All numbers MUST be big-endian coded, except the field length used in Vorbis comments, which MUST be little-endian coded. All numbers MUST be unsigned except linear predictor coefficients, the linear prediction shift and numbers which directly represent samples, which MUST be signed. None of these restrictions apply to application metadata blocks or to Vorbis comment field contents. These ^^^^ MUSTs aren't really MUSTs. They are "MUST, except when". Does it make sense to have a decoder check for them, and take action when they are violated? Or are these statements of fact ("All numbers will be big-endian encoded, except the field length used in Vorbis comments, which will be little-endian coded.") Also - this is the first occurrence of "Vorbis" in the document. Could you add a reference for this? 7. Format lay-out Since a decoder MAY start decoding in the middle of a stream, there MUST be a method to determine the start of a frame. A 15-bit sync These ^^^^ are almost certainly statements of fact. I'd suggest something like "In order to allow a decoder to start decoding in the middle of a stream, a 15-bit sync code begins each frame". code begins each frame. The sync code will not appear anywhere else in the frame header. However, since it MAY appear in the subframes, the decoder has two other ways of ensuring a correct sync. The first is to check that the rest of the frame header contains no invalid data. Even this is not foolproof since valid header patterns can still occur within the subframes. The decoder's final check is to generate an 8-bit CRC of the frame header and compare this to the CRC stored at the end of the frame header. Also, since a decoder MAY start decoding at an arbitrary frame in the stream, each frame header MUST contain some basic information about the stream because the decoder MAY not have access to the STREAMINFO Same suggestion as before - these are statements of fact. I'd suggest "In order to allow a decoder to start decoding at an arbitrary frame in the stream, each frame header contains some basic information about the stream, in the event that the decoder does not have access to the STREAMINFO metadata block at the start of the stream." metadata block at the start of the stream. This information includes sample rate, bits per sample, number of channels, etc. Since the frame header is pure overhead, it has a direct effect on the compression ratio. To keep the frame header as small as possible, FLAC uses lookup tables for the most commonly used values for frame properties. When a certain property has a value that is not covered by the lookup table, the decoder is directed to find the value of that property (for example the sample rate) at the end of the frame header or in the streaminfo metadata block. In case a frame header refers to the streaminfo metadata block, the file is not 'streamable', see section format subset (#format-subset) for details. In this way, the file is streamable and the frame header size small for all of the most common forms of audio data. 8. Format subset * The sample rate bits (#sample-rate-bits) in the frame header MUST be 0b0001-0b1110, i.e. the frame header MUST NOT refer to the streaminfo metadata block to find the sample rate. * The bits depth bits (#bit-depth-bits) in the frame header MUST be 0b001-0b111, i.e. the frame header MUST NOT refer to the streaminfo metadata block to find the bit depth. On these two bullets ^^^^, is "to find" the best wording? I wonder if these bullets should say "to describe" instead. 9.2. Streaminfo The minimum block size is excluding the last block of a FLAC file, which may be smaller. If the minimum block size is equal to the maximum block size, the file contains a fixed block size stream. Note that in case of a stream with a variable block size, the actual maximum block size MAY be smaller than the maximum block size listed in the streaminfo block, and the actual smallest block size excluding the last block MAY be larger than the minimum block size listed in the streaminfo block. This is because the encoder has to write these fields before receiving any input audio data, and cannot know beforehand what block sizes it will use, only between what bounds these will be chosen. ^^^^ This explanation seems important enough to make it clearer. If I understand correctly, these aren't "minimum and maximum block sizes", they're "minimum and maximum block size bounds". Would it be useful, and not excessively painful, to make that terminology change? The MD5 signature is made by performing an MD5 transformation on the samples of all channels interleaved, represented in signed, little- endian form. This interleaving is on a per-sample basis, so for a stereo file this means first the first sample of the first channel, then the first sample of the second channel, then the second sample of the first channel etc. Before performing the MD5 transformation, This text ^^^^ is accurate, but I can't understand it without reading it out loud, with appropriate pauses. Would a picture of the samples being interleaved before the MD5 transformation help? all samples must be byte-aligned. So, in case the bit depth is not a whole number of bytes, additional zero bits are inserted at the most- significant position until each sample representation is a whole number of bytes. 9.4. Application The application metadata block is for use by third-party applications. The only mandatory field is a 32-bit identifier, much like a FourCC but not restricted to ASCII characters. An ID registry "FourCC" isn't defined in the document, but is there a reference that you could include? is being maintained at https://xiph.org/flac/id.html (https://xiph.org/flac/id.html) 9.6. Vorbis comment Note that the Vorbis comment as used in Vorbis allows for on the order of 2^64 bytes of data whereas the FLAC metadata block is limited to 2^24 bytes. Given the stated purpose of Vorbis comments, i.e. human-readable textual information, this limit is unlikely to be My suggestion would be "the FLAC metadata limit is unlikely to be restrictive". restrictive. Also note that the 32-bit field lengths are coded little-endian, as opposed to the usual big-endian coding of fixed- length integers in the rest of the FLAC format. 9.6.2. Channel mask Besides fields containing information about the work itself, one field is defined for technical reasons, of which the field name is WAVEFORMATEXTENSIBLE_CHANNEL_MASK. This field contains information on which channels the file contains. Use of this field is RECOMMENDED in case these differ from the channels defined in the section channels bits (#channels-bits). I have a question here - if this field doesn't match the section channel bits, which one is assumed to be correct? I would have guessed that if the channel mask is present, it is assumed to be correct, but I am guessing. 9.7. Cuesheet If the media catalog number is less than 128 bytes long, it SHOULD be right-padded with NUL characters. For CD-DA, this is a thirteen digit number, followed by 115 NUL bytes. Why is this ^^^^ a SHOULD? Is this for legacy support, or support of incorrect implementations, or something else? 9.7.1. Cuesheet track A track number of 0 is not allowed to avoid conflicting with the CD- I would suggest "is not allowed, because the CD-DA spec reserves this for the lead-in". DA spec, which reserves this for the lead-in. For CD-DA the number MUST be 1-99, or 170 for the lead-out; for non-CD-DA, the track number MUST be 255 for the lead-out. It is RECOMMENDED to start with track 1 and increase sequentially. Track numbers MUST be unique within a cuesheet. 9.8. Picture The following table contains all defined picture types. Values other than those listed in the table are reserved and SHOULD NOT be used. If values other than those listed in the table are used, what if anything should the decoder do? Is there a MUST requirement that a decoder must check for the SHOULD NOT values, and handle them appropriately, or simply not crash? There MAY only be one each of picture type 1 and 2 in a file. In general practice, many FLAC playback devices and software display the contents of a picture metadata block with picture type 3 (front cover) during playback, if present. 10.1.2. Sample rate bits The next 4 bits, referred to as the sample rate bits, contain the sample rate of the audio according to the following table. In case the sample rate bits code for an uncommon sample rate, this is stored after the uncommon block size or after the coded number in case no Nit: I'd suggest "if no uncommon block size was used". I see other occurrences of "in case" in the document, so I'll ask that you look at those occurrences as well. uncommon block size was used. See section uncommon sample rate (#uncommon-sample-rate). 10.1.3. Channels bits The next 4 bits (the first 4 bits of the fourth byte of each frame), "(the first 4 bits of the fourth byte of each frame)" seems very helpful. Would it make sense to add this guidance at the beginning of each of the sections that begin with "the next 4 bits"? referred to as the channels bits, code for both the number of channels of the audio as well as any stereo decorrelation used according to the following table. 10.1.4. Bit depth bits The next 3 bits code for the bit depth of the audio according to the following table. Is this saying "The next 3 bits encode the bit depth"? 10.2.2. Wasted bits per sample The wasted bits-per-sample flag in a subframe header is set to 1 if a certain number of least-significant bits of all samples in the I'm confused - what is the "certain number"? Is this (I'm guessing) a number that might vary between audio file types? current subframe are zero. If this is the case, the number of wasted bits-per-sample (k) minus 1 follows the flag in an unary encoding. For example, if k is 3, 0b001 follows. If k = 0, the wasted bits- per-sample flag is 0 and no unary coded k follows. In case k is not equal to 0, samples are coded ignoring k least- significant bits. For example, if the preceding frame header specified a sample size of 16 bits per sample and k is 3, samples in the subframe are coded as 13 bits per sample. A decoder MUST add k least-significant zero bits by shifting left (padding) after decoding a subframe sample. In case the frame has left/side, right/side or mid/side stereo, padding MUST happen to a sample before it is used to Is this "a decoder MUST perform padding before constructing a left or right sample"? reconstruct a left or right sample. 10.2.5. Fixed predictor subframe As a predictor makes use of samples preceding the sample that is predicted, it can only be used when enough such samples are known. I'd suggest "enough samples" here. As each subframe in FLAC is coded completely independent, the first I'd suggest "coded completely independently". few samples in each subframe cannot be predicted. Therefore, a number of so-called warm-up samples equal to the predictor order is stored. These are stored unencoded, bypassing the predictor and residual coding stage. See section on Constant subframe (#constant- subframe) on how samples are stored unencoded. The table below defines how a fixed predictor subframe appears in the bitstream. To encode a signal with a fixed predictor, each sample has the corresponding prediction subtracted and sent to the residual coder. To decode a signal with a fixed predictor, first the residual has to be decoded, after which for each sample the prediction can be added. I'd suggest "the residual is decoded, and then the prediction can be added for each sample". I have the same comment about Section 10.2.6 below. This means that decoding MUST be a sequential process within a subframe, as for each sample, enough fully decoded previous samples are needed to calculate the prediction. 10.2.7.3. Residual sample value limit All residual samples values MUST be representable in the range offered by a 32-bit integer, signed one's complement. Equivalently, all residual sample values MUST fall in the range offered by a 32-bit integer signed two's complement excluding the most negative possible value of that range. This means residual sample values MUST NOT have an absolute value equal to or larger then 2 to the power 31. A FLAC I'd suggest "equal to, or larger than, 2 to the power 31". You might also say "larger than 2 ** 31" here, I think. encoder MUST make sure of this. In case a FLAC encoder is, for a certain subframe, unable to find a suitable predictor of which all residual samples fall within said range, it MUST default to writing a verbatim subframe. The appendix numerical considerations (#numerical-considerations) explains in which circumstances residual samples are already implicitly representable in said range and thus an additional check is not needed. 10.3. Frame footer Following the last subframe is the frame footer. If the last subframe is not byte aligned (i.e. the bits required to store all subframes put together are not divisible by 8), zero bits are added until byte alignment is reached. Following this is a 16-bit CRC, ^^^^ I think this is "Following the last subframe is the frame footer. If the last subframe is not byte aligned (i.e. the number of bits required to store all subframes put together is not divisible by 8), zero bits are added until byte alignment is reached." initialized with 0, with polynomial x^16 + x^15 + x^2 + x^0. This CRC covers the whole frame excluding the 16-bit CRC, including the sync code. 11.1. Ogg mapping In case an audio stream is encoded where audio properties (sample rate, number of channels or bit depth) change at some point in the stream, this should be dealt with by finishing encoding of the current Ogg stream and starting a new Ogg stream concatenated to the previous one. This is called chaining in Ogg. See the Ogg specification for details. I know you provided a reference for Ogg on first use, but suggest repeating it here, for the less careful readers. 11.3. ISO Base Media File Format (MP4) mapping The full encapsulation definition of FLAC audio in MP4 files was deemed too extensive to include in this document. A definition document can be found at https://github.com/xiph/flac/blob/master/doc/isoflac.txt (https://github.com/xiph/flac/blob/master/doc/isoflac.txt) This This might be clearer if it said "The definitions document is summarized here." document is summarized here. 13. Security Considerations Implementations of the FLAC codec need to take appropriate security considerations into account. Those related to denial of service are outlined in Section 2.1 of [RFC4732]. It is extremely important for the decoder to be robust against malicious payloads. Malicious payloads MUST NOT cause the decoder to overrun its allocated memory or to take an excessive amount of resources to decode. An overrun in I understand the use of the word "malicious" here, but would it be correct to say "payloads that do not conform to this specification MUST NOT cause ..."? It seems that the problem is that the payload might cause unsuspicious implementations to misbehave, from a security perspective, not that the payload is **intended** to cause unsuspicious implementations to misbehave. allocated memory could lead to arbitrary code execution by an attacker. The same applies to the encoder, even though problems in encoders are typically rarer. Malicious audio streams MUST NOT cause the encoder to misbehave because this would allow an attacker to attack transcoding gateways. As with all compression algorithms, both encoding and decoding can produce an output much larger than the input. In case of decoding, the most extreme possible case of this is a frame with eight constant subframes of block size 65535 and coding for 32-bit PCM. This frame is only 49 byte in size, but codes for more than 2 megabyte of uncompressed PCM data. For encoding, it is possible to have an even larger size increase, although such behavior is generally considered faulty. This happens if the encoder chooses a rice parameter that does not fit with the residual that has to be encoded. In such a case, very long unary coded symbols can appear, in the most extreme case more than 4 gigabyte per sample. It is RECOMMENDED decoder and "RECOMMENDED" is a BCP14 synonym for "SHOULD". Are there any reasons that decoder and encoder implementations ought not take these precautions? If not, I'd suggest MUST, but this might be better as "Decoder and encoder implementors are advised to take precautions". encoder implementations take precautions to prevent excessive resource utilization in such cases. When metadata is handled, it is RECOMMENDED to either thoroughly test handling of extreme cases or impose reasonable limits beyond the Again, is there a reason why implementors ought not to do these things? If not, I'd suggest MUST, or alternatively "implementors are advised", as in the previous comment. limits of this specification document. For example, a single Vorbis comment metadata block can contain millions of valid fields. It is unlikely such a limit is ever reached except in a potentially malicious file. Likewise the media type and description of a picture metadata block can be millions of characters long, despite there being no reasonable use of such contents. One possible use case for very long character strings is in lyrics, which can be stored in Vorbis comment metadata block fields. Various kinds of metadata blocks contain length fields or fields counts. While reading a block following these lengths or counts, a decoder MUST make sure higher-level lengths or counts (most importantly the length field of the metadata block itself) are not exceeded. As some of these length fields code string lengths, memory for which must be allocated, it is RECOMMENDED to first verify that Again, is there a reason implementations ought not to do this? If not, I'd suggest MUST, or alternatively, "implementors are advised". block is valid before allocating memory based on its contents. Seeking in a FLAC stream that is not in a container relies on the coded number in frame headers and optionally a seektable metadata block. It is RECOMMENDED to employ thorough sanity checks on whether Again, same comment about RECOMMENDED here. a found coded number or seekpoint is at all possible. Without these checks, seeking might get stuck in an infinite loop when numbers in frames are non-consecutive or otherwise invalid, which could be used in denial of service attacks. It is RECOMMENDED to employ fuzz testing combined with different Again, same comment about RECOMMENDED here. sanitizers on FLAC decoders to find security problems. To improve efficiency of this process, a decoder SHOULD, on decoding, ignore the Again, same comment about SHOULD here. results of CRC checks during fuzz testing. See the FLAC decoder testbench (https://wiki.hydrogenaud.io/ index.php?title=FLAC_decoder_testbench) for a non-exhaustive list of FLAC files with extreme configurations which lead to crashes or reboots on some known implementations. None of the content carried in FLAC is intended to be executable. Is this a requirement that decoders MUST NOT execute it? 14. IANA Considerations 14.1. Media type registration The following information serves as the registration form for the "audio/flac" media type. This media type is applicable for FLAC audio packaged in its native container. FLAC audio packaged in another container will take on the media type of its container, for example audio/ogg when packaged in an Ogg container or video/mp4 when packaged in a MP4 container alongside a video track. Type name: audio Subtype name: flac Required parameters: none Optional parameters: none Encoding considerations: as per this document Security considerations: see section 12 Interoperability considerations: no known concerns Published specification: THISRFC Applications that use this media type: ffmpeg, apache, firefox Fragment identifier considerations: none Additional information: Deprecated alias names for this type: audio/x-flac Magic number(s): fLaC File extension(s): flac Macintosh file type code(s): none Person & email address to contact for further information: IETF CELLAR WG Intended usage: COMMON Restrictions on usage: N/A Author: IETF CELLAR WG Change controller: IESG Provisional registration? (standards tree only): NO 15. Normative References [I-D.ietf-cellar-matroska] Lhomme, S., Bunkus, M., and D. Rice, "Matroska Media Container Format Specifications", Work in Progress, Internet-Draft, draft-ietf-cellar-matroska-14, 1 October 2022, <https://www.ietf.org/archive/id/draft-ietf-cellar- matroska-14.txt>. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>. [RFC4732] Handley, M., Ed., Rescorla, E., Ed., and IAB, "Internet Denial-of-Service Considerations", RFC 4732, DOI 10.17487/RFC4732, December 2006, <https://www.rfc-editor.org/info/rfc4732>. [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>. 16. Informative References [HPL-1999-144] Hans, M. and RW. Schafer, "Lossless Compression of Digital Audio", DOI 10.1109/79.939834, November 1999, <https://www.hpl.hp.com/techreports/1999/HPL- 1999-144.pdf>. [IEC.60908.1999] International Electrotechnical Commission, "Audio recording - Compact disc digital audio system", IEC International standard 60908 second edition, 1999. [ISRC-handbook] "International Standard Recording Code (ISRC) Handbook, 4th edition", 2021, <https://www.ifpi.org/isrc_handbook/>. [RFC3533] Pfeiffer, S., "The Ogg Encapsulation Format Version 0", RFC 3533, DOI 10.17487/RFC3533, May 2003, <https://www.rfc-editor.org/info/rfc3533>. [RFC5334] Goncalves, I., Pfeiffer, S., and C. Montgomery, "Ogg Media Types", RFC 5334, DOI 10.17487/RFC5334, September 2008, <https://www.rfc-editor.org/info/rfc5334>. [RFC6716] Valin, JM., Vos, K., and T. Terriberry, "Definition of the Opus Audio Codec", RFC 6716, DOI 10.17487/RFC6716, September 2012, <https://www.rfc-editor.org/info/rfc6716>. [RFC7942] Sheffer, Y. and A. Farrel, "Improving Awareness of Running Code: The Implementation Status Section", BCP 205, RFC 7942, DOI 10.17487/RFC7942, July 2016, <https://www.rfc-editor.org/info/rfc7942>. [robinson-tr156] Robinson, T., "SHORTEN: Simple lossless and near-lossless waveform compression", December 1994, <https://mi.eng.cam.ac.uk/reports/abstracts/ robinson_tr156.html>.
- [Cellar] My notes on https://datatracker.ietf.org… Spencer Dawkins at IETF
- Re: [Cellar] My notes on https://datatracker.ietf… Spencer Dawkins at IETF
- Re: [Cellar] My notes on https://datatracker.ietf… Martijn van Beurden
- Re: [Cellar] My notes on https://datatracker.ietf… Spencer Dawkins at IETF
- Re: [Cellar] My notes on https://datatracker.ietf… Martijn van Beurden
- Re: [Cellar] My notes on https://datatracker.ietf… Spencer Dawkins at IETF
- Re: [Cellar] My notes on https://datatracker.ietf… Martijn van Beurden
- Re: [Cellar] My notes on https://datatracker.ietf… Spencer Dawkins at IETF
- Re: [Cellar] My notes on https://datatracker.ietf… Martijn van Beurden
- Re: [Cellar] My notes on https://datatracker.ietf… Martijn van Beurden