[Cellar] My notes on https://datatracker.ietf.org/doc/draft-ietf-cellar-flac/

Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com> Mon, 27 February 2023 02:42 UTC

Return-Path: <spencerdawkins.ietf@gmail.com>
X-Original-To: cellar@ietfa.amsl.com
Delivered-To: cellar@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4D536C151B16 for <cellar@ietfa.amsl.com>; Sun, 26 Feb 2023 18:42:55 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.086
X-Spam-Level:
X-Spam-Status: No, score=-0.086 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001, URI_DOTEDU=1.999] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id nCTuS9bl1iQl for <cellar@ietfa.amsl.com>; Sun, 26 Feb 2023 18:42:50 -0800 (PST)
Received: from mail-pj1-x102a.google.com (mail-pj1-x102a.google.com [IPv6:2607:f8b0:4864:20::102a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9E458C14F721 for <cellar@ietf.org>; Sun, 26 Feb 2023 18:42:50 -0800 (PST)
Received: by mail-pj1-x102a.google.com with SMTP id x20-20020a17090a8a9400b00233ba727724so10458217pjn.1 for <cellar@ietf.org>; Sun, 26 Feb 2023 18:42:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=BuBezTGja85O5Ig9o6l/a7jlXQh3HKdcdWMMjzsNoSQ=; b=Fu+wmoqeuXsH6G8IgWXx5GMlT8XJw+mANE5b44IoreAJVTQ14RGqAlQKZJfnxlPT/C n4kWnPYz9OB1AZKPn9m9rJ4HMDywSVujDfQ8bGCw/l9boboTjAhOKQ+/1s0L9/xPm7yk QZXa+eXZpTJDbJFN5cYo3lR9tdckSiSZSglQuCIf/C8nuImiLF2mVsee9nV22vGpMk+c a2+gKtPdEZb1BuS038n4LWZPQQzlKjqo8lhaHL/PBd5azJ5Ulyah1WPQamsutXQyUxjN vwBhnDtXfCW4GCXBMeX1/4cbFyQ1HkdwF0XW0pMiY07de/XN5r2/MFWogLFEbxVet5sV o9yw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=BuBezTGja85O5Ig9o6l/a7jlXQh3HKdcdWMMjzsNoSQ=; b=TV3OMKQJCdBNwjChKLpyprp3vPFDZwUYRNgjVGrAOhlmbr9Ql6cjxptF7IrQsHmzNw aehKW5hJ4yeVTUlLQik1vCiVU0qVyRlimuw/KfedPI7Vr8TTzMKt3ffGMtnr0IVu8CXQ 3Ap5rpJNz61LotW9ITyIAeYlDRt6Z1+0WjNQFUg3IY4voQMYxrgp2zIkiuOwtfiMP9tI d1H2q/tFG2YzX8YGCUwPDg7mkhW+T8rS5Da8ylCnKrcnLXLjoe4LJEEZPG2lvn9uSbC+ uYY8pRTEbVBNTGi6Ty08N75TJIzfHM+0/9MBkQ2Vna2Qvo6Uk9jQnJ0xSE3youF/J/pd zbsg==
X-Gm-Message-State: AO0yUKXCXwKukorWhLiDOAxZ3vxWi2Z53sxlv/x29nmEObqTSfiDq7lG yYXKQJiVHLiNy8aETM5sX5wcv8xerXwaxvh1G+GjBn8Pc5S5Kg==
X-Google-Smtp-Source: AK7set98PxnbcTDs3NqTccSoXTqHcn9qR2Kk9qalIs/l3u0NAPUaBpWY6+fQ8qqLCPR+Az+rmY2Lcm/ObUk6xjZ+ir0=
X-Received: by 2002:a17:90a:dd8d:b0:230:3b84:9169 with SMTP id l13-20020a17090add8d00b002303b849169mr4108174pjv.2.1677465768953; Sun, 26 Feb 2023 18:42:48 -0800 (PST)
MIME-Version: 1.0
From: Spencer Dawkins at IETF <spencerdawkins.ietf@gmail.com>
Date: Sun, 26 Feb 2023 20:42:21 -0600
Message-ID: <CAKKJt-fp4Xf5uoM9TOC6T1azCgCD0AGA-LbH4X03w0ot1CAPvA@mail.gmail.com>
To: Codec Encoding for LossLess Archiving and Realtime transmission <cellar@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000480d7705f5a56f73"
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/gYqVZiws52oA38zSrqw5yXEj_TA>
Subject: [Cellar] My notes on https://datatracker.ietf.org/doc/draft-ietf-cellar-flac/
X-BeenThere: cellar@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Codec Encoding for LossLess Archiving and Realtime transmission <cellar.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cellar>, <mailto:cellar-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cellar/>
List-Post: <mailto:cellar@ietf.org>
List-Help: <mailto:cellar-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cellar>, <mailto:cellar-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Feb 2023 02:42:55 -0000

Dear Cellar,

I'm still working my way through
https://datatracker.ietf.org/doc/draft-ietf-cellar-flac/. I still need to
read the Appendices, but I wanted to pass along my notes so far.

In general, I like the document. I do have questions, especially about the
use of BCP14 terminology, and some suggestions for readability, but this
should get discussion started.

I expect to have the reviews of the appendices done by tomorrow at close of
business, my time.

Best,

Spencer
-----------

I have a couple of comments on the abstract.

Abstract

   This document defines the Free Lossless Audio Codec (FLAC) format.

   FLAC is designed to reduce the amount of computer storage space

   needed to store digital audio signals without needing to remove

   information in doing so (i.e. lossless).

I might say, "without losing information in doing so (i.e. lossless)."

   FLAC is free in the sense

   that its specification is open, its reference implementation is open-

   source and it is not encumbered by any known patent.

This ^^^^ might all be true, but I'd suggest removing at least "and it is
not encumbered by any known patent". The IETF tries not to make statements
about whether a technology is encumbered - the first question a reviewer
would ask, is "any patent known to whom?", and it would get worse from
there.

   Compared to

   other lossless (audio) coding formats, FLAC is a format with low

   complexity and can be coded to and from with little computing

   resources.  Decoding of FLAC has seen many independent

   implementations on many different platforms, and both encoding and

   decoding can be implemented without needing floating-point

   arithmetic.

1.  Introduction

   The coding methods provided by the FLAC format work best on PCM audio

   signals of which the samples have a signed representation and are

   centered around zero.  Audio signals in which samples have an

   unsigned representation must be transformed to a signed

   representation as described in this document in order to achieve

   reasonable compression.  The FLAC format is not suited to compress

   audio that is not PCM.

Given ^^^^, do you need to say this? Is there a meaningful difference
between "not suited for" and "cannone be compressed"?

   Pulse-density modulated audio, e.g.  DSD,

   cannot be compressed by FLAC.

2.  Notation and Conventions

In this next paragraph, I'd suggest replacing "may be" with "are". "May"
tends to be used in describing protocol operations in the IETF, and these
are statements of fact.

   Values expressed as u(n) represent unsigned big-endian integer using

   n bits.  Values expressed as s(n) represent signed big-endian integer

   using n bits, signed two's complement. n may be expressed as an

   equation using * (multiplication), / (division), + (addition), or -

   (subtraction).  An inclusive range of the number of bits expressed

   may be represented with an ellipsis, such as u(m...n).  The name of a

   value followed by an asterisk * indicates zero or more occurrences of

   the value.  The name of a value followed by a plus sign + indicates

   one or more occurrences of the value.

3.  Acknowledgments

This section is very nice. We would ordinarily put an acknowledgement
section at the end of a document, so, in this case, it would be after IANA
Considerations.

The RFC Editor folks may ask you to move the URLs in this section to
Informative References, and use reference tags to link the people to the
work they did.

   FLAC owes much to the many people who have advanced the audio

   compression field so freely.  For instance:

   *  A.  J.  Robinson (https://web.archive.org/web/20160315141134/

      http://mi.eng.cam.ac.uk/~ajr/) for his work on Shorten; his paper

      ([robinson-tr156]) is a good starting point on some of the basic

      methods used by FLAC.  FLAC trivially extends and improves the

      fixed predictors, LPC coefficient quantization, and Rice coding

      used in Shorten.

   *  S.  W.  Golomb

      (https://web.archive.org/web/20040215005354/http://csi.usc.edu/

      faculty/golomb.html) and Robert F.  Rice; their universal codes

      are used by FLAC's entropy coder.

   *  N.  Levinson and J.  Durbin; the reference encoder uses an

      algorithm developed and refined by them for determining the LPC

      coefficients from the autocorrelation coefficients.

   *  And of course, Claude Shannon (https://en.wikipedia.org/wiki/

      Claude_Shannon)

   The FLAC format, the FLAC reference implementation and this document

   were originally developed by Josh Coalson.  While many others have

   contributed since, this original effort is deeply appreciated.

4.  Definitions

   *  *Subframe*: An encoded subblock.  All subframes within a frame

      code for the same number of samples.  A subframe MAY correspond to

I'd suggest "might correspond".

      a subblock, else it corresponds to either the addition or

      subtraction of two subblocks, see section on interchannel

      decorrelation (#interchannel-decorrelation).

   *  *Linear predictor*: a predictor using linear prediction

      (https://en.wikipedia.org/wiki/Linear_prediction)  This is also

I'd create a reference for this URL, and include it as Informative.

      called *linear predictive coding (LPC)*. With a linear predictor

      each prediction is a linear combination of past samples, hence the

      name.  A linear predictor has a causal discrete-time finite

      impulse response (https://en.wikipedia.org/wiki/

      Finite_impulse_response).

I'd create a reference for this URL, and include it as Informative.

   *  *Rice code*: A variable-length code

      (https://en.wikipedia.org/wiki/Variable-length_code) which

I'd create a reference for this URL, and include it as Informative.

      compresses data by making use of the observation that, after using

      an effective predictor, most residual samples are closer to zero

      than the original samples, while still allowing for a small part

      of the samples to be much larger.

5.  Conceptual overview

   Similar to many audio coders, a FLAC file is encoded following the

I'd suggest "many other audio coders" (other than FLAC, right?)

   steps below.  On decoding a FLAC file, these steps are undone in

   reverse order, i.e. from bottom to top.

   *  Blocking (see section on Blocking (#blocking)).  The input is

      split up into many contiguous blocks.  With FLAC, the blocks MAY

      vary in size.  The optimal size of the block is usually affected

So, what you say is true, but I'd suggest saying "Because FLAC blocks can
vary in size, a FLAC decoder MUST be able to decode variable-sized blocks",
so that the weak permission of MAY for an encoder is reflected in a strong
requirement of MUST for the decoder. I'm assuming this is mandatory for
decoders, unless it's OK for decoders to fail to decode conformant FLAC
blocks.

(if you want to say "MAY vary in size", I won't argue 🙂)

      by many factors, including the sample rate, spectral

      characteristics over time, etc.  However, as finding the optimal

      block size arrangement is a rather complex problem, the FLAC

      format allows for a constant block size throughout a stream as

      well.

   *  Prediction (see section on Prediction (#prediction)).  To remove

      redundancy in a signal, a predictor is stored for each subblock or

      its transformation as formed in the previous step.  A predictor

      consists of a simple mathematical description that can be used, as

      the name implies, to predict a certain sample from the samples

      that preceded it.  As this prediction is rarely exact, the error

      of this prediction is passed to the next stage.  The predictor of

      each subblock is completely independent from other subblocks.

      Since the methods of prediction are known to both the encoder and

      decoder, only the parameters of the predictor need be included in

      the compressed stream.  In case no usable predictor can be found

      for a certain subblock, the signal is stored instead of compressed

Meta-comment - the phrase "in case" appears about 20 times in this
document. I think this should be "if", throughout.

This may be super clear to those skilled in the art, but I wonder if it
would be clearer if it said, "the uncompressed signal is stored and the
next stage is skipped"

      and the next stage is skipped.

   *  Residual Coding (See section on Residual Coding (#residual-

      coding)).  As the predictor does not describe the signal exactly,

      the difference between the original signal and the predicted

      signal (called the error or residual signal) MUST be coded

Is this "MUST" a statement of reality ("if you don't code the error
losslessly, the signal cannot be decoded without losing information"), or
an actual requirement for the implementation?

If the latter, perhaps "MUST be coded losslessly in order to recover the
original signal without loss".

      losslessly.  If the predictor is effective, the residual signal

      will require fewer bits per sample than the original signal.  FLAC

      uses Rice coding, a subset of Golomb coding, with either 4-bit or

      5-bit parameters to code the residual signal.

5.1.  Blocking

   The size used for blocking the audio data has a direct effect on the

   compression ratio.  If the block size is too small, the resulting

   large number of frames mean that excess bits will be wasted on frame

   headers.  If the block size is too large, the characteristics of the

   signal may vary so much that the encoder will be unable to find a

   good predictor.  In order to simplify encoder/decoder design, FLAC

   imposes a minimum block size of 16 samples, and a maximum block size

   of 65535 samples.  This range covers the optimal size for all of the

   audio data FLAC supports.

Is there a reference that could be included for the last statement here? ^

5.3.  Prediction

   A FLAC encoder is free in selecting which method is used to model the

   input, with the following exceptions:

   *  When the samples that need to be stored do not all have the same

      value (i.e. the signal is not constant), a constant subframe MUST

      NOT be used,

My experience has been that if you're saying "this is a bad idea", it's
better to explain the consequences ("using a constant subframe when the
input signal is not constant, will result in very compact and inaccurate
decoded streams"). If you're saying "MUST NOT be used", a decoder MUST
detect that you're violating a MUST requirement, and the decoder's behavior
would be undefined.

Which are you saying?

   *  When an encoder is unable to find a fixed or linear predictor of

      which all residual samples are representable in 32-bit signed

      integers as stated in section coded residual (#coded-residual), a

      verbatim subframe MUST be used.

I have a similar comment here: are you saying "this is what a wise encoder
would do", or are you saying "the decoder MUST detect that the encoder is
violating a MUST protocol requirement and <do something, perhaps just
discarding the subframe"?

   For more information on fixed and linear predictors, see

   [HPL-1999-144] and [robinson-tr156].

5.4.  Residual Coding

   In case a subframe uses a predictor to approximate the audio signal,

   a residual needs to be stored to 'correct' the approximation to the

I'd suggest "is stored".

   exact value.  When an effective predictor is used, the average

   numerical value of the residual samples is smaller than that of the

   samples before prediction.  While having smaller values on average,

   it is possible a few 'outlier' residual samples are much larger than

   any of the original samples.  Sometimes these outliers even exceed

   the range the bit depth of the original audio offers.

   Quite often the optimal Rice parameter varies over the course of a

   subframe.  To accommodate this, the residual can be split up into

   partitions, where each partition has its own Rice parameter.  To keep

   overhead and complexity low, the number of partitions used in a

   subframe is limited to powers of two.

I'd suggest "MUST be a power of two".

6.  Format principles

   FLAC has no format version information, but it does contain reserved

   space in several places.  Future versions of the format MAY use this

   reserved space safely without breaking the format of older streams.

   Older decoders MAY choose to abort decoding or skip data encoded with

   newer methods.  Apart from reserved patterns, the format specifies

I'd suggest "data encoded using methods they do not recognize".

   invalid patterns in certain places, meaning that the patterns MUST

   NOT appear in any valid bitstream, in any prior, present, or future

   versions of the format.  These invalid patterns are usually used to

   make the synchronization mechanism more robust.

I need some help here, please. I searched the document for occurrences of
the word "invalid", and found three or four occurrences in tables that said
something like "this value is invalid", and one or two more that said
"these values are invalid", in text, not in tables.

Are those occurrences the patterns that you're referencing in "the patterns
MUST NOT appear in any valid bitstream, in any prior, present, or future
versions of the format"?

In any event, it would be helpful to the reader if you had a table that
said "these are patterns are invalid", and if it makes sense to include a
forward reference to someplace in the document that explains where each
pattern might occur, that would also be helpful.

A separate comment - I'd suggest "in any future version of the format". You
can't have a MUST requirement that controls what's happened in the past. Is
this document describing the present version of the format?

   All numbers used in a FLAC bitstream MUST be integers, there are no

   floating-point representations.  All numbers MUST be big-endian

   coded, except the field length used in Vorbis comments, which MUST be

   little-endian coded.  All numbers MUST be unsigned except linear

   predictor coefficients, the linear prediction shift and numbers which

   directly represent samples, which MUST be signed.  None of these

   restrictions apply to application metadata blocks or to Vorbis

   comment field contents.

These ^^^^ MUSTs aren't really MUSTs. They are "MUST, except when". Does it
make sense to have a decoder check for them, and take action when they are
violated? Or are these statements of fact ("All numbers will be big-endian
encoded, except the field length used in Vorbis comments, which will be
little-endian coded.")

Also - this is the first occurrence of "Vorbis" in the document. Could you
add a reference for this?

7.  Format lay-out

   Since a decoder MAY start decoding in the middle of a stream, there

   MUST be a method to determine the start of a frame.  A 15-bit sync

These ^^^^ are almost certainly statements of fact. I'd suggest something
like "In order to allow a decoder to start decoding in the middle of a
stream, a 15-bit sync code begins each frame".

   code begins each frame.  The sync code will not appear anywhere else

   in the frame header.  However, since it MAY appear in the subframes,

   the decoder has two other ways of ensuring a correct sync.  The first

   is to check that the rest of the frame header contains no invalid

   data.  Even this is not foolproof since valid header patterns can

   still occur within the subframes.  The decoder's final check is to

   generate an 8-bit CRC of the frame header and compare this to the CRC

   stored at the end of the frame header.

   Also, since a decoder MAY start decoding at an arbitrary frame in the

   stream, each frame header MUST contain some basic information about

   the stream because the decoder MAY not have access to the STREAMINFO

Same suggestion as before - these are statements of fact. I'd suggest "In
order to allow a decoder to start decoding at an arbitrary frame in the
stream, each frame header contains some basic information about the stream,
in the event that the decoder does not have access to the STREAMINFO
metadata block at the start of the stream."

   metadata block at the start of the stream.  This information includes

   sample rate, bits per sample, number of channels, etc.  Since the

   frame header is pure overhead, it has a direct effect on the

   compression ratio.  To keep the frame header as small as possible,

   FLAC uses lookup tables for the most commonly used values for frame

   properties.  When a certain property has a value that is not covered

   by the lookup table, the decoder is directed to find the value of

   that property (for example the sample rate) at the end of the frame

   header or in the streaminfo metadata block.  In case a frame header

   refers to the streaminfo metadata block, the file is not

   'streamable', see section format subset (#format-subset) for details.

   In this way, the file is streamable and the frame header size small

   for all of the most common forms of audio data.

8.  Format subset

   *  The sample rate bits (#sample-rate-bits) in the frame header MUST

      be 0b0001-0b1110, i.e. the frame header MUST NOT refer to the

      streaminfo metadata block to find the sample rate.

   *  The bits depth bits (#bit-depth-bits) in the frame header MUST be

      0b001-0b111, i.e. the frame header MUST NOT refer to the

      streaminfo metadata block to find the bit depth.

On these two bullets ^^^^, is "to find" the best wording? I wonder if these
bullets should say "to describe" instead.

9.2.  Streaminfo

   The minimum block size is excluding the last block of a FLAC file,

   which may be smaller.  If the minimum block size is equal to the

   maximum block size, the file contains a fixed block size stream.

   Note that in case of a stream with a variable block size, the actual

   maximum block size MAY be smaller than the maximum block size listed

   in the streaminfo block, and the actual smallest block size excluding

   the last block MAY be larger than the minimum block size listed in

   the streaminfo block.  This is because the encoder has to write these

   fields before receiving any input audio data, and cannot know

   beforehand what block sizes it will use, only between what bounds

   these will be chosen.

^^^^ This explanation seems important enough to make it clearer. If I
understand correctly, these aren't "minimum and maximum block sizes",
they're "minimum and maximum block size bounds". Would it be useful, and
not excessively painful, to make that terminology change?

   The MD5 signature is made by performing an MD5 transformation on the

   samples of all channels interleaved, represented in signed, little-

   endian form.  This interleaving is on a per-sample basis, so for a

   stereo file this means first the first sample of the first channel,

   then the first sample of the second channel, then the second sample

   of the first channel etc.  Before performing the MD5 transformation,

This text ^^^^ is accurate, but I can't understand it without reading it
out loud, with appropriate pauses. Would a picture of the samples being
interleaved before the MD5 transformation help?

   all samples must be byte-aligned.  So, in case the bit depth is not a

   whole number of bytes, additional zero bits are inserted at the most-

   significant position until each sample representation is a whole

   number of bytes.

9.4.  Application

   The application metadata block is for use by third-party

   applications.  The only mandatory field is a 32-bit identifier, much

   like a FourCC but not restricted to ASCII characters.  An ID registry

"FourCC" isn't defined in the document, but is there a reference that you
could include?

   is being maintained at https://xiph.org/flac/id.html

   (https://xiph.org/flac/id.html)

9.6.  Vorbis comment

   Note that the Vorbis comment as used in Vorbis allows for on the

   order of 2^64 bytes of data whereas the FLAC metadata block is

   limited to 2^24 bytes.  Given the stated purpose of Vorbis comments,

   i.e. human-readable textual information, this limit is unlikely to be

My suggestion would be "the FLAC metadata limit is unlikely to be
restrictive".

   restrictive.  Also note that the 32-bit field lengths are coded

   little-endian, as opposed to the usual big-endian coding of fixed-

   length integers in the rest of the FLAC format.

9.6.2.  Channel mask

   Besides fields containing information about the work itself, one

   field is defined for technical reasons, of which the field name is

   WAVEFORMATEXTENSIBLE_CHANNEL_MASK.  This field contains information

   on which channels the file contains.  Use of this field is

   RECOMMENDED in case these differ from the channels defined in the

   section channels bits (#channels-bits).

I have a question here - if this field doesn't match the section channel
bits, which one is assumed to be correct? I would have guessed that if the
channel mask is present, it is assumed to be correct, but I am guessing.

9.7.  Cuesheet

   If the media catalog number is less than 128 bytes long, it SHOULD be

   right-padded with NUL characters.  For CD-DA, this is a thirteen

   digit number, followed by 115 NUL bytes.

Why is this ^^^^ a SHOULD? Is this for legacy support, or support of
incorrect implementations, or something else?

9.7.1.  Cuesheet track

   A track number of 0 is not allowed to avoid conflicting with the CD-

I would suggest "is not allowed, because the CD-DA spec reserves this for
the lead-in".

   DA spec, which reserves this for the lead-in.  For CD-DA the number

   MUST be 1-99, or 170 for the lead-out; for non-CD-DA, the track

   number MUST be 255 for the lead-out.  It is RECOMMENDED to start with

   track 1 and increase sequentially.  Track numbers MUST be unique

   within a cuesheet.

9.8.  Picture

   The following table contains all defined picture types.  Values other

   than those listed in the table are reserved and SHOULD NOT be used.

If values other than those listed in the table are used, what if anything
should the decoder do? Is there a MUST requirement that a decoder must
check for the SHOULD NOT values, and handle them appropriately, or simply
not crash?

   There MAY only be one each of picture type 1 and 2 in a file.  In

   general practice, many FLAC playback devices and software display the

   contents of a picture metadata block with picture type 3 (front

   cover) during playback, if present.

10.1.2.  Sample rate bits

   The next 4 bits, referred to as the sample rate bits, contain the

   sample rate of the audio according to the following table.  In case

   the sample rate bits code for an uncommon sample rate, this is stored

   after the uncommon block size or after the coded number in case no

Nit: I'd suggest "if no uncommon block size was used". I see other
occurrences of "in case" in the document, so I'll ask that you look at
those occurrences as well.

   uncommon block size was used.  See section uncommon sample rate

   (#uncommon-sample-rate).

10.1.3.  Channels bits

   The next 4 bits (the first 4 bits of the fourth byte of each frame),

"(the first 4 bits of the fourth byte of each frame)" seems very helpful.
Would it make sense to add this guidance at the beginning of each of the
sections that begin with "the next 4 bits"?

   referred to as the channels bits, code for both the number of

   channels of the audio as well as any stereo decorrelation used

   according to the following table.

10.1.4.  Bit depth bits

   The next 3 bits code for the bit depth of the audio according to the

   following table.

Is this saying "The next 3 bits encode the bit depth"?

10.2.2.  Wasted bits per sample

   The wasted bits-per-sample flag in a subframe header is set to 1 if a

   certain number of least-significant bits of all samples in the

I'm confused - what is the "certain number"? Is this (I'm guessing) a
number that might vary between audio file types?

   current subframe are zero.  If this is the case, the number of wasted

   bits-per-sample (k) minus 1 follows the flag in an unary encoding.

   For example, if k is 3, 0b001 follows.  If k = 0, the wasted bits-

   per-sample flag is 0 and no unary coded k follows.

   In case k is not equal to 0, samples are coded ignoring k least-

   significant bits.  For example, if the preceding frame header

   specified a sample size of 16 bits per sample and k is 3, samples in

   the subframe are coded as 13 bits per sample.  A decoder MUST add k

   least-significant zero bits by shifting left (padding) after decoding

   a subframe sample.  In case the frame has left/side, right/side or

   mid/side stereo, padding MUST happen to a sample before it is used to

Is this "a decoder MUST perform padding before constructing a left or right
sample"?

   reconstruct a left or right sample.

10.2.5.  Fixed predictor subframe

   As a predictor makes use of samples preceding the sample that is

   predicted, it can only be used when enough such samples are known.

I'd suggest "enough samples" here.

   As each subframe in FLAC is coded completely independent, the first

I'd suggest "coded completely independently".

   few samples in each subframe cannot be predicted.  Therefore, a

   number of so-called warm-up samples equal to the predictor order is

   stored.  These are stored unencoded, bypassing the predictor and

   residual coding stage.  See section on Constant subframe (#constant-

   subframe) on how samples are stored unencoded.  The table below

   defines how a fixed predictor subframe appears in the bitstream.

   To encode a signal with a fixed predictor, each sample has the

   corresponding prediction subtracted and sent to the residual coder.

   To decode a signal with a fixed predictor, first the residual has to

   be decoded, after which for each sample the prediction can be added.

I'd suggest "the residual is decoded, and then the prediction can be added
for each sample".

I have the same comment about Section 10.2.6 below.

   This means that decoding MUST be a sequential process within a

   subframe, as for each sample, enough fully decoded previous samples

   are needed to calculate the prediction.

10.2.7.3.  Residual sample value limit

   All residual samples values MUST be representable in the range

   offered by a 32-bit integer, signed one's complement.  Equivalently,

   all residual sample values MUST fall in the range offered by a 32-bit

   integer signed two's complement excluding the most negative possible

   value of that range.  This means residual sample values MUST NOT have

   an absolute value equal to or larger then 2 to the power 31.  A FLAC

I'd suggest "equal to, or larger than, 2 to the power 31". You might also
say "larger than 2 ** 31" here, I think.

   encoder MUST make sure of this.  In case a FLAC encoder is, for a

   certain subframe, unable to find a suitable predictor of which all

   residual samples fall within said range, it MUST default to writing a

   verbatim subframe.  The appendix numerical considerations

   (#numerical-considerations) explains in which circumstances residual

   samples are already implicitly representable in said range and thus

   an additional check is not needed.

10.3.  Frame footer

   Following the last subframe is the frame footer.  If the last

   subframe is not byte aligned (i.e. the bits required to store all

   subframes put together are not divisible by 8), zero bits are added

   until byte alignment is reached.  Following this is a 16-bit CRC,

^^^^ I think this is "Following the last subframe is the frame footer.  If
the last subframe is not byte aligned (i.e. the number of bits required to
store all subframes put together is not divisible by 8), zero bits are
added until byte alignment is reached."

   initialized with 0, with polynomial x^16 + x^15 + x^2 + x^0.  This

   CRC covers the whole frame excluding the 16-bit CRC, including the

   sync code.

11.1.  Ogg mapping

   In case an audio stream is encoded where audio properties (sample

   rate, number of channels or bit depth) change at some point in the

   stream, this should be dealt with by finishing encoding of the

   current Ogg stream and starting a new Ogg stream concatenated to the

   previous one.  This is called chaining in Ogg. See the Ogg

   specification for details.

I know you provided a reference for Ogg on first use, but suggest repeating
it here, for the less careful readers.

11.3.  ISO Base Media File Format (MP4) mapping

   The full encapsulation definition of FLAC audio in MP4 files was

   deemed too extensive to include in this document.  A definition

   document can be found at

   https://github.com/xiph/flac/blob/master/doc/isoflac.txt

   (https://github.com/xiph/flac/blob/master/doc/isoflac.txt) This

This might be clearer if it said "The definitions document is summarized
here."

   document is summarized here.

13.  Security Considerations

   Implementations of the FLAC codec need to take appropriate security

   considerations into account.  Those related to denial of service are

   outlined in Section 2.1 of [RFC4732].  It is extremely important for

   the decoder to be robust against malicious payloads.  Malicious

   payloads MUST NOT cause the decoder to overrun its allocated memory

   or to take an excessive amount of resources to decode.  An overrun in

I understand the use of the word "malicious" here, but would it be correct
to say "payloads that do not conform to this specification MUST NOT cause
..."? It seems that the problem is that the payload might cause
unsuspicious implementations to misbehave, from a security perspective, not
that the payload is **intended** to cause unsuspicious implementations to
misbehave.

   allocated memory could lead to arbitrary code execution by an

   attacker.  The same applies to the encoder, even though problems in

   encoders are typically rarer.  Malicious audio streams MUST NOT cause

   the encoder to misbehave because this would allow an attacker to

   attack transcoding gateways.

   As with all compression algorithms, both encoding and decoding can

   produce an output much larger than the input.  In case of decoding,

   the most extreme possible case of this is a frame with eight constant

   subframes of block size 65535 and coding for 32-bit PCM.  This frame

   is only 49 byte in size, but codes for more than 2 megabyte of

   uncompressed PCM data.  For encoding, it is possible to have an even

   larger size increase, although such behavior is generally considered

   faulty.  This happens if the encoder chooses a rice parameter that

   does not fit with the residual that has to be encoded.  In such a

   case, very long unary coded symbols can appear, in the most extreme

   case more than 4 gigabyte per sample.  It is RECOMMENDED decoder and

"RECOMMENDED" is a BCP14 synonym for "SHOULD". Are there any reasons that
decoder and encoder implementations ought not take these precautions? If
not, I'd suggest MUST, but this might be better as "Decoder and encoder
implementors are advised to take precautions".

   encoder implementations take precautions to prevent excessive

   resource utilization in such cases.

   When metadata is handled, it is RECOMMENDED to either thoroughly test

   handling of extreme cases or impose reasonable limits beyond the

Again, is there a reason why implementors ought not to do these things? If
not, I'd suggest MUST, or alternatively "implementors are advised", as in
the previous comment.

   limits of this specification document.  For example, a single Vorbis

   comment metadata block can contain millions of valid fields.  It is

   unlikely such a limit is ever reached except in a potentially

   malicious file.  Likewise the media type and description of a picture

   metadata block can be millions of characters long, despite there

   being no reasonable use of such contents.  One possible use case for

   very long character strings is in lyrics, which can be stored in

   Vorbis comment metadata block fields.

   Various kinds of metadata blocks contain length fields or fields

   counts.  While reading a block following these lengths or counts, a

   decoder MUST make sure higher-level lengths or counts (most

   importantly the length field of the metadata block itself) are not

   exceeded.  As some of these length fields code string lengths, memory

   for which must be allocated, it is RECOMMENDED to first verify that

Again, is there a reason implementations ought not to do this? If not, I'd
suggest MUST, or alternatively, "implementors are advised".

   block is valid before allocating memory based on its contents.

   Seeking in a FLAC stream that is not in a container relies on the

   coded number in frame headers and optionally a seektable metadata

   block.  It is RECOMMENDED to employ thorough sanity checks on whether

Again, same comment about RECOMMENDED here.

   a found coded number or seekpoint is at all possible.  Without these

   checks, seeking might get stuck in an infinite loop when numbers in

   frames are non-consecutive or otherwise invalid, which could be used

   in denial of service attacks.

   It is RECOMMENDED to employ fuzz testing combined with different

Again, same comment about RECOMMENDED here.

   sanitizers on FLAC decoders to find security problems.  To improve

   efficiency of this process, a decoder SHOULD, on decoding, ignore the

Again, same comment about SHOULD here.

   results of CRC checks during fuzz testing.

   See the FLAC decoder testbench (https://wiki.hydrogenaud.io/

   index.php?title=FLAC_decoder_testbench) for a non-exhaustive list of

   FLAC files with extreme configurations which lead to crashes or

   reboots on some known implementations.

   None of the content carried in FLAC is intended to be executable.

Is this a requirement that decoders MUST NOT execute it?

14.  IANA Considerations

14.1.  Media type registration

   The following information serves as the registration form for the

   "audio/flac" media type.  This media type is applicable for FLAC

   audio packaged in its native container.  FLAC audio packaged in

   another container will take on the media type of its container, for

   example audio/ogg when packaged in an Ogg container or video/mp4 when

   packaged in a MP4 container alongside a video track.

   Type name: audio

   Subtype name: flac

   Required parameters: none

   Optional parameters: none

   Encoding considerations: as per this document

   Security considerations: see section 12

   Interoperability considerations: no known concerns

   Published specification: THISRFC

   Applications that use this media type: ffmpeg, apache, firefox

   Fragment identifier considerations: none

   Additional information:

     Deprecated alias names for this type: audio/x-flac

     Magic number(s): fLaC

     File extension(s): flac

     Macintosh file type code(s): none

   Person & email address to contact for further information: IETF CELLAR WG

   Intended usage: COMMON

   Restrictions on usage: N/A

   Author: IETF CELLAR WG

   Change controller: IESG

   Provisional registration? (standards tree only): NO

15.  Normative References

   [I-D.ietf-cellar-matroska]

              Lhomme, S., Bunkus, M., and D. Rice, "Matroska Media

              Container Format Specifications", Work in Progress,

              Internet-Draft, draft-ietf-cellar-matroska-14, 1 October

              2022, <https://www.ietf.org/archive/id/draft-ietf-cellar-

              matroska-14.txt>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate

              Requirement Levels", BCP 14, RFC 2119,

              DOI 10.17487/RFC2119, March 1997,

              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC4732]  Handley, M., Ed., Rescorla, E., Ed., and IAB, "Internet

              Denial-of-Service Considerations", RFC 4732,

              DOI 10.17487/RFC4732, December 2006,

              <https://www.rfc-editor.org/info/rfc4732>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC

              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,

              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

16.  Informative References

   [HPL-1999-144]

              Hans, M. and RW. Schafer, "Lossless Compression of Digital

              Audio", DOI 10.1109/79.939834, November 1999,

              <https://www.hpl.hp.com/techreports/1999/HPL-

              1999-144.pdf>.

   [IEC.60908.1999]

              International Electrotechnical Commission, "Audio

              recording - Compact disc digital audio system",

              IEC International standard 60908 second edition, 1999.

   [ISRC-handbook]

              "International Standard Recording Code (ISRC) Handbook,

              4th edition", 2021, <https://www.ifpi.org/isrc_handbook/>.

   [RFC3533]  Pfeiffer, S., "The Ogg Encapsulation Format Version 0",

              RFC 3533, DOI 10.17487/RFC3533, May 2003,

              <https://www.rfc-editor.org/info/rfc3533>.

   [RFC5334]  Goncalves, I., Pfeiffer, S., and C. Montgomery, "Ogg Media

              Types", RFC 5334, DOI 10.17487/RFC5334, September 2008,

              <https://www.rfc-editor.org/info/rfc5334>.

   [RFC6716]  Valin, JM., Vos, K., and T. Terriberry, "Definition of the

              Opus Audio Codec", RFC 6716, DOI 10.17487/RFC6716,

              September 2012, <https://www.rfc-editor.org/info/rfc6716>.

   [RFC7942]  Sheffer, Y. and A. Farrel, "Improving Awareness of Running

              Code: The Implementation Status Section", BCP 205,

              RFC 7942, DOI 10.17487/RFC7942, July 2016,

              <https://www.rfc-editor.org/info/rfc7942>.

   [robinson-tr156]

              Robinson, T., "SHORTEN: Simple lossless and near-lossless

              waveform compression", December 1994,

              <https://mi.eng.cam.ac.uk/reports/abstracts/

              robinson_tr156.html>.