[Cellar] John Scudder's Discuss on draft-ietf-cellar-matroska-16: (with DISCUSS and COMMENT)

John Scudder via Datatracker <noreply@ietf.org> Mon, 22 May 2023 16:18 UTC

Return-Path: <noreply@ietf.org>
X-Original-To: cellar@ietf.org
Delivered-To: cellar@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id BE9A2C1516E0; Mon, 22 May 2023 09:18:23 -0700 (PDT)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: John Scudder via Datatracker <noreply@ietf.org>
To: The IESG <iesg@ietf.org>
Cc: draft-ietf-cellar-matroska@ietf.org, cellar-chairs@ietf.org, cellar@ietf.org, mcr+ietf@sandelman.ca, mcr+ietf@sandelman.ca
X-Test-IDTracker: no
X-IETF-IDTracker: 10.4.0
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: John Scudder <jgs@juniper.net>
Message-ID: <168477230376.6870.18325949540737076935@ietfa.amsl.com>
Date: Mon, 22 May 2023 09:18:23 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/bkQJAATSjyh0DOz--TCzJhmx39I>
Subject: [Cellar] John Scudder's Discuss on draft-ietf-cellar-matroska-16: (with DISCUSS and COMMENT)
X-BeenThere: cellar@ietf.org
X-Mailman-Version: 2.1.39
List-Id: Codec Encoding for LossLess Archiving and Realtime transmission <cellar.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cellar>, <mailto:cellar-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cellar/>
List-Post: <mailto:cellar@ietf.org>
List-Help: <mailto:cellar-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cellar>, <mailto:cellar-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 22 May 2023 16:18:23 -0000

John Scudder has entered the following ballot position for
draft-ietf-cellar-matroska-16: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/about/groups/iesg/statements/handling-ballot-positions/ 
for more information about how to handle DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-cellar-matroska/



----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

# John Scudder, RTG AD, comments for draft-ietf-cellar-matroska-16
CC @jgscudder

## DISCUSS

(This is another, and I hope the final, update to my ballot. I add one more
DISCUSS point, directly following, and I add COMMENTs and NITs, starting with
Section 15.1. I apologize for the episodic nature of this review, and hope it
was helpful to get the early part.)

### MatroskaTags reference

I notice that the [MatroskaTags] document is referenced in several places. At
first glance, it seems like this is a document that "must be read to understand
or implement the technology in the new RFC, or whose technology must be present
for the technology in the new RFC to work" [1], in which case it should be
Normative.

I can see why the WG would be hesitant to make the reference Normative, since
the document seems to be less mature and might hold up RFC publication for some
time while it also progresses. But please be sure that it really does fit the
criteria of "only [providing] additional information", and that the current
spec can stand alone without it.

Because I don't have the subject area expertise to make this call with
confidence, I don't intend to hold a DISCUSS on this point past the telechat,
this is primarily here to ensure that the question is considered by others.

[1]
https://www.ietf.org/about/groups/iesg/statements/normative-informative-references/

--

(This is an update to my previous ballot, to add one more question. The
original ballot (below) was entirely questions for the authors (though of
course, anyone is welcome to engage). The new question is primarily
process-related and I'm adding it mostly to flag the issue for the sponsoring
AD, and to make sure we cover it during IESG discussion, although again, anyone
is welcome to engage.)

### Conflict with RFC 8794, potentially resolved by errata

Primarily for the AD/IESG: I happened to notice that there are 9 errata open
against RFC 8794, the EMBL spec, which is a normative reference for this doc.
All of them are in "reported" state so have no normative force right now. Two
of them (ID 7189 and ID 7191) are added specifically to bring Matroska's usage
into compliance (or rather, to redefine "compliance" to include what Matroska
does).

It's not clear to me that these are proper uses of the errata process, but it's
not my WG or area. In any case, it seems like this is a hint that there's a
potential problem. Whether the problem should be fixed by verifying the errata,
or some other way, I don't know.

Primarily for the authors:

(Update: thanks for your replies to this part, your proposed solutions sound
good to me.)

Thanks for this document. It's well outside my comfort zone (I'm a routing
protocols guy) and I appreciate the effort that's gone into making what is
clearly a major piece of work accessible to someone like me.

Although I haven't completed my review, as I've gone through it I found two
sections that seemed worth having a DISCUSSion about. Since the Section 10
DISCUSSion might take a while, I figured I should raise it now rather than wait
to finish my entire review. So, I may add some additional comments later.

### Section 6.8

To quote §6.8 in its entirety:

   The Tags Element is most subject to changes after the file was
   originally created.  For easier editing, the Tags Element SHOULD be
   placed at the end of the Segment Element, even after the Attachments
   Element.  On the other hand, it is inconvenient to have to seek in
   the Segment for tags, especially for network streams.  So it's better
   if the Tags Element is found early in the stream.  When editing the
   Tags Element, the original Tags Element at the beginning can be
   overwritten with a Void Element and a new Tags Element written at the
   end of the Segment Element.  The file size will only marginally
   change.

I've read this several times and I just can't figure out how to reconcile "the
Tags Element SHOULD be placed at the end of the Segment Element" with "it's
better if the Tags Element is found early in the stream". Have I missed some
subtlety in the distinction between "the Segment Element" and "the stream"? Or
are you setting up a dichotomy on purpose? Or something else?

My guess is you're purposefully setting up a dichotomy (the Tags Element both
should go early, and late) and then proposing a solution for it (put it early,
but if it gets revised, Void it out and put the replacement late). Is that it?
If so, I'd recommend rewriting to be clearer about what you're doing and not
presenting the reader with a contradiction. Absent a significant rewrite, at a
minimum the SHOULD has to go.

### Section 10

It seems as though there are significant unstated assumptions in the
descriptions throughout this section and its subsections. I'll go into further
detail on specifics below. Since the Block is such a fundamental structure for
this document, it seems worth putting in some extra effort to make sure its
description is clear.

In routing protocol specifications we tend to use quaint ASCII protocol packet
diagrams with bit rulers along the top. Your use of variable-length fields
makes this approach difficult, but I regretted not having some kind of
graphical representation to reference, which might otherwise have helped. $0.02.

### Throughout, "bits" representation

In various places, you have language like "The bits 5-6 of the Block Header
flags are set to 11" and similar. This language seems like it needs to be
fixed, because the simplest plain English reading of the sentence (IMO anyway)
is "bit 5 is set to decimal 11, and so is bit 6", which doesn't make sense. The
most straightforward fix would be to write "... is set to 0b11" to indicate
that you're writing the value in binary.

### Tables 36 and following, semantics

The semantics of the tables are not clear to me, and I don't see that they're
defined anywhere:

Offset: What is this? All the examples are a small hex value and a plus sign.
The plus sign generally means "or more" colloquially, although if this were a
regex it would mean something about repetition. In any case, I don't know WHAT
it means here. By the way, why are you showing these in hex and not decimal
representation?

Player: This says "MUST" sometimes and is a hyphen other times. What does this
mean? That a compliant Matroska Player MUST implement support for certain
things (the MUSTs) and... something something I can't guess, for other things
(the hyphens)?

### Section 10.1, Track Number

The first line of Table 36 description is "Track Number (Track Entry). It is
coded in EBML like form (1 octet if the value is < 0x80, 2 if < 0x4000, etc.)
(most significant bits set to increase the range)."

First, a suggestion: if you're saying "encoded as an EBML VINT" (which I think
is what you mean) then say so, with xref, since that's a well-defined format
with a well-defined reference. Second, a concern: since you're using a
variable-length encoding here, I don't get how you can provide a specific
offset for the later fields; Timestamp's location is presumably "after Track
Number". Maybe that's what the "0x01+" notation means? It's telling me that
Timestamp won't be any earlier than byte 1, but it might be later, depending on
how Track Number is coded? But if the + means "or later" I don't see how Track
Number gets 0x00+ -- what would make it start later than byte 0?

### Tables 40 and following, semantics

The various lacings have tables with a column headed "block octets" or "block
octet". Given that the header is variable-length (because of the Track Number
encoding) I don't see how these block octets can be known a priori. I guess
perhaps the answer is "this is just an example, in a case where the Track
Number happens to be encoded in a single byte, so the header occupies octets
0-3". If that is the case, I suggest stating it explicitly, perhaps in the
introductory text of §10.4.

### 10.4.2 Xiph lacing, description of encoding

After looking at the example (thank you for providing this) and checking RFC
3533, I see what the coding is, but I think the description needs some work.
Here's what I understand your intent to be, with some comments:

"Lacing Head on 1 Octet". I'm not able to make sense of these words, on their
own. I infer from the examples that what you mean is, the first octet of the
block data is a single unsigned integer that provides the number of frames in
the lace minus one, meaning a lace can contain between 1 and 256 frames. Is
that right?

Then that's followed by a series of integers, described as "Lacing size of each
frame except the last one". These indicate the sizes of the respective lace
frames other than the final one (which is inferred). If the number of frames in
the lace is N, there will be N-1 integers in the list. The integers are
variable-sized, and the encoding of an integer is a repetition of zero or more
octets with value 255, followed by a single octet with a value less than 255.
The frame size value is the sum of the octet values.

If my paragraph above is a correct description, I think your paragraph that
begins "The lacing size is split into 255 values" needs work. To begin with,
the plain English reading of "is split into 255 values" is "it's split into 255
distinct things", which is not what you mean.

(Update: thanks for explaining why this point was mistaken.  By the way, what
happens if N=1? I appreciate that it would be silly to use anything other than
the "no lacing" option in this case, but is it forbidden to encode a single
frame in one of the other lacings? Given there's no way to properly do the
encoding, it seems like it might be worth saying explicitly that this is a MUST
NOT.)

### 10.4.3 EBML lacing, encodings

The previous point about "on 1 Octet" applies.

In this case, I have no problem with the integer coding definition, it's clear.
But the example gave me difficulty in several ways.

In the second line of Table 43, you have

        Block Octets    Value           Description

        5-6             0x43 0x20       Size of the first frame
                                        (800 = 0x320 + 0x4000)

My first attempt at understanding how eight hundred can be equal to 0x320 +
0x4000 was a fail, because of course using the normal understanding of the "="
and "+" symbols, it can't be. My second attempt was to interpret this as saying
that you're using the VINT coding, and the "=" is just (confusingly) being used
to mean the colloquial English, i.e. you're saying "800, encoded in VINT as 0x4
for the VINT_WIDTH and VINT_MARKER, meaning it's two octets wide, followed by
0x0320 for the next 14 bits... and 0x320 is decimal 800."

Looking at the third line,

        Block Octets    Value           Description

        7-8             0x5e 0xd3       Size of the second frame
                                        (500 - 800 = -300 = -0x12c + 0x1fff +
                                        0x4000)

Once again I think the descriptive text is confusing: the size of the second
frame isn't -300, though I do get that you store -300 to let the size of the
second frame be computed, as 800 - 300, to give 500, which is indeed the size.

Second, "-0x12c" isn't any standard notation I'm aware of. I get that you're
using it to mean "negative 300". Then 0x1fff is the hex value for
(2^((7*2)-1)-1), and the addition of 0x4000 is the value for VINT_WIDTH and
VINT_MARKER for a 2-octet encoding. At a minimum, I would rewrite this as
"0x1fff + 0x4000 - 0x012c" to eliminate the negation sign and produce an
expression that would parse, but read on.

It seems to me that readability would benefit from dividing the examples up --
a subsection on VINT/S_VINT, and a subsection on the actual Block encoding. For
example, the VINT part could go something like this, inserted directly after
Table 42.

   In our example, we will need to be able to represent 800, the size of
   the first frame, as a VINT. This is done using a 2 octet wide VINT as
   0x4000 + 0x0320 -- the 0x4000 is the VINT_WIDTH and VINT_MARKER, and
   the 0x0320 is the hexadecimal representation of decimal 800. The
   resulting VINT representation is 0x4320. We will also need to be able
   to represent -300, the offset required for the size of the second
   frame, as a signed VINT. This is done as 0x4000 + 0x1fff - 0x012c. In
   this expression, 0x4000 is again the VINT_WIDTH and VINT_MARKER,
   0x1fff is the hexadecimal representation of (2^((7*n)-1) - 1) for n
   of 2, and 0x012c is the hexadecimal representation of 300. The
   resulting signed VINT representation is 0x5ed3.

Or instead of talking about 300, you could also talk about "500 - 800", making
the expression "0x4000 + 0x1fff + (0x01f4 - 0x0320)", and/or you could
represent each value in the most convenient radix for clarity, as in "0x4000 +
0x1fff + (500 - 800)".

Then maybe you'd rewrite the descriptions in Table 43 something like

        Block Octets    Value           Description

        5-6             0x43 0x20       Size of the first frame, 800,
                                        encoded as a VINT

        7-8             0x5e 0xd3       Offset between size of first
                                        frame and size of second frame,
                                        encoded as a signed VINT. To
                                        recover the size of the second
                                        frame, we sum this value with
                                        the preceding frame, so,
                                        (-300) + 800 = 500.

### 10.5, misuse of RFC 2119 keywords

   When a frame in a BlockGroup is not a RAP, all references SHOULD be
   listed as a ReferenceBlock, at least some of them,

Which is it? All? Or some of them? The RFC 2119 SHOULD is confused about what
it needs to do here. Even if you remove the keyword (for example rewriting as
"should"), if I were implementing this I'd have a hard time knowing what you're
asking me to do.

                                                      even if not
   accurate, or one ReferenceBlock with the value "0" corresponding to a
   self or unknown reference.  The lack of ReferenceBlock would mean
   such a frame is a RAP and seeking on that frame that actually depends
   on other frames MAY create bogus output or even crash.

The RFC 2119 MAY is wrong in this context, you seem to be using it to express
"might". As written, a strict reading would mean you're saying "it is
explicitly OK, though optional, if implementations decide to create bogus
output or crash in this case". You can fix this by replacing the MAY with
"might" or "could".


----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

## COMMENT

### Section 7

"A reading application supporting Matroska version V MUST NOT refuse to read an
application with DocReadTypeVersion equal to or lower than V even if
DocTypeVersion is greater than V."

Should the second "application" be "file", making the text "... refuse to read
a file with..."?

### Section 9

"For video sequences signaled as progressive, it is twice the value of
DefaultDecodedFieldDuration."

What is "it"?

### Section 15.1, RFC 2119 keywords in an example

The first paragraph ends with an example, that contains MAY and SHOULD.
Generally, examples are non-normative, so probably it would be better for these
to be regular lower-case words and not RFC 2119 keywords. I have flagged some
other cases where you do this with examples, but I probably haven't gotten all
of them.

By the way, I very much appreciated all the examples used in the document, they
helped a lot with my understanding.

### Section 17.1, pathological examples

The way hard linking is defined, the obvious usage is like a classic
doubly-linked list... but you also allow for inference of forward and backward
links if absent. I guess this is for historical reasons and to be robust in the
face of weird pathologies, not because these patterns are desirable. Table 45
shows the classic doubly-linked list usage, and Tables 46-48 show what I would
describe as pathological (but still permitted) cases.

Maybe I am mistaken and all of the four examples shown are equally desirable
and acceptable, in which case please disregard. But if indeed one is the
preferred encoding style, then this might be a case for a SHOULD (for that
style) and MAY (for the others, already done in your text).

Also, this paragraph doesn't seem quite right:

   For each node of the chain of Segments of a Linked Segment at least
   one Segment MUST reference the other Segment of the node.

Specifically, I think "the other Segment of the node" should be something like
"another Segment within the chain"?

### Section 18.8, "sum" or "union"?

                            The Cues Elements corresponding to such a
   virtual track SHOULD be the sum of the Cues Elements for each of the
   tracks it's composed of (when the Cues are defined per track).

Should "sum" be "union"? I didn't see an obvious way to (arithmetically) sum
Cues Elements.

### Section 18.10, MAY instead of may?

I don't often suggest *adding* a 2119 keyword, but in the last paragraph,
"Matroska Readers may" might be a good place to use the RFC 2119 MAY.

### Section 19, examples don't need 2119 keywords

In the first paragraph, again we have 2119 keywords with examples. IMHO these
aren't needed.

### Section 20.2.2, wrong xref?

Is the xref to 5.1.7.1.4.14 correct? Maybe it should be to 5.1.7.4.3 or
5.1.7.4.4?

### Section 20.4

I would have benefited from a pointer to the [matroskatags] reference here:

   Each level can have different meanings for audio and video.  The
   ORIGINAL_MEDIA_TYPE tag can be used to specify a string for
   ChapterPhysicalEquiv = 60.

In Table 57, is the list of values (10..70) the only and immutable set of
possible values? Might this be a candidate for having an IANA registry?

### Section 21.1 refs for JPEG/PNG

"Only JPEG and PNG image formats SHOULD be used for cover art pictures." It
seems as though these should have references?

### Section 21.1, use of MAY; bullet list

Just checking: "If a selected subtitle track has some AttachmentLink elements,
the player MAY use only these fonts." As I unpack this using the RFC 2119
definition, this means that it's OK for a player to restrict its font rendering
of a subtitle track to only those found in the AttachmentLink element. That
seems reasonable, but I'm checking because if this were a plain English "may"
it would instead come out meaning that a player is not allowed to use any fonts
other than those in the AttachmentLink elements. It might be well to reword the
sentence to make the intended meaning a little more explicit, for example, if
the first reading is the intended one it could be something like "... the
player MAY restrict its font rendering to use only these fonts".

Also, this looks like a broken bullet list:

   A Matroska player SHOULD handle the official font media types from
   [RFC8081] when the system can handle the type: * font/sfnt: Generic
   SFNT Font Type, * font/ttf: TTF Font Type, * font/otf: OpenType
   Layout (OTF) Font Type, * font/collection: Collection Font Type, *
   font/woff: WOFF 1.0, * font/woff2: WOFF 2.0.

and so does this:

   There may also be some font attachments with the application/octet-
   stream media type.  In that case the Matroska player MAY try to guess
   the font type by checking the file extension of the
   AttachedFile\FileName string.  Common file extensions for fonts are:
   * .ttf for Truetype fonts, equivalent to font/ttf, * .otf for
   OpenType Layout fonts, equivalent to font/otf, * .ttc for Collection
   fonts, equivalent to font/collection The file extension check MUST be
   case insensitive.

### Section 23.1, "the following guidelines"

What are the guidelines being referred to by "The following guidelines, when
followed, help reduce the number of seeking operations"? I don't see any in
§23.1 itself. Maybe this is supposed to be an xref to Section 25?

### Section 23.2, TCP

In "Livestreaming of Matroska over HTTP (or any other plain protocol based on
TCP) is possible", I suppose it's not actually necessary to restrict
applicability to only TCP. QUIC is the elephant in the room, but presumably,
Matroska doesn't really care what it's being streamed over as long as it gets
reliable in-order delivery. Also, "plain protocol" isn't a well-defined term
AFAIK.

### Section 25.1, lower of?

Perhaps add "the lower of" in "store no more than 5 seconds or 5 megabytes" to
make it "store no more than the lower of 5 seconds or 5 megabytes"? Or "the
greater of" if that's what you mean, but as written it's underspecified.

### Section 27.1, advice for DE

Since you've used the "Specification Required" policy, RFC 8126 §4.5 says

   The required documentation and review criteria, giving clear guidance
   to the designated expert, should be provided when defining the
   registry.

Also, I notice you're using the word "reclaimed" instead of "deprecated" in
this section. I think your intention is clear, but why not use "deprecated"
instead? AFAIK it's the usual term for this, and indeed you use it in the title
of Annex A.

## NITS

### Section 4.4

"In some situation, a Cluster Element MAY" -> "In some situations, a Cluster
Element MAY" (make "situations" plural).

### Section 5.1.5

"This Element SHOULD be set when the Segment is not transmitted as a live
stream (see #livestreaming)." Looks like #livestreaming is a broken xref.

### Section 17.2

The last paragraph ends in what looks like a broken bullet list.

   There are 2 ways to use a chapter link: * Linked-Duration linking, *
   Linked-Edition linking

## Notes

This review is in the ["IETF Comments" Markdown format][ICMF], You can use the
[`ietf-comments` tool][ICT] to automatically convert this review into
individual GitHub issues.

[ICMF]: https://github.com/mnot/ietf-comments/blob/main/format.md
[ICT]: https://github.com/mnot/ietf-comments