Re: [Cellar] John Scudder's Discuss on draft-ietf-cellar-matroska-16: (with DISCUSS and COMMENT)

Hi,

Thanks for the detailed review. Comments below.

> On 20 May 2023, at 21:49, John Scudder via Datatracker <noreply@ietf.org> wrote:
> 
> John Scudder has entered the following ballot position for
> draft-ietf-cellar-matroska-16: Discuss
> 
> When responding, please keep the subject line intact and reply to all
> email addresses included in the To and CC lines. (Feel free to cut this
> introductory paragraph, however.)
> 
> 
> Please refer to https://www.ietf.org/about/groups/iesg/statements/handling-ballot-positions/ 
> for more information about how to handle DISCUSS and COMMENT positions.
> 
> 
> The document, along with other ballot positions, can be found here:
> https://datatracker.ietf.org/doc/draft-ietf-cellar-matroska/
> 
> 
> 
> ----------------------------------------------------------------------
> DISCUSS:
> ----------------------------------------------------------------------
> 
> # John Scudder, RTG AD, comments for draft-ietf-cellar-matroska-16
> CC @jgscudder
> 
> ## DISCUSS
> 
> (This is an update to my previous ballot, to add one more question. The
> original ballot (below) was entirely questions for the authors (though of
> course, anyone is welcome to engage). The new question is primarily
> process-related and I'm adding it mostly to flag the issue for the sponsoring
> AD, and to make sure we cover it during IESG discussion, although again, anyone
> is welcome to engage.)
> 
> Primarily for the AD/IESG: I happened to notice that there are 9 errata open
> against RFC 8794, the EMBL spec, which is a normative reference for this doc.
> All of them are in "reported" state so have no normative force right now. Two
> of them (ID 7189 and ID 7191) are added specifically to bring Matroska's usage
> into compliance (or rather, to redefine "compliance" to include what Matroska
> does).
> 
> It's not clear to me that these are proper uses of the errata process, but it's
> not my WG or area. In any case, it seems like this is a hint that there's a
> potential problem. Whether the problem should be fixed by verifying the errata,
> or some other way, I don't know.

I’ll let Michael and Spencer chime on that as I’m less familiar with the details.
We do have a plan to make a new EBML document, so making this more formal is possible.

However if we can mark some the errata in a different way to make them normative, we should do it.

> 
> Primarily for the authors:
> 
> Thanks for this document. It's well outside my comfort zone (I'm a routing
> protocols guy) and I appreciate the effort that's gone into making what is
> clearly a major piece of work accessible to someone like me.
> 
> Although I haven't completed my review, as I've gone through it I found two
> sections that seemed worth having a DISCUSSion about. Since the Section 10
> DISCUSSion might take a while, I figured I should raise it now rather than wait
> to finish my entire review. So, I may add some additional comments later.
> 
> ### Section 6.8
> 
> To quote §6.8 in its entirety:
> 
>   The Tags Element is most subject to changes after the file was
>   originally created.  For easier editing, the Tags Element SHOULD be
>   placed at the end of the Segment Element, even after the Attachments
>   Element.  On the other hand, it is inconvenient to have to seek in
>   the Segment for tags, especially for network streams.  So it's better
>   if the Tags Element is found early in the stream.  When editing the
>   Tags Element, the original Tags Element at the beginning can be
>   overwritten with a Void Element and a new Tags Element written at the
>   end of the Segment Element.  The file size will only marginally
>   change.
> 
> I've read this several times and I just can't figure out how to reconcile "the
> Tags Element SHOULD be placed at the end of the Segment Element" with "it's
> better if the Tags Element is found early in the stream". Have I missed some
> subtlety in the distinction between "the Segment Element" and "the stream"? Or
> are you setting up a dichotomy on purpose? Or something else?
> 
> My guess is you're purposefully setting up a dichotomy (the Tags Element both
> should go early, and late) and then proposing a solution for it (put it early,
> but if it gets revised, Void it out and put the replacement late). Is that it?
> If so, I'd recommend rewriting to be clearer about what you're doing and not
> presenting the reader with a contradiction. Absent a significant rewrite, at a
> minimum the SHOULD has to go.

I agree. The first SHOULD is normative although the text makes it “equally” OK to put at the beginning or at the end of the file. The reasons for both are explained. Maybe we can add that by default the Tags should go at the beginning (non normative) and when edited, they can be put at the end by voiding the original Tags.

> 
> ### Section 10
> 
> It seems as though there are significant unstated assumptions in the
> descriptions throughout this section and its subsections. I'll go into further
> detail on specifics below. Since the Block is such a fundamental structure for
> this document, it seems worth putting in some extra effort to make sure its
> description is clear.
> 
> In routing protocol specifications we tend to use quaint ASCII protocol packet
> diagrams with bit rulers along the top. Your use of variable-length fields
> makes this approach difficult, but I regretted not having some kind of
> graphical representation to reference, which might otherwise have helped. $0.02.

I can give it a try. After all there aren’t many Track Number sizes possible. In most cases it will be coded on one byte.

> 
> ### Throughout, "bits" representation
> 
> In various places, you have language like "The bits 5-6 of the Block Header
> flags are set to 11" and similar. This language seems like it needs to be
> fixed, because the simplest plain English reading of the sentence (IMO anyway)
> is "bit 5 is set to decimal 11, and so is bit 6", which doesn't make sense. The
> most straightforward fix would be to write "... is set to 0b11" to indicate
> that you're writing the value in binary.
> 

You’re right.

> ### Tables 36 and following, semantics
> 
> The semantics of the tables are not clear to me, and I don't see that they're
> defined anywhere:
> 
> Offset: What is this? All the examples are a small hex value and a plus sign.
> The plus sign generally means "or more" colloquially, although if this were a
> regex it would mean something about repetition. In any case, I don't know WHAT
> it means here. By the way, why are you showing these in hex and not decimal
> representation?
> 
> Player: This says "MUST" sometimes and is a hyphen other times. What does this
> mean? That a compliant Matroska Player MUST implement support for certain
> things (the MUSTs) and... something something I can't guess, for other things
> (the hyphens)?

This was ported from our all website where all the elements were defined that way. The + expressing a varying length. We’ve gone a different way in the specs but the format remains there. Switching to a table as you mentioned should make this much clearer.

The MUST in these tables indicate the reader cannot ignore these bits. But in the end section 10.4 explains this much better. So we don’t need it there.

> 
> ### Section 10.1, Track Number
> 
> The first line of Table 36 description is "Track Number (Track Entry). It is
> coded in EBML like form (1 octet if the value is < 0x80, 2 if < 0x4000, etc.)
> (most significant bits set to increase the range)."
> 
> First, a suggestion: if you're saying "encoded as an EBML VINT" (which I think
> is what you mean) then say so, with xref, since that's a well-defined format

Indeed.

> with a well-defined reference. Second, a concern: since you're using a
> variable-length encoding here, I don't get how you can provide a specific
> offset for the later fields; Timestamp's location is presumably "after Track
> Number". Maybe that's what the "0x01+" notation means? It's telling me that
> Timestamp won't be any earlier than byte 1, but it might be later, depending on
> how Track Number is coded? But if the + means "or later" I don't see how Track
> Number gets 0x00+ -- what would make it start later than byte 0?

That is what the + is for. Without the rest of the document using this notation, it’s definitely unclear.

> ### Tables 40 and following, semantics
> 
> The various lacings have tables with a column headed "block octets" or "block
> octet". Given that the header is variable-length (because of the Track Number
> encoding) I don't see how these block octets can be known a priori. I guess
> perhaps the answer is "this is just an example, in a case where the Track
> Number happens to be encoded in a single byte, so the header occupies octets
> 0-3". If that is the case, I suggest stating it explicitly, perhaps in the
> introductory text of §10.4.

OK

> ### 10.4.2 Xiph lacing, description of encoding
> 
> After looking at the example (thank you for providing this) and checking RFC
> 3533, I see what the coding is, but I think the description needs some work.
> Here's what I understand your intent to be, with some comments:
> 
> "Lacing Head on 1 Octet". I'm not able to make sense of these words, on their
> own. I infer from the examples that what you mean is, the first octet of the
> block data is a single unsigned integer that provides the number of frames in
> the lace minus one, meaning a lace can contain between 1 and 256 frames. Is
> that right?

This is correct.

> Then that's followed by a series of integers, described as "Lacing size of each
> frame except the last one". These indicate the sizes of the respective lace
> frames other than the final one (which is inferred). If the number of frames in
> the lace is N, there will be N-1 integers in the list. The integers are
> variable-sized, and the encoding of an integer is a repetition of zero or more
> octets with value 255, followed by a single octet with a value less than 255.
> The frame size value is the sum of the octet values.
> 
> If my paragraph above is a correct description, I think your paragraph that
> begins "The lacing size is split into 255 values" needs work. To begin with,
> the plain English reading of "is split into 255 values" is "it's split into 255
> distinct things", which is not what you mean.

Indeed. It’s split into a succession of the value 255 and then the remaining size.

> By the way, what happens if N=1? I appreciate that it would be silly to use
> anything other than the "no lacing" option in this case, but is it forbidden to
> encode a single frame in one of the other lacings? Given there's no way to
> properly do the encoding, it seems like it might be worth saying explicitly
> that this is a MUST NOT.

In this case, this text applies "The size of the last frame is deduced from the size remaining in the Block after the other frames.”. The "Number of frames minus 1” would be zero. And the size of the only frame would be deduced from the Block Size and the end of that “number of frames minus 1” position in the buffer.

> 
> ### 10.4.3 EBML lacing, encodings
> 
> The previous point about "on 1 Octet" applies.
> 
> In this case, I have no problem with the integer coding definition, it's clear.
> But the example gave me difficulty in several ways.
> 
> In the second line of Table 43, you have
> 
>        Block Octets    Value           Description
> 
>        5-6             0x43 0x20       Size of the first frame
>                                        (800 = 0x320 + 0x4000)
> 
> My first attempt at understanding how eight hundred can be equal to 0x320 +
> 0x4000 was a fail, because of course using the normal understanding of the "="
> and "+" symbols, it can't be. My second attempt was to interpret this as saying
> that you're using the VINT coding, and the "=" is just (confusingly) being used
> to mean the colloquial English, i.e. you're saying "800, encoded in VINT as 0x4
> for the VINT_WIDTH and VINT_MARKER, meaning it's two octets wide, followed by
> 0x0320 for the next 14 bits... and 0x320 is decimal 800.

You are correct. It should mentioned the VINT coding. It was written long before the VINT concept was added and has not been revised since then.

> Looking at the third line,
> 
>        Block Octets    Value           Description
> 
>        7-8             0x5e 0xd3       Size of the second frame
>                                        (500 - 800 = -300 = -0x12c + 0x1fff +
>                                        0x4000)
> 
> Once again I think the descriptive text is confusing: the size of the second
> frame isn't -300, though I do get that you store -300 to let the size of the
> second frame be computed, as 800 - 300, to give 500, which is indeed the size.
> 
> Second, "-0x12c" isn't any standard notation I'm aware of. I get that you're
> using it to mean "negative 300". Then 0x1fff is the hex value for
> (2^((7*2)-1)-1), and the addition of 0x4000 is the value for VINT_WIDTH and
> VINT_MARKER for a 2-octet encoding. At a minimum, I would rewrite this as
> "0x1fff + 0x4000 - 0x012c" to eliminate the negation sign and produce an
> expression that would parse, but read on.

OK

> It seems to me that readability would benefit from dividing the examples up --
> a subsection on VINT/S_VINT, and a subsection on the actual Block encoding. For
> example, the VINT part could go something like this, inserted directly after
> Table 42.
> 
>   In our example, we will need to be able to represent 800, the size of
>   the first frame, as a VINT. This is done using a 2 octet wide VINT as
>   0x4000 + 0x0320 -- the 0x4000 is the VINT_WIDTH and VINT_MARKER, and
>   the 0x0320 is the hexadecimal representation of decimal 800. The
>   resulting VINT representation is 0x4320. We will also need to be able
>   to represent -300, the offset required for the size of the second
>   frame, as a signed VINT. This is done as 0x4000 + 0x1fff - 0x012c. In
>   this expression, 0x4000 is again the VINT_WIDTH and VINT_MARKER,
>   0x1fff is the hexadecimal representation of (2^((7*n)-1) - 1) for n
>   of 2, and 0x012c is the hexadecimal representation of 300. The
>   resulting signed VINT representation is 0x5ed3.
> 
> Or instead of talking about 300, you could also talk about "500 - 800", making
> the expression "0x4000 + 0x1fff + (0x01f4 - 0x0320)", and/or you could
> represent each value in the most convenient radix for clarity, as in "0x4000 +
> 0x1fff + (500 - 800)".
> 
> Then maybe you'd rewrite the descriptions in Table 43 something like
> 
>        Block Octets    Value           Description
> 
>        5-6             0x43 0x20       Size of the first frame, 800,
>                                        encoded as a VINT
> 
>        7-8             0x5e 0xd3       Offset between size of first
>                                        frame and size of second frame,
>                                        encoded as a signed VINT. To
>                                        recover the size of the second
>                                        frame, we sum this value with
>                                        the preceding frame, so,
>                                        (-300) + 800 = 500

OK, I’ll give this a try.

> ### 10.5, misuse of RFC 2119 keywords
> 
>   When a frame in a BlockGroup is not a RAP, all references SHOULD be
>   listed as a ReferenceBlock, at least some of them,
> 
> Which is it? All? Or some of them? The RFC 2119 SHOULD is confused about what
> it needs to do here. Even if you remove the keyword (for example rewriting as
> "should"), if I were implementing this I'd have a hard time knowing what you're
> asking me to do.

This is one tricky part we had to deal with. Because of slight differences between Matroska and WebM, we can up with this complicated description. In the end each case described is valid, or read the other way: every case not listed is invalid. Maybe we can even turn this as a  “MUST be either:
* all references listed as a ReferenceBlock
*some references listed as a ReferenceBlock, even if not accurate
* one ReferenceBlock with the value “0”, corresponding to a self or unknown reference"

> 
>                                                      even if not
>   accurate, or one ReferenceBlock with the value "0" corresponding to a
>   self or unknown reference.  The lack of ReferenceBlock would mean
>   such a frame is a RAP and seeking on that frame that actually depends
>   on other frames MAY create bogus output or even crash.
> 
> The RFC 2119 MAY is wrong in this context, you seem to be using it to express
> "might". As written, a strict reading would mean you're saying "it is
> explicitly OK, though optional, if implementations decide to create bogus
> output or crash in this case". You can fix this by replacing the MAY with
> "might" or "could".

Indeed, it’s not a normative “may”.

> 
> 
> ----------------------------------------------------------------------
> COMMENT:
> ----------------------------------------------------------------------
> 
> ## COMMENT
> 
> ### Section 7
> 
> "A reading application supporting Matroska version V MUST NOT refuse to read an
> application with DocReadTypeVersion equal to or lower than V even if
> DocTypeVersion is greater than V."
> 
> Should the second "application" be "file", making the text "... refuse to read
> a file with..."?

Indeed.

> 
> ### Section 9
> 
> "For video sequences signaled as progressive, it is twice the value of
> DefaultDecodedFieldDuration."
> 
> What is "it"?

It should be in the previous paragraph as it’s pretty much the same sentence as for interlaced content. In progressive content, we consider it has two fields per video sequence and DefaultDecodedFieldDuration corresponds to 2 fields. The text needs some rewording.

> ## NITS
> 
> ### Section 4.4
> 
> "In some situation, a Cluster Element MAY" -> "In some situations, a Cluster
> Element MAY" (make "situations" plural).
> 
> ### Section 5.1.5
> 
> "This Element SHOULD be set when the Segment is not transmitted as a live
> stream (see #livestreaming)." Looks like #livestreaming is a broken xref.
> 
> ## Notes
> 
> This review is in the ["IETF Comments" Markdown format][ICMF], You can use the
> [`ietf-comments` tool][ICT] to automatically convert this review into
> individual GitHub issues.
> 
> [ICMF]: https://github.com/mnot/ietf-comments/blob/main/format.md
> [ICT]: https://github.com/mnot/ietf-comments
> 

Interesting, although in our case the document is generated from various files, one of them generated from XML. I’m not sure this will work well in this case.

> 
> _______________________________________________
> Cellar mailing list
> Cellar@ietf.org
> https://www.ietf.org/mailman/listinfo/cellar