Re: [Cellar] EBML Error Handling

Dave Rice <dave@dericed.com> Mon, 12 February 2018 16:32 UTC

From: Dave Rice <dave@dericed.com>
Message-Id: <498A8E37-D9C4-4A0E-B7E6-C7EB185FC79A@dericed.com>
Content-Type: multipart/alternative; boundary="Apple-Mail=_2427076D-DAB8-41F5-A1F3-0F160A12C248"
Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\))
Date: Mon, 12 Feb 2018 11:31:50 -0500
In-Reply-To: <CAOXsMFJgw-A=8OZptNFg1F4c4XnNhmSfwN4nknLSTjVVWK7hfw@mail.gmail.com>
Cc: Codec Encoding for LossLess Archiving and Realtime transmission <cellar@ietf.org>
To: Steve Lhomme <slhomme@matroska.org>
References: <CAOXsMFJCdt+XVmP5tYQ=LfNdwFmZCfgM-HCtep27P+dji6K4gQ@mail.gmail.com> <CAOXsMFJgw-A=8OZptNFg1F4c4XnNhmSfwN4nknLSTjVVWK7hfw@mail.gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/Fh8zRyntgv4gDVxTajQ86pVZBuw>
Subject: Re: [Cellar] EBML Error Handling
Precedence: list


> On Nov 26, 2017, at 11:36 AM, Steve Lhomme <slhomme@matroska.org> wrote:
> 
> 2017-11-26 16:13 GMT+01:00 Steve Lhomme <slhomme@matroska.org>:
>> Hi,
>> 
>> The last remaining issues to fix before we can issue the last call on
>> the EBML specs is more text to explain how error handling is done. It
>> covers many possible cases.
>> 
>> Here are the cases I can think of (some mentioned here [1] and here [2]):
> 
> Here's what libebml does and how we originally wanted to handle those
> cases. Let me know if it could be improved or if it's just wrong.
> 
>> - multiple unique elements
> 
> libebml doesn't actually for reading unique elements only once. But
> there is a helper to read the first element of one EBML ID. So by
> default the first value encountered is the one that readers would use.
> Different parsers (like the one in VLC) usually work differently, they
> parse an EBML level and finding a known value that should be unique
> (for example video width) they set that value. If the element is
> encountered many time the value is overwritten. So the last value is
> actually used. I don't have any preference. I'm curious to know what
> mkvmerge does. We should also check what ffmpeg does.
> 
> Given the streaming nature of EBML/Matroska it would make sense that
> the first value encountered is used ASAP. But on the other hand a
> unique value usually sets a state that must be interpreted when the
> whole master element is read. So it may not matter that much. But
> there is the case of the ClusterTimecode. It does matter for
> streaming, it even has to be early in the Cluster. For this element
> subsequent values should be ignored (unless they're equal, which makes
> no difference). That would favor the 'first found sets the value'
> approach.
> 
>> - CRC element misplaced (it should always be first)
> 
> IMO it should be considered as a false positive detection. In other
> words if the parser reads [BF] as an ID but it's not the first
> element, the data is probably incorrect and so should be handled as
> such (see further in the mail). There's no way a correct writer would
> misplace the CRC element.
> 
>> - ClusterTimecode misplaced (it should always be first or after the CRC)
> 
> Should not go in the EBML specs. But that's a case that needs to be
> considered (see above).
> 
>> - CRC mismatch with the content
> 
> libebml doesn't enforce any behavior. It just tells if you the CRC-32
> is valid when you ask. In the case of VLC (using libebml and
> libmatroska), it doesn't care. A player in general will try to play
> content even if it's partly bogus/damaged. So it's more a concern for
> data analysis/recovery. So we may not need to define any policy at
> all. An invalid CRC-32 value just means there's something wrong, but
> you don't know where.
> 
>> - invalid EBML ID
> 
> libebml just skips data until it finds a proper ID. That is an ID that
> is valid at the semantic level. The library also has a mode that
> allows reading bogus data that seem to make sense (dummy reading).
> That's useful for forward compatibility with files. It's the preferred
> approach in VLC (it's optional and was the other way around in the
> past). So here we may not define a specific policy either. forward
> compatibility is something that is very important so it should
> probably be the default way of handling things. Using dummy elements
> also allow remuxing a file and keeping the original data even if you
> don't understand them. It's not foolproof though if dummy data depend
> on data or metadata (like position) that may be altered during
> remuxing.
> 
> libebml has a twist though. Even if it finds a dummy element, it also
> checks that the length is legit in the context (ie not bigger than the
> parent). If it doesn't then the dummy ID is considered invalid and it
> starts parsing the next octets. Meaning there's a "hole" in the EBML
> data.
> 
>> - invalid EBML length
> 
> That's the case described above. libebml doesn't allow elements
> outside of the parent boundaries. That helps with false positive when
> encountering unknown element IDs. If such an error is detected, it
> starts parsing the next octets, resulting in a "hole" in the EBML
> data.
> 
> By next octet I mean the first octet after the EBML ID+Length fields.
> So if you have [AA][BB][CC] octets, it will start parsing the next ID
> using the [BB] octet and further. The hole is the [AA] octet.
> 
> Just like with the CRC-32 a player will not care much about those
> errors. It's more important for data analysis and possibly detecting
> the position of bogus data.
> 
>> - Master element with junk data (most likely unidentified or bogus IDs)
> 
> Again, we live in a world where bogus files will exist, even
> unintentionally. Players (at least the sane ones) will try to extract
> as much valid data from the source and not care about the rest.
> Regular users will not care if something may be wrong somewhere in the
> file. So the question is more for data analysis/recovery.
> 
> As seen above, there might be dummy data. That looks totally right but
> are unknown. And there might be "holes" for data that really didn't
> make sense the EBML way. Data analysis programs may want to treat them
> differently. But I'm not sure defining a policy makes much sense. What
> is more important is that the same bogus data is handled the same way
> between parsers. As mentioned during No Time To Wait 2, the same bits
> should produce the same output. So we should probably agree on how
> these 2 kinds data error are generated rather than what to do with
> them.
> 
>> - reserved EBML ID values
> 
> For libebml it falls in the category of dummy data.
> The specs currently say "Any Element ID with the VINT_DATA component
> set as all zero values or all one values MUST be ignored and MUST NOT
> be considered an error in the EBML Document". It doesn't say what to
> do with [80] as an ID for example. IMO it should be considered an
> invalid ID (see above) and so skip to the next octet (EBML "hole").
> 
>> - elements not found anymore
> 
> libebml doesn't do any version check on the semantic. This may not be
> good. Finding old IDs may be a sign of damaged data rather than bogus
> writer. On the other mkvalidator does such check and it's common to
> find mismatches. Especially with an ever evolving format like WebM. I
> think it's better to read these elements and let the upper level
> reader deal with the version mismatch.
> 
> In the case of deprecated value there's a big chance it's element that
> are supposedly never used, so the data can be handled as valid but
> unused by the upper layer.
> 
>> or not yet for the given DocType
> 
> This is related to the Doctype 0/Experimental field/DTD discussion.
> libebml currently reads them as valid "dummy" elements (unless you
> disable dummy mode). As seen above the data could be remuxed as is
> even when not knowing what they do. I don't think they should be
> handled as errors. But it may change depending on how we define for
> experimental IDs. Especially how we handle current existing files, if
> they are considered as strict (therefore all unknown IDs are junk
> data, not dummy elements).
> 
>> - IDs with the inefficient size
> 
> libebml predates the decision of enforcing the most efficient size for
> IDs. So it allows that.
> Now the specs say "The VINT_DATA component of the Element ID MUST be
> encoded at the shortest valid length". So when a data with the
> improper size is encountered, it must be considered as junk data. I
> don't think any files exist with IDs stored as such (or very early
> Matroska files). So it should be safe to go that way.
> 
>> - huge EBML Master size that can't hold in memory
> 
> "640 KB ought to be enough for everyone" is famous for being right at
> the time and wrong now. So it would be hard to set a hard limit. And
> if we don't it would violate the "same bit produces the same data
> rule". Also the amount of memory available really depends on the
> particular machine the program runs on at a given time. Also Unknown
> Size elements are not meant to be held in memory anyway but to
> minimize latency (and backward writing when it's not possible).
> Not sure what to do here.
> 
>> - Elements <NON Master> with no default value with a size 0
> 
> libebml gives 0 for an integer with no size. Should it be considered a
> bug in the writing app or bogus data ? Not sure what to do here. If we
> decide to read it as the element it is (ie not as junk data producing
> a hole) it is in an unreadable state. It has no value that can be
> read. And it cannot be used without a value.
> 
>> - EBML element not in the right place according to the semantic
> 
> libebml would produce a dummy element or a hole depending on the
> reading mode. IMO it's junk data. If a writer writes this it cannot
> expect the data to be handled properly.
> Depending on the philosophy we use here will define how strict the
> writer output needs to be. If we say it's OK then it's kinda OK to
> have such bugs in writers.
> 
>> There are all reading/parsing errors. Based on what the reader is
>> supposed to do will also define the possibilities to create safe files
>> for a writer (see [2]).
>> 
>> I will reply to this email with what libebml does and how it was
>> originally designed to work. I think it's better we agree on the rules
>> for each case before doing patches for the specs.
> 
> So this is a long reply to a complex topic. We have 2 current outcomes
> (dummy and hole) in libebml. Maybe you can think of different ways to
> handle things.
> 
>> [1] https://github.com/Matroska-Org/ebml-specification/issues/48
>> [2] https://github.com/Matroska-Org/ebml-specification/issues/92 <https://github.com/Matroska-Org/ebml-specification/issues/92>

A minor start but I began a section of “Considerations for Reading EBML” can can contain these topics. See https://github.com/Matroska-Org/ebml-specification/pull/177 <https://github.com/Matroska-Org/ebml-specification/pull/177>. If this approach works, we can start to incorporate the other topics you’re listing.

Kind Regards,
Dave Rice

[Cellar] EBML Error Handling Steve Lhomme
Re: [Cellar] EBML Error Handling Steve Lhomme
Re: [Cellar] EBML Error Handling Dave Rice