Re: [Cellar] EBML Error Handling

Dave Rice <dave@dericed.com> Mon, 12 February 2018 16:32 UTC

Return-Path: <dave@dericed.com>
X-Original-To: cellar@ietfa.amsl.com
Delivered-To: cellar@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 04BE112D870 for <cellar@ietfa.amsl.com>; Mon, 12 Feb 2018 08:32:08 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.12
X-Spam-Level:
X-Spam-Status: No, score=-1.12 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_NEUTRAL=0.779] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5RLRkxRvtfoO for <cellar@ietfa.amsl.com>; Mon, 12 Feb 2018 08:32:05 -0800 (PST)
Received: from server172-2.web-hosting.com (server172-2.web-hosting.com [68.65.122.110]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 10B8812D86E for <cellar@ietf.org>; Mon, 12 Feb 2018 08:32:05 -0800 (PST)
Received: from cpe-104-162-94-162.nyc.res.rr.com ([104.162.94.162]:40538 helo=[10.0.1.2]) by server172.web-hosting.com with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from <dave@dericed.com>) id 1elH1E-000NHS-0G; Mon, 12 Feb 2018 11:31:54 -0500
From: Dave Rice <dave@dericed.com>
Message-Id: <498A8E37-D9C4-4A0E-B7E6-C7EB185FC79A@dericed.com>
Content-Type: multipart/alternative; boundary="Apple-Mail=_2427076D-DAB8-41F5-A1F3-0F160A12C248"
Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\))
Date: Mon, 12 Feb 2018 11:31:50 -0500
In-Reply-To: <CAOXsMFJgw-A=8OZptNFg1F4c4XnNhmSfwN4nknLSTjVVWK7hfw@mail.gmail.com>
Cc: Codec Encoding for LossLess Archiving and Realtime transmission <cellar@ietf.org>
To: Steve Lhomme <slhomme@matroska.org>
References: <CAOXsMFJCdt+XVmP5tYQ=LfNdwFmZCfgM-HCtep27P+dji6K4gQ@mail.gmail.com> <CAOXsMFJgw-A=8OZptNFg1F4c4XnNhmSfwN4nknLSTjVVWK7hfw@mail.gmail.com>
X-Mailer: Apple Mail (2.3445.4.7)
X-OutGoing-Spam-Status: No, score=-2.9
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - server172.web-hosting.com
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - dericed.com
X-Get-Message-Sender-Via: server172.web-hosting.com: authenticated_id: dave@dericed.com
X-Authenticated-Sender: server172.web-hosting.com: dave@dericed.com
X-Source:
X-Source-Args:
X-Source-Dir:
X-From-Rewrite: unmodified, already matched
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/Fh8zRyntgv4gDVxTajQ86pVZBuw>
Subject: Re: [Cellar] EBML Error Handling
X-BeenThere: cellar@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Codec Encoding for LossLess Archiving and Realtime transmission <cellar.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cellar>, <mailto:cellar-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cellar/>
List-Post: <mailto:cellar@ietf.org>
List-Help: <mailto:cellar-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cellar>, <mailto:cellar-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Feb 2018 16:32:08 -0000


> On Nov 26, 2017, at 11:36 AM, Steve Lhomme <slhomme@matroska.org> wrote:
> 
> 2017-11-26 16:13 GMT+01:00 Steve Lhomme <slhomme@matroska.org>:
>> Hi,
>> 
>> The last remaining issues to fix before we can issue the last call on
>> the EBML specs is more text to explain how error handling is done. It
>> covers many possible cases.
>> 
>> Here are the cases I can think of (some mentioned here [1] and here [2]):
> 
> Here's what libebml does and how we originally wanted to handle those
> cases. Let me know if it could be improved or if it's just wrong.
> 
>> - multiple unique elements
> 
> libebml doesn't actually for reading unique elements only once. But
> there is a helper to read the first element of one EBML ID. So by
> default the first value encountered is the one that readers would use.
> Different parsers (like the one in VLC) usually work differently, they
> parse an EBML level and finding a known value that should be unique
> (for example video width) they set that value. If the element is
> encountered many time the value is overwritten. So the last value is
> actually used. I don't have any preference. I'm curious to know what
> mkvmerge does. We should also check what ffmpeg does.
> 
> Given the streaming nature of EBML/Matroska it would make sense that
> the first value encountered is used ASAP. But on the other hand a
> unique value usually sets a state that must be interpreted when the
> whole master element is read. So it may not matter that much. But
> there is the case of the ClusterTimecode. It does matter for
> streaming, it even has to be early in the Cluster. For this element
> subsequent values should be ignored (unless they're equal, which makes
> no difference). That would favor the 'first found sets the value'
> approach.
> 
>> - CRC element misplaced (it should always be first)
> 
> IMO it should be considered as a false positive detection. In other
> words if the parser reads [BF] as an ID but it's not the first
> element, the data is probably incorrect and so should be handled as
> such (see further in the mail). There's no way a correct writer would
> misplace the CRC element.
> 
>> - ClusterTimecode misplaced (it should always be first or after the CRC)
> 
> Should not go in the EBML specs. But that's a case that needs to be
> considered (see above).
> 
>> - CRC mismatch with the content
> 
> libebml doesn't enforce any behavior. It just tells if you the CRC-32
> is valid when you ask. In the case of VLC (using libebml and
> libmatroska), it doesn't care. A player in general will try to play
> content even if it's partly bogus/damaged. So it's more a concern for
> data analysis/recovery. So we may not need to define any policy at
> all. An invalid CRC-32 value just means there's something wrong, but
> you don't know where.
> 
>> - invalid EBML ID
> 
> libebml just skips data until it finds a proper ID. That is an ID that
> is valid at the semantic level. The library also has a mode that
> allows reading bogus data that seem to make sense (dummy reading).
> That's useful for forward compatibility with files. It's the preferred
> approach in VLC (it's optional and was the other way around in the
> past). So here we may not define a specific policy either. forward
> compatibility is something that is very important so it should
> probably be the default way of handling things. Using dummy elements
> also allow remuxing a file and keeping the original data even if you
> don't understand them. It's not foolproof though if dummy data depend
> on data or metadata (like position) that may be altered during
> remuxing.
> 
> libebml has a twist though. Even if it finds a dummy element, it also
> checks that the length is legit in the context (ie not bigger than the
> parent). If it doesn't then the dummy ID is considered invalid and it
> starts parsing the next octets. Meaning there's a "hole" in the EBML
> data.
> 
>> - invalid EBML length
> 
> That's the case described above. libebml doesn't allow elements
> outside of the parent boundaries. That helps with false positive when
> encountering unknown element IDs. If such an error is detected, it
> starts parsing the next octets, resulting in a "hole" in the EBML
> data.
> 
> By next octet I mean the first octet after the EBML ID+Length fields.
> So if you have [AA][BB][CC] octets, it will start parsing the next ID
> using the [BB] octet and further. The hole is the [AA] octet.
> 
> Just like with the CRC-32 a player will not care much about those
> errors. It's more important for data analysis and possibly detecting
> the position of bogus data.
> 
>> - Master element with junk data (most likely unidentified or bogus IDs)
> 
> Again, we live in a world where bogus files will exist, even
> unintentionally. Players (at least the sane ones) will try to extract
> as much valid data from the source and not care about the rest.
> Regular users will not care if something may be wrong somewhere in the
> file. So the question is more for data analysis/recovery.
> 
> As seen above, there might be dummy data. That looks totally right but
> are unknown. And there might be "holes" for data that really didn't
> make sense the EBML way. Data analysis programs may want to treat them
> differently. But I'm not sure defining a policy makes much sense. What
> is more important is that the same bogus data is handled the same way
> between parsers. As mentioned during No Time To Wait 2, the same bits
> should produce the same output. So we should probably agree on how
> these 2 kinds data error are generated rather than what to do with
> them.
> 
>> - reserved EBML ID values
> 
> For libebml it falls in the category of dummy data.
> The specs currently say "Any Element ID with the VINT_DATA component
> set as all zero values or all one values MUST be ignored and MUST NOT
> be considered an error in the EBML Document". It doesn't say what to
> do with [80] as an ID for example. IMO it should be considered an
> invalid ID (see above) and so skip to the next octet (EBML "hole").
> 
>> - elements not found anymore
> 
> libebml doesn't do any version check on the semantic. This may not be
> good. Finding old IDs may be a sign of damaged data rather than bogus
> writer. On the other mkvalidator does such check and it's common to
> find mismatches. Especially with an ever evolving format like WebM. I
> think it's better to read these elements and let the upper level
> reader deal with the version mismatch.
> 
> In the case of deprecated value there's a big chance it's element that
> are supposedly never used, so the data can be handled as valid but
> unused by the upper layer.
> 
>> or not yet for the given DocType
> 
> This is related to the Doctype 0/Experimental field/DTD discussion.
> libebml currently reads them as valid "dummy" elements (unless you
> disable dummy mode). As seen above the data could be remuxed as is
> even when not knowing what they do. I don't think they should be
> handled as errors. But it may change depending on how we define for
> experimental IDs. Especially how we handle current existing files, if
> they are considered as strict (therefore all unknown IDs are junk
> data, not dummy elements).
> 
>> - IDs with the inefficient size
> 
> libebml predates the decision of enforcing the most efficient size for
> IDs. So it allows that.
> Now the specs say "The VINT_DATA component of the Element ID MUST be
> encoded at the shortest valid length". So when a data with the
> improper size is encountered, it must be considered as junk data. I
> don't think any files exist with IDs stored as such (or very early
> Matroska files). So it should be safe to go that way.
> 
>> - huge EBML Master size that can't hold in memory
> 
> "640 KB ought to be enough for everyone" is famous for being right at
> the time and wrong now. So it would be hard to set a hard limit. And
> if we don't it would violate the "same bit produces the same data
> rule". Also the amount of memory available really depends on the
> particular machine the program runs on at a given time. Also Unknown
> Size elements are not meant to be held in memory anyway but to
> minimize latency (and backward writing when it's not possible).
> Not sure what to do here.
> 
>> - Elements <NON Master> with no default value with a size 0
> 
> libebml gives 0 for an integer with no size. Should it be considered a
> bug in the writing app or bogus data ? Not sure what to do here. If we
> decide to read it as the element it is (ie not as junk data producing
> a hole) it is in an unreadable state. It has no value that can be
> read. And it cannot be used without a value.
> 
>> - EBML element not in the right place according to the semantic
> 
> libebml would produce a dummy element or a hole depending on the
> reading mode. IMO it's junk data. If a writer writes this it cannot
> expect the data to be handled properly.
> Depending on the philosophy we use here will define how strict the
> writer output needs to be. If we say it's OK then it's kinda OK to
> have such bugs in writers.
> 
>> There are all reading/parsing errors. Based on what the reader is
>> supposed to do will also define the possibilities to create safe files
>> for a writer (see [2]).
>> 
>> I will reply to this email with what libebml does and how it was
>> originally designed to work. I think it's better we agree on the rules
>> for each case before doing patches for the specs.
> 
> So this is a long reply to a complex topic. We have 2 current outcomes
> (dummy and hole) in libebml. Maybe you can think of different ways to
> handle things.
> 
>> [1] https://github.com/Matroska-Org/ebml-specification/issues/48
>> [2] https://github.com/Matroska-Org/ebml-specification/issues/92 <https://github.com/Matroska-Org/ebml-specification/issues/92>

A minor start but I began a section of “Considerations for Reading EBML” can can contain these topics. See https://github.com/Matroska-Org/ebml-specification/pull/177 <https://github.com/Matroska-Org/ebml-specification/pull/177>. If this approach works, we can start to incorporate the other topics you’re listing.

Kind Regards,
Dave Rice