Re: [Cellar] EBML Error Handling
Dave Rice <dave@dericed.com> Mon, 12 February 2018 16:32 UTC
Return-Path: <dave@dericed.com>
X-Original-To: cellar@ietfa.amsl.com
Delivered-To: cellar@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 04BE112D870 for <cellar@ietfa.amsl.com>; Mon, 12 Feb 2018 08:32:08 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.12
X-Spam-Level:
X-Spam-Status: No, score=-1.12 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_NEUTRAL=0.779] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5RLRkxRvtfoO for <cellar@ietfa.amsl.com>; Mon, 12 Feb 2018 08:32:05 -0800 (PST)
Received: from server172-2.web-hosting.com (server172-2.web-hosting.com [68.65.122.110]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 10B8812D86E for <cellar@ietf.org>; Mon, 12 Feb 2018 08:32:05 -0800 (PST)
Received: from cpe-104-162-94-162.nyc.res.rr.com ([104.162.94.162]:40538 helo=[10.0.1.2]) by server172.web-hosting.com with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from <dave@dericed.com>) id 1elH1E-000NHS-0G; Mon, 12 Feb 2018 11:31:54 -0500
From: Dave Rice <dave@dericed.com>
Message-Id: <498A8E37-D9C4-4A0E-B7E6-C7EB185FC79A@dericed.com>
Content-Type: multipart/alternative; boundary="Apple-Mail=_2427076D-DAB8-41F5-A1F3-0F160A12C248"
Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\))
Date: Mon, 12 Feb 2018 11:31:50 -0500
In-Reply-To: <CAOXsMFJgw-A=8OZptNFg1F4c4XnNhmSfwN4nknLSTjVVWK7hfw@mail.gmail.com>
Cc: Codec Encoding for LossLess Archiving and Realtime transmission <cellar@ietf.org>
To: Steve Lhomme <slhomme@matroska.org>
References: <CAOXsMFJCdt+XVmP5tYQ=LfNdwFmZCfgM-HCtep27P+dji6K4gQ@mail.gmail.com> <CAOXsMFJgw-A=8OZptNFg1F4c4XnNhmSfwN4nknLSTjVVWK7hfw@mail.gmail.com>
X-Mailer: Apple Mail (2.3445.4.7)
X-OutGoing-Spam-Status: No, score=-2.9
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - server172.web-hosting.com
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - dericed.com
X-Get-Message-Sender-Via: server172.web-hosting.com: authenticated_id: dave@dericed.com
X-Authenticated-Sender: server172.web-hosting.com: dave@dericed.com
X-Source:
X-Source-Args:
X-Source-Dir:
X-From-Rewrite: unmodified, already matched
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/Fh8zRyntgv4gDVxTajQ86pVZBuw>
Subject: Re: [Cellar] EBML Error Handling
X-BeenThere: cellar@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Codec Encoding for LossLess Archiving and Realtime transmission <cellar.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cellar>, <mailto:cellar-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cellar/>
List-Post: <mailto:cellar@ietf.org>
List-Help: <mailto:cellar-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cellar>, <mailto:cellar-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Feb 2018 16:32:08 -0000
> On Nov 26, 2017, at 11:36 AM, Steve Lhomme <slhomme@matroska.org> wrote: > > 2017-11-26 16:13 GMT+01:00 Steve Lhomme <slhomme@matroska.org>: >> Hi, >> >> The last remaining issues to fix before we can issue the last call on >> the EBML specs is more text to explain how error handling is done. It >> covers many possible cases. >> >> Here are the cases I can think of (some mentioned here [1] and here [2]): > > Here's what libebml does and how we originally wanted to handle those > cases. Let me know if it could be improved or if it's just wrong. > >> - multiple unique elements > > libebml doesn't actually for reading unique elements only once. But > there is a helper to read the first element of one EBML ID. So by > default the first value encountered is the one that readers would use. > Different parsers (like the one in VLC) usually work differently, they > parse an EBML level and finding a known value that should be unique > (for example video width) they set that value. If the element is > encountered many time the value is overwritten. So the last value is > actually used. I don't have any preference. I'm curious to know what > mkvmerge does. We should also check what ffmpeg does. > > Given the streaming nature of EBML/Matroska it would make sense that > the first value encountered is used ASAP. But on the other hand a > unique value usually sets a state that must be interpreted when the > whole master element is read. So it may not matter that much. But > there is the case of the ClusterTimecode. It does matter for > streaming, it even has to be early in the Cluster. For this element > subsequent values should be ignored (unless they're equal, which makes > no difference). That would favor the 'first found sets the value' > approach. > >> - CRC element misplaced (it should always be first) > > IMO it should be considered as a false positive detection. In other > words if the parser reads [BF] as an ID but it's not the first > element, the data is probably incorrect and so should be handled as > such (see further in the mail). There's no way a correct writer would > misplace the CRC element. > >> - ClusterTimecode misplaced (it should always be first or after the CRC) > > Should not go in the EBML specs. But that's a case that needs to be > considered (see above). > >> - CRC mismatch with the content > > libebml doesn't enforce any behavior. It just tells if you the CRC-32 > is valid when you ask. In the case of VLC (using libebml and > libmatroska), it doesn't care. A player in general will try to play > content even if it's partly bogus/damaged. So it's more a concern for > data analysis/recovery. So we may not need to define any policy at > all. An invalid CRC-32 value just means there's something wrong, but > you don't know where. > >> - invalid EBML ID > > libebml just skips data until it finds a proper ID. That is an ID that > is valid at the semantic level. The library also has a mode that > allows reading bogus data that seem to make sense (dummy reading). > That's useful for forward compatibility with files. It's the preferred > approach in VLC (it's optional and was the other way around in the > past). So here we may not define a specific policy either. forward > compatibility is something that is very important so it should > probably be the default way of handling things. Using dummy elements > also allow remuxing a file and keeping the original data even if you > don't understand them. It's not foolproof though if dummy data depend > on data or metadata (like position) that may be altered during > remuxing. > > libebml has a twist though. Even if it finds a dummy element, it also > checks that the length is legit in the context (ie not bigger than the > parent). If it doesn't then the dummy ID is considered invalid and it > starts parsing the next octets. Meaning there's a "hole" in the EBML > data. > >> - invalid EBML length > > That's the case described above. libebml doesn't allow elements > outside of the parent boundaries. That helps with false positive when > encountering unknown element IDs. If such an error is detected, it > starts parsing the next octets, resulting in a "hole" in the EBML > data. > > By next octet I mean the first octet after the EBML ID+Length fields. > So if you have [AA][BB][CC] octets, it will start parsing the next ID > using the [BB] octet and further. The hole is the [AA] octet. > > Just like with the CRC-32 a player will not care much about those > errors. It's more important for data analysis and possibly detecting > the position of bogus data. > >> - Master element with junk data (most likely unidentified or bogus IDs) > > Again, we live in a world where bogus files will exist, even > unintentionally. Players (at least the sane ones) will try to extract > as much valid data from the source and not care about the rest. > Regular users will not care if something may be wrong somewhere in the > file. So the question is more for data analysis/recovery. > > As seen above, there might be dummy data. That looks totally right but > are unknown. And there might be "holes" for data that really didn't > make sense the EBML way. Data analysis programs may want to treat them > differently. But I'm not sure defining a policy makes much sense. What > is more important is that the same bogus data is handled the same way > between parsers. As mentioned during No Time To Wait 2, the same bits > should produce the same output. So we should probably agree on how > these 2 kinds data error are generated rather than what to do with > them. > >> - reserved EBML ID values > > For libebml it falls in the category of dummy data. > The specs currently say "Any Element ID with the VINT_DATA component > set as all zero values or all one values MUST be ignored and MUST NOT > be considered an error in the EBML Document". It doesn't say what to > do with [80] as an ID for example. IMO it should be considered an > invalid ID (see above) and so skip to the next octet (EBML "hole"). > >> - elements not found anymore > > libebml doesn't do any version check on the semantic. This may not be > good. Finding old IDs may be a sign of damaged data rather than bogus > writer. On the other mkvalidator does such check and it's common to > find mismatches. Especially with an ever evolving format like WebM. I > think it's better to read these elements and let the upper level > reader deal with the version mismatch. > > In the case of deprecated value there's a big chance it's element that > are supposedly never used, so the data can be handled as valid but > unused by the upper layer. > >> or not yet for the given DocType > > This is related to the Doctype 0/Experimental field/DTD discussion. > libebml currently reads them as valid "dummy" elements (unless you > disable dummy mode). As seen above the data could be remuxed as is > even when not knowing what they do. I don't think they should be > handled as errors. But it may change depending on how we define for > experimental IDs. Especially how we handle current existing files, if > they are considered as strict (therefore all unknown IDs are junk > data, not dummy elements). > >> - IDs with the inefficient size > > libebml predates the decision of enforcing the most efficient size for > IDs. So it allows that. > Now the specs say "The VINT_DATA component of the Element ID MUST be > encoded at the shortest valid length". So when a data with the > improper size is encountered, it must be considered as junk data. I > don't think any files exist with IDs stored as such (or very early > Matroska files). So it should be safe to go that way. > >> - huge EBML Master size that can't hold in memory > > "640 KB ought to be enough for everyone" is famous for being right at > the time and wrong now. So it would be hard to set a hard limit. And > if we don't it would violate the "same bit produces the same data > rule". Also the amount of memory available really depends on the > particular machine the program runs on at a given time. Also Unknown > Size elements are not meant to be held in memory anyway but to > minimize latency (and backward writing when it's not possible). > Not sure what to do here. > >> - Elements <NON Master> with no default value with a size 0 > > libebml gives 0 for an integer with no size. Should it be considered a > bug in the writing app or bogus data ? Not sure what to do here. If we > decide to read it as the element it is (ie not as junk data producing > a hole) it is in an unreadable state. It has no value that can be > read. And it cannot be used without a value. > >> - EBML element not in the right place according to the semantic > > libebml would produce a dummy element or a hole depending on the > reading mode. IMO it's junk data. If a writer writes this it cannot > expect the data to be handled properly. > Depending on the philosophy we use here will define how strict the > writer output needs to be. If we say it's OK then it's kinda OK to > have such bugs in writers. > >> There are all reading/parsing errors. Based on what the reader is >> supposed to do will also define the possibilities to create safe files >> for a writer (see [2]). >> >> I will reply to this email with what libebml does and how it was >> originally designed to work. I think it's better we agree on the rules >> for each case before doing patches for the specs. > > So this is a long reply to a complex topic. We have 2 current outcomes > (dummy and hole) in libebml. Maybe you can think of different ways to > handle things. > >> [1] https://github.com/Matroska-Org/ebml-specification/issues/48 >> [2] https://github.com/Matroska-Org/ebml-specification/issues/92 <https://github.com/Matroska-Org/ebml-specification/issues/92> A minor start but I began a section of “Considerations for Reading EBML” can can contain these topics. See https://github.com/Matroska-Org/ebml-specification/pull/177 <https://github.com/Matroska-Org/ebml-specification/pull/177>. If this approach works, we can start to incorporate the other topics you’re listing. Kind Regards, Dave Rice
- [Cellar] EBML Error Handling Steve Lhomme
- Re: [Cellar] EBML Error Handling Steve Lhomme
- Re: [Cellar] EBML Error Handling Dave Rice