Re: [Cellar] On the multiplicity of Info elements

Dave Rice <dave@dericed.com> Wed, 06 January 2016 07:06 UTC

Return-Path: <dave@dericed.com>
X-Original-To: cellar@ietfa.amsl.com
Delivered-To: cellar@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CFA0C1ACD3B for <cellar@ietfa.amsl.com>; Tue, 5 Jan 2016 23:06:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.121
X-Spam-Level:
X-Spam-Status: No, score=-1.121 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_NEUTRAL=0.779] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Pe4fKLWbgxfh for <cellar@ietfa.amsl.com>; Tue, 5 Jan 2016 23:06:11 -0800 (PST)
Received: from s172.web-hosting.com (s172.web-hosting.com [68.65.122.110]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 72E2B1ACD4B for <cellar@ietf.org>; Tue, 5 Jan 2016 23:06:11 -0800 (PST)
Received: from user-387g4ij.cable.mindspring.com ([208.120.18.83]:44210 helo=[10.0.1.64]) by server172.web-hosting.com with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.86) (envelope-from <dave@dericed.com>) id 1aGiAZ-0012k6-0Z; Wed, 06 Jan 2016 02:06:10 -0500
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\))
From: Dave Rice <dave@dericed.com>
In-Reply-To: <CAOXsMFKJJhzU-3CYqguDePY42T+Vvhx9ytAfvoM6xyqaZY+N4g@mail.gmail.com>
Date: Wed, 06 Jan 2016 02:06:05 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <FCC4DC05-44CD-4C2B-8C59-8E3E5B494DC0@dericed.com>
References: <CAHUoETLC4dQQ7=TOuTXZ3aDjKCCJgz2s-8Gb33MoSAP3hgRQiQ@mail.gmail.com> <BEA72D66-EA3D-4CF0-987D-836E95287F39@dericed.com> <20151230091811.GA19636@bunkus.org> <CAOXsMFLCbe-W=h+tQpdRa8Nh0jz=xdbZTXEmoXsgQTbA=4OPCQ@mail.gmail.com> <C0E5EBA2-2A56-46F9-A049-629EFB11F280@dericed.com> <CAOXsMF+gc0d2LEisfHm0jnjDGQKcYquEMBt7FnZ_uuSNF=C0iw@mail.gmail.com> <568AC10F.9030303@gmx.de> <CAOXsMFKJJhzU-3CYqguDePY42T+Vvhx9ytAfvoM6xyqaZY+N4g@mail.gmail.com>
To: Steve Lhomme <slhomme@matroska.org>
X-Mailer: Apple Mail (2.3112)
X-OutGoing-Spam-Status: No, score=-1.0
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - server172.web-hosting.com
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - dericed.com
X-Get-Message-Sender-Via: server172.web-hosting.com: authenticated_id: dave@dericed.com
X-Authenticated-Sender: server172.web-hosting.com: dave@dericed.com
X-Source:
X-Source-Args:
X-Source-Dir:
X-From-Rewrite: unmodified, already matched
Archived-At: <http://mailarchive.ietf.org/arch/msg/cellar/WDJ05JXGKFwdx9Wd_QRlNycwZ6A>
Cc: cellar@ietf.org, "Sebastian G. <bastik>" <bastik.public.mailinglist@gmx.de>
Subject: Re: [Cellar] On the multiplicity of Info elements
X-BeenThere: cellar@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Codec Encoding for LossLess Archiving and Realtime transmission <cellar.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cellar>, <mailto:cellar-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cellar/>
List-Post: <mailto:cellar@ietf.org>
List-Help: <mailto:cellar-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cellar>, <mailto:cellar-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Jan 2016 07:06:14 -0000

> On Jan 5, 2016, at 3:20 AM, Steve Lhomme <slhomme@matroska.org> wrote:
> 
> 2016-01-04 19:59 GMT+01:00 Sebastian G. <bastik>
> <bastik.public.mailinglist@gmx.de>:
>> 04.01.2016, 08:11 Steve Lhomme:
>>> 2016-01-03 16:53 GMT+01:00 Dave Rice <dave@dericed.com>:
>>>> 
>>>>> On Jan 2, 2016, at 3:47 AM, Steve Lhomme <slhomme@matroska.org>
>>>>> wrote:
>>>>> 
>>>>> 2015-12-30 10:18 GMT+01:00 Moritz Bunkus <moritz@bunkus.org>:
>>>>>> Hey,
>>>>>> 
>>>>>> I only remember the discussion around Tracks being multiple,
>>>>>> not particularly for the other ones. Our intent way back when
>>>>>> was to allow muxers to write multiple instances of _the same
>>>>>> information_ in different places in order to make the file more
>>>>>> resilient against damage or incomplete downloads with protocols
>>>>>> like BitTorrent.
>>>>> 
>>>>> Yes, that's the idea for the Track Info as it's vital to the
>>>>> usability of the file, as well as the Segment Info. I'm not sure
>>>>> it's used in practice though. Since the goal of CELLAR is
>>>>> archiving solutions it may still make sense.
>>>> 
>>>> Perhaps to declare that an Element may be repeated but must be
>>>> repeated identically should be a new EBML Element Attributes, so
>>>> there can be a distinction between the repeatability Segment/Info
>>>> and the repeatability of SimpleBlock.
>>> 
>>> That might be good. After all not elements make sense as repeated
>>> ones. For example in Matroska you don't want a Cluster (timestamped
>>> data) to be repeated.
>>> 
>>>>>> The same reasoning could be applied to Info. Both elements are
>>>>>> absolutely crucial to playback; the other level 1 elements safe
>>>>>> for the clusters simply aren’t.
>>>> 
>>>> But what should happen when the read finds differences in
>>>> repeated-but-should-be-identical elements?
>>> 
>>> Good question. Maybe repeated elements should have a CRC ? If a CRC
>>> is wrong (or not found) the parser could look for a copy.
>> 
>> I like the CRC idea for repeated elements, but it still does not define
>> how players should behave if they encounter two elements, even with
>> valid CRCs, not matching each other.
> 
> Also what about repeated elements that are not master elements. You'd
> have no way of telling which is the best version. So repeated should
> probably be master elements. Maybe CRC should be mandatory too (not
> sure if real life files already follow this rule). That's the only way
> a parser would be able to tell which version is correct, as far as I
> can see.
> 
>> There would have to be a recommendation. "Use always the first
>> occurrence of an element." or "If an element occurs repeated and its
>> values differ, the last occurrence is the one that should be used."
> 
> Not necessarily. Bogus data could be on the first one. The goal is not
> to write a version of the element and then an updated version in the
> file. It has to be the same data. If you want to clear the first
> version, you should use the Void EBML element. When the file is
> originally written, repeated values should be exactly the same. That
> means elements that have a file offset relative to a child element
> would not be equal. Luckily in Matroska offset positions are always
> relative to the Segment, so a Level-0 element.
> 
> http://www.matroska.org/technical/specs/notes.html#Position_References
> 
>> Obviously tools that create such files are violating the specifications
>> since it should not be allowed to create repeated elements with
>> differing values. On the other hand should it be hard to break playback.
>> I prefer uniform behavior among players.
>> 
>>>> For a scenario of two differing Info Elements, VLC and FFmpeg use
>>>> different Info Elements. Which use is correct? Since the use of
>>>> repeated-identical elements is resilience a deviation between the
>>>> two could be expected, so we should suggest how the reader should
>>>> respond.
>>> 
>>> It was designed for recovery tools. It may not be good to change
>>> players for such cases. It would make them more complex. (unless an
>>> elegant/easy solution is found).
>> 
>> For differences due to transmission errors a CRC for repeated elements
>> seems a good solution.
>> 
>> Players have to do something with repeated elements. I don't know what
>> they do, but there should be a recommended way they should handle such
>> cases. If a player breaks, that is OK, as long as the file was violating
>> the specs. A player should behave in an expected way.
> 
> IMO what makes sense from a player point of view is to read an
> element. If there's a CRC, it's broken and the semantic says the
> element can be repeated, then it should look for a valid version.
> Otherwise it shouldn't have to wonder which element to use if it
> encounters another one. IMO repeated elements only make sense if
> there's a CRC (whichever form it may take).
> 
> Another rule for repeatable elements: the element MUST be unique at
> that level (not multiple).
> 
>>>>>>> SeekHead, Info, Cluster, Tracks, and Tags are multiple.
>>>>>> 
>>>>>> SeekHead and Cluster must be multiple. SeekHead in order to
>>>>>> allow moving a SeekHead to the end of the file while still
>>>>>> referencing it from the start (so that normal players will
>>>>>> still find it quickly). Cluster for obvious reasons.
>>>>>> 
>>>>>>> And Cues, Attachments, and Chapters are non-multiple.
>>>>>> 
>>>>>> I have no idea why Tags is multiple and these three aren't.
>>>>>> 
>>>>>> To me the following would make sense:
>>>>>> 
>>>>>> - Info, Tracks – multiple but only if each instance contains
>>>>>> the same information
>>>>>> 
>>>>>> - SeekHead, Cluster – multiple without restrictions
>>>>>> 
>>>>>> - Attachments, Chapters, Cues, Tags – single
>>>> 
>>>> I can understand Attachments and Tags being multiple as it could
>>>> allow attachments or tags to be added to a file without having to
>>>> re-write too many bytes.
>>> 
>>> Yes. But then there should be a way for the player to know about
>>> these beforehand. Good players scan Matroska files beforehand anyway
>>> (unless it's live streaming).
>>> 
>> 
>> I agree with having a mechanism for players to know about them beforehand.

Here is a draft for an EBML Schema Attribute to be used in the definition of EBML Elements for what I’m referring to as Identically-Recurring Elements.

====
A boolean to express if the EBML Element may occur within its Parent Element more than once but that each recurrance within that Parent Element MUST be identical both in storage and semantics. Such Elements are referred to as Identically-Recurring Elements. In this case, identical copies of an EBML Element are permitted to be stored multiple times within the same Parent Element in order to increase data resillience and optimize the use of EBML in transmission. Identically-Recurring Elements SHOULD include a CRC-32 Element as a Child Element. If a Parent Element contains more than one copy of an Identically-Recurring Element which includes a CRC-32 Child Elememnt then the first instance of the Identically-Recurring Element with a valid CRC-32 value should be used for interpretation. If a Parent Element contains more than one copy of an Identically-Recurring Element which does not contain a CRC-32 Child Elememnt then the first instance of the Identically-Recurring Element should be used for interpretation. If the `identical` attribute is not expressed for that Element then that Element is considered to not have a requirement for identical expression within the same Parent Element. The `identical` attribute is only valid if the Element is not set to `multiple`, otherwise the `identical` attribute shall be ignored.
====

The text may be seen in context with the other attributes at https://github.com/MediaArea/ebml-specification/blob/d108c14d1f1c748d1f3f50e58d4057208325f892/specification.markdown#ebml-schema-element-attributes

Some pending questions:
- Should Identically-Recurring Elements recommend or mandate inclusion of a CRC-32? I suggestion recommend and not mandate for reverse compability.
- Is 'Identically-Recurring Elements’ a decent short name for these types of Elements?
- Should the ‘identical’ attribute apply to only Master-elements? I think it could be open. I could see scenarios to place the CRC-32 both at the beginning and end of the Parent Element.
- I’m thinking that ‘multiple’ means the Element may recur within its Parent Element and means that there are multiple semantics. I suggest that ‘identical’ Elements not be ‘multiple’ as an ‘identical’ Element recurs but only with a single and non-multiple semantic meaning. OK?
- If there are multiple copies of identical Elements, I wrote in the draft that if there’s a CRC than the first copy with valid CRC be used, else the first copy be used. Potentially we could say first valid copy be used, but then need to say fully what valid means.
- I agree that an identical Element should not be used to change the semantics over time.

Btw while working on this I found a mkv file from Lavf54.29.104 where a Track Entry element contained two different copies of DefaultDuration. They only differed by 1, but is the first one valid and the second one invalid?

Dave Rice