Re: [Cellar] EBML Extensions

Jerome Martinez <jerome@mediaarea.net> Sun, 15 April 2018 15:44 UTC

Return-Path: <jerome@mediaarea.net>
X-Original-To: cellar@ietfa.amsl.com
Delivered-To: cellar@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C55F1127867 for <cellar@ietfa.amsl.com>; Sun, 15 Apr 2018 08:44:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id aeZxgRs8YHNE for <cellar@ietfa.amsl.com>; Sun, 15 Apr 2018 08:44:08 -0700 (PDT)
Received: from 9.mo5.mail-out.ovh.net (9.mo5.mail-out.ovh.net [178.32.96.204]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 02E6B1270A3 for <cellar@ietf.org>; Sun, 15 Apr 2018 08:44:08 -0700 (PDT)
Received: from player730.ha.ovh.net (unknown [10.109.105.46]) by mo5.mail-out.ovh.net (Postfix) with ESMTP id EA7621A07B0 for <cellar@ietf.org>; Sun, 15 Apr 2018 17:44:05 +0200 (CEST)
Received: from [192.168.2.120] (p5DDB6BF5.dip0.t-ipconnect.de [93.219.107.245]) (Authenticated sender: jerome@mediaarea.net) by player730.ha.ovh.net (Postfix) with ESMTPSA id 9EDCC4400A3 for <cellar@ietf.org>; Sun, 15 Apr 2018 17:44:02 +0200 (CEST)
To: cellar@ietf.org
References: <CAOXsMFK-sp+=CRUmYunHRptudZgLGpMLoTF+UVA-oW94+gzb=w@mail.gmail.com>
From: Jerome Martinez <jerome@mediaarea.net>
Message-ID: <ffb9246b-2138-7441-61b6-4250ceb29a78@mediaarea.net>
Date: Sun, 15 Apr 2018 17:44:03 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <CAOXsMFK-sp+=CRUmYunHRptudZgLGpMLoTF+UVA-oW94+gzb=w@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------94DC54CA789F33255933808A"
Content-Language: en-GB
X-Ovh-Tracer-Id: 8305482137877483665
X-VR-SPAMSTATE: OK
X-VR-SPAMSCORE: 0
X-VR-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgedtgedrieeigdekhecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfqggfjpdevjffgvefmvefgnecuuegrihhlohhuthemuceftddtnecu
Archived-At: <https://mailarchive.ietf.org/arch/msg/cellar/6G6ibfg7msekJ802t0LvdKYuSBc>
Subject: Re: [Cellar] EBML Extensions
X-BeenThere: cellar@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Codec Encoding for LossLess Archiving and Realtime transmission <cellar.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cellar>, <mailto:cellar-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cellar/>
List-Post: <mailto:cellar@ietf.org>
List-Help: <mailto:cellar-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cellar>, <mailto:cellar-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 15 Apr 2018 15:44:12 -0000

On 15/04/2018 15:30, Steve Lhomme wrote:
> Following the discussion on RAWcooked elements we need a way to define
> how to create extensions to a DocType that are not meant to be merged
> in the main DocType specifications.
>
> The XML Schema we use is already quite good for external files as we
> can define the exact path an element can be found, independently of
> the main specs. We even have fields in the header to tell which
> DocType and which version of the DocType it applies to.
>
> What we don't have is these information when reading an EBML file. We
> know the DocType and the version but we don't know of any extensions
> that are also valid in this file.
>
> If we had a DTD/Schema embedded in every EBML file that would be where
> we put these data. But we don't. And there may be simpler way to do
> it, in the same philosophy as EBML/Matroska: only write information
> that's not obvious.
>
> We can take the RAWcooked elements as an example for what we need:
> https://github.com/Matroska-Org/matroska-specification/pull/223/files
>
> This should be a separate XML file that has the same DocType and
> version as the current Matroska one.
> https://github.com/Matroska-Org/matroska-specification/blob/master/ebml_matroska.xml
> We may also include a human readable name for the extension like
> extension="RAWcooked".
>
> The 3 added elements and that's it. That would be enough for the
> extension definition.
>
> In the file/stream we should tell if this extension is used, when it's
> used. The logical way to do that is to put this information in the
> EBML header. We will also need the IDs, the path where they belong and
> at least if it's a master element or not (to allow children elements).
> The path actually contains a lot of information like the maximum of
> occurrences or if it's a recursive element and even how the recursion
> works.
>
> The goal is not to interpret the values but only to tell what is valid
> and not valid we may not need to know the exact type of the value and
> the allowed ranges. But that could be added as well.
>
> The path would probably be in a binary format. Especially as the
> elements name are not known.
>
> So in the case of RAWcooked we would have something like that stored
> in the EBML header:
> EBML
>    EBMLExtensions
>        EBMLExtensions
>          ExtensionName: RAWcooked
>          ExtensionElement
>            ExtensionElementPath: 0*1(\0x18538067\0x1F43B675\0xA0\0x207262)
>            ExtensionElementHasChildren: 0 (may be the type otherwise)
>          ExtensionElement
>            ExtensionElementPath: 0*(\0x18538067\0x207273)
>            ExtensionElementHasChildren: 0
>          ExtensionElement
>            ExtensionElementPath: 0*1(\0x18538067\0x1654AE6B\0xAE\0x207274)
>            ExtensionElementHasChildren: 0
>
>
> We could add extra elements like a URL to know more about that
> extension, a human readable name for each element, etc. The URL may
> also replace the ExtensionName.
>
> The separators between IDs may be kept as we need a way to wrap things
> in this way: 1*(\Segment\Chapters\EditionEntry(1*(\ChapterAtom)))
>
> Opinions ?

While I like the idea of defining some "schema" in EBML header, I am 
wondering what is the actual usage: for a conformance checker, it helps 
to consider unknown elements as "expected", but is it useful? Shouldn't 
a conformance checker just discard unknown elements, expecting that it 
was defined later (compared to when the conformance checker was built)?

Additionally, I understand you talk about third party extension, but 
don't we have the same issue with official new Matroska elements? when 
we freeze Matroska v4, how should be considered new elements if the 
DocVersion is still 4? in other words, does DocType of 4 mandates a 
precise list of elements, or is it "open"? IMO, thinking more and more 
about that, it should be "open" and a conformance checker should just 
ignore unknown element or list them in an information area.

For third party extensions, I see that the proposal does not resolve the 
issue about collisions. even if collisions may be rare, people may want 
to use shorter ones so a collision would not be so rare. And it does not 
indicate which ID to use, i.e. if we (specification writers) don't know 
which element was used by third parties, how will we avoid to use an 
element ID not used by someone else who did not indicate us that an 
element is used?

Worse, let imagine at EBML level: we want EBML to be used in more 
places, but what would happen if spec author X define element Y in the 
spec (nothing forbids him to do so), and EBML authors decide to add a 
global element with value Y (the same value) because they were not aware 
of author X and his spec?

So I think we have a more global issue: we have reserved values nowhere 
(not in EBML, not in Matroska), and it may be important to consider that 
as an issue to resolve before flagging EBML as "final version 1". 
Checking e.g. 1-byte elements, I see that we have already 78 (76 
Matroska + 2 global EBML) 1-byte elements defined, other 128 
possibilities. Not a lot are remaining for future EBML global 1-byte 
element...

I suggest that we first reserve (for all classes?) some EBML IDs for 
future global elements, before we can't use any new Global element due 
to Matroska using all of them.

Independently (this is subject to debate, as some people don't want 
private elements), we could reserve a range of element IDs for private 
content.
Even if I talked about "uuid" stuff, I don't like so much this idea 
because it increases the size (16 byte per UUID, in addition to EBML ID 
size), if I take your idea about EBMLExtension, I suggest that:
- Reserve some global elements (3-byte minimum?) for private content
- In EBML header, map such global element to something well defined (I 
don't like UUID because they are cryptic when someone faces the new 
value, inverted DNS or something similar to a tag name may be better)
Idea is based on MXF "Primer", all 2-byte elements IDs are "user 
defined" and the "Primer" part in the MXF file maps them to their 
16-byte "Universal Label", so the private content uses 2 bytes for 
element ID.
Based on your proposal, it would mean something like:
- We reserve 0x234500 to 0x2345FF (random values) in EBML for private 
content.
- In EBMLHeader:
EBML
   EBMLExtensions
       ExtensionElement
         ExtensionValue: 0x234562
         ExtensionMeaning: "RAWcooked/RawcookedBlockGroup"
       ExtensionElement
         ExtensionValue: 0x234572
         ExtensionMeaning: "RAWcooked/RawcookedTrackEntry"
       ExtensionElement
         ExtensionValue: 0x234573
         ExtensionMeaning: "RAWcooked/RawcookedSegment"

Just an idea as a potential solution for the issues I see with your idea 
and issues in general about EBML, but I have no strong ideas about how 
to handle them, someone has other ideas?