Re: [Cbor] Record proposal

Carsten Bormann <cabo@tzi.org> Thu, 29 July 2021 15:05 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 101D03A12A4 for <cbor@ietfa.amsl.com>; Thu, 29 Jul 2021 08:05:00 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.896
X-Spam-Level:
X-Spam-Status: No, score=-1.896 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id iB1zjxv1sZAn for <cbor@ietfa.amsl.com>; Thu, 29 Jul 2021 08:04:54 -0700 (PDT)
Received: from gabriel-smtp.zfn.uni-bremen.de (gabriel-smtp.zfn.uni-bremen.de [134.102.50.15]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 453503A12AC for <cbor@ietf.org>; Thu, 29 Jul 2021 08:04:54 -0700 (PDT)
Received: from [192.168.217.118] (p548dcc89.dip0.t-ipconnect.de [84.141.204.137]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4GbDN3164xz31NM; Thu, 29 Jul 2021 17:04:51 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CAEs2a6stR_0=eGP0Vx6z_9x7mf1pWJpRE8toTaSSmdWOQ+zsFA@mail.gmail.com>
Date: Thu, 29 Jul 2021 17:04:50 +0200
Cc: Christian Amsüss <christian@amsuess.com>, "cbor@ietf.org" <cbor@ietf.org>
X-Mao-Original-Outgoing-Id: 649263890.476133-67f4da9d71af706a845f43a351f59e68
Content-Transfer-Encoding: quoted-printable
Message-Id: <C8000846-EEB7-4E2A-B909-298E479CFD62@tzi.org>
References: <8421F43D-E9ED-444F-A915-415F3AE59FA0@tzi.org> <YJJ+oJZ5YF/c14sv@hephaistos.amsuess.com> <41C02CBE-E7EC-4E61-889B-779EE561C632@tzi.org> <31190CB3-2EE9-4B92-BBC6-C29F71A11162@hxcore.ol> <CAEs2a6stR_0=eGP0Vx6z_9x7mf1pWJpRE8toTaSSmdWOQ+zsFA@mail.gmail.com>
To: Kris Zyp <kriszyp@gmail.com>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/eKWkBV9Kt0dBcnUlzL_1kXO-8Zc>
Subject: Re: [Cbor] Record proposal
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Jul 2021 15:05:00 -0000

Hi Kris,

timely question — the CBOR meeting at IETF111 is tomorrow.

My plan was to look at a corpus of SenML (RFC 8428) data to see whether your proposal would turn out to be useful there.  SenML is a spec that allows building “packs” of records, which in principle can be all different, but in reality fall into a small number of structural categories (e.g., you might have water flow, water temperature, and water pressure coming in from separate sensors, so you cannot readily combine them, but each sensor stream has records that always look the same).

Of course, time ran out before this meeting, so I didn’t manage to do that yet.

The SenML example is different from the CSV (RFC 4180/7111) example that comes to mind immediately in that the records go into different groups, while in a CSV Table all entries have the same structure (usually indicated in the first line in CSV’s text representation).

I would be most happy if we could come up with something that fits SenML and CSV applications, but doesn’t create some of the processing model problems that might come up.  Having to run linear processing means that one cannot extract subsequences from an array, which is a bit like the effect SenML base fields have — most SenML packs have base fields only in the first record.  A hierarchical model (tag applies to its content) would mean at least subtrees are extractable, but there is going to be some redundancy between multiple record arrays — redundancy that cbor-packed could pack away.

I’m not sure we will have much time for discussion tomorrow (1 hour meeting and 10 documents to look at :-), but we should continue on the list and create a proposal for one of the next CBOR interims (I’m actually going to be on vacation on 2021-08-11, but the one on 2021-08-25 would suit me).

Grüße, Carsten


> On 2021-07-29, at 02:39, Kris Zyp <kriszyp@gmail.com> wrote:
> 
> I just wanted to check on the status of this, and if this registry
> entry is still being considered?
> Thank you,
> Kris
> 
> On Wed, May 5, 2021 at 9:11 AM <kriszyp@gmail.com> wrote:
>> 
>> To try to clarify a little bit about my proposal, my intent with the record proposal was not primarily about minimizing the size of the encodings. I think that is a nice auxiliary benefit, but is not the primary purpose of the proposed tag. The primary purpose is to assign semantics indicating that a sequence of values should be understood as a structured record. I think this is complementary to tag 259, which indicates the opposite, a set of name-values to be interpreted a map structure. And I think this type of semantic application to a primitive data structure is the intent of tags, if I understand correctly.
>> 
>> And the benefit of being able to use this type of tag/semantics is that encoders can more explicitly align encodings with language structures and class. And this can have much more of a significant impact on performance than space. Serializers can “stream” serialization of array or other sequences, that may be of indefinite length, without a prior knowledge of homogeneity, but still be able to use reuse structural information for any objects/elements that have the same structure, in a way that is easily optimized and can scale to arrays and other structures that may not be scannable ahead of time. For example, an encoder can keep a simple cache of class types that it has serialized and recognize that it has already serialized the record structure in a previous entity, and reference that rather reserializing the structure again. And this type of class/type cache can be significantly faster than doing cache lookups for every primitive value. Likewise, on deserialization, this provides an opportunity for a decoder to use a record reference to find and allocate/initialize exactly the correct class or structure, and then decode values directly into that structure.
>> 
>> And my intent with suggesting this as a tag proposal is that hopefully this can simply be a rather unobtrusive addition the tag registry, keeping aligned with the intent and purpose of tag semantics, without trying to alter the direction CBOR itself at all.
>> 
>> 
>> 
>> Thanks,
>> 
>> Kris
>> 
>> 
>> 
>> Sent from Mail for Windows 10
>> 
>> 
>> 
>> From: Carsten Bormann
>> Sent: May 5, 2021 6:50 AM
>> To: Christian Amsüss
>> Cc: cbor@ietf.org; Kris Zyp
>> Subject: Re: [Cbor] Record proposal
>> 
>> 
>> 
>> On 2021-05-05, at 13:16, Christian Amsüss <christian@amsuess.com> wrote:
>> 
>>> 
>> 
>>> (And as much as I dislike being "the person to whom everything looks
>> 
>>> like a nail", I'll probably ask about whether this fits in the general
>> 
>>> model of packed CBOR, with the first entity setting up a single table
>> 
>>> entry, and then the entries expanding a [] to a {}).
>> 
>> 
>> 
>> There are two potential aspects to the proposed tag:
>> 
>> 
>> 
>> * a more compact representation (which is all that cbor-packed is about)
>> 
>> 
>> 
>> * semantic indication that a specific kind of record is being used
>> 
>> 
>> 
>> Proposed Tag 105 currently does not have a place for further semantic indications, but one could be added.
>> 
>> 
>> 
>> By the way, cbor-packed turns the example I gave in the referenced email into
>> 
>> 
>> 
>> 51([["value", "name"], [], [],
>> 
>>   [{simple(1): "one", simple(0): 1},
>> 
>>    {simple(1): "two", simple(0): 2}, {simple(1): "three", simple(0): 3}]])
>> 
>> 
>> 
>> Encoding-wise, the last array looks like this:
>> 
>> 
>> 
>>      83                  # array(3)
>> 
>>         a2               # map(2)
>> 
>>            e1            # primitive(1)
>> 
>>            63            # text(3)
>> 
>>               6f6e65     # "one"
>> 
>>            e0            # primitive(0)
>> 
>>            01            # unsigned(1)
>> 
>>         a2               # map(2)
>> 
>>            e1            # primitive(1)
>> 
>>            63            # text(3)
>> 
>>               74776f     # "two"
>> 
>>            e0            # primitive(0)
>> 
>>            02            # unsigned(2)
>> 
>>         a2               # map(2)
>> 
>>            e1            # primitive(1)
>> 
>>            65            # text(5)
>> 
>>               7468726565 # "three"
>> 
>>            e0            # primitive(0)
>> 
>>            03            # unsigned(3)
>> 
>> 
>> 
>> So the overhead here is one map head and two simple values per row.
>> 
>> (Of course, that assumes that one-byte simple values are still available in the greater context this is in.)
>> 
>> 
>> 
>> Even with a form of circumfix compression (e.g., mapping tables with parameters [1]), this is hard to beat encoding wise.
>> 
>> The record proposal as is takes four bytes per row (1+2 tag, 1 array).
>> 
>> This can be optimized significantly further only by amortizing the tag over more than one row, as my “CSV style” does, but that requires homogeneity.
>> 
>> 
>> 
>> Grüße, Carsten
>> 
>> 
>> 
>> [1]: https://datatracker.ietf.org/doc/draft-bormann-lpwan-cbor-template/
>> 
>> 
>> 
>> 
> 
> _______________________________________________
> CBOR mailing list
> CBOR@ietf.org
> https://www.ietf.org/mailman/listinfo/cbor