Re: [Cbor] Maps: Need for additional compactness?

Emile Cormier <emile.cormier.jr@gmail.com> Wed, 16 June 2021 00:15 UTC

Return-Path: <emile.cormier.jr@gmail.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AC99F3A4369 for <cbor@ietfa.amsl.com>; Tue, 15 Jun 2021 17:15:23 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.096
X-Spam-Level:
X-Spam-Status: No, score=-2.096 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tRVbTSd4F34O for <cbor@ietfa.amsl.com>; Tue, 15 Jun 2021 17:15:18 -0700 (PDT)
Received: from mail-yb1-xb34.google.com (mail-yb1-xb34.google.com [IPv6:2607:f8b0:4864:20::b34]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9C82A3A4366 for <cbor@ietf.org>; Tue, 15 Jun 2021 17:15:18 -0700 (PDT)
Received: by mail-yb1-xb34.google.com with SMTP id g142so375734ybf.9 for <cbor@ietf.org>; Tue, 15 Jun 2021 17:15:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=AgPREowy89WD8tanf1+ixiTEgLa4lku6Gvvn17t+nOE=; b=p3mZNaVtR5peSqOn9/Le1A9z74CVAnWHWtmGyA5QyJYvTi9sbYCrLuY0nMQLUtAwzI Db0JKWoHXtOzKNblTAiAMSJjD0w42wkPW4WFYKaWxm70z+8mBDylWBp9r57D+CAsGdz4 TxC30c/f02oIcNFHOOg6hIVqifkFsmqUUbuqH/065BujEak6CAQApV7PE4HqSh8zwBnu PQdRTTaqPlo3G0totIyzJEJ6GUmwbEGZ+lRBBLMfIDpWY62TfMwR9jspdP+mZa1/97OG rCSdXu4E8MYjCR0e799zgSM3T4dAqmxoX9VDw5gmffNVLJndY2UvQAhiX8KEYv15Emkp bMpg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=AgPREowy89WD8tanf1+ixiTEgLa4lku6Gvvn17t+nOE=; b=oyoxjTOczLsDHLK46s5vuL3ALPJlfr/qIAJ2JuUUWBSHiL3Atebv74TfpqXXK9RBcx RR5rtJFO8eXPNcjh+xROEcCpUBWVfcRWgpPp1q6hZEORauCf47Yhv0NaRGnVC5waM56u OrwE24Zz3UbGTUvmRFZrcjzwAlHR1b1B1j093fybA0KP8UCRacpfnnsM/646S0rQP3SE o3+CBJy+WbbaQlANQsGP7WRQ+hGnAkzqODL6WS9ppV9Q5YF+Pepop+bbPdJtJ1ePJ053 mg4DOJvSvytvdxI/BrHgpX1N7iu1LdgokZRS+eYks6EFekNNTU4C61YchsYgWIcWPXFF g6IQ==
X-Gm-Message-State: AOAM531gskziWov1wUcpyCrN6RJGiUPQKudUc64WwxZMDw+oOAvA3xt6 /180r7BzJT6s8EATu3wu9CGZgO9+HOhbVToPvsI=
X-Google-Smtp-Source: ABdhPJwczEqJeZH09z+QubDK/g3fLYzU7HvvIkDl/cqnRRXNvpud8P5uXiycWuXB3fOIlVxGGAawfH46y7ggtX7wq60=
X-Received: by 2002:a25:c245:: with SMTP id s66mr2350624ybf.507.1623802517126; Tue, 15 Jun 2021 17:15:17 -0700 (PDT)
MIME-Version: 1.0
References: <YMDmJcrm9OwrKDk9@hephaistos.amsuess.com> <CAM70yxAPW4xFqP5tXx=s5HCm-m6tYyi+XsLTxkk58o=P6NRRCw@mail.gmail.com> <53BC1116-A603-49E7-A3A5-10113B494A63@tzi.org> <CAM70yxBthNVKYfNxq=MmLpKSHc_rO3hvWuGr5sfHeMPnki7_1Q@mail.gmail.com>
In-Reply-To: <CAM70yxBthNVKYfNxq=MmLpKSHc_rO3hvWuGr5sfHeMPnki7_1Q@mail.gmail.com>
From: Emile Cormier <emile.cormier.jr@gmail.com>
Date: Tue, 15 Jun 2021 21:15:06 -0300
Message-ID: <CAM70yxCP9s4T8=+qCQLU_Lv4+jMMooh+GQ4xvU2NYN48MnYT_g@mail.gmail.com>
To: Carsten Bormann <cabo@tzi.org>
Cc: =?UTF-8?Q?Christian_Ams=C3=BCss?= <christian@amsuess.com>, Kio Smallwood <kio@mothers-arms.co.uk>, cbor@ietf.org
Content-Type: multipart/alternative; boundary="0000000000003800e605c4d6fcdd"
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/xNmo0FM8jIRtG5__NztJAXjUlOQ>
Subject: Re: [Cbor] Maps: Need for additional compactness?
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 16 Jun 2021 00:15:24 -0000

A quick search reveals that the MList encoding seems to be favored by
various JSON libraries when serializing multimaps. This allows the use of
JSON objects when string keys are used:

{"aaa", [1, 2], "bbb", [3]}

where 1 and 2 are distinct values under the duplicate "aaa" JSON key.

If CBOR were to adopt the same MList convention, then transcoding from CBOR
multimaps to JSON would be trivial. The overhead of one byte per multimap
key seems like an acceptable price to pay for better simplicity and
compatibility with JSON.

In the case of non-string keys I would personally prefer transcoding a CBOR
multimap to a JSON array of pairs:

[[100, [101, 102], [200, [201]]

where 101 and 102 are distinct values under the duplicate 100 key.

Before deciding anything, perhaps it would be a good idea to survey how
existing JSON libraries deal with maps that are not traditional JS objects
with unique string keys. This would minimize the "impedance mismatch"
between CBOR and JSON.

Cheers,
Emile Cormier

On Tue, Jun 15, 2021 at 6:00 PM Emile Cormier <emile.cormier.jr@gmail.com>
wrote:

> Hi Carsten,
>
> Sorry for the late reply. Your message was flagged as spam for some reason
> and I just found it recently.
>
> Thanks for clarifying the OEMLists; they make sense to me now. My example
> with the superfluous brackets was taken directly from your slides, so you
> may want to consider fixing those slides if you plan on reusing them.
>
> I'm not sure which would be easiest to decode between Dittolists and
> OEMLists. With Dittolists, a SAX-like parser either has to "remember" the
> last key (which is not trivial if the key is of arbitrary type), or it has
> to look ahead after the value following the key to see if there's a Ditto
> special value.
>
> What vexes me for both the Dittolists and OEMLists is how I'm supposed to
> transcode them into JSON in a multi-encoding environment, such as a WAMP
> router (https://wamp-proto.org/). Consider the case where a JSON peer
> would expect a multimap to be in the [[“aaa”, 1], [“aaa”, 2], [“bbb”, 3]]
> form with repeated keys. My transcoder would therefore have to memorize
> each inbound CBOR multimap key in case it needs to be repeated in the
> output JSON array. I would have to store the keys in some sort of DOM
> tree-like structure, because keys can be arrays or maps. This would ruin
> the simplicity of my current CBOR-to-JSON transcoder.
>
> On the other hand, if I assume that a JSON peer understands OEMLists, then
> I can simply transcode each sub-entity of a CBOR OEMList as-is without
> having to memorize anything.
>
> I'll do some research on how multimaps are expected to be serialized in
> the Javascript world. I'm guessing that there's really no consensus.
>
> Cheers,
> Emile Cormier
>
> On Sat, Jun 12, 2021 at 12:25 PM Carsten Bormann <cabo@tzi.org> wrote:
>
>> Hi Emile,
>>
>> > The MList encoding becomes a pessimization if there are few duplicate
>> keys and a large number of key-value pairs. It might not be so bad to
>> assume that keys tend to be generally short - they are often things like
>> database IDs.
>> >
>> > The OMList encoding not allowing array values seems like a showstopper
>> to me.
>>
>> Yes, both MList and OMList are showing intermediate steps; I think only
>> OEMList would be worth defining tags for.
>>
>> > For OEMList, I'm not clear on this part:
>> >
>> > So [[“aaa”, 1], [“aaa”, 2], [“bbb”, 3], [“ccc”, [4]]] becomes [“aaa”,
>> [1, 2], “bbb”, 3, [“ccc”, [[4]]]]
>>
>> There are some spurious brackets here.  I think this should be:
>>
>> [“aaa”, [1, 2], “bbb”, 3, “ccc”, [[4]]]
>>
>> Basically the idea is:
>>
>> Packing:
>> — For keys with only one value:
>>   — if that value is not an array, use the value
>>   — if that value *is* an array, enclose it into an array (what you call
>> “singleton”).
>> — for keys with more than one value:
>>   — enclose all the values into a single array
>> Unpacking:
>> — for values that are not an array: use that as the single value for the
>> key.
>> — for values that are an array: use each element of that array as values
>> for the key.
>>
>> > Is the resulting singleton array [“ccc”, [[4]]] used as an indicator so
>> that the decoder interprets [[4]] as the array [4] to be associated with
>> the key "ccc”?
>>
>> No.  Any outer array is always removed when unpacking, regardless of how
>> many elements are in there.
>>
>> > If so, how is the decoder supposed to distinguish this from an array
>> key in heterogeneous maps? This problem is more obvious if you change the
>> example to this:
>> >
>> > [[“aaa”, 1], [“aaa”, 2], [“bbb”, [3]], [“ccc”, 4]] becomes [“aaa”, [1,
>> 2], [“bbb”, [[3]]], “ccc”, 4]
>> >
>> > and a decoder interprets this as:
>> > Key,     Value
>> > "aaa",    1
>> > "aaa",    2
>> > [“bbb”, [[3]]],    "ccc"
>> > 4,    <missing value>
>>
>> Yes, doesn’t work with those superfluous brackets.  See above.
>>
>> > It seems to me the only way to avoid the possible pessimization of
>> MList, while also avoiding ambiguities with keys/values that can themselves
>> be arrays/maps, is to introduce a new CBOR special value (or "abuse" CBOR
>> break) to indicate the "shared key" special case. This has the disadvantage
>> that it can't be directly transcoded to JSON.
>>
>> I think OEMLists work well as described above.
>>
>> > 1. For something like OEMList, when encoding a multimap, an encoder
>> needs to look ahead to see if there are multiple values associated with the
>> same key.
>>
>> Yes.  I haven’t talked about Dittolists yet.
>>
>> [[“aaa”, 1], [“aaa”, 2], [“bbb”, 3], [“ccc”, [4]]]
>> becomes
>> [“aaa”, 1, *ditto*, 2, “bbb”, 3, “ccc”, [4]]
>>
>> *ditto* might be a simple value we allocate specifically for this (one
>> byte), e.g., simple(16).
>>
>> The gain of a dittolist then is the (n-1)*(sizeof(key)-1) for a key used
>> n times.
>> The gain of an OEMList is (n-1)*sizeof(key)-1, but then there is a loss
>> of one byte for each array value that comes with a key that is used only
>> once.
>> Which of these is better depends on the use case, but we are talking
>> about adding or removing single bytes for certain cases.
>>
>> Dittolists are definitely easier to stream, and I also think easier to
>> implement (no special cases for array values).
>>
>> In summary, to me it seems we might want to add OEMLists, Dittolists,
>> both, or none of them.
>>
>> Grüße, Carsten
>>
>>