Re: [Cbor] Maps: Need for additional compactness?

Emile Cormier <emile.cormier.jr@gmail.com> Tue, 15 June 2021 21:00 UTC

Return-Path: <emile.cormier.jr@gmail.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 07C4B3A3DB9 for <cbor@ietfa.amsl.com>; Tue, 15 Jun 2021 14:00:54 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.096
X-Spam-Level:
X-Spam-Status: No, score=-2.096 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 88ihect780Pc for <cbor@ietfa.amsl.com>; Tue, 15 Jun 2021 14:00:49 -0700 (PDT)
Received: from mail-yb1-xb2f.google.com (mail-yb1-xb2f.google.com [IPv6:2607:f8b0:4864:20::b2f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1E1BE3A3E41 for <cbor@ietf.org>; Tue, 15 Jun 2021 14:00:38 -0700 (PDT)
Received: by mail-yb1-xb2f.google.com with SMTP id e10so22486713ybb.7 for <cbor@ietf.org>; Tue, 15 Jun 2021 14:00:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ujk/aTm8R7Wv7bNJA7SYRhN4pvaVHDjX6rSAfIZuyKo=; b=n5/6z/m0bsbyZu07DN2URq/rwquXRj1czRGvSe6RaKzEDbDrwBCRSDlY7JpPjKKSCO HONcTMM/U+j5gXxqFwciov93kC6h3H2a6iXQJ27HkVBJj9Hoj8xbfh3qWsn5MQOyGGYP 3/Gx6/CURN3Bd51ymMJ+H1kn7rVKjFjRbiVUv1tKiW8lA6IxVcYwtDJUWHyekVZ3jaXi yzbtqnWmG44vmb7WEOyhini4ulUDCazOOwJIP1a3Y/ped0Y48Rx1mJhMORRHyvQfgBzN Jo9UtVw841Z8rA3GVS+VZt0lYFGxrpdorkCjGLJCrYy09alxtzHzcMzTsm+0Ha5dcdmE os5A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ujk/aTm8R7Wv7bNJA7SYRhN4pvaVHDjX6rSAfIZuyKo=; b=BD3FfQCjwKsWIbU7PX8zWEsa9VYoPjTISemHiETn+Qd5N5rapeQ2rjhyUFldcyXunt Dz8MvxEXXeOohzemQIH1z+YZ+BFuSDBytPklkfx2CTcnZMqBMeBAxW9HpWUca1zIejx0 93ny9+achzIDE82RMsdu2uXbw5HemuXtmQfnlIVOeF9R2GieT7FmC5vqCVzk8qoyc/Bz np5wJTAk5EIx6b4ociNJtELDHyWh2B3Zm0iNUDlaMAOz5KecIRQ6CblUNEzS1bDmCpA+ Nq/Jt2roq0VuVLYyB1NhGH78tQ4jS6dZmVCyQRWwAXj23Bt1VMC1jpE9nOo6BAvwkQ6e GpCQ==
X-Gm-Message-State: AOAM533TVV8Hz3gyj/x6R5VZm3f3cUGi51xZqDjLnjLvZpi5XgdSBxKl VCkmWUR7q0gLZlnB3SlkOPmvsKerz28P4sbDJW/NZaozp482uQ==
X-Google-Smtp-Source: ABdhPJxGrnrQwY2ravBIS7ZXJJvv3dROjDKNEWKRWii5ZEBWvTmNcYxBPK+2pHcNeyORqeRxZAZMk3Tdw/azgOCDV5w=
X-Received: by 2002:a25:f812:: with SMTP id u18mr1473727ybd.402.1623790837591; Tue, 15 Jun 2021 14:00:37 -0700 (PDT)
MIME-Version: 1.0
References: <YMDmJcrm9OwrKDk9@hephaistos.amsuess.com> <CAM70yxAPW4xFqP5tXx=s5HCm-m6tYyi+XsLTxkk58o=P6NRRCw@mail.gmail.com> <53BC1116-A603-49E7-A3A5-10113B494A63@tzi.org>
In-Reply-To: <53BC1116-A603-49E7-A3A5-10113B494A63@tzi.org>
From: Emile Cormier <emile.cormier.jr@gmail.com>
Date: Tue, 15 Jun 2021 18:00:26 -0300
Message-ID: <CAM70yxBthNVKYfNxq=MmLpKSHc_rO3hvWuGr5sfHeMPnki7_1Q@mail.gmail.com>
To: Carsten Bormann <cabo@tzi.org>
Cc: Christian Amsüss <christian@amsuess.com>, Kio Smallwood <kio@mothers-arms.co.uk>, cbor@ietf.org
Content-Type: multipart/alternative; boundary="0000000000001071bb05c4d444b0"
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/V3-HFmwhrN6iG8hnXg09yJuJHQ0>
Subject: Re: [Cbor] Maps: Need for additional compactness?
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Jun 2021 21:00:54 -0000

Hi Carsten,

Sorry for the late reply. Your message was flagged as spam for some reason
and I just found it recently.

Thanks for clarifying the OEMLists; they make sense to me now. My example
with the superfluous brackets was taken directly from your slides, so you
may want to consider fixing those slides if you plan on reusing them.

I'm not sure which would be easiest to decode between Dittolists and
OEMLists. With Dittolists, a SAX-like parser either has to "remember" the
last key (which is not trivial if the key is of arbitrary type), or it has
to look ahead after the value following the key to see if there's a Ditto
special value.

What vexes me for both the Dittolists and OEMLists is how I'm supposed to
transcode them into JSON in a multi-encoding environment, such as a WAMP
router (https://wamp-proto.org/) Consider the case where a JSON peer would
expect a multimap to be in the [[“aaa”, 1], [“aaa”, 2], [“bbb”, 3]] form
with repeated keys. My transcoder would therefore have to memorize each
inbound CBOR multimap key in case it needs to be repeated in the output
JSON array. I would have to store the keys in some sort of DOM tree-like
structure, because keys can be arrays or maps. This would ruin the
simplicity of my current CBOR-to-JSON transcoder.

On the other hand, if I assume that a JSON peer understands OEMLists, then
I can simply transcode each sub-entity of a CBOR OEMList as-is without
having to memorize anything.

I'll do some research on how multimaps are expected to be serialized in the
Javascript world. I'm guessing that there's really no consensus.

Cheers,
Emile Cormier

On Sat, Jun 12, 2021 at 12:25 PM Carsten Bormann <cabo@tzi.org> wrote:

> Hi Emile,
>
> > The MList encoding becomes a pessimization if there are few duplicate
> keys and a large number of key-value pairs. It might not be so bad to
> assume that keys tend to be generally short - they are often things like
> database IDs.
> >
> > The OMList encoding not allowing array values seems like a showstopper
> to me.
>
> Yes, both MList and OMList are showing intermediate steps; I think only
> OEMList would be worth defining tags for.
>
> > For OEMList, I'm not clear on this part:
> >
> > So [[“aaa”, 1], [“aaa”, 2], [“bbb”, 3], [“ccc”, [4]]] becomes [“aaa”,
> [1, 2], “bbb”, 3, [“ccc”, [[4]]]]
>
> There are some spurious brackets here.  I think this should be:
>
> [“aaa”, [1, 2], “bbb”, 3, “ccc”, [[4]]]
>
> Basically the idea is:
>
> Packing:
> — For keys with only one value:
>   — if that value is not an array, use the value
>   — if that value *is* an array, enclose it into an array (what you call
> “singleton”).
> — for keys with more than one value:
>   — enclose all the values into a single array
> Unpacking:
> — for values that are not an array: use that as the single value for the
> key.
> — for values that are an array: use each element of that array as values
> for the key.
>
> > Is the resulting singleton array [“ccc”, [[4]]] used as an indicator so
> that the decoder interprets [[4]] as the array [4] to be associated with
> the key "ccc”?
>
> No.  Any outer array is always removed when unpacking, regardless of how
> many elements are in there.
>
> > If so, how is the decoder supposed to distinguish this from an array key
> in heterogeneous maps? This problem is more obvious if you change the
> example to this:
> >
> > [[“aaa”, 1], [“aaa”, 2], [“bbb”, [3]], [“ccc”, 4]] becomes [“aaa”, [1,
> 2], [“bbb”, [[3]]], “ccc”, 4]
> >
> > and a decoder interprets this as:
> > Key,     Value
> > "aaa",    1
> > "aaa",    2
> > [“bbb”, [[3]]],    "ccc"
> > 4,    <missing value>
>
> Yes, doesn’t work with those superfluous brackets.  See above.
>
> > It seems to me the only way to avoid the possible pessimization of
> MList, while also avoiding ambiguities with keys/values that can themselves
> be arrays/maps, is to introduce a new CBOR special value (or "abuse" CBOR
> break) to indicate the "shared key" special case. This has the disadvantage
> that it can't be directly transcoded to JSON.
>
> I think OEMLists work well as described above.
>
> > 1. For something like OEMList, when encoding a multimap, an encoder
> needs to look ahead to see if there are multiple values associated with the
> same key.
>
> Yes.  I haven’t talked about Dittolists yet.
>
> [[“aaa”, 1], [“aaa”, 2], [“bbb”, 3], [“ccc”, [4]]]
> becomes
> [“aaa”, 1, *ditto*, 2, “bbb”, 3, “ccc”, [4]]
>
> *ditto* might be a simple value we allocate specifically for this (one
> byte), e.g., simple(16).
>
> The gain of a dittolist then is the (n-1)*(sizeof(key)-1) for a key used n
> times.
> The gain of an OEMList is (n-1)*sizeof(key)-1, but then there is a loss of
> one byte for each array value that comes with a key that is used only once.
> Which of these is better depends on the use case, but we are talking about
> adding or removing single bytes for certain cases.
>
> Dittolists are definitely easier to stream, and I also think easier to
> implement (no special cases for array values).
>
> In summary, to me it seems we might want to add OEMLists, Dittolists,
> both, or none of them.
>
> Grüße, Carsten
>
>