Re: [Cbor] Maps: Need for additional compactness?

Carsten Bormann <cabo@tzi.org> Sat, 12 June 2021 15:25 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4AEE73A16FE for <cbor@ietfa.amsl.com>; Sat, 12 Jun 2021 08:25:59 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_FAIL=0.001, SPF_HELO_NONE=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id JZ1M4vQnNbL7 for <cbor@ietfa.amsl.com>; Sat, 12 Jun 2021 08:25:56 -0700 (PDT)
Received: from gabriel-2.zfn.uni-bremen.de (gabriel-2.zfn.uni-bremen.de [IPv6:2001:638:708:32::19]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0AAEB3A16FF for <cbor@ietf.org>; Sat, 12 Jun 2021 08:25:55 -0700 (PDT)
Received: from [192.168.217.118] (p548dcc89.dip0.t-ipconnect.de [84.141.204.137]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-2.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4G2M400b6cz2xH5; Sat, 12 Jun 2021 17:25:52 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CAM70yxAPW4xFqP5tXx=s5HCm-m6tYyi+XsLTxkk58o=P6NRRCw@mail.gmail.com>
Date: Sat, 12 Jun 2021 17:25:51 +0200
Cc: Christian Amsüss <christian@amsuess.com>, Kio Smallwood <kio@mothers-arms.co.uk>, cbor@ietf.org
X-Mao-Original-Outgoing-Id: 645204351.685257-5c7a58a866ba4f7afaef446f321735ba
Content-Transfer-Encoding: quoted-printable
Message-Id: <53BC1116-A603-49E7-A3A5-10113B494A63@tzi.org>
References: <YMDmJcrm9OwrKDk9@hephaistos.amsuess.com> <CAM70yxAPW4xFqP5tXx=s5HCm-m6tYyi+XsLTxkk58o=P6NRRCw@mail.gmail.com>
To: Emile Cormier <emile.cormier.jr@gmail.com>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/c7aTlt56a-naz50VJjU_tzY4N5c>
Subject: Re: [Cbor] Maps: Need for additional compactness?
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 12 Jun 2021 15:25:59 -0000

Hi Emile,

> The MList encoding becomes a pessimization if there are few duplicate keys and a large number of key-value pairs. It might not be so bad to assume that keys tend to be generally short - they are often things like database IDs.
> 
> The OMList encoding not allowing array values seems like a showstopper to me.

Yes, both MList and OMList are showing intermediate steps; I think only OEMList would be worth defining tags for.

> For OEMList, I'm not clear on this part:
> 
> So [[“aaa”, 1], [“aaa”, 2], [“bbb”, 3], [“ccc”, [4]]] becomes [“aaa”, [1, 2], “bbb”, 3, [“ccc”, [[4]]]]

There are some spurious brackets here.  I think this should be:

[“aaa”, [1, 2], “bbb”, 3, “ccc”, [[4]]]

Basically the idea is:

Packing:
— For keys with only one value:
  — if that value is not an array, use the value
  — if that value *is* an array, enclose it into an array (what you call “singleton”).
— for keys with more than one value:
  — enclose all the values into a single array
Unpacking:
— for values that are not an array: use that as the single value for the key.
— for values that are an array: use each element of that array as values for the key.

> Is the resulting singleton array [“ccc”, [[4]]] used as an indicator so that the decoder interprets [[4]] as the array [4] to be associated with the key "ccc”?

No.  Any outer array is always removed when unpacking, regardless of how many elements are in there.

> If so, how is the decoder supposed to distinguish this from an array key in heterogeneous maps? This problem is more obvious if you change the example to this:
> 
> [[“aaa”, 1], [“aaa”, 2], [“bbb”, [3]], [“ccc”, 4]] becomes [“aaa”, [1, 2], [“bbb”, [[3]]], “ccc”, 4]
> 
> and a decoder interprets this as:
> Key,     Value
> "aaa",    1
> "aaa",    2
> [“bbb”, [[3]]],    "ccc"
> 4,    <missing value>

Yes, doesn’t work with those superfluous brackets.  See above.

> It seems to me the only way to avoid the possible pessimization of MList, while also avoiding ambiguities with keys/values that can themselves be arrays/maps, is to introduce a new CBOR special value (or "abuse" CBOR break) to indicate the "shared key" special case. This has the disadvantage that it can't be directly transcoded to JSON.

I think OEMLists work well as described above.

> 1. For something like OEMList, when encoding a multimap, an encoder needs to look ahead to see if there are multiple values associated with the same key.

Yes.  I haven’t talked about Dittolists yet.

[[“aaa”, 1], [“aaa”, 2], [“bbb”, 3], [“ccc”, [4]]] 
becomes
[“aaa”, 1, *ditto*, 2, “bbb”, 3, “ccc”, [4]]

*ditto* might be a simple value we allocate specifically for this (one byte), e.g., simple(16).

The gain of a dittolist then is the (n-1)*(sizeof(key)-1) for a key used n times.
The gain of an OEMList is (n-1)*sizeof(key)-1, but then there is a loss of one byte for each array value that comes with a key that is used only once.
Which of these is better depends on the use case, but we are talking about adding or removing single bytes for certain cases.

Dittolists are definitely easier to stream, and I also think easier to implement (no special cases for array values).

In summary, to me it seems we might want to add OEMLists, Dittolists, both, or none of them.

Grüße, Carsten