[Cbor] Re: Early allocation for packed CBOR (Re: Reminder: CBOR WG Virtual Meeting on 2024-12-11)

Christopher Allen <christophera@lifewithalacrity.com> Wed, 11 December 2024 17:07 UTC

Return-Path: <christophera@lifewithalacrity.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2A09BC15199D for <cbor@ietfa.amsl.com>; Wed, 11 Dec 2024 09:07:27 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.906
X-Spam-Level:
X-Spam-Status: No, score=-1.906 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=lifewithalacrity-com.20230601.gappssmtp.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id A3AOkqwyu7Ty for <cbor@ietfa.amsl.com>; Wed, 11 Dec 2024 09:07:22 -0800 (PST)
Received: from mail-ed1-x52d.google.com (mail-ed1-x52d.google.com [IPv6:2a00:1450:4864:20::52d]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D9974C14F6F7 for <cbor@ietf.org>; Wed, 11 Dec 2024 09:07:21 -0800 (PST)
Received: by mail-ed1-x52d.google.com with SMTP id 4fb4d7f45d1cf-5d3d14336f0so8703622a12.3 for <cbor@ietf.org>; Wed, 11 Dec 2024 09:07:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=lifewithalacrity-com.20230601.gappssmtp.com; s=20230601; t=1733936840; x=1734541640; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=auXjmteeXF5FE6FZVH8yOtMYPIZEhQuUGmGjrzfPNPE=; b=vfvJ8YZtI8H0ftrY0LcHkGaDvzpPhnnuTBY8ARiZVFooYKdMc/6AA9Bt6/hioTXUdH od85G3XzHhd9C/uFEdr4SYqa4JGybR48u9W0Fb/hJhrXoSKMXrqgEehD8vquB+VwJN9K cA+ptDRtz1HfqATydJjHEnNyH4wEgIkIsm3oILYeaekxtV7RkcxTN6vjB/SaCIaPBAxS dp95FA9WYCsshKJ/Op/6v8RTBfqEffqfs9mHfDTko33+k60tkOGdbBfNrZ3hZ75tmZ+J 48docTADTix8i2dMOcUuBZWnjEFGNMctjosDP0L+XHS1IbZXAvlHeKpQoUt8LILJQwTS gUpg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733936840; x=1734541640; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=auXjmteeXF5FE6FZVH8yOtMYPIZEhQuUGmGjrzfPNPE=; b=fPUjnLUgQTbdB3LkIg4BaVlyRDH6otRJWcnaEq+MVRt5P2MOdsVrz49g+a5hoV/vo1 KNfRQ0yz5lAJXfcWh7jIeQwALK5d4mzuarJfWauwPj52VR48TtN/8YsIBeDrF6O98nbv aBb6qMQTa2mHqmfLnn/dQIihj6RS+Ek/5TBD45n/+0XljfagZLud6wpP6E9UxrvgzGC1 ogVdHtEUByzLn8h1kIkqFdIfnEZnUdWMZ3od/bY37DAjOyksXxecaXT2M5xvzH4ZdGlb MnR+mv/OIS4An3AytOMc++WNlej6Qd+EahnHYY9+PFcoWUc5edV9USmGv9szmPHu2UvA UeSg==
X-Forwarded-Encrypted: i=1; AJvYcCV7CSf/Xob71v1tow0cmCyprRqYX8W7+EdWnOUQ1NF2XH4Mr4hwJ4C1CR6g2BVzLwLcxQ0p@ietf.org
X-Gm-Message-State: AOJu0YzAEmrNhcDMiH8pq46n3ynJRS22OOTHgUzZgXnZD5E1HIR7peH9 kk+2N/T1x3TVXjHY6FTQEjsheI4nXX7U/x/m0veKOEpQZ1UZa433tsrNipRl2WCNyIPWXC0dlQd v0T//J49RWOwuqJ/OwPPa0KFitz8HzuHvqPwl
X-Gm-Gg: ASbGnctr9DX8gniVBDlepBiJaRVUc1UapgOx5X3hrD0BKdh2D3Rh7sC2V2dd0ivMcuk W+b7x1wbDcExDMezmiqFrSBOw8NW8qqsb5iT6ItYGZAql4yeQaa9U68Zr8St8XVaM3Evybw==
X-Google-Smtp-Source: AGHT+IGFkICkVhaLCXxnwj5GH+s15tQjdNa0nUNOTbfoQHLNE3E7A+6mN0HaTZyflG6+b2tUnIL2UNEHDUSFx+sJWp0=
X-Received: by 2002:a05:6402:1e90:b0:5d0:81dc:f20e with SMTP id 4fb4d7f45d1cf-5d4330abb13mr3848805a12.17.1733936839856; Wed, 11 Dec 2024 09:07:19 -0800 (PST)
MIME-Version: 1.0
References: <CALaySJKDFscUBGw4CPspXJvUTkXywVHc_FrmhO3ybBWTrwjGXw@mail.gmail.com> <CALaySJJ8-M9x8irtmF2pfDE3GRXU1am9n2a3XeDcmPT+kww+KA@mail.gmail.com> <CALaySJKTQT_9CC-wVVd+fY1NYJ73M8CP22hn=rWrFeTJSJDEsA@mail.gmail.com> <CALaySJKG3oagg6ffLTx8LgvLvnjHHA2DMGgY74E0q=rReAc4PA@mail.gmail.com> <CALaySJLtUR1=G_WH4H+zoJ5LCrHjBgEf1oW104zDtFQighY+gg@mail.gmail.com> <CALaySJLnKxU9m3BNPq4XayrSrorRBG2vuBz1AF-CsEBoSZe7Xg@mail.gmail.com> <CALaySJKaz7C=GN5E=saiDY4KxL+9xCfM0ocZuMStEQ96FnQ4KA@mail.gmail.com> <CALaySJJEXkey9vLAp8VqDXmPsWpxiWN9jjtVnGio1nMQ4K+mDQ@mail.gmail.com> <CALaySJJfc+tET4Vm5UQjHPK5mf61O0iR-1i6=X32CYtWxZLWTQ@mail.gmail.com> <CALaySJKdrk7aPzhT=kbE1B8pq1EBw74nmx_peSJMAoHsG5jyVQ@mail.gmail.com> <CALaySJ+fWX4zEnE5v-Q9R6eCv=kSJjnc-fsXL5PGPgac1GJAcA@mail.gmail.com> <B807C9D3-39A4-4024-BC1D-85DD84EA1735@tzi.org> <DFE56705-CCDD-4172-B577-C873E3DB4898@tzi.org> <5FEA5C07-4A39-4B58-B2AE-F261D111FCE6@cursive.net> <D0618F67-4868-4745-A526-F73DF1A98E1B@tzi.org> <2A875D49-DD88-42D9-969D-0841A6B41F95@tzi.org> <PH7PR02MB92920676E8817271E547F82FB73E2@PH7PR02MB9292.namprd02.prod.outlook.com> <29696.1733935516@obiwan.sandelman.ca>
In-Reply-To: <29696.1733935516@obiwan.sandelman.ca>
From: Christopher Allen <christophera@lifewithalacrity.com>
Date: Wed, 11 Dec 2024 09:07:08 -0800
Message-ID: <CAAse2dEYQ-Kp_g5FTfqVBY3UQTd82McpwgHeZHmwbCkDh+v_SA@mail.gmail.com>
To: Michael Richardson <mcr+ietf@sandelman.ca>
Content-Type: multipart/alternative; boundary="00000000000067386b062901a16d"
Message-ID-Hash: Z67OSVV6GOXPF4N6TABVECQ66TCTPZIT
X-Message-ID-Hash: Z67OSVV6GOXPF4N6TABVECQ66TCTPZIT
X-MailFrom: christophera@lifewithalacrity.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: Michael Jones <michael_b_jones@hotmail.com>, CBOR <cbor@ietf.org>, Manu Sporny <msporny@digitalbazaar.com>
X-Mailman-Version: 3.3.9rc6
Precedence: list
Subject: [Cbor] Re: Early allocation for packed CBOR (Re: Reminder: CBOR WG Virtual Meeting on 2024-12-11)
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/bK4yZ1t4eSf7Xo1dYgNCtcg6Qog>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>

(sorry if this is a duplicate, IETF mailing lists seem to reject mails sent
via google)

I've been uncomfortable with the packed CBOR proposal, but I have not
spoken up as I'm not an expert on compression in constrained environments.
Gordian Envelope uses a single CBOR tag #40003 for RFC 1951 DEFLATE. I
recognize that DEFLATE is challenging in a constrained environment, but
this very large allocation of tags feels like a hack, as opposed to another
single tag for a specific compression for all items inside it.

There are also other approaches, like that used by W3C CBOR-LD
https://json-ld.github.io/cbor-ld-spec/, which uses the available semantic
"@context" for very large amounts of compression of larger data. They claim
"By utilizing semantic compression schemes, compression ratios in excess of
60% better than generalized compression schemes are possible." and I've
seen some slides that detail this (perhaps Manu Sporny could share here?).
As I understand it, CBOR-LD is currently in production. I'm not proposing
CBOR-LD as an alternative to packed CBOR, but it does demonstrate there
might be other approaches.

Reviewing previous emails, I also never saw any responses to Kang Seonghoon
<public+ietf-cbor@mearie.org> email "[Cbor] On the value of packed CBOR"
(copy below), who seems to raise some other legitimate questions.

I'd also like to know who is actually planning implementing packed CBOR, on
which specific constrained environments. Otherwise, packed CBOR feels like
an academic exercise.

-- Christopher Allen

---------- Forwarded message ---------
From: Kang Seonghoon <public+ietf-cbor@mearie.org>
Date: Mon, May 27, 2024 at 6:42 AM
Subject: [Cbor] On the value of packed CBOR
To: <cbor@ietf.org>


Hello, I have been following CBOR and its tag ecosystem for many years
but only noticed the existence of the packed CBOR proposal
[I-D.ietf-cbor-packed] which made me very uneasy about its bad
precedent and potential impact. It took a large wall of text to
elaborate my initial thought, and while I did have sourced most
arguments from the list archive, yet still I may well have missed some
points along the way. At the very least though I hope my post to be a
survey of the past and my own opinions about this format.

(A rendered version of this writing is also available at
<https://hackmd.io/HVdVu80SShSBmj7r9B10Wg>, in the case gmail screws
the formatting.)



## Weak rationale for packing over others

The Packed CBOR proposal ("the Proposal" henceforth), as of the latest
published draft, lists the following statements in the first section:

1. [CBOR] is a data format whose design goals include the possibility
of extremely small code size, fairly small message size, and
extensibility without the need for version negotiation.

2. CBOR data items, in particular when generated from legacy data
models, often allow considerable gains in compactness when applying
data compression.

3. While traditional data compression techniques such as DEFLATE can
work well for CBOR encoded data items, their disadvantage is that the
receiver needs to decompress the compressed form to make use of the
data.

4. Packed CBOR [is] a simple transformation of a CBOR data item into
another CBOR data item that is almost as easy to consume as the
original CBOR data item.

The statement 1 can be regarded as CBOR's own value propositions and
simply repeats the abstract of [RFC8949]. The statement 2 then
challenges one of them, namely the compactness ("fairly small message
size"). Given the statement 1, it is reasonable to expect that there
was no good way to achieve more compact encoding without compromising
some design goals, or such encoding should have been adopted instead.
The statement 3, however, is all about the additional preprocessing
step. The Proposal quickly hand-waves why that is relevant to design
goals, and just presents the supposed solution in the statement 4.

It should be noted that this kind of questionable motivation is not
limited to the Proposal but quite common in the world of technical
standards. In many cases though additional informative materials do
often give many hints, and this case is no different: the Proposal
mentions that the planned use for the W3C Thing Description (TD)
format [WOT-TD] was an initial driving force for the format, which
leads to the 2017 presentation by the same author [Bormann2017].



## Vague notion of processability

[Bormann2017] is a set of slides presented in the W3C WoT WG meeting
in 2017, when it was not yet decided which serialization format should
be used for the W3C TD format. The author compares existing
compression algorithms and concludes in favor of (the earlier version
of) the Proposal. SMILE [SMILE] and EXI [exi] were also considered in
addition to JSON and CBOR, and my understanding is that TD eventually
adopted both CBOR and EXI as optional encodings without giving any
more preference. The data set cited by [Bormann2017] is harder to
access by now, but still in public [WOT-TD-PLUGFEST-EXAMPLE].

As the Proposal correctly points out, the JSON vocabulary used by TD
is indeed verbose and significantly enlarges converted CBOR data.
Implicitly assuming that TD cannot be changed, it gives more explicit
statements about the best serialization format to use for TD, which I
will paraphrase a bit for convenience (and any errors and omissions
are mine):

3a. There is a trade-off between compactness and usefulness for
processing ("processability").

3b. Compactness has no relation to semantics. It implies some sort of
processing as compression requires "view of window".

3c. Processability is about keeping data in a processable form. It
includes no escaping and no piecing [sic] together.

3d. Structural sharing maintains processability but still saves ~1/3,
while prefix sharing saves even more but reduces processability.
Static dictionary will help too.

It is clear that this notion of processability is what's behind the
original statement 3. Unfortunately this is also where the whole
justification falls apart, because you can process the data in many
more ways. It can be even argued that CBOR doesn't maintain
processability too, as CBOR records the *number* of nested items and
thus has to maintain a stack to skip to a particular nested item.

Alternatively, we can define the "processability" to only include a
particular set of operations. The statement 3c implies and 3d seems to
confirm that it should include a space-constrained decoding, but even
that is quite vague. For example,

- Can we assume we have enough memory to buffer the whole packed CBOR
data item? If we can't, how many past bytes can we keep during the
decoding process?

- Can we use a disproportionately large amount of ROM as opposed to RAM?

- Should we tailor the encoding for JSON-like data items, or something else?

The Proposal answers none of them directly, but it does mention that
DEFLATE [RFC1951] is not suitable for the supposed constrained
environment. DEFLATE requires only ~32 KiB of sliding context and two
in-memory Huffman trees, but there are only 286 + 32 = 318 distinct
codes so the sliding context will dominate the memory usage. Therefore
the inability to use DEFLATE does suggest the severe lack of working
memory and would leave little space to buffer CBOR data items as
well---everything has to be streamed and quickly discarded after any
processing.

[Bormann2017] also supports this hypothesis, because it also compares
the packed CBOR with LZ4 and DEFLATE. LZ4 is an LZ77-only codec
without any entropy coding and very simple and efficient to decompress
as a result. LZ4 decompression reaches 4 GB/s in modern systems
[lzbench] and a similar result is expected in less powerful systems
due to its simple nature. The decompression *time* therefore cannot be
any deciding factor, and the space overhead seems to be the only
reason to make the packed CBOR.

In my opinion, it is impossible to design a good packing format
without having concrete and justifiable answers to virtually every
single of them, yet they are still left completely unanswered to this
day and inferred assumptions don't match the actual design. The
Proposal should define and present concrete processing models to
accurately shape the format design.



## Is the packed CBOR actually processable?

Due to the vague notion of processability, the Proposal never fully
justified that the proposed scheme would be actually more efficiently
processed than the ordinary CBOR compressed with traditional means,
with any reasonable definition of efficiency. And if we assume a
memory-constrained environment as inferred before, it is evident that
the proposed design is not really suitable.

By design, packing tables have to be kept in memory all the time.
There is no benefit to process them further because items in the same
packing tables are mutually independent, so the minimum required space
is close to the size of encoded tables themselves. This implies the
practical limit in the packing table size: if the target memory size
is 32 KiB, these tables cannot hold more than 2^15 items even when
each item is one byte long. And yet the Proposal supports ~2^28
references---presumably just in case. (By the way, a much larger
sliding window in modern LZ77 variants is largely for more
opportunistic matches, and yet it has to be constrained to be
practical. Zstandard [RFC8878, Section 3.1.1.1.1.2] for example has
the minimum suggested window size of 8 MB.)

Each data item in the packed CBOR will have to be eventually consumed
in two ways: matched as a part of structure, or used by applications.
The latter is hard to qualify, but the former requires an equality
function that operates on CBOR data items, like `equal(simple(0),
"foo")`. Shared items are actually well-behaved because it is just a
single lookup if you have an additional mapping from item indices to
in-memory offsets (otherwise you need to skip prior items every time),
but argument items are much more complex. For example, imagine the
following where literals `foo`, `extension:bar` and `extension:baz`
act as structural keys:

    113([["foo", "extension:ba"],
         {simple(0): 42, 225("r"): true, 225("z"): "quux"})

Even with an assumption that keys are in a certain order, the decoder
has to compute `equal(225("r"), "extension:bar")` for example---which
would involve a lookup, partial match and recursion. Not only this
logic would be complicated with the full concatenation function, but
recursion also takes an additional working memory. This design might
be possible on its own, but doesn't seem suitable for
memory-constrained decoders.



## Compression without breaking structure

There is another way to interpret [Bormann2017] in terms of
compression: traditional compression algorithms are not good at
structured formats like CBOR, so a dedicated compression format is
required and can remain efficient even without breaking its structure
and making it less processable. This interpretation indeed doesn't
require the precise notion of processability and more compact
compression alone can be a merit to proceed. This possibility led to a
proposal to use the packed CBOR for DNS messages
[I-D.lenders-dns-cbor], for example.

Unfortunately the statement 3b is yet another evidence that the
Proposal was fundamentally misguided: it is a well-known result of
information theory that compactness is in fact *intimately* related to
semantics. Any new information is a new opportunity to eliminate
redundancy if you can spend enough resources. It's only a matter of
how much resource we are willing to spend for given input.

Executable compression is a good example where exploiting structure
improves compression. Many control flow instructions in CPU
architectures use relative addresses to reduce the operand size. But
this means that multiple opcodes with the same target address will
have different encoded operands. For this reason, many compression
algorithms try to convert relative addresses to absolute addresses
before the actual compression [BCJ]. As long as it can be undone, it
doesn't even have to be correct because the CPU will never execute a
preprocessed executable. A more sophisticated algorithm would
recognize different parts of instructions so that it can regroup and
transform them separately, as demonstrated by Fabian Giesen in his
surprisingly approachable talk on the subject [Giesen2006, p. 41].

Still, the observation that compressed data doesn't always retain the
original structure is largely correct. In fact the erasure of such
structure is expected and highly desirable, because the only
information-theoretically incompressible data is a uniform random
sequence. Stock compression algorithms are largely out if we have to
retain some structure, but there are still many possibilities to
explore, some of which I'll lightly elaborate:

**Individual literal compression.** It was already noted that string
literals take up much space in many CBOR data items. Therefore
compressing those literals individually should have been the foremost
candidate. Barcode standards like QR Code [ISO18004], Data Matrix
[ISO16022] and Aztec Code [ISO24778] have utilized multiple character
sets and escape codes to minimize the number of bits without prior
knowledge about inputs, which CBOR can equally leverage.

For example, consider a subset of text strings where all characters
are in U+0021..007F. We can design two 6-bit codes where 0 is an
escape code, 1--31 map to U+0021..003F for both and 32--63 map to
U+0040..005F for the "upper" code and U+0060..007F for the "lower"
code. The active code is "lower" at the beginning, the escape code
will toggle the state for the next code. If the next code is also the
escape code, the current state is retained until the next escape code.
This encoding can be put into a byte string---assuming no other byte
string exists---to yield at most 25% more compact encoding without any
context. Additional information and assumption, like that many short
strings are in English, would allow for more sophisticated methods
like [SMAZ2].

**Different contexts.** LZ77 famously uses a fixed-size sliding
window, but it is hardly the only possible form of prior contexts that
algorithms can exploit. High-end media codecs have already leveraged
hundreds or thousands of different contexts for decades, and
regrouping different parts of data is in fact an implicit way to
introduce additional contexts. Alternatively one can *reduce* the
context in order to achieve something not possible in stock
algorithms, at the expense of inferior compression though.

For example, one can apply compression only to text strings and keep
all structural bytes intact, making necessary structural
transformations ourselves. The context for text strings therefore
doesn't contain any structural bytes which confuse most compressors.
An implicit version of this context would be moving all text contents
into a separate stream, like turning `["foo", {"example": "blah"}]`
into the structural `[/43/, {/47/: /44/}]` and the literal
`fooexampleblah` where `/xx/` denotes a raw byte sequence for the
remaining part of text string. Such context or equivalent
preprocessing can retain some processability even after the
compression.

**Use a different structure.** As suggested before, CBOR is not fully
processable if we are very strict about the definition. Therefore it
is reasonable to expect that any packing scheme for CBOR would be less
processable than the plain CBOR, and the question becomes which one to
sacrifice. CBOR structure itself is already quite simple, so it seems
unwise to reuse CBOR for packed representations instead of making a
new variant that is similarly simple but more suitable for
compression.

Consider the previous example. Is the structural `[/43/, {/47/:
/44/}]` actually a valid CBOR data item? It isn't, and we *have* to
use something like `["\x03", {"\x07": "\x04"}]` to make it a valid
CBOR without making it indistinguishable from, say, `[3, {7: 4}]`. But
the structure-only data item is not meaningful to ordinary CBOR
decoders anyway, so all these overheads paid for CBOR compatibility
would be useless.



## Inefficient tag allocation

The Proposal allocates 2^28 + 2^26 + 7 tags, which is easily the
largest allocation in a single proposal to date. Not only does this
set a bad precedent for similar proposals, but its use of tags is
inefficient.

My personal threshold for the "reasonable" allocation size is
proportional to a square root of the number of available tags, which
suggests that you shouldn't allocate more than 256 1+2 tags or 65,536
1+4 tags at once (I believe this was roughly the case before the
Proposal, where the largest allocation was due to [RFC9277]). This way
many applications requiring a considerable number of tags can coexist
for a long time, while the proposed allocation of more than 2^28
consecutive 1+4 tags would be possible only 10 more times. The fact
that they are all consecutive and conveniently placed makes the
Proposal very sensitive to other allocations as well, especially for
1+4 tags: what if someone registers the tag 2,000,000,000 today?

I believe this large amount of allocation was considered necessary
because CBOR doesn't have any other efficient way to attach arbitrary
bytes to data items. Indeed, 1+1 tags do seem necessary under this
requirement because there are only so many other 2-byte prefixes. But
longer tags are less justified, because a rather compact encoding is
possible with only short tags. This is even true for the subset that
includes only first 2^15 references, as I have demonstrated earlier.
(By the way, the lower bound of 27647 in the Proposal is incorrect; it
should be 28671 - 1023 + 8 = 27656 instead.)

I'll present one possible alternative encoding here as a concrete
example. It requires that tag 6 is no longer used for the first
argument reference, so that `6(..)` can be used for other
possibilities. We will make use of 0x40 1-byte data items:

    Pattern            Count  Diagnostics
    --------------  --------  ----------------------------------
    00..1F              0x18  0..23
    20..37              0x18  -1..-24
    E0..EF              0x10  simple(0)..simple(15)

These byte ranges can be then combined into 2-, 3-, 4- and 5-byte
prefixes that can be attached to the next data item as follows:

    Pattern            Count  Diagnostics
    --------------  --------  ----------------------------------
    C6C6                 0x1  6(6(/rest/))
    C6D5..C6D7           0x3  6(21(/rest/))..6(23(/rest/))
    C6A1xx              0x40  6({xx: /rest/})
    C682xx              0x40  6([xx, /rest/])
    C683xxxx          0x1000  6([xx, xx, /rest/])
    C684xxxxxx       0x40000  6([xx, xx, xx, /rest/])

These prefixes can be then mapped to actual semantics. It is possible
to encode much more prefixes if desired; I used tags 21--23 for
example because they are reasonably inert during the decoding, and had
more than 0x100000 5-byte prefixes in a more complex design. More 1+1
tags can be added if only four 2-byte prefixes are considered
insufficient. But the number of such additional tags can be minimized
so that 1+2 or 1+4 tags didn't seem to be required after all.

Yet still, I can't strongly argue that this scheme is a way to go
because such use of tags is fundamentally flawed. These encodings have
to be well-formed for no other reason, and thus can't make better use
of limited bits. Even a large number of allocation in the Proposal
can't avoid 10.4--12.6 additional bits per tag, which are used solely
to indicate it's a tag.



## Concatenation that doesn't always concatenate

The Proposal defines the concatenation function in
[I-D.ietf-cbor-packed, Section 2.4], which is already significantly
complex to describe. It can be alternatively described as a more
functional manner like the following:

    update({..., k(i-1): v(i-1), ki: vi, k(i+1): v(i+1)...}, ki, v') =
      {..., k(i-1): v(i-1),         k(i+1): v(i+1)...}  if v' = undefined
      {..., k(i-1): v(i-1), ki: v', k(i+1): v(i+1)...}  otherwise
    update({k1: v1, ..., kn: vn}, k', v') =
      {k1: v1, ..., kn: vn, k': v'}                     if k' != ki for all
i

    u8("xxx") = h'xxxx'  where h'xxxx' is a UTF-8 encoding of "xxx"
    u8(h'xxxx') = "xxx"  if h'xxxx' was a valid UTF-8 sequence

    concat([x1, ..., xn], [y1, ..., ym]) = [x1, ..., xn, y1, ..., ym])

    concat({k1: v1, ..., kn: vn}, {}) = {k1: v1, ..., kn: vn}

    concat({k1: v1, ..., kn: vn}, {k': v', /rest/...}) =
      concat(update({k1: v1, ..., kn: vn}, k', v'), {/rest/...})

    concat("xxx", "yyy") = "xxxyyy"
    concat("xxx", h'yyyy') = concat("xxx", u8(h'yyyy'))
    concat(h'xxxx', "yyy") = concat(u8(h'xxxx'), "yyy")
    concat(h'xxxx', h'yyyy') = h'xxxxyyyy'

    concat("xxx", [y1, ..., ym])   = join("xxx", [y1, ..., ym])
    concat(h'xxxx', [y1, ..., ym]) = join(h'xxxx', [y1, ..., ym])
    concat([x1, ..., xn], "yyy")   = join("yyy", [x1, ..., xn])
    concat([x1, ..., xn], h'yyyy') = join(h'yyyy', [x1, ..., xn])

    Other missing cases are not valid.

This form also more clearly shows where the Proposal is lacking:

- What happens when `k' = ki = kj` for some distinct i and j? Note
that well-formed but invalid CBOR data items may have duplicate keys.

- Can `undefined` be inserted for the `concat(map, map)` case if the
key is not found? (I assumed so above.)

- Are tags relevant or not in this pattern matching? (I assumed it's
relevant.)

- Is there a guarantee that this function is intended to be
associative? (This may not be true depending on how duplicate keys are
handled.)

And then there are function tags. They are, annoyingly, described in
entirely different sections and even more underspecified. The
following is my attempt to rigorously define them:

    join("xxx", [])           = ""
    join(h'xxxx', [])         = h''
    join([/ignored/ ...], []) = []
    join({/ignored/ ...}, []) = {}
    join(x, [y1, ..., ym])    =
      concatb(concatb(...(concatb(y1, x), ...), x), ym)  if m >= 1

    concatb("xxx", b'yyyy') = concat(u8("xxx"), b'yyyy') unconditionally
    concatb(b'xxxx', "yyy") = concat(b'xxxx', u8("yyy")) unconditionally
    concatb(x, y)           = concat(x, y)               otherwise

    concat(106(x), y) = concat(x, 105(y)) = join(x, y)

    concat(114([x1, ..., xn]), [y1, ..., ym]) =
      update(...(update({}, x1, y1), ...), xm, ym)  if n >= m

Like, this formulation also reveals many issues:

- Both the basic definition and tag 114 depend on the same `undefined`
handling for map values, which should be unified into a single
definition. (I have introduced the `update` helper function above.)

- Why is it the case that `concat("AB", h'4344') = h'41424344'` but
`concat("AB", [h'', h'4344']) = join("AB", [h'', h'4344']) = "ABCD"`?
(I have defined `concatb` just to reflect this discrepancy.)

- "Joining works by sequentially applying the concatenation function",
but which direction is it? (Relevant if `concat` turns out to be
non-associative. I assumed left-to-right.)

- Is `concat(106(x), 105(y))` valid?

- In fact, is the distinction between tag 105 "ijoin" and 106 "join"
even necessary?

As it stands, these definitions are so ambiguous that they should be
completely rewritten. And I don't see why function tags have to depend
on the concatenation function; they will work just fine as standalone
tags, like `106(["a", ["b", "c"]])` meaning `"bac"` *anywhere*. The
concatenation function is trying to do too many things instead and yet
rarely good at anything.



## CBOR may not be suited for packed representations

I see the desire to keep the base format unified for the ordinary and
packed CBOR, but all evidence suggests that this is detrimental to any
packing scheme. Moreover the Proposal devotes a whole section
[I-D.ietf-cbor-packed, Section 5] to define the tag equivalence when
some tags act as a "stand in" for other data items, but a separate
base format makes it trivial; if the unpacking results in some tag,
that's just valid as any tag in non-packed CBOR data items.

An attempt to devise a format-specific compression algorithm is not
new, but such attempts are rarely successful without a clear design
goal. When I first learned of the packed CBOR, my brain immediately
drew a parallel to the Standard Compression Scheme for Unicode (SCSU)
[UTS6]. SCSU was originally designed by Reuters in 1996 and maintained
a dynamic window and switched from/to UTF-16 to adapt to different
input scripts. It was considered important enough that it became the
Unicode Technical Standard (other UTSes include the collation
algorithm, regular expression and emoji), but it got quickly obsoleted
within years due to the ever-increasing computing power and the actual
usage was apparently scarce.

Amusingly though, SCSU's dynamic window concept was later reused by
Punycode [RFC3492] when internationalized domain names had to fit in
the existing host name labels and thus had to be greatly compressed.
In the end, SCSU's design was effective but applied inadequately to an
inadequate problem. As it stands now, the packed CBOR resembles SCSU
much more than Punycode.



## References

[BCJ] Wikipedia contributors, "BCJ (algorithm)", Wikipedia, The Free
Encyclopedia, <
https://en.wikipedia.org/w/index.php?title=BCJ_(algorithm)&oldid=1218326963
>.

[Bormann2017] Bormann, C., "Serializing Thing Descriptions", W3C Web
of Things IG/WG joint F2F meeting, Tech Day 1 minutes (draft), July
2017, <https://www.w3.org/2017/07/wot-f2f/slides/w3c-serialization.pdf>.

[EXI] Schneider, J., Ed., Kamiya, T., Ed., Peintner, D., Ed.,
Kyusakov, R., Ed., "Efficient XML Interchange (EXI) Format 1.0 (Second
Edition)", W3C Recommendation, February 2014,
<https://www.w3.org/TR/exi/>.

[Giesen2006] Giesen, F., "Working With Compression", Breakpoint 2006,
April 2006, <
https://www.farbrausch.de/~fg/seminars/workcompression_download.pdf>.

[I-D.ietf-cbor-packed] Bormann, C. and Gütschow, M., "Packed CBOR",
Internet-Draft draft-ietf-cbor-packed-12, March 2024,
<https://www.ietf.org/archive/id/draft-ietf-cbor-packed-12.html>.

[I-D.lenders-dns-cbor] Lenders, M. S., Bormann, C., Schmidt, T. C.,
Wählisch, M., "A Concise Binary Object Representation (CBOR) of DNS
Messages", Internet-Draft draft-lenders-dns-cbor-07, May 2024,
<https://www.ietf.org/archive/id/draft-lenders-dns-cbor-07.html>.

[ISO16022] ISO/IEC, "Information technology - Automatic identification
and data capture techniques - Data Matrix bar code symbology
specification", ISO/IEC 16022:2024, May 2024,
<https://www.iso.org/standard/80926.html>.

[ISO18004] ISO/IEC, "Information technology - Automatic identification
and data capture techniques - QR Code bar code symbology
specification", ISO/IEC 18004:2015, February 2015,
<https://www.iso.org/standard/62021.html>.

[ISO24778] ISO/IEC, "Information technology - Automatic identification
and data capture techniques - Aztec Code bar code symbology
specification", ISO/IEC 24778:2024, April 2024,
<https://www.iso.org/standard/82441.html>.

[LZBENCH] Skibinski, P., "lzbench v1.8.1", September 2020,
<
https://github.com/inikep/lzbench/tree/1f16f2987969f5da63b64efaf0c19f2859ea4480
>.

[RFC1951] Deutsch, P., "DEFLATE Compressed Data Format Specification
version 1.3", RFC 1951, DOI 10.17487/RFC1951, May 1996,
<https://www.rfc-editor.org/info/rfc1951>.

[RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode
for Internationalized Domain Names in Applications (IDNA)", RFC 3492,
DOI 10.17487/RFC3492, March 2003,
<https://www.rfc-editor.org/info/rfc3492>.

[RFC8878] Costello, A., "Punycode: A Bootstring encoding of Unicode
for Internationalized Domain Names in Applications (IDNA)", RFC 3492,
DOI 10.17487/RFC3492, March 2003,
<https://www.rfc-editor.org/info/rfc3492>.

[RFC8949] Bormann, C. and P. Hoffman, "Concise Binary Object
Representation (CBOR)", STD 94, RFC 8949, DOI 10.17487/RFC8949,
December 2020, <https://www.rfc-editor.org/info/rfc8949>.

[RFC9277] Richardson, M. and C. Bormann, "On Stable Storage for Items
in Concise Binary Object Representation (CBOR)", RFC 9277, DOI
10.17487/RFC9277, August 2022,
<https://www.rfc-editor.org/info/rfc9277>.

[SMAZ2] Sanfilippo, S., "SMAZ2: compression for very short messages
for LoRa and embedded devices", January 2024,
<
https://github.com/antirez/smaz2/tree/71ecc43e9f4f992a3d160494ab780e81efb743da
>.

[SMILE] FasterXML, LLC., "Smile Format Specification, version 1.0.5",
January 2022, <
https://github.com/FasterXML/smile-format-specification/blob/8fcf388ae410827ae492bc92aee03ddbb49037ed/smile-specification.md
>.

[UTS6] Wolf, M., et al., "Unicode Technical Standard #6: A Standard
Compression Scheme for Unicode", May 2005,
<https://www.unicode.org/reports/tr6/tr6-4.html>.

[WOT-TD] Kaebisch, S., Ed., McCool, M., Ed., Korkan, E., Ed., "Web of
Things (WoT) Thing Description 1.1", W3C Recommendation, December
2023, <https://www.w3.org/TR/wot-thing-description/>.

[WOT-TD-PLUGFEST-EXAMPLE]
<
https://github.com/w3c/wot-thing-description/blob/db8abb3655afc7f149db7976ba4e79149619f537/test-bed/data/plugfest/2017-05-osaka/MyLED_f.jsonld
>



-- 
-- Kang Seonghoon | https://mearie.org/
-- Opinions expressed in this email do not necessarily represent the
views of my employer.
--

_______________________________________________
CBOR mailing list -- cbor@ietf.org
To unsubscribe send an email to cbor-leave@ietf.org