[Cbor] Re: Soliciting unresolved points around dCBOR

Christian Amsüss <christian@amsuess.com> Mon, 17 June 2024 12:01 UTC

Return-Path: <christian@amsuess.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 84B4BC1D8757 for <cbor@ietfa.amsl.com>; Mon, 17 Jun 2024 05:01:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.906
X-Spam-Level:
X-Spam-Status: No, score=-6.906 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4FrGb9G5MDj6 for <cbor@ietfa.amsl.com>; Mon, 17 Jun 2024 05:01:42 -0700 (PDT)
Received: from smtp.akis.at (smtp.akis.at [IPv6:2a02:b18:500:a515::f455]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 93B91C1DA1D6 for <cbor@ietf.org>; Mon, 17 Jun 2024 05:01:39 -0700 (PDT)
Received: from poseidon-mailhub.amsuess.com (095129206250.cust.akis.net [95.129.206.250]) by smtp.akis.at (8.17.2/8.17.2) with ESMTPS id 45HC1Zo2078718 (version=TLSv1.2 cipher=ECDHE-ECDSA-AES256-GCM-SHA384 bits=256 verify=NOT) for <cbor@ietf.org>; Mon, 17 Jun 2024 14:01:36 +0200 (CEST) (envelope-from christian@amsuess.com)
X-Authentication-Warning: smtp.akis.at: Host 095129206250.cust.akis.net [95.129.206.250] claimed to be poseidon-mailhub.amsuess.com
Received: from poseidon-mailbox.amsuess.com (unknown [IPv6:2a02:b18:c13b:8010:a800:ff:fede:b1bf]) by poseidon-mailhub.amsuess.com (Postfix) with ESMTP id AD38E3C9AC for <cbor@ietf.org>; Mon, 17 Jun 2024 14:01:34 +0200 (CEST)
Received: from hephaistos.amsuess.com (unknown [IPv6:2a02:b18:c13b:8010:1478:26bb:b7b2:76a8]) by poseidon-mailbox.amsuess.com (Postfix) with ESMTPSA id 7A79D36CF1 for <cbor@ietf.org>; Mon, 17 Jun 2024 14:01:34 +0200 (CEST)
Received: (nullmailer pid 13587 invoked by uid 1000); Mon, 17 Jun 2024 12:01:34 -0000
Date: Mon, 17 Jun 2024 14:01:34 +0200
From: Christian Amsüss <christian@amsuess.com>
To: cbor@ietf.org
Message-ID: <ZnAlntTjsdqobA7h@hephaistos.amsuess.com>
References: <Zm7eekcpgBmJQ5jv@hephaistos.amsuess.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="UvngAJentg4EhVlp"
Content-Disposition: inline
In-Reply-To: <Zm7eekcpgBmJQ5jv@hephaistos.amsuess.com>
X-Scanned-By: MIMEDefang 2.86
Message-ID-Hash: 7MBZ67DGAWM3QG3QQE6F6VJEZ3L4BTVV
X-Message-ID-Hash: 7MBZ67DGAWM3QG3QQE6F6VJEZ3L4BTVV
X-MailFrom: christian@amsuess.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: Soliciting unresolved points around dCBOR
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/AVOZfhgeU7LuwTWWIPw1HmZaJ7Q>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>

Hello dCBOR authors,

(Wolf: Yes that thread was started to get my own hat-off items into
context; took a little longer to write).

as a whole, dCBOR looks to me like something one can do with CBOR, but
nothing I would prefer to endorse. I've held back with comments
previously as there were other ongoing discussions, but as they don't go
anywhere in particular, I'm listing here the particular aspects I think
are problematic (all about the numeric reduction), along with more
editorial comments.

Note that I'm curious to look into Gordian for the expression of graphs
(not really the elision topic: in the constrained space I'm working in,
the nonces required to do effective redaction outweigh the data
transported, to the point where it's easier to produce a signed excerpt
than a revealable data structure). The numeric reduction applied for it
is what so far has kept me from digging into it deeper to evaluate
whether it is applicable (because I'm unlikely to use it with numeric
reduction in place) -- I'll have a look now anyway because understanding
tag 201 will need that.

# [major] Reducing the usefulness of the model

(Sorry for the long text, but I think the context is necessary to get
the point.)

When there are similar data types, environments can make choices in
whether to distinguish them or not. For example:

* floats may or may not be unified with integers

  (CBOR distinguishes them, which inconveniences JavaScript users)

* integers of different length may be unified

  (CBOR does not distinguish them, which inconveniences those who
  attempt to decode every possible CBOR value into the i65 data type no
  language has)

* ASCII characters may be unified with u8, particularly in arrays

  (CBOR distinguishes between byte strings and arrays that consist only
  of u8, which inconveniences Rust users)

* Byte strings may be unified with text strings

  (CBOR distinguishes between them, which would have inconvenienced
  users of ancient Python versions)

All the CBOR distinctions mentioned above are in the basic generic data
model of CBOR. Extended generic data models are created when an
application opts in to the use of some special values or tags. This can
shift CBOR's value space from those extension points into dedicated
types:

* true and false introduce booleans from the simple value space
  (distinct from the integers, to the minor inconvenience of Python
  users to date when they are surprised by isinstance(True, int))

* bignums (tag 2/3) extend the integer space (to the convenience of
  Python users; nobody is really inconvenienced because it's not worse
  than the i65)

Generalizing, trouble arises when all of the following apply:

* the language is dynamically typed enough to have a "native"
  representation of arbitrary CBOR (or has mechanisms to emulate this,
  such as Rust's traits)
* that native representation's choices of where to distinguish are not
  1:1 aligned (AFAICT, it never is)
* the user chooses to pass the data to the serializer without an
  application specific schema or application specific encoding rules
  that explain the intention

The trouble comes in two directions:

(a) the native model is stricter than the extended generic data model:

    Encoding is trivial. At decoding time, the application decides at
    use time which of the versions it uses.

    (This is what dCBOR entails for languages that distinguish floats
    and integers.)

(b) the native model is more lax than the extended generic data model:

    Decoding is trivial. At encoding time, the application needs to pass
    extra data on to the encoder about which choice to make.

    (This is what regular CBOR users face in JavaScript, spreadsheets
    and other languages with a weak float-int distinction.)

While for implementations it may appear to be more convenient to
implement (a) because the API boundary the data traverses is defined
in the extended generic data model (eg. if the item is used to do a
string repetition, values with non-zero fractional parts are
rejected in the application), (b) places the onus of deciding on the
sender of the data, which generally has more information available on
the data itself than the receiver.

As for how to implement (b), it may appear as if this requires
extending the encoder API, but that is not necessarily so: If an
encoder has all the parts needed to generate generic CBOR except some
missing distinction, it is always possible for the encoder to send tags:
The software defines tags that the user of the encoding API can wrap
their data items in to express the encoding intet -- say, 1234567(1.0)
to encode as an integer and 7654321(1.0) to encode as a float. As those
tags are reserved for the encoder, the encoder recognizes them and
performs the right encoding without emitting the tags. By comparison, a
if an application indeed needs to make a distinction, it can also use
tags with a more weakly typed CBOR application profile -- but then those
tags need to be sent on the wire. (For example, a strongly typed
application built on dCBOR may use 765(1) to express that it really
means a float, but knows that the serializer will turn the float to an
integer before it arrives at the receiver).


The choices CBOR made are useful ones because choices can because it
stays expressive, compact, and because choices can be made at the data
producer side where the information is.

Identifying more and more items with each other is a valid choice an
application can make, down to "everything is a string". That may be a
suitable choice for some application, but it is nothing I'd recommend a
high-level tool (such as even Gordian, let alone dCBOR which presents
itself as useful beyond Gordian) should do. For example, YANG (also a
high-level tool) did that, and now there is considerable effort needed
to make it usable in concise representations once more [yangcbor].


# [major] We're not in actual maths

Rational numbers are a true compatible subset of the integer numbers
mathematically, and thus an actual extension: All operations defined on
integers, when applied in the extension set, behave identically. Being
an extension is a useful property when types are unified, as it allows
applications that operate on a subset to retain their mindset.

However, we're not working with mathematical integers but with bound
integers, and we're not working with mathematical rationals but floats.
Properties such as (a + b) + c = a + (b + c) hold for bound integers as
long as either result is defined, a property which is lost extending to
floats. (In this example, for a = 2^63, b = c = 2^10).

This and related properties make float not a good extension for
integers.

For comparison, the unification CBOR chose in its basic generic model
(unifying 8-bit up to 64-bit integers) has this property, as has the
extension it offers in tags 2 and 3.


# [minor] Incompatibility with existing models

There is one extended data model provided in CBOR already that extends
the integers: Applications that opt in to the use of tags 2 and 3 and
accept the extended data model that comes with it have an indefinite
range of integers available, of which everything in [2^-64,2^64) is
encoded major type 1/0.

This extension also picks a different route than original CBOR
by unifying the non-big integers with a differernt type, thus forcing
applications to use at most one of them. (Cf. RDF semantics entailment
regimes: this roughly correlates to creating non-monotonic extensions
that are thus incompatible [rdfextensions]).


# [minor] Small weird details

* The float-to-int conversion only affects half the negative integers. A
  value of 2.0^-64 and 2^-64 can both be expressed in dCBOR
  independently. (So is it now so important to not have any items with
  identical values between float and int or not?)

  This may be a remnant of earlier editing stages.

* Simple values other than the known ones are exluded, but tags are not
  excluded.

  Simple values and tags are very similar in that they are the
  extension points of CBOR, differing in their number (256 v. 2^64) and
  whether or not they carry data. Other than that, they are pretty much
  the same. Why rule out the one and not the other?

# [editorial] Mix of model and serialization

RFC8949 describes the basic generic data model, and many extended data
models described by tags. CDE defines a subset of the allowed
serializations of CBOR, even a canonical representation. The application
profiles CDE describes are subsets of the basic generic data model (by
ruling out some construction or unifying them, thereby ruling out
choices), creating an extended generic data model, but AIU they do not
interact with the serialization any more than just by referencing that
there is this serialization to be used.

dCBOR talks a about encoders and decoders, which from the context
perform both the conversion of the extended data model and the basic
data model, but also do serialization in the same step (hinted at eg. by
"MUST validate that encoded CBOR conforms to the requirements of [CDE]"
-- as CDE does not touch the data model but only operates on the
encoding).

The text in CDE admits that the distinction between the CBOR processing
and the Application Profile is a conceptual one, and may be combined --
but AIU that admission is for implementations, not for specifications.
As it is, dCBOR restates a lot of what is entailed by "we use this model
together with CDE" in normative text. I'm fine with such a document
cautioning against duplicate map keys (as they could otherwise be
produced by numeric reduction), but that is a note to implementers;
specification-wise it follows from applying the numeric reduction.

The way I see it, the application profile would describe a piece of CDDL
of what is valid, in our case like this:

```cddl
AnyDCBOR = [* AnyDCBOR ] / {* AnyDCBOR => AnyDCBOR }
    / int
    / bstr / tstr
    / float .within FloatNotInt
    / #6(AnyDCBOR)
    / simpleDCBOR

simpleDCBOR = bool / null

FloatNotInt = ... # it's a mouthful to write down, but having seen UTF-8
                  # characterized in ABNF I have little doubt it is
                  # possible
```

... along with rules (in prose while CDDL 2.0 is on its way) on how to
map Any to AnyDCBOR if the item is not already in the allowed set.

From that, encoding in binary is then just a matter of a single
normative reference to CDE.

This mix between CDE and application profile has been addressed in the
text of tag 201, but is prevalent in the rest of the text.




I hope this helps enhance the document, or (fingers crossed) even spins
off discussion that leads to Gordian not requiring anything beyond CDE

Christian (as individual)

[rdfextensions]: https://www.w3.org/TR/rdf12-semantics/#extensions
[yangcbor]: https://www.ietf.org/archive/id/draft-bormann-cbor-yang-standin-00.html

-- 
To use raw power is to make yourself infinitely vulnerable to greater powers.
  -- Bene Gesserit axiom