[Cbor] Re: Soliciting unresolved points around dCBOR
Christian Amsüss <christian@amsuess.com> Mon, 17 June 2024 12:01 UTC
Return-Path: <christian@amsuess.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 84B4BC1D8757 for <cbor@ietfa.amsl.com>; Mon, 17 Jun 2024 05:01:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.906
X-Spam-Level:
X-Spam-Status: No, score=-6.906 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4FrGb9G5MDj6 for <cbor@ietfa.amsl.com>; Mon, 17 Jun 2024 05:01:42 -0700 (PDT)
Received: from smtp.akis.at (smtp.akis.at [IPv6:2a02:b18:500:a515::f455]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 93B91C1DA1D6 for <cbor@ietf.org>; Mon, 17 Jun 2024 05:01:39 -0700 (PDT)
Received: from poseidon-mailhub.amsuess.com (095129206250.cust.akis.net [95.129.206.250]) by smtp.akis.at (8.17.2/8.17.2) with ESMTPS id 45HC1Zo2078718 (version=TLSv1.2 cipher=ECDHE-ECDSA-AES256-GCM-SHA384 bits=256 verify=NOT) for <cbor@ietf.org>; Mon, 17 Jun 2024 14:01:36 +0200 (CEST) (envelope-from christian@amsuess.com)
X-Authentication-Warning: smtp.akis.at: Host 095129206250.cust.akis.net [95.129.206.250] claimed to be poseidon-mailhub.amsuess.com
Received: from poseidon-mailbox.amsuess.com (unknown [IPv6:2a02:b18:c13b:8010:a800:ff:fede:b1bf]) by poseidon-mailhub.amsuess.com (Postfix) with ESMTP id AD38E3C9AC for <cbor@ietf.org>; Mon, 17 Jun 2024 14:01:34 +0200 (CEST)
Received: from hephaistos.amsuess.com (unknown [IPv6:2a02:b18:c13b:8010:1478:26bb:b7b2:76a8]) by poseidon-mailbox.amsuess.com (Postfix) with ESMTPSA id 7A79D36CF1 for <cbor@ietf.org>; Mon, 17 Jun 2024 14:01:34 +0200 (CEST)
Received: (nullmailer pid 13587 invoked by uid 1000); Mon, 17 Jun 2024 12:01:34 -0000
Date: Mon, 17 Jun 2024 14:01:34 +0200
From: Christian Amsüss <christian@amsuess.com>
To: cbor@ietf.org
Message-ID: <ZnAlntTjsdqobA7h@hephaistos.amsuess.com>
References: <Zm7eekcpgBmJQ5jv@hephaistos.amsuess.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="UvngAJentg4EhVlp"
Content-Disposition: inline
In-Reply-To: <Zm7eekcpgBmJQ5jv@hephaistos.amsuess.com>
X-Scanned-By: MIMEDefang 2.86
Message-ID-Hash: 7MBZ67DGAWM3QG3QQE6F6VJEZ3L4BTVV
X-Message-ID-Hash: 7MBZ67DGAWM3QG3QQE6F6VJEZ3L4BTVV
X-MailFrom: christian@amsuess.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: Soliciting unresolved points around dCBOR
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/AVOZfhgeU7LuwTWWIPw1HmZaJ7Q>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>
Hello dCBOR authors,
(Wolf: Yes that thread was started to get my own hat-off items into
context; took a little longer to write).
as a whole, dCBOR looks to me like something one can do with CBOR, but
nothing I would prefer to endorse. I've held back with comments
previously as there were other ongoing discussions, but as they don't go
anywhere in particular, I'm listing here the particular aspects I think
are problematic (all about the numeric reduction), along with more
editorial comments.
Note that I'm curious to look into Gordian for the expression of graphs
(not really the elision topic: in the constrained space I'm working in,
the nonces required to do effective redaction outweigh the data
transported, to the point where it's easier to produce a signed excerpt
than a revealable data structure). The numeric reduction applied for it
is what so far has kept me from digging into it deeper to evaluate
whether it is applicable (because I'm unlikely to use it with numeric
reduction in place) -- I'll have a look now anyway because understanding
tag 201 will need that.
# [major] Reducing the usefulness of the model
(Sorry for the long text, but I think the context is necessary to get
the point.)
When there are similar data types, environments can make choices in
whether to distinguish them or not. For example:
* floats may or may not be unified with integers
(CBOR distinguishes them, which inconveniences JavaScript users)
* integers of different length may be unified
(CBOR does not distinguish them, which inconveniences those who
attempt to decode every possible CBOR value into the i65 data type no
language has)
* ASCII characters may be unified with u8, particularly in arrays
(CBOR distinguishes between byte strings and arrays that consist only
of u8, which inconveniences Rust users)
* Byte strings may be unified with text strings
(CBOR distinguishes between them, which would have inconvenienced
users of ancient Python versions)
All the CBOR distinctions mentioned above are in the basic generic data
model of CBOR. Extended generic data models are created when an
application opts in to the use of some special values or tags. This can
shift CBOR's value space from those extension points into dedicated
types:
* true and false introduce booleans from the simple value space
(distinct from the integers, to the minor inconvenience of Python
users to date when they are surprised by isinstance(True, int))
* bignums (tag 2/3) extend the integer space (to the convenience of
Python users; nobody is really inconvenienced because it's not worse
than the i65)
Generalizing, trouble arises when all of the following apply:
* the language is dynamically typed enough to have a "native"
representation of arbitrary CBOR (or has mechanisms to emulate this,
such as Rust's traits)
* that native representation's choices of where to distinguish are not
1:1 aligned (AFAICT, it never is)
* the user chooses to pass the data to the serializer without an
application specific schema or application specific encoding rules
that explain the intention
The trouble comes in two directions:
(a) the native model is stricter than the extended generic data model:
Encoding is trivial. At decoding time, the application decides at
use time which of the versions it uses.
(This is what dCBOR entails for languages that distinguish floats
and integers.)
(b) the native model is more lax than the extended generic data model:
Decoding is trivial. At encoding time, the application needs to pass
extra data on to the encoder about which choice to make.
(This is what regular CBOR users face in JavaScript, spreadsheets
and other languages with a weak float-int distinction.)
While for implementations it may appear to be more convenient to
implement (a) because the API boundary the data traverses is defined
in the extended generic data model (eg. if the item is used to do a
string repetition, values with non-zero fractional parts are
rejected in the application), (b) places the onus of deciding on the
sender of the data, which generally has more information available on
the data itself than the receiver.
As for how to implement (b), it may appear as if this requires
extending the encoder API, but that is not necessarily so: If an
encoder has all the parts needed to generate generic CBOR except some
missing distinction, it is always possible for the encoder to send tags:
The software defines tags that the user of the encoding API can wrap
their data items in to express the encoding intet -- say, 1234567(1.0)
to encode as an integer and 7654321(1.0) to encode as a float. As those
tags are reserved for the encoder, the encoder recognizes them and
performs the right encoding without emitting the tags. By comparison, a
if an application indeed needs to make a distinction, it can also use
tags with a more weakly typed CBOR application profile -- but then those
tags need to be sent on the wire. (For example, a strongly typed
application built on dCBOR may use 765(1) to express that it really
means a float, but knows that the serializer will turn the float to an
integer before it arrives at the receiver).
The choices CBOR made are useful ones because choices can because it
stays expressive, compact, and because choices can be made at the data
producer side where the information is.
Identifying more and more items with each other is a valid choice an
application can make, down to "everything is a string". That may be a
suitable choice for some application, but it is nothing I'd recommend a
high-level tool (such as even Gordian, let alone dCBOR which presents
itself as useful beyond Gordian) should do. For example, YANG (also a
high-level tool) did that, and now there is considerable effort needed
to make it usable in concise representations once more [yangcbor].
# [major] We're not in actual maths
Rational numbers are a true compatible subset of the integer numbers
mathematically, and thus an actual extension: All operations defined on
integers, when applied in the extension set, behave identically. Being
an extension is a useful property when types are unified, as it allows
applications that operate on a subset to retain their mindset.
However, we're not working with mathematical integers but with bound
integers, and we're not working with mathematical rationals but floats.
Properties such as (a + b) + c = a + (b + c) hold for bound integers as
long as either result is defined, a property which is lost extending to
floats. (In this example, for a = 2^63, b = c = 2^10).
This and related properties make float not a good extension for
integers.
For comparison, the unification CBOR chose in its basic generic model
(unifying 8-bit up to 64-bit integers) has this property, as has the
extension it offers in tags 2 and 3.
# [minor] Incompatibility with existing models
There is one extended data model provided in CBOR already that extends
the integers: Applications that opt in to the use of tags 2 and 3 and
accept the extended data model that comes with it have an indefinite
range of integers available, of which everything in [2^-64,2^64) is
encoded major type 1/0.
This extension also picks a different route than original CBOR
by unifying the non-big integers with a differernt type, thus forcing
applications to use at most one of them. (Cf. RDF semantics entailment
regimes: this roughly correlates to creating non-monotonic extensions
that are thus incompatible [rdfextensions]).
# [minor] Small weird details
* The float-to-int conversion only affects half the negative integers. A
value of 2.0^-64 and 2^-64 can both be expressed in dCBOR
independently. (So is it now so important to not have any items with
identical values between float and int or not?)
This may be a remnant of earlier editing stages.
* Simple values other than the known ones are exluded, but tags are not
excluded.
Simple values and tags are very similar in that they are the
extension points of CBOR, differing in their number (256 v. 2^64) and
whether or not they carry data. Other than that, they are pretty much
the same. Why rule out the one and not the other?
# [editorial] Mix of model and serialization
RFC8949 describes the basic generic data model, and many extended data
models described by tags. CDE defines a subset of the allowed
serializations of CBOR, even a canonical representation. The application
profiles CDE describes are subsets of the basic generic data model (by
ruling out some construction or unifying them, thereby ruling out
choices), creating an extended generic data model, but AIU they do not
interact with the serialization any more than just by referencing that
there is this serialization to be used.
dCBOR talks a about encoders and decoders, which from the context
perform both the conversion of the extended data model and the basic
data model, but also do serialization in the same step (hinted at eg. by
"MUST validate that encoded CBOR conforms to the requirements of [CDE]"
-- as CDE does not touch the data model but only operates on the
encoding).
The text in CDE admits that the distinction between the CBOR processing
and the Application Profile is a conceptual one, and may be combined --
but AIU that admission is for implementations, not for specifications.
As it is, dCBOR restates a lot of what is entailed by "we use this model
together with CDE" in normative text. I'm fine with such a document
cautioning against duplicate map keys (as they could otherwise be
produced by numeric reduction), but that is a note to implementers;
specification-wise it follows from applying the numeric reduction.
The way I see it, the application profile would describe a piece of CDDL
of what is valid, in our case like this:
```cddl
AnyDCBOR = [* AnyDCBOR ] / {* AnyDCBOR => AnyDCBOR }
/ int
/ bstr / tstr
/ float .within FloatNotInt
/ #6(AnyDCBOR)
/ simpleDCBOR
simpleDCBOR = bool / null
FloatNotInt = ... # it's a mouthful to write down, but having seen UTF-8
# characterized in ABNF I have little doubt it is
# possible
```
... along with rules (in prose while CDDL 2.0 is on its way) on how to
map Any to AnyDCBOR if the item is not already in the allowed set.
From that, encoding in binary is then just a matter of a single
normative reference to CDE.
This mix between CDE and application profile has been addressed in the
text of tag 201, but is prevalent in the rest of the text.
I hope this helps enhance the document, or (fingers crossed) even spins
off discussion that leads to Gordian not requiring anything beyond CDE
Christian (as individual)
[rdfextensions]: https://www.w3.org/TR/rdf12-semantics/#extensions
[yangcbor]: https://www.ietf.org/archive/id/draft-bormann-cbor-yang-standin-00.html
--
To use raw power is to make yourself infinitely vulnerable to greater powers.
-- Bene Gesserit axiom
- [Cbor] Soliciting unresolved points around dCBOR Christian Amsüss
- [Cbor] Re: Soliciting unresolved points around dC… lgl island-resort.com
- [Cbor] Re: Soliciting unresolved points around dC… Wolf McNally
- [Cbor] Re: Soliciting unresolved points around dC… Christian Amsüss
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… Christian Amsüss
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… Wolf McNally
- [Cbor] Re: Soliciting unresolved points around dC… Carsten Bormann
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… Joe Hildebrand
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… Joe Hildebrand
- [Cbor] Re: Soliciting unresolved points around dC… Christian Amsüss
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Gordian for graph serialization Christian Amsüss
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… Wolf McNally
- [Cbor] Re: Soliciting unresolved points around dC… Wolf McNally
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… Carsten Bormann
- [Cbor] Re: Soliciting unresolved points around dC… Wolf McNally
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… Wolf McNally
- [Cbor] Re: Need for preferred and CDE (was Re: So… Anders Rundgren
- [Cbor] Re: Need for preferred and CDE (was Re: So… Anders Rundgren
- [Cbor] Re: Need for preferred and CDE (was Re: So… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… Carsten Bormann
- [Cbor] Re: Soliciting unresolved points around dC… Carsten Bormann
- [Cbor] Re: Soliciting unresolved points around dC… lgl island-resort.com
- [Cbor] Re: Soliciting unresolved points around dC… Carsten Bormann
- [Cbor] Need for preferred and CDE (was Re: Solici… lgl island-resort.com
- [Cbor] Re: Need for preferred and CDE (was Re: So… Wolf McNally
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Re: Applicability of deterministic encodin… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… Carsten Bormann
- [Cbor] Re: Gordian for graph serialization Wolf McNally
- [Cbor] Re: Soliciting unresolved points around dC… Wolf McNally
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Applicability of deterministic encoding Wa… Anders Rundgren
- [Cbor] Re: Need for preferred and CDE (was Re: So… lgl island-resort.com
- [Cbor] Re: Gordian for graph serialization Christian Amsüss
- [Cbor] Re: Soliciting unresolved points around dC… Anders Rundgren
- [Cbor] Re: Soliciting unresolved points around dC… lgl island-resort.com
- [Cbor] Re: Soliciting unresolved points around dC… lgl island-resort.com
- [Cbor] Re: Need for preferred and CDE (was Re: So… Joe Hildebrand
- [Cbor] Re: Need for preferred and CDE (was Re: So… lgl island-resort.com
- [Cbor] Re: Gordian for graph serialization Wolf McNally
- [Cbor] Re: Soliciting unresolved points around dC… Wolf McNally
- [Cbor] Re: Applicability of deterministic encodin… lgl island-resort.com
- [Cbor] Re: Need for preferred and CDE (was Re: So… Joe Hildebrand
- [Cbor] Re: Need for preferred and CDE (was Re: So… Wolf McNally
- [Cbor] Re: Need for preferred and CDE (was Re: So… Carsten Bormann
- [Cbor] Re: Need for preferred and CDE (was Re: So… lgl island-resort.com
- [Cbor] Re: Need for preferred and CDE (was Re: So… Carsten Bormann
- [Cbor] Re: Need for preferred and CDE (was Re: So… lgl island-resort.com
- [Cbor] Re: Need for preferred and CDE (was Re: So… Carsten Bormann
- [Cbor] Re: Need for preferred and CDE (was Re: So… Anders Rundgren
- [Cbor] Re: Need for preferred and CDE (was Re: So… Carsten Bormann
- [Cbor] Re: Need for preferred and CDE (was Re: So… lgl island-resort.com
- [Cbor] Basic Serialization (Re: Need for preferre… Carsten Bormann
- [Cbor] Re: Basic Serialization (Re: Need for pref… lgl island-resort.com
- [Cbor] Re: Basic Serialization (Re: Need for pref… Anders Rundgren
- [Cbor] CR in EDN strings (Re: Basic Serialization… Carsten Bormann
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Anders Rundgren
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Carsten Bormann
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Anders Rundgren
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Rohan Mahy
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Carsten Bormann
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Rohan Mahy
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Carsten Bormann
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Rohan Mahy
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Carsten Bormann
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Rohan Mahy
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Anders Rundgren
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Rohan Mahy
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Carsten Bormann
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Rohan Mahy
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Anders Rundgren
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Carsten Bormann
- [Cbor] Re: CR in EDN strings (Re: Basic Serializa… Carsten Bormann