[Cbor] Re: I-D Action: draft-mcnally-deterministic-cbor-10.html

Wolf McNally <wolf@wolfmcnally.com> Wed, 31 July 2024 08:17 UTC

Return-Path: <wolf@wolfmcnally.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0B719C14F707 for <cbor@ietfa.amsl.com>; Wed, 31 Jul 2024 01:17:58 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.908
X-Spam-Level:
X-Spam-Status: No, score=-1.908 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=wolfmcnally-com.20230601.gappssmtp.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XOBRvoavn6bj for <cbor@ietfa.amsl.com>; Wed, 31 Jul 2024 01:17:57 -0700 (PDT)
Received: from mail-pf1-x434.google.com (mail-pf1-x434.google.com [IPv6:2607:f8b0:4864:20::434]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4FF87C14F617 for <cbor@ietf.org>; Wed, 31 Jul 2024 01:17:57 -0700 (PDT)
Received: by mail-pf1-x434.google.com with SMTP id d2e1a72fcca58-70b703eda27so3424489b3a.3 for <cbor@ietf.org>; Wed, 31 Jul 2024 01:17:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wolfmcnally-com.20230601.gappssmtp.com; s=20230601; t=1722413876; x=1723018676; darn=ietf.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=evf8zA+tjxuGJY0URF1ZK7ipYkHmtgslXBNNF3+Pndc=; b=BT/CHZa9NKzhj3jjy4kW9UhiWNh8zRP+KCECrzDHfaD0CE/2cwJ0nTemIJ9BDDat/W ubiT//sMmfVlyvWjnFWmuK2IS4d3ZcikgiV4/PjvhDcQxO7k02k2xG6+U9MzrdqFsQy7 fkVqUp+BlapOces6EWCGy8wsCPAV1UIYkZE6MPA6kv/B0OZ2ZfKObKfppvpkpFj8IROD 1IhFhlP3l9Dp33RU1UpnfuEwVzyaFGJRJ1uSQtYhLHjnkZUlA0FOpqwNZLA7RS7odQZ5 QZHkP58OYm57+Y15TCXw0BJtpF+h6njRgEj0nih1dm2ZyYBzbL43HvPXnWj1ULbqBSQ5 +XLg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722413877; x=1723018677; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=evf8zA+tjxuGJY0URF1ZK7ipYkHmtgslXBNNF3+Pndc=; b=ew5F3OpX/Na2fYPg99Khby140Im38KncJ+5RO6139cG3gaPuZzSPQLTR956HKBbb+A Y7WXuSl2QH//anwLvB7XqiCtHkVgnuvyWEMnjGmKlE1HkIkZFd/tzTspK41fbvIIMeNV Z1DNQ2jX4SSb2ypLGV36sWKqUMHGsbUzm7I9EPbs8i/AwidxXdbsA6xxzp2Ha8N8ym54 TRkdGhPAGF5R0OHwDS8ZqfTQk9rphMklzU2kueamhWMX41hJm6fAod2dMCxx7qd+NDGN B9yYtVu86eHRdah/WBpkKVFUgKfH7DQ9HYA4M/sJEWj57twRjQvaDMtJCcb3efpg8XDR dfhg==
X-Forwarded-Encrypted: i=1; AJvYcCUNC1PFGi8WtMJ7hNhR+e4K+TCsiIQAxF7n9miguQx6zwbpAUkSZDAXYY+wVBZsUGavx9+k2h/hMxDseVWQ
X-Gm-Message-State: AOJu0Yy+sL1eb99LvgE6f+PMuMVtW+maHboKs11csBOv7tLdIYYpjj5P CuWnTxuPHMj4IUatHNllLWoIzneDLNw6CC1lP3Fx3tl0iXzr6juAeBlyGrSd65dFPRa8Hg6/Kcz Z
X-Google-Smtp-Source: AGHT+IElUwmXMeBXRudMCrzUKeBlzaiVy0iBSi6bqpp+Ins3HJ0UDCocJk9FoKVMNjVbgCe8xJEhHw==
X-Received: by 2002:a05:6a20:3942:b0:1c2:8cf4:766c with SMTP id adf61e73a8af0-1c4a14d91acmr12928838637.33.1722413876552; Wed, 31 Jul 2024 01:17:56 -0700 (PDT)
Received: from smtpclient.apple ([192.145.119.154]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1fed7fa468bsm114927475ad.255.2024.07.31.01.17.55 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 31 Jul 2024 01:17:56 -0700 (PDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
From: Wolf McNally <wolf@wolfmcnally.com>
In-Reply-To: <0a1f0006-5237-4498-8250-fdfdd9aa685a@gmail.com>
Date: Wed, 31 Jul 2024 01:17:43 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <0E80EFF6-9FB7-4AF3-A122-DD23FF26746E@wolfmcnally.com>
References: <5504ABFA-33AA-4265-9006-834DD263B9D8@island-resort.com> <4118A3AE-E1F1-4311-B5D8-B2A09AD822F4@wolfmcnally.com> <cc786eb5-bd5c-404d-a2b2-716c64b4acc7@gmail.com> <E21D4EEA-2C6E-4444-BB98-27D22A63F6C9@tzi.org> <0a1f0006-5237-4498-8250-fdfdd9aa685a@gmail.com>
To: Anders Rundgren <anders.rundgren.net@gmail.com>
X-Mailer: Apple Mail (2.3774.600.62)
Message-ID-Hash: XDZVJUR4UJDA3QQVLCPQSQ5SR372ZVQE
X-Message-ID-Hash: XDZVJUR4UJDA3QQVLCPQSQ5SR372ZVQE
X-MailFrom: wolf@wolfmcnally.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: Carsten Bormann <cabo@tzi.org>, CBOR <cbor@ietf.org>, Christopher Allen <christophera@lifewithalacrity.com>, Shannon Appelcline <shannon.appelcline@gmail.com>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: I-D Action: draft-mcnally-deterministic-cbor-10.html
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/MpnZX94Zsv4bk7rWhx485UhjRU4>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>

Anders,

> On Jul 30, 2024, at 11:52 PM, Anders Rundgren <anders.rundgren.net@gmail.com> wrote:
> 
> The only thing that remains clear, is that anybody wanting to understand the rationale for numeric reduction would have to wade through literally hundreds of pretty voluminous messages, typically mixing things actually are unrelated, like "you can build dBCOR on top of CDE”.

It’s really not that hard. Today I had GPT-4o explain it to me from first principles. If it can make sense of this, I think you can.


Me: According to proffered schemes for JSON canonicalization, what is the canonical form of the numeric value `2.0`?

GPT: In JSON canonicalization schemes, the canonical form of the numeric value `2.0` is typically represented as `2`. This representation removes any unnecessary decimal points or trailing zeros, adhering to the principle of representing numbers in their simplest form. Thus, `2.0` becomes `2` in the canonicalized JSON.


Me: Why does JSON canonicalization uphold the idea of `2.0` being the same as `2`? 

GPT: JSON canonicalization treats `2.0` as equivalent to `2` because it aims to standardize data representation in a way that eliminates ambiguity and redundancy. The key reasons include:

1. **Data Consistency**: By removing unnecessary decimal points and zeros, canonicalization ensures that equivalent values are represented consistently, reducing the risk of mismatches or errors when comparing data.

2. **Simplicity**: The principle of representing numbers in their simplest form means that there is no distinction between `2` and `2.0` in the JSON format. This simplifies processing and parsing, as the same value is always represented the same way.

3. **Interoperability**: JSON is widely used for data interchange, and canonicalization helps ensure that different systems interpret and process the data consistently. By defining a single representation for numbers, it facilitates interoperability across various platforms and programming languages.

4. **Avoiding Semantic Ambiguity**: In many programming languages and systems, there is no meaningful difference between `2` and `2.0`—both represent the same numeric value. Canonicalization reflects this by treating them as identical, avoiding any potential confusion over their meaning.

These reasons contribute to the decision to represent `2.0` as `2` in JSON canonicalization, promoting clarity and uniformity in data representation.


Me: Setting aside the `number` type in JavaScript, and just considering JSON as a method for data interchange among heterogenous systems, how would you respond to the argument that `2.0` and `2` should be kept distinct because one represents a floating point value and the other represents an integer value?

GPT: While the argument that `2.0` (floating point) and `2` (integer) should be distinct has technical merit in some contexts, it overlooks the primary purpose of JSON as a data interchange format designed for simplicity and broad compatibility among heterogeneous systems. Here's why maintaining the distinction may not be necessary or beneficial in the context of JSON:

1. **Language and System Agnosticism**: JSON is designed to be a lightweight, text-based format that can be easily parsed and generated by a wide range of programming languages and systems. Many of these languages do not differentiate between `2` and `2.0` when parsing JSON, treating both as the same numeric value. This standardization minimizes compatibility issues and simplifies data interchange.

2. **Data Normalization**: By normalizing numbers to their simplest form (e.g., converting `2.0` to `2`), JSON canonicalization helps ensure consistent data representation. This is crucial for tasks like digital signatures and data comparison, where any difference in representation, even if semantically insignificant, could lead to false mismatches or verification failures.

3. **Pragmatic Considerations**: In practical terms, the distinction between integers and floating-point numbers is often more relevant within specific computational contexts rather than at the level of data interchange. JSON’s purpose is to facilitate the exchange of data, not to enforce specific numerical types. Systems receiving the data can interpret it according to their own needs and constraints.

4. **Precision and Accuracy**: For most use cases in JSON, the precision and accuracy differences between integers and floating-point representations are not significant enough to warrant maintaining a distinction.

In summary, while there is a theoretical basis for distinguishing `2.0` from `2`, JSON prioritizes simplicity and universal accessibility. The distinction between integer and floating-point numbers is not preserved in JSON canonicalization to maintain broad compatibility and avoid unnecessary complexity in data interchange.


Me: Are you familiar with CBOR/CDE and dCBOR in particular? Search if you’re not sure.

GPT: *Searched 6 sites* CBOR/CDE (Concise Binary Object Representation/Common Deterministic Encoding) and dCBOR (Deterministic CBOR) are formats designed for efficient and deterministic data serialization. The primary goal of dCBOR is to ensure that semantically equivalent data items produce identical byte streams, which is crucial for applications that rely on data integrity verification, such as cryptographic systems.

Specifically, dCBOR requires that floating point values without a fractional component (like `2.0`) must be encoded as integers (`2`) to reduce redundancy and ensure uniformity across different systems.

The distinction between integers and floating point numbers is thus collapsed in dCBOR when a floating point value can be exactly represented as an integer. This decision is based on several considerations:

1. **Deterministic Encoding**: Ensuring consistent representation simplifies the verification process, as identical data must result in identical encoding.

2. **Efficiency**: Using the reduced form generally reduces data size and enhances transmission efficiency.

3. **Practicality**: Most applications do not distinguish between `2.0` and `2` in terms of data processing. This approach simplifies implementations across platforms and ensures compatibility.

By adhering to these principles, dCBOR ensures a robust and consistent encoding mechanism suitable for environments requiring precise data integrity, such as digital signatures and secure communications.


Me: Would you mount a defense of dCBOR’s numeric reduction similar to what you used for JSON canonicalization of numeric values?

GPT: Yes, a defense of dCBOR's numeric reduction can be mounted on similar grounds as the justification for JSON canonicalization of numeric values. Here are the key points:

1. **Deterministic Encoding**: The primary goal of dCBOR's numeric reduction is to ensure that data with the same semantic value is represented identically in binary form. This deterministic encoding is critical for systems that require data integrity checks, such as digital signatures or cryptographic operations, where even a slight difference in encoding would result in different hash values, breaking verification processes.

2. **Simplicity and Efficiency**: Numeric reduction can reduce data size in many cases, making data transmission and storage more efficient. By encoding values like `2.0` as `2`, dCBOR eliminates unnecessary data, which simplifies the decoding process and reduces the possibility of errors due to inconsistent representations. This also aligns with the practical consideration that, in most applications, the distinction between `2` and `2.0` is irrelevant.

3. **Consistency Across Platforms**: Different systems and languages may handle floating-point and integer types differently. By standardizing on a single representation for values that can be expressed both as integers and floating-point numbers, dCBOR ensures consistency across heterogeneous systems. This uniformity is particularly important in distributed systems and environments where data needs to be reliably compared or combined from multiple sources.

4. **Avoiding Ambiguity**: The distinction between `2` and `2.0` might suggest a difference in data type (integer vs. floating-point), but in the context of serialization and data interchange, this distinction often lacks practical significance. By reducing numbers to their simplest form, dCBOR avoids ambiguity and ensures that all parties interpret the data the same way.

Overall, dCBOR's approach to numeric reduction enhances interoperability, data integrity, and efficiency, making it a robust choice for applications where determinism and consistency are paramount. This rationale closely mirrors the reasoning behind JSON canonicalization, where simplicity and universal understanding are prioritized over maintaining distinctions that may not be meaningful in the context of data interchange.


Anders, if you have any followup questions for GPT I’d be happy to forward them for you.

~ Wolf