[Cbor] Re: I-D Action: draft-mcnally-deterministic-cbor-10.html

Anders Rundgren <anders.rundgren.net@gmail.com> Wed, 31 July 2024 14:25 UTC

Return-Path: <anders.rundgren.net@gmail.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6DFCFC14F6F0 for <cbor@ietfa.amsl.com>; Wed, 31 Jul 2024 07:25:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.107
X-Spam-Level:
X-Spam-Status: No, score=-2.107 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fkngIhSaU7h8 for <cbor@ietfa.amsl.com>; Wed, 31 Jul 2024 07:25:07 -0700 (PDT)
Received: from mail-lj1-x230.google.com (mail-lj1-x230.google.com [IPv6:2a00:1450:4864:20::230]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 7AE17C15106B for <cbor@ietf.org>; Wed, 31 Jul 2024 07:25:07 -0700 (PDT)
Received: by mail-lj1-x230.google.com with SMTP id 38308e7fff4ca-2ef2d582e31so70205451fa.2 for <cbor@ietf.org>; Wed, 31 Jul 2024 07:25:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722435906; x=1723040706; darn=ietf.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=xoV9n8MnB60b9SOeZvMETOMTg04ba0QTntHrUpP55eg=; b=a7eprNr4BPTCwkr5QaeKpVJkwFYMuKYV+sXl398Gav9kO16fTNXgHUj/sj2CdTXuNM KatB+Jw9OTv/Iql5UP1lEoZRabrSQ3U/j/vXktCy6I7z2bWKwcempQqH/z/aGuSaavZj z5M+YlY8gATY+sKihBo1PoWegCEuv6q6A5Icy6AWXwo9yQltZzOSCHydWaW6jWkb1ZbA XtOmtm4q7WoGtbpibedOnTwzn20an0bARErQ8BJOUnt1hAD9DS+dSyKyhR36D7k1/Isl cHICwLXFb0oirt4JV2NjRl9LEciQkh9pU4AmKcTxpyhXyGJNtEPS3jtTwla8IX5yXTlA GPjg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722435906; x=1723040706; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=xoV9n8MnB60b9SOeZvMETOMTg04ba0QTntHrUpP55eg=; b=n1efb34h1n4hlftSj+73IWJ8LsguXW6czDqlaDLENPpWVjRxwr1sRuIyLgVnHw0wfF 7dc+nG9z/+SxHnd3nXhdY75J0JK/RNSyFXW+LlRJNzD0RiehG+qqpr9qjCrNhKu8gOZo p7BTyWM/HdVRitV6/ST8nIv2t2ymc50A3Jm6mS8UABgBLextEyb63A/DZGGJ/Fg3xWjI LvK8ntXKLLMInyqjoT5/JMuVZwT8ANYqxYfCCfMRZnyV8BAQIHEwHXQHRrHZMAm827J7 YIgIIDyD4w7ZZSw93dcobboBRmucLaeoi1upIoxsABF5Lo0y6Ql/ujkTKDTAyK6cLBAn ySRA==
X-Forwarded-Encrypted: i=1; AJvYcCXeRGDwVSSVN4HwuwdOuQNGp+CCHGs5m315sBUa6HJax9upP6djnF6vJnlGsBENqRy0FcWQL7iazKdwWjBe
X-Gm-Message-State: AOJu0YxOu5I2kZrr6xeFYB56sgj3xKlLRCJYycIsnq2sOwQk+jhGZmRw uODplGCkRaKg8JbUlIEOgG7S+UKg4AbsWweh2HNSCuQ62pUoUOwDAuz1ag==
X-Google-Smtp-Source: AGHT+IFlJTmgmD4zjolMVceMLHEGGSy/mv+BInoegHksEWDMDn0hoNhpIg8sTaVOHHlo33rBNfZiPQ==
X-Received: by 2002:a2e:8053:0:b0:2ef:564c:9b66 with SMTP id 38308e7fff4ca-2f12ee579a3mr97681691fa.30.1722435905199; Wed, 31 Jul 2024 07:25:05 -0700 (PDT)
Received: from ?IPV6:2a01:e0a:e1b:64b0:68b7:b177:2e4f:83c4? ([2a01:e0a:e1b:64b0:68b7:b177:2e4f:83c4]) by smtp.googlemail.com with ESMTPSA id 5b1f17b1804b1-4282bb64a68sm23898785e9.38.2024.07.31.07.25.04 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 31 Jul 2024 07:25:04 -0700 (PDT)
Message-ID: <d0507b0e-fb33-46c2-8517-211a4e7f4924@gmail.com>
Date: Wed, 31 Jul 2024 16:25:02 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
To: Wolf McNally <wolf@wolfmcnally.com>
References: <5504ABFA-33AA-4265-9006-834DD263B9D8@island-resort.com> <4118A3AE-E1F1-4311-B5D8-B2A09AD822F4@wolfmcnally.com> <cc786eb5-bd5c-404d-a2b2-716c64b4acc7@gmail.com> <E21D4EEA-2C6E-4444-BB98-27D22A63F6C9@tzi.org> <0a1f0006-5237-4498-8250-fdfdd9aa685a@gmail.com> <0E80EFF6-9FB7-4AF3-A122-DD23FF26746E@wolfmcnally.com>
Content-Language: en-US
From: Anders Rundgren <anders.rundgren.net@gmail.com>
In-Reply-To: <0E80EFF6-9FB7-4AF3-A122-DD23FF26746E@wolfmcnally.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Message-ID-Hash: 4XCKRRP5UP5IPROXNAN4Y2AOBIYSJHYA
X-Message-ID-Hash: 4XCKRRP5UP5IPROXNAN4Y2AOBIYSJHYA
X-MailFrom: anders.rundgren.net@gmail.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: Carsten Bormann <cabo@tzi.org>, CBOR <cbor@ietf.org>, Christopher Allen <christophera@lifewithalacrity.com>, Shannon Appelcline <shannon.appelcline@gmail.com>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: I-D Action: draft-mcnally-deterministic-cbor-10.html
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/BUxhDwPH-kHJ6WRXyKQBuKqvzxc>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>

On 2024-07-31 10:17, Wolf McNally wrote:
> Anders,
> 
>> On Jul 30, 2024, at 11:52 PM, Anders Rundgren <anders.rundgren.net@gmail.com> wrote:
>>
>> The only thing that remains clear, is that anybody wanting to understand the rationale for numeric reduction would have to wade through literally hundreds of pretty voluminous messages, typically mixing things actually are unrelated, like "you can build dBCOR on top of CDE”.
> 
> It’s really not that hard. Today I had GPT-4o explain it to me from first principles. If it can make sense of this, I think you can.

Hi Wolf,

What can I say? A major reason for my strong aversion against numerical reduction (canonicalization), are my experiences with JSON.

Most RFCs claiming to be based on JSON, are in practice rather based on the crippled version known as I-JSON (RFC 7493).  This looks like a specification mismatch.

Officially, many IETFers consider JSON canonicalization an abomination. In practice the very same IETFers use it anyway, but without mentioning the offending C-word:
https://www.rfc-editor.org/rfc/rfc7638.html#section-3.3

Been there, done that, never going back: https://github.com/cyberphone/cbor-everywhere

CONSISTENCY

It is true that treating 2 and 2.0 as equivalent is an example of consistent handling of numbers.

However, always treating integers as integers and floating point values as floating point values, is also an example of consistency.


Anyway, it doesn't really matter what I think since you have decided to go with numeric reduction no matter what.


SCOPE CONSIDERATIONS

Finally, the scope of a solution may also impact choices.  My s.c. "Enterprise Application" profile is designed for the market constituting of 90%+ all software developers, while dCBOR (according to Laurence), targets a small niche.

Regards,
Anders


> 
> 
> Me: According to proffered schemes for JSON canonicalization, what is the canonical form of the numeric value `2.0`?
> 
> GPT: In JSON canonicalization schemes, the canonical form of the numeric value `2.0` is typically represented as `2`. This representation removes any unnecessary decimal points or trailing zeros, adhering to the principle of representing numbers in their simplest form. Thus, `2.0` becomes `2` in the canonicalized JSON.
> 
> 
> Me: Why does JSON canonicalization uphold the idea of `2.0` being the same as `2`?
> 
> GPT: JSON canonicalization treats `2.0` as equivalent to `2` because it aims to standardize data representation in a way that eliminates ambiguity and redundancy. The key reasons include:
> 
> 1. **Data Consistency**: By removing unnecessary decimal points and zeros, canonicalization ensures that equivalent values are represented consistently, reducing the risk of mismatches or errors when comparing data.
> 
> 2. **Simplicity**: The principle of representing numbers in their simplest form means that there is no distinction between `2` and `2.0` in the JSON format. This simplifies processing and parsing, as the same value is always represented the same way.
> 
> 3. **Interoperability**: JSON is widely used for data interchange, and canonicalization helps ensure that different systems interpret and process the data consistently. By defining a single representation for numbers, it facilitates interoperability across various platforms and programming languages.
> 
> 4. **Avoiding Semantic Ambiguity**: In many programming languages and systems, there is no meaningful difference between `2` and `2.0`—both represent the same numeric value. Canonicalization reflects this by treating them as identical, avoiding any potential confusion over their meaning.
> 
> These reasons contribute to the decision to represent `2.0` as `2` in JSON canonicalization, promoting clarity and uniformity in data representation.
> 
> 
> Me: Setting aside the `number` type in JavaScript, and just considering JSON as a method for data interchange among heterogenous systems, how would you respond to the argument that `2.0` and `2` should be kept distinct because one represents a floating point value and the other represents an integer value?
> 
> GPT: While the argument that `2.0` (floating point) and `2` (integer) should be distinct has technical merit in some contexts, it overlooks the primary purpose of JSON as a data interchange format designed for simplicity and broad compatibility among heterogeneous systems. Here's why maintaining the distinction may not be necessary or beneficial in the context of JSON:
> 
> 1. **Language and System Agnosticism**: JSON is designed to be a lightweight, text-based format that can be easily parsed and generated by a wide range of programming languages and systems. Many of these languages do not differentiate between `2` and `2.0` when parsing JSON, treating both as the same numeric value. This standardization minimizes compatibility issues and simplifies data interchange.
> 
> 2. **Data Normalization**: By normalizing numbers to their simplest form (e.g., converting `2.0` to `2`), JSON canonicalization helps ensure consistent data representation. This is crucial for tasks like digital signatures and data comparison, where any difference in representation, even if semantically insignificant, could lead to false mismatches or verification failures.
> 
> 3. **Pragmatic Considerations**: In practical terms, the distinction between integers and floating-point numbers is often more relevant within specific computational contexts rather than at the level of data interchange. JSON’s purpose is to facilitate the exchange of data, not to enforce specific numerical types. Systems receiving the data can interpret it according to their own needs and constraints.
> 
> 4. **Precision and Accuracy**: For most use cases in JSON, the precision and accuracy differences between integers and floating-point representations are not significant enough to warrant maintaining a distinction.
> 
> In summary, while there is a theoretical basis for distinguishing `2.0` from `2`, JSON prioritizes simplicity and universal accessibility. The distinction between integer and floating-point numbers is not preserved in JSON canonicalization to maintain broad compatibility and avoid unnecessary complexity in data interchange.
> 
> 
> Me: Are you familiar with CBOR/CDE and dCBOR in particular? Search if you’re not sure.
> 
> GPT: *Searched 6 sites* CBOR/CDE (Concise Binary Object Representation/Common Deterministic Encoding) and dCBOR (Deterministic CBOR) are formats designed for efficient and deterministic data serialization. The primary goal of dCBOR is to ensure that semantically equivalent data items produce identical byte streams, which is crucial for applications that rely on data integrity verification, such as cryptographic systems.
> 
> Specifically, dCBOR requires that floating point values without a fractional component (like `2.0`) must be encoded as integers (`2`) to reduce redundancy and ensure uniformity across different systems.
> 
> The distinction between integers and floating point numbers is thus collapsed in dCBOR when a floating point value can be exactly represented as an integer. This decision is based on several considerations:
> 
> 1. **Deterministic Encoding**: Ensuring consistent representation simplifies the verification process, as identical data must result in identical encoding.
> 
> 2. **Efficiency**: Using the reduced form generally reduces data size and enhances transmission efficiency.
> 
> 3. **Practicality**: Most applications do not distinguish between `2.0` and `2` in terms of data processing. This approach simplifies implementations across platforms and ensures compatibility.
> 
> By adhering to these principles, dCBOR ensures a robust and consistent encoding mechanism suitable for environments requiring precise data integrity, such as digital signatures and secure communications.
> 
> 
> Me: Would you mount a defense of dCBOR’s numeric reduction similar to what you used for JSON canonicalization of numeric values?
> 
> GPT: Yes, a defense of dCBOR's numeric reduction can be mounted on similar grounds as the justification for JSON canonicalization of numeric values. Here are the key points:
> 
> 1. **Deterministic Encoding**: The primary goal of dCBOR's numeric reduction is to ensure that data with the same semantic value is represented identically in binary form. This deterministic encoding is critical for systems that require data integrity checks, such as digital signatures or cryptographic operations, where even a slight difference in encoding would result in different hash values, breaking verification processes.
> 
> 2. **Simplicity and Efficiency**: Numeric reduction can reduce data size in many cases, making data transmission and storage more efficient. By encoding values like `2.0` as `2`, dCBOR eliminates unnecessary data, which simplifies the decoding process and reduces the possibility of errors due to inconsistent representations. This also aligns with the practical consideration that, in most applications, the distinction between `2` and `2.0` is irrelevant.
> 
> 3. **Consistency Across Platforms**: Different systems and languages may handle floating-point and integer types differently. By standardizing on a single representation for values that can be expressed both as integers and floating-point numbers, dCBOR ensures consistency across heterogeneous systems. This uniformity is particularly important in distributed systems and environments where data needs to be reliably compared or combined from multiple sources.
> 
> 4. **Avoiding Ambiguity**: The distinction between `2` and `2.0` might suggest a difference in data type (integer vs. floating-point), but in the context of serialization and data interchange, this distinction often lacks practical significance. By reducing numbers to their simplest form, dCBOR avoids ambiguity and ensures that all parties interpret the data the same way.
> 
> Overall, dCBOR's approach to numeric reduction enhances interoperability, data integrity, and efficiency, making it a robust choice for applications where determinism and consistency are paramount. This rationale closely mirrors the reasoning behind JSON canonicalization, where simplicity and universal understanding are prioritized over maintaining distinctions that may not be meaningful in the context of data interchange.
> 
> 
> Anders, if you have any followup questions for GPT I’d be happy to forward them for you.
> 
> ~ Wolf
>