[Cbor] Re: dCBOR: Normalization of Strings

Carsten Bormann <cabo@tzi.org> Mon, 29 July 2024 15:24 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D1811C169416 for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 08:24:58 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, T_SPF_HELO_TEMPERROR=0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id IiqY-OqbNgzg for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 08:24:53 -0700 (PDT)
Received: from smtp.zfn.uni-bremen.de (smtp.zfn.uni-bremen.de [134.102.50.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EF4B1C14CE31 for <cbor@ietf.org>; Mon, 29 Jul 2024 08:24:52 -0700 (PDT)
Received: from [192.168.217.145] (p5dc5d6c5.dip0.t-ipconnect.de [93.197.214.197]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4WXhxF4vm0zDCbw; Mon, 29 Jul 2024 17:24:49 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <F04DCC66-CECA-4FFD-9AF1-C58F983A8EF1@cursive.net>
Date: Mon, 29 Jul 2024 17:24:49 +0200
X-Mao-Original-Outgoing-Id: 743959489.220638-aa521670efd55d6dae05b6d54249a85b
Content-Transfer-Encoding: quoted-printable
Message-Id: <6CF47BC0-D918-43A8-826D-457AF6FFFB3B@tzi.org>
References: <E8325093-F005-4A56-8AFB-9C1637E19EA4@wolfmcnally.com> <F04DCC66-CECA-4FFD-9AF1-C58F983A8EF1@cursive.net>
To: Joe Hildebrand <hildjj@cursive.net>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Message-ID-Hash: NQKWRDDHLMFG7HLUBMRGLHTHY3UIIXNG
X-Message-ID-Hash: NQKWRDDHLMFG7HLUBMRGLHTHY3UIIXNG
X-MailFrom: cabo@tzi.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: Wolf McNally <wolf@wolfmcnally.com>, CBOR <cbor@ietf.org>, Christopher Allen <christophera@lifewithalacrity.com>, Shannon Appelcline <shannon.appelcline@gmail.com>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: dCBOR: Normalization of Strings
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/mdwlJllyq5wE6W00nNwbYenwq34>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>

Hi Joe,

thank you for this interesting information.

I would expect that, in most cases, text in applications is already available in NFC form.  Only where that is not the case, normalization would actually be required; a high-performance application could rely on data at rest to be NFC.
Similarly, the receiving application likely needs to ingest the information in NFC form, so there is no way around generating the NFC for ingestion.
(In the environments dCBOR was designed for, Normalization always occurs in the process of deriving signing inputs from data at rest, so it is not just map keys.)
This could pretty much nullify the advantages of NFD that you describe below.

Grüße, Carsten


> On 2024-07-29, at 16:08, Joe Hildebrand <hildjj@cursive.net> wrote:
> 
> 1) NFC always requires memory allocation even for checking correctness.  NFD is much better for this sort of thing, and now that we have real fonts and text renderers that can do composition, it's almost always better for transmission.
> 
> 2) Normalization might only need to be applied to map keys.  It's a LOT more expensive than you expect, particularly if you choose NFC.
> 
> 
> Explanation of 1) for those that that need it:
> 
> The "C" in "NFC" stands for "composition".  It requires strings like "ä" to be stored as U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) rather than U+0061 (LATIN SMALL LETTER A) followed by U+0308 (COMBINING DIAERESIS).  But since there might be multiple combining characters in play, each of which must be in the correct order, the NFC algorithm is to first break U+00E4 into U+0061 U+0308, then sort all of the combining characters, then recombine everything that is possible.  There's no way to just iterate through a string and determine that it is correct without breaking things apart and recombining them.
> 
> NFD does the "break apart" and "order" steps of NFC, but does not do the recombination.  Therefore, you can iterate through a string, see if anything can be broken apart or is an out of order combining character without having to modify the string or allocate memory when validating.  This turns out to be a MAJOR performance win in practice.  For this win to show up, you often need a validate() routine instead of doing "foo".normalize('NFD') === "foo", but it's at least possible.
> 
> — 
> Joe Hildebrand
> 
>> On Jul 29, 2024, at 12:39 AM, Wolf McNally <wolf@wolfmcnally.com> wrote:
>> 
>> All,
>> 
>> I’d like your comment on a proposed addition to dCBOR [1]:
>> 
>> • Encoders MUST only serialize text strings that are in Unicode Normalization Form C (NFC). [2]
>> • Decoders MUST require that deserialized text strings are in NFC.
>> 
>> This rule is comparable to dCBOR’s mandate for numerical reduction, as it applies to one of the most common used data types, and closes a frequently-discussed [3] determinism loophole regarding Unicode strings having more than one equivalent way to encode the same semantics. For example:
>> 
>> 'e\u0301'  # 'e' followed by combining acute accent
>> '\u00E9'   # 'é' as a single character (the normalized form of this character)
>> 
>> These strings are canonically equal but encoded differently, breaking determinism if encoded by different agents without the proposed rule.
>> 
>> I have verified that there are readily-available ways to convert strings to NFC in C/C++ (using ICU or Boost libraries), Swift (using Foundation’s `precomposedStringWithCanonicalMapping`), Rust (using the `unicode-normalization` crate), JavaScript (using `String.prototype.normalize`), and PHP (using the `Normalizer` class). Several of these languages also have streamlined APIs for efficiently detecting whether a string is in NFC, which is useful for decoders. Those without such an API can simply compare a string being deserialized to a re-normalized copy.
>> 
>> I think it’s important to note that text strings in ASCII and Latin-1 are already necessarily in NFC, and the normalization rules are designed to be efficient in such common cases.
>> 
>> [1] dCBOR: A Deterministic CBOR Application Profile
>> https://datatracker.ietf.org/doc/draft-mcnally-deterministic-cbor/
>> 
>> [2] Unicode Standard Annex #15: Unicode Normalization Forms
>> https://unicode.org/reports/tr15/
>> 
>> [3] Discussion on Unicode normalization as it relates to canonical JSON.
>> https://esdiscuss.org/topic/json-canonicalize#content-3
>> 
>> ~ Wolf
>> _______________________________________________
>> CBOR mailing list -- cbor@ietf.org
>> To unsubscribe send an email to cbor-leave@ietf.org
> 
> _______________________________________________
> CBOR mailing list -- cbor@ietf.org
> To unsubscribe send an email to cbor-leave@ietf.org