[Cbor] Re: dCBOR: Normalization of Strings

Wolf McNally <wolf@wolfmcnally.com> Mon, 29 July 2024 21:53 UTC

Return-Path: <wolf@wolfmcnally.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7D45BC14F60E for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 14:53:41 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.904
X-Spam-Level:
X-Spam-Status: No, score=-1.904 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=wolfmcnally-com.20230601.gappssmtp.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Q1lhW0n-QpXt for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 14:53:37 -0700 (PDT)
Received: from mail-pf1-x42a.google.com (mail-pf1-x42a.google.com [IPv6:2607:f8b0:4864:20::42a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 96A47C14CE5D for <cbor@ietf.org>; Mon, 29 Jul 2024 14:53:37 -0700 (PDT)
Received: by mail-pf1-x42a.google.com with SMTP id d2e1a72fcca58-70d19d768c2so2758937b3a.3 for <cbor@ietf.org>; Mon, 29 Jul 2024 14:53:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wolfmcnally-com.20230601.gappssmtp.com; s=20230601; t=1722290016; x=1722894816; darn=ietf.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=BbnPje2Coejgj5CcL/hz+PXEZa2KO85SZaL0al/E22w=; b=WpMEAkeDGZWlSNsBxC6ZqRus2cff3MltkhBzIVbeyAXOpwm4CA60YlvDNiL2UyB0gT w11OufD9hNGSJbqQYiCUPA0kIMHfxwRI9K4CAeANBpKeoDgfjr5CTZaXxtpsQW7G9zue hvFjCZ3kP3wol0OGWV3dM9Oo1SDP/x4n35WM/oL/mX2y5h47No/hS8XTH9k4rSDZU5GL QF+CpM7EujuGjNSSD9nTfdSYLdnzCC5S2u0IKGItKZePydAAjjFZjnYkNN+DSM7vQwcn XFTL8qrFGIGNr42la9qUAGgPaMF7cDt6zQtuY6OV7Lgb5cwGl6ksMMiG/LpP7O3uxbc+ 5g+g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722290016; x=1722894816; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BbnPje2Coejgj5CcL/hz+PXEZa2KO85SZaL0al/E22w=; b=t0dC+ekCxBoqiIAU/9f18cmrvVjeLX1bWFtQA9Mabh2sfcIIHiOgVosQ2TG9X7BLhE gX3zSRMb6yK3Xyimxen3fIApAw64+3NbyxQJmGxJ2VSP3PtfAPK00r4OhC5l9EzDWkoy I+K3LvNbtkoznlKhNYGf8nQa+fcyKbg2Cf8BqGn/ITkB2XnXfNJOuvel2gpNI/IFnWOs +RLx5tzgjb/pZKxeV+0JchB04fb4T/5LbMyPrfTjT8hboK+hRJcMoxr3uvrh8K5IsC/O /w+RiSrcFndxNh5tDWL9tn+jLss2BEDeDH/WxPJb1upPBcVYq70jv9UU0EQTiYyAQCib /5OQ==
X-Gm-Message-State: AOJu0YzUnlbr3CKVlYHElqbGDeCVCVyP7dTglViqxpyJ1ULLJywae1zB 3/tIK598A+qr5JBaIkLJk3RxUSQJqdxJB4Cq2K5H5HUalQx+L8fGNeDtTMGPNoc=
X-Google-Smtp-Source: AGHT+IEffbyOGimy+/nDSAMwgsqWLZOeYbDnMmTAvz5as8UMKiDu7u/xyaBWl3RrpcrrOS8maGnuIA==
X-Received: by 2002:a05:6a21:789f:b0:1c0:f5fa:db10 with SMTP id adf61e73a8af0-1c4a0e14ec0mr7907661637.0.1722290016464; Mon, 29 Jul 2024 14:53:36 -0700 (PDT)
Received: from smtpclient.apple ([192.145.119.154]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-70ead8152afsm7518403b3a.140.2024.07.29.14.53.34 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 29 Jul 2024 14:53:36 -0700 (PDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
From: Wolf McNally <wolf@wolfmcnally.com>
In-Reply-To: <F04DCC66-CECA-4FFD-9AF1-C58F983A8EF1@cursive.net>
Date: Mon, 29 Jul 2024 14:53:23 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <D6A3A142-0999-4D0B-9CBC-A698BC384DD4@wolfmcnally.com>
References: <E8325093-F005-4A56-8AFB-9C1637E19EA4@wolfmcnally.com> <F04DCC66-CECA-4FFD-9AF1-C58F983A8EF1@cursive.net>
To: Joe Hildebrand <hildjj@cursive.net>
X-Mailer: Apple Mail (2.3774.600.62)
Message-ID-Hash: RNRMFSQ2W653UPJH6OQSCX4MZ2KHIWYA
X-Message-ID-Hash: RNRMFSQ2W653UPJH6OQSCX4MZ2KHIWYA
X-MailFrom: wolf@wolfmcnally.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: CBOR <cbor@ietf.org>, Carsten Bormann <cabo@tzi.org>, Christopher Allen <christophera@lifewithalacrity.com>, Shannon Appelcline <shannon.appelcline@gmail.com>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: dCBOR: Normalization of Strings
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/ZoguTwRyXMQjavNpD8QhNCq5BEI>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>

Joe,

Thanks for your input.

My main problem with NFD is that it breaks existing Latin-1 characters with diacritics like `è` (0xe8) into two code points, requiring *three* bytes to encode as UTF-8:

• e (U+0065) encoded as 0x65 in UTF-8.
• ̀ (U+0300) encoded as 0xCC 0x80 in UTF-8.

This means considerably more work and storage for very common use cases.

As I stated, Latin-1 (and by inclusion ASCII) is already in NFC by definition. So converting such a string to NFC would first detect whether it is already in Latin-1, and if so trivially avoid any further transformation: no decomposition then recomposition steps would be needed in these common cases.

~ Wolf

> On Jul 29, 2024, at 7:08 AM, Joe Hildebrand <hildjj@cursive.net> wrote:
> 
> 1) NFC always requires memory allocation even for checking correctness.  NFD is much better for this sort of thing, and now that we have real fonts and text renderers that can do composition, it's almost always better for transmission.
> 
> 2) Normalization might only need to be applied to map keys.  It's a LOT more expensive than you expect, particularly if you choose NFC.
> 
> 
> Explanation of 1) for those that that need it:
> 
> The "C" in "NFC" stands for "composition".  It requires strings like "ä" to be stored as U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) rather than U+0061 (LATIN SMALL LETTER A) followed by U+0308 (COMBINING DIAERESIS).  But since there might be multiple combining characters in play, each of which must be in the correct order, the NFC algorithm is to first break U+00E4 into U+0061 U+0308, then sort all of the combining characters, then recombine everything that is possible.  There's no way to just iterate through a string and determine that it is correct without breaking things apart and recombining them.
> 
> NFD does the "break apart" and "order" steps of NFC, but does not do the recombination.  Therefore, you can iterate through a string, see if anything can be broken apart or is an out of order combining character without having to modify the string or allocate memory when validating.  This turns out to be a MAJOR performance win in practice.  For this win to show up, you often need a validate() routine instead of doing "foo".normalize('NFD') === "foo", but it's at least possible.
> 
> — 
> Joe Hildebrand
> 
>> On Jul 29, 2024, at 12:39 AM, Wolf McNally <wolf@wolfmcnally.com> wrote:
>> 
>> All,
>> 
>> I’d like your comment on a proposed addition to dCBOR [1]:
>> 
>> • Encoders MUST only serialize text strings that are in Unicode Normalization Form C (NFC). [2]
>> • Decoders MUST require that deserialized text strings are in NFC.
>> 
>> This rule is comparable to dCBOR’s mandate for numerical reduction, as it applies to one of the most common used data types, and closes a frequently-discussed [3] determinism loophole regarding Unicode strings having more than one equivalent way to encode the same semantics. For example:
>> 
>> 'e\u0301'  # 'e' followed by combining acute accent
>> '\u00E9'   # 'é' as a single character (the normalized form of this character)
>> 
>> These strings are canonically equal but encoded differently, breaking determinism if encoded by different agents without the proposed rule.
>> 
>> I have verified that there are readily-available ways to convert strings to NFC in C/C++ (using ICU or Boost libraries), Swift (using Foundation’s `precomposedStringWithCanonicalMapping`), Rust (using the `unicode-normalization` crate), JavaScript (using `String.prototype.normalize`), and PHP (using the `Normalizer` class). Several of these languages also have streamlined APIs for efficiently detecting whether a string is in NFC, which is useful for decoders. Those without such an API can simply compare a string being deserialized to a re-normalized copy.
>> 
>> I think it’s important to note that text strings in ASCII and Latin-1 are already necessarily in NFC, and the normalization rules are designed to be efficient in such common cases.
>> 
>> [1] dCBOR: A Deterministic CBOR Application Profile
>> https://datatracker.ietf.org/doc/draft-mcnally-deterministic-cbor/
>> 
>> [2] Unicode Standard Annex #15: Unicode Normalization Forms
>> https://unicode.org/reports/tr15/
>> 
>> [3] Discussion on Unicode normalization as it relates to canonical JSON.
>> https://esdiscuss.org/topic/json-canonicalize#content-3
>> 
>> ~ Wolf
>> _______________________________________________
>> CBOR mailing list -- cbor@ietf.org
>> To unsubscribe send an email to cbor-leave@ietf.org
>