[Cbor] Re: dCBOR: Normalization of Strings

Joe Hildebrand <hildjj@cursive.net> Mon, 29 July 2024 22:20 UTC

Return-Path: <hildjj@cursive.net>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AE59CC19ECBC for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 15:20:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.108
X-Spam-Level:
X-Spam-Status: No, score=-2.108 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cursive.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ZTFYEOgY5kiO for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 15:20:07 -0700 (PDT)
Received: from mail-qk1-x735.google.com (mail-qk1-x735.google.com [IPv6:2607:f8b0:4864:20::735]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 54EA7C19ECB3 for <cbor@ietf.org>; Mon, 29 Jul 2024 15:20:07 -0700 (PDT)
Received: by mail-qk1-x735.google.com with SMTP id af79cd13be357-7a1e31bc183so114544285a.3 for <cbor@ietf.org>; Mon, 29 Jul 2024 15:20:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cursive.net; s=google; t=1722291606; x=1722896406; darn=ietf.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=TJRfOY0ONPrFSFWTP7QnUDDxWw5eG2aqjlLie7uLj3M=; b=gagwQZu/sQQTzi/naLbQRkCYy9mULpnp4FLiaGa3b9XN2uZXBb5kImVdn3Zttc3/jr NPFtxTnPGEu+HPXM7klth9gt61NipwhDGyz5cBk+FqGKXnA5sRzSAOZexteCAgHNxX8C 3WFJP4dsct024pOvTa6M4XZYakCh5ertfNybQ=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722291606; x=1722896406; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TJRfOY0ONPrFSFWTP7QnUDDxWw5eG2aqjlLie7uLj3M=; b=m/Sj6Nn0ZRElzQ8UfUAhOsXKo5gLoHWfk3t3kVymbQBkZjyzcAPgaPuVHxaYVZpvXH wlmUjok4foFya1WafBPTknnebReytttzw0Qq6GArpFXIItr7+WpM55F8Rn02cr/I0aab Ot744e5M29xVk5fhSq725BxKllsvCPyts0elGYLmhxkWs7ckjhZX4RzX0fTH1xx/Y5Gp IvJYNtv2sIODgu02zXWnTCA1bKXNA4IIeW3pU2LemH/GUHSRM89DAUBFfymjfw5d0aJb ucitvhmlXZ5XkvdBEp7MREJ1wBDNXKQr2mE2XD7AwO8vM0XNUbn8nulhUjM1S50btpfT tgjw==
X-Gm-Message-State: AOJu0YwIbhl6KmvDk9OSoK29mVZ4d2waxpN5P5Q+FVxiLJaQNG008D4g gFoMla0FnfiYldluEIM+MzBgJmg3nJYuU4Mvgvnn2A6i8GOolwvWzerMOG+Iew==
X-Google-Smtp-Source: AGHT+IEoWDR/HjOoPDuQAI0+aPgn0mC0iN9qLYx7amOaNOFymg/zChWGqFR92gzBtiDl9qUFR+PEug==
X-Received: by 2002:a05:620a:438c:b0:7a1:632e:e576 with SMTP id af79cd13be357-7a1e52e8e1fmr1202012585a.54.1722291605721; Mon, 29 Jul 2024 15:20:05 -0700 (PDT)
Received: from smtpclient.apple ([2603:6013:7d00:db79:1dac:1c6f:f552:6e9f]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7a1d745ef7fsm566449085a.132.2024.07.29.15.20.05 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 29 Jul 2024 15:20:05 -0700 (PDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
From: Joe Hildebrand <hildjj@cursive.net>
In-Reply-To: <8A2595D5-B9EC-4C5E-B83A-D10DE2D4B4DD@wolfmcnally.com>
Date: Mon, 29 Jul 2024 18:19:54 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <D9570449-9EB8-49D9-95A3-62343790882E@cursive.net>
References: <E8325093-F005-4A56-8AFB-9C1637E19EA4@wolfmcnally.com> <F04DCC66-CECA-4FFD-9AF1-C58F983A8EF1@cursive.net> <D6A3A142-0999-4D0B-9CBC-A698BC384DD4@wolfmcnally.com> <8A2595D5-B9EC-4C5E-B83A-D10DE2D4B4DD@wolfmcnally.com>
To: Wolf McNally <wolf@wolfmcnally.com>
X-Mailer: Apple Mail (2.3774.600.62)
Message-ID-Hash: UNA6B77LUF4ZEQ6LQWXUBLO67CKLGYZV
X-Message-ID-Hash: UNA6B77LUF4ZEQ6LQWXUBLO67CKLGYZV
X-MailFrom: hildjj@cursive.net
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: CBOR <cbor@ietf.org>, Carsten Bormann <cabo@tzi.org>, Christopher Allen <christophera@lifewithalacrity.com>, Shannon Appelcline <shannon.appelcline@gmail.com>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: dCBOR: Normalization of Strings
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/8W1bex3MCcnFwTMiSjv-1JKSVI8>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>

That would be a reasonable choice as well, depending on your use case.  Would you expect the decoder to fail decoding if it received a non-Latin-1 character?  Or would that be some other layer's job?

— 
Joe Hildebrand

> On Jul 29, 2024, at 6:03 PM, Wolf McNally <wolf@wolfmcnally.com> wrote:
> 
> Joe,
> 
> I would also add that if you’re concerned about operating dCBOR in a highly constrained environment, your coder-decoder could further be constraint to only handle Latin-1 characters, in which case it would still output valid dCBOR, and would accept many common input values. Depending on your use case, you could decide whether to simply accept all non-Latin-1 unicode without testing it for NFC conformance, or reject any non-Latin-1 characters as unsupported.
> 
> ~ Wolf
> 
>> On Jul 29, 2024, at 2:53 PM, Wolf McNally <wolf@wolfmcnally.com> wrote:
>> 
>> Joe,
>> 
>> Thanks for your input.
>> 
>> My main problem with NFD is that it breaks existing Latin-1 characters with diacritics like `è` (0xe8) into two code points, requiring *three* bytes to encode as UTF-8:
>> 
>> • e (U+0065) encoded as 0x65 in UTF-8.
>> • ̀ (U+0300) encoded as 0xCC 0x80 in UTF-8.
>> 
>> This means considerably more work and storage for very common use cases.
>> 
>> As I stated, Latin-1 (and by inclusion ASCII) is already in NFC by definition. So converting such a string to NFC would first detect whether it is already in Latin-1, and if so trivially avoid any further transformation: no decomposition then recomposition steps would be needed in these common cases.
>> 
>> ~ Wolf
>> 
>>> On Jul 29, 2024, at 7:08 AM, Joe Hildebrand <hildjj@cursive.net> wrote:
>>> 
>>> 1) NFC always requires memory allocation even for checking correctness.  NFD is much better for this sort of thing, and now that we have real fonts and text renderers that can do composition, it's almost always better for transmission.
>>> 
>>> 2) Normalization might only need to be applied to map keys.  It's a LOT more expensive than you expect, particularly if you choose NFC.
>>> 
>>> 
>>> Explanation of 1) for those that that need it:
>>> 
>>> The "C" in "NFC" stands for "composition".  It requires strings like "ä" to be stored as U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) rather than U+0061 (LATIN SMALL LETTER A) followed by U+0308 (COMBINING DIAERESIS).  But since there might be multiple combining characters in play, each of which must be in the correct order, the NFC algorithm is to first break U+00E4 into U+0061 U+0308, then sort all of the combining characters, then recombine everything that is possible.  There's no way to just iterate through a string and determine that it is correct without breaking things apart and recombining them.
>>> 
>>> NFD does the "break apart" and "order" steps of NFC, but does not do the recombination.  Therefore, you can iterate through a string, see if anything can be broken apart or is an out of order combining character without having to modify the string or allocate memory when validating.  This turns out to be a MAJOR performance win in practice.  For this win to show up, you often need a validate() routine instead of doing "foo".normalize('NFD') === "foo", but it's at least possible.
>>> 
>>> — 
>>> Joe Hildebrand
>>> 
>>>> On Jul 29, 2024, at 12:39 AM, Wolf McNally <wolf@wolfmcnally.com> wrote:
>>>> 
>>>> All,
>>>> 
>>>> I’d like your comment on a proposed addition to dCBOR [1]:
>>>> 
>>>> • Encoders MUST only serialize text strings that are in Unicode Normalization Form C (NFC). [2]
>>>> • Decoders MUST require that deserialized text strings are in NFC.
>>>> 
>>>> This rule is comparable to dCBOR’s mandate for numerical reduction, as it applies to one of the most common used data types, and closes a frequently-discussed [3] determinism loophole regarding Unicode strings having more than one equivalent way to encode the same semantics. For example:
>>>> 
>>>> 'e\u0301'  # 'e' followed by combining acute accent
>>>> '\u00E9'   # 'é' as a single character (the normalized form of this character)
>>>> 
>>>> These strings are canonically equal but encoded differently, breaking determinism if encoded by different agents without the proposed rule.
>>>> 
>>>> I have verified that there are readily-available ways to convert strings to NFC in C/C++ (using ICU or Boost libraries), Swift (using Foundation’s `precomposedStringWithCanonicalMapping`), Rust (using the `unicode-normalization` crate), JavaScript (using `String.prototype.normalize`), and PHP (using the `Normalizer` class). Several of these languages also have streamlined APIs for efficiently detecting whether a string is in NFC, which is useful for decoders. Those without such an API can simply compare a string being deserialized to a re-normalized copy.
>>>> 
>>>> I think it’s important to note that text strings in ASCII and Latin-1 are already necessarily in NFC, and the normalization rules are designed to be efficient in such common cases.
>>>> 
>>>> [1] dCBOR: A Deterministic CBOR Application Profile
>>>> https://datatracker.ietf.org/doc/draft-mcnally-deterministic-cbor/
>>>> 
>>>> [2] Unicode Standard Annex #15: Unicode Normalization Forms
>>>> https://unicode.org/reports/tr15/
>>>> 
>>>> [3] Discussion on Unicode normalization as it relates to canonical JSON.
>>>> https://esdiscuss.org/topic/json-canonicalize#content-3
>>>> 
>>>> ~ Wolf
>>>> _______________________________________________
>>>> CBOR mailing list -- cbor@ietf.org
>>>> To unsubscribe send an email to cbor-leave@ietf.org
>>> 
>> 
>