[Cbor] Re: dCBOR: Normalization of Strings
Joe Hildebrand <hildjj@cursive.net> Mon, 29 July 2024 16:00 UTC
Return-Path: <hildjj@cursive.net>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7EF50C180B7A for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 09:00:59 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.107
X-Spam-Level:
X-Spam-Status: No, score=-2.107 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cursive.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id h-utKGlxS91W for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 09:00:55 -0700 (PDT)
Received: from mail-qv1-xf36.google.com (mail-qv1-xf36.google.com [IPv6:2607:f8b0:4864:20::f36]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 66363C180B7F for <cbor@ietf.org>; Mon, 29 Jul 2024 09:00:55 -0700 (PDT)
Received: by mail-qv1-xf36.google.com with SMTP id 6a1803df08f44-6b7a3e4686eso17226506d6.1 for <cbor@ietf.org>; Mon, 29 Jul 2024 09:00:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cursive.net; s=google; t=1722268854; x=1722873654; darn=ietf.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=QpgIvz27DH6PdH4c8A6BEyPldnZaDhcCJAtKp6b3lgs=; b=rqKFkUL/eyAxkpwySd92hnh+kpjMoZQZPt+gIuTkxEMx6esLad6FHdYVgZ8yMZDJJy Ev5Ev7WAwrlqa9PMtZ85dJjjVGLnvti3mk2BMa1u8RthgmWHglIlvwBQltJjSN3nh6ev 4nryja+UfMqr05t0vOjVpYHY5NvayFC9im8Pc=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722268854; x=1722873654; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=QpgIvz27DH6PdH4c8A6BEyPldnZaDhcCJAtKp6b3lgs=; b=Hq1BSqdXrQ55rj7tfIV6U6lzQdllrCKS5SLOMIvFAqhboazKm7YyXwOQT58otOn9aY I7FiLbOWzqJ1h+1grCYYcKF7ajbMGPU4ivc4MyNs1kUq2/ZsM+Y/4TfndyBZu/sizauE zbM1Bqj5uNC+3MyhBJt6kP8sI2NwFjevjPqO04ZETy2H+fV071WJ/5Ea5CFHZOmyhr3/ XlJNitD0IRXCSkg0gjsjffSntkztneDlcd79g68CWrSi6iIfrXPfxkmiU48Ezb2WtYL5 2u+46VREMC24QLfulRrq9TjlXsPPAlKGIH8D7pYxOtaovzc2zQ0tbGJ/z7TMhuz2vPav ocgw==
X-Forwarded-Encrypted: i=1; AJvYcCUpDVnpQHmwqfPh2P9E7/oTQOiYliEwWMNVMhb+nLH2fqh/dfyKHjFIVJzP+3EoD3NSPFsc6HU4Y+c392Df
X-Gm-Message-State: AOJu0YwOfyjg63qiLsmbkvWPiA0sAWEGM99nPOEOxbtL1s758w0G9hs5 DCvOW4NXm0SeAe0wDqqWyviE3KSNEQK74u7udK+vhPektxmcVwfYfJO5K3y/kw==
X-Google-Smtp-Source: AGHT+IF9s/SB6wAewsZYXPeNse7AZiK1NUXWdXTuHRNlRUEUYnjbml2UT6OFB4YcwaStdFPEUUJ+YQ==
X-Received: by 2002:a05:6214:509e:b0:6b5:db9e:2c73 with SMTP id 6a1803df08f44-6bb5599efe8mr99305056d6.10.1722268854167; Mon, 29 Jul 2024 09:00:54 -0700 (PDT)
Received: from smtpclient.apple ([2603:6013:7d00:db79:1dac:1c6f:f552:6e9f]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6bb3fac19c5sm54230856d6.112.2024.07.29.09.00.53 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 29 Jul 2024 09:00:53 -0700 (PDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
From: Joe Hildebrand <hildjj@cursive.net>
In-Reply-To: <6CF47BC0-D918-43A8-826D-457AF6FFFB3B@tzi.org>
Date: Mon, 29 Jul 2024 12:00:42 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <F478C23D-EFE8-4D54-AB9E-28FE3856E940@cursive.net>
References: <E8325093-F005-4A56-8AFB-9C1637E19EA4@wolfmcnally.com> <F04DCC66-CECA-4FFD-9AF1-C58F983A8EF1@cursive.net> <6CF47BC0-D918-43A8-826D-457AF6FFFB3B@tzi.org>
To: Carsten Bormann <cabo@tzi.org>
X-Mailer: Apple Mail (2.3774.600.62)
Message-ID-Hash: A6PKLGSC5YFSYYCGVF5N3BKD5FCHSAJF
X-Message-ID-Hash: A6PKLGSC5YFSYYCGVF5N3BKD5FCHSAJF
X-MailFrom: hildjj@cursive.net
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: Wolf McNally <wolf@wolfmcnally.com>, CBOR <cbor@ietf.org>, Christopher Allen <christophera@lifewithalacrity.com>, Shannon Appelcline <shannon.appelcline@gmail.com>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: dCBOR: Normalization of Strings
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/7frjJkyKalLQTuYyng35kilwcbI>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>
Why would text in your application be in NFC form already?
Note that you have to at least check that whatever strings that needed to be normalized is in the correct format before trusting that the deterministic encoding was done correctly, which means every step in the process that relies on determinism is going to be doing this expensive check.
I'd need to know more about the particular application to have an opinion on whether all of the text needs to be normalized. User-entered text that doesn't need to be compared against strings that were entered somewhere else probably don't need normalization, even for deterministic encoding.
—
Joe Hildebrand
> On Jul 29, 2024, at 11:24 AM, Carsten Bormann <cabo@tzi.org> wrote:
>
> Hi Joe,
>
> thank you for this interesting information.
>
> I would expect that, in most cases, text in applications is already available in NFC form. Only where that is not the case, normalization would actually be required; a high-performance application could rely on data at rest to be NFC.
> Similarly, the receiving application likely needs to ingest the information in NFC form, so there is no way around generating the NFC for ingestion.
> (In the environments dCBOR was designed for, Normalization always occurs in the process of deriving signing inputs from data at rest, so it is not just map keys.)
> This could pretty much nullify the advantages of NFD that you describe below.
>
> Grüße, Carsten
>
>
>> On 2024-07-29, at 16:08, Joe Hildebrand <hildjj@cursive.net> wrote:
>>
>> 1) NFC always requires memory allocation even for checking correctness. NFD is much better for this sort of thing, and now that we have real fonts and text renderers that can do composition, it's almost always better for transmission.
>>
>> 2) Normalization might only need to be applied to map keys. It's a LOT more expensive than you expect, particularly if you choose NFC.
>>
>>
>> Explanation of 1) for those that that need it:
>>
>> The "C" in "NFC" stands for "composition". It requires strings like "ä" to be stored as U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) rather than U+0061 (LATIN SMALL LETTER A) followed by U+0308 (COMBINING DIAERESIS). But since there might be multiple combining characters in play, each of which must be in the correct order, the NFC algorithm is to first break U+00E4 into U+0061 U+0308, then sort all of the combining characters, then recombine everything that is possible. There's no way to just iterate through a string and determine that it is correct without breaking things apart and recombining them.
>>
>> NFD does the "break apart" and "order" steps of NFC, but does not do the recombination. Therefore, you can iterate through a string, see if anything can be broken apart or is an out of order combining character without having to modify the string or allocate memory when validating. This turns out to be a MAJOR performance win in practice. For this win to show up, you often need a validate() routine instead of doing "foo".normalize('NFD') === "foo", but it's at least possible.
>>
>> —
>> Joe Hildebrand
>>
>>> On Jul 29, 2024, at 12:39 AM, Wolf McNally <wolf@wolfmcnally.com> wrote:
>>>
>>> All,
>>>
>>> I’d like your comment on a proposed addition to dCBOR [1]:
>>>
>>> • Encoders MUST only serialize text strings that are in Unicode Normalization Form C (NFC). [2]
>>> • Decoders MUST require that deserialized text strings are in NFC.
>>>
>>> This rule is comparable to dCBOR’s mandate for numerical reduction, as it applies to one of the most common used data types, and closes a frequently-discussed [3] determinism loophole regarding Unicode strings having more than one equivalent way to encode the same semantics. For example:
>>>
>>> 'e\u0301' # 'e' followed by combining acute accent
>>> '\u00E9' # 'é' as a single character (the normalized form of this character)
>>>
>>> These strings are canonically equal but encoded differently, breaking determinism if encoded by different agents without the proposed rule.
>>>
>>> I have verified that there are readily-available ways to convert strings to NFC in C/C++ (using ICU or Boost libraries), Swift (using Foundation’s `precomposedStringWithCanonicalMapping`), Rust (using the `unicode-normalization` crate), JavaScript (using `String.prototype.normalize`), and PHP (using the `Normalizer` class). Several of these languages also have streamlined APIs for efficiently detecting whether a string is in NFC, which is useful for decoders. Those without such an API can simply compare a string being deserialized to a re-normalized copy.
>>>
>>> I think it’s important to note that text strings in ASCII and Latin-1 are already necessarily in NFC, and the normalization rules are designed to be efficient in such common cases.
>>>
>>> [1] dCBOR: A Deterministic CBOR Application Profile
>>> https://datatracker.ietf.org/doc/draft-mcnally-deterministic-cbor/
>>>
>>> [2] Unicode Standard Annex #15: Unicode Normalization Forms
>>> https://unicode.org/reports/tr15/
>>>
>>> [3] Discussion on Unicode normalization as it relates to canonical JSON.
>>> https://esdiscuss.org/topic/json-canonicalize#content-3
>>>
>>> ~ Wolf
>>> _______________________________________________
>>> CBOR mailing list -- cbor@ietf.org
>>> To unsubscribe send an email to cbor-leave@ietf.org
>>
>> _______________________________________________
>> CBOR mailing list -- cbor@ietf.org
>> To unsubscribe send an email to cbor-leave@ietf.org
>
- [Cbor] dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally