[Cbor] Re: dCBOR: Normalization of Strings
Joe Hildebrand <hildjj@cursive.net> Mon, 29 July 2024 14:08 UTC
Return-Path: <hildjj@cursive.net>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0E3DBC18DB80 for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 07:08:33 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.107
X-Spam-Level:
X-Spam-Status: No, score=-2.107 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cursive.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id dgQ5CJ4bJOUb for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 07:08:28 -0700 (PDT)
Received: from mail-qt1-x836.google.com (mail-qt1-x836.google.com [IPv6:2607:f8b0:4864:20::836]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5E218C180B74 for <cbor@ietf.org>; Mon, 29 Jul 2024 07:08:15 -0700 (PDT)
Received: by mail-qt1-x836.google.com with SMTP id d75a77b69052e-44ff398cefcso18489131cf.0 for <cbor@ietf.org>; Mon, 29 Jul 2024 07:08:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cursive.net; s=google; t=1722262094; x=1722866894; darn=ietf.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=QjRWGQTggcEUg7BqwBo6mObptoJUKZ8rlRSPccFjkhg=; b=KUObi/qpllOOQmaiR+xDQ7p7DsVBIZ8QW0hxeGD6opUL4V5dUsMw45UyzSiYB27ZQU Rz325bPZV/4gjVPQkBsUESeQS4VLMezDOXU1H0YAq8QUc1FuJljaQo/yd7Qk48zQdGhH U1KI7uYKmcRC1B7vzoOOJoQYZ6RMYWvLKuZ5Y=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722262094; x=1722866894; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=QjRWGQTggcEUg7BqwBo6mObptoJUKZ8rlRSPccFjkhg=; b=A8NcqD6eAZvqMhGoyDSW5aNnCANQdF5r9MA/ZH5E9RCDmPqC9o+K1fW16QPLy7j2AL Wxal0LNc8ZDhcsQFGPMQXUaA4MsQEBmzlQvyOnsWM5ZTiZQbVT7tEzgryyt+Iuucdbk5 yVPt4iQa2lJGAQ0ivVWShGswPRDvoSanDHEDvACbQcuuxRouKvrw0G4wzKE4Je7JstOP 0FdVloUKtOf1slwiGj/GBLbc9HHwCYW+tmvFZcFgkFAprNAp8nDlUZNYNhEVl1w1xgKJ foPaKm/QoA1zV0qjhAsXu5um7d/wtmjXT90Y/rtTLv/VgttXSGOWD0htOXWeWMlCxXd0 eihw==
X-Gm-Message-State: AOJu0Yz521wdaPRb3IJfM4XJGIusmk+8Iqcd61JAxLPZa0ZS4s+A4faK 4qtv3J+2O4/WtnuM4cfkFRcJ3+lVvz50YQ5+erCWS8MQk7jBJ63LVY0K3u2kkg==
X-Google-Smtp-Source: AGHT+IGhyGS+zNN4fYT0qf0jprXsl7p2Mc9g1Ju11fTbumtFBvNeBnwPgXyIHYscEBRg8GrkGPBrZA==
X-Received: by 2002:a05:622a:1903:b0:450:3d3:7406 with SMTP id d75a77b69052e-45004f6615fmr68783731cf.63.1722262093669; Mon, 29 Jul 2024 07:08:13 -0700 (PDT)
Received: from smtpclient.apple (syn-074-219-233-003.biz.spectrum.com. [74.219.233.3]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-44fe814652asm42012061cf.27.2024.07.29.07.08.13 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 29 Jul 2024 07:08:13 -0700 (PDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
From: Joe Hildebrand <hildjj@cursive.net>
In-Reply-To: <E8325093-F005-4A56-8AFB-9C1637E19EA4@wolfmcnally.com>
Date: Mon, 29 Jul 2024 10:08:02 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <F04DCC66-CECA-4FFD-9AF1-C58F983A8EF1@cursive.net>
References: <E8325093-F005-4A56-8AFB-9C1637E19EA4@wolfmcnally.com>
To: Wolf McNally <wolf@wolfmcnally.com>
X-Mailer: Apple Mail (2.3774.600.62)
Message-ID-Hash: KSP5XNLBUT6V4DK4MVZ7UDDRE5LFLNRO
X-Message-ID-Hash: KSP5XNLBUT6V4DK4MVZ7UDDRE5LFLNRO
X-MailFrom: hildjj@cursive.net
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: CBOR <cbor@ietf.org>, Carsten Bormann <cabo@tzi.org>, Christopher Allen <christophera@lifewithalacrity.com>, Shannon Appelcline <shannon.appelcline@gmail.com>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: dCBOR: Normalization of Strings
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/WmPNcmixEZ_MVGw971A2I0uLZjQ>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>
1) NFC always requires memory allocation even for checking correctness. NFD is much better for this sort of thing, and now that we have real fonts and text renderers that can do composition, it's almost always better for transmission.
2) Normalization might only need to be applied to map keys. It's a LOT more expensive than you expect, particularly if you choose NFC.
Explanation of 1) for those that that need it:
The "C" in "NFC" stands for "composition". It requires strings like "ä" to be stored as U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) rather than U+0061 (LATIN SMALL LETTER A) followed by U+0308 (COMBINING DIAERESIS). But since there might be multiple combining characters in play, each of which must be in the correct order, the NFC algorithm is to first break U+00E4 into U+0061 U+0308, then sort all of the combining characters, then recombine everything that is possible. There's no way to just iterate through a string and determine that it is correct without breaking things apart and recombining them.
NFD does the "break apart" and "order" steps of NFC, but does not do the recombination. Therefore, you can iterate through a string, see if anything can be broken apart or is an out of order combining character without having to modify the string or allocate memory when validating. This turns out to be a MAJOR performance win in practice. For this win to show up, you often need a validate() routine instead of doing "foo".normalize('NFD') === "foo", but it's at least possible.
—
Joe Hildebrand
> On Jul 29, 2024, at 12:39 AM, Wolf McNally <wolf@wolfmcnally.com> wrote:
>
> All,
>
> I’d like your comment on a proposed addition to dCBOR [1]:
>
> • Encoders MUST only serialize text strings that are in Unicode Normalization Form C (NFC). [2]
> • Decoders MUST require that deserialized text strings are in NFC.
>
> This rule is comparable to dCBOR’s mandate for numerical reduction, as it applies to one of the most common used data types, and closes a frequently-discussed [3] determinism loophole regarding Unicode strings having more than one equivalent way to encode the same semantics. For example:
>
> 'e\u0301' # 'e' followed by combining acute accent
> '\u00E9' # 'é' as a single character (the normalized form of this character)
>
> These strings are canonically equal but encoded differently, breaking determinism if encoded by different agents without the proposed rule.
>
> I have verified that there are readily-available ways to convert strings to NFC in C/C++ (using ICU or Boost libraries), Swift (using Foundation’s `precomposedStringWithCanonicalMapping`), Rust (using the `unicode-normalization` crate), JavaScript (using `String.prototype.normalize`), and PHP (using the `Normalizer` class). Several of these languages also have streamlined APIs for efficiently detecting whether a string is in NFC, which is useful for decoders. Those without such an API can simply compare a string being deserialized to a re-normalized copy.
>
> I think it’s important to note that text strings in ASCII and Latin-1 are already necessarily in NFC, and the normalization rules are designed to be efficient in such common cases.
>
> [1] dCBOR: A Deterministic CBOR Application Profile
> https://datatracker.ietf.org/doc/draft-mcnally-deterministic-cbor/
>
> [2] Unicode Standard Annex #15: Unicode Normalization Forms
> https://unicode.org/reports/tr15/
>
> [3] Discussion on Unicode normalization as it relates to canonical JSON.
> https://esdiscuss.org/topic/json-canonicalize#content-3
>
> ~ Wolf
> _______________________________________________
> CBOR mailing list -- cbor@ietf.org
> To unsubscribe send an email to cbor-leave@ietf.org
- [Cbor] dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally