[Cbor] dCBOR: Normalization of Strings
Wolf McNally <wolf@wolfmcnally.com> Mon, 29 July 2024 04:39 UTC
Return-Path: <wolf@wolfmcnally.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 52BE3C14F6EE for <cbor@ietfa.amsl.com>; Sun, 28 Jul 2024 21:39:43 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.903
X-Spam-Level:
X-Spam-Status: No, score=-1.903 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=wolfmcnally-com.20230601.gappssmtp.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id p5UVjt5GpSn9 for <cbor@ietfa.amsl.com>; Sun, 28 Jul 2024 21:39:39 -0700 (PDT)
Received: from mail-oi1-x22b.google.com (mail-oi1-x22b.google.com [IPv6:2607:f8b0:4864:20::22b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 12A9FC14F6A1 for <cbor@ietf.org>; Sun, 28 Jul 2024 21:39:38 -0700 (PDT)
Received: by mail-oi1-x22b.google.com with SMTP id 5614622812f47-3db157cb959so2026732b6e.0 for <cbor@ietf.org>; Sun, 28 Jul 2024 21:39:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wolfmcnally-com.20230601.gappssmtp.com; s=20230601; t=1722227978; x=1722832778; darn=ietf.org; h=to:cc:date:message-id:subject:mime-version:from:from:to:cc:subject :date:message-id:reply-to; bh=UY8h5eSDKQ0NqLNpZfC3zZpK3U1Pb0YvKFDwOikGO0c=; b=X2J3iXaeja2hGCJbt6iJGduoKS51xrIeJywRBxrdxLEkSRpAbTrmDFLe8xLL5VLffU lNvXrDrZTawDOHBO7mG2wOB/002UB8ZUILjtUzvgK0QYygVUeLaG2uDVk+NnLvd1Ohj/ uPFrk/7c92FRQ0aiz6SCoAsRfX+k2DnmQipr05cPke60/nVVBr3zJjeV1o+zTgM+mKK0 63h+yCoS3LxWNZY3YPsIXIEYNOicYJMGoFXboxIJEVEfvsFzLA7dbbWITBGkiKMOU0Z6 tzyfjZdgtl0xafI/fR2rQyIn4Oh9O3M10mndwe9SwZm+uTwBon7cBzUPZMfeJ4j6NiUY l1bw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722227978; x=1722832778; h=to:cc:date:message-id:subject:mime-version:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=UY8h5eSDKQ0NqLNpZfC3zZpK3U1Pb0YvKFDwOikGO0c=; b=crgWAkcqDajoT4N+yz6RYhHChKi40XbrFYq1xeJTnBnZozi/hrY3PYeFZxRU3cYCcD orZXAKJYVBjiWdyFKMe2c+F0NnaSWHimVv4Z7tVqvFQzWShrbzArEUJh4hTTTCjNujwG 68/q+gzcayXGhbyOgY2lRiphCyRRN7KImT6byxttp7bwsY7XPsbomWKXzRcTAB0LWf1b MAlMz1q1XUGMXQisXv0DZW2X3C778Th4HMagw4QLV/xK8JPEBBTJB/H7E42VtZZyU2dx xZMKFY8pSoPtHWOMcpvg04SMFYIb3AQoTFPoTHronN33GA0H92MpKkfPAaPZgLFuzZs7 fsyw==
X-Gm-Message-State: AOJu0YyvNLyeTFiEOnkDemTv3iwYE7GohqXsI5WAGG4rEGUA040pSfti pIXFVLliHlx0hasYxrVPDv6vBZrxs2Npj/nWUxv5T5yHyjrWlRfDM+7i9xFPrV/VfvFNwnku7j2 z
X-Google-Smtp-Source: AGHT+IFc5vwSHZ6hF4YYQa5VgJb/9O++N4fNeb1i6C2MfbP9Nvl5kJS7SJxJsfyyhJi6LipA9L2tkg==
X-Received: by 2002:a05:6808:1b0d:b0:3da:409f:46d7 with SMTP id 5614622812f47-3db23a46c98mr9261872b6e.30.1722227977958; Sun, 28 Jul 2024 21:39:37 -0700 (PDT)
Received: from smtpclient.apple ([192.145.119.154]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-70ead712ea7sm5995836b3a.77.2024.07.28.21.39.36 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 28 Jul 2024 21:39:37 -0700 (PDT)
From: Wolf McNally <wolf@wolfmcnally.com>
Content-Type: multipart/alternative; boundary="Apple-Mail=_5E7828F5-1121-4B9B-BAF3-521B47210239"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
Message-Id: <E8325093-F005-4A56-8AFB-9C1637E19EA4@wolfmcnally.com>
Date: Sun, 28 Jul 2024 21:39:24 -0700
To: CBOR <cbor@ietf.org>, Carsten Bormann <cabo@tzi.org>
X-Mailer: Apple Mail (2.3774.600.62)
Message-ID-Hash: 6WPWJNX5NITILXRJT35ERO3GCQOHIXMB
X-Message-ID-Hash: 6WPWJNX5NITILXRJT35ERO3GCQOHIXMB
X-MailFrom: wolf@wolfmcnally.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: Christopher Allen <christophera@lifewithalacrity.com>, Shannon Appelcline <shannon.appelcline@gmail.com>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] dCBOR: Normalization of Strings
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/kGGwaMHPsZpohgC3vk-IwH8e7iU>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>
All, I’d like your comment on a proposed addition to dCBOR [1]: • Encoders MUST only serialize text strings that are in Unicode Normalization Form C (NFC). [2] • Decoders MUST require that deserialized text strings are in NFC. This rule is comparable to dCBOR’s mandate for numerical reduction, as it applies to one of the most common used data types, and closes a frequently-discussed [3] determinism loophole regarding Unicode strings having more than one equivalent way to encode the same semantics. For example: 'e\u0301' # 'e' followed by combining acute accent '\u00E9' # 'é' as a single character (the normalized form of this character) These strings are canonically equal but encoded differently, breaking determinism if encoded by different agents without the proposed rule. I have verified that there are readily-available ways to convert strings to NFC in C/C++ (using ICU or Boost libraries), Swift (using Foundation’s `precomposedStringWithCanonicalMapping`), Rust (using the `unicode-normalization` crate), JavaScript (using `String.prototype.normalize`), and PHP (using the `Normalizer` class). Several of these languages also have streamlined APIs for efficiently detecting whether a string is in NFC, which is useful for decoders. Those without such an API can simply compare a string being deserialized to a re-normalized copy. I think it’s important to note that text strings in ASCII and Latin-1 are already necessarily in NFC, and the normalization rules are designed to be efficient in such common cases. [1] dCBOR: A Deterministic CBOR Application Profile https://datatracker.ietf.org/doc/draft-mcnally-deterministic-cbor/ [2] Unicode Standard Annex #15: Unicode Normalization Forms https://unicode.org/reports/tr15/ [3] Discussion on Unicode normalization as it relates to canonical JSON. https://esdiscuss.org/topic/json-canonicalize#content-3 ~ Wolf
- [Cbor] dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally