Return-Path: <wolf@wolfmcnally.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1])
	by ietfa.amsl.com (Postfix) with ESMTP id BCB00C1840DE
	for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 16:02:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.904
X-Spam-Level: 
X-Spam-Status: No, score=-1.904 tagged_above=-999 required=5
	tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
	HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001,
	RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001,
	SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01,
	URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001]
	autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key)
	header.d=wolfmcnally-com.20230601.gappssmtp.com
Received: from mail.ietf.org ([50.223.129.194])
	by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id rxZBgt3dtSc5 for <cbor@ietfa.amsl.com>;
	Mon, 29 Jul 2024 16:02:12 -0700 (PDT)
Received: from mail-pl1-x62d.google.com (mail-pl1-x62d.google.com
 [IPv6:2607:f8b0:4864:20::62d])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256)
	(No client certificate requested)
	by ietfa.amsl.com (Postfix) with ESMTPS id D6DBCC151087
	for <cbor@ietf.org>; Mon, 29 Jul 2024 16:02:12 -0700 (PDT)
Received: by mail-pl1-x62d.google.com with SMTP id
 d9443c01a7336-1fc569440e1so33969665ad.3
        for <cbor@ietf.org>; Mon, 29 Jul 2024 16:02:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=wolfmcnally-com.20230601.gappssmtp.com; s=20230601; t=1722294132;
 x=1722898932; darn=ietf.org;
        h=references:to:cc:in-reply-to:date:subject:mime-version:message-id
         :from:from:to:cc:subject:date:message-id:reply-to;
        bh=dJiVKlFxTst5qEDkGkZLmkdfhqOpxPRkszTvlJZLuKM=;
        b=be9/jEimmvDD+eq4FKKpR9tyfI4aONySJFvKc06tq7nzhhSu3iH8Va3fTkq2PMW5FU
         MSAqkdjdPefAt6YojRGyyQA9jGTOu7Wdv7mcSBojrNxS4zBz/0lbclBz5OdxjAN9nH56
         WN+4pnsTjUESQUUGYrujAPzjXia/3vaiGWzczpXhr9E0+TvgTp3fTz/evaH5IIubCus+
         m6WnR7WG3QmysWqMOItE6uu5sJKWcahx4l4a+AGsly3WwhDuW0L/x3xZi/vAn+eJkr6g
         WEnpo72Mo0MwQOOGmOwmoWfcEyGBH/dG9D7C6/M1Ssy70In0RPwTn6G7vEVVrbqaq91x
         1V6g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1722294132; x=1722898932;
        h=references:to:cc:in-reply-to:date:subject:mime-version:message-id
         :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=dJiVKlFxTst5qEDkGkZLmkdfhqOpxPRkszTvlJZLuKM=;
        b=ils+4ET8n+dghiTsJi4X2lkL37YjwZYOasQ1cwKJK1xhqKOJkAJeOZ3Jkmk4WpEi1P
         84oSINaYcHOmZeh4/xi/4oPTaB+3TlttAYEb+OWuuB3gMNVSIclbTVUfU0PtrhDHY+J9
         f2a2tZmEGtswC1koVUML+JrRZBW5bug38xaqUEU/gpbTxaDXTq0Fxd8iYTqPoPiDAR7O
         zIDAvUrXGwJCrDglHnbVve36kOcxCQ0xXXxQiYviuYmAwjsMbfBvBWWlaKV4n7lUJpze
         yzr6oS9117tDoetPiHrfz0XXdVWIYaKj8aOrJ3UsCvtkCs3in7nw0FeCb7jxwNc2vBVP
         e7fw==
X-Gm-Message-State: AOJu0Yz3wi2n04uWNHHMzECRLaPZUK3o/fKjD8HJ07uis6KDr2WMH1nO
	2eZPBs5OB/+ehm1dUiILcU73zn2iWMAUnEZpKW93/DQskXE9LYrfLbIvy6g1AWI=
X-Google-Smtp-Source: 
 AGHT+IHxeN6w6R8+ANW/vkAU9JxBcWsLDGdKJVykE3PWJGQMCQDl83033JOniBZmYBXBBtdUkVZL9g==
X-Received: by 2002:a17:902:e54a:b0:1fb:7c7f:6447 with SMTP id
 d9443c01a7336-1ff0482b7f7mr112996285ad.25.1722294131549;
        Mon, 29 Jul 2024 16:02:11 -0700 (PDT)
Received: from smtpclient.apple ([192.145.119.154])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-1fed7ff067esm88111625ad.295.2024.07.29.16.02.10
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Mon, 29 Jul 2024 16:02:11 -0700 (PDT)
From: Wolf McNally <wolf@wolfmcnally.com>
Message-Id: <EDA0C1BC-C39A-4C4F-AAEA-D52FBF53A420@wolfmcnally.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_A6324986-5438-4E27-9DDE-C5E267347265"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
Date: Mon, 29 Jul 2024 16:01:58 -0700
In-Reply-To: <B342E173-0F31-42A3-B980-123ACDA74200@cursive.net>
To: Joe Hildebrand <hildjj@cursive.net>
References: <E8325093-F005-4A56-8AFB-9C1637E19EA4@wolfmcnally.com>
 <F04DCC66-CECA-4FFD-9AF1-C58F983A8EF1@cursive.net>
 <D6A3A142-0999-4D0B-9CBC-A698BC384DD4@wolfmcnally.com>
 <B342E173-0F31-42A3-B980-123ACDA74200@cursive.net>
X-Mailer: Apple Mail (2.3774.600.62)
Message-ID-Hash: I44QKUYIBDD7F7CXJSISXVBIF4FALVOJ
X-Message-ID-Hash: I44QKUYIBDD7F7CXJSISXVBIF4FALVOJ
X-MailFrom: wolf@wolfmcnally.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency;
 loop; banned-address; member-moderation; header-match-cbor.ietf.org-0;
 nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size;
 news-moderation; no-subject; digests; suspicious-header
CC: CBOR <cbor@ietf.org>, Carsten Bormann <cabo@tzi.org>,
 Christopher Allen <christophera@lifewithalacrity.com>,
 Shannon Appelcline <shannon.appelcline@gmail.com>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: =?utf-8?q?=5BCbor=5D_Re=3A_dCBOR=3A_Normalization_of_Strings?=
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: 
 <https://mailarchive.ietf.org/arch/msg/cbor/81fskPl60lhUo-GUPZ4ZxdAhsWc>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>


--Apple-Mail=_A6324986-5438-4E27-9DDE-C5E267347265
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Joe,

> On Jul 29, 2024, at 3:18=E2=80=AFPM, Joe Hildebrand =
<hildjj@cursive.net> wrote:
>=20
> I think that's true, assuming we're right that Latin-1 doesn't include =
any combining characters.

It definitely does not.

The Unicode Standard Annex #15 =E2=80=9CUnicode Normalization Forms=E2=80=9D=
 also states:

> Note: Text exclusively containing ASCII characters (U+0000..U+007F) is =
left unaffected by all of the Normalization Forms. This is particularly =
important for programming languages. (See Unicode Standard Annex #31, =
"Unicode Identifier and Pattern Syntax" [UAX31].) Text exclusively =
containing Latin-1 characters (U+0000..U+00FF) is left unaffected by =
NFC. This is effectively the same as saying that all Latin-1 text is =
*already* normalized to NFC.
https://unicode.org/reports/tr15/

> Whether taking two passes on non-Latin-1 strings is a net win probably =
also depends on your corpus.

I don=E2=80=99t think you need two passes, and I highly doubt NFC =
converters do. It only requires a single pass: for each character ask, =
=E2=80=9Cis it Latin-1=E2=80=9D. If so, copy it to the output. IFF you =
discover a non Latin-1 character, then do the decomposition and =
recomposition steps. If you want to avoid allocation for the trivial =
case, then do this pass with the test, and IFF you find a non-Latin-1 =
character then copy the result so far to the output and proceed with the =
more complex case. If you never find a non-Latin-1 character, return the =
original input.

> That would be a reasonable choice as well, depending on your use case. =
 Would you expect the decoder to fail decoding if it received a =
non-Latin-1 character?  Or would that be some other layer's job?

As it is the dCBOR decoder=E2=80=99s job to enforce dCBOR compliance, a =
*fully* compliant decoder would fully follow the rule. Failing that, a =
less-than-fully compliant decoder SHOULD do what it can to enforce the =
rules, but SHOULD also document that it is not fully-compliant. In the =
case of Latin-1, ensuring the code points are in 0 <=3D p <=3D 255 is =
sufficient:

```python
def isAllLatin1(utf8_string):
    try:
        for char in utf8_string:
            if ord(char) > 255:
                return False
        return True
    except UnicodeDecodeError:
        return False
```

This check would never admit a non-NFC string, but could also give false =
negatives on NFC strings that aren=E2=80=99t Latin-1.

If your environment was so constrained that you needed to drop =
compliance for this rule entirely, then you could just validate it as =
UTF-8 and then accept it.

```python
def is_valid_utf8(data):
    try:
        data.decode('utf-8')
        return True
    except UnicodeDecodeError:
        return False
```

If you needed to drop even that level of compliance, then you'd =
basically just treat the string like a blob and hope for the best.

~ Wolf=

--Apple-Mail=_A6324986-5438-4E27-9DDE-C5E267347265
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"overflow-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;">Joe,<br =
id=3D"lineBreakAtBeginningOfMessage"><div><br><blockquote =
type=3D"cite"><div>On Jul 29, 2024, at 3:18=E2=80=AFPM, Joe Hildebrand =
&lt;hildjj@cursive.net&gt; wrote:</div><div><div><br>I think that's =
true, assuming we're right that Latin-1 doesn't include any combining =
characters.</div></div></blockquote><div><br></div><div>It definitely =
does not.</div><div><br></div><div><span style=3D"caret-color: rgb(0, 0, =
0); color: rgb(0, 0, 0);">The Unicode Standard Annex #15 =E2=80=9CUnicode =
Normalization Forms=E2=80=9D&nbsp;</span>also =
states:</div><div><br></div><div>&gt; Note: Text exclusively containing =
ASCII characters (U+0000..U+007F) is left unaffected by all of the =
Normalization Forms. This is particularly important for programming =
languages. (See Unicode Standard Annex #31, "Unicode Identifier and =
Pattern Syntax" [UAX31].) Text exclusively containing Latin-1 characters =
(U+0000..U+00FF) is left unaffected by NFC. This is effectively the same =
as saying that all Latin-1 text is *already* normalized to =
NFC.</div><div><a =
href=3D"https://unicode.org/reports/tr15/">https://unicode.org/reports/tr1=
5/</a></div><div><br></div><blockquote type=3D"cite"><div><div>Whether =
taking two passes on non-Latin-1 strings is a net win probably also =
depends on your =
corpus.<br></div></div></blockquote><div><br></div><div>I don=E2=80=99t =
think you need two passes, and I highly doubt NFC converters do. It only =
requires a single pass: for each character ask, =E2=80=9Cis it =
Latin-1=E2=80=9D. If so, copy it to the output. IFF you discover a non =
Latin-1 character, then do the decomposition and recomposition steps. If =
you want to avoid allocation for the trivial case, then do this pass =
with the test, and IFF you find a non-Latin-1 character then copy the =
result so far to the output and proceed with the more complex case. If =
you never find a non-Latin-1 character, return the original =
input.</div></div><div><br></div><div></div><blockquote =
type=3D"cite"><div><span style=3D"caret-color: rgb(0, 0, 0); color: =
rgb(0, 0, 0); font-family: SFPro-Regular;">That would be a reasonable =
choice as well, depending on your use case. &nbsp;Would you expect the =
decoder to fail decoding if it received a non-Latin-1 character? =
&nbsp;Or would that be some other layer's =
job?</span></div></blockquote><div><br></div><div>As it is the dCBOR =
decoder=E2=80=99s job to enforce dCBOR compliance, a *fully* compliant =
decoder would fully follow the rule. Failing that, a less-than-fully =
compliant decoder SHOULD do what it can to enforce the rules, but SHOULD =
also document that it is not fully-compliant. In the case of Latin-1, =
ensuring the code points are in 0 &lt;=3D p &lt;=3D 255 is =
sufficient:</div><div><br></div><div>```python</div><div><div>def =
isAllLatin1(utf8_string):</div><div>&nbsp; &nbsp; try:</div><div>&nbsp; =
&nbsp; &nbsp; &nbsp; for char in utf8_string:</div><div>&nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; if ord(char) &gt; 255:</div><div>&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return =
False</div><div>&nbsp; &nbsp; &nbsp; &nbsp; return True</div><div>&nbsp; =
&nbsp; except UnicodeDecodeError:</div><div>&nbsp; &nbsp; &nbsp; &nbsp; =
return False</div></div><div>```</div><div><br></div><div>This check =
would never admit a non-NFC string, but could also give false negatives =
on NFC strings that aren=E2=80=99t Latin-1.</div><div><br></div><div>If =
your environment was so constrained that you needed to drop compliance =
for this rule entirely, then you could just validate it as UTF-8 and =
then accept it.</div><div><br></div><div>```python</div><div><div>def =
is_valid_utf8(data):</div><div>&nbsp; &nbsp; try:</div><div>&nbsp; =
&nbsp; &nbsp; &nbsp; data.decode('utf-8')</div><div>&nbsp; &nbsp; &nbsp; =
&nbsp; return True</div><div>&nbsp; &nbsp; except =
UnicodeDecodeError:</div><div>&nbsp; &nbsp; &nbsp; &nbsp; return =
False</div></div><div>```</div><div><br></div><div>If you needed to drop =
even that level of compliance, then you'd basically just treat the =
string like a blob and hope for the =
best.</div><br><div></div><div><div>~ Wolf</div></div></body></html>=

--Apple-Mail=_A6324986-5438-4E27-9DDE-C5E267347265--

