[Cbor] Re: dCBOR: Normalization of Strings
Wolf McNally <wolf@wolfmcnally.com> Mon, 29 July 2024 23:02 UTC
Return-Path: <wolf@wolfmcnally.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BCB00C1840DE for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 16:02:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.904
X-Spam-Level:
X-Spam-Status: No, score=-1.904 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=wolfmcnally-com.20230601.gappssmtp.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rxZBgt3dtSc5 for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 16:02:12 -0700 (PDT)
Received: from mail-pl1-x62d.google.com (mail-pl1-x62d.google.com [IPv6:2607:f8b0:4864:20::62d]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D6DBCC151087 for <cbor@ietf.org>; Mon, 29 Jul 2024 16:02:12 -0700 (PDT)
Received: by mail-pl1-x62d.google.com with SMTP id d9443c01a7336-1fc569440e1so33969665ad.3 for <cbor@ietf.org>; Mon, 29 Jul 2024 16:02:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wolfmcnally-com.20230601.gappssmtp.com; s=20230601; t=1722294132; x=1722898932; darn=ietf.org; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:from:to:cc:subject:date:message-id:reply-to; bh=dJiVKlFxTst5qEDkGkZLmkdfhqOpxPRkszTvlJZLuKM=; b=be9/jEimmvDD+eq4FKKpR9tyfI4aONySJFvKc06tq7nzhhSu3iH8Va3fTkq2PMW5FU MSAqkdjdPefAt6YojRGyyQA9jGTOu7Wdv7mcSBojrNxS4zBz/0lbclBz5OdxjAN9nH56 WN+4pnsTjUESQUUGYrujAPzjXia/3vaiGWzczpXhr9E0+TvgTp3fTz/evaH5IIubCus+ m6WnR7WG3QmysWqMOItE6uu5sJKWcahx4l4a+AGsly3WwhDuW0L/x3xZi/vAn+eJkr6g WEnpo72Mo0MwQOOGmOwmoWfcEyGBH/dG9D7C6/M1Ssy70In0RPwTn6G7vEVVrbqaq91x 1V6g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722294132; x=1722898932; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=dJiVKlFxTst5qEDkGkZLmkdfhqOpxPRkszTvlJZLuKM=; b=ils+4ET8n+dghiTsJi4X2lkL37YjwZYOasQ1cwKJK1xhqKOJkAJeOZ3Jkmk4WpEi1P 84oSINaYcHOmZeh4/xi/4oPTaB+3TlttAYEb+OWuuB3gMNVSIclbTVUfU0PtrhDHY+J9 f2a2tZmEGtswC1koVUML+JrRZBW5bug38xaqUEU/gpbTxaDXTq0Fxd8iYTqPoPiDAR7O zIDAvUrXGwJCrDglHnbVve36kOcxCQ0xXXxQiYviuYmAwjsMbfBvBWWlaKV4n7lUJpze yzr6oS9117tDoetPiHrfz0XXdVWIYaKj8aOrJ3UsCvtkCs3in7nw0FeCb7jxwNc2vBVP e7fw==
X-Gm-Message-State: AOJu0Yz3wi2n04uWNHHMzECRLaPZUK3o/fKjD8HJ07uis6KDr2WMH1nO 2eZPBs5OB/+ehm1dUiILcU73zn2iWMAUnEZpKW93/DQskXE9LYrfLbIvy6g1AWI=
X-Google-Smtp-Source: AGHT+IHxeN6w6R8+ANW/vkAU9JxBcWsLDGdKJVykE3PWJGQMCQDl83033JOniBZmYBXBBtdUkVZL9g==
X-Received: by 2002:a17:902:e54a:b0:1fb:7c7f:6447 with SMTP id d9443c01a7336-1ff0482b7f7mr112996285ad.25.1722294131549; Mon, 29 Jul 2024 16:02:11 -0700 (PDT)
Received: from smtpclient.apple ([192.145.119.154]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1fed7ff067esm88111625ad.295.2024.07.29.16.02.10 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 29 Jul 2024 16:02:11 -0700 (PDT)
From: Wolf McNally <wolf@wolfmcnally.com>
Message-Id: <EDA0C1BC-C39A-4C4F-AAEA-D52FBF53A420@wolfmcnally.com>
Content-Type: multipart/alternative; boundary="Apple-Mail=_A6324986-5438-4E27-9DDE-C5E267347265"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
Date: Mon, 29 Jul 2024 16:01:58 -0700
In-Reply-To: <B342E173-0F31-42A3-B980-123ACDA74200@cursive.net>
To: Joe Hildebrand <hildjj@cursive.net>
References: <E8325093-F005-4A56-8AFB-9C1637E19EA4@wolfmcnally.com> <F04DCC66-CECA-4FFD-9AF1-C58F983A8EF1@cursive.net> <D6A3A142-0999-4D0B-9CBC-A698BC384DD4@wolfmcnally.com> <B342E173-0F31-42A3-B980-123ACDA74200@cursive.net>
X-Mailer: Apple Mail (2.3774.600.62)
Message-ID-Hash: I44QKUYIBDD7F7CXJSISXVBIF4FALVOJ
X-Message-ID-Hash: I44QKUYIBDD7F7CXJSISXVBIF4FALVOJ
X-MailFrom: wolf@wolfmcnally.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: CBOR <cbor@ietf.org>, Carsten Bormann <cabo@tzi.org>, Christopher Allen <christophera@lifewithalacrity.com>, Shannon Appelcline <shannon.appelcline@gmail.com>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: dCBOR: Normalization of Strings
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/81fskPl60lhUo-GUPZ4ZxdAhsWc>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>
Joe, > On Jul 29, 2024, at 3:18 PM, Joe Hildebrand <hildjj@cursive.net> wrote: > > I think that's true, assuming we're right that Latin-1 doesn't include any combining characters. It definitely does not. The Unicode Standard Annex #15 “Unicode Normalization Forms” also states: > Note: Text exclusively containing ASCII characters (U+0000..U+007F) is left unaffected by all of the Normalization Forms. This is particularly important for programming languages. (See Unicode Standard Annex #31, "Unicode Identifier and Pattern Syntax" [UAX31].) Text exclusively containing Latin-1 characters (U+0000..U+00FF) is left unaffected by NFC. This is effectively the same as saying that all Latin-1 text is *already* normalized to NFC. https://unicode.org/reports/tr15/ > Whether taking two passes on non-Latin-1 strings is a net win probably also depends on your corpus. I don’t think you need two passes, and I highly doubt NFC converters do. It only requires a single pass: for each character ask, “is it Latin-1”. If so, copy it to the output. IFF you discover a non Latin-1 character, then do the decomposition and recomposition steps. If you want to avoid allocation for the trivial case, then do this pass with the test, and IFF you find a non-Latin-1 character then copy the result so far to the output and proceed with the more complex case. If you never find a non-Latin-1 character, return the original input. > That would be a reasonable choice as well, depending on your use case. Would you expect the decoder to fail decoding if it received a non-Latin-1 character? Or would that be some other layer's job? As it is the dCBOR decoder’s job to enforce dCBOR compliance, a *fully* compliant decoder would fully follow the rule. Failing that, a less-than-fully compliant decoder SHOULD do what it can to enforce the rules, but SHOULD also document that it is not fully-compliant. In the case of Latin-1, ensuring the code points are in 0 <= p <= 255 is sufficient: ```python def isAllLatin1(utf8_string): try: for char in utf8_string: if ord(char) > 255: return False return True except UnicodeDecodeError: return False ``` This check would never admit a non-NFC string, but could also give false negatives on NFC strings that aren’t Latin-1. If your environment was so constrained that you needed to drop compliance for this rule entirely, then you could just validate it as UTF-8 and then accept it. ```python def is_valid_utf8(data): try: data.decode('utf-8') return True except UnicodeDecodeError: return False ``` If you needed to drop even that level of compliance, then you'd basically just treat the string like a blob and hope for the best. ~ Wolf
- [Cbor] dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Carsten Bormann
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally
- [Cbor] Re: dCBOR: Normalization of Strings Joe Hildebrand
- [Cbor] Re: dCBOR: Normalization of Strings Wolf McNally