[Cbor] Re: dCBOR: Normalization of Strings

Joe Hildebrand <hildjj@cursive.net> Mon, 29 July 2024 22:18 UTC

Return-Path: <hildjj@cursive.net>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E4800C14F5E0 for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 15:18:27 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.107
X-Spam-Level:
X-Spam-Status: No, score=-2.107 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cursive.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id C45QAGmzHmR0 for <cbor@ietfa.amsl.com>; Mon, 29 Jul 2024 15:18:23 -0700 (PDT)
Received: from mail-qv1-xf36.google.com (mail-qv1-xf36.google.com [IPv6:2607:f8b0:4864:20::f36]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EF81EC151092 for <cbor@ietf.org>; Mon, 29 Jul 2024 15:18:23 -0700 (PDT)
Received: by mail-qv1-xf36.google.com with SMTP id 6a1803df08f44-6b797234b09so30179156d6.0 for <cbor@ietf.org>; Mon, 29 Jul 2024 15:18:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cursive.net; s=google; t=1722291503; x=1722896303; darn=ietf.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=4IrJV1cbuLPccvZTZuVhrX7ZbL5QgVyJkbOeSklRv/Y=; b=In38J7XMY//W/KS4lRjNBGWm1MVxyxeBy3v8ZZwXNHvrgeurciWx9U8W2ugbenICSN 86xovDGyQXglOcXMYoO9ogee4jB6LnFhcPmmNhF18Jl6XPS7vbz0Lfcb6KU/0hCS3qmA E7XwHl64BMtFaAiVUfN0Ybury4Te0ULaVUtFY=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722291503; x=1722896303; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4IrJV1cbuLPccvZTZuVhrX7ZbL5QgVyJkbOeSklRv/Y=; b=SKsYhxIHfzEuXScGlXBv+XJ1smFZ9q+xV1tmrjEnvpAjE3YfIAumVKkQ6IozpXahU8 4Lw6qg5aGew4bHqrRguwg03gXgThqEZbhbeKST7Mecvg83+jvqe+/gh/kqlVojpcxZ82 1sXkq3BLteaKP+NBemmmj5oh5QjkPoet04/bpFHK48z3OhsogDgV9I+6yv5LqsV27OmB a/ixcua8rKOs4xDUx8CI6w29ToXJ1wc7R+wc+4hgPigWKg92KuYAEJDExqBfdlffTSJM P5y2OvHq3Ve7S/CTyIfkMAb4xu1v1kQsUUe01n5y8LGcvQ+92j7FMykI0Oge/9kkLl/N 5/Sg==
X-Gm-Message-State: AOJu0YxiFbNeui1O85XjJq1ft6gIN6I3CLLybmKTds+aG2BILwTXrLV+ wly994sK2AJBNDBS6FvODGhXk4PSXJBoJfRG9xxD/tIrnB8J1XgshQwHhapjZA==
X-Google-Smtp-Source: AGHT+IFOo8BLsKsNk7T4YYxDlGDeoVtNg3ZzEaaB/Y+RxsLCRAdQaTcrc7+RqBhSo4nk2cswvdAgfQ==
X-Received: by 2002:ad4:518f:0:b0:6b5:2c82:7d7d with SMTP id 6a1803df08f44-6bb780ee843mr2082056d6.24.1722291502624; Mon, 29 Jul 2024 15:18:22 -0700 (PDT)
Received: from smtpclient.apple ([2603:6013:7d00:db79:1dac:1c6f:f552:6e9f]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6bb3f8d828bsm56432596d6.24.2024.07.29.15.18.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 29 Jul 2024 15:18:22 -0700 (PDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
From: Joe Hildebrand <hildjj@cursive.net>
In-Reply-To: <D6A3A142-0999-4D0B-9CBC-A698BC384DD4@wolfmcnally.com>
Date: Mon, 29 Jul 2024 18:18:11 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <B342E173-0F31-42A3-B980-123ACDA74200@cursive.net>
References: <E8325093-F005-4A56-8AFB-9C1637E19EA4@wolfmcnally.com> <F04DCC66-CECA-4FFD-9AF1-C58F983A8EF1@cursive.net> <D6A3A142-0999-4D0B-9CBC-A698BC384DD4@wolfmcnally.com>
To: Wolf McNally <wolf@wolfmcnally.com>
X-Mailer: Apple Mail (2.3774.600.62)
Message-ID-Hash: LXVQJZYS6BIUPTIETGFNFWTVRWWRZ6K4
X-Message-ID-Hash: LXVQJZYS6BIUPTIETGFNFWTVRWWRZ6K4
X-MailFrom: hildjj@cursive.net
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: CBOR <cbor@ietf.org>, Carsten Bormann <cabo@tzi.org>, Christopher Allen <christophera@lifewithalacrity.com>, Shannon Appelcline <shannon.appelcline@gmail.com>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: dCBOR: Normalization of Strings
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/OPzf08nJJMrm6PdBHoCnqLayw3A>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>

> On Jul 29, 2024, at 5:53 PM, Wolf McNally <wolf@wolfmcnally.com> wrote:
> 
> Joe,
> 
> Thanks for your input.
> 
> My main problem with NFD is that it breaks existing Latin-1 characters with diacritics like `è` (0xe8) into two code points, requiring *three* bytes to encode as UTF-8:

Yes.

> • e (U+0065) encoded as 0x65 in UTF-8.
> • ̀ (U+0300) encoded as 0xCC 0x80 in UTF-8.
> 
> This means considerably more work and storage for very common use cases.

I think in most cases, you're unlikely to notice the difference, but of course it depends on your corpus.

> As I stated, Latin-1 (and by inclusion ASCII) is already in NFC by definition. So converting such a string to NFC would first detect whether it is already in Latin-1, and if so trivially avoid any further transformation: no decomposition then recomposition steps would be needed in these common cases.

I think that's true, assuming we're right that Latin-1 doesn't include any combining characters.  Whether taking two passes on non-Latin-1 strings is a net win probably also depends on your corpus.

Note: I don't really care which NF you pick or know whether you need to normalize for your use case.  I'm mostly pushing back on the IETF *always* recommending NFC.  The usual argument used to be that some font renderers couldn't handle decomposed characters -- that argument is no longer valid.

In a one XMPP server I worked on, NFKC processing of user IDs was *the* bottleneck, even after it had been optimized several times. The K part didn't make much difference in performance, but was of course destructive of information.  If we had it to do over, I would have argued hard for NFD.

-- 
Joe Hildebarnd