Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-06.txt

Rob Sayre <sayrer@gmail.com> Sun, 01 October 2023 16:28 UTC

Return-Path: <sayrer@gmail.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 47CDEC15153D; Sun, 1 Oct 2023 09:28:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.104
X-Spam-Level:
X-Spam-Status: No, score=-7.104 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2qvHwrAdzbRG; Sun, 1 Oct 2023 09:28:21 -0700 (PDT)
Received: from mail-ej1-x62e.google.com (mail-ej1-x62e.google.com [IPv6:2a00:1450:4864:20::62e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id AA035C151527; Sun, 1 Oct 2023 09:28:05 -0700 (PDT)
Received: by mail-ej1-x62e.google.com with SMTP id a640c23a62f3a-9b1ebc80d0aso1855133666b.0; Sun, 01 Oct 2023 09:28:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1696177684; x=1696782484; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=4ReJxj+AFR6wZMQxfpGgnW7BXVLKVPYkqrQYoESDbbk=; b=eXkcK6rRY3vGgD3NI/5LUoMZDQ8xiEET04MFIEQ0fkLe4g8cKlImS7c9QdPTzIi+TV 5Ruq71eutg3Gqq39p7UEp06+30BvqwLctZmrFSYYmE+K9xzBcou85ZWV0bGjhpi+bpHJ gkzzrzp4OxSmQ9O4P0F4s1uBEn9h7sFwjcdxlG/sUiKwunDeknGUN1PnCPgv4fYHiL5P ExySYC1drMtKTh06d6jMCQe4rFCHCm302INUgaSzx1Vw1kA8sfXod0G7dnhCyHXqI3Gu h6YIMhXf7HiE9cn6ecX/Ld4uoIp69JN8Df8uZeSJ/DUf7q/6591JbVu7CC6V1G0Xue7j ELMQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696177684; x=1696782484; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=4ReJxj+AFR6wZMQxfpGgnW7BXVLKVPYkqrQYoESDbbk=; b=CnotjfccB8/YSUbu6ZpWQ4M295RctcRDS46CAWLziJ7ItH6+7/nf8OI8555SuA/tql bY6TJdj/GUwMxz2kZ22aAz87lCJnXjE4SuPQFFFPduVu2SZzfpf/uAYV0dOSgpTWqk8B Er0Q9ulZNqCQKs3YY/fr+aDQgGLj553ARV+4VbBp0kdaQpDTkWmjDug92ZisdIAEhxmo fpN0Ab5avu6XztoVXmha9l32Lcz8UsGo/6XLHn9513v7+p6K8+zcqKVLDJVDiHxvq86l xbWY3p+SF5peXtvRAsjJ4/VnFQWgn5OSvI7ztj6wJtkXZlC9CqBG5tHXQEZ5LDCgRKdP ms7A==
X-Gm-Message-State: AOJu0YwxQ3nXt09DKUaLWs1q+SCcoMYDhrdu6mmzQDV9xDOakHubW4s8 20kSb+XWMtg/loIH+Cc4RvGSLIG/8g9ZoB+DAIFDBaX4AHlp3A==
X-Google-Smtp-Source: AGHT+IHvVI+L4ZTxHIigN7aKqGtLa6/5tZkA2Gp/ANUMyxp8HDm3zNMnqLWL0IyVU6N6t9mLXUNR9DPe7NAUH+IYO7A=
X-Received: by 2002:a17:906:18c:b0:9b2:b9ff:dc35 with SMTP id 12-20020a170906018c00b009b2b9ffdc35mr9944144ejb.70.1696177683660; Sun, 01 Oct 2023 09:28:03 -0700 (PDT)
MIME-Version: 1.0
References: <169566019635.41806.9804796677919971070@ietfa.amsl.com> <CAHBU6is-wU2NLXNWL56nSJ4=nKvDzGv_Aw4qJN6N2O8CuM4-yw@mail.gmail.com> <SYBPR01MB59814B3448F5754AAEDA1740E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com> <CAHBU6iueqtd5T1T-ciYUMWvmo8XqBQqO5LkWbdRaoXQzPYSQOQ@mail.gmail.com> <SYBPR01MB59819A9F0BDD785F74EB2855E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com>
In-Reply-To: <SYBPR01MB59819A9F0BDD785F74EB2855E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com>
From: Rob Sayre <sayrer@gmail.com>
Date: Sun, 01 Oct 2023 09:27:51 -0700
Message-ID: <CAChr6SwLcEX3Oox-CMCui+p8LQQFJBf+kG8p9WNpD8HzgXsm9Q@mail.gmail.com>
To: "Manger, James" <James.H.Manger=40team.telstra.com@dmarc.ietf.org>
Cc: Tim Bray <tbray@textuality.com>, "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
Content-Type: multipart/alternative; boundary="0000000000004f821f0606aa242f"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/oii82awE0lIF5vm3rG5jybWUmAM>
Subject: Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-06.txt
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 01 Oct 2023 16:28:26 -0000

On Sat, Sep 30, 2023 at 6:53 PM Manger, James <James.H.Manger=
40team.telstra.com@dmarc.ietf.org> wrote:

>
>
> > the explanation of the problem works better if you start from code
> points
>
>
>
> I disagree. Problems include:
>
>    1. Arbitrary code unit sequences can be ill-formed
>    2. BOM
>    3. Noncharacters
>    4. Legacy controls
>    5. Tag chars and other deprecated format chars
>    6. NFC vs NFKC; canonical chars
>    7. Private use chars
>    8. BIDI
>    9. …
>    10. Escaping
>
>
>
> #1 is about encodings. #2 is about encodings and is a char. #3-7 are about
> char subsets. #10 is about a higher-level.
>
> Code points don’t help explain any of these.
>

Well, the grammar describes code points, and the document needs to explain
the unassigned ones as well as the surrogate code points.

Some of the topics you raise here are things that people writing Unicode
libraries need to know about, but are fairly irrelevant to Unicode users.
Most people will be calling something like std::str::from_utf8 [0] if
they're not using some higher-level library for JSON or CBOR etc. At the
limit, people might ask themselves "what's the ICU function for this?", but
they're beyond the scope of this document at that point.

Normalization actually does come up in application code sometimes. For
example, tweets need to be NFC to count characters correctly [1]. But the
person writing that code is not going to learn much from this document (I
do find the subsets useful and concise, though).

Unpaired surrogate code points and their escaped form need to be covered,
since the standard behavior of billions of web browsers is to smuggle them
in well-formed UTF-8 via escape sequences. [2]

thanks,
Rob

[0] https://doc.rust-lang.org/std/str/fn.from_utf8.html
[1]
https://github.com/sayrer/twitter-text/blob/cxx/rust/twitter-text/src/extractor.rs#L323C63-L323C63
[2] https://github.com/tc39/proposal-well-formed-stringify