Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-06.txt

Tim Bray <tbray@textuality.com> Sun, 01 October 2023 17:31 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 04E2FC151540 for <i18ndir@ietfa.amsl.com>; Sun, 1 Oct 2023 10:31:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.106
X-Spam-Level:
X-Spam-Status: No, score=-2.106 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=textuality.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id elRx-gRllCOd for <i18ndir@ietfa.amsl.com>; Sun, 1 Oct 2023 10:31:21 -0700 (PDT)
Received: from mail-ed1-x535.google.com (mail-ed1-x535.google.com [IPv6:2a00:1450:4864:20::535]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E4BBDC151541 for <i18ndir@ietf.org>; Sun, 1 Oct 2023 10:31:21 -0700 (PDT)
Received: by mail-ed1-x535.google.com with SMTP id 4fb4d7f45d1cf-533c5d10dc7so17561118a12.3 for <i18ndir@ietf.org>; Sun, 01 Oct 2023 10:31:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=textuality.com; s=google; t=1696181480; x=1696786280; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=wflI2Ss3lbnqexiuDkAUDslfgvVSrYaDDiriHtFVzas=; b=AFB2FqSx3kFR+qMgpcwykPQQz3AHS96YxaTr4rAE+Ew3ecu9efIVF3d88an++3TGx/ F3jE+hXFkkCsmiSiTLzM0+JCy1ztru54+08TFUEEWbfhJogtrjxHnLIURWzqna8Gs1nZ dogsZ35iLwGaJ+Cg0+76gfCrnEFAxQWRucSik=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696181480; x=1696786280; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=wflI2Ss3lbnqexiuDkAUDslfgvVSrYaDDiriHtFVzas=; b=NUsxGeGU8kU3kgNM/iprhukb1p7fhelj/rPVAGyD0aiVWfwBhNAxBUc521biAYDjae 3IM3d2FCOrqG5Qni5uQKpNMd1fxkRW8JBUdimNwnvHBXtkdCtg9OzX9ds5SB2Bks+NU0 lgiB08VANqgCc1H6JRpqv7nxrdw5Zjioy7BILq95RmNBnhLRm1vrK9Vja4e95ij+Zfsl wPRSX0HbXTvfnyoDiQn1rdjUIBNEOw04LRh3qSbz2eAXvIM/EFQ2N42IHEpO5T6lfDjN ciaLmF9GbAWRONA1vrwDG8Nn4DSy4G093CTb/DJ3b8eizuSTRrmTWkgesSnpU+RowB+U pi5g==
X-Gm-Message-State: AOJu0YwVERMxUPdfUOJ4pY1OsLvZnbNFxm73frHa0aJYUNA9LoCZHW0z sWyt+Rm0XT0qSfv1bNZSSxpwR5MrZUoKn4PAcTdFEKpV24Zlxt/U3Kg=
X-Google-Smtp-Source: AGHT+IFpqhWD+WlUdlbHOG7zG2JC4Ut74Y8qnzfrL/gPh2EmMtMnz4j8sqb3Xu5ow7y+Z/JOlDjCnNJJjeB0gZLwsxQ=
X-Received: by 2002:a05:6402:1859:b0:530:8a17:39e0 with SMTP id v25-20020a056402185900b005308a1739e0mr8583792edy.13.1696181479217; Sun, 01 Oct 2023 10:31:19 -0700 (PDT)
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Sun, 1 Oct 2023 17:31:18 +0000
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Sun, 1 Oct 2023 17:31:14 +0000
Mime-Version: 1.0 (Mimestream 1.1.2)
References: <169566019635.41806.9804796677919971070@ietfa.amsl.com> <CAHBU6is-wU2NLXNWL56nSJ4=nKvDzGv_Aw4qJN6N2O8CuM4-yw@mail.gmail.com> <SYBPR01MB59814B3448F5754AAEDA1740E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com> <CAHBU6iueqtd5T1T-ciYUMWvmo8XqBQqO5LkWbdRaoXQzPYSQOQ@mail.gmail.com> <SYBPR01MB59819A9F0BDD785F74EB2855E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com>
In-Reply-To: <SYBPR01MB59819A9F0BDD785F74EB2855E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com>
From: Tim Bray <tbray@textuality.com>
Date: Sun, 01 Oct 2023 17:31:18 +0000
Message-ID: <CAHBU6iu_PUdWXk52UfnoYo7-e0s+tWfiWqy5i+QrrvgJhYOenQ@mail.gmail.com>
To: "Manger, James" <James.H.Manger@team.telstra.com>
Cc: "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
Content-Type: multipart/alternative; boundary="0000000000008b30670606ab0612"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/LtBYThahEt9L9Cv9uPHzxKmbx00>
Subject: Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-06.txt
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 01 Oct 2023 17:31:26 -0000

On Sep 30, 2023 at 6:53:28 PM, "Manger, James" <
James.H.Manger@team.telstra.com> wrote:

> Why is U+D800 so much more crucial than U+20FFFF or an ill-formed UTF-8
> byte sequence such as 0xC0 0x80?
>
>
>
> There are lots of byte sequences that are ill-formed UTF-8. Some
> correspond to non-minimal-length encodings (eg C0 80 ~ U+0000); some to
>

Because surrogates are more of a problem in the wild.  My belief is that
this is mostly a consequence of JSON’s \u escapes and Java’s “char”
primitive (OMG I just checked and the official Oracle’s official Java
tutorial still in 2023 says “The char data type is a single 16-bit Unicode
character.”) Plus the JavaScript/browser behavior Rob Sayre describes.

> the explanation of the problem works better if you start from code points
>
>
>
> I disagree. Problems include:
>

I think that both I and the authors of the Unicode standard disagree with
you. The narrative starts with code points and builds up from there.
Unfortunately, this argument can’t be settled by providing technical
evidence, it’s a matter of pedagogical technique.


>    1. Arbitrary code unit sequences can be ill-formed
>
>  > motivate why scalars exist
>
>
>
> Explaining the 1,081,344 size and the U+D800-U+DFFF gap would be
> interesting.
>

Yes! That history is mystery to me.  Also the nonchars in the
Arabic-extended region.

> I don’t think the discussion about scalars and UTF-8/16/32 is
> appropriate. UTF-16 and UTF-32 shouldn’t be used in protocols, the Internet
> has converged on UTF-8
>
>
>
> Argh!
>
> Sure, only use UTF-8 in protocols. But there is also memory & APIs where
> 16-bit code units (hence UTF-16) are still important.
>

You are correct. But this document is designed to help protocol designers
in the IETF, who really have no business straying outside the boundaries of
UTF-8.

If we are only interested in UTF-8 why are surrogates ever mentioned?
>

Because they are forbidden in theory, occur in practice, and break software.


>