Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-06.txt

Tim Bray <tbray@textuality.com> Sat, 30 September 2023 21:46 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5C56AC1516F3 for <i18ndir@ietfa.amsl.com>; Sat, 30 Sep 2023 14:46:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.105
X-Spam-Level:
X-Spam-Status: No, score=-2.105 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=textuality.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id GpnV4TG0hOSc for <i18ndir@ietfa.amsl.com>; Sat, 30 Sep 2023 14:46:26 -0700 (PDT)
Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id DE7E6C151557 for <i18ndir@ietf.org>; Sat, 30 Sep 2023 14:46:26 -0700 (PDT)
Received: by mail-lf1-x130.google.com with SMTP id 2adb3069b0e04-50433d8385cso23130913e87.0 for <i18ndir@ietf.org>; Sat, 30 Sep 2023 14:46:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=textuality.com; s=google; t=1696110385; x=1696715185; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=WVb6naROi0/c6Le5EPlwzN6HIcqDv1D2eO4fnuN+uMU=; b=K/6v9v0DD6o3MmaCRkNWY154S6KlipBiUEERnEAsqdD1FZxtlD1wNBZ3b88J1X6nmA t+n5/jOdeZONBEpCohMX5wJfxSn0AdBp+9eMRnxgVi6yITyqpmyiH594p7hC141XnBfX 7s7AZvncUuTQ1qGy04fGnRsxkcdtkEhBPFqNk=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696110385; x=1696715185; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=WVb6naROi0/c6Le5EPlwzN6HIcqDv1D2eO4fnuN+uMU=; b=E2Ja5v9cV/Hvl2YS3pCz+QmkQIFgya7kOuZe5bP/CeE5OFHGJvzwvwvxHlbz0hSMoL Y1C5A+9vFaLx76siLxhcA+DISoU8Fc+6/65u3v2MxI9ddeHlIae9M7yEPWQ+SK4cAfnU iBcwagy6OtJ8GSvWjCmo4bch5CyvRfsd2UGx/Rsa/EzK8P8e8QYFaq7nrQny3LyazP4g X9Z17DLxtwSyEzicSlMK3ayWnedVoT/pUWfu2Oh0EmTUssNCeOYBjHlviBmC+nV8I0a6 p5dW5hIHL1mVGIADexhOXw2wrPUL1PeiBc8yCQMTUSU2jQw/Rehxf6euEpzX15TCtLMh dy8w==
X-Gm-Message-State: AOJu0YxIDUHlvdDHkBfDMGxDl8lmUBhYLSdyWH45p2Srz16vz7RwK0+I uPKMAtn8tNVQbNhNSih4AeRId36FwWjGk00P4YLCNA==
X-Google-Smtp-Source: AGHT+IG1nM8ij1wCYF7JyC9Ws7Y8k5S9BH5zdStEAcRvP+ITijBeFxfQeHG+eYTA8nYNyGk077BvCIM6bF6EJ/0uefo=
X-Received: by 2002:a05:6512:15a8:b0:503:385c:4319 with SMTP id bp40-20020a05651215a800b00503385c4319mr7653383lfb.19.1696110383857; Sat, 30 Sep 2023 14:46:23 -0700 (PDT)
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Sat, 30 Sep 2023 21:46:22 +0000
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Sat, 30 Sep 2023 21:46:19 +0000
Mime-Version: 1.0 (Mimestream 1.1.2)
References: <169566019635.41806.9804796677919971070@ietfa.amsl.com> <CAHBU6is-wU2NLXNWL56nSJ4=nKvDzGv_Aw4qJN6N2O8CuM4-yw@mail.gmail.com> <SYBPR01MB59814B3448F5754AAEDA1740E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com>
In-Reply-To: <SYBPR01MB59814B3448F5754AAEDA1740E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com>
From: Tim Bray <tbray@textuality.com>
Date: Sat, 30 Sep 2023 21:46:22 +0000
Message-ID: <CAHBU6iueqtd5T1T-ciYUMWvmo8XqBQqO5LkWbdRaoXQzPYSQOQ@mail.gmail.com>
To: "Manger, James" <James.H.Manger@team.telstra.com>
Cc: "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000ee1dff06069a78be"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/QbFOT7W3lFKARYQAFcgmtM0uRHQ>
Subject: Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-06.txt
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 30 Sep 2023 21:46:31 -0000

 This is interesting, thank you James.

James proposes a restructure that cleverly removes mentions of code points
and surrogates, on the basis that scalars are more important than code
points.

He also points out an important error; the I-JSON repertoire is *not* scalars,
it’s scalars minus noncharacters. I had entirely forgotten this.  Should we
introduce an I-JSON subset for people who might want to use control codes?
I personally think using control codes is a bad idea, do others disagree?

While I agree that scalars are more important than code points, I have a
problem with some of this, which is not technical.  I think the most common
scenario in which this document would be used (should it progress), would
be someone who is working on an internet draft with text fields and gets
told “you need to specify your character repertoire, go look at
[Unichars]”.

So the document exists at least in part to educate people about what’s
lurking in Unicode.   For that reason, I don’t want to have a spec in the
style of “just do this, don’t worry about why, trust us”. I think the
explanation of the problem works better if you start from code points and
explicitly caution against surrogates. Also, it’s hard to motivate why
scalars exist without introducing surrogates.

I don’t think the discussion about scalars and UTF-8/16/32 is appropriate.
UTF-16 and UTF-32 shouldn’t be used in protocols, the Internet has
converged on UTF-8, RFC2277 mandates it, and this document should assume
that characters on the wire are UTF-8.

What do people think about the BOM?  Officially, it’s ZERO WIDTH NO-BREAK
SPACE.  It probably doesn't add value to a UTF-8 field but it also probably
won’t break anything.  I can see people using it to pad out fixed-length
fields?  If we remove it from “Unicode Assignables” then we probably have
to include an explanation about the BOM functionality.



On Sep 30, 2023 at 2:15:01 AM, "Manger, James" <
James.H.Manger@team.telstra.com> wrote:

> Comments on draft-bray-unichars-06
> <https://datatracker.ietf.org/doc/html/draft-bray-unichars>.
>
>
>
> §2 “Characters and Code Points” should start with scalars; they are far
> more important than code points. Suggested replacement text.
>
> [UNICODE <http://www.unicode.org/versions/latest/>] defines the 1,081,344
> integers in the ranges 0 to D7FF16 and E00016 to 10FFFF16 as "Unicode
> scalars". Every character is assigned to one scalar. As of Unicode 15.1
> (2023), 149,813 characters have been assigned, leaving 931,531 scalars
> available for assignment in future versions.
>
> unicode-scalar = %x0-D7FF / %xE000-10FFFF
>
> Scalars are the complete set of values that can be uniquely represented in
> all 3 Unicode encoding forms – UTF-8, UTF-16, and UTF-32 – which use 8-bit,
> 16-bit and 32-bit code units respectively. So scalars are the repertoire
> that works for representing characters in memory, storage, and in network
> protocols regardless of choices to use 8, 16 or 32-bit words.
>
> I’d rename the section to be “Characters and scalars”.
>
> Drop §2.1 “Transformation formats”.
>
> §2.2 “Problematic Code Point Types” could be renamed “Problematic
> characters”.
>
> BOM should be considered a problematic character (and excluded from
> unicode-assignables) as it is used as an encoding-layer signal.
>
> Surrogates are handled at the encoding layer, not the character layer, so
> drop §2.2.1 “Surrogates”. I suggest a new §2.3 “Ill-formed encodings” to
> replace §2.2.1 and most of §3.
>
> 2.3. Ill-formed encodings
>
> A sequence of 8-bit, 16-bit or 32-bit code units representing scalars is a
> well-formed UTF-8, UTF-16, or UTF-32 encoding respectively. However, there
> are other code unit sequences in each of these 3 encodings that don’t map
> to scalars (eg C016 8016 in UTF-8; D80016 in UTF-16; 20FFFF16 in UTF-32).
> Such sequences are call ill-formed. They can exist in practice. Reasonable
> options when interpreting such code unit sequences are signalling an error
> or treating them as "�" (U+FFFD, REPLACEMENT CHARACTER). Silently ignoring
> ill-formed code unit sequences is a known security risk.
>
> Drop §3 “Dealing With Problematic Code Points”.
>
> Typo: \U0089 should be \u0089.
>
> Typo: RFC19413 should be RFC9413.
>
> I’d define unicode-scalar in §2 so we don’t need §4.1 “Unicode Scalars”.
> §4 can say:
>
> Specifications can refer to these by the names “Unicode scalars” (section
> 2), “XML Characters”, and “Unicode Assignables”.
>
> I-JSON can’t be used as an example using unicode-scalar as it explicitly
> excludes noncharacters; and the difference between the repertoire for the
> JSON vs the repertoire for the logical string that can be represented by a
> JSON string is not explained.
>
> i-json-value-repertoire = %x9 / %xA / %xD / %x20-D7FF / %xE000-FFFD /
> %x10000-1FFFD / %x20000-2FFFD / … / %x100000-10FFFD
>
> i-json-logical-string-repertoire = %x0-D7FF / %xE000-FFFD / %x10000-1FFFD
> / %x20000-2FFFD / … / %x100000-10FFFD
>
> --
>
> James Manger
>
>
>
>
>
>
>
> * General From: *art <art-bounces@ietf.org> on behalf of Tim Bray <
> tbray@textuality.com>
> *Date: *Tuesday, 26 September 2023 at 2:51 am
> *To: *i18ndir@ietf.org <i18ndir@ietf.org>, ART Area <art@ietf.org>
> *Subject: *[art] Fwd: New Version Notification for
> draft-bray-unichars-06.txt
>
> [External Email] This email was sent from outside the organisation – be
> cautious, particularly with links and attachments.
>
> What’s new and different here.
>
>
>
>    1. Locked down definition of “problematic”
>    2. Locked down definition of “character repertoire”
>    3. Changed “Useful Assignables” to “Unicode Assignables” (checked with
>    Asmus first)
>
>
>
> A new version of Internet-Draft draft-bray-unichars-06.txt has been
>
> successfully submitted by Paul Hoffman and posted to the
> IETF repository.
>
> Name:     draft-bray-unichars
> Revision: 06
> Title:    Unicode Character Repertoire Subsets
> Date:     2023-09-25
> Group:    Individual Submission
> Pages:    10
> URL:      https://www.ietf.org/archive/id/draft-bray-unichars-06.txt
> Status:   https://datatracker.ietf.org/doc/draft-bray-unichars/
> HTML:     https://www.ietf.org/archive/id/draft-bray-unichars-06.html
> HTMLized: https://datatracker.ietf.org/doc/html/draft-bray-unichars
> Diff:     https://author-tools.ietf.org/iddiff?url2=draft-bray-unichars-06
>
> Abstract:
>
>   This document discusses specifying subsets of the Unicode character
>   repertoire for use in protocols and data formats.
>
>
>
> The IETF Secretariat
>
>