Re: [I18ndir] draft-bray-unichars-01

Tim Bray <tbray@textuality.com> Wed, 30 August 2023 00:05 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 211C6C14CE55 for <i18ndir@ietfa.amsl.com>; Tue, 29 Aug 2023 17:05:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.105
X-Spam-Level:
X-Spam-Status: No, score=-2.105 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=textuality.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zyiE4aQfCoU0 for <i18ndir@ietfa.amsl.com>; Tue, 29 Aug 2023 17:05:15 -0700 (PDT)
Received: from mail-ed1-x533.google.com (mail-ed1-x533.google.com [IPv6:2a00:1450:4864:20::533]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EF60FC14CE31 for <i18ndir@ietf.org>; Tue, 29 Aug 2023 17:05:15 -0700 (PDT)
Received: by mail-ed1-x533.google.com with SMTP id 4fb4d7f45d1cf-52683b68c2fso6550601a12.0 for <i18ndir@ietf.org>; Tue, 29 Aug 2023 17:05:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=textuality.com; s=google; t=1693353914; x=1693958714; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=WQewiLM7wJ9jgeAX3SG+qpd7FEvKxPXLeCcYhEQ/cW8=; b=b7UZ+cC/HR9sd6GPC3bTaeidGUr17ZdwZ9kjGJlzK8XaTe+lnaQO2M0ivbyuhcFBMZ TWo9zHm+YziXVuwRayE4OqpI6ZDp/rZnbuCRRk9WTnx2JQEGx4Uv1CI3fmO62mAtJYOt t2uXhEDihkEsjqeQVGYVHxJbzNNK8B8ymwv3M=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693353914; x=1693958714; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=WQewiLM7wJ9jgeAX3SG+qpd7FEvKxPXLeCcYhEQ/cW8=; b=FdODVBeZv3JT4BQQcsBa875tL2iZta9L1rXbVOBtxm/n5OgfyA1zQ21pXz0xvAwTzL xpPNg32KFPR8H+nuegRYmeybh1kpygL9ESA2SwZKJQd9fHoaqx/WZaJ7Lr4DIHd5702R SNcPV6uMdRvcwhsuTtQzu9P9Lh6Lwr6SZ1Im84fkhcUFrckE1c49rEj3f7Zhh8dVTS+a e8aQ6L49siC7NuaBBlf05/QOXZbqTKGckBjLOEXn4Avj2epnK/LvjrsBcWxDBuv/dipT 23cAiTerw5GE8GrTgmyd1eT5mro1nfrDsuYH9l3CU7rJ8dLDKbIR5jLMQ+7QZxh3SK10 +d4w==
X-Gm-Message-State: AOJu0YzKXOjcrutP/gIi9pllITBUKgz3E6lmn5GltJ81JiJkVGOK1k3b NDSnJeAFobjVKt09RTVc3x6wcHJlnE5/0RDPdO+/jA==
X-Google-Smtp-Source: AGHT+IFu/IlpPOH/m5Y7o6iqr6xlTGW3upQ7H7rhZzfLlKDmu8sU9vchpxtz2+UHPzeRSKRegzX1UvEUMJQARbT9Das=
X-Received: by 2002:aa7:cf09:0:b0:525:7e46:940 with SMTP id a9-20020aa7cf09000000b005257e460940mr608634edy.24.1693353914027; Tue, 29 Aug 2023 17:05:14 -0700 (PDT)
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Tue, 29 Aug 2023 19:05:13 -0500
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Tue, 29 Aug 2023 19:05:10 -0500
Mime-Version: 1.0 (Mimestream 1.0.5)
References: <CAHBU6isuZ1fgAjv14JRCiWaq-cmE69iEGajQkDDNA4CzfTKoxQ@mail.gmail.com> <122f70b8-62f8-cd24-a0e1-c3e0052b37e8@ix.netcom.com>
In-Reply-To: <122f70b8-62f8-cd24-a0e1-c3e0052b37e8@ix.netcom.com>
From: Tim Bray <tbray@textuality.com>
Date: Tue, 29 Aug 2023 19:05:13 -0500
Message-ID: <CAHBU6isB7u7wqaJOsuae3O8m9vi3P9Z5c8H1OhH4EiXUP9wJsA@mail.gmail.com>
To: Asmus Freytag <asmusf@ix.netcom.com>
Cc: "i18ndir@ietf.org" <i18ndir@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000866692060418ae16"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/kaTgNUlQhmCHekSCwZh-CCumuwQ>
Subject: Re: [I18ndir] draft-bray-unichars-01
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Aug 2023 00:05:20 -0000

 Asmus, these are great! Thanks. If this document goes forward I think it
will include almost all of what you suggest.

My favorite input is (7) - Neither co-author is in love with “Basic”, a
better suggestion would be greeted with joy.

On Aug 29, 2023 at 12:10:42 PM, Asmus Freytag <asmusf@ix.netcom.com> wrote:

> Comments on the draft (in no particular order or priority).
>
> (1)
> "Since the C0 controls include zero and the 32 smallest integers, they are
> likely to occur in data as a result of programming errors."
>
> In the way the Unicode standard refers to characters "the control codes"
> are the characters, so "the C0 control codes include" would be read that
> way. What you are referring to here are the integer values, corresponding
> to a code point. Suggest rewording because "zero" as a digit is not a
> control code and while it's possible to figure out what must be meant, it's
> a needless ambiguity.
>
> How about "Since the code points for C0 controls include the 32 smallest
> integers including zero..."
>
> (2)
> I recommend thinking not of "problematic code points" but of "problematic
> code point types" as per this definition from the Unicode Glossary.
>
> *Code Point Type <https://www.unicode.org/glossary/#code_point_type>*.
> Any of the seven fundamental classes of code points in the standard:
> Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.
> (See definition D10a in Section 3.4, Characters and Encoding
> <https://www.unicode.org/versions/latest/ch03.pdf#G2212>.)
>
> This would let you also address the private-use problematic in interchange
> (even if you conclude they might be useful enough not to restrict them by
> default, they suffer from the same lack of consensus interpretation as the
> controls).
>
> (3)
>
> "reserved for internal use" should be "reserved for internal use by
> application" (as opposed to internal use by the standard).
>
>
> (4)
> This subset has the advantage of excluding surrogates, which can never add
> any value and have the potential to cause problems.
>
> should be reworded a bit to:
>
> "This subset has the advantage of excluding surrogates, which are not
> assigned to any characters, and thus can never add any value.
> They have the potential to cause problems, for example it is not possible
> to represent them individually in UTF-8."
>
> Rationale for this suggestion is to be slightly more specific, so the
> reader comes away with the conclusion that the "can never add any value" is
> based on well-founded reasons and not editorial opinion by the writers.
>
> (5)
> The ABNF in section 4.1: the comments are confusing.
>
> (6)
> I would like to refer you to table 2-3 in the Unicode Standard. The "Basic
> Unicode Characters" that you propagate consist of what Unicode considers
> "Assigned" code points plus "Reserved", with the modification that you are
> subtracting the "useless controls".
>
> Reserved code points are sometimes referred to as "assignable code points"
> (for example on the bottom of page 30 in Unicode 15.0.0). That makes that
> subset the combination of "assigned" plus "assignable" code points. (Which
> then, modulo "useless" controls, corresponds to the bulk of the basic set.)
>
> I would suggest to explicitly relate your definition to those terms.
>
> (6)
> The definition "useful controls" is currently buried in the text and
> there's not even a header to locate the definition. Because this (or the
> complementary definition of "useless" ones) is the only value added piece
> over "assigned + assignable", I suggest elevating the definition.
>
> I'm not going to quibble about the very opinionated naming of the concept.
>
> (7)
> The term "Basic" is an interesting choice. Because the set is anything but
> "basic" -- it includes all code points that can be maximally assigned,
> except for 61 C0 / C1 Controls. The only part of the set that is "basic" is
> in fact the subset of control codes.
>
> What you have defined is the "maximally useful set of Unicode code points
> for data interchange, absent a protocol defining specific control code
> semantics".
>
>
> Hope you find some of these comments useful,
> A./
>
>
>
>
>
>
>
>
> On 8/29/2023 9:37 AM, Tim Bray wrote:
>
> Hello I18ndir (anyone still here?), Paul Hoffman and I just submitted
> draft-bray-unichars-01 - our AD Francesca Palombini suggested we notify
> this list: https://datatracker.ietf.org/doc/draft-bray-unichars/
>
> This draft fell out of a conversation originally provoked by this errata
> report: https://www.rfc-editor.org/errata/eid7603
>
> It revealed a distressing lack of consensus about Unicode characters and
> code points and character repertoires. I feel personally bad because I am
> the editor of a couple of RFCs that are open to criticism on this front.
>
> So, this tries to say “here’s how an RFC should specify which Unicode
> characters it supports”.  We think this would be useful to multiple groups
> almost immediately, including especially those who don’t realize the area
> can be problematic.
>
> Anyhow, the purpose of this note is to ask your advice on how to take this
> forward. We think this works well as an individual submission.   We don’t
> *think* it needs a working group, but that’s not our call. The draft
> doesn’t express any opinions about best practices, it just points out
> several alternative character repertoires, provides ABNF, and discusses
> their trade-offs.  None of them are wrong.
>
>
>