Re: [I18ndir] draft-bray-unichars-01

Asmus Freytag <asmusf@ix.netcom.com> Tue, 29 August 2023 19:10 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 990E7C15C522 for <i18ndir@ietfa.amsl.com>; Tue, 29 Aug 2023 12:10:49 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.996
X-Spam-Level:
X-Spam-Status: No, score=-1.996 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.091, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=earthlink.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id K6VleIdsRbVu for <i18ndir@ietfa.amsl.com>; Tue, 29 Aug 2023 12:10:45 -0700 (PDT)
Received: from mta-102a.earthlink-vadesecure.net (mta-102b.earthlink-vadesecure.net [51.81.61.67]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D63C4C15257C for <i18ndir@ietf.org>; Tue, 29 Aug 2023 12:10:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; bh=5dORqNdxUJoybYR/VxgcclpnVteOatPkRaMH8J swkvI=; c=relaxed/relaxed; d=earthlink.net; h=from:reply-to:subject: date:to:cc:resent-date:resent-from:resent-to:resent-cc:in-reply-to: references:list-id:list-help:list-unsubscribe:list-subscribe:list-post: list-owner:list-archive; q=dns/txt; s=dk12062016; t=1693336242; x=1693941042; b=dNk07sgWjih+hLXFpmMHgO74riwXAepFnWa7G+rktzFwUEruBsGCADQ uRL18rKRnxvriqsJHe+vNjCxq6/O4+19Yv3LFgZMRKb53gri39ORhZmUpyB+dYrmSd4Dupj xbQYxSbwkqlJQEzGn+m4/2d9kwQoufcr0UVMFR3mNfpG6lpnViAhxnHzxbKp6SrmnKsKo1a 8ziKkUJuqOripxGDTkVkFRfTd4PpYgmMFRuFOxptRUVnqhG0IQgo3RGaPan14i9cwUaGCnD bWJbnORcsphUsvLr25o5M69C4FK1gfxUjYM0IwLA/cBx8Ed49D2nO6/gM7AQX9Ohd0hDG7+ 0tw==
Received: from [10.71.219.206] ([142.147.89.234]) by vsel1nmtao02p.internal.vadesecure.com with ngmta id f71a6e1d-177ff05741e55604; Tue, 29 Aug 2023 19:10:42 +0000
Content-Type: multipart/alternative; boundary="------------l6nfhvlExM2JTbqCVfSxbMhz"
Message-ID: <122f70b8-62f8-cd24-a0e1-c3e0052b37e8@ix.netcom.com>
Date: Tue, 29 Aug 2023 12:10:42 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0
Content-Language: en-US
To: Tim Bray <tbray@textuality.com>, "i18ndir@ietf.org" <i18ndir@ietf.org>
References: <CAHBU6isuZ1fgAjv14JRCiWaq-cmE69iEGajQkDDNA4CzfTKoxQ@mail.gmail.com>
From: Asmus Freytag <asmusf@ix.netcom.com>
In-Reply-To: <CAHBU6isuZ1fgAjv14JRCiWaq-cmE69iEGajQkDDNA4CzfTKoxQ@mail.gmail.com>
Authentication-Results: earthlink-vadesecure.net; auth=pass smtp.auth=asmusf@ix.netcom.com smtp.mailfrom=asmusf@ix.netcom.com;
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/6QEjobPR5lz_-3NhpUdyfpeo2w0>
Subject: Re: [I18ndir] draft-bray-unichars-01
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Aug 2023 19:10:49 -0000

Comments on the draft (in no particular order or priority).

(1)
"Since the C0 controls include zero and the 32 smallest integers, they 
are likely to occur in data as a result of programming errors."

In the way the Unicode standard refers to characters "the control codes" 
are the characters, so "the C0 control codes include" would be read that 
way. What you are referring to here are the integer values, 
corresponding to a code point. Suggest rewording because "zero" as a 
digit is not a control code and while it's possible to figure out what 
must be meant, it's a needless ambiguity.

How about "Since the code points for C0 controls include the 32 smallest 
integers including zero..."

(2)
I recommend thinking not of "problematic code points" but of 
"problematic code point types" as per this definition from the Unicode 
Glossary. /
/

    /Code Point Type
    <https://www.unicode.org/glossary/#code_point_type>/. Any of the
    seven fundamental classes of code points in the standard: Graphic,
    Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.
    (See definition D10a in Section 3.4, Characters and Encoding
    <https://www.unicode.org/versions/latest/ch03.pdf#G2212>.)

This would let you also address the private-use problematic in 
interchange (even if you conclude they might be useful enough not to 
restrict them by default, they suffer from the same lack of consensus 
interpretation as the controls).

(3)

"reserved for internal use" should be "reserved for internal use by 
application" (as opposed to internal use by the standard).


(4)
This subset has the advantage of excluding surrogates, which can never 
add any value and have the potential to cause problems.

should be reworded a bit to:

"This subset has the advantage of excluding surrogates, which are not 
assigned to any characters, and thus can never add any value.
They have the potential to cause problems, for example it is not 
possible to represent them individually in UTF-8."

Rationale for this suggestion is to be slightly more specific, so the 
reader comes away with the conclusion that the "can never add any value" 
is based on well-founded reasons and not editorial opinion by the writers.

(5)
The ABNF in section 4.1: the comments are confusing.

(6)
I would like to refer you to table 2-3 in the Unicode Standard. The 
"Basic Unicode Characters" that you propagate consist of what Unicode 
considers "Assigned" code points plus "Reserved", with the modification 
that you are subtracting the "useless controls".

Reserved code points are sometimes referred to as "assignable code 
points" (for example on the bottom of page 30 in Unicode 15.0.0). That 
makes that subset the combination of "assigned" plus "assignable" code 
points. (Which then, modulo "useless" controls, corresponds to the bulk 
of the basic set.)

I would suggest to explicitly relate your definition to those terms.

(6)
The definition "useful controls" is currently buried in the text and 
there's not even a header to locate the definition. Because this (or the 
complementary definition of "useless" ones) is the only value added 
piece over "assigned + assignable", I suggest elevating the definition.

I'm not going to quibble about the very opinionated naming of the concept.

(7)
The term "Basic" is an interesting choice. Because the set is anything 
but "basic" -- it includes all code points that can be maximally 
assigned, except for 61 C0 / C1 Controls. The only part of the set that 
is "basic" is in fact the subset of control codes.

What you have defined is the "maximally useful set of Unicode code 
points for data interchange, absent a protocol defining specific control 
code semantics".


Hope you find some of these comments useful,
A./








On 8/29/2023 9:37 AM, Tim Bray wrote:
> Hello I18ndir (anyone still here?), Paul Hoffman and I just submitted 
> draft-bray-unichars-01 - our AD Francesca Palombini suggested we 
> notify this list: https://datatracker.ietf.org/doc/draft-bray-unichars/
>
> This draft fell out of a conversation originally provoked by this 
> errata report: https://www.rfc-editor.org/errata/eid7603 
> <https://www.rfc-editor.org/errata/eid7603>
>
> It revealed a distressing lack of consensus about Unicode characters 
> and code points and character repertoires. I feel personally bad 
> because I am the editor of a couple of RFCs that are open to criticism 
> on this front.
>
> So, this tries to say “here’s how an RFC should specify which Unicode 
> characters it supports”.  We think this would be useful to multiple 
> groups almost immediately, including especially those who don’t 
> realize the area can be problematic.
>
> Anyhow, the purpose of this note is to ask your advice on how to take 
> this forward. We think this works well as an individual submission.   
> We don’t *think* it needs a working group, but that’s not our call. 
> The draft doesn’t express any opinions about best practices, it just 
> points out several alternative character repertoires, provides ABNF, 
> and discusses their trade-offs.  None of them are wrong.
>