Re: [I18ndir] draft-bray-unichars-01

Asmus Freytag <asmusf@ix.netcom.com> Wed, 30 August 2023 05:19 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2CD6AC15109B for <i18ndir@ietfa.amsl.com>; Tue, 29 Aug 2023 22:19:29 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.996
X-Spam-Level:
X-Spam-Status: No, score=-1.996 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.091, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=earthlink.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id k2uZaxwV5ENy for <i18ndir@ietfa.amsl.com>; Tue, 29 Aug 2023 22:19:23 -0700 (PDT)
Received: from mta-202a.earthlink-vadesecure.net (mta-202b.earthlink-vadesecure.net [51.81.232.241]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2D51EC15106B for <i18ndir@ietf.org>; Tue, 29 Aug 2023 22:19:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; bh=5Dps3vErtoLsxJsPZm6QVnStRc/eSlMQsI7b00 GCyqQ=; c=relaxed/relaxed; d=earthlink.net; h=from:reply-to:subject: date:to:cc:resent-date:resent-from:resent-to:resent-cc:in-reply-to: references:list-id:list-help:list-unsubscribe:list-subscribe:list-post: list-owner:list-archive; q=dns/txt; s=dk12062016; t=1693372761; x=1693977561; b=XHLjoOoExSu3rktrndO++Y3YfxmKLKV6HUVb5Kh38OgJOX+hMKJm8nI 8zJxmYslR0aCGApOkZH7oCfzwRLF3SA6P7RcNxdvHs/Z3lInSQZPd2NvO41GGBo1jLdRrLd 4HzedZ3FBscX5A9tbrgc7frVzjO49aXmBQNSvViWOCevG4AL+cS00M4yNgZyY0oWfF54LZy J/InZo7BsG7xquF7QIEUhHI7zxoO2WN6eUra0PzCZOUkdyzRllliAx+ZmgRXaMeWFLM6rs9 Y8+plNDc5ICsLpb7k1hYOxHsdNwA5B4ZwH2p+Jk0LW7Z5NROGjns6bAnAEeT750bZVrD2C9 ugw==
Received: from [192.168.0.2] ([75.172.100.214]) by vsel2nmtao02p.internal.vadesecure.com with ngmta id dad88a75-1780118ddf0eb38e; Wed, 30 Aug 2023 05:19:21 +0000
Content-Type: multipart/alternative; boundary="------------qUPDnvfKpwKHxHDCJhOPQUdE"
Message-ID: <6b0c7dda-dd75-ec38-6465-74fbf2185ff5@ix.netcom.com>
Date: Tue, 29 Aug 2023 22:19:19 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0
Content-Language: en-US
To: Tim Bray <tbray@textuality.com>
Cc: "i18ndir@ietf.org" <i18ndir@ietf.org>
References: <CAHBU6isuZ1fgAjv14JRCiWaq-cmE69iEGajQkDDNA4CzfTKoxQ@mail.gmail.com> <122f70b8-62f8-cd24-a0e1-c3e0052b37e8@ix.netcom.com> <CAHBU6isB7u7wqaJOsuae3O8m9vi3P9Z5c8H1OhH4EiXUP9wJsA@mail.gmail.com> <a4e8ead7-c3ac-0d9a-fc27-e70b59233614@ix.netcom.com> <CAHBU6isnxUvx7wFz=Z8QjgFWncekPUpfQ8gK=pjF5C+vh=EkPQ@mail.gmail.com>
From: Asmus Freytag <asmusf@ix.netcom.com>
In-Reply-To: <CAHBU6isnxUvx7wFz=Z8QjgFWncekPUpfQ8gK=pjF5C+vh=EkPQ@mail.gmail.com>
Authentication-Results: earthlink-vadesecure.net; auth=pass smtp.auth=asmusf@ix.netcom.com smtp.mailfrom=asmusf@ix.netcom.com;
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/1gGKrL3saOaQh9zLwupMYXx13cQ>
Subject: Re: [I18ndir] draft-bray-unichars-01
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Aug 2023 05:19:29 -0000

On 8/29/2023 7:50 PM, Tim Bray wrote:
>
>
> On Aug 29, 2023 at 5:22:03 PM, Asmus Freytag <asmusf@ix.netcom.com> wrote:
>> On 8/29/2023 5:05 PM, Tim Bray wrote:
>>> Asmus, these are great! Thanks. If this document goes forward I 
>>> think it will include almost all of what you suggest.
>> I would be thrilled. I think it's useful to move this forward (with 
>> changes).
>>>
>>> My favorite input is (7) - Neither co-author is in love with 
>>> “Basic”, a better suggestion would be greeted with joy.
>>
>> "Minimally restrictive" describes what you are aiming for. You are 
>> restricting everything that has no business being interchanged and no 
>> more.'
>>
> Standing alone, that makes sense. But compared to the other options in 
> the draft, such as those used by JSON & XML, we are 
> /more/ restrictive. So there also could be a case for “maximally 
> restrictive”. Maybe just “Restricted Code Points”?  Hmm, not obvious.

Just intended to get your thinking started. The full descriptive name 
would be:

"Assigned plus Assignable code points minus 'useless' control codes".

Alternatively:

"Complete present and future Graphic, Format and Private-Use characters, 
together with three essential control codes".

None of those trip off the tongue, but again, it's useful to stare at 
them a bit. Perhaps that might triggers something.


Now, the collection that you describe is the one that is useful for text 
representation for interchange. You exclude the control code which are 
(except for the three "essential" ones) really about protocol support 
and not the text itself, and you exclude all the stuff that shouldn't be 
interchanged (noncharacters, surrogates).

"Default Unicode Subset for interchanging Text".  (DUSFIT).

You see where I'm going, using Default instead of Basic, echoing similar 
subsets for "Default Identifiers". Both are geared to a specific purpose 
and both carry in the name that they might be further customized (e.g. a 
non-default text subset might exclude Private use, to ensure that all 
characters have a shared semantic).

It might be possible to change the initials a bit to make them 
pronounceable.

What do you think?

A./

>> Something like this then sets up the idea that there are other 
>> scenarios where the likely direction is "more restrictive", except 
>> perhaps some protocols that do assign meaning to certain controls and 
>> therefore rely on them -- we all think that these should be the 
>> exception; in other words, if you picked something that indicates an 
>> endpoint (like "minimally") the nice feature would be that most other 
>> specifications would tend to subset.
>>
>> For example, by restricting private use.
>>
>> A./
>>
>> PS: you might think about
>>
>> (8)
>>
>> Suggest a standard methodology for specifying subsets other than the 
>> "XML" or "minimally restrictive" ones / or at least discuss some 
>> concepts around this area, or point to specs in and out of ITETF that 
>> create limited subsets for various purposes.
>>
>> There's always a tension of defining data formats that don't impose 
>> practical restrictions on the text content and situation where 
>> general text is not what you need (but pure ASCII strings don't cut 
>> it either).
>>
>>
>>>
>>> On Aug 29, 2023 at 12:10:42 PM, Asmus Freytag <asmusf@ix.netcom.com> 
>>> wrote:
>>>> Comments on the draft (in no particular order or priority).
>>>>
>>>> (1)
>>>> "Since the C0 controls include zero and the 32 smallest integers, 
>>>> they are likely to occur in data as a result of programming errors."
>>>>
>>>> In the way the Unicode standard refers to characters "the control 
>>>> codes" are the characters, so "the C0 control codes include" would 
>>>> be read that way. What you are referring to here are the integer 
>>>> values, corresponding to a code point. Suggest rewording because 
>>>> "zero" as a digit is not a control code and while it's possible to 
>>>> figure out what must be meant, it's a needless ambiguity.
>>>>
>>>> How about "Since the code points for C0 controls include the 32 
>>>> smallest integers including zero..."
>>>>
>>>> (2)
>>>> I recommend thinking not of "problematic code points" but of 
>>>> "problematic code point types" as per this definition from the 
>>>> Unicode Glossary. /
>>>> /
>>>>
>>>>     /Code Point Type
>>>>     <https://www.unicode.org/glossary/#code_point_type>/. Any of
>>>>     the seven fundamental classes of code points in the standard:
>>>>     Graphic, Format, Control, Private-Use, Surrogate, Noncharacter,
>>>>     Reserved. (See definition D10a in Section 3.4, Characters and
>>>>     Encoding <https://www.unicode.org/versions/latest/ch03.pdf#G2212>.)
>>>>
>>>> This would let you also address the private-use problematic in 
>>>> interchange (even if you conclude they might be useful enough not 
>>>> to restrict them by default, they suffer from the same lack of 
>>>> consensus interpretation as the controls).
>>>>
>>>> (3)
>>>>
>>>> "reserved for internal use" should be "reserved for internal use by 
>>>> application" (as opposed to internal use by the standard).
>>>>
>>>>
>>>> (4)
>>>> This subset has the advantage of excluding surrogates, which can 
>>>> never add any value and have the potential to cause problems.
>>>>
>>>> should be reworded a bit to:
>>>>
>>>> "This subset has the advantage of excluding surrogates, which are 
>>>> not assigned to any characters, and thus can never add any value.
>>>> They have the potential to cause problems, for example it is not 
>>>> possible to represent them individually in UTF-8."
>>>>
>>>> Rationale for this suggestion is to be slightly more specific, so 
>>>> the reader comes away with the conclusion that the "can never add 
>>>> any value" is based on well-founded reasons and not editorial 
>>>> opinion by the writers.
>>>>
>>>> (5)
>>>> The ABNF in section 4.1: the comments are confusing.
>>>>
>>>> (6)
>>>> I would like to refer you to table 2-3 in the Unicode Standard. The 
>>>> "Basic Unicode Characters" that you propagate consist of what 
>>>> Unicode considers "Assigned" code points plus "Reserved", with the 
>>>> modification that you are subtracting the "useless controls".
>>>>
>>>> Reserved code points are sometimes referred to as "assignable code 
>>>> points" (for example on the bottom of page 30 in Unicode 15.0.0). 
>>>> That makes that subset the combination of "assigned" plus 
>>>> "assignable" code points. (Which then, modulo "useless" controls, 
>>>> corresponds to the bulk of the basic set.)
>>>>
>>>> I would suggest to explicitly relate your definition to those terms.
>>>>
>>>> (6)
>>>> The definition "useful controls" is currently buried in the text 
>>>> and there's not even a header to locate the definition. Because 
>>>> this (or the complementary definition of "useless" ones) is the 
>>>> only value added piece over "assigned + assignable", I suggest 
>>>> elevating the definition.
>>>>
>>>> I'm not going to quibble about the very opinionated naming of the 
>>>> concept.
>>>>
>>>> (7)
>>>> The term "Basic" is an interesting choice. Because the set is 
>>>> anything but "basic" -- it includes all code points that can be 
>>>> maximally assigned, except for 61 C0 / C1 Controls. The only part 
>>>> of the set that is "basic" is in fact the subset of control codes.
>>>>
>>>> What you have defined is the "maximally useful set of Unicode code 
>>>> points for data interchange, absent a protocol defining specific 
>>>> control code semantics".
>>>>
>>>>
>>>> Hope you find some of these comments useful,
>>>> A./
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 8/29/2023 9:37 AM, Tim Bray wrote:
>>>>> Hello I18ndir (anyone still here?), Paul Hoffman and I just 
>>>>> submitted draft-bray-unichars-01 - our AD Francesca Palombini 
>>>>> suggested we notify this list: 
>>>>> https://datatracker.ietf.org/doc/draft-bray-unichars/
>>>>>
>>>>> This draft fell out of a conversation originally provoked by this 
>>>>> errata report: https://www.rfc-editor.org/errata/eid7603 
>>>>> <https://www.rfc-editor.org/errata/eid7603>
>>>>>
>>>>> It revealed a distressing lack of consensus about Unicode 
>>>>> characters and code points and character repertoires. I feel 
>>>>> personally bad because I am the editor of a couple of RFCs that 
>>>>> are open to criticism on this front.
>>>>>
>>>>> So, this tries to say “here’s how an RFC should specify which 
>>>>> Unicode characters it supports”.  We think this would be useful to 
>>>>> multiple groups almost immediately, including especially those who 
>>>>> don’t realize the area can be problematic.
>>>>>
>>>>> Anyhow, the purpose of this note is to ask your advice on how to 
>>>>> take this forward. We think this works well as an individual 
>>>>> submission.   We don’t *think* it needs a working group, but 
>>>>> that’s not our call. The draft doesn’t express any opinions about 
>>>>> best practices, it just points out several alternative character 
>>>>> repertoires, provides ABNF, and discusses their trade-offs.  None 
>>>>> of them are wrong.
>>>>>
>>>>
>>