Re: [I18ndir] draft-bray-unichars-01

Asmus Freytag <asmusf@ix.netcom.com> Wed, 30 August 2023 00:22 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C40EBC14CEFF for <i18ndir@ietfa.amsl.com>; Tue, 29 Aug 2023 17:22:15 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.997
X-Spam-Level:
X-Spam-Status: No, score=-6.997 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.091, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=earthlink.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XUcC-dz_bXyv for <i18ndir@ietfa.amsl.com>; Tue, 29 Aug 2023 17:22:11 -0700 (PDT)
Received: from mta-202a.earthlink-vadesecure.net (mta-202b.earthlink-vadesecure.net [51.81.232.241]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 69F78C151062 for <i18ndir@ietf.org>; Tue, 29 Aug 2023 17:22:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; bh=WyIL1nFfYTPfdVLX3R1/VmAzDQSgLpwZ/1WM3j neTYI=; c=relaxed/relaxed; d=earthlink.net; h=from:reply-to:subject: date:to:cc:resent-date:resent-from:resent-to:resent-cc:in-reply-to: references:list-id:list-help:list-unsubscribe:list-subscribe:list-post: list-owner:list-archive; q=dns/txt; s=dk12062016; t=1693354925; x=1693959725; b=fA4rqbIjKAX6xTy9tq1dWrTwC44wLhw6yD3xAH8fK6vuMRK2s4N3A71 2nxM02argG2buOyUUCBFfRgY42hTuIasCccBGsTOJGVd45r5RggTeBw9dAXbYPS/yfx3YkR XfVJpi8IICa0GsB1eMohszvvkCpbsw5Mdqno4tlWAy2zvxHB5AGLShYy880L/xHP0LvsmsB pAiV31GE1pGzdKtGjBNhe3KnegZ0+CJG9RClX0aJcQHtRfALWeNLnSDpV6E43MVrV5XCCyn XQKJpaYUlLSsc77tCjg1R/uIY7Lt/fYwJkBzs6+66iGLKJZ3mFUJQN/cZnUj8F1kenPWJDt lmw==
Received: from [192.168.0.2] ([75.172.100.214]) by vsel2nmtao02p.internal.vadesecure.com with ngmta id adc6ad93-17800155120d0c67; Wed, 30 Aug 2023 00:22:05 +0000
Content-Type: multipart/alternative; boundary="------------D5444m0YVrPMegD3O000qkci"
Message-ID: <a4e8ead7-c3ac-0d9a-fc27-e70b59233614@ix.netcom.com>
Date: Tue, 29 Aug 2023 17:22:03 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0
Content-Language: en-US
To: Tim Bray <tbray@textuality.com>
Cc: "i18ndir@ietf.org" <i18ndir@ietf.org>
References: <CAHBU6isuZ1fgAjv14JRCiWaq-cmE69iEGajQkDDNA4CzfTKoxQ@mail.gmail.com> <122f70b8-62f8-cd24-a0e1-c3e0052b37e8@ix.netcom.com> <CAHBU6isB7u7wqaJOsuae3O8m9vi3P9Z5c8H1OhH4EiXUP9wJsA@mail.gmail.com>
From: Asmus Freytag <asmusf@ix.netcom.com>
In-Reply-To: <CAHBU6isB7u7wqaJOsuae3O8m9vi3P9Z5c8H1OhH4EiXUP9wJsA@mail.gmail.com>
Authentication-Results: earthlink-vadesecure.net; auth=pass smtp.auth=asmusf@ix.netcom.com smtp.mailfrom=asmusf@ix.netcom.com;
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/c2lWAnfkrYga3rv3CknfEf8KihA>
Subject: Re: [I18ndir] draft-bray-unichars-01
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Aug 2023 00:22:15 -0000

On 8/29/2023 5:05 PM, Tim Bray wrote:
> Asmus, these are great! Thanks. If this document goes forward I think 
> it will include almost all of what you suggest.
I would be thrilled. I think it's useful to move this forward (with 
changes).
>
> My favorite input is (7) - Neither co-author is in love with “Basic”, 
> a better suggestion would be greeted with joy.

"Minimally restrictive" describes what you are aiming for. You are 
restricting everything that has no business being interchanged and no more.'

Something like this then sets up the idea that there are other scenarios 
where the likely direction is "more restrictive", except perhaps some 
protocols that do assign meaning to certain controls and therefore rely 
on them -- we all think that these should be the exception; in other 
words, if you picked something that indicates an endpoint (like 
"minimally") the nice feature would be that most other specifications 
would tend to subset.

For example, by restricting private use.

A./

PS: you might think about

(8)

Suggest a standard methodology for specifying subsets other than the 
"XML" or "minimally restrictive" ones / or at least discuss some 
concepts around this area, or point to specs in and out of ITETF that 
create limited subsets for various purposes.

There's always a tension of defining data formats that don't impose 
practical restrictions on the text content and situation where general 
text is not what you need (but pure ASCII strings don't cut it either).


>
> On Aug 29, 2023 at 12:10:42 PM, Asmus Freytag <asmusf@ix.netcom.com> 
> wrote:
>> Comments on the draft (in no particular order or priority).
>>
>> (1)
>> "Since the C0 controls include zero and the 32 smallest integers, 
>> they are likely to occur in data as a result of programming errors."
>>
>> In the way the Unicode standard refers to characters "the control 
>> codes" are the characters, so "the C0 control codes include" would be 
>> read that way. What you are referring to here are the integer values, 
>> corresponding to a code point. Suggest rewording because "zero" as a 
>> digit is not a control code and while it's possible to figure out 
>> what must be meant, it's a needless ambiguity.
>>
>> How about "Since the code points for C0 controls include the 32 
>> smallest integers including zero..."
>>
>> (2)
>> I recommend thinking not of "problematic code points" but of 
>> "problematic code point types" as per this definition from the 
>> Unicode Glossary. /
>> /
>>
>>     /Code Point Type
>>     <https://www.unicode.org/glossary/#code_point_type>/. Any of the
>>     seven fundamental classes of code points in the standard:
>>     Graphic, Format, Control, Private-Use, Surrogate, Noncharacter,
>>     Reserved. (See definition D10a in Section 3.4, Characters and
>>     Encoding <https://www.unicode.org/versions/latest/ch03.pdf#G2212>.)
>>
>> This would let you also address the private-use problematic in 
>> interchange (even if you conclude they might be useful enough not to 
>> restrict them by default, they suffer from the same lack of consensus 
>> interpretation as the controls).
>>
>> (3)
>>
>> "reserved for internal use" should be "reserved for internal use by 
>> application" (as opposed to internal use by the standard).
>>
>>
>> (4)
>> This subset has the advantage of excluding surrogates, which can 
>> never add any value and have the potential to cause problems.
>>
>> should be reworded a bit to:
>>
>> "This subset has the advantage of excluding surrogates, which are not 
>> assigned to any characters, and thus can never add any value.
>> They have the potential to cause problems, for example it is not 
>> possible to represent them individually in UTF-8."
>>
>> Rationale for this suggestion is to be slightly more specific, so the 
>> reader comes away with the conclusion that the "can never add any 
>> value" is based on well-founded reasons and not editorial opinion by 
>> the writers.
>>
>> (5)
>> The ABNF in section 4.1: the comments are confusing.
>>
>> (6)
>> I would like to refer you to table 2-3 in the Unicode Standard. The 
>> "Basic Unicode Characters" that you propagate consist of what Unicode 
>> considers "Assigned" code points plus "Reserved", with the 
>> modification that you are subtracting the "useless controls".
>>
>> Reserved code points are sometimes referred to as "assignable code 
>> points" (for example on the bottom of page 30 in Unicode 15.0.0). 
>> That makes that subset the combination of "assigned" plus 
>> "assignable" code points. (Which then, modulo "useless" controls, 
>> corresponds to the bulk of the basic set.)
>>
>> I would suggest to explicitly relate your definition to those terms.
>>
>> (6)
>> The definition "useful controls" is currently buried in the text and 
>> there's not even a header to locate the definition. Because this (or 
>> the complementary definition of "useless" ones) is the only value 
>> added piece over "assigned + assignable", I suggest elevating the 
>> definition.
>>
>> I'm not going to quibble about the very opinionated naming of the 
>> concept.
>>
>> (7)
>> The term "Basic" is an interesting choice. Because the set is 
>> anything but "basic" -- it includes all code points that can be 
>> maximally assigned, except for 61 C0 / C1 Controls. The only part of 
>> the set that is "basic" is in fact the subset of control codes.
>>
>> What you have defined is the "maximally useful set of Unicode code 
>> points for data interchange, absent a protocol defining specific 
>> control code semantics".
>>
>>
>> Hope you find some of these comments useful,
>> A./
>>
>>
>>
>>
>>
>>
>>
>>
>> On 8/29/2023 9:37 AM, Tim Bray wrote:
>>> Hello I18ndir (anyone still here?), Paul Hoffman and I just 
>>> submitted draft-bray-unichars-01 - our AD Francesca Palombini 
>>> suggested we notify this list: 
>>> https://datatracker.ietf.org/doc/draft-bray-unichars/
>>>
>>> This draft fell out of a conversation originally provoked by this 
>>> errata report: https://www.rfc-editor.org/errata/eid7603 
>>> <https://www.rfc-editor.org/errata/eid7603>
>>>
>>> It revealed a distressing lack of consensus about Unicode characters 
>>> and code points and character repertoires. I feel personally bad 
>>> because I am the editor of a couple of RFCs that are open to 
>>> criticism on this front.
>>>
>>> So, this tries to say “here’s how an RFC should specify which 
>>> Unicode characters it supports”.  We think this would be useful to 
>>> multiple groups almost immediately, including especially those who 
>>> don’t realize the area can be problematic.
>>>
>>> Anyhow, the purpose of this note is to ask your advice on how to 
>>> take this forward. We think this works well as an individual 
>>> submission.   We don’t *think* it needs a working group, but that’s 
>>> not our call. The draft doesn’t express any opinions about best 
>>> practices, it just points out several alternative character 
>>> repertoires, provides ABNF, and discusses their trade-offs.  None of 
>>> them are wrong.
>>>
>>