Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Asmus Freytag <asmusf@ix.netcom.com> Fri, 08 September 2023 21:01 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D70AEC14CE54 for <i18ndir@ietfa.amsl.com>; Fri, 8 Sep 2023 14:01:23 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.996
X-Spam-Level:
X-Spam-Status: No, score=-6.996 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.091, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=earthlink.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5NKgNgV_KOs8 for <i18ndir@ietfa.amsl.com>; Fri, 8 Sep 2023 14:01:20 -0700 (PDT)
Received: from mta-201a.earthlink-vadesecure.net (mta-201a.earthlink-vadesecure.net [51.81.229.180]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 06E68C14CE3F for <i18ndir@ietf.org>; Fri, 8 Sep 2023 14:01:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; bh=k44rKt7ljDRXGDwkuu78JDAbX4hL6z+RH61U6w vQY3U=; c=relaxed/relaxed; d=earthlink.net; h=from:reply-to:subject: date:to:cc:resent-date:resent-from:resent-to:resent-cc:in-reply-to: references:list-id:list-help:list-unsubscribe:list-subscribe:list-post: list-owner:list-archive; q=dns/txt; s=dk12062016; t=1694206879; x=1694811679; b=jzsam/oriyNHM3+gPpYSSkVfw2GV+4yj6eq9IL17e0jClCT0DoFh46x Qj3qg5LUD8fhSa5/L+lVMzSnrr0iXw49UOGZZ1JUw02tQEGF2AIdafTn0HGsmg+lrSw+4WW VP8U6RGo5K4DoKlWMP5jGdSiOCkPhmx9inzyc/PUm90WfZfmtvRuM9QcH9GETXxwY2R5jof 2EKwcbQwSFIbHM7s+p2EwrZCRENMoJTGjmFgii5BdWuCVZ682K9uImxCZUt4o6/5qJC5BzT qCach08ochq/q/1OkHefp/siPR7x6l1zau1dSn6f0CY35eneNgF8NEzpeXUytCm7irRTf8s GFQ==
Received: from [10.71.219.206] ([198.54.131.147]) by vsel2nmtao01p.internal.vadesecure.com with ngmta id 6000dd95-1783082e3b3c3abd; Fri, 08 Sep 2023 21:01:19 +0000
Content-Type: multipart/alternative; boundary="------------V4OclXtMe0M6YiqhyVZulrfu"
Message-ID: <3f61b607-f2cb-265d-7396-24e18355327e@ix.netcom.com>
Date: Fri, 08 Sep 2023 14:01:19 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.0
Content-Language: en-US
To: i18ndir@ietf.org
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <CAChr6SwDNujzq6+T6CXPko3jju9EiL6kmQCgNs4Ly7QAALujqg@mail.gmail.com>
From: Asmus Freytag <asmusf@ix.netcom.com>
In-Reply-To: <CAChr6SwDNujzq6+T6CXPko3jju9EiL6kmQCgNs4Ly7QAALujqg@mail.gmail.com>
Authentication-Results: earthlink-vadesecure.net; auth=pass smtp.auth=asmusf@ix.netcom.com smtp.mailfrom=asmusf@ix.netcom.com;
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/8mhypkdutRAw-tgU-g7EoKcffKA>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Sep 2023 21:01:23 -0000

On 9/8/2023 1:16 PM, Rob Sayre wrote:
> On Fri, Sep 8, 2023 at 12:38 PM Tim Bray <tbray@textuality.com> wrote:
>
>     See https://www.ietf.org/archive/id/draft-bray-unichars-03.html
>
>     A bunch of minor corrections and improvements, thanks to everyone
>     for that, especially James Manger for noticing that the ABNF was
>     entirely wrong in one place.
>
>     The word “useless” has been replaced by “legacy”.
>
I like.

On re-reading, I note that I keep stumbling over this statement

"While the inclusion of unassigned code points in text data is 
undesirable, it is difficult to specify that it should be avoided, 
because unassigned code points regularly become assigned as new 
characters are added to Unicode."

Small, closed character sets are a luxury of the past. They are 
unrealistic, because they assume a static universe. A comprehensive 
character set like the Unicode Standard must be able to grow. Not only 
because it intends to be universal, with many writing systems not yet 
cataloged, but also because existing writing systems change. Western 
Europe alone, despite its well established use of the Latin script, has 
seen two new characters, the € and the capital sharp s. When you have a 
closed set, your only way to react to such changes is to switch 
character sets.

A minimal fix to the text might be to change "is undesirable" to "may 
have its drawbacks".

Now, there are cases where it is necessary to restrict the use of 
Unicode characters to the subset that represents a known version. This 
is usually the case if data has to be normalized (and must stay 
normalized, something that cannot be guaranteed for unassigned code 
points). For an example, see IDNA2008.

Which leads to a suggestion for a fifth set (really a series), which 
would be "versioned-assigned". It would include all code points assigned 
as of a given version, minus the "legacy controls". That is a nice 
repertoire to use whenever you need to normalize your data and guarantee 
that their normalization does not change. (As it turns out, you can 
migrate your repertoire, over time, to later versions, as long as data 
that is normalized is free of unassigned code points relative to the 
version of the normalization tables.)


>
>     I think the feedback was pretty clear that the draft needed to be
>     more opinionated; just because we document the existence of the
>     default JSON repertoire (“all the code points”) doesn’t mean that
>     anyone should use it in the present or future. So, introduced a
>     new section “Refining Character Repertoires” to highlight those
>     issues and offer a suggestion.
>
>
> This one is tougher and correct. Fully in favor.
>
> I would change
>
> "These numbers are used to represent the characters in computer memory 
> and storage systems and, in specifications, to specify the allowed 
> repertoires of Unicode characters."
>
> to
>
> "These numbers are used to represent the allowed repertoires of 
> Unicode characters."

This change makes it sound like that is their only use. For that reason, 
I like the original formulation better.

I can see where it could be a little confusing in the other direction, 
because the original formulation might imply that internal 
representation is directly in terms of those numbers, whereas it usually 
is in terms of a transformation format, whether UTF-8 or UTF-32. Now, we 
don't want to delve deeply into a discussion of transformation formats, 
code units and the like.

"These numbers underlie the representation of characters in computer 
memory, storage systems and data transmission. In specifications they 
are used..."

To me, that better reflects the central nature of this concept, but 
removes the implication that they are used "as is" in data representation.

>
> Other commenters have said the "useful" term is not that great. I 
> agree, but I can't think of anything better. In particular, I thought 
> "no, people really do use NULL". Maybe "text-characters"? IDK, up to 
> the editors.

There are two types of data formats. Those that can contain NULL and 
those that cannot.

NULL, where used, always has a special meaning. The days where it could 
be inserted into a datastream with "null effect" are long over, event 
thought that was among the original use cases.

Most people don't use VT or FF (or NEL) but there are legacy data 
formats that do.

Perhaps, the best thing would be to acknowledge that there may be a need 
to be more permissive, if and when lossless conversion from legacy data 
is needed. A useful discussion would contain a recommendation to what 
spec authors should do.

The section on "Refining Character Repertoires" currently represents 1/2 
of the coin -- it is limited to a discussion on how to be more 
restrictive. It could benefit from discussing the other half: what about 
citing something like "useful assignables" *plus* selected controls, 
instead of either allowing all controls, or using one of the wider 
repertoires that drags in all the non-characters.

I think that would be a useful extension of the work.


>
> thanks,
> Rob
>
>