Re: [I18ndir] draft-bray-unichars-01

Tim Bray <tbray@textuality.com> Wed, 30 August 2023 02:50 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9BD01C151546 for <i18ndir@ietfa.amsl.com>; Tue, 29 Aug 2023 19:50:18 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.106
X-Spam-Level:
X-Spam-Status: No, score=-7.106 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=textuality.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id cwsHQUCURE3r for <i18ndir@ietfa.amsl.com>; Tue, 29 Aug 2023 19:50:14 -0700 (PDT)
Received: from mail-lf1-x12a.google.com (mail-lf1-x12a.google.com [IPv6:2a00:1450:4864:20::12a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5E2DFC151532 for <i18ndir@ietf.org>; Tue, 29 Aug 2023 19:50:14 -0700 (PDT)
Received: by mail-lf1-x12a.google.com with SMTP id 2adb3069b0e04-500913779f5so8152058e87.2 for <i18ndir@ietf.org>; Tue, 29 Aug 2023 19:50:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=textuality.com; s=google; t=1693363812; x=1693968612; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=ykUj6fQB9qj9K+aZk6t7hFpR2/WklEYAPlf42d5OG9g=; b=MrWwYUqe7x3FLi9z9qKjTorWQ8F4xkolt6dlaUhZkmKZhemEMswkC0bhTVIvdMj6A9 Wayng2bu4ETfpPWEiS9Z6OQOPR/ZShbil788GLmQI4Ztfz06Iki2mnRyHmGuFqTtyiqo wnLaesy9rfXhrhhyUkLkwyqL5c24NLdJ9V16I=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693363812; x=1693968612; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ykUj6fQB9qj9K+aZk6t7hFpR2/WklEYAPlf42d5OG9g=; b=Qcuncqf2hcNygfTdNJLHTNHcuLXOW1MdHSDDE+dwN3G0+ua+D2jiTwtN1xHStJb8Mu GvLpY9vde9/D3GBMVHTNmTaQcYHB8+VLAK3A4QMJnCQc9R8MOSxFqaUBBio5JUMajZyv XyuaQzfukuIxc9FpaReGfzTWz1DSAdTBR20tdjXgR6gGotg9j+yc/20Qxhj2RREJQ1vV rAh4bSpVqlPTuBDeKpSlHGUNJTD8Hs7czq8z7vFeXrSWI47lwOp/CBVcaxeCbjHhqx4l 1bw3aj4k8g50CuyKCmENJ2zp4d1agMSAld96va8EMrlTDSvn4MCTPaMnLetULPYWOvp0 gPeA==
X-Gm-Message-State: AOJu0Yw0Sr6yxFqlujqjd8jeQSMXSjzMBRvOoZYM0W0hK47YyYpNw4us 9y9VQHYmcI5VIDMbW4qAKLn0rpjM3h3GRggneQxQFaTWM3ncw65UtWE=
X-Google-Smtp-Source: AGHT+IG0M40s0Zq+jLQijOxktgISFEECKrYPMSBXn6ESPteFBWHA/Es/h7g5lkcL8ycFVCRTq+ZKRuLl97++jwecIrQ=
X-Received: by 2002:ac2:4e8c:0:b0:500:b8bc:bd9a with SMTP id o12-20020ac24e8c000000b00500b8bcbd9amr449282lfr.49.1693363811427; Tue, 29 Aug 2023 19:50:11 -0700 (PDT)
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Tue, 29 Aug 2023 21:50:10 -0500
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Tue, 29 Aug 2023 21:50:07 -0500
Mime-Version: 1.0 (Mimestream 1.0.5)
References: <CAHBU6isuZ1fgAjv14JRCiWaq-cmE69iEGajQkDDNA4CzfTKoxQ@mail.gmail.com> <122f70b8-62f8-cd24-a0e1-c3e0052b37e8@ix.netcom.com> <CAHBU6isB7u7wqaJOsuae3O8m9vi3P9Z5c8H1OhH4EiXUP9wJsA@mail.gmail.com> <a4e8ead7-c3ac-0d9a-fc27-e70b59233614@ix.netcom.com>
In-Reply-To: <a4e8ead7-c3ac-0d9a-fc27-e70b59233614@ix.netcom.com>
From: Tim Bray <tbray@textuality.com>
Date: Tue, 29 Aug 2023 21:50:10 -0500
Message-ID: <CAHBU6isnxUvx7wFz=Z8QjgFWncekPUpfQ8gK=pjF5C+vh=EkPQ@mail.gmail.com>
To: Asmus Freytag <asmusf@ix.netcom.com>
Cc: "i18ndir@ietf.org" <i18ndir@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000074baa806041afc24"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/5mZtov5ZK_8k24ZaWrsytqjta0s>
Subject: Re: [I18ndir] draft-bray-unichars-01
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Aug 2023 02:50:18 -0000

On Aug 29, 2023 at 5:22:03 PM, Asmus Freytag <asmusf@ix.netcom.com> wrote:

> On 8/29/2023 5:05 PM, Tim Bray wrote:
>
> Asmus, these are great! Thanks. If this document goes forward I think it
> will include almost all of what you suggest.
>
> I would be thrilled. I think it's useful to move this forward (with
> changes).
>
>
> My favorite input is (7) - Neither co-author is in love with “Basic”, a
> better suggestion would be greeted with joy.
>
> "Minimally restrictive" describes what you are aiming for. You are
> restricting everything that has no business being interchanged and no more.'
>
Standing alone, that makes sense. But compared to the other options in the
draft, such as those used by JSON & XML, we are *more* restrictive.  So
there also could be a case for “maximally restrictive”.  Maybe just
“Restricted Code Points”?  Hmm, not obvious.

> Something like this then sets up the idea that there are other scenarios
> where the likely direction is "more restrictive", except perhaps some
> protocols that do assign meaning to certain controls and therefore rely on
> them -- we all think that these should be the exception; in other words, if
> you picked something that indicates an endpoint (like "minimally") the nice
> feature would be that most other specifications would tend to subset.
>
> For example, by restricting private use.
>
> A./
>
> PS: you might think about
>
> (8)
>
> Suggest a standard methodology for specifying subsets other than the "XML"
> or "minimally restrictive" ones / or at least discuss some concepts around
> this area, or point to specs in and out of ITETF that create limited
> subsets for various purposes.
>
> There's always a tension of defining data formats that don't impose
> practical restrictions on the text content and situation where general text
> is not what you need (but pure ASCII strings don't cut it either).
>
>
>
> On Aug 29, 2023 at 12:10:42 PM, Asmus Freytag <asmusf@ix.netcom.com>
> wrote:
>
>> Comments on the draft (in no particular order or priority).
>>
>> (1)
>> "Since the C0 controls include zero and the 32 smallest integers, they
>> are likely to occur in data as a result of programming errors."
>>
>> In the way the Unicode standard refers to characters "the control codes"
>> are the characters, so "the C0 control codes include" would be read that
>> way. What you are referring to here are the integer values, corresponding
>> to a code point. Suggest rewording because "zero" as a digit is not a
>> control code and while it's possible to figure out what must be meant, it's
>> a needless ambiguity.
>>
>> How about "Since the code points for C0 controls include the 32 smallest
>> integers including zero..."
>>
>> (2)
>> I recommend thinking not of "problematic code points" but of "problematic
>> code point types" as per this definition from the Unicode Glossary.
>>
>> *Code Point Type <https://www.unicode.org/glossary/#code_point_type>*.
>> Any of the seven fundamental classes of code points in the standard:
>> Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.
>> (See definition D10a in Section 3.4, Characters and Encoding
>> <https://www.unicode.org/versions/latest/ch03.pdf#G2212>.)
>>
>> This would let you also address the private-use problematic in
>> interchange (even if you conclude they might be useful enough not to
>> restrict them by default, they suffer from the same lack of consensus
>> interpretation as the controls).
>>
>> (3)
>>
>> "reserved for internal use" should be "reserved for internal use by
>> application" (as opposed to internal use by the standard).
>>
>>
>> (4)
>> This subset has the advantage of excluding surrogates, which can never
>> add any value and have the potential to cause problems.
>>
>> should be reworded a bit to:
>>
>> "This subset has the advantage of excluding surrogates, which are not
>> assigned to any characters, and thus can never add any value.
>> They have the potential to cause problems, for example it is not possible
>> to represent them individually in UTF-8."
>>
>> Rationale for this suggestion is to be slightly more specific, so the
>> reader comes away with the conclusion that the "can never add any value" is
>> based on well-founded reasons and not editorial opinion by the writers.
>>
>> (5)
>> The ABNF in section 4.1: the comments are confusing.
>>
>> (6)
>> I would like to refer you to table 2-3 in the Unicode Standard. The
>> "Basic Unicode Characters" that you propagate consist of what Unicode
>> considers "Assigned" code points plus "Reserved", with the modification
>> that you are subtracting the "useless controls".
>>
>> Reserved code points are sometimes referred to as "assignable code
>> points" (for example on the bottom of page 30 in Unicode 15.0.0). That
>> makes that subset the combination of "assigned" plus "assignable" code
>> points. (Which then, modulo "useless" controls, corresponds to the bulk of
>> the basic set.)
>>
>> I would suggest to explicitly relate your definition to those terms.
>>
>> (6)
>> The definition "useful controls" is currently buried in the text and
>> there's not even a header to locate the definition. Because this (or the
>> complementary definition of "useless" ones) is the only value added piece
>> over "assigned + assignable", I suggest elevating the definition.
>>
>> I'm not going to quibble about the very opinionated naming of the
>> concept.
>>
>> (7)
>> The term "Basic" is an interesting choice. Because the set is anything
>> but "basic" -- it includes all code points that can be maximally assigned,
>> except for 61 C0 / C1 Controls. The only part of the set that is "basic" is
>> in fact the subset of control codes.
>>
>> What you have defined is the "maximally useful set of Unicode code points
>> for data interchange, absent a protocol defining specific control code
>> semantics".
>>
>>
>> Hope you find some of these comments useful,
>> A./
>>
>>
>>
>>
>>
>>
>>
>>
>> On 8/29/2023 9:37 AM, Tim Bray wrote:
>>
>> Hello I18ndir (anyone still here?), Paul Hoffman and I just submitted
>> draft-bray-unichars-01 - our AD Francesca Palombini suggested we notify
>> this list: https://datatracker.ietf.org/doc/draft-bray-unichars/
>>
>> This draft fell out of a conversation originally provoked by this errata
>> report: https://www.rfc-editor.org/errata/eid7603
>>
>> It revealed a distressing lack of consensus about Unicode characters and
>> code points and character repertoires. I feel personally bad because I am
>> the editor of a couple of RFCs that are open to criticism on this front.
>>
>> So, this tries to say “here’s how an RFC should specify which Unicode
>> characters it supports”.  We think this would be useful to multiple groups
>> almost immediately, including especially those who don’t realize the area
>> can be problematic.
>>
>> Anyhow, the purpose of this note is to ask your advice on how to take
>> this forward. We think this works well as an individual submission.   We
>> don’t *think* it needs a working group, but that’s not our call. The draft
>> doesn’t express any opinions about best practices, it just points out
>> several alternative character repertoires, provides ABNF, and discusses
>> their trade-offs.  None of them are wrong.
>>
>>
>>
>