Re: [I18ndir] [art] New Version Notification for draft-bray-unichars-04.txt

Carsten Bormann <cabo@tzi.org> Mon, 18 September 2023 14:18 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4EED9C15198E; Mon, 18 Sep 2023 07:18:33 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.905
X-Spam-Level:
X-Spam-Status: No, score=-1.905 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=unavailable autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Vgu1PuiYeIYq; Mon, 18 Sep 2023 07:18:29 -0700 (PDT)
Received: from smtp.zfn.uni-bremen.de (smtp.zfn.uni-bremen.de [IPv6:2001:638:708:32::21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id AE5D4C14F721; Mon, 18 Sep 2023 07:18:28 -0700 (PDT)
Received: from [192.168.217.124] (p548dc15c.dip0.t-ipconnect.de [84.141.193.92]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4Rq6N02l9XzDCbT; Mon, 18 Sep 2023 16:18:24 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <SY4PR01MB5980D8DDE229D1C57AEDFB55E5FBA@SY4PR01MB5980.ausprd01.prod.outlook.com>
Date: Mon, 18 Sep 2023 16:18:23 +0200
Cc: Tim Bray <tbray@textuality.com>, ART Area <art@ietf.org>, "i18ndir@ietf.org" <i18ndir@ietf.org>
X-Mao-Original-Outgoing-Id: 716739503.589443-4bae790d59bc23fe6b23e5a60ae33132
Content-Transfer-Encoding: quoted-printable
Message-Id: <E8456DD7-AA9C-4E58-A2C1-FDE6BFC26CFA@tzi.org>
References: <169479938668.18742.9199862891950651366@ietfa.amsl.com> <CAHBU6ivzUV947N+n7AoYkCFT3ZfaLobCQ4fBXw3dvkqTT=LBAw@mail.gmail.com> <SY4PR01MB5980D8DDE229D1C57AEDFB55E5FBA@SY4PR01MB5980.ausprd01.prod.outlook.com>
To: "Manger, James" <James.H.Manger=40team.telstra.com@dmarc.ietf.org>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/b9qsm9NpbWsHKtu96hQSc9htSmk>
Subject: Re: [I18ndir] [art] New Version Notification for draft-bray-unichars-04.txt
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Sep 2023 14:18:33 -0000

+1

> On 2023-09-18, at 16:05, Manger, James <James.H.Manger=40team.telstra.com@dmarc.ietf.org> wrote:
> 
> I don’t think draft-bray-unichars-04 should advance in the IETF.
>  
> Defining unicode-scalar-values, xml-chars, and useful-assignables are 3 helpful “subsets of Unicode characters” that can be used in protocols and data formats.
> Defining unicode-code-points as though it is similar is a category error, however.
>  
> Section 3.1 and 3.2 deliberately make %x0-10FFFF and %x0-D7FF / %xE000-10FFFF looks like similar repertoires, which is too misleading.
> UTF-8, UTF-16 and UTF-32 are only defined for the latter. There may be “obvious” extensions of UTF-8 (WTF-8) and UTF-32 that can cover %x0-10FFFF, but they are simply not widely supported in modern software, so those extensions are no use in an IETF standard. And the “obvious” extension to UTF-16 gives something that no longer works as you expect from a repertoire.
>  
> An implementation that accepts surrogates cannot distinguish a {high-surrogate, low-surrogate} pair from a non-BMP character. ECMA-404 is clear on this when it says, “whether a processor of JSON texts interprets such a surrogate pair (“\uD834\uDD1E”) as a single code point (U+1D11E) or as an explicit surrogate pair is a semantic decision that is determined by the specific processor”. That is totally unexpected from seeing %x0-10FFFF as a seemingly simple repertoire.
>  
> It makes sense for a spec to define:
>   unicode-scalar-value = %x0-D7FF / %xE000-10FFFF
>   string = *unicode-scalar-value
>  
> It does not make sense for a spec to define:
>   unicode-code-point = %x0-10FFFF
>   string = *unicode-code-point
> because, for instance, %xD834 %xDD1E and %x1D11E are separate values of that ABNF grammar but will not be treated that way by implementations. In implementations they will be indistinguishable strings. An internal 16-bit format will store the same two 16-bit words for both. Only 1 form can come out.
>  
> For understandable reasons, JSON supports both *(%x0-D7FF / %xE000-10FFFF) and *(%x0-FFFF) (arbitrary 16-bit data) as models for the logical strings it can represent. An implementation can pick either. They don’t exactly overlap. There is probably a complicated ABNF that can cover both involving *(%x0-D7FF / %xE000-10FFFF)  and unpaired-surrogate, but it would be non-trivial (and not that practical). And that ABNF for the logical strings JSON can represent is different from the ABNF for JSON text itself, which excludes controls other than whitespace and has escape sequences. *(%x0-10FFFF) – as implied by 3.1 – doesn’t match any concept here.
>  
> The Unicode spec (chapter 2 General Structure) covers abstract characters, code points, codespace, code unit, encoding form, encoding scheme etc. draft-bray-unichars-04 is shorter and simpler, but the simplifications aren’t quite precise enough to add clarity.
>  
> P.S. The ABNF productions in draft-bray-unichars-04 should be singular not plural, eg unicode-scalar-value.
>  
> --
> James Manger
>  
> 
> General
> _______________________________________________
> art mailing list
> art@ietf.org
> https://www.ietf.org/mailman/listinfo/art