Re: [I18ndir] [art] New Version Notification for draft-bray-unichars-06.txt

Carsten Bormann <cabo@tzi.org> Sat, 07 October 2023 01:14 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2D483C151535; Fri, 6 Oct 2023 18:14:43 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.904
X-Spam-Level:
X-Spam-Status: No, score=-1.904 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id lm_n0JMM_C9x; Fri, 6 Oct 2023 18:14:38 -0700 (PDT)
Received: from smtp.zfn.uni-bremen.de (smtp.zfn.uni-bremen.de [134.102.50.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EC212C14F74A; Fri, 6 Oct 2023 18:14:37 -0700 (PDT)
Received: from smtpclient.apple (eduroam-pool10-020.wlan.uni-bremen.de [134.102.90.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4S2S4q29TlzDCbp; Sat, 7 Oct 2023 03:14:35 +0200 (CEST)
Content-Type: multipart/alternative; boundary="Apple-Mail=_02B4C346-A86D-4397-AE52-7174EE94ABD2"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.700.6\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <SY4PR01MB5980D4B50C9E4E4A1AF92CC4E5C8A@SY4PR01MB5980.ausprd01.prod.outlook.com>
Date: Sat, 07 Oct 2023 03:14:24 +0200
Cc: Tim Bray <tbray@textuality.com>, "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
Message-Id: <9276000B-6106-47EA-A97A-B37E93CB82B6@tzi.org>
References: <169566019635.41806.9804796677919971070@ietfa.amsl.com> <CAHBU6is-wU2NLXNWL56nSJ4=nKvDzGv_Aw4qJN6N2O8CuM4-yw@mail.gmail.com> <SYBPR01MB59814B3448F5754AAEDA1740E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com> <CAHBU6iueqtd5T1T-ciYUMWvmo8XqBQqO5LkWbdRaoXQzPYSQOQ@mail.gmail.com> <SY4PR01MB5980D009F1623E3694B871B7E5C5A@SY4PR01MB5980.ausprd01.prod.outlook.com> <CAChr6SzMXqmEJvwQ0Vb0+CfchBn2kMueQJ-2Th1=4Oct8b9t6A@mail.gmail.com> <E1464943-EB11-4FA4-B933-4F138C6C34A0@tzi.org> <CAHBU6itgC07j0P5DcACDyHSjEOG6=j5kWE=eYF8E0NA3mm_b5A@mail.gmail.com> <SY4PR01MB59803C733B6B6A1C9D4E04F4E5C5A@SY4PR01MB5980.ausprd01.prod.outlook.com> <CAHBU6iuEbKOri56HiTB+HcsPKOpXJArFpbkVnf68=5i8FMWPUg@mail.gmail.com> <SY4PR01MB5980D4B50C9E4E4A1AF92CC4E5C8A@SY4PR01MB5980.ausprd01.prod.outlook.com>
To: "Manger, James" <James.H.Manger=40team.telstra.com@dmarc.ietf.org>
X-Mailer: Apple Mail (2.3731.700.6)
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/Ih4AT0Z_2tU77nHbN20wiRsgQE8>
Subject: Re: [I18ndir] [art] New Version Notification for draft-bray-unichars-06.txt
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 07 Oct 2023 01:14:43 -0000

+1

For all intents and purposes, “Unicode character” should be understood (and, possibly, defined for use in a document) as “Unicode scalar value”.
There are very few protocols that need finer detail about their repertoire.
Obviously, the 1d and 2d subsets (see modern-network-unicode) make a lot of sense for 1d and 2d applications.

That is really all that most of the people who need to design Unicode-based protocols need to know.
Adding some knowledge about normalization may sometimes be useful (see Appendix C).

Grüße, Carsten


> On 7. Oct 2023, at 03:09, Manger, James <James.H.Manger=40team.telstra.com@dmarc.ietf.org> wrote:
> 
> 
> General
> On Oct 2, 2023 at 5:20:59 PM, "Manger, James" <James.H.Manger@team.telstra.com <mailto:James.H.Manger@team.telstra.com>> wrote:
> 
> draft-bray-unichars <https://datatracker.ietf.org/doc/html/draft-bray-unichars> §3 “Dealing with problematic code points” suggests “replacing problematic code points with "�" (U+FFFD, REPLACEMENT CHARACTER)” (or signalling an error, but I’ll only talk about the replacement option in this email).
> An ill-formed sequence of code units needs to be replaced. It is far less obvious to me that “problematic” scalars should be replaced. Even for noncharacters Unicode provides a good FAQ <https://www.unicode.org/faq/private_use.html#nonchar9> and corrigendum #9 “Clarification about noncharacters” <https://www.unicode.org/versions/corrigendum9.html> that suggests passing them along (treating them like unassigned scalars) is often the best policy (because the internal/interchange boundary is blurry).
> > OK, that’s worth a reference.
> So §4.3 defining unicode-assignable that excludes noncharacters is fine -- when to be lenient on receiving a supposed unicode-assignable value is less obvious.
> But §3 looks dodgy.
> > Would a note that it might be reasonable to accept nonchars, referencing that corregendum, de-dodgify it in your view?
> 
> If it might be reasonable to accept nonchars, it presumably might be reasonable to accept controls or any scalar. To de-dodgify, the text should not conflate ill-formed code units with scalars.
>  
> “Virtuous intolerance” [RFC9413 <https://www.rfc-editor.org/rfc/rfc9413.html#name-virtuous-intolerance>] with respect to UTF-8/16/32 is clear and widely implemented: signal an error; or replace with U+FFFD (or U+003F). Presumably this is why javascript is changing JSON.stringify to always escape an unpaired surrogate (not just accepting escaped-unpaired-surrogates in JSON.parse).
>  
> “Virtuous intolerance” with respect to xml-character or unicode-assignable is less clear to me. Maybe it is left to future specs that refers to these repertoires? Or maybe this doc can pick “rules for consistent handling of aberrant conditions”. That means this doc doesn’t merely name some repertoires but adds handling rules. Sounds feasible; could be controversial. Can we pick “always signal an error”? Or do we need to offer “or replace scalars-not-in-the-repertoire with U+FFFD”. It’s just that I’m not sure any systems do the latter.
>  
> In any case, I’d like to see any such “virtuous intolerance” rules for this doc’s repertoires described separately from Unicode’s existing “virtuous intolerance” rules for UTF-8/16/32.
>  
> --
> James Manger
> _______________________________________________
> art mailing list
> art@ietf.org <mailto:art@ietf.org>
> https://www.ietf.org/mailman/listinfo/art