Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

"Manger, James" <James.H.Manger@team.telstra.com> Sun, 10 September 2023 13:51 UTC

Return-Path: <James.H.Manger@team.telstra.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DE57FC15108D; Sun, 10 Sep 2023 06:51:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.008
X-Spam-Level:
X-Spam-Status: No, score=-2.008 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=team.telstra.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id bsAt-tKJC2Le; Sun, 10 Sep 2023 06:51:38 -0700 (PDT)
Received: from AUS01-SY4-obe.outbound.protection.outlook.com (mail-sy4aus01on2114.outbound.protection.outlook.com [40.107.107.114]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 7E8DAC15107C; Sun, 10 Sep 2023 06:51:37 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=BmSVeiJNvpf1rZYOh+B8jA5h+MtXexhXmg+rzVFbW7VdbVX9SCSa/hrWOF6qXTX2cB9yKSSmkRac8ON53DXcYw6rMW2z3KMCiRJXqh2W5iQ3/wD4qHGe7cfFk0ReD9uWXml92IXHHOURxCeDE4QlPq6VqJu2J0EVgiHKlsupzh/qHuP9/iZ9yhMQedt8f5Tfq3EGlSW8YFetbhJ+U3hgGrCnjrOkiUCLlYA3Q0tFWCbzO7lxYSpjPDxxp++WdFulJlaKwoyd5Co10zJ0jZSrLRj6pPBbpBbc4M/lqPqxtWc/wM31LXx3zZqVE7WBJjGTfdfUTrxer1+MPzgpXK9UWA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Vp249NmTwKaT/ZZ9UNN/a/i8QGaOsPJyYuoQarNHaj0=; b=KKU46m011RQsD3rUVTeFbSbwHazoaLsWPNMJdmGVM4ejBcThxfsQ6tNdSyBm3mcp0aS42a/lzzuC2cZ9cfyJ2H7PA65Ib83EuVc9umLZTZBhixssCAbtGbAut19+oc+4x3q6pQgTeUL1gNGRbJ/+CA4i0RI/Totdcaq/FHchIiIBNbpVlQrJsAyvDEmFc3EXagbJ4pfih0c/z/x+8sbZb93gqT83rVm53D0X131vQV182zB4je4XUynWMF41G8xI0OKnBTC+1JXwoH+nh1nd5J7mDLNoDXlWnH+HTt3u5C1Wm8manozsm7rwHWLQx2RIwu6re+r76QjEjOxx/S1iCw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=team.telstra.com; dmarc=pass action=none header.from=team.telstra.com; dkim=pass header.d=team.telstra.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=team.telstra.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Vp249NmTwKaT/ZZ9UNN/a/i8QGaOsPJyYuoQarNHaj0=; b=J1LjenEzW3CHSnBxWZnkxBG2z6L8xgM6JoZogSmLCh0Rf5VPNLRysL4D5Mwwj2G6AHX2kstII3DNiB0xGhw1Cn0wQB5rzo0uj7UzojLaS4a7DQUGYbDzkPaBjtLMJi2Ii/TVZ9aU+tKjgNtRkFcywms0RulIXY1ukdFondgfStM=
Received: from ME3PR01MB5973.ausprd01.prod.outlook.com (2603:10c6:220:db::11) by SYZPR01MB7605.ausprd01.prod.outlook.com (2603:10c6:10:16a::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6768.34; Sun, 10 Sep 2023 13:51:34 +0000
Received: from ME3PR01MB5973.ausprd01.prod.outlook.com ([fe80::2ace:ec4f:4e55:4cae]) by ME3PR01MB5973.ausprd01.prod.outlook.com ([fe80::2ace:ec4f:4e55:4cae%4]) with mapi id 15.20.6768.029; Sun, 10 Sep 2023 13:51:33 +0000
From: "Manger, James" <James.H.Manger@team.telstra.com>
To: Tim Bray <tbray@textuality.com>, "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
Thread-Topic: [art] Just uploaded draft-bray-unichars-03
Thread-Index: AQHZ4owMGc58/jC4xEOLxa8BxMKqeLAT9RRN
Date: Sun, 10 Sep 2023 13:51:33 +0000
Message-ID: <ME3PR01MB59730B45D9339180AF00E941E5F3A@ME3PR01MB5973.ausprd01.prod.outlook.com>
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com>
In-Reply-To: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com>
Accept-Language: en-AU, en-US
Content-Language: en-AU
X-Hashtags: #NewslettersPlus
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
msip_labels: MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_Enabled=True; MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_SiteId=49dfc6a3-5fb7-49f4-adea-c54e725bb854; MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_SetDate=2023-09-10T11:51:08.3686681Z; MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_ContentBits=0; MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_Method=Standard
authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=team.telstra.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: ME3PR01MB5973:EE_|SYZPR01MB7605:EE_
x-ms-office365-filtering-correlation-id: 5eae4baf-e90d-4cce-ee7f-08dbb2050b51
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info: 2shZl6THe4g6kEcQ1kTMYZ4rq0/byt4tV6ipSHvLJpG28VzgYSLL+1qKCxqsqMA5QHvlQD5xV2+HVA481dCjFafYblm49UDSCf9zh+leol4pjYORBKTsO129iZOrAY8T3SNFXA+yL5yQnnKBxDmjjIEAm8NkM611+ewqvWMJVsesZG68m0R3rMW5O298Aw9uanWiz8CSCIRsSZ7dDVoFzOPd4nST4WIsEEubUzL6QAyczY0/J/YSpLFOSZ6QZI36reDXOhtqZ+Gv1mZOg3ehDLv4D0e4vRtUdi/Lxkktn4PH5GYVftsb3enEVWRVEP6HTHXo9TqsJehiEJdz7AFr0oPI/rFTzkWMpsG8Tl81of+TCUuYJuqSUOshiMVWxTMj8UgbmNwDOrT8kqVt/1FlvvOv1LmOQPSKEhZWHaVJ6jYtWRpz0HyZ1Xki+nxEPYULdmxIs+XypxNZw0APoyso2FQ+sjOLWVX2sJ0eGZ84Mb68fwsi+m6xpU2mUxokj9lTpIUXg50DyawgJwyk6G02Ld8BHzKGbG/KwsAfr0rfisWUOea5baQvNE1GxLl49oD8+9nAqxkcAjATscSmUDdC34toiWbgU/GmDQlT1nH2KMM=
x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:ME3PR01MB5973.ausprd01.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230031)(346002)(366004)(136003)(39860400002)(376002)(396003)(451199024)(1800799009)(186009)(122000001)(66899024)(71200400001)(53546011)(7696005)(6506007)(33656002)(86362001)(82960400001)(38100700002)(38070700005)(55016003)(166002)(966005)(9686003)(478600001)(110136005)(41300700001)(52536014)(316002)(76116006)(8936002)(8676002)(5660300002)(2906002)(21615005)(66446008)(66946007)(66556008)(64756008)(66476007); DIR:OUT; SFP:1102;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: duAL6LWNzvVWbfdRQck9EcxiapuQ3UvoChuc/F+aVWctX5ka1cd0/uSb5F9U7LgF5apW+ujXF220VGPS5ykDe5Iyrh65zZQgQFwojdsRgt0D1C/t91At6ODR9H2Yuoz3/iZKr4tpzTsQH7e8wJBlM7E1tuRaYCbSjBcZHBm638vE9wBYAVJexLrl0SmOiovSgk4W8yMM7zLomGur+AfZs9t3Kt9m1dcgApB5ieVToRimNcJMfAZEiLNdVN+LNQk/04S3L8SfoC5JNVTMwCJR1uOC+LmT1kUV8pSVG5imQsTYGvJfdbdN11xTAv+R1Biw0qp2BnBAd1eapWKNmvKVzZI8796APk17KrESJdk4OSXY8aVbqtV7b1el1sbk4X8l6IuyG+CmVFaZ/RyYabQWeJmHaxcMa5k5pk7LLZADa6UuJ+S2Uv4dHQcV1HkSP5Djt8A11drILwdei/vADavGTk8eSuAjtCk/3NRMVvPj3Gc5mCwauoH3/C3HmQZ1cRjMtGXpyISa2yxJfr4Asg8STWVQFrgIRvkKvmO9Oz311OsqtYLqAJ3KkICY4dASri44bDTx0HTzBTlbXmnsfHBIgCaf8u/CbR8G7xb09+Ml5bic9hvo2Qy8ergydjtPYHULwdIadABWKN2RDN1yVmzU0CokIwJZ+skKex2ZqXct25I2EKn2OQPsulGv+bHrXRAm+5FMR6VAAHrl0FQjtmaYeZfQKqkl06Wvzz5u+mU387BdCMeHCRDr1fO0xOxsfyfiOWfbZZv0mXMeDe4VQXxZ4BR9ijg/YY6Y88IfZuNJaVq3BURT7OlncbNPorwe4ZixRc24Yejp6jOCzEsXR4TsjzFMVlq4YT9ohxFV2Y1gErNV2EMrvuTkymVXlktDNwuS3nt1HrGn/WnxUnoPZdwaeaxt8NrM4x5wIVTkL59xE7DephHI5wvpWcPfgmgCjQtqAQtcLz6O3mbvAPlzufWUFDIzTtsFr6SP1uR/H68mx4Pa8IHhuvtzgAfX6Iqysl/6QtNZxDXZBmd7qxreo4y/7GzG12F85hgCH4Alk4SnRJP5Y4VqiIK/yhggS8jNLW+L6/I88eFUuKvgBhoXaWSN6DML3BuVY9y0v13YqTY6MswSxpmF6ambIWPFMkZ4kpzmQAguS5JCPzvCPiCdDBj5gLuAAHxn61iVwDnSDUjaam9C3zSD0LWKGPyewxwsa+MVgsqXcfXyMkOO6y/HcFdLxjPb/f5XEZa9HdU0LeoY0ZuiR30mQMN3JPsdTnpX2+OvBioQJFWUod5yphBdGBR8uvpQZxKndFHHftS5bGWMcFulynk+lFSjM1eSZ1nJG8Ox6bIS5I5kiyzCf+jP6l6lD05Af5q+6HflBbZ5YGU0epOqvGUdtV7P2Rd01vRjK2vv144lRz+4LZtUYO+m2GBu6mbvgOWCSXe43dvyCp75feITmkij4Zb0L6TIMvZ3uFUslwh0eGyN3NjnME7RYV8vM+uQN3iylOdAJquM1zdLtybJr1GJgvNVrFpn/KkyOddSNCJLOJMoezUoXhHG4tv/t5GWWO3YWUOWH//ovyKRAudDfOElTk6PgcPq7ww1L2knGM/tlpJZxj0Kmkg20EtT6MNYE1FIxdabiDUVfSQynhMS4YxeSZFXEy95ZfOFhlGppNhxmmzjHfSUov35Jk3lHyGMGpO3GcWcz/PKpDsfq/Y=
Content-Type: multipart/alternative; boundary="_000_ME3PR01MB59730B45D9339180AF00E941E5F3AME3PR01MB5973ausp_"
MIME-Version: 1.0
X-OriginatorOrg: team.telstra.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: ME3PR01MB5973.ausprd01.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 5eae4baf-e90d-4cce-ee7f-08dbb2050b51
X-MS-Exchange-CrossTenant-originalarrivaltime: 10 Sep 2023 13:51:33.7237 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 49dfc6a3-5fb7-49f4-adea-c54e725bb854
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: Q6v73OY529crfkrq4fiLq3pO+OBsZNO3X63A5TwRpljoMyfvSYz0qMK5HmV3TwC/9gRAWo1cFLA3g4F/NZU0JdpCjVNsvkZMRqEQOrR2H4A=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SYZPR01MB7605
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/Ma8Rla45xX09jP-x14kebH8NB2Q>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Sep 2023 13:51:43 -0000

Comments on draft-bray-unichars-03<https://www.ietf.org/archive/id/draft-bray-unichars-03.html>

Section 3.1. Unicode Code Points
The default repertoire of CBOR is unicode-scalar-values, not unicode-code-points. RFC8949 CBOR states that it’s string type “major type 3” is “a text string encoded as UTF-8”. That (since it is UTF-8) can’t include surrogates. It also states that “characters in this type are never escaped” so a JSON "\uDEAD" escape cannot be used to sneak in a surrogate. RFC8949 does use the phrase “Unicode code point” but appends “(scalar value)” at one point.

The default repertoire of JSON is not unicode-code-points since JSON excludes controls except tab, newline and carriage return. Given this spec distinguishes useful-assignables from unicode-scalar-values it should distinguish JSON’s actual subset from unicode-code-points if it is going to mention JSON.

3.1. needs to explicitly state that this unicode-code-points cannot be encoded in well-formed UTF-8 (or UTF-16 or UTF-32). It can only be used via higher-level escape sequences in protocols that offers those (such as JSON). This is mentioned in 2.2.1 (“it is impossible to represent a surrogate in well-formed UTF-8”), but also needs to be in 3.1. Otherwise, 3.1 and 3.2 appear as two similar choices, which elides their huge difference.

Section 2.2.3. Noncharacters
This spec highlights noncharacters for exclusion. However, Unicode explicitly warns against that: Corrigendum #9 Clarification About Noncharacters<https://www.unicode.org/versions/corrigendum9.html> says “the real intent of noncharacters is that they are permanently prohibited from being assigned standard, interchangeable meanings, rather than that they are prohibited from occurring in Unicode strings which happen to be interchanged”.
So an IETF spec is never going to define a string that needs a noncharacter; but it’s also never going to define a string that needs a private-use character either. If a spec defines an element that can hold any string, should that allow private-use characters but exclude noncharacters and non-useful controls? I’m not sure. That still leaves a lot of junk (eg BOM).

Section 5. Refining Character Repertoires
"\u7FFFF" is NOT a JSON escape for U+7FFFF; it a JSON escape for U+7FFF followed by an F character (as a few others have pointed out).
A proper JSON escape for U+7FFFF is "\uD9BF\uDFFF".

One form of escaping appearing in a character repertoire spec is jarring.

I agree that “many libraries will silently parse” "\uDEAD", but I’m not sure how many “generate an ill-formed UTF-8 string”. In Java, for instance, "\uDEAD".getBytes("UTF-8") returns a single byte 0x3F “?” – it’s valid UTF-8, just no longer a 1-to-1 representation of the in-memory unpaired surrogate.

--
James Manger




General

From: art <art-bounces@ietf.org> on behalf of Tim Bray <tbray@textuality.com>
Date: Saturday, 9 September 2023 at 5:38 am
To: i18ndir@ietf.org <i18ndir@ietf.org>, ART Area <art@ietf.org>
Subject: [art] Just uploaded draft-bray-unichars-03
[External Email] This email was sent from outside the organisation – be cautious, particularly with links and attachments.
See https://www.ietf.org/archive/id/draft-bray-unichars-03.html

A bunch of minor corrections and improvements, thanks to everyone for that, especially James Manger for noticing that the ABNF was entirely wrong in one place.

The word “useless” has been replaced by “legacy”.

I think the feedback was pretty clear that the draft needed to be more opinionated; just because we document the existence of the default JSON repertoire (“all the code points”) doesn’t mean that anyone should use it in the present or future. So, introduced a new section “Refining Character Repertoires” to highlight those issues and offer a suggestion.