Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-04.txt

"Manger, James" <James.H.Manger@team.telstra.com> Tue, 19 September 2023 12:59 UTC

Return-Path: <James.H.Manger@team.telstra.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 24E90C14E515; Tue, 19 Sep 2023 05:59:12 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.008
X-Spam-Level:
X-Spam-Status: No, score=-2.008 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=team.telstra.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qdCzBgG4vdJn; Tue, 19 Sep 2023 05:59:07 -0700 (PDT)
Received: from AUS01-SY4-obe.outbound.protection.outlook.com (mail-sy4aus01on2117.outbound.protection.outlook.com [40.107.107.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 59ECBC14CE4B; Tue, 19 Sep 2023 05:59:06 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=a3k1/CqqOdzlF8iYX/MjPWjpqyJPq/wy1R1N3X0DcVtIPzWWlumPkVkFdydpve/qEB19L/ogw+zSqGzCj+iA+5ZhhFFnXGs2Yi0cvWpB0P2u/qqo8vEc0WOs4QBK8OE/C+KoPXtDt3AtPi/LR2wvGz3AUGShPOndZzlpr1DbhQqO+XbxgcawqXRKW1a87cTgYSfd+0zE9zuWO7wgDbM7yuszWPE4WcpOCxYeL3cwumgwG2tVNRMERhqFXwP9C0SDO38azPhuDDKWb6oPU2rDfpKRwXcNo6GKg25ASOYhFw4qiRmu/vSC7FM0W7LrFj2H1PgQhh5lLYdk6VpCb2Ywkw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=+0BWiXXQ6VJTLsYLN8llPX6OhR6Q8rwMM6Q2EYnkFhc=; b=llHLogjXbAjBxFaDNvTEDUCKXeNtIW05AP8Qsk/UNP7avydmmC1VjFiKKNXUhKO95sJ7fneac98m9KE+Gg+fRfCOkgeaiEyHEIV9Nc1F4QI7iAQ9a/AWAYhJvPhVMBTHneT23x5LmZBefyCjvYQvu+QRsydkLZ0sIJs+AXbF88uDh1WKhXqLFlfPgEsliI5ep9m/nHnJp5CD37DWw83Zy3S0awd1+bhByMvkaS1UDQTvPNPeqPp/4puwj/8d4irRBkiU5jPUp4/IHyRNhoJIHZm5wJatLsSsrYIM3900KEadlM8KR1eixYl6nOfa9gF2AvW5BsBgyXV9XDNSJydaLg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=team.telstra.com; dmarc=pass action=none header.from=team.telstra.com; dkim=pass header.d=team.telstra.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=team.telstra.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=+0BWiXXQ6VJTLsYLN8llPX6OhR6Q8rwMM6Q2EYnkFhc=; b=vhpGzMJ0HSCn1EZNqPZCWAX+OloVjCOqQd6S/z4DvCs2KHmC44aXDCT7kUAce5DtsH1ETBSotGWPJeUjzy2dGlO8lmHxVOsQIzZxhHm+p6pfYpnTvruUUj12kzmdJu1GcUHP9gumHH5Fbuib+b4gDUE7jr47HZPZjsXnPKjDlKY=
Received: from SY4PR01MB5980.ausprd01.prod.outlook.com (2603:10c6:10:f7::9) by ME3PR01MB6596.ausprd01.prod.outlook.com (2603:10c6:220:126::13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6792.24; Tue, 19 Sep 2023 12:59:01 +0000
Received: from SY4PR01MB5980.ausprd01.prod.outlook.com ([fe80::9e6a:b9cb:3e87:b02a]) by SY4PR01MB5980.ausprd01.prod.outlook.com ([fe80::9e6a:b9cb:3e87:b02a%4]) with mapi id 15.20.6792.026; Tue, 19 Sep 2023 12:59:01 +0000
From: "Manger, James" <James.H.Manger@team.telstra.com>
To: Rob Sayre <sayrer@gmail.com>, Asmus Freytag <asmusf@ix.netcom.com>
CC: Tim Bray <tbray@textuality.com>, ART Area <art@ietf.org>, "i18ndir@ietf.org" <i18ndir@ietf.org>
Thread-Topic: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-04.txt
Thread-Index: AQHZ5/zsmS2wpxwnEkKNM3oLn0BLIbAgRFyegACdDoCAAAUUAIAAI9cAgAAD7YCAAMPfSw==
Date: Tue, 19 Sep 2023 12:59:01 +0000
Message-ID: <SY4PR01MB5980320C8286458EA28EFF32E5FAA@SY4PR01MB5980.ausprd01.prod.outlook.com>
References: <169479938668.18742.9199862891950651366@ietfa.amsl.com> <CAHBU6ivzUV947N+n7AoYkCFT3ZfaLobCQ4fBXw3dvkqTT=LBAw@mail.gmail.com> <SY4PR01MB5980D8DDE229D1C57AEDFB55E5FBA@SY4PR01MB5980.ausprd01.prod.outlook.com> <CAChr6SzRa8F+OrELa8N3rAMLmxdvr-g5c0i_9ESnWnwZY-iA4A@mail.gmail.com> <CAChr6Sy05spOW9nsy36kYr8Ob6OYS7vCgrEVPhhWs9Pe4LkpNA@mail.gmail.com> <2e6c2d13-9fc9-d320-3803-2b9a4df3b042@ix.netcom.com> <CAChr6Swr5tS2-wW8dZ0A4J7_Jd+RoHZNJkzhNfcVTi84oDvOPA@mail.gmail.com>
In-Reply-To: <CAChr6Swr5tS2-wW8dZ0A4J7_Jd+RoHZNJkzhNfcVTi84oDvOPA@mail.gmail.com>
Accept-Language: en-AU, en-US
Content-Language: en-AU
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
msip_labels: MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_Enabled=True; MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_SiteId=49dfc6a3-5fb7-49f4-adea-c54e725bb854; MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_SetDate=2023-09-19T08:12:38.4124887Z; MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_ContentBits=0; MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_Method=Standard
authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=team.telstra.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: SY4PR01MB5980:EE_|ME3PR01MB6596:EE_
x-ms-office365-filtering-correlation-id: 14a1d365-cf2b-42c0-9209-08dbb910325a
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info: ku9zCNqj/+6V4rPzhWOAsdv9l5q3OCoiX7qqRiSVHg257DOZZnChXUs7AYP3KxLTFwxe3VklC4gvUp9/gNeU2B5lq9O7cC9j2p+RJB3WT7MpAkfsizefJSmleOl8Te5X+Us0mFE/UOsVSIFzSooVio92KAdAClF3Z3PwDd20+lKyjbj4jZyoh+zMcdS+2D/nKkSPX0cs3DultoRUq43PWjuCEQu8BYNYwwCik9p60c2DPGOipsT26T1V5xdtdftjHVHrErxoB7Ra/Mz0CdXEROPD6mmrRTE3ejrNJp6nEscOrJtxYkjRMBUpGSHi8b0gpfPxxfMd/M+e8TQv69neHAvCfAYFiXS8BZyXbIdJzoNFKhiNtAERGQZlSbnGJ8Zvly0sGy5XwIA9wpBBbx+6oZYGbmDpIfFaPKpJoFPGvTPm7XnnE93LrbVQ+pS27qzt7kNEiIDT8zrkJGI0U2LF/TjEvxO/ObdvtHgiT8KOqNO4mKBBa/OOPZmhyEEk57G7VGS4pfuFftGilBkDafnlA+U8sGUt2rfZHs+GbW0dsQ1lpDt5EsE3zR6mC/XdjrzsOaSnXy8r0zexG9Qt1KyExDkNIxzowQGtSur7q3K5a4g=
x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:SY4PR01MB5980.ausprd01.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230031)(376002)(346002)(396003)(366004)(39860400002)(136003)(186009)(451199024)(1800799009)(66476007)(54906003)(64756008)(66446008)(66556008)(55016003)(122000001)(5660300002)(38100700002)(26005)(82960400001)(33656002)(38070700005)(86362001)(2906002)(4326008)(8676002)(41300700001)(316002)(166002)(66574015)(66946007)(76116006)(8936002)(71200400001)(83380400001)(110136005)(478600001)(6506007)(7696005)(52536014)(9686003); DIR:OUT; SFP:1102;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: c7N87bx9EyPv0WBJ6kcDTI7un26xgrhrpuD1aV6kBNxbJ+buqh6hXRDLb20dkNv8eiAOr3jr98++q23zgkuurfbyoRLIO3BsF3r54SLr+bxL+L9i584rnQe41M6f3cTwGWoaHUyjrUz68oB23Ym1qJvCc/M/Fb8a4rqZk2Y6X+71j+9yuqdOgVsi9NL/pGI9lwP13fKYoPVs1eWT/bYbjjBaJULNNMtqhKnjzaxpIHAcOLSzW0OhgfF78E5bLu1/Da5njORF8wcu8I8FqGoS0hGtVlOpSV/eapqeLqEVX30Qv0LzYbYf9DeyY6MN9iP/B/CrMqpTK9Mh8S3qCoH8/rfKIBZKRMEF8k1v8SNkQTpf2H7fMwkX9lcmRQ3q6s8abcpplrLJCKQ2ATU2VgtWJYbxSceL8BfWj3OmSUc9CylF1uZBsFWXDouGemlJqpYTZJjT7WhRERIDl6IF0YfQ8Nk4m2c8fjS6uCqFDtZUcnz+dXHIP6qiIfROBn8oT7Ce0jwHrxWZymJ9fTbm6gjEMCv+KI+jKJ6Rih6fZe5WYXihsCrCiy9eINf/W/B17jORgiF197H9BlqvQF4wea12xU1Qvuiev9rIsy8XI/ko5k9LGbMtWQqu6++YTUi21TqTrcjgFQCCux/2O9BuLU/mVLmiZ8lK8L2mLLWeP3HCeSapRacTLp3d505mUSy2A5gDvd89Bdacn+SRBNg3HOEjbbkmfSFFV02ciTu7BIiwoSl1lGEx4n1v/zAYAjYvCCeNZrFj0KdDhkmGpEfe2aHdIyGn3JbsUIQxPsW79J5CxqdjYedvKMady49WNylt0tEGtqArjIPFhnK0eN+MgCWAMhPWWtLxBCNRGGOZPB00Ey9utv2nZXYBq06ZdDXYQWLANwBLHf9EmpI0PhFoF1QXLwIC5nqZK95lV0qU9J4O/8XcpADsldpUxunr+RNC3E3H0JA1MFvX3DVNCcuDuXEaghMOfR+yuoQuNodN1fmqG+T7ox++MdgUvohlHscn8h9nuWmAIfw9QuSHZp60FBKNnpjTPXYaI0c/sSIdmtCr8wlhEIHdTTFNlWd0bMTHmQmwqBCxFJUWZFdCKVm8Pkl17SGgSjqJF116BSSRSPjNiSUUYxzi8vRjCNLtCockBkaEEJ9SDBGK4Forv9bUh1rQ3+lVlNU8m6pGT+k9Zfl7l5ICxogD/jVthSOTZ1e3+qvGYxxN1F8tcVLDv53Q3c7FyWrmeFThEwZaYMnI5eFIOicTqrLb3JrngmHhgJA7Pchbl2MrT5bTPDPrZw9M4SSdOwp8pRY5u7dgxtlvWfEREh/s1K4EGzHpwcfikoQeIB6pA7K+UO3l7tHbbP4vh9mBChCdOc11UblT2WzJjrVzOcjA3pHyYq9vAdZsscUSs2j311a4/A1+75BIGBLkccGzUVdzutMNR7joKHIBCefcQicPhuN7tKfyniDwbziLaMZjLZ+DYBGrSskJm5XGolAhwViOs9DOexsr2B+6BQeifgl4ttKd2vDG77zdhXZcpYsek5/pQYPgpIWPJ0UDmJ8fF2jx/6zN7Tdklc4Qjk7i2gn27aTwEG3mlLQDxCV3E4yR70YZJ8YHRHDvCiEIZzc7tKtWMjFHx5CooIWmZUQGBdE=
Content-Type: multipart/alternative; boundary="_000_SY4PR01MB5980320C8286458EA28EFF32E5FAASY4PR01MB5980ausp_"
MIME-Version: 1.0
X-OriginatorOrg: team.telstra.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: SY4PR01MB5980.ausprd01.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 14a1d365-cf2b-42c0-9209-08dbb910325a
X-MS-Exchange-CrossTenant-originalarrivaltime: 19 Sep 2023 12:59:01.7983 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 49dfc6a3-5fb7-49f4-adea-c54e725bb854
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: 7YUvD3c1BLHU8Mn2cJFGhWZ98s4t2tJtdKPybRZ9NRWXlcot0ZcD0hvEZl056l0lrP0SxRlK5HDux0zFQ+Uw6Z9849VyENqLQqZ6cXmtUmE=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: ME3PR01MB6596
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/HsJUFHsLbwV0n4fch1Rzwwiej10>
Subject: Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-04.txt
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 19 Sep 2023 12:59:12 -0000

TR17 and Unicode are careful about the distinction between scalars, code points, code units. Scalars and code units are far and away the most important. draft-bray-unichars-04 tries to simplify, but by emphasising code points it adds confusion & inaccuracy.

Plenty of systems will parse the JSON "\uDEAD" to & from an internal string representation, but that JSON is 8 ASCII characters (including the quotes) – there is nothing ill-formed there. It’s far less clear to me if many systems will parse "<ill-formed code unit sequence for U+DEAD>" from five 8-bit WTF-8 code units, three WTF-16 code units, or three WTF-32 code units. And it’s unclear to me whether ECMA-404 expects anything to parse those.

Let’s consider the ABNF
  str = *(%x0-10FFFF)
and two values
  str1 = %x1D11E
  str2 = %xD834 %xDD1E
Both str1 and str2 are valid str values. They are different (1 symbol vs 2). Any ABNF tool will say they are different. If internally my system uses 32-bit code units these will be different: arrays of 1 and 2 code units respectively. But somehow if my system uses 16-bit code units internally then str1 and str2 are identical – indistinguishable. If my system uses 8-bit code units str1 will be four bytes F0 9D 84 9E, str2 will be … uhm … maybe 6 bytes ED A0 B4  ED B4 9E. Somehow this super-simple ABNF *(%x0-10FFFF) has unleashed weird system-dependent behaviour. The weirdness doesn’t come from the ABNF; it’s from somewhere else.

JSON is not as weird as *(%x0-10FFFF). You can interpret JSON specs to imply an unpaired surrogate can only appear as an escape sequence. So that’s a higher-layer concern that doesn’t impact your UTF-8/16/32 encode/decode part at all. ECMA 404 doesn’t mention lone surrogates at all. And even if you go for max 16-bit backward compatibility it is far more easily described as “accept ill-formed UTF-16 if you can; send ill-formed UTF-16 with lone surrogates escaped if you must”.

That’s why I really dislike *(%x0-10FFFF) being used as a shorthand for how JSON can support Unicode plus arbitrary 16-bit data (ie ill-formed UTF-16). JSON specs muddle through by defining JSON text not the logical strings a text can represent; by having an escaping mechanism; by being a bit loose; by leaving semantic decisions to the specific processor; and with paragraphs on interop.

If draft-bray-unichars defines *(%x0-10FFFF) as a standalone item labelled a character repertoire or subset of Unicode characters it need oodles more text to explain the weirdness above. Treating surrogates on par with control code and noncharacter “problematic” code points does not come close.



>>>Perhaps this document should reference <https://unicode.org/reports/tr17/#Strings> (note authors), which covers similar territory.

>>Thanks for noticing.

>>The need for transient states that are discoverable (that is, not fully encapsulated) is a big reason why many specs are not tighter.

>>However, there are points in a protocol where strings aren't in a transient processing state, and here the full restrictions should apply (and be specified).
> Yes. The problem here is that JSON can transmit stuff resembling these: "For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences." I also mentioned it because it says "A string data type is simply a sequence of code units.", which matches ECMA-404 pretty well.
>
> Here, the distinction between "string" and UTF-8/UTF-16/UTF-32 is clearly drawn. To use James' example:
>
> ---
>
> It does not make sense for a spec to define:
>  unicode-code-point = %x0-10FFFF
>  string = *unicode-code-point
>
> ---
>
> It seems to me that TR17 defines "string" this way. Which is not to say that I recommend sending these things over the internet, just that it can happen. I think the draft does a decent job discouraging this one, but I guess it will have to be yet clearer.

No. TR17 says a Unicode 16-bit string is *(%x0-FFFF). Quite different from *(%x0-10FFFF).
TR17 does look good.

--
James Manger



General