Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

"Manger, James" <James.H.Manger@team.telstra.com> Mon, 11 September 2023 02:26 UTC

Return-Path: <James.H.Manger@team.telstra.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8793BC14CEFF; Sun, 10 Sep 2023 19:26:35 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.009
X-Spam-Level:
X-Spam-Status: No, score=-2.009 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=team.telstra.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EongxfYn7M0w; Sun, 10 Sep 2023 19:26:31 -0700 (PDT)
Received: from AUS01-ME3-obe.outbound.protection.outlook.com (mail-me3aus01on2124.outbound.protection.outlook.com [40.107.108.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C92BAC14CF13; Sun, 10 Sep 2023 19:26:30 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=d97LGxUxOlevy8WhxM162Fo6MMZyfdmZaO/PtfziQr+VtQZ6CP20tv9l0rPaYAvYlNYb/X+A36SQdLvNgQPClxHe44GU0O9RepEZj2pRtxe/nd8+obMNxYAE/m08BVMmQTOaMBMiWEcfehYXge0VvreWmCCOb3YteoU/uqjH3n2XqCtAfohil87pk/DgRjYLf8VLDQ9Cr7La3ncYQBOK6GQ4YF/IF4wP0bCbGOd+t8GaTAwYtOD4VMS94RuFShBPTWlinr88g3Im8cYkAKOdqljKzVTGdcDUkuuYAF0jfH6A8p99qvv6kf9daHaAxJbLhaVwfUdGiXV+TWI9BMsZDA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=VtF3ZVQ5GGvKLzTyTsnhi6LwDx9mT36/2dN9M8AsncA=; b=Sk8rLw4/NPSWT6OF0Ux43hC+zIgUZwbSJ19YAb3K20TBXJPbeG6f40uynJLj/PUckoEJBG2L3ZX0dt5bGpQwzi/3okaKRn+R6ddJPwQyuHJxj0GMl3MrQLGCqISmEllHKZ0rCxUEvhY3538CI+FPrEtZTESolEaqOWjF5BGq6LIzt494zNViPIqvL/u1gqst4QKuW5Lch+JqmfMon3E7ylfvo0iRv7zrIwHeMkRX8QFQe1fJMu6eqN3O5Z/kMQGlZ1o0hTpEdocZ9OqDhyT1yCYizBbjMHKwJvQk3TXoFB+NmWkbfv/p8i7mvpO8dDnTH3a4NOs+JOX1v1PfHIOklQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=team.telstra.com; dmarc=pass action=none header.from=team.telstra.com; dkim=pass header.d=team.telstra.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=team.telstra.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=VtF3ZVQ5GGvKLzTyTsnhi6LwDx9mT36/2dN9M8AsncA=; b=pnLrmlqpbUFPbFtkRrRz6LFdFicaxah9Xi4B29y8dVCRaLn5FKQtC0gfostRzdk9buPtPjaCwGxaX20e0fzQgsfbyZPJcs0HK5a0PJOPmPKK+DS93xdkLmLEFBSjDdW62HNjeJRnRB/KsmLihX7WI+6rHzVxaLMUj4byxKjLSEk=
Received: from ME3PR01MB5973.ausprd01.prod.outlook.com (2603:10c6:220:db::11) by SY4PR01MB6234.ausprd01.prod.outlook.com (2603:10c6:10:109::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6768.34; Mon, 11 Sep 2023 02:26:26 +0000
Received: from ME3PR01MB5973.ausprd01.prod.outlook.com ([fe80::2ace:ec4f:4e55:4cae]) by ME3PR01MB5973.ausprd01.prod.outlook.com ([fe80::2ace:ec4f:4e55:4cae%4]) with mapi id 15.20.6768.029; Mon, 11 Sep 2023 02:26:26 +0000
From: "Manger, James" <James.H.Manger@team.telstra.com>
To: Tim Bray <tbray@textuality.com>, Asmus Freytag <asmusf@ix.netcom.com>
CC: "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
Thread-Topic: [art] Just uploaded draft-bray-unichars-03
Thread-Index: AQHZ4owMGc58/jC4xEOLxa8BxMKqeLAT9RRNgABk3YCAAHn8Ow==
Date: Mon, 11 Sep 2023 02:26:26 +0000
Message-ID: <ME3PR01MB5973C8061732F354E5C7F242E5F2A@ME3PR01MB5973.ausprd01.prod.outlook.com>
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <ME3PR01MB59730B45D9339180AF00E941E5F3A@ME3PR01MB5973.ausprd01.prod.outlook.com> <CAHBU6ivc4W3KyYtbK2H7PQUa8C4+g=73nSTgBK+xLXnzH7V6GA@mail.gmail.com>
In-Reply-To: <CAHBU6ivc4W3KyYtbK2H7PQUa8C4+g=73nSTgBK+xLXnzH7V6GA@mail.gmail.com>
Accept-Language: en-AU, en-US
Content-Language: en-AU
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
msip_labels: MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_Enabled=True; MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_SiteId=49dfc6a3-5fb7-49f4-adea-c54e725bb854; MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_SetDate=2023-09-11T01:08:44.8632972Z; MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_ContentBits=0; MSIP_Label_f4ab56b7-6ec4-4073-8d92-ac7cc2e7a5df_Method=Standard
authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=team.telstra.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: ME3PR01MB5973:EE_|SY4PR01MB6234:EE_
x-ms-office365-filtering-correlation-id: 52f3958c-589b-47b8-37d6-08dbb26e7ff7
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info: YSi1QECYB53DcLdl9uP5wnP2wEXeiRCOEylWmpn+ENO0HEOmSmgAojzcxECqHvZWa88VuQixa8+3kJ81LLVai0VSEgZAxireQCibUQ/OW32yGdiEaX7gP3GlgRpJ9aSPIioODVPAC5N4DjM3Vsx+Pn1qcvlF1qJV1ShdPbRkHGaOS9OGtpKode+ReUG4oyesbPT8ZrwhOsEjJcgnTBL33tXZD4TAvTnoIG9EH0aXYNJzNt21E/TePhxQRUeu4JAgd5q0zs4QtvzMonznr6+wn9ao+S2Gr/BkKDQjYlvM7phQaLB3gPO8Lg8qDX5JE/p+WzMH2EBNWrETwFRTQIf+dXaoV7kdZq9svhCKV1IPaBctJme0aYjw6CX2wTy0fidlpLqbIvBuKF55u9UhF2QuXPUpcRfOgObc1ncaMZFesyyQ0cczDA+DVeZy2Pt72YPeboGTJgTMEUQtKmVhOp7cKeGq8aOGXrtC4Rgg+49H2xgsl8jyAc4H2tmQYmMNz5NSk9ArjaUJjJLSDwEiY8cmH5aBdXi8wYkTbNZAzxAE+cSKbi+Pj069Am2NfYCPPESuE3Euedo6e457o6seYk5OlRfhopfb5vNnRgGVhzqXvQE=
x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:ME3PR01MB5973.ausprd01.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230031)(39860400002)(366004)(376002)(136003)(346002)(396003)(186009)(451199024)(1800799009)(122000001)(66899024)(71200400001)(7696005)(6506007)(86362001)(82960400001)(33656002)(38100700002)(38070700005)(55016003)(166002)(26005)(2906002)(966005)(9686003)(83380400001)(478600001)(41300700001)(52536014)(110136005)(5660300002)(76116006)(316002)(4326008)(8676002)(8936002)(64756008)(54906003)(66446008)(66556008)(66476007)(66946007); DIR:OUT; SFP:1102;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: cC0a+KwDGK+KewAGplS0DrXIFzk0RxxNGOkCZwqkhpGoJ8F1/YzqI1r1hHbLJEqyDthjZYfUnFXGm0bjLvt0FC8mmK3MxxdRnBH87FiIXicSoDQyBoTet/hBOWpcKBj53AujpDolOFggeLKD/LGVNbTxIbKgquoUowbf5jue6On+Be+YVVbpzVAQGvN0isvYxiE2e+C8bE6ik93tHekwteRYvh9BzoAUya1R81pLkyLqYsYjSPDz/hxbVMadRCBXWXw0BxuvPbevMARC9rH0e8UeFccgkAaonSFuXXzhJGVYk+2JR8pSju1CYNY9LgIknJECzmJXlgL7F4YxWHREWwfc2kyLg2R5SJ217UjglIztOZtbkQ5EfZxyq5FM9bpYEdSPyFjf76qI+K3NMHd23db45TeUU+7oe6ci3m4LjdIdHBMGPYpnCBaa29XDzV/39bL5gg4vVj22rasg3AmubUHyz8y0ecKbTJEcwFvsDEvJX2aMcCl79A0s2753CQHkmV7SV/Pe1wL1S3bH3hRdyD+Aa+teMUx0v2rVvWJPHNDmBOvP6IuRS1hwp4NXvcieeHwR3WkR13ViUI00cSGJbmjRN1LJvQKv7WOS31SYnh+GcblM9zfrETMakKuNl4Ii8rnRQeFYgbCAR9+rJiONcluNGbcmrbqyuNyFbzdJrFgImjZeDN1Mvvbjz9Obria6dHcYgI2lyUZowgN3YanZUC+z4RIX0FH46uj0UOpZOdelGy/3WPhfexVowIA0b2xiVKFYqLBhEKRdrJWEFYX2z0OKBCXtRgyiBYbRR0FN32/T4dp63AwUxc1Ms7bdCeIUiRTij6ynlL9hiIcDCprsg+w/iQD6m97VhESQ1Xw+JopP/0gm7lLr+LnvpV377YAWtFm1a9zVvFOzrew98an5vapxIIk4h7rKxJFjBf0X7FVc0yAJa7R7eAb3r9wxhjS7QlXSHJvhPvJs04fxAWq5qAsPWID/cCmEbG9aJIKK8ZopmT2rfgRhdgKy+vpFYJrQ/XJWsC0Jg05TwkLo9idEhMAZhIRG4eJmAn21dajkc2jwa8HRPX2LbckYtmWG5naiQeblovmD17+my1z6H6xWPtO4M9of7ub9PjSn2lnP2mohHuOm24MnJaTmy/lE6uq8rSRDqqWA01anWTd9Ac4x16MhNlnffbLmxU/yrNWLArgSqbB1XCIs375yzZr5gd5SzA5i/vQThxQV4aAQt0SfbqGvZsMTITqYIwKvnXxF0KgyjsPAyWxS+fKZWf70dofiuZVNTO+x5SJVud/8/t2QqIZ4Z7ie0teDlQZPzmbARvf5pECN9oC21uNgjXlkHSDMrPSY8LP8kIiwmI50dcboo5UmTwhkKk/s3FFnU/lvJRUCIYZ2+WXCXl6tDsnuktbA5czPXExwpHhyHlX27Ri8+FnC8n1kmt02lr5pmLdIMSIuMUZ9Lg7pw7iaPLQF4FvylOVE2jglbWShAumj/CJRPzF5wzCRBkQEWOpNpY7iqH3Qmo75c91mbdU5XFE6rbsCmB7zv1SlEF9HUpzjaY1WtwZSVf05uUExoVkIzTmq0yj0jZ5jRVeooibJTBeEogRZLplbGeidNii54qtH0oGPKsL5JHVH1T9J3sWXxldfVyA=
Content-Type: multipart/alternative; boundary="_000_ME3PR01MB5973C8061732F354E5C7F242E5F2AME3PR01MB5973ausp_"
MIME-Version: 1.0
X-OriginatorOrg: team.telstra.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: ME3PR01MB5973.ausprd01.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 52f3958c-589b-47b8-37d6-08dbb26e7ff7
X-MS-Exchange-CrossTenant-originalarrivaltime: 11 Sep 2023 02:26:26.6105 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 49dfc6a3-5fb7-49f4-adea-c54e725bb854
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: bVE5vnxOk98tk8G92BSi6GZak/Gr59tSlyqy45oECOHtRYdmK9FDFxTqZx6H8YcUKh6UBPwxYMWnTHaNZDyB+6FLbrN2avYlsHLSaS6hPKU=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SY4PR01MB6234
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/PAONaNQp8DWhfAfGpRZWg0puRKo>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Sep 2023 02:26:35 -0000

> Also, I bet that if I had a JSON text {“example”: “\uDEAD”} and fed it to JSON-to-CBOR converters, a lot of them would emit CBOR containing ill-formed UTF-8.

CBORMapper from the Jackson library doesn’t. (I don’t have much experience with CBOR converters, but Jackson is a very popular Java JSON lib).

com.fasterxml.jackson.core.JsonGenerationException: Invalid surrogate pair, starts with invalid high surrogate (0xDEAD), not in valid range [0xD800, 0xDBFF]
                at app//com.fasterxml.jackson.core.JsonGenerator._reportError(JsonGenerator.java:2849)
                at app//com.fasterxml.jackson.dataformat.cbor.CBORGenerator._invalidSurrogateStart(CBORGenerator.java:1723)
                at app//com.fasterxml.jackson.dataformat.cbor.CBORGenerator._encode2(CBORGenerator.java:1703)
                at app//com.fasterxml.jackson.dataformat.cbor.CBORGenerator._encode(CBORGenerator.java:1660)
                at app//com.fasterxml.jackson.dataformat.cbor.CBORGenerator._writeString(CBORGenerator.java:1451)
                at app//com.fasterxml.jackson.dataformat.cbor.CBORGenerator.writeString(CBORGenerator.java:877)
                at app//com.fasterxml.jackson.databind.ser.std.StringSerializer.serialize(StringSerializer.java:41)
                at app//com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:479)
                at app//com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:318)
                at app//com.fasterxml.jackson.databind.ObjectMapper._writeValueAndClose(ObjectMapper.java:4719)
                at app//com.fasterxml.jackson.databind.ObjectMapper.writeValueAsBytes(ObjectMapper.java:3987)

> BTW this assertion that “UTF-8 can’t include surrogates”, which has been made repeatedly, needs to be taken with a grain of salt. The UTF-8 procedures for converting between code points and byte sequence work perfectly well for surrogates and a whole lot of software out there will silently convert both ways. The UTF-8 in question is in fact not well-formed nor does it conform to the definition of UTF-8, but it exists in the wild and it can’t really be defined as “non-existent”.

A whole lot of software does NOT silently accept ill-formed UTF-8|16|32.
So even if some software does, and even though ill-formed UTF-8 exists, unicode-code-points is still totally unsuitable as a “character repertoire”.

 The default repertoire of JSON is not unicode-code-points since JSON excludes controls except tab, newline and carriage return. Given this spec distinguishes useful-assignables from unicode-scalar-values it should distinguish JSON’s actual subset from unicode-code-points if it is going to mention JSON.

> You’re thinking of XML? https://datatracker.ietf.org/doc/html/rfc8259#section-7 says that C0 controls must be expressed in \u notation but they’re allowed.

JSON can describe a string that has a U+0000 char (using an escape sequence of 6 ASCII chars). But the JSON itself is not allowed to include a U+0000. So the character repertoire for the JSON protocol (the ABNF for the JSON data format) does not allow U+0000. json.org has ’0020’ . ‘10FFFF’; ECMA 404 says “any code point except “ or \ or control character”; RFC8259 has %x20-21 / %x23-5B / %x5D-10FFFF. None of these are unicode-code-points.

 I agree that “many libraries will silently parse” "\uDEAD", but I’m not sure how many “generate an ill-formed UTF-8 string”. In Java, for instance, "\uDEAD".getBytes("UTF-8") returns a single byte 0x3F “?” – it’s valid UTF-8, just no longer a 1-to-1 representation of the in-memory unpaired surrogate.

> Wow. I had no idea. As with many aspects of Java+Unicode, this feels deeply wrong. It should either round-trip or throw a damn exception. Anyhow, that ship sailed a long time ago.  I think we should include the Java example to illustrate another way that surrogates can lead to breakage.

Java correctly detects an attempt to UTF-8-encode a lone surrogate as wrong. CharsetEncoder.encode does throw a MalformedInputException. But Java helpfully offers 3 ways to handle malformed-input and unmappable-character errors: CodingErrorAction.IGNORE|REPLACE|REPORT. The REPLACE option drops the erroneous input, appending a replacement value – which is what String.getBytes(Charset) is defined to do.

You can (with at least some libraries) round-trip JSON with lone surrogates in Java – but the output has to be JSON (with the lone surrogate again represented with an escape).


General