Re: [art] draft-bray-unichars

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Thu, 31 August 2023 09:02 UTC

Return-Path: <duerst@it.aoyama.ac.jp>
X-Original-To: art@ietfa.amsl.com
Delivered-To: art@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A89F8C14CF1A for <art@ietfa.amsl.com>; Thu, 31 Aug 2023 02:02:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, NICE_REPLY_A=-0.091, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=itaoyama.onmicrosoft.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ANjkhk849fjq for <art@ietfa.amsl.com>; Thu, 31 Aug 2023 02:02:04 -0700 (PDT)
Received: from JPN01-OS0-obe.outbound.protection.outlook.com (mail-os0jpn01on2136.outbound.protection.outlook.com [40.107.113.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B3706C14CF12 for <art@ietf.org>; Thu, 31 Aug 2023 02:02:03 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Qg04w1xfLP3fBCjnSAOk2Jv7aXprhsE+2P+GGm+chqt40lbqKJr56qYGkLm9jnjim2x7vLas8CAnIqQqyJ17JVqEWnkRdBF5gM4wQ1PsBB4M9WziFkX4Sr5Hk2B/lFjY+CuetYuuC74V9DF4bmxrLoNKQF8IqztTKCVeQlt9xeyH+DQsEpxp2vM1wEq9yf8wycjz/0ji2XrEqIpigNkg/vUGyuP86BmENwoq1wfiHzOYZ4byWGB2/lt7hEeJs7GiPacDAYgMaUT6ocO/GKuRID/Nup1hDFOTZP11EuHsMiRIDU3D2jmPnDIUAmJZx9Al7WoDE9cVAAlzxr9uhXU9Lw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=fPSl/Yw1bHkYEo5IvZVH/M4EWANy+pUtPbq64b7FwoQ=; b=i/ED4emdj39trWn/l/EclcU7VY2KkA2w7B7CBnpdyGtj+8B6BrTVF2uuQZyrKmhiL/I5zGaIik3HaNLE8TtMCiRUscPaoleYWyxkM30W6pGAlXnlZqt1cZolTBjQFK93+vObPv4aKrwx+ZSbKat6r2XGSzN1Fe3NcsOOYn43udXSDO+hmJUU5QpccBzZaa7Kg0y41vyWGBAJV8Fuc1SWf6h5qa6gUUw5nvAcJc55AjIcipHHuvE1kUgdBJW/cCAnvDDGDAcBd8QE5n8gdcxGTTKO41yVk5OU2+lVhOiJlDxfW4Wn43+l4XWJotG145XE00BZOOi0i+XZA8BkMFK0lw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=it.aoyama.ac.jp; dmarc=pass action=none header.from=it.aoyama.ac.jp; dkim=pass header.d=it.aoyama.ac.jp; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itaoyama.onmicrosoft.com; s=selector2-itaoyama-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=fPSl/Yw1bHkYEo5IvZVH/M4EWANy+pUtPbq64b7FwoQ=; b=A/NsKubwE4wDP4GkaeQ6AcmZTQoqTAdtSDlHfWm80IIK9Mb/Z01CoShSF9bcB5vKz6IU7ysxDWBel6K7ErlCgXvwomZZ7oZxSxjzJYmfKusvk00dmOKHZMgSJoMR3wNIt84wDRuWw4lKLykR82PcYEjSIcYWpMzMlkis/qiX4JQ=
Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=it.aoyama.ac.jp;
Received: from TYAPR01MB5689.jpnprd01.prod.outlook.com (2603:1096:404:8053::7) by TYAPR01MB6089.jpnprd01.prod.outlook.com (2603:1096:402:39::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6745.21; Thu, 31 Aug 2023 09:02:01 +0000
Received: from TYAPR01MB5689.jpnprd01.prod.outlook.com ([fe80::d4a2:6f19:ba9f:ed7a]) by TYAPR01MB5689.jpnprd01.prod.outlook.com ([fe80::d4a2:6f19:ba9f:ed7a%6]) with mapi id 15.20.6745.021; Thu, 31 Aug 2023 09:02:01 +0000
Message-ID: <f99dbf36-70d5-4724-2e13-fd17b0dcacdc@it.aoyama.ac.jp>
Date: Thu, 31 Aug 2023 18:02:00 +0900
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0
Content-Language: en-US
To: Carsten Bormann <cabo@tzi.org>, Tim Bray <tbray@textuality.com>
Cc: art@ietf.org
References: <CAHBU6iuDwquhacp1r7qREfaA1CGLR5LjqdasMdOQUQim6NeJsw@mail.gmail.com> <D870487D-0398-4C91-A1F3-69F1C5E6D036@tzi.org>
From: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
In-Reply-To: <D870487D-0398-4C91-A1F3-69F1C5E6D036@tzi.org>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
X-ClientProxiedBy: TYAPR01CA0117.jpnprd01.prod.outlook.com (2603:1096:404:2a::33) To TYAPR01MB5689.jpnprd01.prod.outlook.com (2603:1096:404:8053::7)
MIME-Version: 1.0
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: TYAPR01MB5689:EE_|TYAPR01MB6089:EE_
X-MS-Office365-Filtering-Correlation-Id: 0e14bca1-82fe-409d-376e-08dbaa00f008
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 5vJR7WLOhYLQzaGv66FCGaMECjHkj2266nNva2TkHNB7MccyQGg0VJ4vzhnp5Gz1ExqcEOGhA3CsOJSaBv18bRIPqxsQUITjiHyDuZ9CkpF+tBbkcf5VAW2gV0UVDAqHh2aC6l2ZvUbPGwrw6wfF5xDbVKCVJmAFxxU/Zk7cLKCaICaqOy7z32TSOZOVFB6dUz97MX1wy53YJ7Lnf5Yp7dBO3Lt1PHDGV3T70DqKFiTeRglv2qen/RWVhR2hY6b4tP9dZhas7Im+fJCZSKKgSLXD1VDy1KNvhPH/K+jGUkWu/dGTHQZMFj7c8spyged4FOaelhL0qO7gh/U8jZE+keYUMnLu/Bjomv9IJguOSJO8FNjW0o13wzHQpXuP3l/180BoOKSamQlxq6QvrmPQF1BLmiyfRWk3MmOGWYVFDW68QWs5MP6abdhMS1knJxt7QgmbXfm5dTH2YQHAefh3u4zDY/FgDw1HVM4XWW0OUtstHPUhqTFeukiKZdmaCQgVLkbd0sCmpDSCr3ZYcoqBNW7b/orcBeZ+jQYZWR423kuDTn6Bq8awKS/Muq/R9dhwQvm7yq/Z+i6NbMcv5ZywA6+gOD2Aju7MBwT/MTvXaTbNFzTRFALJK+Jb92ByITSfUmU5ET43mD0hLGToS+Ihf4hIKdLkIdGn2DApXHndycE=
X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:TYAPR01MB5689.jpnprd01.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230031)(366004)(136003)(39850400004)(346002)(376002)(396003)(451199024)(1800799009)(186009)(2616005)(6486002)(6506007)(36916002)(8936002)(8676002)(4326008)(26005)(478600001)(966005)(83380400001)(5660300002)(110136005)(66556008)(66476007)(66946007)(53546011)(52116002)(6512007)(31686004)(41300700001)(316002)(786003)(2906002)(31696002)(86362001)(41320700001)(38350700002)(38100700002)(43740500002)(45980500001); DIR:OUT; SFP:1102;
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: Hb8hz3KLtljSRnmAuZJKvaYue8BD1rimlniSm29HtNguPwsimFkGk0CtkQnfUx9xgrR64yjrzTaOmLp2CWowt4/w86OnjvwAYklLYtStk/A2jM3r+P3SJWHFN8FmZaT3uVQy0CjrXHCmG3Zkxphp2j2hh8Azm8yIf7h2YWyDwvkTlmKaRjM5+2VQMoZkTyiDZ7geCfTyhCoR/7PpueXZ40P8+mMLvwkdGiasznoBd2R6V5XzsRYutuNQIE+KMUSNWSzyY/9zEhPIOFoc9SRsZgrtZfIronKo6KASh7irklHuQoyEej3oSVDoi2lPpKILMlMOeFc02FYZtlUQTSRZYzxfMOrAxfF4WM+++oEWqZs/JTi3FSdGe3h6PK0g61R5yHRW/IzDKAOovxS+k3o2MZ1MjXHoyJT8NLRnpf2gJ2hvm87zwj7OlIUwa4ahnNt6GmP28xFo+AEc1j/ydwu4nX5TOeT0oDmL//GGyB2t/tOHpBo/RNs3dCjXxIczdOSOh9fUBw6y/P5Yhgl5WtIYaTN+cFqI2/qqaThPFq0twjLkSgx/i5HPQFLkoLdUPic3Lw/bAaUh9LYL8o/NuFHbuMP9gaWkaNV2KDOi3253oshGATz1INoopdsc+qgBg8qnbgSE0vgrl959XWrxDI+HxclQjprv5wfMVTEKu+wgCKohjOve8DN2vkmqiCxmtr1jLDwxq4khFIiWtkQYAxqnZNkHWmUQ2BrkUikzFlCGL2Br0GFYdTn2LF6tgJXRFAjlBBy/+1RoTUeXKyXbM86EBThkUF+PVgNV/gMjzXCBsgGnn/khB6+znC5w5UOgpIpZYrzNvwqVeSjzcSwmTCqIahxHR8gBSLGQuGaIlgQRA37cWby3JB2VviyC3KxqGkYBrf/4XVIgFG1L1l7nPijCrdmjKevt4PsXLlIwlhap8Ihe/MzDqAJFFsx3xNdOcgDnk/wY3mZvJffGEeNhLRgg6BCaQJFtpcHFQWnArhs39AfZfuHDFFgMNiqmXR+P9dVNjrTji6CCuEJfw2VZGAXP9+MoA/pvaAm3pNTVwWKKgR5YXLc4A8VSKbxuOaZ6inBxUH/H1dLkgO/U4yf3tXujvlW3fZ3aDyYO4eUdbjXhdPcGpm3DraSnrSwg0sBW8akSGo8lV0pAFxk1/Zp+8ixLzKM9L1F/NHIh6nx/2z2CNHdCQ8GMUhc87xyXHc0R+CyOa7mPCl8CaYxTDWKdj4vyT2l166F7EGXVnof+m6lsZhdQhkU0gsOq7baDAkl3fGT9lLQ0VRp3anOFLLVjdD4lxBZu3n+DJxPi7OH6B03IopJjJpo+xExSK2ueEW+ICx4DClWOnnngHYbrpvPctGAe/MBJS6VpmHxd+7yZ+89ht9CRUimFnD3lbal+r3QRzEwsZHlAu+pAPpemlT05g+IMYJnfOYzm5p6XUM68QNvvQbscAyZPH0m5KEGNDr0RAUoCeE42FKihh8aXX6p2fvn1+4g7FsldKLFY7dWQPVZioyP1GE/9yTow45ily3ghxHTsuyVAs2zJZhLDr/JrQlnb2+GREsnuWMHI2SNLrcmcKUCGJwOBACnlzZbO6Oxxzxuf
X-OriginatorOrg: it.aoyama.ac.jp
X-MS-Exchange-CrossTenant-Network-Message-Id: 0e14bca1-82fe-409d-376e-08dbaa00f008
X-MS-Exchange-CrossTenant-AuthSource: TYAPR01MB5689.jpnprd01.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 31 Aug 2023 09:02:00.9467 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: e02030e7-4d45-463e-a968-0290e738c18e
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: SYMPSIH0fA7i2WRCzCHQNjcwpXqMZigD7DGFNKCGih8QxP2ouT1kF1UTR3tcoFdVdczbrvQRmwSxcTOvAlfZpA==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYAPR01MB6089
Archived-At: <https://mailarchive.ietf.org/arch/msg/art/vLQkPUEcSFAlxPNs-A6A1tsEdkA>
Subject: Re: [art] draft-bray-unichars
X-BeenThere: art@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Applications and Real-Time Area Discussion <art.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/art>, <mailto:art-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/art/>
List-Post: <mailto:art@ietf.org>
List-Help: <mailto:art-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/art>, <mailto:art-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Aug 2023 09:02:08 -0000

And here are the important references for Carsten's backgrounder:

RFC 3629, UTF-8, a transformation format of ISO 10646
(https://www.rfc-editor.org/rfc/rfc3629.html).
[This is extremely clear that nothing close to lone surrogates or 
overlong encodings and the like is allowed.]

RFC 2277, IETF Policy on Character Sets and Languages
(https://www.rfc-editor.org/rfc/rfc2277.html)

If draft-bray-unichars has any future, the above two references should 
definitely be included and discussed.

Regards,   Martin.

On 2023-08-30 03:35, Carsten Bormann wrote:
> Hi Tim,
> 
> it is certainly useful to write a backgrounder on how to use Unicode in today’s network protocols.
> 
> Actually, I started writing such a document [1], and it seems I’ll need to pick up where I left this before the pandemic.
> 
> [1]: https://datatracker.ietf.org/doc/html/draft-bormann-dispatch-modern-network-unicode-02
> 
> (I received some very good feedback at the time that I can use to create the next revision of this document.)
> 
> The document [2] being announced here has a slightly different background: It seems to have been motivated by the discussion of an errata report that is trying to change RFC 8259 [3] and was discussed at length in [4].
> 
> [2]: https://datatracker.ietf.org/doc/draft-bray-unichars/
> [3]: https://www.rfc-editor.org/errata/eid7603
> [4]: https://mailarchive.ietf.org/arch/msg/json/Hkks1atRTycjGi0Hh2NWhdef8W8
> 
> The change requested was:
> 
> Original Text
> -------------
>    A string is a sequence of zero or more Unicode characters [UNICODE].
> 
> Corrected Text
> --------------
>    A string is a sequence of zero or more Unicode code points [UNICODE].
> 
> Even if this may not be obvious at first glance, this would have been a rather significant change of an approved document, so there was a lot of discussion.
> 
> ## Backgrounder
> 
> The IETF has taken a decision in the late 1990s favoring Unicode and UTF-8 as the interchange format for Unicode.  That decision has been upheld in the IETF for almost a quarter of a century now.
> 
> One problem with the introduction of Unicode and the replacement of what was there in the marketplace before, was that initially Unicode was based on 16-bit characters (UCS-2).  When it became clear that this wouldn’t be enough, a number of environments already had picked up UCS-2 and had built platforms around that.  The extension to now ~ 21 bit that Unicode underwent then was realized on this platforms by switching to UTF-16, a “Unicode transformation format” (UTF-16) based on 16-bit code points that reserves certain code points (“surrogates”) for usage in pairs to represent characters that don’t fit into 16 bits.
> 
> The UCS-2 based character models of the legacy 16-bit platforms in many cases couldn’t be repaired for fully embracing UTF-16 right away, e.g., only much later did ECMAScript introduce the “u” (Unicode) flag for regular expressions to have them actually match “Unicode” characters.  So, on these platforms, UTF-16 is transported in a UCS-2 character model, and sometimes orphaned surrogates turn up instead of Unicode characters as “code points” in interfaces that are not meant to leak these implementation limitations to the outside world.
> 
> UTF-8 of course doesn’t support encoding surrogates (UTF-8 is careful to allow a single representation only for each Unicode character, and surrogate pairs would violate that, while isolated surrogates don’t mean anything in Unicode), so IETF protocols typically do not have to consider these problems of specific platforms.
> 
> ## The current discussion
> 
> The IETF-wide consensus to use Unicode and UTF-8 as designed has upheld for almost a quarter of a century.  Now, for some reason, there is some mood to open this up without need.
> 
> I am not going to repeat the content of RFC 9413 [5], which discusses the harm from protocols being “flexible”.  But it is good that this has been written up, because it shows that effort is often required to avoid protocols turning into what I call “soup”.
> 
> [5]: https://www.rfc-editor.org/rfc/rfc9413.html
> 
>> So, this tries to say “here’s how an RFC should specify which Unicode characters it supports”.
> 
> Replacing Unicode by “Unicode plus some leakage from legacy UCS-2 platforms” MUST not be a “choice” that is open to a protocol designer.  True, in some cases there may be no alternative to integrating a widely used protocol that gets this wrong in some way, but promulgating this as a choice that every protocol designer can make on a whim is deeply wrong.
> 
> I would like to help make sure that we don’t make mistakes that would create the appearance that IETF protocols are now free to fall back to enabling the use of surrogates in place of characters (except where they are meant for, in pairs in ITF-16, which we however normally do not use).
> 
> Grüße, Carsten
> 
> 
> PS.:
> https://unicode.org/glossary/
> points to
> https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
> for the definition of an (abstract) character.
> Page forward to page 88, Definition 7 (D7), and do read.
> Unfortunately, the whole document really is required reading for discussing the fine points people will bring up.
> Terms such as “Unicode scalar value”, “noncharacter", etc. come up, and it is important to understand the meaning of these terms in Unicode-based protocols.
> 
> 
> _______________________________________________
> art mailing list
> art@ietf.org
> https://www.ietf.org/mailman/listinfo/art