Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-06.txt

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Mon, 09 October 2023 09:49 UTC

Return-Path: <duerst@it.aoyama.ac.jp>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 06648C151067; Mon, 9 Oct 2023 02:49:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, NICE_REPLY_A=-0.091, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=itaoyama.onmicrosoft.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id F5QMa-Gtw7bB; Mon, 9 Oct 2023 02:49:19 -0700 (PDT)
Received: from JPN01-OS0-obe.outbound.protection.outlook.com (mail-os0jpn01on2106.outbound.protection.outlook.com [40.107.113.106]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id CA32CC14CEF9; Mon, 9 Oct 2023 02:49:17 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=A3jo6d0EytLN6+oQvWmxd28ODHBe3OlMZxBEIfG0MGXLElwK42SdfZfdZoROwlUkBQIawDtOhOVnvYRJDOKdo+cy6OSjzuvhR6MKGBY4Hb7SYTYJ9faTxI/6bCrp4yQVactO8ka/zN8Y0V0nzUgFLXK+8nNDDxuWZjx3eq+6QzpPLA6+IhGdCFIQGw3dE8zYQpsZIlCzfad7tspZRJb/Y54yvt5zj23iNufRU9gc5xjy1xnf82y+03XLl5b3BTutU76rUO3qev24zuZR4/6jrXk4xbFZZF+1XSx2T6LFtcyMGLQjxxRZjmI6qPj9bR+Ex9OsP6haAuVZzwOnb0qpYg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=GBiITNq2K701v2uj+/TyGgmHIM2o5a2+lFGf3ui5Hgg=; b=ldlyFSy0wBZCCpZSUvU974PLVV2mWwyE6JGTLjbB1WJKlkkHr/2XmIJMNT3/eZFLKcoJCpWTa6TGI6LQi0H3lEMKPdWdBU0FNydqtFTW8ozcogAHGV8R8E1vHYkSgD7cEpkXbuelgoIQ2DyFMEIu2WaymcX6uC/gTnYYbFvRUlBKWmOAaZNmB6XRfK61p7uWeeWGZDroS3PuICGhNjiT4I1yaZcqiGaU0xPcscsGPSLNlPRYQJ+PKRf5YO12+ZOAF1T5N8VUOz9K4OFzhWgPZ9lJh5EdTZ27LCVHnDBLE1B4yMzYibTa+NXBs1rI34LdrADL1cOJm/CGpOIkJkhDmA==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=it.aoyama.ac.jp; dmarc=pass action=none header.from=it.aoyama.ac.jp; dkim=pass header.d=it.aoyama.ac.jp; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itaoyama.onmicrosoft.com; s=selector2-itaoyama-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=GBiITNq2K701v2uj+/TyGgmHIM2o5a2+lFGf3ui5Hgg=; b=J8A9sU2lgSXghqDqSU4yaOse59452gfArXqiUrmuyctI9wuBHssz8kuXDB0dvgSLgKBbfA0b8TwPz7FqFCVlR3P1bA8gSm+xUt+AA6/8orny2Z/z64KH1dnQ6wHhcxw//1p5yIeVe9xlgvu7XffCmWaJgJz/OL9eAnrk2ulDtvk=
Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=it.aoyama.ac.jp;
Received: from TYAPR01MB5689.jpnprd01.prod.outlook.com (2603:1096:404:8053::7) by TYWPR01MB9838.jpnprd01.prod.outlook.com (2603:1096:400:234::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6838.43; Mon, 9 Oct 2023 09:49:15 +0000
Received: from TYAPR01MB5689.jpnprd01.prod.outlook.com ([fe80::d4a2:6f19:ba9f:ed7a]) by TYAPR01MB5689.jpnprd01.prod.outlook.com ([fe80::d4a2:6f19:ba9f:ed7a%7]) with mapi id 15.20.6838.033; Mon, 9 Oct 2023 09:49:14 +0000
Message-ID: <9f16c41b-f2c3-8a05-7be6-585cf965fd5d@it.aoyama.ac.jp>
Date: Mon, 09 Oct 2023 18:49:13 +0900
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1
Content-Language: en-US
To: Tim Bray <tbray@textuality.com>, "Manger, James" <James.H.Manger@team.telstra.com>
Cc: "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
References: <169566019635.41806.9804796677919971070@ietfa.amsl.com> <CAHBU6is-wU2NLXNWL56nSJ4=nKvDzGv_Aw4qJN6N2O8CuM4-yw@mail.gmail.com> <SYBPR01MB59814B3448F5754AAEDA1740E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com> <CAHBU6iueqtd5T1T-ciYUMWvmo8XqBQqO5LkWbdRaoXQzPYSQOQ@mail.gmail.com> <SYBPR01MB59819A9F0BDD785F74EB2855E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com> <CAHBU6iu_PUdWXk52UfnoYo7-e0s+tWfiWqy5i+QrrvgJhYOenQ@mail.gmail.com>
From: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
In-Reply-To: <CAHBU6iu_PUdWXk52UfnoYo7-e0s+tWfiWqy5i+QrrvgJhYOenQ@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
X-ClientProxiedBy: TYWP286CA0023.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:262::10) To TYAPR01MB5689.jpnprd01.prod.outlook.com (2603:1096:404:8053::7)
MIME-Version: 1.0
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: TYAPR01MB5689:EE_|TYWPR01MB9838:EE_
X-MS-Office365-Filtering-Correlation-Id: fb3d97de-7d0e-49c6-ac17-08dbc8acff4d
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: R81XKL3VtEE3NF8SGIa1y0KMw4dU5PstdO9XOnAFbJpNLWhB9eFuv2qwAaWIeEPASBCNiiz750WAvi65KOhuuba/wE9uTNO5x2U9nsKUPrHlzsai9oMbYEwfl8a0WS/zzN0s9k3utzCztvsReosx46kf/9xflHT+baFb5gZyXsPQpsvXrbdUFHVPkHHn1iqiEstAVT9UVcxxrn47kZtKLW5pMJMbdghCuRHhVJYQyw5AmCMzvB2WRmk1Jur5sZs+VT0joBxmDZcBQpnhuOJD8TiqCEJu2CuxkoDefIDgHopMG9pmT6u2ODL7VEEYnwCBHFsLWAWA3BTRKmlvNfH4b+snfBdiTMl3QN/VIXtvZ5U50aVZU2W+HWmfbRHO7c0SUNGJ8019e3zVrRf1tXJu1Ozux4jToVDM/QIj5/pZg9RlUzOzh9z/mRRovCzWdtJLMaI0tBVmJAnDUUHd7Kb2Oykh+34dfArq0AL20bBQ+CVcNSQS6lpB/zatLG6vzyUMFaH9mUv56tHsHmQWqUj1Zc7BjgKliF5AfupDZijnTVc9j+9creKpqysZkxEDc4se8mKQIVYEpS3Dy+ls5SlVqtwtApYJayd34tQeUo7SkkTYqKW+r3tKzWEQAvCnVXSFvMo9A4zs0x4Az8Z2EyLCIst8DojaBlUTxfmZKR1PmCo=
X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:TYAPR01MB5689.jpnprd01.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230031)(136003)(346002)(396003)(366004)(39850400004)(376002)(230922051799003)(64100799003)(451199024)(1800799009)(186009)(31696002)(38350700002)(38100700002)(86362001)(41320700001)(31686004)(2906002)(6512007)(966005)(478600001)(6486002)(45080400002)(8936002)(53546011)(41300700001)(8676002)(4326008)(5660300002)(6506007)(52116002)(36916002)(83380400001)(2616005)(66556008)(66476007)(316002)(66946007)(786003)(110136005)(54906003)(26005)(43740500002)(45980500001); DIR:OUT; SFP:1102;
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: wK7p/YjZs8DKKjq4mr1GDtRh0VLGpm7HnAiZ6z4J9XOHPaUTAKRCxaD0wFcVhkLVTdGcFCRiNt/pIJRJ2Zay8BWcXeCeY0luCo8uy8r6AGN/8uhE9clTjfY8zSFg8cXAd5IODqtafJij0Q+6WncmyRzxd2OEP3Xh2G00IVWHa8ZmhclludslGonTMwtCTuo++BHFgX8PNqzeezDBz0K34iDu5QfQu9L69MHgFpCgzl2znSacButlQ7+D5ftA+AGis3WhEHyMH0sBLQzRHAWMz5kvI7ssU2BX0dG0WQrW8Yrhsy4A8nm1cGGEvBDdkvok5Ic72S4lvLLvmDgWx5vbKTiieWLz9wlV6zDl2J1wYNpd0dSlYxiuKyW5UlY0MQv/qU3Rrk+xj+04IEMABCBIUMkgiWKG2Tf2mNPvcEs8FWSclfihnyatEW987KmJyj46PxHGr5TFuovcVEsbRieGUUeZR3hHxIR/OHvG5MGvYX1c40xIRf0DeHCAS+p4aYo0cE+cWRMGr9Yi+a6JiMwRJwaocpLE8f5swmuY2xAK6k6/sBq/iKgIYqlq2FISqDQXC03yJ7Q4GRWyhpEG/kvEcQ2zJ0tXXn1JMZyqyK6iUiYkxAD26gOrr3yuSG9GuSPWPXltgVrdlBQZllPw5AmaLCknIWWghxY+8QO/BXpfWGkuL0Q5I5XTRKgboG4xp3WGGY9hGjI9KXdk5hiYf2UTAz3W89iJRMMyY1mSzRwDIICqcIKvzf4jitS61BNDA/qPI2H/cs0TL9ageqpYF0jaxVVS3MEDkhMqoVlxI9gXV8QndlxjNup1qjEgov/WlH7kmP+HgWyoQtvmuUebVsbrqtppfsLAjYmKlkpUAOM9RkJEQntj9NdwqllSvnpoy96LNvI73Q2XxGdRGQDwqJCYAgRGFvVDnggqcpNBZ48hCe+lNqEBp2zFjxIt9SOe6UM2mdhNVPNmLYJaDRP+06Ertg8WysJJHNSCKyCISQOItBE+560MDVNrZM8UWIRqruZjnzlQ7CS7xNrd/2uRFUdnMo1bbJ9902jueuWybIasgOQC7png7SlFqKBsBDhZ/rY+EXP3npil8ZW3ksw6DUDI8VNHph96zneI5guM4mYbYzOEe7E/Ozeif2lmexjit3cYioCrYgUGCE5l2BRcBjhoZWc2xWa8HEWZKeZALBhPWnZJrtzCet1mLhikqsiPxXp612b2mJntNtdr1ercp8NwLIPviP4MfMDIXN9/zYF8tSarawu5aasUQcRxU87MnxFoiuuvshfJPrs1hp0zYJohHVBpg3wWQuADJ6dzOueHh5hKao5Fsyf8DfZ6dekc3L9TnvYT/WYN2AymB/kBd1edExptB6XI84tUy8JFkWm0zt3ob01ZWjK1sf+2UYAz2MR2yMGC5rcOeoHh7l3BYRyZbbVFyV1TvtVTcktSfRtiSOh1TI019bQmZGC23PDMeWYmLQxFBO7YUfgbqGH8Si+cElJrTcheGj8PsdlMkU5h//3HCtRyr0r+CWutt94wov0L9GsgsU9S1y1hhI6qPf8fTdOKKhWoAUkLDa/N7/sKCUXTUSF2LV64T09PYNxZBfFO
X-OriginatorOrg: it.aoyama.ac.jp
X-MS-Exchange-CrossTenant-Network-Message-Id: fb3d97de-7d0e-49c6-ac17-08dbc8acff4d
X-MS-Exchange-CrossTenant-AuthSource: TYAPR01MB5689.jpnprd01.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 09 Oct 2023 09:49:14.8036 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: e02030e7-4d45-463e-a968-0290e738c18e
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: +CZ8LdgxrPW4xY4apQrPeXyDT3j8LpvLOIF0txTdiLOwgXu/7YaIMxFkGJ5aCLJUjIP8ZaL66oMYAaWsG6f8YA==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYWPR01MB9838
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/vv1yt_c-nXkjJijLoc-TvevQRGc>
Subject: Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-06.txt
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Oct 2023 09:49:24 -0000

Hello Tim, others,

On 2023-10-02 02:31, Tim Bray wrote:
> On Sep 30, 2023 at 6:53:28 PM, "Manger, James" <
> James.H.Manger@team.telstra.com> wrote:

>> Explaining the 1,081,344 size and the U+D800-U+DFFF gap would be
>> interesting.
>>
> 
> Yes! That history is mystery to me.

Well, at the start, Unicode was pure 16-bit only, and ISO 10646 was 
officially all 32 bits. They had different encoding principles and 
character allocations initially (that was Unicode 1.0 and some ISO 10646 
draft), but users told both sides that having two different universal 
character standards wasn't really what they wanted (one of the few 
examples where https://xkcd.com/927/ didn't hold :-).

So the encoding principles (e.g. precomposed (ISO) vs. decomposed 
(Unicode), character repertoires, and code point assignments got aligned 
(*), but the 16-bit/32-bit difference stayed.

(*) Well, for precomposed vs. decomposed, it was a compromise (let's 
tolerate both), with NFD to define the correspondence and later NFC for 
a more practical "default representation".

The fact that Unicode tried to keep things within 16 bits was actually 
in many ways beneficial; it forced careful allocation of characters 
where an open 32-bit space could have lead to much more wasteful 
allocations. (Korean Hangul being the most notable exception; with 
something as human as human writing and its encoding, there are always 
exceptions.)

At that point, implementations started to show up, e.g. Windows NT and 
Java and the like, and they used Unicode, i.e. "16-bit characters".

A bit later, it became clear that a 16-bit space was not enough. But it 
also became clear that a 32-bit space was way too much. So people were 
looking for ways to carry around a space somewhat wider than 16 bits, 
but still encodeable in 16-bit code units. The end of the 16-bit space 
had already been taken (among else by the special meanings for U+FFFF 
and U+FFFE), but some contiguous spaces were still available.

I assume that after playing around with various ideas (but I have never 
heard about these), people came up with the current solution: Reserve 
two contiguous blocks of 16-bit code units (now called the high and low 
surrogates) of 1024 code units each to encode a total of 2^20 additional 
code points and call the result UTF-16.



> Also the nonchars in the Arabic-extended region.

Do you mean U+FDDD..FDEF in the Arabic Presentation Forms-A block? If 
not, please say what else.

As the description in the code chart says, these are for 
process-internal use. Let's say you want to encode some internal 
formatting information in an application, or want to implement a clever 
text searching algorithm that only works with some special code points 
that are not used for actual characters. Then you can use these, if you 
make sure they never leak.

I'm not sure exactly why these were introduced, but my guess is that 
they were added with the Object Replacement Character (U+FFFC), which 
was an example of such an "application-internal" use which, as far as I 
remember, was discovered when Microsoft objected to encode something in 
that position. (To be fair, Microsoft was not the only implementer who 
used the trick of hanging off in-line images and the like off a special 
character.)

For more details, please see the Unicode/L2 documents linked from 
https://en.wikipedia.org/wiki/Arabic_Presentation_Forms-A.

Regards,    Martin.