Re: [EAI] UTF32

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Fri, 24 April 2015 10:15 UTC

Return-Path: <duerst@it.aoyama.ac.jp>
X-Original-To: ima@ietfa.amsl.com
Delivered-To: ima@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 078D81A7000 for <ima@ietfa.amsl.com>; Fri, 24 Apr 2015 03:15:30 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.298
X-Spam-Level:
X-Spam-Status: No, score=0.298 tagged_above=-999 required=5 tests=[BAYES_40=-0.001, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TJzyBnXoXDUd for <ima@ietfa.amsl.com>; Fri, 24 Apr 2015 03:15:28 -0700 (PDT)
Received: from APAC01-HK1-obe.outbound.protection.outlook.com (mail-hk1on0116.outbound.protection.outlook.com [134.170.140.116]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B61561A6FF2 for <ima@ietf.org>; Fri, 24 Apr 2015 03:15:27 -0700 (PDT)
Authentication-Results: ietf.org; dkim=none (message not signed) header.d=none;
Received: from [133.2.210.64] (133.2.210.64) by TY1PR01MB0141.jpnprd01.prod.outlook.com (25.161.134.13) with Microsoft SMTP Server (TLS) id 15.1.148.16; Fri, 24 Apr 2015 10:15:25 +0000
Message-ID: <553A17B6.5090803@it.aoyama.ac.jp>
Date: Fri, 24 Apr 2015 19:15:18 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: John C Klensin <klensin@jck.com>, Oleksandr Tsaruk <tsaruk@i.ua>
References: <3D9223A5-135E-4F43-B814-EB7BE51D207C@linkedin.com> <01PKTYIGGNDC0000AQ@mauve.mrochek.com> <E1YkXtF-0002DH-0s@st06.mi6.kiev.ua> <ED0FFB5B08EDBB19172476F4@JcK-HP8200.jck.com> <E1YkuAt-0001Yk-0v@st05.mi6.kiev.ua> <B522DEBAE28592BD6029B7D2@JcK-HP8200.jck.com>
In-Reply-To: <B522DEBAE28592BD6029B7D2@JcK-HP8200.jck.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Originating-IP: [133.2.210.64]
X-ClientProxiedBy: TY1PR01CA0024.jpnprd01.prod.outlook.com (25.161.131.162) To TY1PR01MB0141.jpnprd01.prod.outlook.com (25.161.134.13)
X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:TY1PR01MB0141;
X-Forefront-Antispam-Report: BMV:1; SFV:NSPM; SFS:(10019020)(6009001)(6049001)(479174004)(24454002)(92566002)(86362001)(87976001)(93886004)(85182001)(4001350100001)(59896002)(23676002)(66066001)(64126003)(77156002)(54356999)(62966003)(65816999)(50466002)(65806001)(65956001)(5001770100001)(87266999)(2950100001)(50986999)(76176999)(74482002)(122386002)(40100003)(46102003)(80316001)(42186005)(33656002)(83506001)(3940600001); DIR:OUT; SFP:1102; SCL:1; SRVR:TY1PR01MB0141; H:[133.2.210.64]; FPR:; SPF:None; MLV:sfv; LANG:en;
X-Microsoft-Antispam-PRVS: <TY1PR01MB0141BD043C1A12422987300FCAEC0@TY1PR01MB0141.jpnprd01.prod.outlook.com>
X-Exchange-Antispam-Report-Test: UriScan:;
X-Exchange-Antispam-Report-CFA-Test: BCL:0; PCL:0; RULEID:(5002010)(5005006)(3002001); SRVR:TY1PR01MB0141; BCL:0; PCL:0; RULEID:; SRVR:TY1PR01MB0141;
X-Forefront-PRVS: 05568D1FF7
X-OriginatorOrg: it.aoyama.ac.jp
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Apr 2015 10:15:25.0928 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TY1PR01MB0141
Archived-At: <http://mailarchive.ietf.org/arch/msg/ima/rjBBFfJge18uQAi1pNN-0F0SOk4>
Cc: ima@ietf.org
Subject: Re: [EAI] UTF32
X-BeenThere: ima@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "EAI \(Email Address Internationalization\)" <ima.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ima>, <mailto:ima-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ima/>
List-Post: <mailto:ima@ietf.org>
List-Help: <mailto:ima-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Apr 2015 10:15:30 -0000

On 2015/04/22 22:21, John C Klensin wrote:

> Now you may be thinking about something else.  The Unicode code
> space ranges only from 0 to 0x10FFFF.   A 32bit code space would
> be 0 to 0xFFFFFFFF.  If one believed that the Unicode code space
> were too small and that a full 32 bit space were needed, that
> would be a different matter entirely.  The Unicode folks are
> convinced that more than the current space will never be needed
> to represent all symbols of interest but, at one stage, they
> believed that a 16bit code space would be enough.  I haven't
> studied the mechanisms in some years, but, if a larger code
> space were needed, extensions would be needed to both the UTF-8
> and UTF-16 encoding models to accommodate it while anything that
> used a 32 bit space directly would presumably be fairly
> transparent, at least until one got into various Unicode tables
> and algorithms that assume the smaller code space.   But, again,
> that has little to do with the difference between UTF-16 and
> UTF-32 except as a side effect.

More specifically, at one point, Unicode thought that a 16-bit space 
might be enough (for that time being, at least), while on the other hand 
ISO/IEC JTC1/SC2/WG2, the ones responsible for ISO 10646, thought that 
an architecture with a full 31 bits would be better (the 32nd bit was 
always reserved because nobody wanted to repeat the 8-bit "signed char" 
vs. "unsigned char" mess). UCS-4 and UTF-8 were both designed to 
encompass this 31-bit code space. UTF-8 needed up to 6 bytes for a 
character, in a very straightforward way, according to its original 
design (There is still code out there with traces from this period.). 
UCS-2 was of course limited to 16 bits.

With time passing, it became clearer on both sides that 16 bits wasn't 
enough but 31 bits was overkill. The introduction of UTF-16 created an 
upper limit of 0x10FFFF in one of the Unicode encoding forms. Therefore 
both Unicode and SC2/WG2 agreed on this overall limit. UTF-32 was 
introduced as a version of UCS-4 with an explicit upper codepoint limit 
of 0x10FFFF, and the definition of UTF-8 was changed to only go up to 
four bytes.

In the case of an extraterrestrial invasion by a culture with millions 
of characters or a sudden excessive emoji binge, the limits for UCS-32 
and UTF-8 could be changed again, and some further kludge could be 
introduced in UTF-16. But the chance that we'll get there is very low, 
at least for the moment.

Regards,   Martin.