Re: [EAI] UTF32

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Fri, 24 April 2015 10:15 UTC

Message-ID: <553A17B6.5090803@it.aoyama.ac.jp>
Date: Fri, 24 Apr 2015 19:15:18 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: John C Klensin <klensin@jck.com>, Oleksandr Tsaruk <tsaruk@i.ua>
References: <3D9223A5-135E-4F43-B814-EB7BE51D207C@linkedin.com> <01PKTYIGGNDC0000AQ@mauve.mrochek.com> <E1YkXtF-0002DH-0s@st06.mi6.kiev.ua> <ED0FFB5B08EDBB19172476F4@JcK-HP8200.jck.com> <E1YkuAt-0001Yk-0v@st05.mi6.kiev.ua> <B522DEBAE28592BD6029B7D2@JcK-HP8200.jck.com>
In-Reply-To: <B522DEBAE28592BD6029B7D2@JcK-HP8200.jck.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 7bit
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Apr 2015 10:15:25.0928 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TY1PR01MB0141
Archived-At: <http://mailarchive.ietf.org/arch/msg/ima/rjBBFfJge18uQAi1pNN-0F0SOk4>
Cc: ima@ietf.org
Subject: Re: [EAI] UTF32
X-BeenThere: ima@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "EAI \(Email Address Internationalization\)" <ima.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ima>, <mailto:ima-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ima/>
List-Post: <mailto:ima@ietf.org>
List-Help: <mailto:ima-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Apr 2015 10:15:30 -0000

On 2015/04/22 22:21, John C Klensin wrote:

> Now you may be thinking about something else.  The Unicode code
> space ranges only from 0 to 0x10FFFF.   A 32bit code space would
> be 0 to 0xFFFFFFFF.  If one believed that the Unicode code space
> were too small and that a full 32 bit space were needed, that
> would be a different matter entirely.  The Unicode folks are
> convinced that more than the current space will never be needed
> to represent all symbols of interest but, at one stage, they
> believed that a 16bit code space would be enough.  I haven't
> studied the mechanisms in some years, but, if a larger code
> space were needed, extensions would be needed to both the UTF-8
> and UTF-16 encoding models to accommodate it while anything that
> used a 32 bit space directly would presumably be fairly
> transparent, at least until one got into various Unicode tables
> and algorithms that assume the smaller code space.   But, again,
> that has little to do with the difference between UTF-16 and
> UTF-32 except as a side effect.

More specifically, at one point, Unicode thought that a 16-bit space 
might be enough (for that time being, at least), while on the other hand 
ISO/IEC JTC1/SC2/WG2, the ones responsible for ISO 10646, thought that 
an architecture with a full 31 bits would be better (the 32nd bit was 
always reserved because nobody wanted to repeat the 8-bit "signed char" 
vs. "unsigned char" mess). UCS-4 and UTF-8 were both designed to 
encompass this 31-bit code space. UTF-8 needed up to 6 bytes for a 
character, in a very straightforward way, according to its original 
design (There is still code out there with traces from this period.). 
UCS-2 was of course limited to 16 bits.

With time passing, it became clearer on both sides that 16 bits wasn't 
enough but 31 bits was overkill. The introduction of UTF-16 created an 
upper limit of 0x10FFFF in one of the Unicode encoding forms. Therefore 
both Unicode and SC2/WG2 agreed on this overall limit. UTF-32 was 
introduced as a version of UCS-4 with an explicit upper codepoint limit 
of 0x10FFFF, and the definition of UTF-8 was changed to only go up to 
four bytes.

In the case of an extraterrestrial invasion by a culture with millions 
of characters or a sudden excessive emoji binge, the limits for UCS-32 
and UTF-8 could be changed again, and some further kludge could be 
introduced in UTF-16. But the chance that we'll get there is very low, 
at least for the moment.

Regards,   Martin.

Re: [EAI] SMTPUTF8 and 8BITMIME Mark Martinec
[EAI] SMTPUTF8 and 8BITMIME Franck Martin
Re: [EAI] SMTPUTF8 and 8BITMIME ned+ima
[EAI] UTF32 Oleksandr Tsaruk
Re: [EAI] UTF32 John C Klensin
Re: [EAI] UTF32 Oleksandr Tsaruk
Re: [EAI] UTF32 John C Klensin
Re: [EAI] UTF32 Andrew Sullivan
Re: [EAI] UTF32 Franck Martin
Re: [EAI] UTF32 Martin J. Dürst
Re: [EAI] UTF32 Oleksandr Tsaruk
Re: [EAI] UTF32 John C Klensin
Re: [EAI] UTF32 Mark Davis ☕️
Re: [EAI] UTF32 ned+ima
Re: [EAI] UTF32 Mark Davis ☕️