Re: [EAI] UTF32

John C Klensin <klensin@jck.com> Wed, 22 April 2015 13:21 UTC

Date: Wed, 22 Apr 2015 09:21:49 -0400
From: John C Klensin <klensin@jck.com>
To: Oleksandr Tsaruk <tsaruk@i.ua>
Message-ID: <B522DEBAE28592BD6029B7D2@JcK-HP8200.jck.com>
In-Reply-To: <E1YkuAt-0001Yk-0v@st05.mi6.kiev.ua>
References: <3D9223A5-135E-4F43-B814-EB7BE51D207C@linkedin.com> <01PKTYIGGNDC0000AQ@mauve.mrochek.com> <E1YkXtF-0002DH-0s@st06.mi6.kiev.ua> <ED0FFB5B08EDBB19172476F4@JcK-HP8200.jck.com> <E1YkuAt-0001Yk-0v@st05.mi6.kiev.ua>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Archived-At: <http://mailarchive.ietf.org/arch/msg/ima/oJBwKdmWTqnap3JCfmmSWdlLFMw>
Cc: ima@ietf.org
Subject: Re: [EAI] UTF32
Precedence: list

--On Wednesday, April 22, 2015 15:54 +0300 Oleksandr Tsaruk
<tsaruk@i.ua> wrote:

> Dear John
> 
> Suggestion on UTF32 was about capacity of the table which can
> handle all known and future developed symbols and hieroglyphs,
> for case of our days UTF-16 with about 1 million cells could
> keep all basic symbols of known languages in the Earth. 

Oleksandr,

As I suspected, there is probably a misunderstanding somewhere.
UTF-32 and UTF-16 are just different encodings for Unicode and
can represent exactly the same information.   UTF-16 represents
code points above U+FFFF by the use of so-called "surrogates".
I, personally, dislike surrogates and consider them a kludge,
but their availability as part of the UTF-16 system means that
UTF-16 can represent anything that UTF-32 can... and so can
UTF-8.

Now you may be thinking about something else.  The Unicode code
space ranges only from 0 to 0x10FFFF.   A 32bit code space would
be 0 to 0xFFFFFFFF.  If one believed that the Unicode code space
were too small and that a full 32 bit space were needed, that
would be a different matter entirely.  The Unicode folks are
convinced that more than the current space will never be needed
to represent all symbols of interest but, at one stage, they
believed that a 16bit code space would be enough.  I haven't
studied the mechanisms in some years, but, if a larger code
space were needed, extensions would be needed to both the UTF-8
and UTF-16 encoding models to accommodate it while anything that
used a 32 bit space directly would presumably be fairly
transparent, at least until one got into various Unicode tables
and algorithms that assume the smaller code space.   But, again,
that has little to do with the difference between UTF-16 and
UTF-32 except as a side effect.

 best,
    john

> 21.04.2015 17:38, John C Klensin <klensin@jck.com>
>> --On Tuesday, April 21, 2015 16:07 +0300 Oleksandr Tsaruk
>> <tsaruk@i.ua> wrote:
>> 
>> > Is it possible to reconsider (in a very long run) EAI WG
>> > general approach to:
>> > 
>> > "This working group's previous experimental efforts
>> > investigated the use of UTF-32 as a general approach to
>> > email internationalization." In such case email/domaine
>> > internationalization problem could be solved in original
>> > scripts? 
>> 
>> In principle, yes. In practice, given the increasing (and
>> increasingly universal) use of UTF-8 on the wire, probably
>> not.
>> 
>> The more important question is what you think going to UTF-32
>> would accomplish. It is fully isomorphic with UTF-8 -- there
>> is no information that can be represented one way and not the
>> other. It is much less compact than UTF-8 for "western"
>> alphabetic scripts (including Cyrillic), less compact for any
>> BMP code point, and never worse (in terms of more bytes per
>> code point. UTF-32 does not get involved with the "surrogate"
>> mess, but neither does UTF-8. Neither helps at all with the
>> various normalization or comparison problems. The only
>> advantage I can think of at the moment is that UTF-32 permits
>> getting a count of the number of code points present by
>> counting octets and dividing by four while UTF-8 (and UTF-16)
>> require some calculations. However, one rarely cares about
>> number of code points as compared to, e.g., number of "print
>> positions" or "characters" and, given combining sequences and
>> non-spacing characters and marks, getting from a code point
>> count to print position information cannot be done without
>> considerable knowledge of the code points involved (and, for
>> some scripts, rendering procedures).
>> 
>> So, can you explain what you think a move to UTF-32, even if
>> it were possible, would accomplish?
>> 
>> john
> 
> 
> Sincerely yours, 
> Oleksandr Tsaruk, Ph.D.
> 
> 
> -- реклама
> -----------------------------------------------------------
> Не дорогой качественный хостинг.
> Приятные сюрпризы для каждого!
> http://freehost.com.ua/unix/

Re: [EAI] SMTPUTF8 and 8BITMIME Mark Martinec
[EAI] SMTPUTF8 and 8BITMIME Franck Martin
Re: [EAI] SMTPUTF8 and 8BITMIME ned+ima
[EAI] UTF32 Oleksandr Tsaruk
Re: [EAI] UTF32 John C Klensin
Re: [EAI] UTF32 Oleksandr Tsaruk
Re: [EAI] UTF32 John C Klensin
Re: [EAI] UTF32 Andrew Sullivan
Re: [EAI] UTF32 Franck Martin
Re: [EAI] UTF32 Martin J. Dürst
Re: [EAI] UTF32 Oleksandr Tsaruk
Re: [EAI] UTF32 John C Klensin
Re: [EAI] UTF32 Mark Davis ☕️
Re: [EAI] UTF32 ned+ima
Re: [EAI] UTF32 Mark Davis ☕️