Re: [EAI] UTF32

John C Klensin <klensin@jck.com> Tue, 21 April 2015 14:38 UTC

Date: Tue, 21 Apr 2015 10:38:13 -0400
From: John C Klensin <klensin@jck.com>
To: Oleksandr Tsaruk <tsaruk@i.ua>, ima@ietf.org
Message-ID: <ED0FFB5B08EDBB19172476F4@JcK-HP8200.jck.com>
In-Reply-To: <E1YkXtF-0002DH-0s@st06.mi6.kiev.ua>
References: <3D9223A5-135E-4F43-B814-EB7BE51D207C@linkedin.com> <01PKTYIGGNDC0000AQ@mauve.mrochek.com> <E1YkXtF-0002DH-0s@st06.mi6.kiev.ua>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Archived-At: <http://mailarchive.ietf.org/arch/msg/ima/u_feDyxxhvECKwxe7FhrcY9-1Ig>
Cc: cyrillicgp@icann.org
Subject: Re: [EAI] UTF32
Precedence: list

--On Tuesday, April 21, 2015 16:07 +0300 Oleksandr Tsaruk
<tsaruk@i.ua> wrote:

> Is it possible to reconsider (in a very long run) EAI WG
> general approach to:
> 
> "This working group's previous experimental efforts
> investigated the use of UTF-32 as a general approach to email
> internationalization." In such case email/domaine
> internationalization problem could be solved in original
> scripts? 

In principle, yes.  In practice, given the increasing (and
increasingly universal) use of UTF-8 on the wire, probably not.

The more important question is what you think going to UTF-32
would accomplish.  It is fully isomorphic with UTF-8 -- there is
no information that can be represented one way and not the
other.  It is much less compact than UTF-8 for "western"
alphabetic scripts (including Cyrillic), less compact for any
BMP code point, and never worse (in terms of more bytes per code
point.  UTF-32 does not get involved with the "surrogate" mess,
but neither does UTF-8.  Neither helps at all with the various
normalization or comparison problems.   The only advantage I can
think of at the moment is that UTF-32 permits getting a count of
the number of code points present by counting octets and
dividing by four while UTF-8 (and UTF-16) require some
calculations.  However, one rarely cares about number of code
points as compared to, e.g., number of "print positions" or
"characters" and, given combining sequences and non-spacing
characters and marks, getting from a code point count to print
position information cannot be done without considerable
knowledge of the code points involved (and, for some scripts,
rendering procedures).

So, can you explain what you think a move to UTF-32, even if it
were possible, would accomplish?

    john

Re: [EAI] SMTPUTF8 and 8BITMIME Mark Martinec
[EAI] SMTPUTF8 and 8BITMIME Franck Martin
Re: [EAI] SMTPUTF8 and 8BITMIME ned+ima
[EAI] UTF32 Oleksandr Tsaruk
Re: [EAI] UTF32 John C Klensin
Re: [EAI] UTF32 Oleksandr Tsaruk
Re: [EAI] UTF32 John C Klensin
Re: [EAI] UTF32 Andrew Sullivan
Re: [EAI] UTF32 Franck Martin
Re: [EAI] UTF32 Martin J. Dürst
Re: [EAI] UTF32 Oleksandr Tsaruk
Re: [EAI] UTF32 John C Klensin
Re: [EAI] UTF32 Mark Davis ☕️
Re: [EAI] UTF32 ned+ima
Re: [EAI] UTF32 Mark Davis ☕️