Re: [EAI] UTF32

John C Klensin <klensin@jck.com> Tue, 21 April 2015 14:38 UTC

Return-Path: <klensin@jck.com>
X-Original-To: ima@ietfa.amsl.com
Delivered-To: ima@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B91EF1AC3D3 for <ima@ietfa.amsl.com>; Tue, 21 Apr 2015 07:38:22 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.711
X-Spam-Level:
X-Spam-Status: No, score=-0.711 tagged_above=-999 required=5 tests=[BAYES_40=-0.001, RCVD_IN_DNSWL_LOW=-0.7, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id aJBzX5HxXGpj for <ima@ietfa.amsl.com>; Tue, 21 Apr 2015 07:38:21 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 252761AC3C3 for <ima@ietf.org>; Tue, 21 Apr 2015 07:38:21 -0700 (PDT)
Received: from [198.252.137.35] (helo=JcK-HP8200.jck.com) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <klensin@jck.com>) id 1YkZJa-0001aD-9B; Tue, 21 Apr 2015 10:38:18 -0400
Date: Tue, 21 Apr 2015 10:38:13 -0400
From: John C Klensin <klensin@jck.com>
To: Oleksandr Tsaruk <tsaruk@i.ua>, ima@ietf.org
Message-ID: <ED0FFB5B08EDBB19172476F4@JcK-HP8200.jck.com>
In-Reply-To: <E1YkXtF-0002DH-0s@st06.mi6.kiev.ua>
References: <3D9223A5-135E-4F43-B814-EB7BE51D207C@linkedin.com> <01PKTYIGGNDC0000AQ@mauve.mrochek.com> <E1YkXtF-0002DH-0s@st06.mi6.kiev.ua>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.35
X-SA-Exim-Mail-From: klensin@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <http://mailarchive.ietf.org/arch/msg/ima/u_feDyxxhvECKwxe7FhrcY9-1Ig>
Cc: cyrillicgp@icann.org
Subject: Re: [EAI] UTF32
X-BeenThere: ima@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "EAI \(Email Address Internationalization\)" <ima.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ima>, <mailto:ima-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ima/>
List-Post: <mailto:ima@ietf.org>
List-Help: <mailto:ima-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 21 Apr 2015 14:38:22 -0000


--On Tuesday, April 21, 2015 16:07 +0300 Oleksandr Tsaruk
<tsaruk@i.ua> wrote:

> Is it possible to reconsider (in a very long run) EAI WG
> general approach to:
> 
> "This working group's previous experimental efforts
> investigated the use of UTF-32 as a general approach to email
> internationalization." In such case email/domaine
> internationalization problem could be solved in original
> scripts? 

In principle, yes.  In practice, given the increasing (and
increasingly universal) use of UTF-8 on the wire, probably not.

The more important question is what you think going to UTF-32
would accomplish.  It is fully isomorphic with UTF-8 -- there is
no information that can be represented one way and not the
other.  It is much less compact than UTF-8 for "western"
alphabetic scripts (including Cyrillic), less compact for any
BMP code point, and never worse (in terms of more bytes per code
point.  UTF-32 does not get involved with the "surrogate" mess,
but neither does UTF-8.  Neither helps at all with the various
normalization or comparison problems.   The only advantage I can
think of at the moment is that UTF-32 permits getting a count of
the number of code points present by counting octets and
dividing by four while UTF-8 (and UTF-16) require some
calculations.  However, one rarely cares about number of code
points as compared to, e.g., number of "print positions" or
"characters" and, given combining sequences and non-spacing
characters and marks, getting from a code point count to print
position information cannot be done without considerable
knowledge of the code points involved (and, for some scripts,
rendering procedures).

So, can you explain what you think a move to UTF-32, even if it
were possible, would accomplish?

    john