Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Representation

John C Klensin <klensin@jck.com> Tue, 28 December 2010 18:23 UTC

Return-Path: <klensin@jck.com>
X-Original-To: ima@core3.amsl.com
Delivered-To: ima@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 21AEC3A68AB for <ima@core3.amsl.com>; Tue, 28 Dec 2010 10:23:42 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.433
X-Spam-Level:
X-Spam-Status: No, score=-2.433 tagged_above=-999 required=5 tests=[AWL=0.014, BAYES_00=-2.599, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LbznHlKRJpT1 for <ima@core3.amsl.com>; Tue, 28 Dec 2010 10:23:41 -0800 (PST)
Received: from bs.jck.com (ns.jck.com [209.187.148.211]) by core3.amsl.com (Postfix) with ESMTP id BE38E3A68A6 for <ima@ietf.org>; Tue, 28 Dec 2010 10:23:40 -0800 (PST)
Received: from [127.0.0.1] (helo=localhost) by bs.jck.com with esmtp (Exim 4.34) id 1PXeEq-000D4D-FV; Tue, 28 Dec 2010 13:25:36 -0500
X-Vipre-Scanned: 0A920AF2001DEB0A920C3F-TDI
Date: Tue, 28 Dec 2010 13:25:35 -0500
From: John C Klensin <klensin@jck.com>
To: Yangwoo Ko <newcat@icu.ac.kr>
Message-ID: <0D625A27294258D00E95152A@[192.168.1.128]>
In-Reply-To: <AANLkTi==F13UbALApdRFtNfhsDoJOAatmztwhPoAMi8a@mail.gmail.com>
References: <Pine.OSX.4.64.1012221602490.40683@mac-allocchio3.elettra.trieste.it> <68655A9F86D4BE7ED933F8A6@192.168.1.128> <4D192FF8.1030706@dcrocker.net> <9B48F59821946F2EA2DCDEA0@192.168.1.128> <4D19623F.3040804@dcrocker.net> <61939C011F6BB4A93749804C@192.168.1.128> <AANLkTi==F13UbALApdRFtNfhsDoJOAatmztwhPoAMi8a@mail.gmail.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Cc: Andrew Sullivan <ajs@shinkuro.com>, dcrocker@bbiw.net, ima@ietf.org
Subject: Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Representation
X-BeenThere: ima@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: "EAI \(Email Address Internationalization\)" <ima.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ima>
List-Post: <mailto:ima@ietf.org>
List-Help: <mailto:ima-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Dec 2010 18:23:42 -0000

--On Tuesday, December 28, 2010 3:23 PM +0900 Yangwoo Ko
<newcat@icu.ac.kr> wrote:

> On Tue, Dec 28, 2010 at 1:17 PM, John C Klensin
> <klensin@jck.com> wrote:
>>>  If you have
>>> alternative labeling, that would be great to consider.
>> 
>> I hope someone in the WG will have a good suggestion.   If
>> not, we may be back to "ASCII" versus "non-ASCII" as the
>> disjoint pair, "Unicode"  to describe the international CCS
>> (inclusive of ASCII), and "UTF-8" to describe the particular
>> encoding that is used for this protocol and recommended by
>> RFC 2277 (a recommendation that is reinforced by
>> draft-iab-idn-encoding and other more recent work).
> 
> I hope so too. But I don't expect that we are going to have a
> nice answer to this soon enough. If it still significantly
> matters, will it be useful to list up the cases of possible
> confusions when current terminogies are used? With the list,
> we may be able to add some notes to one of our documents
> (possibly framework doc)  to address the known possible
> confusions.

If we try to push this definitional issue very far, it is far
more complicated than has been suggested so far.  For the work
of the EAI WG, I suggest that there are two practical
possibilities:

	(1) The current terminology is not precisely correct
	given other, Standard, usage but, with small
	adjustments, can be used in the documents with some
	explicit local definitions.
	
	(2) In the tradition of John Tukey and others, we invent
	entirely new language that is precisely-defined locally.
	As Andrew points out, that is precisely what IDNA2008
	ended up having to do after struggling with a variation
	on this problem in the DNS space.

Despite Andrew's misgivings, the definitional problem here is
not nearly as hard as that for IDNA.  I think either of the
above would be possible without significantly delaying the WG.
However, the first would be a lot safer in terms of the risk of
introducing new errors as terminology is redefined and then
applied (a problem we certainly got into with IDNA).

Dave suggested in a later note...

>...
> The change in our reference is to stop using ASCII to refer to
> an encoding.  For Internet ASCII, the formal definition
> usually is "7-bit ASCII in an 8-bit field, with the high bit
> off", but "7-bit" should suffice...

Unfortunately, this doesn't work either if one aspires to the
kind of precision for which you seem to be asking.  First of
all, the ANSI definition of ASCII contains both a character
repertoire and an encoding and is, if I recall, fairly explicit
about that.   ANSI also didn't define anything about ASCII in
longer data fields than 7 bits, so the encoding itself is
two-levels: (i) the ASCII repertoire specified as a seven-bit
encoding and (ii) the embedding of [intrinsically 7-bit] ASCII
into an 8-bit field.

While there has never been such a thing as "8-bit ASCII" from a
standards point of view, that term, or "ASCII-8", have often
been used both to describe embedding of ASCII into octets _and_
as the name of an extended, ASCII-based, character repertoire
and encoding that is essentially Latin-1 or ISO 8859-1.   While
the ASCII repertoire is a proper subset of the Unicode
repertoire and (i) the ASCII code points have the same integer
values as the corresponding Unicode code points if the left
octet is stripped from the latter and (ii) when Unicode is
encoded in UTF-8, the octets corresponding to the ASCII
repertoire have the same values as the ASCII code points
embedded in octets with a leading zero bit, there are other CCSs
for which ASCII (both the repertoire and the leading-zero
imbedding) are proper subsets --including the entire ISO 8859
family.   All of the character repertoires of the 8859 standards
are subsets of the Unicode repertoires, but only 8859-1 bears an
obvious relationship to the Unicode code point assignments (and,
in UTF-8, the top half of the 8859-1 repertoire turns into two
octets and the relationship to the original encodings disappears
entirely.

We have usually used "ASCII [embedded] in an 8-bit field, with
the high bit off" on the Internet (indeed, that is part of the
problem that RFC 20 tried to solve), but we have had ASCII
floating around in 7-bit form (not stored in octets) and there
is have been two other octet embeddings that puts the spare bit
somewhere in the middle (one of them as a zero and one as a
parity bit).  There are also 7-bit encodings of other things
floating around, including the encoding described in RFC 2152
(and its predecessor 1642) and the 1642 modification described
in IMAP4rev1 [RFC 2060], a modification not incorporated in the
1642->2152 update.

So, no, "7-bit" doesn't do it either.

Dave's note suggest that a general terminology solution is
needed, a view with which I agree.  But, just as the EAI WG has
been urged to avoid doing work that extends beyond the narrow
scope of what is specifically needed to email address and header
internationalization, an effort to sort out character set
repertoire, definition, encoding, and embedding issues in a
general and appropriate way should not be imposed on the EAI WG.
It might be appropriate to do so by reopening and revising RFC
3536 (which I note uses some terminology that is now considered
obsolete enough to have caused loud protests when used in other
documents) or by initiating a separate effort, but it would be,
IMO, unreasonable and unfair to delay EAI progress for that work
to be initiated and completed.  As Andrew said (and the above
explanation should reinforce), "introducing and rigorously
defining new terms is hard".

    john