Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Representation
John C Klensin <klensin@jck.com> Tue, 28 December 2010 18:23 UTC
Return-Path: <klensin@jck.com>
X-Original-To: ima@core3.amsl.com
Delivered-To: ima@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 21AEC3A68AB for <ima@core3.amsl.com>; Tue, 28 Dec 2010 10:23:42 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.433
X-Spam-Level:
X-Spam-Status: No, score=-2.433 tagged_above=-999 required=5 tests=[AWL=0.014, BAYES_00=-2.599, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LbznHlKRJpT1 for <ima@core3.amsl.com>; Tue, 28 Dec 2010 10:23:41 -0800 (PST)
Received: from bs.jck.com (ns.jck.com [209.187.148.211]) by core3.amsl.com (Postfix) with ESMTP id BE38E3A68A6 for <ima@ietf.org>; Tue, 28 Dec 2010 10:23:40 -0800 (PST)
Received: from [127.0.0.1] (helo=localhost) by bs.jck.com with esmtp (Exim 4.34) id 1PXeEq-000D4D-FV; Tue, 28 Dec 2010 13:25:36 -0500
X-Vipre-Scanned: 0A920AF2001DEB0A920C3F-TDI
Date: Tue, 28 Dec 2010 13:25:35 -0500
From: John C Klensin <klensin@jck.com>
To: Yangwoo Ko <newcat@icu.ac.kr>
Message-ID: <0D625A27294258D00E95152A@[192.168.1.128]>
In-Reply-To: <AANLkTi==F13UbALApdRFtNfhsDoJOAatmztwhPoAMi8a@mail.gmail.com>
References: <Pine.OSX.4.64.1012221602490.40683@mac-allocchio3.elettra.trieste.it> <68655A9F86D4BE7ED933F8A6@192.168.1.128> <4D192FF8.1030706@dcrocker.net> <9B48F59821946F2EA2DCDEA0@192.168.1.128> <4D19623F.3040804@dcrocker.net> <61939C011F6BB4A93749804C@192.168.1.128> <AANLkTi==F13UbALApdRFtNfhsDoJOAatmztwhPoAMi8a@mail.gmail.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Cc: Andrew Sullivan <ajs@shinkuro.com>, dcrocker@bbiw.net, ima@ietf.org
Subject: Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Representation
X-BeenThere: ima@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: "EAI \(Email Address Internationalization\)" <ima.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ima>
List-Post: <mailto:ima@ietf.org>
List-Help: <mailto:ima-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Dec 2010 18:23:42 -0000
--On Tuesday, December 28, 2010 3:23 PM +0900 Yangwoo Ko <newcat@icu.ac.kr> wrote: > On Tue, Dec 28, 2010 at 1:17 PM, John C Klensin > <klensin@jck.com> wrote: >>> If you have >>> alternative labeling, that would be great to consider. >> >> I hope someone in the WG will have a good suggestion. If >> not, we may be back to "ASCII" versus "non-ASCII" as the >> disjoint pair, "Unicode" to describe the international CCS >> (inclusive of ASCII), and "UTF-8" to describe the particular >> encoding that is used for this protocol and recommended by >> RFC 2277 (a recommendation that is reinforced by >> draft-iab-idn-encoding and other more recent work). > > I hope so too. But I don't expect that we are going to have a > nice answer to this soon enough. If it still significantly > matters, will it be useful to list up the cases of possible > confusions when current terminogies are used? With the list, > we may be able to add some notes to one of our documents > (possibly framework doc) to address the known possible > confusions. If we try to push this definitional issue very far, it is far more complicated than has been suggested so far. For the work of the EAI WG, I suggest that there are two practical possibilities: (1) The current terminology is not precisely correct given other, Standard, usage but, with small adjustments, can be used in the documents with some explicit local definitions. (2) In the tradition of John Tukey and others, we invent entirely new language that is precisely-defined locally. As Andrew points out, that is precisely what IDNA2008 ended up having to do after struggling with a variation on this problem in the DNS space. Despite Andrew's misgivings, the definitional problem here is not nearly as hard as that for IDNA. I think either of the above would be possible without significantly delaying the WG. However, the first would be a lot safer in terms of the risk of introducing new errors as terminology is redefined and then applied (a problem we certainly got into with IDNA). Dave suggested in a later note... >... > The change in our reference is to stop using ASCII to refer to > an encoding. For Internet ASCII, the formal definition > usually is "7-bit ASCII in an 8-bit field, with the high bit > off", but "7-bit" should suffice... Unfortunately, this doesn't work either if one aspires to the kind of precision for which you seem to be asking. First of all, the ANSI definition of ASCII contains both a character repertoire and an encoding and is, if I recall, fairly explicit about that. ANSI also didn't define anything about ASCII in longer data fields than 7 bits, so the encoding itself is two-levels: (i) the ASCII repertoire specified as a seven-bit encoding and (ii) the embedding of [intrinsically 7-bit] ASCII into an 8-bit field. While there has never been such a thing as "8-bit ASCII" from a standards point of view, that term, or "ASCII-8", have often been used both to describe embedding of ASCII into octets _and_ as the name of an extended, ASCII-based, character repertoire and encoding that is essentially Latin-1 or ISO 8859-1. While the ASCII repertoire is a proper subset of the Unicode repertoire and (i) the ASCII code points have the same integer values as the corresponding Unicode code points if the left octet is stripped from the latter and (ii) when Unicode is encoded in UTF-8, the octets corresponding to the ASCII repertoire have the same values as the ASCII code points embedded in octets with a leading zero bit, there are other CCSs for which ASCII (both the repertoire and the leading-zero imbedding) are proper subsets --including the entire ISO 8859 family. All of the character repertoires of the 8859 standards are subsets of the Unicode repertoires, but only 8859-1 bears an obvious relationship to the Unicode code point assignments (and, in UTF-8, the top half of the 8859-1 repertoire turns into two octets and the relationship to the original encodings disappears entirely. We have usually used "ASCII [embedded] in an 8-bit field, with the high bit off" on the Internet (indeed, that is part of the problem that RFC 20 tried to solve), but we have had ASCII floating around in 7-bit form (not stored in octets) and there is have been two other octet embeddings that puts the spare bit somewhere in the middle (one of them as a zero and one as a parity bit). There are also 7-bit encodings of other things floating around, including the encoding described in RFC 2152 (and its predecessor 1642) and the 1642 modification described in IMAP4rev1 [RFC 2060], a modification not incorporated in the 1642->2152 update. So, no, "7-bit" doesn't do it either. Dave's note suggest that a general terminology solution is needed, a view with which I agree. But, just as the EAI WG has been urged to avoid doing work that extends beyond the narrow scope of what is specifically needed to email address and header internationalization, an effort to sort out character set repertoire, definition, encoding, and embedding issues in a general and appropriate way should not be imposed on the EAI WG. It might be appropriate to do so by reopening and revising RFC 3536 (which I note uses some terminology that is now considered obsolete enough to have caused loud protests when used in other documents) or by initiating a separate effort, but it would be, IMO, unreasonable and unfair to delay EAI progress for that work to be initiated and completed. As Andrew said (and the above explanation should reinforce), "introducing and rigorously defining new terms is hard". john
- [EAI] apps-team review of draft-ietf-eai-rfc5336b… Claudio Allocchio
- Re: [EAI] apps-team review of draft-ietf-eai-rfc5… Ned Freed
- Re: [EAI] apps-team review of draft-ietf-eai-rfc5… Ned Freed
- Re: [EAI] apps-team review of draft-ietf-eai-rfc5… Charles Lindsey
- Re: [EAI] apps-team review of draft-ietf-eai-rfc5… Ned Freed
- Re: [EAI] apps-team review of draft-ietf-eai-rfc5… Dave CROCKER
- [EAI] RFC5336 and VRFY/EXPN (was: Re: apps-team r… John C Klensin
- Re: [EAI] [apps-discuss] RFC5336 and VRFY/EXPN (w… Al Costanzo
- Re: [EAI] apps-team review of draft-ietf-eai-rfc5… Charles Lindsey
- Re: [EAI] apps-team review of draft-ietf-eai-rfc5… John C Klensin
- Re: [EAI] [apps-discuss] RFC5336 and VRFY/EXPN (w… John C Klensin
- [EAI] Analysis of comments on 5336bis (was: Re: a… John C Klensin
- Re: [EAI] apps-team review of draft-ietf-eai-rfc5… Dave CROCKER
- Re: [EAI] apps-team review of draft-ietf-eai-rfc5… Barry Leiba
- [EAI] Unicode vs. UTF-8 / Encoding vs. Representa… Dave CROCKER
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… John C Klensin
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Dave CROCKER
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… John C Klensin
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Yangwoo Ko
- Re: [EAI] apps-team review of draft-ietf-eai-rfc5… Charles Lindsey
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Andrew Sullivan
- Re: [EAI] [apps-discuss] RFC5336 and VRFY/EXPN (w… Charles Lindsey
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Dave CROCKER
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… John C Klensin
- Re: [EAI] [apps-discuss] RFC5336 and VRFY/EXPN (w… John C Klensin
- Re: [EAI] RFC5336 and VRFY/EXPN Alexey Melnikov
- Re: [EAI] apps-team review of draft-ietf-eai-rfc5… ned+ima
- Re: [EAI] apps-team review of draft-ietf-eai-rfc5… Barry Leiba
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Dave CROCKER
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Yangwoo Ko
- [EAI] new terms Re: Unicode vs. UTF-8 / Encoding … Jiankang YAO
- Re: [EAI] [apps-discuss] RFC5336 and VRFY/EXPN (w… Charles Lindsey
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… John C Klensin
- Re: [EAI] [apps-discuss] RFC5336 and VRFY/EXPN (w… John C Klensin
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Joseph Yee
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Yangwoo Ko
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Charles Lindsey
- [EAI] Repeating normative text from other specifi… Dave CROCKER
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Joseph Yee
- Re: [EAI] Repeating normative text from other spe… ned+ima
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Dave CROCKER
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Yangwoo Ko
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Charles Lindsey
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… John C Klensin
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… John C Klensin
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Charles Lindsey
- Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Repres… Barry Leiba
- [EAI] vrfy syntax Jiankang YAO