Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Representation

Andrew Sullivan <ajs@crankycanuck.ca> Tue, 28 December 2010 12:21 UTC

Return-Path: <ajs@crankycanuck.ca>
X-Original-To: ima@core3.amsl.com
Delivered-To: ima@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id B37EB3A6959 for <ima@core3.amsl.com>; Tue, 28 Dec 2010 04:21:26 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.523
X-Spam-Level:
X-Spam-Status: No, score=-2.523 tagged_above=-999 required=5 tests=[AWL=-0.076, BAYES_00=-2.599, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fmfhjxenGfJc for <ima@core3.amsl.com>; Tue, 28 Dec 2010 04:21:26 -0800 (PST)
Received: from mail.yitter.info (mail.yitter.info [208.86.224.201]) by core3.amsl.com (Postfix) with ESMTP id E30A03A6879 for <ima@ietf.org>; Tue, 28 Dec 2010 04:21:25 -0800 (PST)
Received: from crankycanuck.ca (69-196-144-230.dsl.teksavvy.com [69.196.144.230]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.yitter.info (Postfix) with ESMTPSA id 799E31ECB41D for <ima@ietf.org>; Tue, 28 Dec 2010 12:23:30 +0000 (UTC)
Date: Tue, 28 Dec 2010 07:23:25 -0500
From: Andrew Sullivan <ajs@crankycanuck.ca>
To: ima@ietf.org
Message-ID: <20101228122325.GA81417@shinkuro.com>
References: <Pine.OSX.4.64.1012221602490.40683@mac-allocchio3.elettra.trieste.it> <68655A9F86D4BE7ED933F8A6@[192.168.1.128]> <4D192FF8.1030706@dcrocker.net> <9B48F59821946F2EA2DCDEA0@[192.168.1.128]> <4D19623F.3040804@dcrocker.net> <61939C011F6BB4A93749804C@[192.168.1.128]>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <61939C011F6BB4A93749804C@[192.168.1.128]>
User-Agent: Mutt/1.5.18 (2008-05-17)
Subject: Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Representation
X-BeenThere: ima@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: "EAI \(Email Address Internationalization\)" <ima.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ima>
List-Post: <mailto:ima@ietf.org>
List-Help: <mailto:ima-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Dec 2010 12:21:26 -0000

On Mon, Dec 27, 2010 at 11:17:32PM -0500, John C Klensin wrote:

> I hope someone in the WG will have a good suggestion.   If not,
> we may be back to "ASCII" versus "non-ASCII" as the disjoint
> pair, "Unicode"  to describe the international CCS (inclusive of
> ASCII), and "UTF-8" to describe the particular encoding that is
> used for this protocol and recommended by RFC 2277 (a
> recommendation that is reinforced by draft-iab-idn-encoding and
> other more recent work).

It seems to me that one could take a cue from the IDNA work, which
differentiated A-label, U-label, LDH-label, and so on.

I haven't thought about this too carefully, have barely had time to
follow the discussion, and won't have time to think about it much more
for the next 3 weeks, but perhaps something like "A-set" for the
ASCII-only characters, "U-set" for all Unicode characters, and (ugly)
"X-set" for the set of characters that are in Unicode eXcluding the
ASCII subset?  I know that introducing and rigorously defining new
terms is hard (to return to the IDNA analogy, it took a _lot_ of time
to work out those precise definitions), but it seems that confusion
may continue otherwise.

Feel free to point and laugh.  This might just be a bad idea.

A

-- 
Andrew Sullivan
ajs@crankycanuck.ca