[EAI] new terms Re: Unicode vs. UTF-8 / Encoding vs. Representation

"Jiankang YAO" <yaojk@cnnic.cn> Wed, 29 December 2010 08:59 UTC

Return-Path: <yaojk@cnnic.cn>
X-Original-To: ima@core3.amsl.com
Delivered-To: ima@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 3EAFC28C0DD for <ima@core3.amsl.com>; Wed, 29 Dec 2010 00:59:32 -0800 (PST)
X-Quarantine-ID: <zGcLOOLHfDSw>
X-Virus-Scanned: amavisd-new at amsl.com
X-Amavis-Alert: BAD HEADER, Duplicate header field: "Message-ID"
X-Spam-Flag: NO
X-Spam-Score: -99.41
X-Spam-Level:
X-Spam-Status: No, score=-99.41 tagged_above=-999 required=5 tests=[AWL=0.481, BAYES_00=-2.599, MIME_BASE64_TEXT=1.753, MSGID_FROM_MTA_HEADER=0.803, SARE_SUB_ENC_UTF8=0.152, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zGcLOOLHfDSw for <ima@core3.amsl.com>; Wed, 29 Dec 2010 00:59:31 -0800 (PST)
Received: from cnnic.cn (smtp.cnnic.cn [159.226.7.146]) by core3.amsl.com (Postfix) with SMTP id AC48D3A68F8 for <ima@ietf.org>; Wed, 29 Dec 2010 00:59:30 -0800 (PST)
Received: (eyou send program); Wed, 29 Dec 2010 17:01:35 +0800
Message-ID: <493613295.04708@cnnic.cn>
X-EYOUMAIL-SMTPAUTH: yaojk@cnnic.cn
Received: from unknown (HELO lenovo47e041cf) (127.0.0.1) by 127.0.0.1 with SMTP; Wed, 29 Dec 2010 17:01:35 +0800
Message-ID: <7DD5A11B442D4851959F61B1E237015A@LENOVO47E041CF>
From: Jiankang YAO <yaojk@cnnic.cn>
To: Andrew Sullivan <ajs@crankycanuck.ca>, ima@ietf.org
References: <Pine.OSX.4.64.1012221602490.40683@mac-allocchio3.elettra.trieste.it><68655A9F86D4BE7ED933F8A6@[192.168.1.128]><4D192FF8.1030706@dcrocker.net><9B48F59821946F2EA2DCDEA0@[192.168.1.128]><4D19623F.3040804@dcrocker.net><61939C011F6BB4A93749804C@[192.168.1.128]> <493539018.05274@cnnic.cn>
Date: Wed, 29 Dec 2010 17:01:40 +0800
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: base64
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.5931
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5994
Subject: [EAI] new terms Re: Unicode vs. UTF-8 / Encoding vs. Representation
X-BeenThere: ima@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: "EAI \(Email Address Internationalization\)" <ima.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ima>
List-Post: <mailto:ima@ietf.org>
List-Help: <mailto:ima-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Dec 2010 08:59:32 -0000

----- Original Message ----- 
From: "Andrew Sullivan" <ajs@crankycanuck.ca>
To: <ima@ietf.org>
Sent: Tuesday, December 28, 2010 8:23 PM
Subject: Re: [EAI] Unicode vs. UTF-8 / Encoding vs. Representation


> On Mon, Dec 27, 2010 at 11:17:32PM -0500, John C Klensin wrote:
> 
>> I hope someone in the WG will have a good suggestion.   If not,
>> we may be back to "ASCII" versus "non-ASCII" as the disjoint
>> pair, "Unicode"  to describe the international CCS (inclusive of
>> ASCII), and "UTF-8" to describe the particular encoding that is
>> used for this protocol and recommended by RFC 2277 (a
>> recommendation that is reinforced by draft-iab-idn-encoding and
>> other more recent work).
> 
> It seems to me that one could take a cue from the IDNA work, which
> differentiated A-label, U-label, LDH-label, and so on.
> 

yes, I think that we may borrow the ideas from RFC5890.

 A "U-label" is an IDNA-valid string of Unicode characters, in
      Normalization Form C (NFC) and including at least one non-ASCII
      character, expressed in a standard Unicode Encoding Form (such as
      UTF-8). 


I suggest to define the following terms:

U-character:
 A "U-character" is a non-ASCII Unicode character, in
      Normalization Form C (NFC) and expressed in a standard Unicode Encoding Form of UTF-8. 

 
U-string:
 A "U-string" is a string of Unicode characters, in
      Normalization Form C (NFC) and including at least one non-ASCII
      character, expressed in a standard Unicode Encoding Form of UTF-8. 

A-character:
  An  "A-character" is an ASCII Unicode character, expressed in a standard Unicode Encoding Form of UTF-8. 

A-string:
 An  "A-string" is  a string of Unicode characters,  including ASCII
      characters only, expressed in a standard Unicode Encoding Form of UTF-8. 


are  the definitions above clear?


another question is that we define these new terms in rfc5336bis or in rfc4952bis?


thanks.

Jiankang Yao