Re: Encoding of small characters in draft-klensin-unicode-escapes-04

John C Klensin <> Fri, 05 October 2007 15:30 UTC

Return-path: <>
Received: from [] ( by with esmtp (Exim 4.43) id 1Idp8U-000515-5R; Fri, 05 Oct 2007 11:30:42 -0400
Received: from discuss by with local (Exim 4.43) id 1Idp8S-0004xa-SV for; Fri, 05 Oct 2007 11:30:40 -0400
Received: from [] ( by with esmtp (Exim 4.43) id 1Idp8S-0004wT-HJ for; Fri, 05 Oct 2007 11:30:40 -0400
Received: from ([] by with esmtp (Exim 4.43) id 1Idp8N-00011Q-4z for; Fri, 05 Oct 2007 11:30:40 -0400
Received: from [] (helo=p3.JCK.COM) by with esmtp (Exim 4.34) id 1Idp8C-0009OS-My; Fri, 05 Oct 2007 11:30:24 -0400
Date: Fri, 05 Oct 2007 11:30:23 -0400
From: John C Klensin <>
To: Stephane Bortzmeyer <>,
Subject: Re: Encoding of small characters in draft-klensin-unicode-escapes-04
Message-ID: <A312938795F54DFA7BF86174@p3.JCK.COM>
In-Reply-To: <>
References: <3A8797AD0BB8B1EF4FAA7DE8@p3.JCK.COM> <>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 0ddefe323dd869ab027dbfff7eff0465
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <>
List-Unsubscribe: <>, <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>

--On Friday, 05 October, 2007 16:24 +0200 Stephane Bortzmeyer
<> wrote:

> On Thu, Oct 04, 2007 at 05:10:43PM -0400,
>  John C Klensin <> wrote 
>  a message of 49 lines which said:
>> It now recommends either \u'NNNN..' 
> [Small syntax detail, do not spend too much time on it.]
> Section 5.1 describes the content of the "Backslash-U with
> Delimiters" form as "4*6HEXDIG". Why not "2*6HEXDIG"? It would
> be more compact for the first characters and would be more
> consistent with the other forms such as the "XML and HTML" one.
> My personal taste is that \u'20' is better than \u'0020'.

First, I want to be clear that I don't feel strongly about any
of this and will do what the community wants.   However...

One could use the two-digit form, but I think it is trouble.
The Unicode folks,
whose guidance I'm trying to follow except when it seems to
violate good Internet or applications sense, never use less than
four hex digits.  And we have the problem with Perl (noted in
that section now) that two-digit values are sometimes
interpreted as ASCII and sometimes as "local one-octet character
encoding, whatever that is".  So the four-digit form helps
emphasize that this is, in fact, Unicode and avoids any risk of
confusion with local one-octet (or less) character sets, decimal
encoding of octets, etc.

> Section 6.2 contains a note which seems related (the risk that
> small numbers are thought to represent octets, in the current
> locale) but I do not think it is a serious risk since section
> 2 clearly states that we encode Unicode code points, not
> octets.

And we all know that people often read specs in a sloppy or
incomplete fashion.    So I suggest that requiring four digits
(or more) is more robust than permitting two.  Whether the
advantages of being able to write two digits outweigh that...
well, I don't think so, but, if others strongly disagree, I'll
change it.