Re: draft-klensin-net-utf8-06

"Frank Ellermann" <> Mon, 15 October 2007 08:39 UTC

Return-path: <>
Received: from [] ( by with esmtp (Exim 4.43) id 1IhLTu-0003zj-Dr; Mon, 15 Oct 2007 04:39:22 -0400
Received: from discuss by with local (Exim 4.43) id 1IhLTs-0003wi-KP for; Mon, 15 Oct 2007 04:39:20 -0400
Received: from [] ( by with esmtp (Exim 4.43) id 1IhLTr-0003mQ-J1 for; Mon, 15 Oct 2007 04:39:19 -0400
Received: from ([] by with esmtp (Exim 4.43) id 1IhLTg-0007XX-6K for; Mon, 15 Oct 2007 04:39:14 -0400
Received: from list by with local (Exim 4.43) id 1IhLJS-00067E-Lf for; Mon, 15 Oct 2007 08:28:34 +0000
Received: from ([]) by with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <>; Mon, 15 Oct 2007 08:28:34 +0000
Received: from nobody by with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <>; Mon, 15 Oct 2007 08:28:34 +0000
From: "Frank Ellermann" <>
Subject: Re: draft-klensin-net-utf8-06
Date: Mon, 15 Oct 2007 10:17:33 +0200
Lines: 51
Message-ID: <fev7so$bt0$>
References: <93F25E18AB3DA3EB0599F092@p3.JCK.COM>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.3138
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 4d87d2aa806f79fed918a62e834505ca
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <>
List-Unsubscribe: <>, <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>

John C Klensin wrote:

> less ready than the unicode-escapes one, partially because of
> the addition of a lot of new or revised text, but I hope we 
> are converging

Looking only at the diff, yes.  You've added a recommendation
to avoid private use code points, because they "do not have
standard definitions or normalization interpretations".

I think the latter is incorrect, they're by decree normalized.
Maybe what you want to say is that they can't have canonical
or compatible decompositions.  IMO the obvious "no standard
definition" alone is clearer and good enough to justify your

Not directly related to our draft, in 5.2 you write:
| The latter is important because an unassigned code point 
| always normalizes to itself.

Something's really odd with this in the Unicode standard:

For some unassigned code points it's "almost obvious" that
they'll never end up in any NFX, unless they also get the
non-character property.  What I have in mind are "obvious
gaps" in dingbats, mathematical symbols, and other blocks,
where the abstract character corresponding to the given
unassigned code point was already encoded elsewhere.  

One of the earlier examples is u+2073.  In theory they
could say "whatever we'll do with that code point, it
will be either a non-character, or have a canonical
decomposition, or stay as is (unassigned) forever".

Back to the draft, s/have been be tied/have been tied/
in 5.2. 
> I've added some words about both HT and FF.  One may need to
> look in the appendices to find all of them.

Fine now.  Over the weekend I read a related chapter
in the Unicode standard, and I think you have to 
mention LS u+2028 and maybe also PS u+2029 somewhere.

Admittedly banning LS might upset some folks.  IMO it's
similar to the NEL case, LS might be even more obscure.

OTOH I think that you shouldn't discuss IND u+0084,
it's dead.  TUS 5.0 says "formerly known as INDEX".