Re: draft-klensin-net-utf8-06

John C Klensin <> Mon, 15 October 2007 17:49 UTC

Return-path: <>
Received: from [] ( by with esmtp (Exim 4.43) id 1IhU4f-0005yS-9L; Mon, 15 Oct 2007 13:49:53 -0400
Received: from discuss by with local (Exim 4.43) id 1IhU4d-0005y1-Sm for; Mon, 15 Oct 2007 13:49:51 -0400
Received: from [] ( by with esmtp (Exim 4.43) id 1IhU4d-0005xr-42 for; Mon, 15 Oct 2007 13:49:51 -0400
Received: from ([] by with esmtp (Exim 4.43) id 1IhU4W-0003ec-Mi for; Mon, 15 Oct 2007 13:49:51 -0400
Received: from [] (helo=p3.JCK.COM) by with esmtp (Exim 4.34) id 1IhU4N-0004eJ-7v; Mon, 15 Oct 2007 13:49:35 -0400
Date: Mon, 15 Oct 2007 13:49:34 -0400
From: John C Klensin <>
To: Frank Ellermann <>,
Subject: Re: draft-klensin-net-utf8-06
Message-ID: <090481616F66605395036545@p3.JCK.COM>
In-Reply-To: <fev7so$bt0$>
References: <93F25E18AB3DA3EB0599F092@p3.JCK.COM> <fev7so$bt0$>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 287c806b254c6353fcb09ee0e53bbc5e
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <>
List-Unsubscribe: <>, <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>

--On Monday, 15 October, 2007 10:17 +0200 Frank Ellermann
<> wrote:

> John C Klensin wrote:
>> less ready than the unicode-escapes one, partially because of
>> the addition of a lot of new or revised text, but I hope we 
>> are converging
> Looking only at the diff, yes.  You've added a recommendation
> to avoid private use code points, because they "do not have
> standard definitions or normalization interpretations".
> I think the latter is incorrect, they're by decree normalized.
> Maybe what you want to say is that they can't have canonical
> or compatible decompositions.  IMO the obvious "no standard
> definition" alone is clearer and good enough to justify your
> recommendation.

Frank, in both this and the notes below, we are, I think,
getting dangerously close to providing an interpretation and
profile of Unicode.   The bits of that that are necessary to get
line-endings nailed down and to make the NFC recommendation
clear are necessary but, when we start going into detail on how
individual code points or blocks are interpreted, we are getting
into territory that is very uncomfortable for me and, I believe,
should be uncomfortable for the IETF.

> Not directly related to our draft, in 5.2 you write:
> | The latter is important because an unassigned code point 
> | always normalizes to itself.
> Something's really odd with this in the Unicode standard:
> For some unassigned code points it's "almost obvious" that
> they'll never end up in any NFX, unless they also get the
> non-character property.  What I have in mind are "obvious
> gaps" in dingbats, mathematical symbols, and other blocks,
> where the abstract character corresponding to the given
> unassigned code point was already encoded elsewhere.  

The set of issues involved here is the reason that the "Stable
NF..." discussions started in the wake of some of the
pre-RFC4690 discussions a year or more ago.  I think we should
continue to try very hard to avoid having it turn into an IETF

> One of the earlier examples is u+2073.  In theory they
> could say "whatever we'll do with that code point, it
> will be either a non-character, or have a canonical
> decomposition, or stay as is (unassigned) forever".

They could say lots of things in theory and perhaps in practice.
But they are also in the position in which, being human, it is
possible for them to make mistakes.  And, when they make a
mistake of sufficient importance, they reserve the right to
change things.  Becoming more dependent than absolutely
necessary on changes, or certain types of changes, being
forbidden, impresses me as unwise... and probably unfair to both
the IETF and Unicode communities.

> Back to the draft, s/have been be tied/have been tied/
> in 5.2. 

Fixed in working draft for -07.   Unless more changes come up, I
won't post that draft except as part of a Last Call process --
too much clutter from too many drafts.
>> I've added some words about both HT and FF.  One may need to
>> look in the appendices to find all of them.
> Fine now.  Over the weekend I read a related chapter
> in the Unicode standard, and I think you have to 
> mention LS u+2028 and maybe also PS u+2029 somewhere.

Sorry.  This gets into the picking and choosing of individual
code points that we have tried to avoid, for the reasons above
and otherwise.  Note that the prohibition on IND, etc., is part
of a general prohibition on C1 controls -- a clearly-defined
range.  The comments on those characters are just that, comments
(the text was bad on that -- applying a SHOULD NOT to those
characters and then a MUST NOT to the entire block.  That has
been fixed.

> Admittedly banning LS might upset some folks.  IMO it's
> similar to the NEL case, LS might be even more obscure.
> OTOH I think that you shouldn't discuss IND u+0084,
> it's dead.  TUS 5.0 says "formerly known as INDEX".

But it was taken into Unicode, if I recall, from ISO 8859, the
associated supplemental controls standard, and the various other
flavors of "Latin-NN".  It went into those documents, if my
memory is correct, precisely to correct ambiguities in the
interpretation of LF and CR (LS presumably arose when even that
wasn't sufficient in some environments).   I think mentioning it
because of the relationship to the pre-Unicode C1 controls might
be helpful to some people and is harmless at worst.  But I'd  be
happy to hear from others about this if people think it is
important enough to delay a last call still longer.