draft-klensin-unicode-escapes-01 (was: New Draft)

John C Klensin <john-ietf@jck.com> Fri, 02 February 2007 21:37 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1HD66F-0006rb-3A; Fri, 02 Feb 2007 16:37:39 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1HD66D-0006nu-AB for discuss@apps.ietf.org; Fri, 02 Feb 2007 16:37:37 -0500
Received: from ns.jck.com ([209.187.148.211] helo=bs.jck.com) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1HD65g-0002cH-L1 for discuss@apps.ietf.org; Fri, 02 Feb 2007 16:37:07 -0500
Received: from [127.0.0.1] (helo=p3.JCK.COM) by bs.jck.com with esmtp (Exim 4.34) id 1HD65e-000FiN-4w; Fri, 02 Feb 2007 16:37:02 -0500
Date: Fri, 02 Feb 2007 16:37:01 -0500
From: John C Klensin <john-ietf@jck.com>
To: "Clive D.W. Feather" <clive@demon.net>, Frank Ellermann <nobody@xyzzy.claranet.de>
Subject: draft-klensin-unicode-escapes-01 (was: New Draft)
Message-ID: <B7F8733D73E8CC7227785A69@p3.JCK.COM>
In-Reply-To: <20070202184727.GG68544@finch-staff-1.thus.net>
References: <875A124D75A8B481E176CF06@p3.JCK.COM> <uppsr2hs59srbd7eufbcul5a1ekl7i09nl@hive.bjoern.hoehrmann.de> <EF59DA6FD89C4F19750C68C3@p3.JCK.COM> <20070202114658.GX7742@finch-staff-1.thus.net> <45C3371E.330F@xyzzy.claranet.de> <20070202184727.GG68544@finch-staff-1.thus.net>
X-Mailer: Mulberry/4.0.7 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 86f85b2f88b0d50615aed44a7f9e33c7
Cc: discuss@apps.ietf.org
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

In order to reduce the number of threads consisting of
incremental micro-messages, I'm going to try an omnibus
response....

--On Friday, 02 February, 2007 18:47 +0000 "Clive D.W. Feather"
<clive@demon.net> wrote:

> Frank Ellermann said:
>> The I-D should IMO adopt and cite [Charmod] C042 up to C048
>> verbatim.
> 
> C042 would require &#x1234; rather than allowing us to invent
> \u'1234'.
> 
>> A few other conformance criteria in [Charmod] might be also
>> interesting: http://www.w3.org/TR/charmod/#C070  Don't
>> exclude arbitrary code points
>> http://www.w3.org/TR/charmod/#C077  Don't allow anything
>> above U+10FFFF http://www.w3.org/TR/charmod/#C078  Don't
>> (ab)use surrogates http://www.w3.org/TR/charmod/#C079  Don't
>> (ab)use non-characters
> 
> Those are worth including, I think.

Folks, someday, someone will write a document called "using
Unicode on the Internet".  To some extent, the now-suspended
draft-klensin-net-utf8 is a step in that direction (Mike and I
will get back to it, but, right now, I've got some other things
on my plate).  

But _this_ document is about escaping characters, or really code
points, not about what characters should be permitted or how
they should be used.  Or one of you might take some of this
energy and divert it into an RFC-ish version of "CharMod".   But
for this, just taking the above list as examples:

  C070: irrelevant.   if it is in range, this works.
  C077: needs to be fixed somewhere else.  This is just about a
syntax for escapes
  C078: Already prohibited
  C079: Precisely the reason why one might want escapes is to be
able to deal with non-characters in a sensible way.  Whether
they should be used or not depends on the relevant protocol.



--On Friday, 02 February, 2007 14:05 +0100 Frank Ellermann
<nobody@xyzzy.claranet.de> wrote:

> Yes, never ever mention that HTML exists, it's horrible.  The
> [Charmod] bible requires no (SGML) nonsense in
> http://www.w3.org/TR/charmod/#C044

Sorry, but, IMO, for a document like this which is merely giving
examples (at this point), "very widely deployed and used" trumps
"horrible".
 
>...
> Probably the I-D should mention that one famous exception from
> its rule to avoid encoded UTF-8 is the URL form of IRIs.

Why?  I don't see that (and a few other cases) as "exceptions"
(famous or not), but as mistakes from which we should learn and,
I hope, have learned.   The document is reasonably clear that it
is not a proposal to retrofit any existing protocol.



--On Friday, 02 February, 2007 11:38 +0000 "Clive D.W. Feather"
<clive@demon.net> wrote:

> John C Klensin said:
>> I've just submitted draft-klensin-unicode-escapes-01.txt and
>> assume it will show up in the posting directory today or
>> tomorrow.  
> 
> Some comments for you.

Thanks

> * In 1.1, rather than saying that Unicode occupies "two or
> more octets", wouldn't it be better to say "21 bits - rather
> than the 7 bits of ASCII -"?

No, because of net-ascii and some other issues, probably not.
Again, because the important issue here is that this stuff is
about escapes, you are picking nits that belong elsewhere (see
rant at the end).

> * Somewhere in the last two paragraphs of 1.1 you should be
> talking about mini-languages (e.g. Cosmogol) as well as
> protocols and UIs.

Why?  In principle I could talk about Japanese business cards
and all sorts of other things too.  But that is introductory
material to help the reader understand the context.  Trying to
create an exhaustive list would add nothing to the document
except more pages and making it harder to read.

> * In 3, you're inconsistent between "U+NNN[N[N]]" and
> "NNN...". Indeed, shouldn't the former actually be
> "U+[[N]N]NNNN"? (Note both the order and the number of Ns.) I
> would suggest that better wording might be:

Partially fixed in -02.   U+NNNN[N[N]] versus U+[[N]N]NNNN is a
matter of taste.  Clearly, in that pseudo-notation, one "N" is
as good as another.   See rant at end.

>     ... U+NN syntax for code point references specified in the
> Unicode     Standard, where NN is between four and six
> hexadecimal digits.

I agree with another comment -- too easy to misunderstand.

> * In 4, second bullet, "string terminators" should be "string
> delimiters".

I was deliberately trying to be general.  If one has a form like
H'nnnn', one is clearly talking about "string delimiters".
However, if one has, e.g., &#xNNNN;, it is clear that ";" is a
string terminator, but whether there is a starting delimiter at
all depends on how one defines things in metalanguage or words.
E.g., using BNF (_not_ ABNF), one could reasonably have
   <XML-like-Unicode-escape-string> ::=  <type-introducer>
<value> <terminator>
   <type-introducer> ::= "&#x"
   <value> ::= ....
   <escape-terminator> ::= ";"
or you could construct it in other ways.  Matter of taste, see
rant at the bottom.

> * In 5.2, you've said "generally considered ugly and awkward"
> but I'm not aware of anyone else who's made that complaint.

I could trundle out several others, but it is probably more
efficient to write off to editor's privilege.   See rant at end.

> * In 6 you need to copy in all the security stuff from
> Unicode; the stuff that says that you must use shortest-form
> UTF-8 (so not using %xC1.A1 for 'A') because of the problems
> of filters and firewalls not spotting longer forms.

Absolutely not.  Again, this document is about escapes for
strings of Unicode characters, not about the general use and
appropriateness of Unicode.   And shortest-form UTF-8 is
especially irrelevant because this document is an attempt to
prohibit escapes for UTF-8 entirely where that is still
possible.  The business about different, almost-equivalent,
forms of UTF-8 arguably should go into Section 1 or 2.1 as
further evidence that escaped UTF-8 is a bad idea, but I think
the point has been made.

---------------
<rant>
There is a lot of work to be done in the internationalization
area in the IETF.  Whether one likes what RFC 2277 says or not,
it has become fairly clear that it is not as close to the last
word on the subject as many of us assumed it would be when it
was written.   The suggestions about about CharMod, the
shortest-form string issues that caused us to need to revise the
UTF-8 specs, and the recent work on comparators and comparator
registries, are just the beginning of a very long list.

Speaking personally, I'd be really pleased if I were doing a
much smaller percentage of the document-writing in this area.
But, if I'm going to do it, then my editorial judgment and
preferences are going to prevail... at least until the document
gets far enough along that I have to start arm-wrestling with
the RFC Editor and _their_ judgment and preferences.  I really
appreciate comments about how to make a document more clear but
an argument about, e.g., U+NNNN[N[N]] versus U+[[N]N]NNNN is
only about taste and is hence a waste of everyone's time.

If you don't like my writing style -- and many people don't --
please take on these efforts yourselves and let me complain
about your style (or not) some of the time.  Sniping and
nit-picking is easy and may be fun, but it tends to block,
rather than contribute to, progress.   If you are not going to
pick up some of the writing work, unless your purpose is to
introduce delays and lay down obstacles -- which I assume it is
not-- can we please concentrate on substantive issues and
document changes that would have a clear positive impact on
clarity?
</rant>

thanks,
     john