Re: Next step

Frank Ellermann <nobody@xyzzy.claranet.de> Thu, 25 January 2007 01:18 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1H9tGJ-0008A0-QU; Wed, 24 Jan 2007 20:18:47 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1H9s8U-0008By-He for discuss@apps.ietf.org; Wed, 24 Jan 2007 19:06:38 -0500
Received: from main.gmane.org ([80.91.229.2] helo=ciao.gmane.org) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1H9s8T-0001t4-2V for discuss@apps.ietf.org; Wed, 24 Jan 2007 19:06:38 -0500
Received: from list by ciao.gmane.org with local (Exim 4.43) id 1H9s8L-0006KT-19 for discuss@apps.ietf.org; Thu, 25 Jan 2007 01:06:29 +0100
Received: from d253188.dialin.hansenet.de ([80.171.253.188]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <discuss@apps.ietf.org>; Thu, 25 Jan 2007 01:06:29 +0100
Received: from nobody by d253188.dialin.hansenet.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <discuss@apps.ietf.org>; Thu, 25 Jan 2007 01:06:29 +0100
X-Injected-Via-Gmane: http://gmane.org/
To: discuss@apps.ietf.org
From: Frank Ellermann <nobody@xyzzy.claranet.de>
Subject: Re: Next step
Date: Thu, 25 Jan 2007 01:05:35 +0100
Organization: <URL:http://purl.net/xyzzy>
Lines: 94
Message-ID: <45B7F44F.4675@xyzzy.claranet.de>
References: <B1930392E9C03720F9E495F8@p3.JCK.COM>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@sea.gmane.org
X-Gmane-NNTP-Posting-Host: d253188.dialin.hansenet.de
X-Mailer: Mozilla 3.0 (OS/2; U)
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 34d35111647d654d033d58d318c0d21a
X-Mailman-Approved-At: Wed, 24 Jan 2007 20:18:46 -0500
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

John C Klensin wrote:

> actually more addressed to me personally than on the list.

Hi, I hope the issue with the list "eating" my mails is now
fixed, and try it again (minus typos, but keeping the ABNF
issue because "8 vs. 6 digits" might be still unclear):

~~~ I-D ~~~
[U+NNNN]
                               This document proposes that a specific
   variation on the latter SHOULD be used in protocols unless other
   considerations apply and explains that choice.

I disagree with that proposal, more below.

-  BMP-form =  "\u" Hex-quad
-  Full-form =  "\U" 2*2 Hex-quad
+  BMP-form =  %x5C.75 Hex-quad         ; starting with lower case "\u"
+  Full-form = %x5C.55 2*2 Hex-quad     ; starting with upper case "\U"

You fixed something there already resulting in either four or six
digits, but for a case-sensitive u vs. U you can't use "u" or "U"
in ABNF.  Sometimes this ABNF feature is annnoying.  Looking at your
fix, is that still either four or *_eight_* instead of six digits ?

   (e.g., in RFCs) although the U+NNNN form MAY be used when Unicode
   character encoding is clearly expected.

It SHOULD be used for the purpose of talking _about_ Unicode points
in the prose of Internet drafts and RFCs, but it SHOULD NOT be used
to encode only the non-ASCII characters in Unicode strings.  It MAY
be used to encode complete obviously delimited Unicode strings.

-                This specification recommends that, in the absence of
-  compelling reasons to do otherwise, the Unicode code point forms be
-  used rather than the UTF-8 ones.  There are several reasons for this,
-  including:
+                This specification recommends that, in the absence of
+  compelling reasons to do otherwise, the Unicode code point forms
+  SHOULD be used rather than the UTF-8 ones.  There are several
+  reasons for this, including:

Adding a SHOULD to 3.1, otherwise folks won't believe it.

   o  Perl uses the form \x(NNN...).  The advantage of this form is that
      there are explicit delimiters

Indeed, hitting the important C044 point in [CharMod].

   o  Java uses the form \uNNNN, but can represent characters outside
      Plane 0 (i.e., above U+FFFF) only by the use of surrogate pairs.

One of the reasons why anything with \u or \U is a non-starter, there
are too many incompatible conventions in use.

                                                   Codings that depend
      on surrogates SHOULD NOT be used.

Strong ACK.

   o  HTML and XML use the form &#xNNNN;.  Like the Perl form, this form
      has a clear terminator, reducing ambiguity.  However, it is
      generally considered ugly and awkward outside of its native HTML,
      XML, and similar contexts.

IMO it is THE encoding.  It's also trivial to convert files using this
technique into XML.   For the RFC 4646 language subtag registry I use
a simple gawk script.

   There is one significant disadvantage of the recommended form.  The

No, there are more, folks will assume that it's a convention they know
or a variant of U+NNNN[N[N]] with an arbitrary number of leading 0s.
Nobody will use \U012345 when they can hope to get away with \U12345.

   should not introduce any security issues that are not present as a

My objections are also security considerations, because folks will
screw up with this encoding it could cause havoc.

6.1.  Normative References

There should be a normative reference to the [CharMod] bible, and
especially to its conformance criteria C042 up to C048 starting at
http://www.w3.org/TR/charmod/#C042

In theory your proposal is compatible with C044, but in practice I
fear that it won't work as you expect it.  I could live with e.g.
"authors SHOULD either pick hex. NCRs as in XML or" (your proposal),
but in fact I think that the XML-notation is much better.

Frank