Re: FWD: I-D ACTION:draft-klensin-unicode-escapes-00.txt

John C Klensin <john-ietf@jck.com> Fri, 19 January 2007 17:40 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1H7xjO-0000Rn-1L; Fri, 19 Jan 2007 12:40:50 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1H7xjM-0000Lv-Rj for discuss@apps.ietf.org; Fri, 19 Jan 2007 12:40:48 -0500
Received: from ns.jck.com ([209.187.148.211] helo=bs.jck.com) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1H7xjM-0002E7-67 for discuss@apps.ietf.org; Fri, 19 Jan 2007 12:40:48 -0500
Received: from [127.0.0.1] (helo=p3.JCK.COM) by bs.jck.com with esmtp (Exim 4.34) id 1H7xjL-0006ij-4g; Fri, 19 Jan 2007 12:40:47 -0500
Date: Fri, 19 Jan 2007 12:40:46 -0500
From: John C Klensin <john-ietf@jck.com>
To: Keith Moore <moore@cs.utk.edu>
Subject: Re: FWD: I-D ACTION:draft-klensin-unicode-escapes-00.txt
Message-ID: <E77D46A3FD71DD741ED1BE85@p3.JCK.COM>
In-Reply-To: <45B0F363.6020102@cs.utk.edu>
References: <891E235E7A867F0DB506C90A@p3.JCK.COM> <45B0F363.6020102@cs.utk.edu>
X-Mailer: Mulberry/4.0.7 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 093efd19b5f651b2707595638f6c4003
Cc: discuss@apps.ietf.org
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org


--On Friday, 19 January, 2007 11:35 -0500 Keith Moore
<moore@cs.utk.edu> wrote:

> based on a quick scan, it mostly looks fine to me, with a few
> caveats:
> 
> - it should be clear that this is for newly-designed protocols
> only.  it shouldn't be interpreted as a request to change
> existing protocols (including deployed and nonstandard
> protocols being standardized by IETF), as this would generally
> break backward compatibility by changing the meaning of '\'

That was intended to be clear already.  If it is not
sufficiently so, suggested text, or at least a place to put it,
would be welcome.

More generally, I'm not completely wild about the \u and \U
conventions, as you can probably deduce from the text.  It
seemed like the best choice on balance, but this is an area in
which, were a consensus to emerge around something else, I would
enthusiastically agree.

> - it should be clear that this is for occasional use of
> non-ASCII characters within a protocol field that is
> constrained to contain only ASCII characters (or a subset),
> rather than a recommendation for how to represent non-ASCII
> characters in a protocol field that is capable of carrying,
> say, UTF-8.
> 
> if the text already makes this clear, I apologize for not
> reading the document more carefully.

I don't know if it is clear enough or not.   At some level, if
you didn't conclude that it was clear on reading the draft, then
that is evidence that it isn't clear enough... but I don't know
how carefully you read it.  Certainly I agree with both points
and would welcome suggestions as to how they can be clarified if
they are not sufficiently clear.

> it might also be worthwhile to recommend this notation for use
> in describing character literals in RFCs, in such a way that
> it could be referenced by such RFCs.

Yeah.  The document does address this, but (deliberately) more
or less in passing.   Again, I'd welcome specific community
input and suggestions about this.  I've looked at several RFCs
and U+NNNN seems to be the preferred format for character
literals and, more commonly, for identifying the code point
associated with a named character.  It is also, fwiw, the one I
prefer for that purpose.  But it is fairly poor for inline use
in a protocol.  The authoritative definition and reference for
that form is the "Code Points" section of "Appendix A:
Notational Conventions" of Unicode 5.0 (the reference to the
book is the I-D).  I don't believe that we add value by
repeating that definition in an RFC, but could be easily talked
out of that position.

>From your followup note...

> one more caveat: protocol specifications need to specify this
> notation explicitly (either directly or by reference to the
> published RFC) if they are going to use it. conversely, this
> notation SHOULD NOT (maybe MUST NOT) be used unless it is part
> of the protocol specification.

Please suggest text for specifying those rules.  I constructed
this rather more as advice to protocol designers and, to a
lesser extent, to document authors, rather than a base for
notational definitions to be included by reference.  That could
be changed, but I'd welcome textual suggestions.

> one problem I see with introducing any new notation for
> characters is that it does create normalization issues by
> introducing additional ways to say the same thing.  e.g. after
> introducing this notation, "A" could also be expressed as
> \u0041 or \U00000041.  and it then becomes necessary to manage
> this conversion when copying fields from one protocol that
> supports the new notation (or in which it is benign) to
> another protocol that does not support the notation.

We are, unfortunately, already _deep_ into that problem.  Part
of what motivated this draft was to prevent it from getting
worse.  And, using the example above, the conversion between
\u0041 and \U00000041 is obvious.  You can even do it by eye,
and the relationship to %41 is equally clear.  However, consider
that "A" with a Diaeresis (U+00C4 -- note the use of the "code
point" notation here).  We then have \u00C4 and \U000000C4,
which are easily mapped visually to each other and to the code
point form, which can easily be looked up in a table if you
don't know what it represents.  But it is also, if I have done
the calculation correctly, %C3%83 and that form (used in URIs
and IRIs) is seriously non-intuitive and certainly can't be
converted visually.

> as an example of a potential source of problems, I'd hate to
> see this notation end up in X.509 certs.

To which I'd have to say "what would you like to see?".  My
personal answer is that I'd prefer to see those certs contain
UTF-8 encoding with escapes used only to discuss or present the
certs in contexts in which the UTF-8 and native characters are
not available (and maybe if it is, for explanation and clarity).
But I don't think this spec has much impact on that problem.  If
it should, I'd welcome text suggestions.

General observation: I generated this document because it became
clear that someone needed to something.  The alternative was
that we continue to drift toward a "every protocol makes its own
choices, ignoring all others" situation, with many of them
favoring  escaped UTF-8 (which I consider to be nearly the worst
possible choice for most circumstances... for reasons that
should be clear even from the trivial examples above).  But I
have no particularly strong commitment to any particular
recommendation as long as we establish a recommendation.   From
this point forward, I'm more or less acting as secretary/editor
for whatever comments and consensus emerge rather than trying to
advocate a particular solution or specification.   One
implication of that is that if anyone believes that additional
text, clarification, or specification belongs in the document, I
would really appreciate very specific textual suggestions.

     john