Re: FWD: I-D ACTION:draft-klensin-unicode-escapes-00.txt

"Clive D.W. Feather" <clive@demon.net> Mon, 22 January 2007 09:02 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1H8v4X-0005fz-Cr; Mon, 22 Jan 2007 04:02:37 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1H8v4V-0005fD-UO for discuss@apps.ietf.org; Mon, 22 Jan 2007 04:02:35 -0500
Received: from anchor-internal-1.mail.demon.net ([195.173.56.100]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1H8v4U-0000Yj-2p for discuss@apps.ietf.org; Mon, 22 Jan 2007 04:02:35 -0500
Received: from finch-staff-1.server.demon.net (finch-staff-1.server.demon.net [193.195.224.1]) by anchor-internal-1.mail.demon.net with ESMTP� id l0M92UmJ010234Mon, 22 Jan 2007 09:02:30 GMT
Received: from clive by finch-staff-1.server.demon.net with local (Exim 3.36 #1) id 1H8v4Q-000JIz-00; Mon, 22 Jan 2007 09:02:30 +0000
Date: Mon, 22 Jan 2007 09:02:30 +0000
From: "Clive D.W. Feather" <clive@demon.net>
To: John C Klensin <john-ietf@jck.com>
Subject: Re: FWD: I-D ACTION:draft-klensin-unicode-escapes-00.txt
Message-ID: <20070122090230.GJ60599@finch-staff-1.thus.net>
References: <891E235E7A867F0DB506C90A@p3.JCK.COM> <45B0F363.6020102@cs.utk.edu> <E77D46A3FD71DD741ED1BE85@p3.JCK.COM>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <E77D46A3FD71DD741ED1BE85@p3.JCK.COM>
User-Agent: Mutt/1.5.3i
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 73734d43604d52d23b3eba644a169745
Cc: discuss@apps.ietf.org, Keith Moore <moore@cs.utk.edu>
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

John C Klensin said:
>> - it should be clear that this is for newly-designed protocols
>> only.  it shouldn't be interpreted as a request to change
>> existing protocols (including deployed and nonstandard
>> protocols being standardized by IETF), as this would generally
>> break backward compatibility by changing the meaning of '\'
> 
> That was intended to be clear already.  If it is not
> sufficiently so, suggested text, or at least a place to put it,
> would be welcome.

How about adding "new" before "protocols" in the middle paragraph of 1.1
and the abstract?

>> - it should be clear that this is for occasional use of
>> non-ASCII characters within a protocol field that is
>> constrained to contain only ASCII characters (or a subset),
>> rather than a recommendation for how to represent non-ASCII
>> characters in a protocol field that is capable of carrying,
>> say, UTF-8.

> I don't know if it is clear enough or not.   At some level, if
> you didn't conclude that it was clear on reading the draft, then
> that is evidence that it isn't clear enough... but I don't know
> how carefully you read it.

I don't think it would hurt to add something in 1.1. I'm not sure how to
word it, but something about "Some protocols already accept native UTF-8 or
some other encoding of Unicode, and this recommendation does not apply to
such protocols.".

> I've looked at several RFCs
> and U+NNNN seems to be the preferred format for character
> literals and, more commonly, for identifying the code point
> associated with a named character.  It is also, fwiw, the one I
> prefer for that purpose.  But it is fairly poor for inline use
> in a protocol.  The authoritative definition and reference for
> that form is the "Code Points" section of "Appendix A:
> Notational Conventions" of Unicode 5.0 (the reference to the
> book is the I-D).

I don't have that book. The online version 4.1 suggests the notation
<U+0061, U+0300>, which can be abbreviated to <0061, 0030>. This would
still need some kind of introductory indicator (like \u) to show that it's
a Unicode escape.

>> one more caveat: protocol specifications need to specify this
>> notation explicitly (either directly or by reference to the
>> published RFC) if they are going to use it. conversely, this
>> notation SHOULD NOT (maybe MUST NOT) be used unless it is part
>> of the protocol specification.
> Please suggest text for specifying those rules.  I constructed
> this rather more as advice to protocol designers and, to a
> lesser extent, to document authors, rather than a base for
> notational definitions to be included by reference.  That could
> be changed, but I'd welcome textual suggestions.

"This specification is a recommendation to protocol designers and document
authors. A protocol or other specification MUST NOT be interpreted as
using it unless it explicitly copies this syntax or refers to this RFC
as normative."

> But it is also, if I have done
> the calculation correctly, %C3%83 and that form (used in URIs
> and IRIs) is seriously non-intuitive and certainly can't be
> converted visually.

I certainly agree that encoding of UTF-8 sequences is the wrong thing to
do.

Oh: you should explicitly forbid the use of surrogates to encode characters
above U+FFFF.

> But I
> have no particularly strong commitment to any particular
> recommendation as long as we establish a recommendation.

(1) I agree that anything is better than nothing.

(2) While \uXXXX is better than encoded UTF-8, it's far worse than
something explicitly delimited.

-- 
Clive D.W. Feather  | Work:  <clive@demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive@davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
THUS plc            |                            |