draft-klensin-unicode-escapes-01 (was: New Draft)
John C Klensin <john-ietf@jck.com> Fri, 02 February 2007 21:37 UTC
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1HD66F-0006rb-3A; Fri, 02 Feb 2007 16:37:39 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1HD66D-0006nu-AB for discuss@apps.ietf.org; Fri, 02 Feb 2007 16:37:37 -0500
Received: from ns.jck.com ([209.187.148.211] helo=bs.jck.com) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1HD65g-0002cH-L1 for discuss@apps.ietf.org; Fri, 02 Feb 2007 16:37:07 -0500
Received: from [127.0.0.1] (helo=p3.JCK.COM) by bs.jck.com with esmtp (Exim 4.34) id 1HD65e-000FiN-4w; Fri, 02 Feb 2007 16:37:02 -0500
Date: Fri, 02 Feb 2007 16:37:01 -0500
From: John C Klensin <john-ietf@jck.com>
To: "Clive D.W. Feather" <clive@demon.net>, Frank Ellermann <nobody@xyzzy.claranet.de>
Subject: draft-klensin-unicode-escapes-01 (was: New Draft)
Message-ID: <B7F8733D73E8CC7227785A69@p3.JCK.COM>
In-Reply-To: <20070202184727.GG68544@finch-staff-1.thus.net>
References: <875A124D75A8B481E176CF06@p3.JCK.COM> <uppsr2hs59srbd7eufbcul5a1ekl7i09nl@hive.bjoern.hoehrmann.de> <EF59DA6FD89C4F19750C68C3@p3.JCK.COM> <20070202114658.GX7742@finch-staff-1.thus.net> <45C3371E.330F@xyzzy.claranet.de> <20070202184727.GG68544@finch-staff-1.thus.net>
X-Mailer: Mulberry/4.0.7 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 86f85b2f88b0d50615aed44a7f9e33c7
Cc: discuss@apps.ietf.org
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org
In order to reduce the number of threads consisting of incremental micro-messages, I'm going to try an omnibus response.... --On Friday, 02 February, 2007 18:47 +0000 "Clive D.W. Feather" <clive@demon.net> wrote: > Frank Ellermann said: >> The I-D should IMO adopt and cite [Charmod] C042 up to C048 >> verbatim. > > C042 would require ሴ rather than allowing us to invent > \u'1234'. > >> A few other conformance criteria in [Charmod] might be also >> interesting: http://www.w3.org/TR/charmod/#C070 Don't >> exclude arbitrary code points >> http://www.w3.org/TR/charmod/#C077 Don't allow anything >> above U+10FFFF http://www.w3.org/TR/charmod/#C078 Don't >> (ab)use surrogates http://www.w3.org/TR/charmod/#C079 Don't >> (ab)use non-characters > > Those are worth including, I think. Folks, someday, someone will write a document called "using Unicode on the Internet". To some extent, the now-suspended draft-klensin-net-utf8 is a step in that direction (Mike and I will get back to it, but, right now, I've got some other things on my plate). But _this_ document is about escaping characters, or really code points, not about what characters should be permitted or how they should be used. Or one of you might take some of this energy and divert it into an RFC-ish version of "CharMod". But for this, just taking the above list as examples: C070: irrelevant. if it is in range, this works. C077: needs to be fixed somewhere else. This is just about a syntax for escapes C078: Already prohibited C079: Precisely the reason why one might want escapes is to be able to deal with non-characters in a sensible way. Whether they should be used or not depends on the relevant protocol. --On Friday, 02 February, 2007 14:05 +0100 Frank Ellermann <nobody@xyzzy.claranet.de> wrote: > Yes, never ever mention that HTML exists, it's horrible. The > [Charmod] bible requires no (SGML) nonsense in > http://www.w3.org/TR/charmod/#C044 Sorry, but, IMO, for a document like this which is merely giving examples (at this point), "very widely deployed and used" trumps "horrible". >... > Probably the I-D should mention that one famous exception from > its rule to avoid encoded UTF-8 is the URL form of IRIs. Why? I don't see that (and a few other cases) as "exceptions" (famous or not), but as mistakes from which we should learn and, I hope, have learned. The document is reasonably clear that it is not a proposal to retrofit any existing protocol. --On Friday, 02 February, 2007 11:38 +0000 "Clive D.W. Feather" <clive@demon.net> wrote: > John C Klensin said: >> I've just submitted draft-klensin-unicode-escapes-01.txt and >> assume it will show up in the posting directory today or >> tomorrow. > > Some comments for you. Thanks > * In 1.1, rather than saying that Unicode occupies "two or > more octets", wouldn't it be better to say "21 bits - rather > than the 7 bits of ASCII -"? No, because of net-ascii and some other issues, probably not. Again, because the important issue here is that this stuff is about escapes, you are picking nits that belong elsewhere (see rant at the end). > * Somewhere in the last two paragraphs of 1.1 you should be > talking about mini-languages (e.g. Cosmogol) as well as > protocols and UIs. Why? In principle I could talk about Japanese business cards and all sorts of other things too. But that is introductory material to help the reader understand the context. Trying to create an exhaustive list would add nothing to the document except more pages and making it harder to read. > * In 3, you're inconsistent between "U+NNN[N[N]]" and > "NNN...". Indeed, shouldn't the former actually be > "U+[[N]N]NNNN"? (Note both the order and the number of Ns.) I > would suggest that better wording might be: Partially fixed in -02. U+NNNN[N[N]] versus U+[[N]N]NNNN is a matter of taste. Clearly, in that pseudo-notation, one "N" is as good as another. See rant at end. > ... U+NN syntax for code point references specified in the > Unicode Standard, where NN is between four and six > hexadecimal digits. I agree with another comment -- too easy to misunderstand. > * In 4, second bullet, "string terminators" should be "string > delimiters". I was deliberately trying to be general. If one has a form like H'nnnn', one is clearly talking about "string delimiters". However, if one has, e.g., &#xNNNN;, it is clear that ";" is a string terminator, but whether there is a starting delimiter at all depends on how one defines things in metalanguage or words. E.g., using BNF (_not_ ABNF), one could reasonably have <XML-like-Unicode-escape-string> ::= <type-introducer> <value> <terminator> <type-introducer> ::= "&#x" <value> ::= .... <escape-terminator> ::= ";" or you could construct it in other ways. Matter of taste, see rant at the bottom. > * In 5.2, you've said "generally considered ugly and awkward" > but I'm not aware of anyone else who's made that complaint. I could trundle out several others, but it is probably more efficient to write off to editor's privilege. See rant at end. > * In 6 you need to copy in all the security stuff from > Unicode; the stuff that says that you must use shortest-form > UTF-8 (so not using %xC1.A1 for 'A') because of the problems > of filters and firewalls not spotting longer forms. Absolutely not. Again, this document is about escapes for strings of Unicode characters, not about the general use and appropriateness of Unicode. And shortest-form UTF-8 is especially irrelevant because this document is an attempt to prohibit escapes for UTF-8 entirely where that is still possible. The business about different, almost-equivalent, forms of UTF-8 arguably should go into Section 1 or 2.1 as further evidence that escaped UTF-8 is a bad idea, but I think the point has been made. --------------- <rant> There is a lot of work to be done in the internationalization area in the IETF. Whether one likes what RFC 2277 says or not, it has become fairly clear that it is not as close to the last word on the subject as many of us assumed it would be when it was written. The suggestions about about CharMod, the shortest-form string issues that caused us to need to revise the UTF-8 specs, and the recent work on comparators and comparator registries, are just the beginning of a very long list. Speaking personally, I'd be really pleased if I were doing a much smaller percentage of the document-writing in this area. But, if I'm going to do it, then my editorial judgment and preferences are going to prevail... at least until the document gets far enough along that I have to start arm-wrestling with the RFC Editor and _their_ judgment and preferences. I really appreciate comments about how to make a document more clear but an argument about, e.g., U+NNNN[N[N]] versus U+[[N]N]NNNN is only about taste and is hence a waste of everyone's time. If you don't like my writing style -- and many people don't -- please take on these efforts yourselves and let me complain about your style (or not) some of the time. Sniping and nit-picking is easy and may be fun, but it tends to block, rather than contribute to, progress. If you are not going to pick up some of the writing work, unless your purpose is to introduce delays and lay down obstacles -- which I assume it is not-- can we please concentrate on substantive issues and document changes that would have a clear positive impact on clarity? </rant> thanks, john
- New draft (Was: I-D ACTION:draft-klensin-unicode-… John C Klensin
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… John C Klensin
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Tim Bray
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… John C Klensin
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Tim Bray
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… John C Klensin
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Clive D.W. Feather
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Clive D.W. Feather
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Stephane Bortzmeyer
- I-D.klensin-unicode-escapes (was: New Draft) Frank Ellermann
- I-D.klensin-unicode-escapes (was: New Draft) Frank Ellermann
- ABNF (was: New draft) Frank Ellermann
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Clive D.W. Feather
- Re: I-D.klensin-unicode-escapes (was: New Draft) Clive D.W. Feather
- Re: I-D.klensin-unicode-escapes (was: New Draft) Clive D.W. Feather
- Re: ABNF (was: New draft) Clive D.W. Feather
- Re: ABNF Frank Ellermann
- draft-klensin-unicode-escapes-01 (was: New Draft) John C Klensin
- Re: I-D.klensin-unicode-escapes Frank Ellermann
- Re: I-D.klensin-unicode-escapes John C Klensin
- Re: draft-klensin-unicode-escapes-01 Frank Ellermann
- Re: I-D.klensin-unicode-escapes (was: New Draft) Stephane Bortzmeyer
- Re: I-D.klensin-unicode-escapes (was: New Draft) John C Klensin
- Re: draft-klensin-unicode-escapes-01 (was: New Dr… Clive D.W. Feather
- Re: draft-klensin-unicode-escapes-01 (was: New Dr… John C Klensin
- Re: draft-klensin-unicode-escapes-01 (was: New Dr… Clive D.W. Feather