Re: draft-klensin-unicode-escapes-01 (was: New Draft)
"Clive D.W. Feather" <clive@demon.net> Wed, 07 February 2007 17:50 UTC
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1HEqvt-0000lo-4V; Wed, 07 Feb 2007 12:50:13 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1HEqvs-0000lW-98 for discuss@apps.ietf.org; Wed, 07 Feb 2007 12:50:12 -0500
Received: from anchor-internal-1.mail.demon.net ([195.173.56.100]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1HEqvm-0006mr-Lp for discuss@apps.ietf.org; Wed, 07 Feb 2007 12:50:12 -0500
Received: from finch-staff-1.server.demon.net (finch-staff-1.server.demon.net [193.195.224.1]) by anchor-internal-1.mail.demon.net with ESMTP� id l17Ho27J021912Wed, 7 Feb 2007 17:50:03 GMT
Received: from clive by finch-staff-1.server.demon.net with local (Exim 3.36 #1) id 1HEqvN-000I35-00; Wed, 07 Feb 2007 17:49:41 +0000
Date: Wed, 07 Feb 2007 17:49:41 +0000
From: "Clive D.W. Feather" <clive@demon.net>
To: John C Klensin <john-ietf@jck.com>
Subject: Re: draft-klensin-unicode-escapes-01 (was: New Draft)
Message-ID: <20070207174941.GA64818@finch-staff-1.thus.net>
References: <875A124D75A8B481E176CF06@p3.JCK.COM> <uppsr2hs59srbd7eufbcul5a1ekl7i09nl@hive.bjoern.hoehrmann.de> <EF59DA6FD89C4F19750C68C3@p3.JCK.COM> <20070202114658.GX7742@finch-staff-1.thus.net> <45C3371E.330F@xyzzy.claranet.de> <20070202184727.GG68544@finch-staff-1.thus.net> <B7F8733D73E8CC7227785A69@p3.JCK.COM>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <B7F8733D73E8CC7227785A69@p3.JCK.COM>
User-Agent: Mutt/1.5.3i
X-Spam-Score: 0.0 (/)
X-Scan-Signature: ed68cc91cc637fea89623888898579ba
Cc: Frank Ellermann <nobody@xyzzy.claranet.de>, discuss@apps.ietf.org
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org
John C Klensin said: >>> A few other conformance criteria in [Charmod] might be also >>> interesting: http://www.w3.org/TR/charmod/#C070 Don't >>> exclude arbitrary code points [...] > But _this_ document is about escaping characters, or really code > points, not about what characters should be permitted or how > they should be used. Agreed. > C070: irrelevant. if it is in range, this works. Okay. > C077: needs to be fixed somewhere else. This is just about a > syntax for escapes Mumble. Anything above U+10FFFF is, by definition, not Unicode (okay, not Unicode today). But I'm not going to fight this one any more. However, it does bring something else to mind: whatever escape mechanism is chosen, should there be a limit on the length of the hex string? Should an implementation be expected to correctly handle: \u'000000000000000000000000000000000000000000000000000000000000000001234' or \u'000000000000000000000008000000000000000000000000000000000000000001234' ? What is "correct" in the latter case? > C078: Already prohibited Okay. > C079: Precisely the reason why one might want escapes is to be > able to deal with non-characters in a sensible way. Very true, and a point I should have realized. >> Yes, never ever mention that HTML exists, it's horrible. The >> [Charmod] bible requires no (SGML) nonsense in >> http://www.w3.org/TR/charmod/#C044 > Sorry, but, IMO, for a document like this which is merely giving > examples (at this point), "very widely deployed and used" trumps > "horrible". Um, I'm not sure what Frank was thinking, but I would say: - it's fine to mention HTML; - we should *not* mention the construct "<B>å</B>" (note the missing semicolon), which SGML allows and HTML says SHOULD NOT be used. >>... >> Probably the I-D should mention that one famous exception from >> its rule to avoid encoded UTF-8 is the URL form of IRIs. > Why? I don't see that (and a few other cases) as "exceptions" > (famous or not), but as mistakes from which we should learn and, > I hope, have learned. The document is reasonably clear that it > is not a proposal to retrofit any existing protocol. As someone else asked, what about new protocols based on IRIs? >> * In 1.1, rather than saying that Unicode occupies "two or >> more octets", wouldn't it be better to say "21 bits - rather >> than the 7 bits of ASCII -"? > No, because of net-ascii and some other issues, probably not. "net-ascii"? The only reference I can find is in RFC 1350, where it appears to mean "ASCII with top bit clear, plus some codes from RFC 764 which have the top bit set". My point is that Unicode is not octet-based at all - encodings like UTF-8 are, but Unicode isn't. It's a numbering of characters from 0 to 0x10FFFF just like ASCII is a numnbering from 0 to 127. Neither are octet based. > Again, because the important issue here is that this stuff is > about escapes, you are picking nits that belong elsewhere (see > rant at the end). Nits have to be picked at some point, and sometimes earlier is better than later. >> * Somewhere in the last two paragraphs of 1.1 you should be >> talking about mini-languages (e.g. Cosmogol) as well as >> protocols and UIs. > Why? In principle I could talk about Japanese business cards > and all sorts of other things too. Because they are something that RFCs often use or contain, and for which this is highly relevant. Unlike Japanese business cards. > But that is introductory > material to help the reader understand the context. Trying to > create an exhaustive list would add nothing to the document > except more pages and making it harder to read. I'm not suggesting exhaustive; I'm suggesting that mini-languages are a relevant target for this specification. >> * In 3, you're inconsistent between "U+NNN[N[N]]" and >> "NNN...". Indeed, shouldn't the former actually be >> "U+[[N]N]NNNN"? (Note both the order and the number of Ns.) I >> would suggest that better wording might be: > Partially fixed in -02. U+NNNN[N[N]] versus U+[[N]N]NNNN is a > matter of taste. Clearly, in that pseudo-notation, one "N" is > as good as another. I don't agree, but .... >> * In 4, second bullet, "string terminators" should be "string >> delimiters". > I was deliberately trying to be general. If one has a form like > H'nnnn', one is clearly talking about "string delimiters". > However, if one has, e.g., &#xNNNN;, it is clear that ";" is a > string terminator, No, it's a delimiter. A string terminator marks the end of the *string* - in C, for example, it's the terminating " in the source code or the zero byte at run-time. A delimiter is something that separates one part of the string from another. > but whether there is a starting delimiter at > all depends on how one defines things in metalanguage or words. Delimiters don't have to be in pairs. >> * In 6 you need to copy in all the security stuff from >> Unicode; the stuff that says that you must use shortest-form >> UTF-8 (so not using %xC1.A1 for 'A') because of the problems >> of filters and firewalls not spotting longer forms. > Absolutely not. Again, this document is about escapes for > strings of Unicode characters, not about the general use and > appropriateness of Unicode. And shortest-form UTF-8 is > especially irrelevant because this document is an attempt to > prohibit escapes for UTF-8 entirely where that is still > possible. You've completely missed my point. The security section needs wording along the lines of: An escape mechanism such as the one specified in this document can allow characters to be represented in more than one way. Where software interprets the escaped form, there is a risk that security checks are done at the wrong point. For example, a security system might prohibit the substring "/../" within certain strings. If so, an attacker could attempt to avoid the test by sending "/\u'002E'\u'002E'/" instead. If the security check is made before interpretation of escaped characters, the attack will be successful. > Speaking personally, I'd be really pleased if I were doing a > much smaller percentage of the document-writing in this area. > But, if I'm going to do it, then my editorial judgment and > preferences are going to prevail... at least until the document > gets far enough along that I have to start arm-wrestling with > the RFC Editor and _their_ judgment and preferences. Or until Last Call shows that you're in a very small minority. > I really > appreciate comments about how to make a document more clear but > an argument about, e.g., U+NNNN[N[N]] versus U+[[N]N]NNNN is > only about taste and is hence a waste of everyone's time. No, it isn't. It's showing the *semantics* of the omission. > If you don't like my writing style -- and many people don't -- > please take on these efforts yourselves and let me complain > about your style (or not) some of the time. Oh no. The last time I accepted that challenge, I ended up authoring a 125 page RFC. > Sniping and > nit-picking is easy and may be fun, but it tends to block, > rather than contribute to, progress. Excuse me, I'm not sniping (at least, not deliberately; if I'm giving that impression, I apologise). Yes, I'm nit-picking sometimes, but that's something that should be done if the resulting document is to be clear, precise, and useful. Note, for example, that I'm not nit-picking about what I consider to be wrong spelling and grammar where I'm aware that it's a dialectic variant. > If you are not going to > pick up some of the writing work, Where I have text to offer - be it 3 words or 3 pages - be assured that I will. See above, for example. But sometimes it's a lot easier to make the general point and let the author apply it. > unless your purpose is to > introduce delays and lay down obstacles -- which I assume it is > not-- It is not. > can we please concentrate on substantive issues and > document changes that would have a clear positive impact on > clarity? If I didn't think my suggestions had a positive impact, I wouldn't make them. Equally, sometimes I am asking questions or trying to get an idea of what people think. Perhaps *I*'m in the minority. > I'm willing > to be convinced but note that every time I put something into a > document that is not strictly necessary I get attacked for > excessive length, etc. Not by me. Examples and explanations are often a good thing. I could have made RFC 3977 about 40 to 50 pages shorter if I'd taken that approach. Frank wrote: > Yes, the matter of 21 vs. 31 bits was recently discussed on the > Unicode list in conjunction with a (hypothetical) "UTF-21", maybe > Clive had that discussion in mind. No. I just object to people introducing octets (or, even worse, "bytes") when they're unnecessary and - to my mind - confuse the issue. I note, for example, that the entire POSIX process didn't understand the difference between the two until I pointed it out, and pointed out some of the wording changes needed to deal with it. At which point they decided that POSIX would have to be limited to implementations where they are the same. -- Clive D.W. Feather | Work: <clive@demon.net> | Tel: +44 20 8495 6138 Internet Expert | Home: <clive@davros.org> | Fax: +44 870 051 9937 Demon Internet | WWW: http://www.davros.org | Mobile: +44 7973 377646 THUS plc | |
- New draft (Was: I-D ACTION:draft-klensin-unicode-… John C Klensin
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… John C Klensin
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Tim Bray
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… John C Klensin
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Tim Bray
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… John C Klensin
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Clive D.W. Feather
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Clive D.W. Feather
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Stephane Bortzmeyer
- I-D.klensin-unicode-escapes (was: New Draft) Frank Ellermann
- I-D.klensin-unicode-escapes (was: New Draft) Frank Ellermann
- ABNF (was: New draft) Frank Ellermann
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Clive D.W. Feather
- Re: I-D.klensin-unicode-escapes (was: New Draft) Clive D.W. Feather
- Re: I-D.klensin-unicode-escapes (was: New Draft) Clive D.W. Feather
- Re: ABNF (was: New draft) Clive D.W. Feather
- Re: ABNF Frank Ellermann
- draft-klensin-unicode-escapes-01 (was: New Draft) John C Klensin
- Re: I-D.klensin-unicode-escapes Frank Ellermann
- Re: I-D.klensin-unicode-escapes John C Klensin
- Re: draft-klensin-unicode-escapes-01 Frank Ellermann
- Re: I-D.klensin-unicode-escapes (was: New Draft) Stephane Bortzmeyer
- Re: I-D.klensin-unicode-escapes (was: New Draft) John C Klensin
- Re: draft-klensin-unicode-escapes-01 (was: New Dr… Clive D.W. Feather
- Re: draft-klensin-unicode-escapes-01 (was: New Dr… John C Klensin
- Re: draft-klensin-unicode-escapes-01 (was: New Dr… Clive D.W. Feather