Re: draft-klensin-unicode-escapes-01 (was: New Draft)
John C Klensin <john-ietf@jck.com> Wed, 07 February 2007 21:25 UTC
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com)
by megatron.ietf.org with esmtp (Exim 4.43)
id 1HEuIb-0004Lb-7f; Wed, 07 Feb 2007 16:25:53 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org)
by megatron.ietf.org with esmtp (Exim 4.43) id 1HEuIa-0004LW-6g
for discuss@apps.ietf.org; Wed, 07 Feb 2007 16:25:52 -0500
Received: from ns.jck.com ([209.187.148.211] helo=bs.jck.com)
by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1HEuIY-0005T3-Kv
for discuss@apps.ietf.org; Wed, 07 Feb 2007 16:25:52 -0500
Received: from [127.0.0.1] (helo=p3.JCK.COM)
by bs.jck.com with esmtp (Exim 4.34)
id 1HEuIS-0008rK-Fr; Wed, 07 Feb 2007 16:25:45 -0500
Date: Wed, 07 Feb 2007 16:25:42 -0500
From: John C Klensin <john-ietf@jck.com>
To: "Clive D.W. Feather" <clive@demon.net>
Subject: Re: draft-klensin-unicode-escapes-01 (was: New Draft)
Message-ID: <CABD1699B87DF7916AE364E7@p3.JCK.COM>
In-Reply-To: <20070207174941.GA64818@finch-staff-1.thus.net>
References: <875A124D75A8B481E176CF06@p3.JCK.COM>
<uppsr2hs59srbd7eufbcul5a1ekl7i09nl@hive.bjoern.hoehrmann.de>
<EF59DA6FD89C4F19750C68C3@p3.JCK.COM>
<20070202114658.GX7742@finch-staff-1.thus.net>
<45C3371E.330F@xyzzy.claranet.de>
<20070202184727.GG68544@finch-staff-1.thus.net>
<B7F8733D73E8CC7227785A69@p3.JCK.COM>
<20070207174941.GA64818@finch-staff-1.thus.net>
X-Mailer: Mulberry/4.0.7 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 0.0 (/)
X-Scan-Signature: d49da3f50144c227c0d2fac65d3953e6
Cc: Frank Ellermann <nobody@xyzzy.claranet.de>, discuss@apps.ietf.org
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols
<discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>,
<mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>,
<mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org
--On Wednesday, 07 February, 2007 17:49 +0000 "Clive D.W. Feather" <clive@demon.net> wrote: > However, it does bring something else to mind: whatever escape > mechanism is chosen, should there be a limit on the length of > the hex string? Should an implementation be expected to > correctly handle: > > \u'00000000000000000000000000000000000000000000000000000000000 > 0000001234' or > > \u'00000000000000000000000800000000000000000000000000000000000 > 0000001234' ? > > What is "correct" in the latter case? I have assumed that compatibility with existing practice and good sense suggests that the string should be at least four and no more than eight hex digits in length. Obviously, I didn't write that down. This is probably a strong argument for including ABNF for all of the even slightly-recommended forms (see previous note). It is not fixed in -02, which has been in the posting queue since Monday morning, but I'll try to get it in -03. >>> Yes, never ever mention that HTML exists, it's horrible. The >>> [Charmod] bible requires no (SGML) nonsense in >>> http://www.w3.org/TR/charmod/#C044 >> Sorry, but, IMO, for a document like this which is merely >> giving examples (at this point), "very widely deployed and >> used" trumps "horrible". > > Um, I'm not sure what Frank was thinking, but I would say: > - it's fine to mention HTML; > - we should *not* mention the construct "<B>å</B>" (note > the missing semicolon), which SGML allows and HTML says > SHOULD NOT be used. It is not mentioned. Even å is not mentioned. That is a whole different sort of abstraction than what the document talks about. Perhaps it should be mentioned; if you think so, text would be helpful. >>> ... >>> Probably the I-D should mention that one famous exception >>> from its rule to avoid encoded UTF-8 is the URL form of IRIs. >> Why? I don't see that (and a few other cases) as "exceptions" >> (famous or not), but as mistakes from which we should learn >> and, I hope, have learned. The document is reasonably clear >> that it is not a proposal to retrofit any existing protocol. > > As someone else asked, what about new protocols based on IRIs? Words here won't help, IMO. Put differently, this is not the right place to put the words. If we can get reasonable consensus on some variation of this and a few other things wrapped up, I think the right approach is to revisit IRI itself and see if we want to make changes, narrow scope, insert warnings, or whatever. And, in the meantime, I trust that these conversations have raised the issues with the right group to prevent new work from going forward without at least their being considered. I just don't know what else to suggest. I have personally never been happy about aspects of the IRI spec but, up through and after the time that it was in Last Call, most of my concerns were just intuitions that there were things in it that would get us in trouble in the long term -- intuitions I wasn't able to explain in terms of, e.g., concrete examples. I also have concerns that IRIs addressed a problem that didn't really need solving and ignored the important ones. Today, I could probably do somewhat better at explanation and examples and others reading this could probably do better than I can. But I'm still not ready to try to assemble either "IRIbis" or "IRIs considered harmful". If someone else is, he or she should go to it. >>> * In 1.1, rather than saying that Unicode occupies "two or >>> more octets", wouldn't it be better to say "21 bits - rather >>> than the 7 bits of ASCII -"? >> No, because of net-ascii and some other issues, probably not. > > "net-ascii"? The only reference I can find is in RFC 1350, > where it appears to mean "ASCII with top bit clear, plus some > codes from RFC 764 which have the top bit set". Long discussion. But "ASCII with top bit clear", etc., implies 7-bit ASCII in an 8-bit unit. If one were to transmit ASCII in 7-bit units, there is no top bit to be clear or set. Some systems, of course, actually did that, even though ASCII on the network was defined in octet terms. > My point is that Unicode is not octet-based at all - encodings > like UTF-8 are, but Unicode isn't. It's a numbering of > characters from 0 to 0x10FFFF just like ASCII is a numnbering > from 0 to 127. Neither are octet based. On the other hand, UCS-4 was certainly octet based, and UTF-32 is now presented as new terminology for UCS-4. So the situation isn't as clear, or as pure, as your comment above suggests. A bit more on this below. >> Again, because the important issue here is that this stuff is >> about escapes, you are picking nits that belong elsewhere (see >> rant at the end). > > Nits have to be picked at some point, and sometimes earlier is > better than later. Ok. It is just wearing me out on a document I somewhat accidentally volunteered to assemble but don't feel strongly committed to. My problem, not yours, unless I give up and you do care about it. Co-authors would be welcome at this point. >>> * Somewhere in the last two paragraphs of 1.1 you should be >>> talking about mini-languages (e.g. Cosmogol) as well as >>> protocols and UIs. >> Why? In principle I could talk about Japanese business cards >> and all sorts of other things too. > > Because they are something that RFCs often use or contain, and > for which this is highly relevant. Unlike Japanese business > cards. I still don't see what the stopping rule is. And, while I may not be looking in the right places, I don't see Cosmogol (for example) playing the role in IETF protocol definitions that, e.g., ABNF or ASN.1 do. >> But that is introductory >> material to help the reader understand the context. Trying to >> create an exhaustive list would add nothing to the document >> except more pages and making it harder to read. > > I'm not suggesting exhaustive; I'm suggesting that > mini-languages are a relevant target for this specification. If you suggest text that explains why they are relevant and what the impact is, and others agree that it is important enough to justify the added length, I'll happily drop that text in. >>> * In 3, you're inconsistent between "U+NNN[N[N]]" and >>> "NNN...". Indeed, shouldn't the former actually be >>> "U+[[N]N]NNNN"? (Note both the order and the number of Ns.) I >>> would suggest that better wording might be: >> Partially fixed in -02. U+NNNN[N[N]] versus U+[[N]N]NNNN is >> a matter of taste. Clearly, in that pseudo-notation, one "N" >> is as good as another. > > I don't agree, but .... Maybe this is connected to our slightly different views of octets/ bytes (and perhaps "nibbles") above. Or maybe not. And, fwiw, if there were a requirement along the lines of "no leading zeros in strings unless they are needed to make the string at least four digits long", then I would certainly want to write U+[[1]M]NNNN with N in the range 0..F and M in the range 1..F or something like that. But, at least so far, we don't have that rule and, so far, I'm happy to leave deciding on whether or not it is needed to those specifying particular escapes for particular protocols. >>> * In 4, second bullet, "string terminators" should be "string >>> delimiters". >> I was deliberately trying to be general. If one has a form >> like H'nnnn', one is clearly talking about "string >> delimiters". However, if one has, e.g., &#xNNNN;, it is clear >> that ";" is a string terminator, > > No, it's a delimiter. A string terminator marks the end of the > *string* - in C, for example, it's the terminating " in the > source code or the zero byte at run-time. A delimiter is > something that separates one part of the string from another. Sigh. I think that "terminator" is still right. If I read your definition, whether it is correct or not depends on the definition of strings and substrings. And part of that definitional question goes back to discussions long ago that I'll discuss only over appropriate beverages (coming to Prague?). However, I don't think this is worth shedding blood over so, unless someone else objects, -03 will use "delimiter" throughout. >>> * In 6 you need to copy in all the security stuff from >>> Unicode; the stuff that says that you must use shortest-form >>> UTF-8 (so not using %xC1.A1 for 'A') because of the problems >>> of filters and firewalls not spotting longer forms. >> Absolutely not. Again, this document is about escapes for >> strings of Unicode characters, not about the general use and >> appropriateness of Unicode. And shortest-form UTF-8 is >> especially irrelevant because this document is an attempt to >> prohibit escapes for UTF-8 entirely where that is still >> possible. > > You've completely missed my point. > > The security section needs wording along the lines of: > > An escape mechanism such as the one specified in this > document can allow characters to be represented in more > than one way. Where software interprets the escaped form, > there is a risk that security checks are done at the wrong > point. While I do not believe this is necessary or the right place to put these sorts of warnings (IMO, they belong in something like 2277bis or a "safely using Unicode" doc), it is at worst harmless, so I'm invoking the "not willing to shed blood" principle. Text included in -03 (with an added comment about checks for minimal or normalized forms). > For example, a security system might prohibit the substring > "/../" within certain strings. If so, an attacker could > attempt to avoid the test by sending "/\u'002E'\u'002E'/" > instead. If the security check is made before > interpretation of escaped characters, the attack will be > successful. See above. >> Speaking personally, I'd be really pleased if I were doing a >> much smaller percentage of the document-writing in this area. >> But, if I'm going to do it, then my editorial judgment and >> preferences are going to prevail... at least until the >> document gets far enough along that I have to start >> arm-wrestling with the RFC Editor and _their_ judgment and >> preferences. > > Or until Last Call shows that you're in a very small minority. Which, in the case of _purely_ editorial preferences, involves some probability of the author saying "I don't need this; find someone else to finish the document or let it drop". >> I really >> appreciate comments about how to make a document more clear >> but an argument about, e.g., U+NNNN[N[N]] versus U+[[N]N]NNNN >> is only about taste and is hence a waste of everyone's time. > > No, it isn't. It's showing the *semantics* of the omission. Not without more words or a more complete definition, IMO. See the comments above about U+[1[M]]NNNN. Do you think that is worth belaboring? Do others? As I think about it, our difference in view may rest in my viewing U+NNNN[N[N]] pretty much as a name for a concept that I assumed everyone reading the document was likely to already understand. I could, equally comfortably, have just used U+NNNN or U+NNN... and put in some words, once, about actual lengths. In fact, that may have been where the original use of U+NNN came from. You are viewing it, I guess, as a piece of semi-formal metanotation with associated semantics, etc. If I had seen it that way, I would probably have gone all the way to ABNF on the theory that U+[[N]N]NNNN does little more than give a strong hint. No commitment yet, but, if no one but Clive and I have strong opinions about this, I am likely to invoke the "no bloodshed" rule and change this too... but I'd really like to hear from others about whether to go to U+[[N]N]NNNN or something along the lines of U+[[1]M]NNNN or to giving this critter a name and writing ABNF, thereby ridding ourselves of trying to communicate a lot of information in a 12-character notational form. If we do go to ABNF, I also need guidance as to whether to write a leading zero rule into it (preferably in the form of the ABNF that people would prefer). >> If you don't like my writing style -- and many people don't -- >> please take on these efforts yourselves and let me complain >> about your style (or not) some of the time. > > Oh no. The last time I accepted that challenge, I ended up > authoring a 125 page RFC. And here I assumed that 2821 would never qualify for a brevity award in the messaging space. Time to go add 50 pages to 2821bis :-) >> Sniping and >> nit-picking is easy and may be fun, but it tends to block, >> rather than contribute to, progress. > > Excuse me, I'm not sniping (at least, not deliberately; if I'm > giving that impression, I apologise). Yes, I'm nit-picking > sometimes, but that's something that should be done if the > resulting document is to be clear, precise, and useful. > > Note, for example, that I'm not nit-picking about what I > consider to be wrong spelling and grammar where I'm aware that > it's a dialectic variant. Ok. For various historical reasons, I often actually prefer what I assume is your dialect to my native one. The RFC Editor generally doesn't. If you are so inclined, we should have an offline discussion and then I'll send you the XML and let you have at it -- part of what is going on here is that this document is a very marginal activity for me right now and the time needed to insert trivial and low-priority changes feels burdensome. But, if you felt inclined to make a stylistic editing pass, we should chat to find out whether we can agree on criteria and limits. >> If you are not going to >> pick up some of the writing work, > > Where I have text to offer - be it 3 words or 3 pages - be > assured that I will. See above, for example. But sometimes > it's a lot easier to make the general point and let the author > apply it. >... Ok. As long as we can both understand that there are limited here, even if we don't precisely agree about where they are, keep those cards and letters coming. Just for planning purposes, I'll probably get -03, containing the changes discussed above and whatever else comes up (including the results of corrections or arguments about things I've put into -02, which was finally announced about five minutes ago), out in the next week or two. But, if more versions are needed after that, I will have to put the thing aside until post-Prague. > Not by me. Examples and explanations are often a good thing. I > could have made RFC 3977 about 40 to 50 pages shorter if I'd > taken that approach. See snide comment about its length relative to that of 2821 above. We probably mostly agree, but several people in the community don't... or are willing to insist on their favorite example and then complain about length vis-a-vis everyone else's. > No. I just object to people introducing octets (or, even > worse, "bytes") when they're unnecessary and - to my mind - > confuse the issue. And most of us who first encountered the Internet or ARPANET, or computer systems generally, at a time when "characters" could come in 5, 6, 7, 8, 9, or 12 bit units (and maybe some others), with or without padding, are sensitive to that issue too. I just don't know whether this document is the right place to make that point, partially because of the UCS-4 problem mentioned above (and the corresponding original definition of ISO 10646 as a 32-bit character set) and partially because I have no real confidence that the 21-bit limit will last (any more than seven or eight bit limits on characters or eight or 32 bit limits on address have lasted). So, in practical terms, I'm strongly inclined to say that, if we have a variable-length delimited string, it can be up to eight hex digits long and folks need to be prepared to parse that much although individual protocols adopting escapes can impose a leading zero rule. > I note, for example, that the entire POSIX process didn't > understand the difference between the two until I pointed it > out, and pointed out some of the wording changes needed to > deal with it. At which point they decided that POSIX would > have to be limited to implementations where they are the same. Sigh. Unfortunately, not the only thing wrong in the POSIX process. We should have a drink sometime and swap stories. best, john
- New draft (Was: I-D ACTION:draft-klensin-unicode-… John C Klensin
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… John C Klensin
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Tim Bray
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… John C Klensin
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Tim Bray
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… John C Klensin
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Clive D.W. Feather
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Clive D.W. Feather
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Stephane Bortzmeyer
- I-D.klensin-unicode-escapes (was: New Draft) Frank Ellermann
- I-D.klensin-unicode-escapes (was: New Draft) Frank Ellermann
- ABNF (was: New draft) Frank Ellermann
- Re: New draft (Was: I-D ACTION:draft-klensin-unic… Clive D.W. Feather
- Re: I-D.klensin-unicode-escapes (was: New Draft) Clive D.W. Feather
- Re: I-D.klensin-unicode-escapes (was: New Draft) Clive D.W. Feather
- Re: ABNF (was: New draft) Clive D.W. Feather
- Re: ABNF Frank Ellermann
- draft-klensin-unicode-escapes-01 (was: New Draft) John C Klensin
- Re: I-D.klensin-unicode-escapes Frank Ellermann
- Re: I-D.klensin-unicode-escapes John C Klensin
- Re: draft-klensin-unicode-escapes-01 Frank Ellermann
- Re: I-D.klensin-unicode-escapes (was: New Draft) Stephane Bortzmeyer
- Re: I-D.klensin-unicode-escapes (was: New Draft) John C Klensin
- Re: draft-klensin-unicode-escapes-01 (was: New Dr… Clive D.W. Feather
- Re: draft-klensin-unicode-escapes-01 (was: New Dr… John C Klensin
- Re: draft-klensin-unicode-escapes-01 (was: New Dr… Clive D.W. Feather