Re: draft-klensin-unicode-escapes-01 (was: New Draft)

"Clive D.W. Feather" <clive@demon.net> Wed, 07 February 2007 17:50 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1HEqvt-0000lo-4V; Wed, 07 Feb 2007 12:50:13 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1HEqvs-0000lW-98 for discuss@apps.ietf.org; Wed, 07 Feb 2007 12:50:12 -0500
Received: from anchor-internal-1.mail.demon.net ([195.173.56.100]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1HEqvm-0006mr-Lp for discuss@apps.ietf.org; Wed, 07 Feb 2007 12:50:12 -0500
Received: from finch-staff-1.server.demon.net (finch-staff-1.server.demon.net [193.195.224.1]) by anchor-internal-1.mail.demon.net with ESMTP� id l17Ho27J021912Wed, 7 Feb 2007 17:50:03 GMT
Received: from clive by finch-staff-1.server.demon.net with local (Exim 3.36 #1) id 1HEqvN-000I35-00; Wed, 07 Feb 2007 17:49:41 +0000
Date: Wed, 07 Feb 2007 17:49:41 +0000
From: "Clive D.W. Feather" <clive@demon.net>
To: John C Klensin <john-ietf@jck.com>
Subject: Re: draft-klensin-unicode-escapes-01 (was: New Draft)
Message-ID: <20070207174941.GA64818@finch-staff-1.thus.net>
References: <875A124D75A8B481E176CF06@p3.JCK.COM> <uppsr2hs59srbd7eufbcul5a1ekl7i09nl@hive.bjoern.hoehrmann.de> <EF59DA6FD89C4F19750C68C3@p3.JCK.COM> <20070202114658.GX7742@finch-staff-1.thus.net> <45C3371E.330F@xyzzy.claranet.de> <20070202184727.GG68544@finch-staff-1.thus.net> <B7F8733D73E8CC7227785A69@p3.JCK.COM>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <B7F8733D73E8CC7227785A69@p3.JCK.COM>
User-Agent: Mutt/1.5.3i
X-Spam-Score: 0.0 (/)
X-Scan-Signature: ed68cc91cc637fea89623888898579ba
Cc: Frank Ellermann <nobody@xyzzy.claranet.de>, discuss@apps.ietf.org
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

John C Klensin said:
>>> A few other conformance criteria in [Charmod] might be also
>>> interesting: http://www.w3.org/TR/charmod/#C070  Don't
>>> exclude arbitrary code points
[...]

> But _this_ document is about escaping characters, or really code
> points, not about what characters should be permitted or how
> they should be used.

Agreed.

>   C070: irrelevant.   if it is in range, this works.

Okay.

>   C077: needs to be fixed somewhere else.  This is just about a
> syntax for escapes

Mumble. Anything above U+10FFFF is, by definition, not Unicode (okay, not
Unicode today). But I'm not going to fight this one any more.

However, it does bring something else to mind: whatever escape mechanism is
chosen, should there be a limit on the length of the hex string? Should an
implementation be expected to correctly handle:

  \u'000000000000000000000000000000000000000000000000000000000000000001234'
or
  \u'000000000000000000000008000000000000000000000000000000000000000001234'
?

What is "correct" in the latter case?

>   C078: Already prohibited

Okay.

>   C079: Precisely the reason why one might want escapes is to be
> able to deal with non-characters in a sensible way.

Very true, and a point I should have realized.

>> Yes, never ever mention that HTML exists, it's horrible.  The
>> [Charmod] bible requires no (SGML) nonsense in
>> http://www.w3.org/TR/charmod/#C044
> Sorry, but, IMO, for a document like this which is merely giving
> examples (at this point), "very widely deployed and used" trumps
> "horrible".

Um, I'm not sure what Frank was thinking, but I would say:
- it's fine to mention HTML;
- we should *not* mention the construct "<B>&aring</B>" (note the missing
  semicolon), which SGML allows and HTML says SHOULD NOT be used.

>>...
>> Probably the I-D should mention that one famous exception from
>> its rule to avoid encoded UTF-8 is the URL form of IRIs.
> Why?  I don't see that (and a few other cases) as "exceptions"
> (famous or not), but as mistakes from which we should learn and,
> I hope, have learned.   The document is reasonably clear that it
> is not a proposal to retrofit any existing protocol.

As someone else asked, what about new protocols based on IRIs?

>> * In 1.1, rather than saying that Unicode occupies "two or
>> more octets", wouldn't it be better to say "21 bits - rather
>> than the 7 bits of ASCII -"?
> No, because of net-ascii and some other issues, probably not.

"net-ascii"? The only reference I can find is in RFC 1350, where it appears
to mean "ASCII with top bit clear, plus some codes from RFC 764 which have
the top bit set".

My point is that Unicode is not octet-based at all - encodings like UTF-8
are, but Unicode isn't. It's a numbering of characters from 0 to 0x10FFFF
just like ASCII is a numnbering from 0 to 127. Neither are octet based.

> Again, because the important issue here is that this stuff is
> about escapes, you are picking nits that belong elsewhere (see
> rant at the end).

Nits have to be picked at some point, and sometimes earlier is better than
later.

>> * Somewhere in the last two paragraphs of 1.1 you should be
>> talking about mini-languages (e.g. Cosmogol) as well as
>> protocols and UIs.
> Why?  In principle I could talk about Japanese business cards
> and all sorts of other things too.

Because they are something that RFCs often use or contain, and for which
this is highly relevant. Unlike Japanese business cards.

> But that is introductory
> material to help the reader understand the context.  Trying to
> create an exhaustive list would add nothing to the document
> except more pages and making it harder to read.

I'm not suggesting exhaustive; I'm suggesting that mini-languages are a
relevant target for this specification.

>> * In 3, you're inconsistent between "U+NNN[N[N]]" and
>> "NNN...". Indeed, shouldn't the former actually be
>> "U+[[N]N]NNNN"? (Note both the order and the number of Ns.) I
>> would suggest that better wording might be:
> Partially fixed in -02.   U+NNNN[N[N]] versus U+[[N]N]NNNN is a
> matter of taste.  Clearly, in that pseudo-notation, one "N" is
> as good as another.

I don't agree, but ....

>> * In 4, second bullet, "string terminators" should be "string
>> delimiters".
> I was deliberately trying to be general.  If one has a form like
> H'nnnn', one is clearly talking about "string delimiters".
> However, if one has, e.g., &#xNNNN;, it is clear that ";" is a
> string terminator,

No, it's a delimiter. A string terminator marks the end of the *string* -
in C, for example, it's the terminating " in the source code or the zero
byte at run-time. A delimiter is something that separates one part of the
string from another.

> but whether there is a starting delimiter at
> all depends on how one defines things in metalanguage or words.

Delimiters don't have to be in pairs.

>> * In 6 you need to copy in all the security stuff from
>> Unicode; the stuff that says that you must use shortest-form
>> UTF-8 (so not using %xC1.A1 for 'A') because of the problems
>> of filters and firewalls not spotting longer forms.
> Absolutely not.  Again, this document is about escapes for
> strings of Unicode characters, not about the general use and
> appropriateness of Unicode.   And shortest-form UTF-8 is
> especially irrelevant because this document is an attempt to
> prohibit escapes for UTF-8 entirely where that is still
> possible.

You've completely missed my point.

The security section needs wording along the lines of:

    An escape mechanism such as the one specified in this document can
    allow characters to be represented in more than one way. Where
    software interprets the escaped form, there is a risk that security
    checks are done at the wrong point.

    For example, a security system might prohibit the substring
    "/../" within certain strings. If so, an attacker could attempt to
    avoid the test by sending "/\u'002E'\u'002E'/" instead. If the
    security check is made before interpretation of escaped characters,
    the attack will be successful.

> Speaking personally, I'd be really pleased if I were doing a
> much smaller percentage of the document-writing in this area.
> But, if I'm going to do it, then my editorial judgment and
> preferences are going to prevail... at least until the document
> gets far enough along that I have to start arm-wrestling with
> the RFC Editor and _their_ judgment and preferences.

Or until Last Call shows that you're in a very small minority.

> I really
> appreciate comments about how to make a document more clear but
> an argument about, e.g., U+NNNN[N[N]] versus U+[[N]N]NNNN is
> only about taste and is hence a waste of everyone's time.

No, it isn't. It's showing the *semantics* of the omission.

> If you don't like my writing style -- and many people don't --
> please take on these efforts yourselves and let me complain
> about your style (or not) some of the time.

Oh no. The last time I accepted that challenge, I ended up authoring a 125
page RFC.

> Sniping and
> nit-picking is easy and may be fun, but it tends to block,
> rather than contribute to, progress.

Excuse me, I'm not sniping (at least, not deliberately; if I'm giving that
impression, I apologise). Yes, I'm nit-picking sometimes, but that's
something that should be done if the resulting document is to be clear,
precise, and useful.

Note, for example, that I'm not nit-picking about what I consider to be
wrong spelling and grammar where I'm aware that it's a dialectic variant.

> If you are not going to
> pick up some of the writing work,

Where I have text to offer - be it 3 words or 3 pages - be assured that I
will. See above, for example. But sometimes it's a lot easier to make the
general point and let the author apply it.

> unless your purpose is to
> introduce delays and lay down obstacles -- which I assume it is
> not--

It is not.

> can we please concentrate on substantive issues and
> document changes that would have a clear positive impact on
> clarity?

If I didn't think my suggestions had a positive impact, I wouldn't make
them. Equally, sometimes I am asking questions or trying to get an idea of
what people think. Perhaps *I*'m in the minority.

> I'm willing
> to be convinced but note that every time I put something into a
> document that is not strictly necessary I get attacked for
> excessive length, etc.                             

Not by me. Examples and explanations are often a good thing. I could have
made RFC 3977 about 40 to 50 pages shorter if I'd taken that approach.

Frank wrote:
> Yes, the matter of 21 vs. 31 bits was recently discussed on the
> Unicode list in conjunction with a (hypothetical) "UTF-21", maybe
> Clive had that discussion in mind.

No. I just object to people introducing octets (or, even worse, "bytes")
when they're unnecessary and - to my mind - confuse the issue.

I note, for example, that the entire POSIX process didn't understand the
difference between the two until I pointed it out, and pointed out some of
the wording changes needed to deal with it. At which point they decided
that POSIX would have to be limited to implementations where they are the
same.

-- 
Clive D.W. Feather  | Work:  <clive@demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive@davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
THUS plc            |                            |