Re: draft-klensin-unicode-escapes-01 (was: New Draft)

"Clive D.W. Feather" <clive@demon.net> Fri, 09 February 2007 14:24 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1HFWgD-00014Z-HE; Fri, 09 Feb 2007 09:24:49 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1HFWgD-00014U-4F for discuss@apps.ietf.org; Fri, 09 Feb 2007 09:24:49 -0500
Received: from anchor-internal-1.mail.demon.net ([195.173.56.100]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1HFWgA-0000n1-8E for discuss@apps.ietf.org; Fri, 09 Feb 2007 09:24:49 -0500
Received: from finch-staff-1.server.demon.net (finch-staff-1.server.demon.net [193.195.224.1]) by anchor-internal-1.mail.demon.net with ESMTP� id l19EOhAc009068Fri, 9 Feb 2007 14:24:44 GMT
Received: from clive by finch-staff-1.server.demon.net with local (Exim 3.36 #1) id 1HFWfh-000AaJ-00; Fri, 09 Feb 2007 14:24:17 +0000
Date: Fri, 09 Feb 2007 14:24:17 +0000
From: "Clive D.W. Feather" <clive@demon.net>
To: John C Klensin <john-ietf@jck.com>
Subject: Re: draft-klensin-unicode-escapes-01 (was: New Draft)
Message-ID: <20070209142417.GK18441@finch-staff-1.thus.net>
References: <875A124D75A8B481E176CF06@p3.JCK.COM> <uppsr2hs59srbd7eufbcul5a1ekl7i09nl@hive.bjoern.hoehrmann.de> <EF59DA6FD89C4F19750C68C3@p3.JCK.COM> <20070202114658.GX7742@finch-staff-1.thus.net> <45C3371E.330F@xyzzy.claranet.de> <20070202184727.GG68544@finch-staff-1.thus.net> <B7F8733D73E8CC7227785A69@p3.JCK.COM> <20070207174941.GA64818@finch-staff-1.thus.net> <CABD1699B87DF7916AE364E7@p3.JCK.COM>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <CABD1699B87DF7916AE364E7@p3.JCK.COM>
User-Agent: Mutt/1.5.3i
X-Spam-Score: 0.0 (/)
X-Scan-Signature: e367d58950869b6582535ddf5a673488
Cc: Frank Ellermann <nobody@xyzzy.claranet.de>, discuss@apps.ietf.org
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

John C Klensin said:
>> However, it does bring something else to mind: whatever escape
>> mechanism is chosen, should there be a limit on the length of
>> the hex string?

> I have assumed that compatibility with existing practice and
> good sense suggests that the string should be at least four and
> no more than eight hex digits in length.  Obviously, I didn't
> write that down.  This is probably a strong argument for
> including ABNF for all of the even slightly-recommended forms
> (see previous note).   It is not fixed in -02, which has been in
> the posting queue since Monday morning, but I'll try to get it
> in -03.

Okay.

>> Um, I'm not sure what Frank was thinking, but I would say:
>> - it's fine to mention HTML;
>> - we should *not* mention the construct "<B>&aring</B>" (note
>> the missing   semicolon), which SGML allows and HTML says
>> SHOULD NOT be used.
> It is not mentioned.   Even &aring; is not mentioned.  That is a
> whole different sort of abstraction than what the document talks
> about.

Sorry, I shouldn't have mentioned &aring; in my example. I see that -02
talks about the missing semicolon, so I'm happy that this be closed.


>> My point is that Unicode is not octet-based at all - encodings
>> like UTF-8 are, but Unicode isn't. It's a numbering of
>> characters from 0 to 0x10FFFF just like ASCII is a numnbering
>> from 0 to 127. Neither are octet based.
> On the other hand, UCS-4 was certainly octet based, and UTF-32
> is now presented as new terminology for UCS-4.   So the
> situation isn't as clear, or as pure, as your comment above
> suggests.

Not quite, no.

>> Nits have to be picked at some point, and sometimes earlier is
>> better than later.
> Ok.   It is just wearing me out on a document I somewhat
> accidentally volunteered to assemble but don't feel strongly
> committed to.   My problem, not yours, unless I give up and you
> do care about it.

Sorry, I didn't mean to stick a huge load on you by my comments.

> Co-authors would be welcome at this point.

Not at the moment, I'm afraid. If things change at work, then perhaps.

>>>> * Somewhere in the last two paragraphs of 1.1 you should be
>>>> talking about mini-languages (e.g. Cosmogol) as well as
>>>> protocols and UIs.

> I still don't see what the stopping rule is.  And, while I may
> not be looking in the right places, I don't see Cosmogol (for
> example) playing the role in IETF protocol definitions that,
> e.g., ABNF or ASN.1 do.

Well, ABNF is another mini-language that might benefit from a way to handle
Unicode.

> If you suggest text that explains why they are relevant and what
> the impact is, and others agree that it is important enough to
> justify the added length, I'll happily drop that text in.

In -02, change the last paragraph of 1.1 to:

    In addition to the protocol contexts addressed in this specification,
    escapes to represent Unicode characters could also be useful in formal
    languages (such as ABNF and Cosmogol) and in presentations to users
    (i.e. user interfaces). The formats specified in, and the reasoning of,
    this document may be applicable to these contexts as well, but this is
    not a proposal to standardise them.

That would satisfy me.

> And, fwiw, if there were a requirement along the lines of "no
> leading zeros in strings unless they are needed to make the
> string at least four digits long", then I would certainly want
> to write
>    U+[[1]M]NNNN
> with N in the range 0..F and
> M in the range 1..F
> or something like that.

Something like that, yes, since as written you've ruled out U+10xxxx.

That's also, perhaps, going too far in putting the detail in the text.

>>>> * In 4, second bullet, "string terminators" should be "string
>>>> delimiters".
[...]
> Sigh.  I think that "terminator" is still right.  If I read your
> definition, whether it is correct or not depends on the
> definition of strings and substrings. And part of that
> definitional question goes back to discussions long ago that
> I'll discuss only over appropriate beverages (coming to Prague?).

Regrettably not - I can't justify the time. But if you're ever in London or
Cambridge ....

>> The security section needs wording along the lines of:
>> 
>>     An escape mechanism such as the one specified in this
>> document can     allow characters to be represented in more
>> than one way. Where     software interprets the escaped form,
>> there is a risk that security     checks are done at the wrong
>> point.
> 
> While I do not believe this is necessary or the right place to
> put these sorts of warnings (IMO, they belong in something like
> 2277bis or a "safely using Unicode" doc), it is at worst
> harmless, so I'm invoking the "not willing to shed blood"
> principle.

Thanks.

I can agree with "not necessary"; I just think it's a good idea (see other
discussions on minimalism).

>>> But, if I'm going to do it, then my editorial judgment and
>>> preferences are going to prevail...
>> Or until Last Call shows that you're in a very small minority.
> Which, in the case of _purely_ editorial preferences, involves
> some probability of the author saying "I don't need this; find
> someone else to finish the document or let it drop".

True.

> As I think about it, our difference in view may rest in my
> viewing U+NNNN[N[N]] pretty much as a name for a concept that I
> assumed everyone reading the document was likely to already
> understand.  I could, equally comfortably, have just used U+NNNN
> or U+NNN... and put in some words, once, about actual lengths.
> In fact, that may have been where the original use of U+NNN came
> from.   You are viewing it, I guess, as a piece of semi-formal
> metanotation with associated semantics, etc.

Or, more precisely, I'm concerned that others may read it that way. As you
say, you and I know what it's supposed to mean.

[My day job involves doing this sort of analysis on draft legislation.
You'd be amazed just how badly people can misinterpret even the slightest
ambiguity.]

> No commitment yet, but, if no one but Clive and I have strong
> opinions about this, I am likely to invoke the "no bloodshed"
> rule and change this too... but I'd really like to hear from
> others about whether to go to
>    U+[[N]N]NNNN
> or something along the lines of 
>    U+[[1]M]NNNN
> or to giving this critter a name and writing ABNF, thereby
> ridding ourselves of trying to communicate a lot of information
> in a 12-character notational form.

I think that the ABNF approach is probably a good one.

> If we do go to ABNF, I also
> need guidance as to whether to write a leading zero rule into it
> (preferably in the form of the ABNF that people would prefer).

    unicode-notation = "U+" code-point
    code-point = bmp-code-point / extended-code-point
    bmp-code-point = 4hex-digit
    extended-code-point = (non-zero-hex-digit / "10") 4hex-digit
    hex-digit = "0" / non-zero-hex-digit
    non-zero-hex-digit = "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" /
        "9" / "A" / "B" / "C" / "D" / "E" / "F"

>> Note, for example, that I'm not nit-picking about what I
>> consider to be wrong spelling and grammar where I'm aware that
>> it's a dialectic variant.
> Ok.  For various historical reasons, I often actually prefer
> what I assume is your dialect to my native one.  The RFC Editor
> generally doesn't.  If you are so inclined, we should have an
> offline discussion and then I'll send you the XML and let you
> have at it

No thanks. I actually view this as being very much an author's privilege.
Throughout 3977 I used *my* choices as to correct spelling and grammar, and
fought the RFC Editor when necessary. You're the author, so these are your
choice IMAO.

>> No. I just object to people introducing octets (or, even
>> worse, "bytes") when they're unnecessary and - to my mind -
>> confuse the issue.
> And most of us who first encountered the Internet or ARPANET, or
> computer systems generally, at a time when "characters" could
> come in 5, 6, 7, 8, 9, or 12 bit units (and maybe some others),

[5 and a half in my case: I learned my programming on a system with 39 bit
words, and a character encoding with 5 bits with two shift states - there
were two separate packing systems, one of 7x5 with explicit shifts, the
other of 6x6 with shift bits on each character.]

> partially because I have no real
> confidence that the 21-bit limit will last

Point.

> So, in practical terms, I'm strongly
> inclined to say that, if we have a variable-length delimited
> string, it can be up to eight hex digits long and folks need to
> be prepared to parse that much although individual protocols
> adopting escapes can impose a leading zero rule.

Okay.

> Sigh.  Unfortunately, not the only thing wrong in the POSIX
> process.  We should have a drink sometime and swap stories.

See above.

-- 
Clive D.W. Feather  | Work:  <clive@demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive@davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
THUS plc            |                            |