Re: draft-klensin-unicode-escapes-01

John C Klensin <john-ietf@jck.com> Sat, 03 February 2007 17:07 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1HDOLw-0000B2-Jk; Sat, 03 Feb 2007 12:07:04 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1HDOLv-00008o-8M for discuss@apps.ietf.org; Sat, 03 Feb 2007 12:07:03 -0500
Received: from ns.jck.com ([209.187.148.211] helo=bs.jck.com) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1HDOLq-00033b-Nf for discuss@apps.ietf.org; Sat, 03 Feb 2007 12:07:03 -0500
Received: from [127.0.0.1] (helo=p3.JCK.COM) by bs.jck.com with esmtp (Exim 4.34) id 1HDOLp-000O7W-To; Sat, 03 Feb 2007 12:06:58 -0500
Date: Sat, 03 Feb 2007 12:06:56 -0500
From: John C Klensin <john-ietf@jck.com>
To: Frank Ellermann <nobody@xyzzy.claranet.de>, discuss@apps.ietf.org
Subject: Re: draft-klensin-unicode-escapes-01
Message-ID: <AF334D6BB0BFF3037B0DE609@p3.JCK.COM>
X-Mailer: Mulberry/4.0.7 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 1.1 (+)
X-Scan-Signature: 963faf56c3a5b6715f0b71b66181e01a
Cc:
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org


--On Saturday, 03 February, 2007 03:03 +0100 Frank Ellermann
<nobody@xyzzy.claranet.de> wrote:

> John C Klensin wrote:
> 
>> Sorry, but, IMO, for a document like this which is merely
>> giving examples (at this point), "very widely deployed and
>> used" trumps "horrible".
> 
> When I mentioned hex. NCRs I meant XML, not SGML and its many
> ways to save keystrokes.

And I was commenting only on the suggestion that all reference
to HTML be removed, however horrible it is.  XML is, as far as I
know, already covered.  If it isn't covered adequately, I don't
understand your suggestion.

>>> Probably the I-D should mention that one famous exception
>>> from its rule to avoid encoded UTF-8 is the URL form of IRIs.
> 
>> Why?  I don't see that (and a few other cases) as "exceptions"
>> (famous or not), but as mistakes from which we should learn
>> and, I hope, have learned.
> 
> The overall RFC 2277 rule is IMO "if you don't know what it is
> or can't say what it is, it's UTF-8, and if that theory fails
> it's UNKNOWN-8BIT".  And unlike RFC 2231 an URL can't say what
> it is.

I think I've understood both of those things.  I just haven't
seen the justification or requirement to start exploring
existing protocols in this document, famous or not.  I'm willing
to be convinced but note that every time I put something into a
document that is not strictly necessary I get attacked for
excessive length, etc.

>>> wouldn't it be better to say "21 bits - rather than the 7
>>> bits of ASCII -"?
> 
>> No, because of net-ascii and some other issues, probably not.
>> Again, because the important issue here is that this stuff is
>> about escapes, you are picking nits that belong elsewhere
> 
> That nit is closely related to escape mechanisms providing for
> 31 or 32 bits, and attempts to get rid of leading zeros in
> these mechanisms, which could fail without explicit delimiters.

Sure.  But we routinely express ASCII in terms of octets.  We
don't use the "7-bit" or "21-bit" language very often.  And
wiring it into protocols and conventions just gets us into
trouble with other standards bodies change their minds and
decide that however many bits they though was certainly enough,
wasn't.   Consider the evolution from "6 bits is enough" (with
BCD and other code sets) to "7 bits is enough" (with ASCII and
ISO 646), to "8 bits is enough" (with 8859 (including the
unfortunately-named ASCII-8), EBCDIC, and others), to "16 bits
is certainly enough" (with Unicode 1.0), to "we will never need
more than 21" (current Unicode).   Of course, the IETF and its
predecessors have made the same errors, starting with the
conviction that we would never need more than 255 network nodes,
to 32 bits with IPv4, to 128 bits with IPv6, to some suspicions
that, depending on allocation policies, the latter may turn out
to not be enough in a world with sensor networks, multiple
connectivity paths, and many, many small IP-connected devices.

I don't think I disagree with your point -- it is certainly
factual-- but don't yet see the  need to open this topic up in
this document (see the comment about length, etc., above).  I
have taken \uNNNN and \UNNNNNNNN out of their special status.
I've inserted an explicit discussion about why explicit
delimiters (or terminators) are desirable.  And, with the
working draft for -02, I've made additional comments, as an
extra bullet in Section 4, about the inability to tell the
difference, just by looking, between \uNNNN as a short,
BMP-only, form of the \uNNNN and \UNNNNNNNN pair, a short form
for \uNNNNN or \uNNNNNN, and \uNNNN as an octet encoding for
UTF-16 as reasons why none of those "\u" form (without
delimiters) are attractive.   What else do you suggest and why?
Text please.

>> U+NNNN[N[N]] versus U+[[N]N]NNNN is a matter of taste.
 
> If the idea is to reflect appendix A of Unicode 5, it talks
> about stripping leading zeros until four digits are left.

Which defines the contents and semantics of the NNNN... string,
not the syntax.  I can argue this either way (as I presume you
can).  I just find the second form harder to read.

>  In
> table A.1 it has U+HHHH vs. U-HHHHHHHHH, saying that this is
> the same as \uHHHH vs. \UHHHHHHHH.  I haven't seen U-HHHHHHHH
> before, and I won't miss it if you don't want to talk about it
> in the draft.

Thanks.  I don't intend to talk about it unless someone makes a
very persuasive argument.

>> my editorial judgment and preferences are going to prevail...
>> at least until the document gets far enough along that I have
>> to start arm-wrestling with the RFC Editor and _their_
>> judgment and preferences.
> 
> Oops, sorry, I thought the intended status was BCP, "an attempt
> to prohibit escapes for UTF-8 entirely".

Well, we clearly can't do that without making the spec
retroactive on existing protocols, especially the famous IRI
situation.  But, yes, the target is a standards track document
of some favor... see below.

>> If you don't like my writing style -- and many people don't --
>> please take on these efforts yourselves and let me complain
>> about your style (or not) some of the time.
> 
> I like it, your drafts are almost always very interesting.  I
> was really surprised when I stumbled over RFC 2345 some day
> ago, it supports UTF-8 for whois.  So that wasn't a DeNIC
> invention after all, and it's clearly older than RFC 3912
> chapter 4.

For whatever it is worth, I've become convinced, as I have
delved further into the history of telnet-based protocols, that
it was a mistake.  And I have no reason to believe that DENIC
looked at that document rather than inventing the idea
independently.  If they did draw inspiration from it, I think
I'd be suitably chagrined :-(

> But "interesting" sometimes includes "controversial", and then
> the intended status is relevant.  As you say it makes no sense
> to nit- pick your personal preferences.

In this case, and probably more generally, the comment was
strictly about editorial preferences, not substantive issues of
protocol design or recommendations.    See below.

>> nit-picking is easy and may be fun, but it tends to block,
>> rather than contribute to, progress.
> 
> I don't read 99% of all drafts, and I don't try to contribute
> to 99.9%.  The remaining 0.1% somehow attracted my attention,
> often because I like them, the opposite is also possible.  No
> idea how easy it generally is, but this article alone took me
> about three hours, because I tried to find out what "net
> ascii" really is, why your "net utf-8" mentions an RFC that's
> not more available, where the Unicode 5 notational conventions
> are, etc.

My apologies.  And I'm flattered that you thought it worth the
time.  I'm typically willing to answer questions about comments
of mine that seem obscure if that is more efficient for you.
 
>> unless your purpose is to introduce delays and lay down
>> obstacles
> 
> Escaping that with <rant> and adding a qualifier doesn't make
> it better.  If you don't want a discussion about the draft
> it's okay, as you say we're free to submit our own drafts
> about this and / or related topics.

Frank, I welcome, and want, comments and suggestions.  The fact
that I completely revamped the documents and what it suggested
between -00 and and -01 in response to comments critical of the
recommendation about \u and \U should be evidence of that.  On
this document, virtually every substantive suggestion or comment
has resulted in some changes to the text.  I count "this
sentence is incomprehensible and needs to be fixed" as
substantive in that regard.  As with open source programs and
programming, I believe that having multiple critical eyes on one
of these things makes it better, and that certainly has been the
case here.  

What I've objected to, here and elsewhere, has been what seemed
to be high-frequency nit-picking about editorial style and
preferences.  From my point of view, that is irritating and it
does little or nothing to advance the work.  More important,
especially when I'm working on multiple documents in parallel, I
often start feeling like a Bear of Very Little Brain (apologies
if that reference is not obvious to you): while long words don't
bother me,  I start getting scared that I will miss something
substantive and important among flurries of editorial quibbling.

      john