Re: draft-klensin-net-utf8-06

"Tim Bray" <tbray@textuality.com> Mon, 22 October 2007 10:24 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1IjuSK-0001P7-0j; Mon, 22 Oct 2007 06:24:20 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1IjuSI-0001Jg-Hg for discuss-confirm+ok@megatron.ietf.org; Mon, 22 Oct 2007 06:24:18 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1IjuSH-00016w-KF for discuss@apps.ietf.org; Mon, 22 Oct 2007 06:24:17 -0400
Received: from wa-out-1112.google.com ([209.85.146.180]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1IjuS7-0000Iq-BP for discuss@apps.ietf.org; Mon, 22 Oct 2007 06:24:13 -0400
Received: by wa-out-1112.google.com with SMTP id m16so1268396waf for <discuss@apps.ietf.org>; Mon, 22 Oct 2007 03:23:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:reply-to:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; bh=sMgcIt3jVMM4zPMlxbn6XWuRwa4Ux4lnzBbTjFWclHM=; b=HoTMUqG+nwXNOptWNNzCDQk9unrkPrLbRaLLGD5ucMmaiAC3XkcjJgBUo5y5Hazq2hcZBs/fM2+RQWbFjdGLV0cWNJPycEzfkSZ9+W5JqNq7FPjWh4W2TaBYoLY8e2NPLXcBb5i1e5zkND/J9MuFzfOgu0UmnSWhIl1um6OOGB4=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:reply-to:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=LCHBnTaMK3wVCGINECXxvMtNc4go8J7EQGCu55JFH0WPxyHgS+C9DV0OYiFVwunWiw1As6ThFWGE/aqLdsA6gfEeSTS7EHtLCdrkYspP0ueLEYbHZBTGBkNTO+hhBfMt1cA30wGDuBVV7vMDCMJC0OIqlBf/zdW0npSBAn+IRJM=
Received: by 10.114.181.6 with SMTP id d6mr5074621waf.1193048601125; Mon, 22 Oct 2007 03:23:21 -0700 (PDT)
Received: by 10.114.161.15 with HTTP; Mon, 22 Oct 2007 03:23:21 -0700 (PDT)
Message-ID: <517bf110710220323l493c61ccrcc2d72ee3801f60a@mail.gmail.com>
Date: Mon, 22 Oct 2007 03:23:21 -0700
From: "Tim Bray" <tbray@textuality.com>
To: "John C Klensin" <john-ietf@jck.com>
Subject: Re: draft-klensin-net-utf8-06
In-Reply-To: <93F25E18AB3DA3EB0599F092@p3.JCK.COM>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <93F25E18AB3DA3EB0599F092@p3.JCK.COM>
X-Google-Sender-Auth: f64a5dc8cce079ba
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 9a2be21919e71dc6faef12b370c4ecf5
Cc: discuss@apps.ietf.org
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: tbray@textuality.com
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

My apologies for arriving late to the party.   I got on a
trans-Pacific flight with the draft in a browser window, but
unfortunately not much else of the conversation, so this may be
duplicative.

This review of the draft raises, with one exception, issues of style
and explication, and only one of these seems really serious to me: the
term "code point" needs to be introduced early in the document to
facilitate its use later; you really can't talk about Unicode
intelligently without flinging the term around.

I have one material point: standardizing on CRLF, rather than LF, for
line-ending, seems questionable:
- Lots of payload formats have their own rules about line ends; it's
not clear to me that coupling line-ending rules to character-encoding
rules is natural or beneficial.
- XML, which constitutes a noticeable proportion of Internet protocol
payloads, allows CRLF but requires parsers to convert it to LF before
passing it on to the software.  So this choice is going to waste (an
admittedly small amount of) space and require extra data processing;
the cost of not being able to pass a buffer-full of text to the
receiving program without performing the CRLF->LF mapping and thus
necessitating an additional copy of each byte seems to me worth
worrying about.

Now to style points:

Section 1.1

"Historically, the Internet has been largely ASCII-based"

I suggest "Historically, Internet protocols have been largely
ASCII-based."  The actual payload of the Internet has contained a high
volume of non-English (and thus necessarily non-ASCII) text for some
decades now.

For reasons of brevity and clarity, I'd simply remove the following,
which bulks up the paragraph and doesn't really add much value: "and
no longer need to deal with per-script standards for character sets
(e.g., one standard for each of Arabic, Cyrillic, Devanagari, etc., or
even standards keyed to languages that are usually considered to share
a script, such as French, German, or Swedish). "

Here's a proposed slight redraft of the text introducing the
encodings, which provides a bit of motivation, usefully introduces the
term "code point", and is more precise:

Unicode identifies each character by an integer, called its "code
point", in the range 0-0x10ffff.  These integers can be encoded into
byte sequences for transmission  in at least three standard and
generally-recognized encoding forms, all of which are completely
defined in The  Unicode Standard and the documents cited below:

- UTF-8 [RFC3629] defines a variable-length encoding that may be
applied uniformly to all code points.
- UTF-16 [RFC271] encodes the range of Unicode characters whose code
points are less than 65536 straightforwardly as 16-bit integers, and
provides a "surrogate" mechanism for encoding larger code points in 32
bits.
- UTF-32 (also known as UCS-4) simply encodes each code point as a
32-bit integer.

Section 2.

"Characters must be coded"... would "encoded" be more consistent with
usage elsewhere?

"  2.  Line-endings MUST be indicated by the sequence Carriage-Return
       (CR, U+000D) followed by Line-Feed (LF, U+000A), often known just
       as CRLF. "

See above.

"  5.  As suggested in Section 6 of RFC 3629, the Byte Order Mark
       ("BOM") signature MUST NOT appear at the beginning of these text
       strings."

It might be worth adding a note something along these lines: "The BOM
is useful in establishing the endian-ness of UTF-16 and UTF-32
encodings, but serves no useful purpose in the context of UTF-8."

"Systems conforming to this specification MUST NOT transmit any string
   containing any code point"

- This is the first appearance of the term "code point", which needs
defining.  The edit I suggested above for section 1 would fix this.
- Also, this sentence (which is important) seems to sit uneasily in
this section about NFC; I would hoist it and promote it to near the
top of section 2.

Section 6.

"The same security issues that apply to UTF-8, as
   discussed in [RFC3629], could be argued to apply to this standard form,"

That last clause is klunky, how about: "It could be argued that the
same issues that apply to UTF-8, as discussed in [RFC3629], apply to
this specification."

Generally: I'd ruthlessly whack about 80% of the history and suchlike
explication.  Future consumers of this document, of which I predict
there will be many, will need to consult and then cite the meat:
"Unicode characters, UTF-8 encoding, NFC normalization upstream, be
careful about Unicode versioning", and will benefit from some
explanation of the technical advantages of the choices.  The rest is
of questionable value.  -Tim