Re: draft-klensin-net-utf8-06
"Tim Bray" <tbray@textuality.com> Mon, 22 October 2007 10:24 UTC
Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1IjuSK-0001P7-0j; Mon, 22 Oct 2007 06:24:20 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1IjuSI-0001Jg-Hg for discuss-confirm+ok@megatron.ietf.org; Mon, 22 Oct 2007 06:24:18 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1IjuSH-00016w-KF for discuss@apps.ietf.org; Mon, 22 Oct 2007 06:24:17 -0400
Received: from wa-out-1112.google.com ([209.85.146.180]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1IjuS7-0000Iq-BP for discuss@apps.ietf.org; Mon, 22 Oct 2007 06:24:13 -0400
Received: by wa-out-1112.google.com with SMTP id m16so1268396waf for <discuss@apps.ietf.org>; Mon, 22 Oct 2007 03:23:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:reply-to:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; bh=sMgcIt3jVMM4zPMlxbn6XWuRwa4Ux4lnzBbTjFWclHM=; b=HoTMUqG+nwXNOptWNNzCDQk9unrkPrLbRaLLGD5ucMmaiAC3XkcjJgBUo5y5Hazq2hcZBs/fM2+RQWbFjdGLV0cWNJPycEzfkSZ9+W5JqNq7FPjWh4W2TaBYoLY8e2NPLXcBb5i1e5zkND/J9MuFzfOgu0UmnSWhIl1um6OOGB4=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:reply-to:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=LCHBnTaMK3wVCGINECXxvMtNc4go8J7EQGCu55JFH0WPxyHgS+C9DV0OYiFVwunWiw1As6ThFWGE/aqLdsA6gfEeSTS7EHtLCdrkYspP0ueLEYbHZBTGBkNTO+hhBfMt1cA30wGDuBVV7vMDCMJC0OIqlBf/zdW0npSBAn+IRJM=
Received: by 10.114.181.6 with SMTP id d6mr5074621waf.1193048601125; Mon, 22 Oct 2007 03:23:21 -0700 (PDT)
Received: by 10.114.161.15 with HTTP; Mon, 22 Oct 2007 03:23:21 -0700 (PDT)
Message-ID: <517bf110710220323l493c61ccrcc2d72ee3801f60a@mail.gmail.com>
Date: Mon, 22 Oct 2007 03:23:21 -0700
From: Tim Bray <tbray@textuality.com>
To: John C Klensin <john-ietf@jck.com>
Subject: Re: draft-klensin-net-utf8-06
In-Reply-To: <93F25E18AB3DA3EB0599F092@p3.JCK.COM>
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <93F25E18AB3DA3EB0599F092@p3.JCK.COM>
X-Google-Sender-Auth: f64a5dc8cce079ba
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 9a2be21919e71dc6faef12b370c4ecf5
Cc: discuss@apps.ietf.org
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: tbray@textuality.com
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org
My apologies for arriving late to the party. I got on a trans-Pacific flight with the draft in a browser window, but unfortunately not much else of the conversation, so this may be duplicative. This review of the draft raises, with one exception, issues of style and explication, and only one of these seems really serious to me: the term "code point" needs to be introduced early in the document to facilitate its use later; you really can't talk about Unicode intelligently without flinging the term around. I have one material point: standardizing on CRLF, rather than LF, for line-ending, seems questionable: - Lots of payload formats have their own rules about line ends; it's not clear to me that coupling line-ending rules to character-encoding rules is natural or beneficial. - XML, which constitutes a noticeable proportion of Internet protocol payloads, allows CRLF but requires parsers to convert it to LF before passing it on to the software. So this choice is going to waste (an admittedly small amount of) space and require extra data processing; the cost of not being able to pass a buffer-full of text to the receiving program without performing the CRLF->LF mapping and thus necessitating an additional copy of each byte seems to me worth worrying about. Now to style points: Section 1.1 "Historically, the Internet has been largely ASCII-based" I suggest "Historically, Internet protocols have been largely ASCII-based." The actual payload of the Internet has contained a high volume of non-English (and thus necessarily non-ASCII) text for some decades now. For reasons of brevity and clarity, I'd simply remove the following, which bulks up the paragraph and doesn't really add much value: "and no longer need to deal with per-script standards for character sets (e.g., one standard for each of Arabic, Cyrillic, Devanagari, etc., or even standards keyed to languages that are usually considered to share a script, such as French, German, or Swedish). " Here's a proposed slight redraft of the text introducing the encodings, which provides a bit of motivation, usefully introduces the term "code point", and is more precise: Unicode identifies each character by an integer, called its "code point", in the range 0-0x10ffff. These integers can be encoded into byte sequences for transmission in at least three standard and generally-recognized encoding forms, all of which are completely defined in The Unicode Standard and the documents cited below: - UTF-8 [RFC3629] defines a variable-length encoding that may be applied uniformly to all code points. - UTF-16 [RFC271] encodes the range of Unicode characters whose code points are less than 65536 straightforwardly as 16-bit integers, and provides a "surrogate" mechanism for encoding larger code points in 32 bits. - UTF-32 (also known as UCS-4) simply encodes each code point as a 32-bit integer. Section 2. "Characters must be coded"... would "encoded" be more consistent with usage elsewhere? " 2. Line-endings MUST be indicated by the sequence Carriage-Return (CR, U+000D) followed by Line-Feed (LF, U+000A), often known just as CRLF. " See above. " 5. As suggested in Section 6 of RFC 3629, the Byte Order Mark ("BOM") signature MUST NOT appear at the beginning of these text strings." It might be worth adding a note something along these lines: "The BOM is useful in establishing the endian-ness of UTF-16 and UTF-32 encodings, but serves no useful purpose in the context of UTF-8." "Systems conforming to this specification MUST NOT transmit any string containing any code point" - This is the first appearance of the term "code point", which needs defining. The edit I suggested above for section 1 would fix this. - Also, this sentence (which is important) seems to sit uneasily in this section about NFC; I would hoist it and promote it to near the top of section 2. Section 6. "The same security issues that apply to UTF-8, as discussed in [RFC3629], could be argued to apply to this standard form," That last clause is klunky, how about: "It could be argued that the same issues that apply to UTF-8, as discussed in [RFC3629], apply to this specification." Generally: I'd ruthlessly whack about 80% of the history and suchlike explication. Future consumers of this document, of which I predict there will be many, will need to consult and then cite the meat: "Unicode characters, UTF-8 encoding, NFC normalization upstream, be careful about Unicode versioning", and will benefit from some explanation of the technical advantages of the choices. The rest is of questionable value. -Tim
- draft-klensin-net-utf8-06 John C Klensin
- Re: draft-klensin-net-utf8-06 Stephane Bortzmeyer
- Re: draft-klensin-net-utf8-06 Frank Ellermann
- Re: draft-klensin-net-utf8-06 John C Klensin
- Re: draft-klensin-net-utf8-06 Frank Ellermann
- Re: draft-klensin-net-utf8-06 Stephane Bortzmeyer
- Re: draft-klensin-net-utf8-06 Bill McQuillan
- Re: draft-klensin-net-utf8-06 Tim Bray
- Re: draft-klensin-net-utf8-06 Julian Reschke
- Re: draft-klensin-net-utf8-06 Frank Ellermann
- Re: draft-klensin-net-utf8-06 Tony Finch
- Re: draft-klensin-net-utf8-06 John C Klensin