Comments on Unicode Format for Network Interchange
"Markus Scherer" <markus.icu@gmail.com> Mon, 23 April 2007 17:48 UTC
Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com)
by megatron.ietf.org with esmtp (Exim 4.43)
id 1Hg2eP-0002xT-7B; Mon, 23 Apr 2007 13:48:33 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43)
id 1Hg2eN-0002v4-TZ for discuss-confirm+ok@megatron.ietf.org;
Mon, 23 Apr 2007 13:48:31 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org)
by megatron.ietf.org with esmtp (Exim 4.43) id 1Hg2eN-0002uw-Js
for discuss@apps.ietf.org; Mon, 23 Apr 2007 13:48:31 -0400
Received: from an-out-0708.google.com ([209.85.132.244])
by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1Hg2eN-0004QJ-62
for discuss@apps.ietf.org; Mon, 23 Apr 2007 13:48:31 -0400
Received: by an-out-0708.google.com with SMTP id d33so1913296and
for <discuss@apps.ietf.org>; Mon, 23 Apr 2007 10:48:30 -0700 (PDT)
DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta;
h=domainkey-signature:received:received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition;
b=jQ/ne5xynoZ3STtxqdsDb1ZGwaB1Et16hyggv7j2KG85j3S20cCJCbGHRD07+4ePWEHXX01QkSU8fDPFiycxCfcle36i4qaki2q+UV1ZLpc/htVmb8DgpzsCf0WMR8wL/XVRzaLIq3b87GJyp6k7j9O3uyAjch6CKFGsISGl7cY=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta;
h=received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition;
b=D+Sn9iBgVKwsJgnJQ4AeUT9vLYMZ/Wy0Fa7K3SdKxq19S+f5EvZ4bdXRx+CCeQ+ufnWSfGVyxUXSFrKB5VZzUbvKAVh17VMOdz0vZ0Gu8NTF/yITcjQQH7fCf2hdQ6r65byVEu74RzahRMzq8vTaSYcD8/o+91QR4qtnpalIA8Y=
Received: by 10.100.251.9 with SMTP id y9mr3883796anh.1177350509692;
Mon, 23 Apr 2007 10:48:29 -0700 (PDT)
Received: by 10.100.125.17 with HTTP; Mon, 23 Apr 2007 10:48:29 -0700 (PDT)
Message-ID: <6bb028490704231048s41deaf57q33ddb21fd0e76f17@mail.gmail.com>
Date: Mon, 23 Apr 2007 10:48:29 -0700
From: "Markus Scherer" <markus.icu@gmail.com>
To: discuss@apps.ietf.org
Subject: Comments on Unicode Format for Network Interchange
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 0.0 (/)
X-Scan-Signature: a8041eca2a724d631b098c15e9048ce9
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols
<discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>,
<mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>,
<mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org
Dear Mr. Klensin and Mr. Padlipsky et al.,
I have reviewed and discussed your draft-klensin-net-utf8-03 with some
colleagues. We welcome the standardization on UTF-8 as the default
internet charset.
We would like to make the following suggestions
(each starting with *** and ending with *** *** among quotes from the
internet-draft):
[...]
2. Net-Unicode
2.1. Definition
The Network Unicode (Net-Unicode) format is defined as follows:
1. Characters MUST be coded in UTF-8 as defined in [RFC3629].
2. Line-endings MUST be indicated by the sequence Carriage-Return
(U+000D) followed by Line-Feed (U+000A).
*** Suggested change:
2. Line-endings MUST be indicated by the sequence Carriage-Return
(U+000D) followed by Line-Feed (U+000A), or by a single
Carriage-Return (U+000D), or by a single Line-Feed (U+000A).
Justification: We believe that single CR and LF are common because of
implementation practice on a variety of platforms, and that it is both
unrealistic and unnecessary to try to legislate them away.
Applications already commonly handle all of CR, LF and CR+LF, and some
support even more characters according to the Unicode Newline
Guidelines.
*** ***
3. Before transmission, all character sequences MUST be normalized
according to Unicode method "NFC" (see Section 3).
*** Suggested change:
3. Before transmission, all character sequences SHOULD be normalized
according to Unicode method "NFC" (see Section 3).
Justification: With the MUST language in the draft, we see the following issues:
* The draft later says that recipients should not just assume
that incoming text is normalized. Therefore, recipients must
already be prepared to at least check for normalization.
-> We believe that the MUST is not useful.
* The normalization requirement is the reason for the Unicode versioning
and stability discussion below which complicates this internet-draft
considerably.
-> We believe that the MUST is not necessary.
* The normalization stability restricts this specification to Unicode versions
3.2 and above (see section 4).
-> We believe that this is too restrictive.
Unicode applications normally handle text from Unicode 2.0 and above.
* We believe that the MUST is unenforceable.
Moreover, if recipients must check, it doesn't make
any difference whether it is enforced.
(With this change, much of the following text of the internet-draft
can be simplified significantly. In particular, the discussions of
unassigned/unknown characters, stabilized forms, etc. can and should
be dropped.)
*** ***
4. As suggested in Section 6 of RFC 3629, the Byte Order Mark
("BOM") signature MUST NOT appear at the beginning of these text
strings.
*** Suggested change:
4. The UTF-8 signature byte sequence (EF BB BF, UTF-8 encoding of U+FEFF,
sometimes called Byte Order Mark ("BOM")),
when it appears at the beginning of the text, SHOULD be deleted
by the recipient.
If a Word Joiner is needed in the text, U+2060 WORD JOINER SHOULD be used
instead of U+FEFF ZERO WIDTH NO-BREAK SPACE.
Justification: We believe that the draft text is unnecessarily strong,
and at the same time not sufficiently specific for implementers.
*** ***
[...]
2.2. The ASCII NVT Definition
[...]
1. The "defined but not required" codes -- BEL, BS, HT, VT, FF --
and the undefined control codes ("C0") SHOULD NOT be used unless
required by exceptional circumstances.
*** Suggested change:
1. Control codes from both the "C0" (U+0000..U+001F, U+007F)
and "C1" (U+0080..U+009F) ranges,
with the exception of HT (09), LF (0A) and CR (0D),
SHOULD NOT be used unless required by exceptional circumstances.
Justification: The sets of C0 and C1 control codes that should and
should not be used should be defined explicitly, and with code point
values. Only HT, LF and CR are very widely used.
*** ***
2. CR MUST NOT appear except when immediately followed by either NUL
or LF, with the latter (CR LF) designating the "new line"
function. Because page layout is better done in other ways and
to avoid other types of confusion, CR NUL SHOULD preferably be
avoided.
3. LF CR SHOULD NOT appear except as a side-effect of multiple CR LF
sequences (e.g., CR LF CR LF).
*** Suggested change:
Remove points 2. and 3.
Justification: The other suggested changes permit CR and LF.
*** ***
[...]
4. Versions of Unicode
In retrospect, one of the advantages of ASCII [X3.4-1978] when it was
chosen was that the code space was full when the Standard was first
published. There was no practical way to add characters or change
code point assignments without being obviously incompatible. Unicode
does not have that property: there are large blocks of space reserved
for future expansion and new versions, with new characters and code
point assignments, appear at regular intervals.
While there are some security issues if people deliberately try to
trick the system (see Section 6), Unicode version changes should not
have a significant impact on the text stream specification of this
document for the following reasons:
o The transformation between Unicode code table positions and the
corresponding UTF-8 code is algorithmic; it does not depend on
whether a code point has been assigned or not.
o The normalization specified here, NFC (see Section 3), performs a
very limited set of mappings, much more limited than those of the
more extensive NFKC used in, e.g., nameprep [RFC3491].
*** Suggested change:
Drop this second bullet and the following paragraph.
Justification: They are unnecessary with changing NFC from MUST to SHOULD.
*** ***
The NFC tables may be updated over time as new characters are added,
but the Unicode Consortium has guaranteed the stability of all NFC
strings. That is, if a string does not contain any unassigned
characters, and it is normalized according to NFC, it will always be
normalized according to all future versions of the Unicode Standard.
The stability of the Net-Unicode format is thus guaranteed when any
implementation that converts text into Net-Unicode format does not
permit unassigned characters.
Were Unicode to be changed in a way that violated these assumptions,
i.e., that either invalidated the string order of RFC 3629 or that
that changed the stability of NFC as stated above, this specification
would not apply. Put differently, this specification applies only to
versions of Unicode starting with version 3.2 and extending to, but
not including, any version for which no changes are made in either
the UTF-8 definition or to NFC stability.
*** Suggested change:
Modify the paragraph above, removing references to NFC.
Justification: As a result, this specification will then apply to
versions of Unicode starting with version 2.0.
*** ***
[...]
5.2. The Unicode Applicability Dilemma
[...]
*** Suggested change:
Add an item for a fifth way to get around the problem: Strongly
encourage use of normalization form NFC in interchanged text, but do
not require it.
Justification: This is the alternative discussed here.
*** ***
9.1. Normative References
***
Suggested change: Please add a reference for [RFC3629] UTF-8, a
transformation format of ISO 10646
Justification: Missing reference.
*** ***
Best regards,
Markus Scherer
Google Software Internationalization
ICU Project Developer
- Comments on Unicode Format for Network Interchange Markus Scherer
- Re: Comments on Unicode Format for Network Interc… Frank Ellermann
- Re: Comments on Unicode Format for Network Interc… Markus Scherer
- Re: Comments on Unicode Format for Network Interc… Frank Ellermann