Comments on Unicode Format for Network Interchange

"Markus Scherer" <markus.icu@gmail.com> Mon, 23 April 2007 17:48 UTC

DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=D+Sn9iBgVKwsJgnJQ4AeUT9vLYMZ/Wy0Fa7K3SdKxq19S+f5EvZ4bdXRx+CCeQ+ufnWSfGVyxUXSFrKB5VZzUbvKAVh17VMOdz0vZ0Gu8NTF/yITcjQQH7fCf2hdQ6r65byVEu74RzahRMzq8vTaSYcD8/o+91QR4qtnpalIA8Y=
Message-ID: <6bb028490704231048s41deaf57q33ddb21fd0e76f17@mail.gmail.com>
Date: Mon, 23 Apr 2007 10:48:29 -0700
From: Markus Scherer <markus.icu@gmail.com>
To: discuss@apps.ietf.org
Subject: Comments on Unicode Format for Network Interchange
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Precedence: list
Errors-To: discuss-bounces@apps.ietf.org

Dear Mr. Klensin and Mr. Padlipsky et al.,

I have reviewed and discussed your draft-klensin-net-utf8-03 with some
colleagues. We welcome the standardization on UTF-8 as the default
internet charset.

We would like to make the following suggestions
(each starting with *** and ending with *** *** among quotes from the
internet-draft):

[...]

2.  Net-Unicode

2.1.  Definition

   The Network Unicode (Net-Unicode) format is defined as follows:

   1.  Characters MUST be coded in UTF-8 as defined in [RFC3629].

   2.  Line-endings MUST be indicated by the sequence Carriage-Return
       (U+000D) followed by Line-Feed (U+000A).

*** Suggested change:
   2.  Line-endings MUST be indicated by the sequence Carriage-Return
       (U+000D) followed by Line-Feed (U+000A), or by a single
       Carriage-Return (U+000D), or by a single Line-Feed (U+000A).

Justification: We believe that single CR and LF are common because of
implementation practice on a variety of platforms, and that it is both
unrealistic and unnecessary to try to legislate them away.
Applications already commonly handle all of CR, LF and CR+LF, and some
support even more characters according to the Unicode Newline
Guidelines.
*** ***

   3.  Before transmission, all character sequences MUST be normalized
       according to Unicode method "NFC" (see Section 3).

*** Suggested change:
   3.  Before transmission, all character sequences SHOULD be normalized
       according to Unicode method "NFC" (see Section 3).

Justification: With the MUST language in the draft, we see the following issues:
* The draft later says that recipients should not just assume
  that incoming text is normalized. Therefore, recipients must
  already be prepared to at least check for normalization.
  -> We believe that the MUST is not useful.
* The normalization requirement is the reason for the Unicode versioning
  and stability discussion below which complicates this internet-draft
  considerably.
  -> We believe that the MUST is not necessary.
* The normalization stability restricts this specification to Unicode versions
  3.2 and above (see section 4).
  -> We believe that this is too restrictive.
     Unicode applications normally handle text from Unicode 2.0 and above.
* We believe that the MUST is unenforceable.
  Moreover, if recipients must check, it doesn't make
  any difference whether it is enforced.

(With this change, much of the following text of the internet-draft
can be simplified significantly. In particular, the discussions of
unassigned/unknown characters, stabilized forms, etc. can and should
be dropped.)
*** ***

   4.  As suggested in Section 6 of RFC 3629, the Byte Order Mark
       ("BOM") signature MUST NOT appear at the beginning of these text
       strings.

*** Suggested change:
   4. The UTF-8 signature byte sequence (EF BB BF, UTF-8 encoding of U+FEFF,
       sometimes called Byte Order Mark ("BOM")),
      when it appears at the beginning of the text, SHOULD be deleted
by the recipient.
      If a Word Joiner is needed in the text, U+2060 WORD JOINER SHOULD be used
      instead of U+FEFF ZERO WIDTH NO-BREAK SPACE.

Justification: We believe that the draft text is unnecessarily strong,
and at the same time not sufficiently specific for implementers.
*** ***

[...]

2.2.  The ASCII NVT Definition

   [...]

   1.  The "defined but not required" codes -- BEL, BS, HT, VT, FF --
       and the undefined control codes ("C0") SHOULD NOT be used unless
       required by exceptional circumstances.

*** Suggested change:
   1.  Control codes from both the "C0" (U+0000..U+001F, U+007F)
       and "C1" (U+0080..U+009F) ranges,
       with the exception of HT (09), LF (0A) and CR (0D),
       SHOULD NOT be used unless required by exceptional circumstances.

Justification: The sets of C0 and C1 control codes that should and
should not be used should be defined explicitly, and with code point
values. Only HT, LF and CR are very widely used.
*** ***

   2.  CR MUST NOT appear except when immediately followed by either NUL
       or LF, with the latter (CR LF) designating the "new line"
       function.  Because page layout is better done in other ways and
       to avoid other types of confusion, CR NUL SHOULD preferably be
       avoided.

   3.  LF CR SHOULD NOT appear except as a side-effect of multiple CR LF
       sequences (e.g., CR LF CR LF).

*** Suggested change:
Remove points 2. and 3.

Justification: The other suggested changes permit CR and LF.
*** ***

 [...]

4.  Versions of Unicode

   In retrospect, one of the advantages of ASCII [X3.4-1978] when it was
   chosen was that the code space was full when the Standard was first
   published.  There was no practical way to add characters or change
   code point assignments without being obviously incompatible.  Unicode
   does not have that property: there are large blocks of space reserved
   for future expansion and new versions, with new characters and code
   point assignments, appear at regular intervals.

   While there are some security issues if people deliberately try to
   trick the system (see Section 6), Unicode version changes should not
   have a significant impact on the text stream specification of this
   document for the following reasons:

   o  The transformation between Unicode code table positions and the
      corresponding UTF-8 code is algorithmic; it does not depend on
      whether a code point has been assigned or not.

   o  The normalization specified here, NFC (see Section 3), performs a
      very limited set of mappings, much more limited than those of the
      more extensive NFKC used in, e.g., nameprep [RFC3491].

*** Suggested change:
Drop this second bullet and the following paragraph.

Justification: They are unnecessary with changing NFC from MUST to SHOULD.
*** ***

   The NFC tables may be updated over time as new characters are added,
   but the Unicode Consortium has guaranteed the stability of all NFC
   strings.  That is, if a string does not contain any unassigned
   characters, and it is normalized according to NFC, it will always be
   normalized according to all future versions of the Unicode Standard.
   The stability of the Net-Unicode format is thus guaranteed when any
   implementation that converts text into Net-Unicode format does not
   permit unassigned characters.

   Were Unicode to be changed in a way that violated these assumptions,
   i.e., that either invalidated the string order of RFC 3629 or that
   that changed the stability of NFC as stated above, this specification
   would not apply.  Put differently, this specification applies only to
   versions of Unicode starting with version 3.2 and extending to, but
   not including, any version for which no changes are made in either
   the UTF-8 definition or to NFC stability.

*** Suggested change:
Modify the paragraph above, removing references to NFC.

Justification: As a result, this specification will then apply to
versions of Unicode starting with version 2.0.
*** ***

[...]

5.2.  The Unicode Applicability Dilemma

[...]

*** Suggested change:
Add an item for a fifth way to get around the problem:   Strongly
encourage use of normalization form NFC in interchanged text, but do
not require it.

Justification: This is the alternative discussed here.
*** ***

9.1.  Normative References

***
Suggested change: Please add a reference for [RFC3629] UTF-8, a
transformation format of ISO 10646

Justification: Missing reference.
*** ***

Best regards,
Markus Scherer
Google Software Internationalization
ICU Project Developer

Comments on Unicode Format for Network Interchange Markus Scherer
Re: Comments on Unicode Format for Network Interc… Frank Ellermann
Re: Comments on Unicode Format for Network Interc… Markus Scherer
Re: Comments on Unicode Format for Network Interc… Frank Ellermann