draft-klensin-net-utf8-05.txt (was: Re: Form feed in Net-UTF8?)

Martin Duerst <duerst@it.aoyama.ac.jp> Tue, 09 October 2007 09:44 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1IfBdt-0000w2-6t; Tue, 09 Oct 2007 05:44:45 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1IfBdr-0000un-1O for discuss-confirm+ok@megatron.ietf.org; Tue, 09 Oct 2007 05:44:43 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1IfBdq-0000uY-O0 for discuss@apps.ietf.org; Tue, 09 Oct 2007 05:44:42 -0400
Received: from scmailgw1.scop.aoyama.ac.jp ([133.2.251.194]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1IfBdk-0008St-6t for discuss@apps.ietf.org; Tue, 09 Oct 2007 05:44:42 -0400
Received: from scmse1.scbb.aoyama.ac.jp (scmse1 [133.2.253.16]) by scmailgw1.scop.aoyama.ac.jp (secret/secret) with SMTP id l999iRIL015209 for <discuss@apps.ietf.org>; Tue, 9 Oct 2007 18:44:27 +0900 (JST)
Received: from (133.2.206.133) by scmse1.scbb.aoyama.ac.jp via smtp id 43ee_39a17f00_764c_11dc_900a_0014221fa3c9; Tue, 09 Oct 2007 18:44:26 +0900
X-AuthUser: duerst@it.aoyama.ac.jp
Received: from Tanzawa.it.aoyama.ac.jp ([133.2.210.1]:46820) by itmail.it.aoyama.ac.jp with [XMail 1.22 ESMTP Server] id <S1715B6> for <discuss@apps.ietf.org> from <duerst@it.aoyama.ac.jp>; Tue, 9 Oct 2007 18:41:02 +0900
Message-Id: <6.0.0.20.2.20071009110918.0a21b790@localhost>
X-Sender: duerst@localhost
X-Mailer: QUALCOMM Windows Eudora Version 6J
Date: Tue, 09 Oct 2007 18:17:39 +0900
To: John C Klensin <john-ietf@jck.com>
From: Martin Duerst <duerst@it.aoyama.ac.jp>
Subject: draft-klensin-net-utf8-05.txt (was: Re: Form feed in Net-UTF8?)
In-Reply-To: <16279928490191A12E3B99E8@p3.JCK.COM>
References: <398A6C120C8B166FCBD3BDAF@p3.JCK.COM> <20071005151227.GA31232@nic.fr> <E877BB045466189D5B4E287A@p3.JCK.COM> <6.0.0.20.2.20071006120622.0a6d4b10@localhost> <16279928490191A12E3B99E8@p3.JCK.COM>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.0 (/)
X-Scan-Signature: e472ca43d56132790a46d9eefd95f0a5
Cc: discuss@apps.ietf.org, Mark Davis <mark.davis@icu-project.org>
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

At 17:17 07/10/06, John C Klensin wrote:

>Martin, since I already shipped -05 and, thanks to the new
>automated system, it has been posted, please see how you like
>the treatment there.   I think we agree on the principles, so
>would welcome suggestions for further tuning.

I think the treatment of FF looks reasonable. What I'm missing
is any mention of HT (horizontal tab). HT is often part of white
space, e.g. in formats such as XML, and heavily used in programming
languages,.... It may not be part of NVT ASCII; in that case, maybe
a short note would help. (I later found HT mentioned in the Appendix,
probably a pointer to that would be enough).

As for the overall document, here are a few more comments:

- 1.1, in my view, is still too much history. If the idea is just to
  establish the need for the new format, the details for UTF-8,
  UTF-16, and UTF-32, as one example, are not needed there.

- In section 3, I don't see any MUST for NFC, although from reading
  the text, it very much seems like that was intended (and I'd agree
  that's the way to go). Looking around, I found this in Section 2.
  Putting it in Section 2 is okay, but then there should be some
  pointer in Section 3. Also, if the definition is in Section 2,
  adding more requirements in other sections isn't very helpful.
  It would be better to either include the "must not transmit
  unassigned codepoints" to Section 2, or to change point 4 in
  section 2 to say "normalized according to details described in
  section 3".

  BTW, Section 2 should not say "Unicode method "NFC""; this would
  be the first spec I see that calls NFC a 'method'.

- Section 3 is written as if NFC were the only normalization form.
  E.g. "The section above requires that all Net-Unicode strings be
  transmitted in normalized form." or "The Unicode Consortium
  specifies a normalization method, known as NFC [NFC]...".
  This document doesn't need to got into details about other forms,
  but it shouldn't keep the informed or half-informed reader wondering
  "so, what about the other forms", or "any reason why they choose
  NFC?" or so.
  As the reason for NFC, I'd suggest some text like the following:
  "Of all normalization forms defined by Unicode, NFC is closest
  to actual use in practice and, in general requires the least
  work when converting from a non-Unicode encoding."

- "Systems conforming to this specification MUST NOT transmit any string
  containing any code point that is unassigned in the version of
  Unicode and NFC on which they are dependent.": The 'and' here is
  somewhat problematic. First, the statement only works for
  'version of Unicode', because NFC as such doesn't come with a list
  of assigned codepoints. The wording from
  http://unicode.org/standard/stability_policy.html#Normalization
  is better: "s only characters from a given version of the Unicode, and it
  is put into a normalized form in accordance with that version of Unicode".
  Second, the 'and' means that we are fine if at least one of the
  conditions is met, i.e. I can use a character from Unicode 5.0
  with a normalization implementation from Unicode 3.0 and I'm still
  okay with the above text.

- Section 4 starts with: "In retrospect, one of the advantages of ASCII".
  This is again too much focus on history. This fact can be mentioned
  e.g. at the end of the first paragraph, but the start of the section
  should concentrate on the problem at hand. Otherwise, this document
  already *reads* outdated now, just immagine how it will read in
  five or ten years.

- Section 4 nails down normalization at Unicode version 3.2. Also,
  it says that an update may be necessary for any changes to normalization.
  In fact, after Unicode version 3.2, there are already corrigenda to
  normalization, and it would be good if this document said clearly
  how they should be dealt with. I would propose the following:

  Corrigendum #4: Five CJK Canonical Mapping Errors
  (see http://www.unicode.org/versions/corrigendum4.html):
  This fixes five obvious mapping errors. Obviously, the new mappings
  should be used, because the old mappings lead to data corruption.
  No explicit labeling of the new mappings is necessary.

  Corrigendum #5: Normalization Idempotency
  (see http://www.unicode.org/versions/corrigendum5.html):
  This fixes a logical inconsistency in the textual description of
  the normalization algorithm. No reasonably occurring text is affected
  (by an extremely generous definition of reasonably occurring).
  No explicit labeling of the new mappings is necessary, because
  data isn't actually affected.

  I also think that the discussion on normalization versions should be
  in the normalization section.

- Section 4 says that non-assigned codepoints should not be used.
  Does that include or exculde codepoints in the private use areas?

- Section 5.1 says:
  "During the development of this specification, there was some
   confusion about where it would be useful given that, e.g., MIME and
   HTTP have their own rules about UTF-8 character types."
  This reads as if there were special provisions in MIME or HTTP about
  UTF-8. Such a misunderstanding should be avoided. I'm not sure
  I understand what you are referring to, but if you mean line-ending
  conventions, then MIME is with you, and for HTTP, it's not an issue
  of UTF-8, but more general. If you are referring to the fact that
  MIME/HTTP don't require NFC normalization, then don't blame your spec;
  NFC wasn't around yet when 

- Later in section 5.1, it says:
  "In particular, if this proposal is approved, or even appears to be
   getting significant traction,": Obviously, such text will look
  weird in an eventual RFC, and should be reworded, e.g. to something
  like a simple applicability statement:
  This specification is intended for use by other specifications that
  have not yet defined how to use Unicode. Examples could be Telnet...
  or FTP... This specification is not inteded for use with specifications
  that already allow the use UTF-8, such as MIME or HTTP.

- I remember a discussion about the fact that Unicode in theory can
  contain a base character followed by an unlimited number of combining
  characters, and that in a protocol context, checking such combinations
  may not be feasible (you in theory need a buffer of unlimited length).
  There was the idea to restrict this format to a certain number of
  combining character; where has this idea gone? From a protocol
  implementer's viewpoint, I think it is still necessary. It is actually
  defined at
  http://www.unicode.org/reports/tr15/tr15-26.html#Stream_Safe_Text_Format
  but I saw some problems there, too, so I'm pinging the Unicode guys.
  (the main problem is that it looks like this form ends up to always
   be based on NFKC, which would be counter-productive)

- I suggest a different title for 5.2. The current title sounds too much
  like FUD. The title should make clear that it is about design decisions,
  and maybe it actually belongs into an appendix (it's not normative,
  as far as I understand). To reduce any potential remaining FUD, it
  should also be pointed out that although the current stability rules
  do not rule out that a new script with both precomposed and decomposed
  representation is encoded, the chance that this happens is pretty
  slim, because the precomposed/decomposed dichotomy mainly came from
  legacy encodings. Also, unless you add a new base letter or a new
  combining character, creating a new precomposed/decomposed pair
  for an existing script is impossible, and one can fairly assume that
  virtually all base letters AND combining characters are already encoded,
  and that for new ones, no precomposed encodings will be created, again
  for the same reasons as above.

- 5.2 mentions "IETF Standard UTF-8" without a definition or a reference.
  Not sure what is inteded, so please clarify.

- Sentences such as these:
  "While not specifically a security issue, the requirement in NVT, and
   hence here, that, except as "newline" (CR LF), the CR character never
   appear alone but only when followed by ASCII NUL (an octet with all
   bits zero) may be problematic for some programming languages, and
   hence a trap for the unwary, unless caution is used."
  should be reworded. If the RFC Editor does even just a halfway job,
  this sentence won't remain as it is, and it's better to fix it now
  than leave that to the RFC Editor (at least in my experience).

I haven't read the Appendix in detail, but I don't think it affects
the above comments.

Regards,   Martin.




#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp