Re: Comments on draft-klensin-net-utf8-06

John C Klensin <> Tue, 16 October 2007 15:51 UTC

Return-path: <>
Received: from [] ( by with esmtp (Exim 4.43) id 1Ihoi1-0005Hr-6I; Tue, 16 Oct 2007 11:51:53 -0400
Received: from discuss by with local (Exim 4.43) id 1Ihohy-0005EY-TS for; Tue, 16 Oct 2007 11:51:50 -0400
Received: from [] ( by with esmtp (Exim 4.43) id 1Ihohx-0005Bb-Tg for; Tue, 16 Oct 2007 11:51:49 -0400
Received: from ([] by with esmtp (Exim 4.43) id 1Ihohn-00019W-Av for; Tue, 16 Oct 2007 11:51:49 -0400
Received: from [] (helo=localhost) by with esmtp (Exim 4.34) id 1Ihohb-000EIL-KS; Tue, 16 Oct 2007 11:51:27 -0400
Date: Tue, 16 Oct 2007 11:51:27 -0400
From: John C Klensin <>
To: "Marcos Sanz/Denic" <>,
Subject: Re: Comments on draft-klensin-net-utf8-06
Message-ID: <1CEEB76FCFC0070A7B2BDEAE@[]>
In-Reply-To: <>
References: <>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 22bbb45ef41b733eb2d03ee71ece8243
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <>
List-Unsubscribe: <>, <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>

--On Tuesday, 16 October, 2007 16:45 +0200 "Marcos Sanz/Denic"
<> wrote:

> To my eyes the document is in good shape, but it still leaves
> one of the  issues at stake partially open:
> Section 2, bullet 2 says "CR SHOULD NOT appear except when
> followed by  LF". The last paragraph of the section 2 says "CR
> MUST NOT appear unless  it is immediately followed by LF (...)
> or NUL". To me the first of these  statements is much less
> restrictive than the second.

Marcos (and others who have been trapped by this),

While I would welcome suggestions about other text and ways to
organize this, the two statements are perfectly consistent with
each other, not a more restrictive/ less restrictive

It won't fit with the existing text and layout in this form, but
what is being said is:

   if there is a CR, 
       it MUST be followed by either LF or NUL
       however, NUL SHOULD be avoided too, so 
       CR LF is the only recommended context in which CR 
          SHOULD be used.

> Nitpicking:
> * Section 1.1: s/variable length/variable-length/ for
> coherence with other  text appearances

fixed in working draft.

> * Section 3: s/to convert all canonically equivalent sequences
> a single  unique form/to convert all canonically equivalent
> sequences into a single  unique form/

s/into/to/, but fixed.
> * Section 4: The definition of the normalization stability is
> misleading,  actually even wrong.
>   Old text: 
>  That is, if a string does not contain any unassigned
>  characters, and it is normalized according to NFC, it will
> always be  normalized according to all future versions of the
> Unicode Standard.
>   Suggested text:
>  That is, if a string does not contain any unassigned
>  characters for a given version of Unicode, and it is
> normalized according  to
>  the definition of NFC in that version, it will always result
> in the same  normalized string according to all future
> versions of the Unicode  Standard.

The text that was used was supplied by Mark Davis after my first
attempt didn't come out right.   I'm happy to put anything in
there that people agree to be correct, but please sort this out
with him and/or other UTC members.

> * Section 4: "the string order of RFC 3629". It's not very
> clear to me  what is meant with this. Byte order? Sorting
> order?

3629 specifies a byte order (in section 4).  It does not address
or mention sort order except to note (in the introduction) that
UTF-8 preserves it and that sort order based on code point
sequence is likely to be fairly useless.

I _think_ I would welcome text to clarify this but please note
that it is not likely to be possible to use this spec without
understanding and following 3629 (that is what the normative
reference is all about).  So I am loathe to cover things that
are well-covered in 3629 lest more confusion be created.

> * Section 4: I would drop the last paragraph, since it is a
> repetition of  what is exhaustively explained in section 5.2.
> I got a parsing error at  the last sentence of that paragraph
> anyway.

Hmm.  It parses for me.   But I agree about the redundancy,
except for that last sentence, which makes a normative assertion
about this specification that does not appear in Section 5.
That last sentence could be restated, less formally, as:

	If one encounters a UTF-8 string in a protocol, and its
	syntax and properties are not specifically defined, then
	it is reasonable to assume that it conforms to this

That assumption might, of course, be wrong, which is one reason
it is important to be careful about having string-receivers
assume that the string-transmitter normalized it.

I would appreciate input from others about what to do about this.

> * Section 5.2: s/[RFC3454])/[RFC3454]),/

It is correct as written ("..stored in normalized form if...")
although the long parenthetical note is a little obnoxious.
Suggestions welcome.
> * Section 5.2, bullet 4: "This process has been discussed in
> the Unicode  Consortium under the name 'Stable NFC'". That
> might very well be but the  only hits I get when googling for
> that string are this very draft and some  contributions of the
> draft author to some mailing lists. So I cast doubt  on the
> utility of introducing this new term here which is a pointer
> to  nowhere. Is this again referring to the normalization
> stability policy of  unicode?

No.   I will recheck the terminology.

I'm going to hold the document for a few days before re-posting
in the hope of getting comments from others.

thanks and regards,