Re: Comments on draft-klensin-net-utf8-06

Marcos Sanz/Denic <sanz@denic.de> Thu, 18 October 2007 08:45 UTC

In-Reply-To: <1CEEB76FCFC0070A7B2BDEAE@[10.1.0.164]>
To: John C Klensin <john-ietf@jck.com>
Subject: Re: Comments on draft-klensin-net-utf8-06
MIME-Version: 1.0
From: Marcos Sanz/Denic <sanz@denic.de>
Message-ID: <OF823CA755.B4F0DAF7-ONC1257378.002D1284-C1257378.0030055F@notes.denic.de>
Date: Thu, 18 Oct 2007 10:44:30 +0200
Content-Type: text/plain; charset="US-ASCII"
Cc: discuss@apps.ietf.org
Precedence: list
Errors-To: discuss-bounces@apps.ietf.org

John,

> While I would welcome suggestions about other text and ways to
> organize this,

What about making a forecasting note right away under bullet 2:

OLD TEXT:

  CR SHOULD NOT appear except when followed by LF.

SUGGESTED TEXT:

  CR SHOULD NOT appear except when followed by LF. The other only allowed 
appearance is in the combination CR NUL, which is not recommended (see 
note at the end of this section).



There is a similar contradictory (less restrictive at the beginning and 
suddenly more restrictive at the end) situation in bullet 3. The first 
sentence goes

  [...] control characters (U+0000 to U+001F and U+007F to U+009F) SHOULD 
generally be avoided

vs the last sentence

  the so-called "C1 Controls" (U+0080 through U+009F) MUST NOT appear

This is the nightmare for any implementor. The double negation doesn't 
provide for clarity either. What about changing 

OLD TEXT:

  control characters (U+0000 to U+001F and U+007F to U+009F) SHOULD 
generally be avoided.

SUGGESTED TEXT

  control characters (U+0000 to U+001F and U+007F) SHOULD NOT be used and 
the so-called "C1 controls" (U+0080 to U+009F) MUST NOT be used.

Then you can drop the last sentence of the bullet.

> >   Suggested text:
> > 
> >  That is, if a string does not contain any unassigned
> >  characters for a given version of Unicode, and it is
> > normalized according  to
> >  the definition of NFC in that version, it will always result
> > in the same  normalized string according to all future
> > versions of the Unicode  Standard.
> 
> The text that was used was supplied by Mark Davis after my first
> attempt didn't come out right.

He'll certainly know better.

> > * Section 4: "the string order of RFC 3629". It's not very
> > clear to me  what is meant with this. Byte order? Sorting
> > order?
> 
> 3629 specifies a byte order (in section 4).  It does not address
> or mention sort order except to note (in the introduction) that
> UTF-8 preserves it and that sort order based on code point
> sequence is likely to be fairly useless.
> 
> I _think_ I would welcome text to clarify this

I support Frank's suggestion, which I'll copy here again for clarity:

-| Were Unicode to be changed in a way that violated these
-| assumptions, i.e., that either invalidated the string order
-| of RFC 3629 or that that changed the stability of NFC as
-| stated above, this specification would not apply.

+| Were Unicode to be changed in a way that violated these
+| assumptions, i.e., that changed the stability of NFC as
+| stated above, this specification would not apply.

And again, UTF-8 as specified in STD 63 is stable.

> So I am loathe to cover things that
> are well-covered in 3629 lest more confusion be created.

We fully agree. I only think that the reference in the old text is not 
necessary.

> > * Section 4: I would drop the last paragraph, since it is a
> > repetition of  what is exhaustively explained in section 5.2.
> > I got a parsing error at  the last sentence of that paragraph
> > anyway.
> 
> Hmm.  It parses for me.   But I agree about the redundancy,

Ok, so we drop it. Now about the last sentence:

> except for that last sentence, which makes a normative assertion
> about this specification that does not appear in Section 5.
> That last sentence could be restated, less formally, as:
> 
>    If one encounters a UTF-8 string in a protocol, and its
>    syntax and properties are not specifically defined, then
>    it is reasonable to assume that it conforms to this
>    specification.

The old formulation mentioned "unidentified UTF-8 strings", the new 
formulation mentions a UTF-8 string with syntax and properties "not 
specifically defined". I am sure you have something in mind, but it still 
doesn't get through. And you are aiming at a normative assertion, then 
normative language should be used and not something vague like "it is 
reasonable to assume".

Thanks and best regards,
Marcos

Comments on draft-klensin-net-utf8-06 Marcos Sanz/Denic
Re: Comments on draft-klensin-net-utf8-06 John C Klensin
Re: Comments on draft-klensin-net-utf8-06 Frank Ellermann
Re: Comments on draft-klensin-net-utf8-06 Marcos Sanz/Denic
Re: Comments on draft-klensin-net-utf8-06 Clive D.W. Feather
Re: Comments on draft-klensin-net-utf8-06 John C Klensin