Re: Comments on draft-klensin-net-utf8-06

John C Klensin <john-ietf@jck.com> Fri, 04 January 2008 21:18 UTC

Date: Fri, 04 Jan 2008 16:18:40 -0500
From: John C Klensin <john-ietf@jck.com>
To: Frank Ellermann <nobody@xyzzy.claranet.de>
Subject: Re: Comments on draft-klensin-net-utf8-06
Message-ID: <00C50AA9B75A816BAB602CBD@p3.JCK.COM>
In-Reply-To: <ff49pf$rfn$1@ger.gmane.org>
References: <OF037DA1CA.695DAFC1-ONC1257376.004E5008-C1257376.00511560@notes.denic.de> <1CEEB76FCFC0070A7B2BDEAE@[10.1.0.164]> <ff49pf$rfn$1@ger.gmane.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Cc: discuss@apps.ietf.org
Precedence: list
Errors-To: discuss-bounces@apps.ietf.org

--On Wednesday, 17 October, 2007 08:21 +0200 Frank Ellermann
<nobody@xyzzy.claranet.de> wrote:

> John C Klensin wrote:
> 
>>> * Section 4: "the string order of RFC 3629". It's not very
>>> clear to me  what is meant with this. Byte order? Sorting
>>> order?
>  
>> 3629 specifies a byte order (in section 4).  It does not
>> address or mention sort order except to note (in the
>> introduction) that UTF-8 preserves it and that sort order
>> based on code point sequence is likely to be fairly useless.
>  
>> I _think_ I would welcome text to clarify this
> 
> Please simplify the remark and remove one "that":
> 
> -| Were Unicode to be changed in a way that violated these
> -| assumptions, i.e., that either invalidated the string order
> -| of RFC 3629 or that that changed the stability of NFC as
> -| stated above, this specification would not apply.
> 
> +| Were Unicode to be changed in a way that violated these
> +| assumptions, i.e., that changed the stability of NFC as
> +| stated above, this specification would not apply.
> 
> UTF-8 as specified in STD 63 is stable.  

Sure it is.  But, were the definition of normal UTF-8 string
order _in the Unicode spec_ to change (however unlikely that
might be) we would be in deep trouble since many (most?)
implementers depend on UTF-8 libraries not RFC 3629-specific
ones.  Those added words have no effect as long as these basic
definitions (NFC or byte order) don't change, but they do help
to explain where the dependencies lie.

>> So I am loathe to cover things that are well-covered in 
>> 3629 lest more confusion be created.
> 
> Yes, just don't mention the "byte-value lexicographic sorting
> order of UTF-8 strings", it's covered in STD 63, and besides
> not very interesting.

That is one of several reasons I left it out.

>>> * Section 4: I would drop the last paragraph, since it is a
>>> repetition of  what is exhaustively explained in section 5.2.
>>> I got a parsing error at  the last sentence of that paragraph
>>> anyway.
> 
> Indeed, that paragraph is unnecessary.  I also can't parse its
> first sentence.
> 
>> That last sentence could be restated, less formally, as:
>  
>> If one encounters a UTF-8 string in a protocol, and its
>> syntax and properties are not specifically defined, then
>> it is reasonable to assume that it conforms to this
>> specification.
> 
> I still don't understand this.  What is an UTF-8 string with
> "unspecified syntax" ?  STD 63 specifies the syntax of UTF-8,
> anything not following this syntax is invalid.

I've rewritten that paragraph a bit, but unless (contrary to the
letter and spirit of RFC 3629) we require BOM indications in
every UTF-8 string, there will be strings floating around that
look like UTF-8 but that may or may not follow the rules of 3629
(e.g., minimal forms) or this spec (e.g., 3629 plus NFC).   In
that regard, protocols that don't reference this spec for a
definition/ set of requirements are out of luck -- this spec
can't say anything about them (that is an advantage for some
purposes).  If the relevant protocol _does_ reference this spec,
then we have a robustness principle issue, specifically whether
the receiving system can assume that an incoming string conforms
(and generate an error message or behave in unexpected ways if
it does not) or if it required to somehow make it conform,
regardless of what nonsense (even nonsense that doesn't conform
to 3629) it might be sent.  This sentence is intended to point
in the former direction.  I don't think spelling it out much
more explicitly in the spec is desirable, although I could be
talked into that if someone supplied text.

> The net-utf8 I-D doesn't specify any default properties, what
> is an assumption that "unspecified properties" conform to
> net-utf8 supposed to do ?  
> 
> If you're talking about unassigned code points please say so.
> In that case it's covered in 5.2, and you can simply delete
> the last paragraph of section 4.

I hope the above more or less answers your question.  Yes,
unassigned code points are on that list, but they aren't the
only thing.

>> I'm going to hold the document for a few days before 
>> re-posting in the hope of getting comments from others.
> 
> Please update the [NFC] reference, s/March 2005/2006-10-12/
> for the version belonging to TUS 5.0.

Done

thanks,
   john

Comments on draft-klensin-net-utf8-06 Marcos Sanz/Denic
Re: Comments on draft-klensin-net-utf8-06 John C Klensin
Re: Comments on draft-klensin-net-utf8-06 Frank Ellermann
Re: Comments on draft-klensin-net-utf8-06 Marcos Sanz/Denic
Re: Comments on draft-klensin-net-utf8-06 Clive D.W. Feather
Re: Comments on draft-klensin-net-utf8-06 John C Klensin