Re: Comments on draft-klensin-net-utf8-06
John C Klensin <john-ietf@jck.com> Fri, 04 January 2008 21:18 UTC
Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1JAtwF-0001Lq-AP; Fri, 04 Jan 2008 16:18:47 -0500
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1JAtwD-00019o-Qo for discuss-confirm+ok@megatron.ietf.org; Fri, 04 Jan 2008 16:18:45 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1JAtwD-000187-GJ for discuss@apps.ietf.org; Fri, 04 Jan 2008 16:18:45 -0500
Received: from ns.jck.com ([209.187.148.211] helo=bs.jck.com) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1JAtwB-000528-Gk for discuss@apps.ietf.org; Fri, 04 Jan 2008 16:18:45 -0500
Received: from [127.0.0.1] (helo=p3.JCK.COM) by bs.jck.com with esmtp (Exim 4.34) id 1JAtw9-0008ip-Md; Fri, 04 Jan 2008 16:18:42 -0500
Date: Fri, 04 Jan 2008 16:18:40 -0500
From: John C Klensin <john-ietf@jck.com>
To: Frank Ellermann <nobody@xyzzy.claranet.de>
Subject: Re: Comments on draft-klensin-net-utf8-06
Message-ID: <00C50AA9B75A816BAB602CBD@p3.JCK.COM>
In-Reply-To: <ff49pf$rfn$1@ger.gmane.org>
References: <OF037DA1CA.695DAFC1-ONC1257376.004E5008-C1257376.00511560@notes.denic.de> <1CEEB76FCFC0070A7B2BDEAE@[10.1.0.164]> <ff49pf$rfn$1@ger.gmane.org>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 287c806b254c6353fcb09ee0e53bbc5e
Cc: discuss@apps.ietf.org
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org
--On Wednesday, 17 October, 2007 08:21 +0200 Frank Ellermann <nobody@xyzzy.claranet.de> wrote: > John C Klensin wrote: > >>> * Section 4: "the string order of RFC 3629". It's not very >>> clear to me what is meant with this. Byte order? Sorting >>> order? > >> 3629 specifies a byte order (in section 4). It does not >> address or mention sort order except to note (in the >> introduction) that UTF-8 preserves it and that sort order >> based on code point sequence is likely to be fairly useless. > >> I _think_ I would welcome text to clarify this > > Please simplify the remark and remove one "that": > > -| Were Unicode to be changed in a way that violated these > -| assumptions, i.e., that either invalidated the string order > -| of RFC 3629 or that that changed the stability of NFC as > -| stated above, this specification would not apply. > > +| Were Unicode to be changed in a way that violated these > +| assumptions, i.e., that changed the stability of NFC as > +| stated above, this specification would not apply. > > UTF-8 as specified in STD 63 is stable. Sure it is. But, were the definition of normal UTF-8 string order _in the Unicode spec_ to change (however unlikely that might be) we would be in deep trouble since many (most?) implementers depend on UTF-8 libraries not RFC 3629-specific ones. Those added words have no effect as long as these basic definitions (NFC or byte order) don't change, but they do help to explain where the dependencies lie. >> So I am loathe to cover things that are well-covered in >> 3629 lest more confusion be created. > > Yes, just don't mention the "byte-value lexicographic sorting > order of UTF-8 strings", it's covered in STD 63, and besides > not very interesting. That is one of several reasons I left it out. >>> * Section 4: I would drop the last paragraph, since it is a >>> repetition of what is exhaustively explained in section 5.2. >>> I got a parsing error at the last sentence of that paragraph >>> anyway. > > Indeed, that paragraph is unnecessary. I also can't parse its > first sentence. > >> That last sentence could be restated, less formally, as: > >> If one encounters a UTF-8 string in a protocol, and its >> syntax and properties are not specifically defined, then >> it is reasonable to assume that it conforms to this >> specification. > > I still don't understand this. What is an UTF-8 string with > "unspecified syntax" ? STD 63 specifies the syntax of UTF-8, > anything not following this syntax is invalid. I've rewritten that paragraph a bit, but unless (contrary to the letter and spirit of RFC 3629) we require BOM indications in every UTF-8 string, there will be strings floating around that look like UTF-8 but that may or may not follow the rules of 3629 (e.g., minimal forms) or this spec (e.g., 3629 plus NFC). In that regard, protocols that don't reference this spec for a definition/ set of requirements are out of luck -- this spec can't say anything about them (that is an advantage for some purposes). If the relevant protocol _does_ reference this spec, then we have a robustness principle issue, specifically whether the receiving system can assume that an incoming string conforms (and generate an error message or behave in unexpected ways if it does not) or if it required to somehow make it conform, regardless of what nonsense (even nonsense that doesn't conform to 3629) it might be sent. This sentence is intended to point in the former direction. I don't think spelling it out much more explicitly in the spec is desirable, although I could be talked into that if someone supplied text. > The net-utf8 I-D doesn't specify any default properties, what > is an assumption that "unspecified properties" conform to > net-utf8 supposed to do ? > > If you're talking about unassigned code points please say so. > In that case it's covered in 5.2, and you can simply delete > the last paragraph of section 4. I hope the above more or less answers your question. Yes, unassigned code points are on that list, but they aren't the only thing. >> I'm going to hold the document for a few days before >> re-posting in the hope of getting comments from others. > > Please update the [NFC] reference, s/March 2005/2006-10-12/ > for the version belonging to TUS 5.0. Done thanks, john
- Comments on draft-klensin-net-utf8-06 Marcos Sanz/Denic
- Re: Comments on draft-klensin-net-utf8-06 John C Klensin
- Re: Comments on draft-klensin-net-utf8-06 Frank Ellermann
- Re: Comments on draft-klensin-net-utf8-06 Marcos Sanz/Denic
- Re: Comments on draft-klensin-net-utf8-06 Clive D.W. Feather
- Re: Comments on draft-klensin-net-utf8-06 John C Klensin