Re: Comments on Unicode Format for Network Interchange

"Markus Scherer" <markus.icu@gmail.com> Mon, 18 June 2007 17:34 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1I0L7n-000215-Tz; Mon, 18 Jun 2007 13:34:47 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1I0L7m-00020o-Bh for discuss-confirm+ok@megatron.ietf.org; Mon, 18 Jun 2007 13:34:46 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1I0L7m-00020f-25 for discuss@apps.ietf.org; Mon, 18 Jun 2007 13:34:46 -0400
Received: from an-out-0708.google.com ([209.85.132.251]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1I0L7j-00009A-Lw for discuss@apps.ietf.org; Mon, 18 Jun 2007 13:34:46 -0400
Received: by an-out-0708.google.com with SMTP id d26so366304and for <discuss@apps.ietf.org>; Mon, 18 Jun 2007 10:34:42 -0700 (PDT)
DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=a1R9m4r2L8oVYSCRP0xrxWgKbgKE1gjsUiXS541/vp9uLk/3UVhKx3Zqz82mUypqJWduxkkpoHpG2z5+8EvuehycPh/VCmaWjFwtLniu9gFc/CHgchUKfaz7g+pxopkJJBsA3AdfEMAyVxhMRoK8LZ2kTuIYDdgxDc8d4nrzNmk=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=gS3YhSb3M5c1l7IJ4ZyJ3JWdqDkXZFPi3XfXI9pSjoJS+wcF6iRN1P7bK3TGssj7P9+2+D8w12pZVztBFYdBq1N0qlVcGZUHVzmQYqjyyy8qukQmH8PYKHJYAIsDmvjBJ7h7QrLLEBL0eX0opny7HqobIctKxujc6vL25Nw+c/Y=
Received: by 10.100.124.5 with SMTP id w5mr3674485anc.1182188082737; Mon, 18 Jun 2007 10:34:42 -0700 (PDT)
Received: by 10.100.5.12 with HTTP; Mon, 18 Jun 2007 10:34:42 -0700 (PDT)
Message-ID: <6bb028490706181034r78352061kda89f149d05620a2@mail.gmail.com>
Date: Mon, 18 Jun 2007 10:34:42 -0700
From: "Markus Scherer" <markus.icu@gmail.com>
To: "Frank Ellermann" <nobody@xyzzy.claranet.de>
Subject: Re: Comments on Unicode Format for Network Interchange
In-Reply-To: <462E9074.14BD@xyzzy.claranet.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <6bb028490704231048s41deaf57q33ddb21fd0e76f17@mail.gmail.com> <462E9074.14BD@xyzzy.claranet.de>
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 932cba6e0228cc603da43d861a7e09d8
Cc: discuss@apps.ietf.org, Mark Davis <mark.davis@icu-project.org>
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

Dear Frank,

Sorry for the very late reply.

I would like to point out that my main interest in the Net-Unicode
internet-draft and in sending comments is with the discussion of
Unicode normalization and versioning. I happily note that you didn't
disagree with my suggestions in that area, except for one suggested
text deletion (where I now mostly agree with you).

As for the issues on line endings and signatures:

On 4/24/07, Frank Ellermann <nobody@xyzzy.claranet.de> wrote:
> Markus Scherer wrote:
>
> > *** Suggested change:
> >    2.  Line-endings MUST be indicated by the sequence Carriage-Return
> >        (U+000D) followed by Line-Feed (U+000A), or by a single
> >        Carriage-Return (U+000D), or by a single Line-Feed (U+000A).
>
> -1F
>
> > Justification: We believe that single CR and LF are common because of
> > implementation practice on a variety of platforms, and that it is both
> > unrealistic and unnecessary to try to legislate them away.
>
> No, it causes havoc.
>
> > Applications already commonly handle all of CR, LF and CR+LF, and some
> > support even more characters according to the Unicode Newline
> > Guidelines.
>
> The draft isn't about arbitrary text or XML (where you'd also need NEL),
> it's about telnet.  It tries to extend ALPHA and DIGIT as used in some
> syntax constructs for text in Internet protocols, it doesn't try to
> introduce a new concept of "line" in these protocols.

I didn't understand from the internet-draft that it was only geared
towards telnet.

If it is indeed only about telnet, then the end-of-line defininition
is of course fine, and apologies for causing confusion.

If it is not intended just for telnet, then I believe allowing several
already-common forms of line endings is simply pragmatic.

> > *** Suggested change:
> >    4. The UTF-8 signature byte sequence (EF BB BF, UTF-8 encoding of
> >       U+FEFF, sometimes called Byte Order Mark ("BOM")), when it
> >       appears at the beginning of the text, SHOULD be deleted by the
> >       recipient.
>
> I don't think that works.  The draft isn't about local text or XML
> files,
> it's about Internet protocols, especially telnet, over the wire.

Fair enough, and I am not suggesting that Unicode signatures should be
used there. My suggestion was intended to clarify the specification
for when a Unicode signature is in fact included even when it is not
legal, not recommended, or simply not necessary. I find it easier, in
practice, to specify how to handle a situation than to just require
that it not occur.

> >       If a Word Joiner is needed in the text, U+2060 WORD JOINER SHOULD
> >       be used instead of U+FEFF ZERO WIDTH NO-BREAK SPACE.
>
> Already covered by STD 63 (RFC 3629).

Ok, thanks.

> > *** Suggested change:
> >    1.  Control codes from both the "C0" (U+0000..U+001F, U+007F)
> >        and "C1" (U+0080..U+009F) ranges,
> >        with the exception of HT (09), LF (0A) and CR (0D),
> >        SHOULD NOT be used unless required by exceptional circumstances.
>
> > Justification: The sets of C0 and C1 control codes that should and
> > should not be used should be defined explicitly, and with code point
> > values. Only HT, LF and CR are very widely used.
>
> Makes sense, but HT can have surprising effects if it's "expanded" into
> one or more spaces, that would need a "security consideration".  Does

Possible, although HT is rarely expanded to spaces in low-level processing.
The most important point here is that the set of control codes be
specified very explicitly.

> DEL really belong to the C0 set ?  Maybe avoiding these old terms is
> clearer for readers today.

I don't actually know if DEL is formally a C0 control code.
I would be perfectly happy with specifying the control codes via
Unicode code point ranges and not using the terms "C0" and "C1".
(Although for charset old-timers these terms might be useful.)

> > *** Suggested change:
> > Remove points 2. and 3.
>
> See above.  Try to edit a plain text file using LF as line-end with the
> tool styling itself as "editor" on a Windows box, and you'll see what I
> mean.  IIRC there were some hot debates why the IETF ftp server sends
> text files with (only) LF, in theory breaking the ABNF in these files.
> This is a rathole, please just accept it as some IETF oddity.  We're
> not forced to use CRLF in local files if we hate it.

Again, the importance of the line ending encoding is way below that of
the normalization and versioning discussion. Here I was simply
suggesting to allow what's already common practice - if not for
telnet, then for many other text formats.

> > *** Suggested change:
> > Drop this second bullet and the following paragraph.
>
> No, folks need to know that Unicode is a moving target to some degree,
> that only small and different subsets are supported by most devices,
> and that it's horribly complex in comparison with ASCII or many legacy
> charsets.  The advantage is obvious, some disadvantages are not.

I agree with you about not deleting the bullet ("The normalization
specified here, NFC (see Section 3), performs...") and the following
paragraph ("The NFC tables may be updated over time"), or not
entirely. However, the last sentence of that paragraph, which reads

  The stability of the Net-Unicode format is thus guaranteed when any
  implementation that converts text into Net-Unicode format does not
  permit unassigned characters.

should be deleted, because with a SHOULD for normalization the
stability of the Net-Unicode format does not depend on normalization
stability any more.

> > Suggested change: Please add a reference for [RFC3629] UTF-8, a
> > transformation format of ISO 10646
>
> There is a RFC 3629 (STD 63) reference, it's in the first part with
> the normative references.

Thanks, and sorry for the oversight.

Best regards,
markus