Re: Comments on Unicode Format for Network Interchange

Frank Ellermann <nobody@xyzzy.claranet.de> Tue, 24 April 2007 23:55 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1HgUrM-00072y-4p; Tue, 24 Apr 2007 19:55:48 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1HgUrL-00072s-Tj for discuss-confirm+ok@megatron.ietf.org; Tue, 24 Apr 2007 19:55:47 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1HgUrL-00072J-Jn for discuss@apps.ietf.org; Tue, 24 Apr 2007 19:55:47 -0400
Received: from main.gmane.org ([80.91.229.2] helo=ciao.gmane.org) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1HgUrI-0002jC-2S for discuss@apps.ietf.org; Tue, 24 Apr 2007 19:55:47 -0400
Received: from list by ciao.gmane.org with local (Exim 4.43) id 1HgUdV-000669-Ql for discuss@apps.ietf.org; Wed, 25 Apr 2007 01:41:29 +0200
Received: from 1cust242.tnt5.hbg2.deu.da.uu.net ([149.225.16.242]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <discuss@apps.ietf.org>; Wed, 25 Apr 2007 01:41:29 +0200
Received: from nobody by 1cust242.tnt5.hbg2.deu.da.uu.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <discuss@apps.ietf.org>; Wed, 25 Apr 2007 01:41:29 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: discuss@apps.ietf.org
From: Frank Ellermann <nobody@xyzzy.claranet.de>
Subject: Re: Comments on Unicode Format for Network Interchange
Date: Wed, 25 Apr 2007 01:19:16 +0200
Organization: <URL:http://purl.net/xyzzy>
Lines: 99
Message-ID: <462E9074.14BD@xyzzy.claranet.de>
References: <6bb028490704231048s41deaf57q33ddb21fd0e76f17@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@sea.gmane.org
X-Gmane-NNTP-Posting-Host: 1cust242.tnt5.hbg2.deu.da.uu.net
X-Mailer: Mozilla 3.0 (OS/2; U)
X-Spam-Score: 0.0 (/)
X-Scan-Signature: e1b0e72ff1bbd457ceef31828f216a86
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

Markus Scherer wrote:

> *** Suggested change:
>    2.  Line-endings MUST be indicated by the sequence Carriage-Return
>        (U+000D) followed by Line-Feed (U+000A), or by a single
>        Carriage-Return (U+000D), or by a single Line-Feed (U+000A).

-1F

> Justification: We believe that single CR and LF are common because of
> implementation practice on a variety of platforms, and that it is both
> unrealistic and unnecessary to try to legislate them away.

No, it causes havoc.

> Applications already commonly handle all of CR, LF and CR+LF, and some
> support even more characters according to the Unicode Newline
> Guidelines.

The draft isn't about arbitrary text or XML (where you'd also need NEL),
it's about telnet.  It tries to extend ALPHA and DIGIT as used in some
syntax constructs for text in Internet protocols, it doesn't try to
introduce a new concept of "line" in these protocols.

> *** Suggested change:
>    4. The UTF-8 signature byte sequence (EF BB BF, UTF-8 encoding of
>       U+FEFF, sometimes called Byte Order Mark ("BOM")), when it
>       appears at the beginning of the text, SHOULD be deleted by the
>       recipient.

I don't think that works.  The draft isn't about local text or XML
files,
it's about Internet protocols, especially telnet, over the wire.

>       If a Word Joiner is needed in the text, U+2060 WORD JOINER SHOULD
>       be used instead of U+FEFF ZERO WIDTH NO-BREAK SPACE.

Already covered by STD 63 (RFC 3629).

> *** Suggested change:
>    1.  Control codes from both the "C0" (U+0000..U+001F, U+007F)
>        and "C1" (U+0080..U+009F) ranges,
>        with the exception of HT (09), LF (0A) and CR (0D),
>        SHOULD NOT be used unless required by exceptional circumstances.

> Justification: The sets of C0 and C1 control codes that should and
> should not be used should be defined explicitly, and with code point
> values. Only HT, LF and CR are very widely used.

Makes sense, but HT can have surprising effects if it's "expanded" into
one or more spaces, that would need a "security consideration".  Does
DEL really belong to the C0 set ?  Maybe avoiding these old terms is
clearer for readers today.

> *** Suggested change:
> Remove points 2. and 3.

See above.  Try to edit a plain text file using LF as line-end with the
tool styling itself as "editor" on a Windows box, and you'll see what I
mean.  IIRC there were some hot debates why the IETF ftp server sends
text files with (only) LF, in theory breaking the ABNF in these files.
This is a rathole, please just accept it as some IETF oddity.  We're
not forced to use CRLF in local files if we hate it.

> *** Suggested change:
> Drop this second bullet and the following paragraph.

No, folks need to know that Unicode is a moving target to some degree,
that only small and different subsets are supported by most devices,
and that it's horribly complex in comparison with ASCII or many legacy
charsets.  The advantage is obvious, some disadvantages are not.

> ***
> Suggested change: Please add a reference for [RFC3629] UTF-8, a
> transformation format of ISO 10646

There is a RFC 3629 (STD 63) reference, it's in the first part with
the normative references.

Frank
--
<reference
    anchor='UTR36'
    target='http://www.unicode.org/reports/tr36'>
<front>
    <title>Unicode Security Considerations</title>
    <author initials='M.' surname='Davis'
            fullname='Mark Davis'>
            <organization />
    </author>
    <author initials='M.' surname='Suignard'
            fullname='Michael Suignard'>
            <organization />
    </author>
    <date month='August' day='11' year='2006' />
</front>
<seriesInfo name='Unicode Technical Reports' value='#36' />
</reference>