Re: Comments on Unicode Format for Network Interchange

Frank Ellermann <nobody@xyzzy.claranet.de> Mon, 18 June 2007 21:23 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1I0Ogs-0005F7-P9; Mon, 18 Jun 2007 17:23:14 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1I0Ogr-0005Ey-GO for discuss-confirm+ok@megatron.ietf.org; Mon, 18 Jun 2007 17:23:13 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1I0Ogr-0005Eq-2s for discuss@apps.ietf.org; Mon, 18 Jun 2007 17:23:13 -0400
Received: from main.gmane.org ([80.91.229.2] helo=ciao.gmane.org) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1I0Ogq-000106-Gt for discuss@apps.ietf.org; Mon, 18 Jun 2007 17:23:13 -0400
Received: from list by ciao.gmane.org with local (Exim 4.43) id 1I0Ogj-0003pv-9x for discuss@apps.ietf.org; Mon, 18 Jun 2007 23:23:05 +0200
Received: from dialin-145-254-045-028.pools.arcor-ip.net ([145.254.45.28]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <discuss@apps.ietf.org>; Mon, 18 Jun 2007 23:23:05 +0200
Received: from nobody by dialin-145-254-045-028.pools.arcor-ip.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <discuss@apps.ietf.org>; Mon, 18 Jun 2007 23:23:05 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: discuss@apps.ietf.org
From: Frank Ellermann <nobody@xyzzy.claranet.de>
Subject: Re: Comments on Unicode Format for Network Interchange
Date: Mon, 18 Jun 2007 23:18:14 +0200
Organization: <URL:http://purl.net/xyzzy>
Lines: 99
Message-ID: <4676F696.321@xyzzy.claranet.de>
References: <6bb028490704231048s41deaf57q33ddb21fd0e76f17@mail.gmail.com> <462E9074.14BD@xyzzy.claranet.de> <6bb028490706181034r78352061kda89f149d05620a2@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@sea.gmane.org
X-Gmane-NNTP-Posting-Host: dialin-145-254-045-028.pools.arcor-ip.net
X-Mailer: Mozilla 3.0 (OS/2; U)
X-Spam-Score: 0.0 (/)
X-Scan-Signature: e1b0e72ff1bbd457ceef31828f216a86
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

Markus Scherer wrote:
 
> Sorry for the very late reply.

No issue, I still have that "ping" dated 2007-03-29 sitting in my 
"Todos" folder and waiting for better times when nobody tries to
update the core e-mail RfCs or similar emergencies... ;-)

> I didn't understand from the internet-draft that it was only
> geared towards telnet.

> If it is indeed only about telnet, then the end-of-line defininition
> is of course fine, and apologies for causing confusion.

The "only" might be misleading, a bunch of other protocols used
to be based on telnet, and inherited some features more or less
clearly.  I guess John has to explain in a new version of this
I-D what exactly he's talking about (a long string of "updates
RFCs x, y, z" in the header), and where he only proposes to
update other protocols (like "whois") in the spirit of his I-D.

Maybe 2821bis (SMTP) is an interesting example, at the moment it
still "allows" control char.s in places where I'd prefer to get
rid of them (as recommended in the "net-Unicode" I-D).

Of course 2821bis will insist on CR LF for lineends.  Just an
example, it's the same issue for many IETF protocols.  An IMO
bad case is FTP, where "a" (= ascii text, and convert lineends
for whatever passes as "local" convention) is the default.

The "local" convention supported on my box is "only CR doesn't
work as expected".  And besides I rarely use FTP for text and
get garbage when I forgot the "b" (binary).  Fortunately I was
never forced to grab a text file from an EBCDIC FTP server. :-)

> If it is not intended just for telnet, then I believe
> allowing several already-common forms of line endings is
> simply pragmatic.

It would cause havoc for some scripts I use (POP3, HTTP, SMTP,
ident, whois, simple stuff).  The SMTP and whois scripts fix
a body LF to CR LF on the fly, but they'd break miserably if 
somebody thinks that a bare ASCII CR is a lineend.  Let alone
Latin-* NEL, or weirder scenarios.

So it's not "only" telnet, but it still is about something
you can completely ignore, unless you need to post a NetNews
article with telnet to the NNTP port, or similar stunts.

 [BOM]
> My suggestion was intended to clarify the specification for
> when a Unicode signature is in fact included even when it is
> not legal, not recommended, or simply not necessary. I find
> it easier, in practice, to specify how to handle a situation
> than to just require that it not occur.

Okay, admittedly I don't like it if I get a mail where the
body starts with a BOM, because my obsolete MUA treats UTF-8
as windows-1252 claiming that it's Latin-1.  Something on the
side of the sender could have silently removed this BOM.  

But the I-D is more general, it has no concept of "body" or
SDU.  Maybe it should mention the BOM issue in an example (?)

For telnet it's IMO pointless, that's lines or rather control
sequences plus text defining the content of a screen, the I-D
can't say that a BOM should be silently removed when it would
end up at the begin of a line.  Ignoring BiDi, and besides the
draft deprecates all control char.s, even HT, allowing only
CR LF in this order.

Is it your idea to eliminate BOM like the I-D eliminates SUB
(= EOF on some platforms) ?  If that's your point I think it's
better solved in RFC 3629, that's already at STD.

Otherwise the I-D could arguably also talk about non-characters,
surrogates, overlong UTF-8, more-than-21-bits UTF-8, etc., all
addressed in RFC 3629.

> the last sentence of that paragraph, which reads
 
>   The stability of the Net-Unicode format is thus guaranteed
>   when any implementation that converts text into Net-Unicode
>   format does not permit unassigned characters.
 
> should be deleted, because with a SHOULD for normalization the
> stability of the Net-Unicode format does not depend on 
> normalization stability any more.

I can't judge it, my last attempt to implement something in this
direction ended again with giving up.  IOW the SHOULD is wishful
thinking as far as my scripts are concerned.  Maybe I try it
again with "level one" later, actually I was only curious what
SASLPREP really does, in addition to its NFC part (it uses case
sensitive NFKC plus tons of prohibited and mapped to nothing or
mapped to space rules).

Frank