[Ltru] Re: UTF-8

Frank Ellermann <nobody@xyzzy.claranet.de> Sun, 17 September 2006 10:13 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1GOtec-0007LG-VG; Sun, 17 Sep 2006 06:13:38 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1GOteb-0007KJ-Jq for ltru@lists.ietf.org; Sun, 17 Sep 2006 06:13:37 -0400
Received: from main.gmane.org ([80.91.229.2] helo=ciao.gmane.org) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1GOteX-0005iQ-V4 for ltru@lists.ietf.org; Sun, 17 Sep 2006 06:13:37 -0400
Received: from list by ciao.gmane.org with local (Exim 4.43) id 1GOteC-0003ah-0F for ltru@lists.ietf.org; Sun, 17 Sep 2006 12:13:12 +0200
Received: from pd9fba9c1.dip0.t-ipconnect.de ([217.251.169.193]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <ltru@lists.ietf.org>; Sun, 17 Sep 2006 12:13:11 +0200
Received: from nobody by pd9fba9c1.dip0.t-ipconnect.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <ltru@lists.ietf.org>; Sun, 17 Sep 2006 12:13:11 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: ltru@lists.ietf.org
From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Sun, 17 Sep 2006 12:06:51 +0200
Organization: <URL:http://purl.net/xyzzy>
Lines: 120
Message-ID: <450D1E3B.7064@xyzzy.claranet.de>
References: <789E617C880666438EDEE30C2A3E8D10EEFC@mailsrvnt05.enet.sharplabs.com> <450B2B75.2F36@xyzzy.claranet.de> <6.0.0.20.2.20060916114849.081056e0@localhost> <450BD347.9EA@xyzzy.claranet.de> <6.0.0.20.2.20060917154808.08a12880@localhost>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@sea.gmane.org
X-Gmane-NNTP-Posting-Host: pd9fba9c1.dip0.t-ipconnect.de
X-Mailer: Mozilla 3.0 (OS/2; U)
X-Spam-Score: 0.0 (/)
X-Scan-Signature: b132cb3ed2d4be2017585bf6859e1ede
Cc:
Subject: [Ltru] Re: UTF-8
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Errors-To: ltru-bounces@ietf.org

Martin Duerst wrote:

>> I've no clue what FTP clients would do with something 
>> like that.

> Just save the file. If it's transferred as text, things
> may go wrong for an EBCDIC system. If transferred as
> binary, it will still be UTF-8.

The old FTP server on my box needed an option "-cp none"
for "raw mode", otherwise it would try to "get it right"
with weird results (cp 858 to cp 1252 or vice versa, my
number of UTF-8 plain text files is still *_zero_*)

>> John's "net-UTF-8" draft isn't ready, and I had some
>> troubles to understand his last published version.
 
> That isn't very relevant, except that it suggests
> normalization to NFC, which I assume we would do anyway,
> and which is already an issue in the case of NCRs.

Let's please just use whatever the source standards use,
garbage in garbage out (drawing the line at "invalid").

In an unrelated (SASLprep) attempt to check how bad this
is some weeks ago my conclusion was "bad enough" after I
found 418 combining characters.

The last thing we need in a 4646bis is a fixed Unicode
version (or convoluted prose why that's unnecessary).

>> for RFC 4646 there were proposals to restrict the 
>> registry to Latn.
> Bad idea. Any average browser installation has way 
> wider coverage.

Or way less, almost certainly not "all Latn", the next
least common denominator after Latin-1 could be MES-1.
Fighting with obscure glyph lists is another issue to
be solved by the source standards, not 4646bis.

 [UTF-8 Bokm&#xE5;l mangled to Latin-1]
> Well, and then interpreted as Shift-JIS in the case of
> my mailer :-(.

Folks will discuss registry entries on the review list,
with any MUA they have.  With NCRs as abstract objects
they'd be at least sure _what_ they discuss.

>> for the bulk update, the fattest I-D in the history of
>> the Internet using B64, they'd shoot us.
> We could define a process experiment. As it is only for
> an I-D, not for an RFC, that should work.

Yes, and as long as the AD is okay with it we might get
away without 3933 for this detail - after a bounce from
the secretary ("too long" or "non-ASCII") until they
believe it.

> I have seen non-ASCII stuff in I-Ds, but checks may be
> a bit more strict nowadays.

I'm not sure that it was always UTF-8, in one case where
it was windows-1252 the I-D wasn't published.  If they're
ready to accept any non-ASCII it ought to include UTF-8,
otherwise they've to fix it anyway sooner or later.

>> We could try QP, and silently send them Doug's original
>> data (as mail) in addition to the I-D.
> If 'them' is IANA, that might work.  If 'them' is the
> Internet-Drafts Editor, that'd not make sense.

IANA.  I'd guess that the "I-D-editor" is in essence a
mailbot - with a human operator trying to fix some wild
and wonderful submissions where that mailbot isn't sure.
  
QP UTF-8 would be only a trick to get the draft published.
IANA could take the decoded attached file - or roll their
own QP I-D to registry decoder, it's no rocket science.

>> keep the &amp; convention, that could be confusing.
> Yes. I'd agree that '&#x' sequences might need escaping,
> but I don't expect any of these. Simple & would be okay
> unescaped, it would trigger an error in strictly implemented
> RFC 4646 software, the same way as UTF-8, and would not
> cause any interpretation problems.

Assuming that we go that way (I still don't like it) we
could "forbid" &#x strings for backwards compatibility.
With "forbid" s/&#x//g.  Or "forbid" s/&#x/u+/g, because
it probably is precisely the same problem in the sources.

To justify such manipulations in 4646bis the WG should 
adopt the record-jar I-D as work item (informational RFC).

I could contribute 212 lines LTRU record-jar to XML as an
example (in the long version allowing to check all subtag
references with the W3C validator).

>> we'd have to decide if a signature is required, another
>> change of the record-jar ABNF.

> Yes. No signature, please.

I could in theory note UTF-8 as an "extended attribute"
of the file.  At some point I've to start that anyway, my
local "if it's not ASCII it's cp 858" rule does not help 
for published files, nobody knows what cp 858 is.

With or without signature, UTF-8 will fail miserably in
some constellations.  With a signature you could say that
it's not the fault of the registry.  Without signature, I
don't see where a misinterpretation as UTF-16 fails, some
"line too long" error ?

Maybe a dummy encoding="UTF-8" XML wrapper could help, one
element with a fixed xml:space="preserve" attribute (?)

Frank



_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru