[Ltru] Re: UTF-8
Frank Ellermann <nobody@xyzzy.claranet.de> Sun, 17 September 2006 10:13 UTC
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1GOtec-0007LG-VG; Sun, 17 Sep 2006 06:13:38 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1GOteb-0007KJ-Jq for ltru@lists.ietf.org; Sun, 17 Sep 2006 06:13:37 -0400
Received: from main.gmane.org ([80.91.229.2] helo=ciao.gmane.org) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1GOteX-0005iQ-V4 for ltru@lists.ietf.org; Sun, 17 Sep 2006 06:13:37 -0400
Received: from list by ciao.gmane.org with local (Exim 4.43) id 1GOteC-0003ah-0F for ltru@lists.ietf.org; Sun, 17 Sep 2006 12:13:12 +0200
Received: from pd9fba9c1.dip0.t-ipconnect.de ([217.251.169.193]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <ltru@lists.ietf.org>; Sun, 17 Sep 2006 12:13:11 +0200
Received: from nobody by pd9fba9c1.dip0.t-ipconnect.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <ltru@lists.ietf.org>; Sun, 17 Sep 2006 12:13:11 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: ltru@lists.ietf.org
From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Sun, 17 Sep 2006 12:06:51 +0200
Organization: <URL:http://purl.net/xyzzy>
Lines: 120
Message-ID: <450D1E3B.7064@xyzzy.claranet.de>
References: <789E617C880666438EDEE30C2A3E8D10EEFC@mailsrvnt05.enet.sharplabs.com> <450B2B75.2F36@xyzzy.claranet.de> <6.0.0.20.2.20060916114849.081056e0@localhost> <450BD347.9EA@xyzzy.claranet.de> <6.0.0.20.2.20060917154808.08a12880@localhost>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@sea.gmane.org
X-Gmane-NNTP-Posting-Host: pd9fba9c1.dip0.t-ipconnect.de
X-Mailer: Mozilla 3.0 (OS/2; U)
X-Spam-Score: 0.0 (/)
X-Scan-Signature: b132cb3ed2d4be2017585bf6859e1ede
Cc:
Subject: [Ltru] Re: UTF-8
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Errors-To: ltru-bounces@ietf.org
Martin Duerst wrote: >> I've no clue what FTP clients would do with something >> like that. > Just save the file. If it's transferred as text, things > may go wrong for an EBCDIC system. If transferred as > binary, it will still be UTF-8. The old FTP server on my box needed an option "-cp none" for "raw mode", otherwise it would try to "get it right" with weird results (cp 858 to cp 1252 or vice versa, my number of UTF-8 plain text files is still *_zero_*) >> John's "net-UTF-8" draft isn't ready, and I had some >> troubles to understand his last published version. > That isn't very relevant, except that it suggests > normalization to NFC, which I assume we would do anyway, > and which is already an issue in the case of NCRs. Let's please just use whatever the source standards use, garbage in garbage out (drawing the line at "invalid"). In an unrelated (SASLprep) attempt to check how bad this is some weeks ago my conclusion was "bad enough" after I found 418 combining characters. The last thing we need in a 4646bis is a fixed Unicode version (or convoluted prose why that's unnecessary). >> for RFC 4646 there were proposals to restrict the >> registry to Latn. > Bad idea. Any average browser installation has way > wider coverage. Or way less, almost certainly not "all Latn", the next least common denominator after Latin-1 could be MES-1. Fighting with obscure glyph lists is another issue to be solved by the source standards, not 4646bis. [UTF-8 Bokmål mangled to Latin-1] > Well, and then interpreted as Shift-JIS in the case of > my mailer :-(. Folks will discuss registry entries on the review list, with any MUA they have. With NCRs as abstract objects they'd be at least sure _what_ they discuss. >> for the bulk update, the fattest I-D in the history of >> the Internet using B64, they'd shoot us. > We could define a process experiment. As it is only for > an I-D, not for an RFC, that should work. Yes, and as long as the AD is okay with it we might get away without 3933 for this detail - after a bounce from the secretary ("too long" or "non-ASCII") until they believe it. > I have seen non-ASCII stuff in I-Ds, but checks may be > a bit more strict nowadays. I'm not sure that it was always UTF-8, in one case where it was windows-1252 the I-D wasn't published. If they're ready to accept any non-ASCII it ought to include UTF-8, otherwise they've to fix it anyway sooner or later. >> We could try QP, and silently send them Doug's original >> data (as mail) in addition to the I-D. > If 'them' is IANA, that might work. If 'them' is the > Internet-Drafts Editor, that'd not make sense. IANA. I'd guess that the "I-D-editor" is in essence a mailbot - with a human operator trying to fix some wild and wonderful submissions where that mailbot isn't sure. QP UTF-8 would be only a trick to get the draft published. IANA could take the decoded attached file - or roll their own QP I-D to registry decoder, it's no rocket science. >> keep the & convention, that could be confusing. > Yes. I'd agree that '&#x' sequences might need escaping, > but I don't expect any of these. Simple & would be okay > unescaped, it would trigger an error in strictly implemented > RFC 4646 software, the same way as UTF-8, and would not > cause any interpretation problems. Assuming that we go that way (I still don't like it) we could "forbid" &#x strings for backwards compatibility. With "forbid" s/&#x//g. Or "forbid" s/&#x/u+/g, because it probably is precisely the same problem in the sources. To justify such manipulations in 4646bis the WG should adopt the record-jar I-D as work item (informational RFC). I could contribute 212 lines LTRU record-jar to XML as an example (in the long version allowing to check all subtag references with the W3C validator). >> we'd have to decide if a signature is required, another >> change of the record-jar ABNF. > Yes. No signature, please. I could in theory note UTF-8 as an "extended attribute" of the file. At some point I've to start that anyway, my local "if it's not ASCII it's cp 858" rule does not help for published files, nobody knows what cp 858 is. With or without signature, UTF-8 will fail miserably in some constellations. With a signature you could say that it's not the fault of the registry. Without signature, I don't see where a misinterpretation as UTF-16 fails, some "line too long" error ? Maybe a dummy encoding="UTF-8" XML wrapper could help, one element with a fixed xml:space="preserve" attribute (?) Frank _______________________________________________ Ltru mailing list Ltru@ietf.org https://www1.ietf.org/mailman/listinfo/ltru
- [Ltru] [OT] Re: UTF-8 Doug Ewell
- [Ltru] Re: [OT] Re: UTF-8 Frank Ellermann
- Re: [Ltru] Re: UTF-8 Addison Phillips
- RE: [Ltru] Re: UTF-8 McDonald, Ira
- [Ltru] Re: UTF-8 Frank Ellermann
- [Ltru] Re: UTF-8 Doug Ewell
- [Ltru] Re: DOCTYPE ltru Doug Ewell
- Re: [Ltru] Re: UTF-8 Martin Duerst
- [Ltru] Re: UTF-8 Frank Ellermann
- Re: [Ltru] Re: UTF-8 John Cowan
- Re: [Ltru] Re: UTF-8 Addison Phillips
- [Ltru] Re: UTF-8 Doug Ewell
- Re: [Ltru] Re: UTF-8 Martin Duerst
- [Ltru] Re: UTF-8 Frank Ellermann
- [Ltru] Re: UTF-8 Stephane Bortzmeyer
- [Ltru] Re: UTF-8 Stephane Bortzmeyer
- Re: [Ltru] Re: UTF-8 Doug Ewell
- [Ltru] Re: UTF-8 Doug Ewell
- [Ltru] Re: RFC 4646 production "grandfathered" co… Doug Ewell
- Re: [Ltru] Re: UTF-8 Addison Phillips
- Re: [Ltru] Re: UTF-8 Addison Phillips
- Re: [Ltru] Re: UTF-8 John Cowan
- Re: [Ltru] Re: UTF-8 Addison Phillips
- Re: [Ltru] Re: UTF-8 John Cowan
- Re: [Ltru] Re: UTF-8 Addison Phillips
- Re: [Ltru] Re: UTF-8 John Cowan
- [Ltru] Re: UTF-8 Doug Ewell
- [Ltru] Re: UTF-8 Frank Ellermann
- Re: [Ltru] Re: UTF-8 Martin Duerst
- [Ltru] Re: RFC 4646 production "grandfathered" co… Frank Ellermann
- Re: [Ltru] Re: RFC 4646 production "grandfathered… John Cowan
- [Ltru] Re: UTF-8 Frank Ellermann
- [Ltru] Re: UTF-8 Frank Ellermann
- Re: [Ltru] Re: UTF-8 Martin Duerst
- [Ltru] Re: RFC 4646 production "grandfathered" co… Frank Ellermann
- Re: [Ltru] Re: UTF-8 Doug Ewell
- Re: [Ltru] Re: UTF-8 Doug Ewell
- [Ltru] Re: UTF-8 Doug Ewell
- [Ltru] Re: UTF-8 Frank Ellermann
- RE: [Ltru] Re: UTF-8 Peter Constable
- [Ltru] Re: UTF-8 Doug Ewell
- Re: [Ltru] Re: UTF-8 Martin Duerst
- RE: [Ltru] Re: UTF-8 Martin Duerst
- RE: [Ltru] Re: UTF-8 Peter Constable
- Re: [Ltru] Re: UTF-8 Addison Phillips
- [Ltru] Re: UTF-8 Stephane Bortzmeyer
- Re: [Ltru] UTF-8 Reshat Sabiq (Reşat)
- RE: [Ltru] UTF-8 McDonald, Ira
- [Ltru] UTF-8 Reshat Sabiq (Reşat)
- Re: [Ltru] UTF-8 John Cowan
- Re: [Ltru] UTF-8 Randy Presuhn
- Re: [Ltru] UTF-8 John Cowan
- Re: [Ltru] UTF-8 GerardM
- Re: [Ltru] UTF-8 John Cowan
- Re: [Ltru] UTF-8 Randy Presuhn
- Re: [Ltru] UTF-8 Addison Phillips
- Re: [Ltru] UTF-8 Addison Phillips
- RE: [Ltru] UTF-8 Peter Constable
- Re: [Ltru] UTF-8 Reshat Sabiq (Reşat)
- [Ltru] Re: UTF-8 Doug Ewell
- Re: [Ltru] UTF-8 Reshat Sabiq (Reşat)