[Ltru] Re: [OT] Re: UTF-8

Frank Ellermann <nobody@xyzzy.claranet.de> Fri, 15 September 2006 11:07 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1GOBXe-0005qN-FK; Fri, 15 Sep 2006 07:07:30 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1GOBXd-0005qI-B5 for ltru@lists.ietf.org; Fri, 15 Sep 2006 07:07:29 -0400
Received: from main.gmane.org ([80.91.229.2] helo=ciao.gmane.org) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1GOBXa-0001I6-Sg for ltru@lists.ietf.org; Fri, 15 Sep 2006 07:07:29 -0400
Received: from list by ciao.gmane.org with local (Exim 4.43) id 1GOBXP-0005po-NY for ltru@lists.ietf.org; Fri, 15 Sep 2006 13:07:15 +0200
Received: from pd9fbad2d.dip0.t-ipconnect.de ([217.251.173.45]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <ltru@lists.ietf.org>; Fri, 15 Sep 2006 13:07:15 +0200
Received: from nobody by pd9fbad2d.dip0.t-ipconnect.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <ltru@lists.ietf.org>; Fri, 15 Sep 2006 13:07:15 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: ltru@lists.ietf.org
From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Fri, 15 Sep 2006 13:03:35 +0200
Organization: <URL:http://purl.net/xyzzy>
Lines: 64
Message-ID: <450A8887.EAB@xyzzy.claranet.de>
References: <E1GNzAK-0005WV-Uf@megatron.ietf.org> <007501c6d890$b96ff410$6401a8c0@DGBP7M81>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@sea.gmane.org
X-Gmane-NNTP-Posting-Host: pd9fbad2d.dip0.t-ipconnect.de
X-Mailer: Mozilla 3.0 (OS/2; U)
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 52f7a77164458f8c7b36b66787c853da
Cc:
Subject: [Ltru] Re: [OT] Re: UTF-8
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Errors-To: ltru-bounces@ietf.org

Doug Ewell wrote:

> I'm not sure why you chose 0x86 as your sequence introducer

Six is the number of trailing octets: 91909F9F9F9F (for John's
example u+10FFFF).

> you could make each sequence 1 byte shorter by marking the
> lead or trail byte specially

Yes, but then only 2 octets (80+81) would never occur (instead
of 11), and lost 9x bytes won't cause an error.  UTF-8 has now
13 "impossible" octets, and similar features.

> "Terminal jockeys" such as Frank da Cruz (inventor of the
> Kermit protocol) would argue that C1 controls are also part
> of "Latin-1" and hence this format is still not
> Latin-1-friendly (a complaint that also used to be brought
> against UTF-8).

Yes, you can only use UTF-8 or UTF-4 with legacy applications
for windows-1252, not with applications needing the C1 control
codes as is.  He could use UTF-1 (it protects C0, SP, DEL, C1)
or UTF-7.  For my local purposes UTF-4 is fine, my text editor
supports hex.  I'm less good with modulo 64 for UTF-8 by heart,
I need extra macros to decode / encode it.

>>  we can't use this for the registry, and we also won't try
>> BOCU-1.  But maybe IANA should offer a gzip-ped version.

> BOCU-1 text tends to include a lot of bytes in the C1 range
> (0x80 through 0x9F) and might not travel through e-mail very
> well.

It's as good or bad as UTF-8 (or in theory UTF-4) for that
purpose, with 8BITMIME or news you need no CTE, otherwise
you need B64 or QP.  If e-mail has a problem with 8 bits these
problems aren't limited to C1, it could be anything, likely a
parity bit.

> I still haven't received a clear answer about how patent-
> encumbered BOCU-1 might be.

IBM's statement in UTS #40 is "royalty-free".  I didn't ask
them for a US-license.  In the EU and some other parts of the
world it's AFAIK (and IANAL) a complete waste of time and money
to patent algorithms.  Patenting modulo 243 arithmetic is an
odd idea.  For UTF-4 (= modulo 16 with 64 lines CharMapML) it
would be ridiculous (but one of these 64 lines is a copyright,
just in case).

> I like the gzip idea.

lstreg6.txt        82218 (2006-08-04)
lstreg6.txt.gz     11788
lstreg6.xml       104627 (see my reply to Debbie)
lstreg6.xml.gz     12141

Matches your observations in UniCompress.  For the 4646bis
registry it makes sense for some folks (for me it's less
relevant, the V.90 bottleneck has its own compression)

Frank



_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru