[Ltru] [OT] Re: UTF-8

"Doug Ewell" <dewell@adelphia.net> Fri, 15 September 2006 06:37 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1GO7Kc-0007VZ-MH; Fri, 15 Sep 2006 02:37:46 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1GO7Kb-0007VU-HM for ltru@ietf.org; Fri, 15 Sep 2006 02:37:45 -0400
Received: from mta13.adelphia.net ([68.168.78.44]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1GO7Ka-00018K-8E for ltru@ietf.org; Fri, 15 Sep 2006 02:37:45 -0400
Received: from DGBP7M81 ([68.67.66.131]) by mta10.adelphia.net (InterMail vM.6.01.05.02 201-2131-123-102-20050715) with SMTP id <20060915063232.UCHE27224.mta10.adelphia.net@DGBP7M81>; Fri, 15 Sep 2006 02:32:32 -0400
Message-ID: <007501c6d890$b96ff410$6401a8c0@DGBP7M81>
From: Doug Ewell <dewell@adelphia.net>
To: LTRU Working Group <ltru@ietf.org>
References: <E1GNzAK-0005WV-Uf@megatron.ietf.org>
Date: Thu, 14 Sep 2006 23:32:31 -0700
MIME-Version: 1.0
Content-Type: text/plain; format="flowed"; charset="utf-8"; reply-type="original"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.2869
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2962
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 9ed51c9d1356100bce94f1ae4ec616a9
Cc: Frank Ellermann <nobody@xyzzy.claranet.de>
Subject: [Ltru] [OT] Re: UTF-8
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Errors-To: ltru-bounces@ietf.org

Frank Ellermann <nobody at xyzzy dot claranet dot de> wrote:

>> up to 11: [ & # x 1 0 F F F F ; ], although in practice nothing more 
>> than 10 will ever be required.
>
> My "7" was for a Latin-1 friendly UTF-4, not escape sequences:
>
> Hex. 86 91 90 9F 9F 9F 9F (lead byte + 6 hex. digits), in that form a 
> legacy text viewer or editor could display 191 visible Latin-1 
> characters as is, instead of "only" 95 ASCII for UTF-8.

First, you said UTF-8, not this.  Second, highly non-standard, but a fun 
thought experiment of the kind I used to indulge in.

I'm not sure why you chose 0x86 as your sequence introducer (U+0086 
START OF SELECTED AREA, or in Wnidows-1252, U+2020 DAGGER).  If all C1 
control bytes are available, you could make each sequence 1 byte shorter 
by marking the lead or trail byte specially:

81 90 9F 9F 9F 9F   <- lead byte 8x, all others 9x
81 80 8F 8F 8F 9F   <- trail byte 9x, all others 8x

"Terminal jockeys" such as Frank da Cruz (inventor of the Kermit 
protocol) would argue that C1 controls are also part of "Latin-1" and 
hence this format is still not Latin-1-friendly (a complaint that also 
used to be brought against UTF-8).

> Of course we can't use this for the registry, and we also won't try 
> BOCU-1.  But maybe IANA should offer a gzip-ped version.

BOCU-1 text tends to include a lot of bytes in the C1 range (0x80 
through 0x9F) and might not travel through e-mail very well.  And I 
still haven't received a clear answer about how patent-encumbered BOCU-1 
might be.

I like the gzip idea.

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/
RFC 4645  *  UTN #14


_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru