[Ltru] Re: [OT] Re: UTF-8

Doug Ewell wrote:

> I'm not sure why you chose 0x86 as your sequence introducer

Six is the number of trailing octets: 91909F9F9F9F (for John's
example u+10FFFF).

> you could make each sequence 1 byte shorter by marking the
> lead or trail byte specially

Yes, but then only 2 octets (80+81) would never occur (instead
of 11), and lost 9x bytes won't cause an error.  UTF-8 has now
13 "impossible" octets, and similar features.

> "Terminal jockeys" such as Frank da Cruz (inventor of the
> Kermit protocol) would argue that C1 controls are also part
> of "Latin-1" and hence this format is still not
> Latin-1-friendly (a complaint that also used to be brought
> against UTF-8).

Yes, you can only use UTF-8 or UTF-4 with legacy applications
for windows-1252, not with applications needing the C1 control
codes as is.  He could use UTF-1 (it protects C0, SP, DEL, C1)
or UTF-7.  For my local purposes UTF-4 is fine, my text editor
supports hex.  I'm less good with modulo 64 for UTF-8 by heart,
I need extra macros to decode / encode it.

>>  we can't use this for the registry, and we also won't try
>> BOCU-1.  But maybe IANA should offer a gzip-ped version.

> BOCU-1 text tends to include a lot of bytes in the C1 range
> (0x80 through 0x9F) and might not travel through e-mail very
> well.

It's as good or bad as UTF-8 (or in theory UTF-4) for that
purpose, with 8BITMIME or news you need no CTE, otherwise
you need B64 or QP.  If e-mail has a problem with 8 bits these
problems aren't limited to C1, it could be anything, likely a
parity bit.

> I still haven't received a clear answer about how patent-
> encumbered BOCU-1 might be.

IBM's statement in UTS #40 is "royalty-free".  I didn't ask
them for a US-license.  In the EU and some other parts of the
world it's AFAIK (and IANAL) a complete waste of time and money
to patent algorithms.  Patenting modulo 243 arithmetic is an
odd idea.  For UTF-4 (= modulo 16 with 64 lines CharMapML) it
would be ridiculous (but one of these 64 lines is a copyright,
just in case).

> I like the gzip idea.

lstreg6.txt        82218 (2006-08-04)
lstreg6.txt.gz     11788
lstreg6.xml       104627 (see my reply to Debbie)
lstreg6.xml.gz     12141

Matches your observations in UniCompress.  For the 4646bis
registry it makes sense for some folks (for me it's less
relevant, the V.90 bottleneck has its own compression)

Frank

_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru