Re: Proposals for 10646/Unicode in MIME

Masataka Ohta <mohta@necom830.cc.titech.ac.jp> Tue, 21 December 1993 15:00 UTC

Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa02466; 21 Dec 93 10:00 EST
Received: from CNRI.RESTON.VA.US by IETF.CNRI.Reston.VA.US id aa02455; 21 Dec 93 10:00 EST
Received: from ietf.cnri.reston.va.us by CNRI.Reston.VA.US id ac03871; 21 Dec 93 9:59 EST
Received: from dimacs.rutgers.edu by IETF.CNRI.Reston.VA.US id aa00729; 21 Dec 93 6:03 EST
Received: by dimacs.rutgers.edu (5.59/SMI4.0/RU1.5/3.08) id AA14388; Tue, 21 Dec 93 01:43:27 EST
Received: from necom830.cc.titech.ac.jp by dimacs.rutgers.edu (5.59/SMI4.0/RU1.5/3.08) id AA14384; Tue, 21 Dec 93 01:43:19 EST
Received: by necom830.cc.titech.ac.jp (5.65+/necom-mx-rg); Tue, 21 Dec 93 15:34:23 +0900
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Masataka Ohta <mohta@necom830.cc.titech.ac.jp>
Return-Path: <mohta@necom830.cc.titech.ac.jp>
Message-Id: <9312210634.AA00144@necom830.cc.titech.ac.jp>
Subject: Re: Proposals for 10646/Unicode in MIME
To: rhys@cs.uq.oz.au, ietf-charsets@innosoft.com, apccirn-i18n@nic.nm.kr
Date: Tue, 21 Dec 1993 15:34:21 -0000
Cc: dcrocker@mordor.stanford.edu, David_Goldsmith@taligent.com, ietf-822@dimacs.rutgers.edu, unicored@unicode.org
Reply-To: ietf-charsets@innosoft.com, unicored@unicode.org, apccirn-i18n@nic.nm.kr
In-Reply-To: <9312202223.AA28439@client>; from "rhys@cs.uq.oz.au" at Dec 21, 93 8:23 am
X-Mailer: ELM [version 2.3 PL11]

Note: As the issue is on text encoding in general, Reply-To: is not
directed to ietf-822.

> I note here that Masataka's proposal for ISO-2022-JP-2 demonstrates what
> we've been arguing all along: it is not enough to just have a character
> encoding.

Recently I avoid to use the word "character" as much as possible and
use the phrase "text encoding", because the concept of "character"
beyond ASCII can not be well defined. Various units of text encoding 
are necessary for different purposes.

Thus, I think the names such as MIME charset and ietf-charsets ML
no good.

> There also needs to be some form of markup to distinguish
> different usages of the same character encoding.  ISO-2022-JP-2 uses
> escape sequences to do markup, whereas a UNICODE version of text/enriched
> would use <...> tags.

ISO-2022-JP-2 does not do any markup. It is for plain text.

It is finite state. It has no nesting.

I don't think anything with nested structure is plain text.

It is and its successors will be as stateless as practically possible
with ISO 2022.

That is, at the beginning of a line, the state can be assumed to be unique.

> The main difference I can see is that ISO-2022-JP-2
> requires the use of markup, even when the whole message is in the same
> language, but UNICODE can get away without markup for 99% of messages,

It is a meaningless difference.

Whether it is 1% or 100%, you need the same amount of codings, fonts,
settings of config.sys and such, anyway.

> letting local conventions set the default language.

That is one of a very important difference.

Unlike UNICODE, ISO-2022-JP-2 is intended to be used in internationalized
environment. It needs no local conventions. BTW, MIME charsets also, can
not depend on local conventions.

> I still fail to see why Masataka objects to UNICODE since his own proposal has
> to jump through the same markup hoops. The only advantage of ISO-2022-JP-2
> that I can see is that it will work on existing terminals without special
> software in some communities.

Then, you can see nothing.

ISO-2022-JP-2 is produced from long and extensive
localization/internationalization experiences in Japanese computer community
with ISO-2022-JP, EUC, SJIS and such.

First of all, ISO-2022-JP-2 can interoperate with ASCII.

Next, it is 7 bit.

Thus, it can interoperate with any ASCII compatible text encoding such
as EUC (both UJIS and EUC-KR) and SJIS.

More importantly, it can interoperate with the future ultimate ASCII
compatible 8 bit encoding. Of course, UNICODE is NOT the future.

We do know that having two or more uninteroperable encodings such
as EUS and SJIS or ASCII and 16bit-UNICODE is the real pain.

> A specious argument at best, since the rest
> of the world does need special software to view ISO-2022-JP-2 anyway.

ISO-2022-JP-2 is, and ISO-2022-INT-1 will be, designed to aid those
who immediately need localization.

I don't think it be a long term solution.

Both ISO 2022 and ISO 10646/UNICODE has a unified syntax to mix
multilingual characters in the world. ISO 2022 is much better for
us to be able to separate C/J/K characters.

On the other hand, both ISO 2022 and ISO 10646/UNICODE lacks a unified
semantics to mix multilingual characters in the world. ISO 10646/UNICODE
inherits the policy of ISO 2022 to treat characters in different languages
differently. Thus, it is impossible to write a unified text processing
library or application of meaningfully rich functionality.

Thus, for the time being, our solution must be 7 bit ISO 2022.

As a long term solution, I have designed ICODE/IUTF, which has, besides
ASCII compatibility, several useful semantical properties for, as far
as I know, all the characters in the world. With a large enough encoding
space (though not impractically large), the real, semantical, unification
is possible.

> UNICODE has the advantage that if a message gets corrupted and the markup
> is lost, there is still a reasonable character that can be displayed, which
> is close enough not to cause the sky to fall in on the reader.  Such corruption
> could easily happen when a message is quoted.  What happens with ISO-2022-JP-2?

Misquoting is the issue which MUST be solved by faulty MTAs and other
faulty transports. Providing workarounds will only result in the delay
of the real solution.

Instead, the real state corruption problem is caused in an interactive
environment where individual programs output their own text streams
simultaneously.

With ISO-2022-JP-2, unlike text/enriched, the state is resumed at the
beginning of the next line.

> People have tried time and again to add markup to UNICODE to satisfy Masataka
> (e.g. language tags), but it just doesn't seem to satisfy him. *sigh*

Strange.

I have *ABSOLUTELY* *NO* interest in text/enriched from the beginning.

I and most of the people in the world want to process our natural
languages as plain text in internationalized environment.

We already have a lot of experience to use our languages as plain text.

You can't force us give up plain text.

							Masataka Ohta

PS

For more information on ICODE, why ISO 10646/UNICODE is no good and how
can it be improved, see:

	"Character Encoding Method for Internationalized Plain
	Text Processing", Proceedings of 8th International Joint
	Workshop on Computer Communications, Masataka OHTA,
	Dec. 1993.

electric copy is available from me.