RE: Unicode progress

John C Klensin <KLENSIN@infoods.unu.edu> Mon, 25 October 1993 09:53 UTC

Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa25935; 25 Oct 93 5:53 EDT
Received: from CNRI.RESTON.VA.US by IETF.CNRI.Reston.VA.US id aa25927; 25 Oct 93 5:53 EDT
Received: from ucdavis.ucdavis.edu by CNRI.Reston.VA.US id aa25659; 25 Oct 93 5:53 EDT
Received: by ucdavis.ucdavis.edu (4.1/UCD2.05) id AA03030; Mon, 25 Oct 93 02:30:04 PDT
X-Orig-Sender: ietf-wnils-request@ucdavis.edu
Received: from INFOODS.MIT.EDU by ucdavis.ucdavis.edu (4.1/UCD2.05) id AA02897; Mon, 25 Oct 93 02:24:05 PDT
Received: from INFOODS.UNU.EDU by INFOODS.UNU.EDU (PMDF V4.2-13 #2603) id <01H4HX7BG52O000OYJ@INFOODS.UNU.EDU>; Mon, 25 Oct 1993 05:25:14 EDT
Date: Mon, 25 Oct 1993 05:25:13 -0400
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: John C Klensin <KLENSIN@infoods.unu.edu>
Subject: RE: Unicode progress
In-Reply-To: <9310231840.AA12431@blacks.jpl.nasa.gov>
To: dank@blacks.jpl.nasa.gov
Cc: mohta@necom830.cc.titech.ac.jp, ietf-wnils@ucdavis.edu, ietf-charsets@innosoft.com
Message-Id: <751541113.180475.KLENSIN@INFOODS.UNU.EDU>
X-Envelope-To: ietf-wnils@UCDAVIS.EDU
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET="US-ASCII"
Content-Transfer-Encoding: 7bit
Mail-System-Version: <MultiNet-MM(330)+TOPSLIB(156)+PMDF(4.2)@INFOODS.UNU.EDU>

>It seems to me that English and Greek characters need separate code points
>because their visual appearance is significantly different, not because
>they are from different languages.

Actually, Dan, a lot of other issues aside, you have hit on one of the
critical issues here.   Ohta-san has responded on this, but let me try a
bit of a generalization.

There are two issues that might usefully be thought of as separate:

(1) "visual appearance is significantly different" is largely in the eye
of the beholder.  Is the Latin lower-case "a" the same, or
"significantly different" from Greek lower-case alpha?  Be careful about
the answer, because it may be different in different fonts, and
typography is supposed to not be an issue here.

(2) To the degree that there are *any* letter-symbols that we can agree
are not "significantly different" in Greek and Latin character sets
(let's stick with alpha and look at its upper-case form as Ohta-san
did), one then can make a choice between--starting from a traditionally
ASCII-based world-- "ASCII characters with Greek supplement" and
"separate contiguous code points for basic Latin and Greek characters".
The former creates a smaller number of total codes because, e.g., Greek
upper-case alpha does not get assigned a code point separate from Latin 
upper-case A.  The latter preserves some collating integrity, some
useful relationships between, e.g. upper case and lower case character
sets, and maybe has some cultural merit (which moves dangerously close
to "because they are different languages").  But the latter yields much
larger total character sets, because similar symbols are assigned to
separate code points under some set of rules.

The "ASCII with supplemental Greek characters" approach is known in the
character set community as "unification".   One of the several
objections to IS 10646 and UNICODE in the Asian character set community
is that North American and European-dominated committees and design
teams were a lot more willing to "unify" characters deriving from
Chinese ("Han") characters than they were to unify characters deriving
from, e.g., Greek or North Semitic.

A few observations on your summary...

The ISO Universal Character Set (sic) standard is 10646, not 16046.

There is no UCS-3, only UCS-2 (16 bit, equivalent to UNICODE in code
points, but possibly with slightly different semantics and conformance
rules) and UCS-4 (32 bit).


There is actually a community of objections to UTF-2.  They are based
on:

(1) For email purposes, and other situations with 7-bit constraints,
UTF-2, by using an 8-bit form, requires double encoding.  There are
direct encodings of 16 or 32 bits to 7 bits that save time and maybe
space.

(2) The variable-length nature of UTF-2 is optimal for ASCII and code
points "low" in the 10646 sequence.  It is pretty bad for the "upper
end" of the BMP (UNICODE, UCS-2), and could get really pathological if
the "high end" code positions of 10646 were used.  So, to a certain
extent, choosing it requires assuming that those higher code positions
will never be used, or that the communities that will use them are never
going to be important to the Internet.  A straight 32-bit coding,
possibly supplemented by conventional compression, does not have that
problem.

--john