RE: Unicode progress

Borka Jerman-Blazic <jerman-blazic@ijs.si> Mon, 25 October 1993 12:42 UTC

Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa27810; 25 Oct 93 8:42 EDT
Received: from CNRI.RESTON.VA.US by IETF.CNRI.Reston.VA.US id aa27802; 25 Oct 93 8:42 EDT
Received: from ucdavis.ucdavis.edu by CNRI.Reston.VA.US id aa01988; 25 Oct 93 8:42 EDT
Received: by ucdavis.ucdavis.edu (4.1/UCD2.05) id AA07233; Mon, 25 Oct 93 05:26:21 PDT
X-Orig-Sender: ietf-wnils-request@ucdavis.edu
Received: from kanin.arnes.si by ucdavis.ucdavis.edu (4.1/UCD2.05) id AA06453; Mon, 25 Oct 93 05:09:44 PDT
X400-Received: by mta kanin.arnes.si in /PRMD=ac/ADMD=mail/C=si/; Relayed; Mon, 25 Oct 1993 13:10:33 +0100
X400-Received: by /PRMD=ac/ADMD=mail/C=si/; Relayed; Mon, 25 Oct 1993 12:09:18 +0100
Date: Mon, 25 Oct 1993 12:09:18 +0100
X400-Originator: jerman-blazic@ijs.si
X400-Recipients: non-disclosure:;
X400-Mts-Identifier: [/PRMD=ac/ADMD=mail/C=si/;931025130918]
X400-Content-Type: P2-1984 (2)
Content-Identifier: 211
Conversion: Prohibited
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Borka Jerman-Blazic <jerman-blazic@ijs.si>
Message-Id: <211*/S=jerman-blazic/O=ijs/PRMD=ac/ADMD=mail/C=si/@MHS>
To: ietf-charsets <ietf-charsets@innosoft.com>
Cc: ietf-wnils <ietf-wnils@ucdavis.edu>
In-Reply-To: <9310231840.AA12431@blacks.jpl.nasa.go>
Subject: RE: Unicode progress

=====================================================================
>>It seems to me that English and Greek characters need separate code points
>>because their visual appearance is significantly different, not because
>>they are from different languages.

>Actually, Dan, a lot of other issues aside, you have hit on one of the
>critical issues here.   Ohta-san has responded on this, but let me try a
>bit of a generalization.

I would say that they have different code points because they belong to
different scripts!. The same apply to Cyrillic. You can not mixed in
some text related operation (I have in mind: ordering) both scripts
becuase they have set elements with exactly the same "shape" i.e
A, K, P, C etc. but  different names (meaning different interpretation,
diferent pronunciation) because they belong to different
scripts. You can order all latin characters from ISO 10 646  
in one collation string for many different languages (it is 
difficult because the ordering rules
differ from language to language, some already done work is
around) but you can not mixed them with
Greek or Cyrillic. I am not expert in ideograms but I guess that the
problems they have with different "shape"s of the same ideogram (which
can somehow be related to the problems of glyphs and characters to 
our -western understanding) is that they belong to the same script-
called ideographic. Of, course that does not mean that they have to
be coded as they are now in 10 646 but somehow they belong together.

>There are two issues that might usefully be thought of as separate:

>(1) "visual appearance is significantly different" is largely in the eye
>of the beholder.  Is the Latin lower-case "a" the same, or
>"significantly different" from Greek lower-case alpha?  Be careful about
>the answer, because it may be different in different fonts, and
>typography is supposed to not be an issue here.

I agree completly. Glyphs and typography is not related issue here.
Characters in one coded character set are supposed to be unique i.e
one character is coded only once in one character set table.

>(2) To the degree that there are *any* letter-symbols that we can agree
>are not "significantly different" in Greek and Latin character sets
>(let's stick with alpha and look at its upper-case form as Ohta-san
>did), one then can make a choice between--starting from a traditionally
>ASCII-based world-- "ASCII characters with Greek supplement" and
>"separate contiguous code points for basic Latin and Greek characters".
>The former creates a smaller number of total codes because, e.g., Greek
>upper-case alpha does not get assigned a code point separate from Latin 
>upper-case A.  The latter preserves some collating integrity, some
>useful relationships between, e.g. upper case and lower case character
>sets, and maybe has some cultural merit (which moves dangerously close
>to "because they are different languages").  But the latter yields much
>larger total character sets, because similar symbols are assigned to
>separate code points under some set of rules.

That issue was discussed for so many years and today will be difficult
to change  the generality adopted by many bodies i.e sets of characters
are coded  and the members of these sets are supposed to be unique in the
set itself. The problem could be maybe better addressed if we speak about
scripts and not languages.

>The "ASCII with supplemental Greek characters" approach is known in the
>character set community as "unification".   One of the several
>objections to IS 10646 and UNICODE in the Asian character set community
>is that North American and European-dominated committees and design
>teams were a lot more willing to "unify" characters deriving from
>Chinese ("Han") characters than they were to unify characters deriving
>from, e.g., Greek or North Semitic.

This issue was already discussed. Why Chinese han was chosen for unification
I do not know but at the SC2 meeting in Rennes it was presented as a
consensus of the three national bodies i.e China, Japan and Korea.
However, I know that soon some new proposal will be discussed on the
Washington meeting of SC2 WG2 (next week) which will allow allocation for
aditional blocks in some part of the BMP (i.e use of the reserved allocations
for sort of announcement and then invocation of the blocks from the second
plane). Please, do not  discuss this further because it is not official and
I do not have the original document!

Borka

p.s but we agreed what are the problems  to be solved over Internet,
did we??