RE: Unicode progress
John C Klensin <KLENSIN@infoods.unu.edu> Mon, 25 October 1993 09:53 UTC
Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa25935; 25 Oct 93 5:53 EDT
Received: from CNRI.RESTON.VA.US by IETF.CNRI.Reston.VA.US id aa25927; 25 Oct 93 5:53 EDT
Received: from ucdavis.ucdavis.edu by CNRI.Reston.VA.US id aa25659; 25 Oct 93 5:53 EDT
Received: by ucdavis.ucdavis.edu (4.1/UCD2.05) id AA03030; Mon, 25 Oct 93 02:30:04 PDT
X-Orig-Sender: ietf-wnils-request@ucdavis.edu
Received: from INFOODS.MIT.EDU by ucdavis.ucdavis.edu (4.1/UCD2.05) id AA02897; Mon, 25 Oct 93 02:24:05 PDT
Received: from INFOODS.UNU.EDU by INFOODS.UNU.EDU (PMDF V4.2-13 #2603) id <01H4HX7BG52O000OYJ@INFOODS.UNU.EDU>; Mon, 25 Oct 1993 05:25:14 EDT
Date: Mon, 25 Oct 1993 05:25:13 -0400
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: John C Klensin <KLENSIN@infoods.unu.edu>
Subject: RE: Unicode progress
In-Reply-To: <9310231840.AA12431@blacks.jpl.nasa.gov>
To: dank@blacks.jpl.nasa.gov
Cc: mohta@necom830.cc.titech.ac.jp, ietf-wnils@ucdavis.edu, ietf-charsets@innosoft.com
Message-Id: <751541113.180475.KLENSIN@INFOODS.UNU.EDU>
X-Envelope-To: ietf-wnils@UCDAVIS.EDU
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET="US-ASCII"
Content-Transfer-Encoding: 7bit
Mail-System-Version: <MultiNet-MM(330)+TOPSLIB(156)+PMDF(4.2)@INFOODS.UNU.EDU>
>It seems to me that English and Greek characters need separate code points >because their visual appearance is significantly different, not because >they are from different languages. Actually, Dan, a lot of other issues aside, you have hit on one of the critical issues here. Ohta-san has responded on this, but let me try a bit of a generalization. There are two issues that might usefully be thought of as separate: (1) "visual appearance is significantly different" is largely in the eye of the beholder. Is the Latin lower-case "a" the same, or "significantly different" from Greek lower-case alpha? Be careful about the answer, because it may be different in different fonts, and typography is supposed to not be an issue here. (2) To the degree that there are *any* letter-symbols that we can agree are not "significantly different" in Greek and Latin character sets (let's stick with alpha and look at its upper-case form as Ohta-san did), one then can make a choice between--starting from a traditionally ASCII-based world-- "ASCII characters with Greek supplement" and "separate contiguous code points for basic Latin and Greek characters". The former creates a smaller number of total codes because, e.g., Greek upper-case alpha does not get assigned a code point separate from Latin upper-case A. The latter preserves some collating integrity, some useful relationships between, e.g. upper case and lower case character sets, and maybe has some cultural merit (which moves dangerously close to "because they are different languages"). But the latter yields much larger total character sets, because similar symbols are assigned to separate code points under some set of rules. The "ASCII with supplemental Greek characters" approach is known in the character set community as "unification". One of the several objections to IS 10646 and UNICODE in the Asian character set community is that North American and European-dominated committees and design teams were a lot more willing to "unify" characters deriving from Chinese ("Han") characters than they were to unify characters deriving from, e.g., Greek or North Semitic. A few observations on your summary... The ISO Universal Character Set (sic) standard is 10646, not 16046. There is no UCS-3, only UCS-2 (16 bit, equivalent to UNICODE in code points, but possibly with slightly different semantics and conformance rules) and UCS-4 (32 bit). There is actually a community of objections to UTF-2. They are based on: (1) For email purposes, and other situations with 7-bit constraints, UTF-2, by using an 8-bit form, requires double encoding. There are direct encodings of 16 or 32 bits to 7 bits that save time and maybe space. (2) The variable-length nature of UTF-2 is optimal for ASCII and code points "low" in the 10646 sequence. It is pretty bad for the "upper end" of the BMP (UNICODE, UCS-2), and could get really pathological if the "high end" code positions of 10646 were used. So, to a certain extent, choosing it requires assuming that those higher code positions will never be used, or that the communities that will use them are never going to be important to the Internet. A straight 32-bit coding, possibly supplemented by conventional compression, does not have that problem. --john
- Re: Unicode progress dank
- Re: Unicode progress Simon E Spero
- Re: Unicode progress dank
- RE: Unicode progress John C Klensin
- RE: Unicode progress Masataka Ohta
- RE: Unicode progress Borka Jerman-Blazic
- RE: Unicode progress Chris Weider
- RE: Unicode progress Masataka Ohta
- RE: Unicode progress Masataka Ohta
- RE: Unicode progress Borka Jerman-Blazic