[idn] Re: Unicode categories
Erik van der Poel <erik@vanderpoel.org> Sat, 12 March 2005 09:21 UTC
Received: from psg.com (mailnull@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id EAA19814 for <idn-archive@lists.ietf.org>; Sat, 12 Mar 2005 04:21:38 -0500 (EST)
Received: from majordom by psg.com with local (Exim 4.44 (FreeBSD)) id 1DA2i7-0001rp-Uj for idn-data@psg.com; Sat, 12 Mar 2005 09:15:03 +0000
Received: from [207.115.63.77] (helo=pimout1-ext.prodigy.net) by psg.com with esmtp (Exim 4.44 (FreeBSD)) id 1DA2hv-0001qX-Gk for idn@ops.ietf.org; Sat, 12 Mar 2005 09:14:51 +0000
Received: from [10.1.1.2] (adsl-64-174-147-206.dsl.sntc01.pacbell.net [64.174.147.206]) by pimout1-ext.prodigy.net (8.12.10 milter /8.12.10) with ESMTP id j2C9EeLE137156; Sat, 12 Mar 2005 04:14:44 -0500
Message-ID: <4232B2FD.1080104@vanderpoel.org>
Date: Sat, 12 Mar 2005 01:14:37 -0800
From: Erik van der Poel <erik@vanderpoel.org>
User-Agent: Mozilla Thunderbird 1.0 (X11/20041206)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: John C Klensin <klensin@jck.com>, idn@ops.ietf.org
CC: Kenneth Whistler <kenw@sybase.com>
Subject: [idn] Re: Unicode categories
References: <421B8484.3070802@vanderpoel.org> <20050223072837.GA21463~@nicemice.net> <D872CCF059514053ECF8A198@scan.jck.com> <421D8411.9030006@vanderpoel.org> <p06210208be4390618c81@[192.168.0.101]> <421E0D0C.2000309@vanderpoel.org> <p06210202be43c3888991@[192.168.0.101]> <E07CE813AD23B2D95DA0C740@scan.jck.com> <421E30F2.1040408@vanderpoel.org> <0E7F74C71945B923C52211F3@scan.jck.com> <421EA0C9.1010500@vanderpoel.org> <00a401c51af3$7863aae0$030aa8c0@DEWELL> <A574CA1BE87BFDA3C2A1AC0E@scan.jck.com> <42322CE2.4040509@vanderpoel.org>
In-Reply-To: <42322CE2.4040509@vanderpoel.org>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on psg.com
X-Spam-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.0.1
Sender: owner-idn@ops.ietf.org
Precedence: bulk
Content-Transfer-Encoding: 7bit
All, Please do not draw any conclusions from the raw Unicode category stability data that I sent earlier. Ken Whistler, a Technical Director at the Unicode Consortium, was so kind to provide further information to put the data into their proper perspective. See below. Sorry about that, Erik ------------------------------------------- Date: Fri, 11 Mar 2005 18:23:51 -0800 (PST) From: Kenneth Whistler <kenw@sybase.com> Subject: Re: UCD stability To: erik@vanderpoel.org Cc: unicode@unicode.org, kenw@sybase.com Erik, If you are going to do things like pass these raw calculations along to the IDN list, ostensibly as some measure of stability of the UCD data, then you should take into consideration another metric. The raw number of characters changing is less reflective of stability than considering how many *decisions* to change a property (of one or more characters) were taken. I intersperse some notes to Andrew West's calculated numbers below, to help put this in context. > Andrew C. West wrote: > > According to my calculations, the number of characters which changed their > > General Category from one version of Unicode to the next is : > > > > 1.1.5 -> 2.0.14 = 474 (1.384%) Many, many, changes, since 1.1.5 was developed in house, without general public review, and since 2.0.14 (the data version corresponding to Unicode 2.0) was the first public release of the data files. > > 2.0.14 -> 2.1.2 = 1 (0.0025%) 1 decision > > 2.1.2 -> 2.1.5 = 16 (0.0410%) 2 decisions: addition of Pi/Pf subcategories, and 1 fix for 8 Tibetan characters > > 2.1.5 -> 2.1.8 = 18 (0.0462%) 1 decision: changes to converge identifier definitions > > 2.1.8 -> 2.1.9 = 3 (0.0077%) 2 decisions: fix for Greek numeral signs; fix for halfwidth forms light vertical > > 2.1.9 -> 3.0.0 = 85 (0.2182%) I'd have to dig further for this, but these were likely mostly changes involved in nailing down normalization for Unicode 3.0. > > 3.0.0 -> 3.0.1 = 0 (0%) > > 3.0.1 -> 3.1.0 = 3 (0.0061%) 1 decision: 3 Runic golden numbers > > 3.1.0 -> 3.2.0 = 7 (0.0074%) 5 decisions: 2 fixes for Khmer signs, 1 for Tamil aytham, 1 for Arabic end of ayah (architectural), 1 for the 3 Mongolian free variation selectors > > 3.2.0 -> 4.0.0 = 16 (0.0168%) 2 decisions: 1 fix for 12 modifier letters, 1 fix for decimal digit alignment > > 4.0.0 -> 4.0.1 = 1 (0.0010%) 1 decision: fix for ZWSP > > 4.0.1 -> 4.1.0 = 12 (0.0124%) 3 decisions: 1 fix for Ethiopic digits, 1 for 2 Katakana middle dots, 1 for Yi syllable wu > > > > I don't know what this tells you about the stability of the UCD data though. The significant point of instability in General Category assignments was in establishing Unicode 2.0 data files (now more than 8 years in the past). There was a significant hiccup for Unicode 3.0, at the point when it became clear that normalization stability was going to be a major issue, and when the data was culled for consistency under canonical and compatibility equivalence. Since that time, the UTC has been very conservative, indeed, in approving any General Category change for an existing character. The types of changes have been limited to: A. Clarification regarding obscure characters for which insufficient information was available earlier. B. Establishment of further data consistency constraints (this impacted some numeric categories, and also explains the change for the Katakana middle dot) C. Implementation issues with a few format characters (ZWSP, Arabic end of ayah, Mongolian free variation selectors) Since the publication of Unicode 3.0 in 2000, the only significantly common-use characters that had any General Category change were: U+0B83 TAMIL SIGN VISARGA (=aytham, Tamil data) U+200B ZERO WIDTH SPACE (mostly relevant to Thai data) U+30FB KATAKANA MIDDLE DOT (Japanese) Of those 3, only U+30FB would exist in any commonly interchanged character set other than Unicode, and *that* change was merely to change a punctuation subclass (gc=Pc --> gc=Po) -- and was additionally a *reversion* to the General Category assignment that U+30FB had in 2.1.5 and earlier. --Ken
- [idn] related work Erik van der Poel
- [idn] Unicode categories Erik van der Poel
- Re: [idn] nameprep2 and the slash homograph issue JFC (Jefsey) Morfin
- Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
- Re: [idn] nameprep2 and the slash homograph issue John C Klensin
- Re: [idn] nameprep2 and the slash homograph issue Adam M. Costello
- Re: [idn] something a little lighter for the week… Doug Ewell
- Re: [idn] stability Erik van der Poel
- Re: [idn] Re: character tables Erik van der Poel
- Re: [idn] Re: process Adam M. Costello
- Re: [idn] punctuation John C Klensin
- Re: [idn] Re: stability JFC (Jefsey) Morfin
- Re: [idn] Re: character tables Gervase Markham
- Re: [idn] stringprep: PRI #29 Erik van der Poel
- Re: [idn] nameprep2 and the slash homograph issue Gervase Markham
- Re: [idn] Re: stability Erik van der Poel
- Re: [idn] process Paul Hoffman
- Re: [idn] Re: character tables YAO Jiankang
- Re: [idn] nameprep2 and the slash homograph issue JFC (Jefsey) Morfin
- Re: [idn] nameprep2 and the slash homograph issue Adam M. Costello
- Re: [idn] punctuation John C Klensin
- Re: [idn] punctuation tedd
- Re: [idn] Re: character tables JFC (Jefsey) Morfin
- Re: [idn] punctuation Erik van der Poel
- Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
- Re: [idn] nameprep2 and the slash homograph issue Gervase Markham
- Re: [idn] Re: stability Erik van der Poel
- Re: [idn] Re: character tables Adam M. Costello
- [idn] Re: character tables John C Klensin
- Re: [idn] Re: character tables Erik van der Poel
- Re: [idn] Re: stability JFC (Jefsey) Morfin
- Re: [idn] Re: character tables Paul Hoffman
- Re: [idn] Re: stability Martin v. Löwis
- Re: [idn] Re: character tables Erik van der Poel
- Re: [idn] Re: stability John C Klensin
- [idn] Re: Unicode categories John C Klensin
- Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
- [idn] character tables Erik van der Poel
- Re: [idn] Re: character tables John C Klensin
- Re: [idn] Re: stability Mark Davis
- Re: [idn] Re: stringprep: PRI #29 Erik van der Poel
- [idn] stability Erik van der Poel
- Re: [idn] Re: character tables Erik van der Poel
- Re: [idn] Re: dichotomies JFC (Jefsey) Morfin
- Re: [idn] process Adam M. Costello
- Re: [idn] Re: character tables William Tan
- Re: [idn] Re: process James Seng
- [idn] Re: stability Simon Josefsson
- Re: [idn] stability Erik van der Poel
- [idn] Re: stability Martin v. Löwis
- Re: [idn] Re: process Jaap Akkerhuis
- Re: [idn] Re: stringprep: PRI #29 Adam M. Costello
- Re: [idn] punctuation tedd
- [idn] Re: dichotomies Erik van der Poel
- Re: [idn] Re: stability Martin v. Löwis
- Re: [idn] punctuation Erik van der Poel
- Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
- Re: [idn] process JFC (Jefsey) Morfin
- [idn] Re: stability Simon Josefsson
- Re: [idn] nameprep2 and the slash homograph issue JFC (Jefsey) Morfin
- [idn] Re: stringprep: PRI #29 Erik van der Poel
- Re: [idn] nameprep2 and the slash homograph issue Adam M. Costello
- Re: [idn] process John C Klensin
- Re: [idn] Re: Unicode categories Mark Davis
- Re: [idn] process Doug Ewell
- Re: [idn] Re: stability Adam M. Costello
- Re: [idn] process Erik van der Poel
- [idn] nameprep2 and the slash homograph issue Erik van der Poel
- Re: [idn] punctuation tedd
- [idn] punctuation Erik van der Poel
- Re: [idn] Re: stability James Seng
- [idn] Re: stability Simon Josefsson
- [idn] something a little lighter for the weekend Erik van der Poel
- Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
- Re: [idn] something a little lighter for the week… Adam M. Costello
- Re: [idn] process Gervase Markham
- [idn] Re: character tables Cary Karp
- [idn] Mozilla? JFC (Jefsey) Morfin
- Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
- Re: [idn] punctuation Erik van der Poel
- [idn] Re: Unicode categories Erik van der Poel
- [idn] Re: stability Simon Josefsson
- Re: [idn] Re: character tables JFC (Jefsey) Morfin
- [idn] Re: process Stephane Bortzmeyer
- Re: [idn] process Erik van der Poel
- Re: [idn] punctuation Jaap Akkerhuis
- Re: [idn] Re: character tables Gervase Markham
- Re: [idn] Re: process Jaap Akkerhuis
- Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
- Re: [idn] Re: process James Seng
- [idn] stringprep mailing list Erik van der Poel
- Re: [idn] Re: dichotomies Erik van der Poel
- Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
- Re: [idn] Re: stability Erik van der Poel
- Re: [idn] Re: character tables Erik van der Poel
- Re: [idn] Re: stability JFC (Jefsey) Morfin
- Re: [idn] Re: process Erik van der Poel
- [idn] Re: stringprep: PRI #29 Simon Josefsson
- Re: [idn] punctuation Erik van der Poel
- Re: [idn] stability Martin v. Löwis
- [idn] stringprep: PRI #29 Erik van der Poel
- Re: [idn] Re: character tables Paul Hoffman
- Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
- [idn] Re: stability Simon Josefsson
- [idn] process Erik van der Poel
- [idn] stringprep: existing profiles and string pr… Erik van der Poel
- Re: [idn] Re: stability Erik van der Poel
- [idn] dichotomies Erik van der Poel
- Re: [idn] stability JFC (Jefsey) Morfin
- [idn] Re: character tables Cary Karp
- Re: [idn] Re: process Erik van der Poel
- [idn] Re: stringprep mailing list Simon Josefsson
- Re: [idn] Re: Unicode categories Martin v. Löwis
- Re: [idn] Re: stability JFC (Jefsey) Morfin
- Re: [idn] something a little lighter for the week… John C Klensin
- Re: [idn] something a little lighter for the week… Adam M. Costello
- Re: [idn] Re: dichotomies JFC (Jefsey) Morfin
- Re: [idn] Re: stability Erik van der Poel
- Re: [idn] Re: stability Erik van der Poel
- [idn] Re: stringprep: PRI #29 Simon Josefsson
- Re: [idn] stability Erik van der Poel
- [idn] Re: stringprep: PRI #29 Simon Josefsson