[idn] Re: Unicode categories

Erik van der Poel <erik@vanderpoel.org> Sat, 12 March 2005 09:21 UTC

Message-ID: <4232B2FD.1080104@vanderpoel.org>
Date: Sat, 12 Mar 2005 01:14:37 -0800
From: Erik van der Poel <erik@vanderpoel.org>
User-Agent: Mozilla Thunderbird 1.0 (X11/20041206)
MIME-Version: 1.0
To: John C Klensin <klensin@jck.com>, idn@ops.ietf.org
CC: Kenneth Whistler <kenw@sybase.com>
Subject: [idn] Re: Unicode categories
References: <421B8484.3070802@vanderpoel.org> <20050223072837.GA21463~@nicemice.net> <D872CCF059514053ECF8A198@scan.jck.com> <421D8411.9030006@vanderpoel.org> <p06210208be4390618c81@[192.168.0.101]> <421E0D0C.2000309@vanderpoel.org> <p06210202be43c3888991@[192.168.0.101]> <E07CE813AD23B2D95DA0C740@scan.jck.com> <421E30F2.1040408@vanderpoel.org> <0E7F74C71945B923C52211F3@scan.jck.com> <421EA0C9.1010500@vanderpoel.org> <00a401c51af3$7863aae0$030aa8c0@DEWELL> <A574CA1BE87BFDA3C2A1AC0E@scan.jck.com> <42322CE2.4040509@vanderpoel.org>
In-Reply-To: <42322CE2.4040509@vanderpoel.org>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Sender: owner-idn@ops.ietf.org
Precedence: bulk
Content-Transfer-Encoding: 7bit

All,

Please do not draw any conclusions from the raw Unicode category 
stability data that I sent earlier. Ken Whistler, a Technical Director 
at the Unicode Consortium, was so kind to provide further information to 
put the data into their proper perspective. See below.

Sorry about that,

Erik

-------------------------------------------

Date: Fri, 11 Mar 2005 18:23:51 -0800 (PST)
From: Kenneth Whistler <kenw@sybase.com>
Subject: Re: UCD stability
To: erik@vanderpoel.org
Cc: unicode@unicode.org, kenw@sybase.com

Erik,

If you are going to do things like pass these raw calculations
along to the IDN list, ostensibly as some measure of stability
of the UCD data, then you should take into consideration another
metric.

The raw number of characters changing is less reflective of
stability than considering how many *decisions* to change
a property (of one or more characters) were taken.

I intersperse some notes to Andrew West's calculated numbers
below, to help put this in context.

 > Andrew C. West wrote:
 > > According to my calculations, the number of characters which 
changed their
 > > General Category from one version of Unicode to the next is :
 > >
 > > 1.1.5 -> 2.0.14 = 474 (1.384%)

Many, many, changes, since 1.1.5 was developed in house,
without general public review, and since 2.0.14 (the
data version corresponding to Unicode 2.0) was the first
public release of the data files.

 > > 2.0.14 -> 2.1.2 = 1 (0.0025%)

1 decision

 > > 2.1.2 -> 2.1.5 = 16 (0.0410%)

2 decisions: addition of Pi/Pf subcategories, and 1 fix for 8 Tibetan
characters

 > > 2.1.5 -> 2.1.8 = 18 (0.0462%)

1 decision: changes to converge identifier definitions

 > > 2.1.8 -> 2.1.9 = 3 (0.0077%)

2 decisions: fix for Greek numeral signs; fix for halfwidth forms light
vertical

 > > 2.1.9 -> 3.0.0 = 85 (0.2182%)

I'd have to dig further for this, but these were likely mostly
changes involved in nailing down normalization for Unicode 3.0.

 > > 3.0.0 -> 3.0.1 = 0 (0%)
 > > 3.0.1 -> 3.1.0 = 3 (0.0061%)

1 decision: 3 Runic golden numbers

 > > 3.1.0 -> 3.2.0 = 7 (0.0074%)

5 decisions: 2 fixes for Khmer signs, 1 for Tamil aytham, 1 for
Arabic end of ayah (architectural), 1 for the 3 Mongolian free
variation selectors

 > > 3.2.0 -> 4.0.0 = 16 (0.0168%)

2 decisions: 1 fix for 12 modifier letters, 1 fix for decimal digit
alignment

 > > 4.0.0 -> 4.0.1 = 1 (0.0010%)

1 decision: fix for ZWSP

 > > 4.0.1 -> 4.1.0 = 12 (0.0124%)

3 decisions: 1 fix for Ethiopic digits, 1 for 2 Katakana middle dots,
1 for Yi syllable wu

 > >
 > > I don't know what this tells you about the stability of the UCD 
data though.

The significant point of instability in General Category
assignments was in establishing Unicode 2.0 data files
(now more than 8 years in the past).

There was a significant hiccup for Unicode 3.0, at the point
when it became clear that normalization stability was going
to be a major issue, and when the data was culled for
consistency under canonical and compatibility equivalence.

Since that time, the UTC has been very conservative, indeed,
in approving any General Category change for an existing
character. The types of changes have been limited to:

   A. Clarification regarding obscure characters for which
      insufficient information was available earlier.

   B. Establishment of further data consistency constraints
      (this impacted some numeric categories, and also
      explains the change for the Katakana middle dot)

   C. Implementation issues with a few format characters
      (ZWSP, Arabic end of ayah, Mongolian free variation selectors)

Since the publication of Unicode 3.0 in 2000, the only
significantly common-use characters that had any General
Category change were:

    U+0B83 TAMIL SIGN VISARGA (=aytham, Tamil data)
    U+200B ZERO WIDTH SPACE  (mostly relevant to Thai data)
    U+30FB KATAKANA MIDDLE DOT  (Japanese)

Of those 3, only U+30FB would exist in any commonly
interchanged character set other than Unicode, and
*that* change was merely to
change a punctuation subclass (gc=Pc --> gc=Po) -- and
was additionally a *reversion* to the General Category
assignment that U+30FB had in 2.1.5 and earlier.

--Ken

[idn] related work Erik van der Poel
[idn] Unicode categories Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue JFC (Jefsey) Morfin
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue John C Klensin
Re: [idn] nameprep2 and the slash homograph issue Adam M. Costello
Re: [idn] something a little lighter for the week… Doug Ewell
Re: [idn] stability Erik van der Poel
Re: [idn] Re: character tables Erik van der Poel
Re: [idn] Re: process Adam M. Costello
Re: [idn] punctuation John C Klensin
Re: [idn] Re: stability JFC (Jefsey) Morfin
Re: [idn] Re: character tables Gervase Markham
Re: [idn] stringprep: PRI #29 Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Gervase Markham
Re: [idn] Re: stability Erik van der Poel
Re: [idn] process Paul Hoffman
Re: [idn] Re: character tables YAO Jiankang
Re: [idn] nameprep2 and the slash homograph issue JFC (Jefsey) Morfin
Re: [idn] nameprep2 and the slash homograph issue Adam M. Costello
Re: [idn] punctuation John C Klensin
Re: [idn] punctuation tedd
Re: [idn] Re: character tables JFC (Jefsey) Morfin
Re: [idn] punctuation Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Gervase Markham
Re: [idn] Re: stability Erik van der Poel
Re: [idn] Re: character tables Adam M. Costello
[idn] Re: character tables John C Klensin
Re: [idn] Re: character tables Erik van der Poel
Re: [idn] Re: stability JFC (Jefsey) Morfin
Re: [idn] Re: character tables Paul Hoffman
Re: [idn] Re: stability Martin v. Löwis
Re: [idn] Re: character tables Erik van der Poel
Re: [idn] Re: stability John C Klensin
[idn] Re: Unicode categories John C Klensin
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
[idn] character tables Erik van der Poel
Re: [idn] Re: character tables John C Klensin
Re: [idn] Re: stability Mark Davis
Re: [idn] Re: stringprep: PRI #29 Erik van der Poel
[idn] stability Erik van der Poel
Re: [idn] Re: character tables Erik van der Poel
Re: [idn] Re: dichotomies JFC (Jefsey) Morfin
Re: [idn] process Adam M. Costello
Re: [idn] Re: character tables William Tan
Re: [idn] Re: process James Seng
[idn] Re: stability Simon Josefsson
Re: [idn] stability Erik van der Poel
[idn] Re: stability Martin v. Löwis
Re: [idn] Re: process Jaap Akkerhuis
Re: [idn] Re: stringprep: PRI #29 Adam M. Costello
Re: [idn] punctuation tedd
[idn] Re: dichotomies Erik van der Poel
Re: [idn] Re: stability Martin v. Löwis
Re: [idn] punctuation Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] process JFC (Jefsey) Morfin
[idn] Re: stability Simon Josefsson
Re: [idn] nameprep2 and the slash homograph issue JFC (Jefsey) Morfin
[idn] Re: stringprep: PRI #29 Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Adam M. Costello
Re: [idn] process John C Klensin
Re: [idn] Re: Unicode categories Mark Davis
Re: [idn] process Doug Ewell
Re: [idn] Re: stability Adam M. Costello
Re: [idn] process Erik van der Poel
[idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] punctuation tedd
[idn] punctuation Erik van der Poel
Re: [idn] Re: stability James Seng
[idn] Re: stability Simon Josefsson
[idn] something a little lighter for the weekend Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] something a little lighter for the week… Adam M. Costello
Re: [idn] process Gervase Markham
[idn] Re: character tables Cary Karp
[idn] Mozilla? JFC (Jefsey) Morfin
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] punctuation Erik van der Poel
[idn] Re: Unicode categories Erik van der Poel
[idn] Re: stability Simon Josefsson
Re: [idn] Re: character tables JFC (Jefsey) Morfin
[idn] Re: process Stephane Bortzmeyer
Re: [idn] process Erik van der Poel
Re: [idn] punctuation Jaap Akkerhuis
Re: [idn] Re: character tables Gervase Markham
Re: [idn] Re: process Jaap Akkerhuis
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] Re: process James Seng
[idn] stringprep mailing list Erik van der Poel
Re: [idn] Re: dichotomies Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] Re: stability Erik van der Poel
Re: [idn] Re: character tables Erik van der Poel
Re: [idn] Re: stability JFC (Jefsey) Morfin
Re: [idn] Re: process Erik van der Poel
[idn] Re: stringprep: PRI #29 Simon Josefsson
Re: [idn] punctuation Erik van der Poel
Re: [idn] stability Martin v. Löwis
[idn] stringprep: PRI #29 Erik van der Poel
Re: [idn] Re: character tables Paul Hoffman
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
[idn] Re: stability Simon Josefsson
[idn] process Erik van der Poel
[idn] stringprep: existing profiles and string pr… Erik van der Poel
Re: [idn] Re: stability Erik van der Poel
[idn] dichotomies Erik van der Poel
Re: [idn] stability JFC (Jefsey) Morfin
[idn] Re: character tables Cary Karp
Re: [idn] Re: process Erik van der Poel
[idn] Re: stringprep mailing list Simon Josefsson
Re: [idn] Re: Unicode categories Martin v. Löwis
Re: [idn] Re: stability JFC (Jefsey) Morfin
Re: [idn] something a little lighter for the week… John C Klensin
Re: [idn] something a little lighter for the week… Adam M. Costello
Re: [idn] Re: dichotomies JFC (Jefsey) Morfin
Re: [idn] Re: stability Erik van der Poel
Re: [idn] Re: stability Erik van der Poel
[idn] Re: stringprep: PRI #29 Simon Josefsson
Re: [idn] stability Erik van der Poel
[idn] Re: stringprep: PRI #29 Simon Josefsson