[idn] character tables

Erik van der Poel <erik@vanderpoel.org> Mon, 28 February 2005 02:18 UTC

Received: from psg.com (mailnull@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA27094 for <idn-archive@lists.ietf.org>; Sun, 27 Feb 2005 21:18:32 -0500 (EST)
Received: from majordom by psg.com with local (Exim 4.44 (FreeBSD)) id 1D5aRe-0007vK-3F for idn-data@psg.com; Mon, 28 Feb 2005 02:15:38 +0000
Received: from [207.115.63.102] (helo=pimout3-ext.prodigy.net) by psg.com with esmtp (Exim 4.44 (FreeBSD)) id 1D5aRa-0007u5-Ou for idn@ops.ietf.org; Mon, 28 Feb 2005 02:15:35 +0000
Received: from [10.1.1.2] (adsl-64-174-147-206.dsl.sntc01.pacbell.net [64.174.147.206]) by pimout3-ext.prodigy.net (8.12.10 milter /8.12.10) with ESMTP id j1S2FSpY401160; Sun, 27 Feb 2005 21:15:32 -0500
Message-ID: <42227EBF.9040703@vanderpoel.org>
Date: Sun, 27 Feb 2005 18:15:27 -0800
From: Erik van der Poel <erik@vanderpoel.org>
User-Agent: Mozilla Thunderbird 1.0 (X11/20041206)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: John C Klensin <klensin@jck.com>
CC: idn@ops.ietf.org
Subject: [idn] character tables
References: <421B8484.3070802@vanderpoel.org> <20050223072837.GA21463~@nicemice.net> <D872CCF059514053ECF8A198@scan.jck.com> <421D8411.9030006@vanderpoel.org> <p06210208be4390618c81@[192.168.0.101]> <421E0D0C.2000309@vanderpoel.org> <p06210202be43c3888991@[192.168.0.101]> <E07CE813AD23B2D95DA0C740@scan.jck.com> <421E30F2.1040408@vanderpoel.org> <0E7F74C71945B923C52211F3@scan.jck.com> <421EA0C9.1010500@vanderpoel.org> <00a401c51af3$7863aae0$030aa8c0@DEWELL> <A574CA1BE87BFDA3C2A1AC0E@scan.jck.com> <421FA55B.9000308@vanderpoel.org> <421FCBD7.8000805@vanderpoel.org>
In-Reply-To: <421FCBD7.8000805@vanderpoel.org>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on psg.com
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00 autolearn=ham version=3.0.1
Sender: owner-idn@ops.ietf.org
Precedence: bulk
Content-Transfer-Encoding: 7bit

> However, one avenue that might be worth exploring some more is to check 
> each registry's character table (for those that have one) and see what 
> the Unicode category is for each character. The Japanese Katakana middle 
> dot U+30FB has the category "Pc" which means "punctuation, connector" 
> and LDH's hyphen U+002D has the category "Pd" which means "punctuation, 
> dash".
> 
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
> 
> If it turns out that all or most of the registries that have tables are 
> using characters with only a small number of Unicode categories, then we 
> may wish to consider moving IDNA to that set of categories (disallowing 
> all others). This would keep the registries happy while keeping *some* 
> of the phishy characters out of DNS.

Even if we do not end up prohibiting a larger number of characters in 
nameprep-bis, it might still be a good idea to have the results of the 
investigation proposed above, since these Unicode character categories 
could then be entered into the guidelines for the registries.

So, these two sub-projects (nameprep-bis and registry table 
investigation) could proceed in parallel. I think it would be good to 
divide and conquer, since one person cannot do all of this. Perhaps we 
could invite volunteers to work on sub-projects?

As I indicate at nameprep.org, I found some character tables at the IANA 
site, but I found even more at the GNU libidn site. One of the first 
things to do is to agree on a single machine-readable format. The tables 
do not all use the same format yet, it seems. Then we would also need to 
have the latest and most official tables from the registries themselves 
(instead of possibly out of date IANA tables and possibly embellished 
unofficial GNU libidn tables).

http://nameprep.org/#related-work

Erik