[idn] new I-D: Safely Encoding of likeness information into ACE label version 0.2
"Soobok Lee" <lsb@postel.co.kr> Sun, 29 July 2001 05:52 UTC
Received: from psg.com (exim@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with SMTP id BAA14239 for <idn-archive@lists.ietf.org>; Sun, 29 Jul 2001 01:52:54 -0400 (EDT)
Received: from lserv by psg.com with local (Exim 3.31 #1) id 15QJzL-000GX4-00 for idn-data@psg.com; Fri, 27 Jul 2001 19:37:59 -0700
Received: from bora.postel.to ([164.124.123.206] helo=bora.lsb.org ident=root) by psg.com with esmtp (Exim 3.31 #1) id 15QJzK-000GUx-00 for idn@ops.ietf.org; Fri, 27 Jul 2001 19:37:58 -0700
Received: from soobok ([210.217.27.237] (may be forged)) by bora.lsb.org (8.9.3/8.8.7) with SMTP id LAA20071; Sat, 28 Jul 2001 11:37:46 +0900
Message-ID: <004701c1170f$1eb34d20$ed1bd9d2@postel.co.kr>
From: Soobok Lee <lsb@postel.co.kr>
To: idn@ops.ietf.org
Cc: lsb@postel.co.kr
Subject: [idn] new I-D: Safely Encoding of likeness information into ACE label version 0.2
Date: Sat, 28 Jul 2001 11:43:44 +0900
MIME-Version: 1.0
Content-Type: text/plain; charset="ks_c_5601-1987"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4522.1200
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200
Sender: owner-idn@ops.ietf.org
Precedence: bulk
Content-Transfer-Encoding: 7bit
Hi, I post this new I-D to this mailing list to be reviewed before 14/AUG. This I-D is somewhat half-baked, but helps to solve the problems of look-alike characters with some pains. I will repost the revised one next week. Welcome any criticisms and supports for further discussions. Regards, Soobok Lee, lsb@postel.co.kr ========================================================================= Internet Draft Soobok Lee draft-lsb-lookalike-00.txt Postel Services, Inc 28 Jul, 2001 Expires in six months Safely Encoding of likeness information into ACE label version 0.2 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Distribution of this document is unlimited. Please send comments to the author lsb@postel.co.kr Abstract For a unicode character which has one or more other look-alike characters, we define both its look-alike normalization form and a likeness index and suggest new ACE prefixing rule for likeness indices which are encoded using a sequence of mixed-case latin alphabets. The likeness index is used to restore the pre-normalization form of the ACE-decoded label and in the same time does not affect case-insensitive label comparison in existing applications and DNS servers. Contents Overview ACE comparison Maximum length of ACE label Implication of restoring pre-normaliation forms More works to be done Security considerations References Author Overview The Unicode Standard has many sets of look-alike characters that are not documented in detail yet. But introducing the unicode characters set into IDN, the primary internet identifier, require some rigorous works to be done in this area near future for security reasons. ACE algorithms such as [DUDE],[AMC-ACE-Z], preserve the case information in the original IDN label by augmenting base32 latin digits to their uppercase ones, which do not affect case-insensitive label comparison operations in applications and DNS servers, while preserving the case information. This draft is based on such case-preserving and case-insensitive nature of IDNA architecture and extends them to fullful the need to map look-alike letters into a unified one while retaining the information about the pre-unification letter for later rendering of ACE-decoded labels for end users. Let's assume we have a unicode look-alike normalization NFLA in the future version of the unicode standard. For certain unicode points a1,a2,a3 that satisfy NFLA(a1) ==NFLA(a2) == NFLA(a3) == a1. let's define a function LA_SET(uc) so that: LA_SET(a1)={a1,a2,a3} In this case the size of LA_SET(a1) is 3, is called 'current minimal likeness size', We shoud define 'maximal likeness size' for each unicode point. For that, we define a function LikeSize(uc) so that LikeSize(a1) = 4, LikeSize(a2)= 4, LIkeSize(a3)=4. It is wise to choose 4 instead of 3 to make room for future additions to UNICODE repertoire of scripts that may contain new look-alike characters. We need to look carefully into proposed or approved new scripts to be added unicode standards near future. We can express a1,a2,a3 as these binary tuples a1=(a1,0) a2=(a1,1) a3=(a1,2) The each second index value of these tuples is called 'likeness index', for which we define a function so that LikeIndex(a1) = 0, LikeIndex(a2) = 1, LikeIndex(a3) = 2. Let's define a function LA_TUPLE(uc) = ( NFLA(uc), LikeIndex(uc), LikeSize(LA_SET(uc)) ). Let's define a restoring function LA_CHAR(uc,i) so that: LA_CHAR(a1,0)=a1, LA_CHAR(a1,1)=a2, LA_CHAR(a1,2)=a3. The main idea of this draft is to incorporate the ascii encoded likeness index and likeness size of a ambiguous character of a IDN label using a sequence of uppercase and lowercase latin alpahbets inserted after ACE prefix. For example, for an cyrillic IDN label <cyrillic a><cyrillic zhe><cyrillic o>: we have a ACE label without looka-like normalization of dq--{<cyrillic a>}{cyrillic zhe}{<cyrillic o>} Only cyrillic 'a' and 'o' has latin look-alikes, so that we can make new ACE label WITH looka-like normalization dq--AaCcC--{<latin a>}{cyrillic zhe}{<latin o>} The uppercase letters in "aAcCc" denote bit '1' and lowercase letters denote bit '0' and these sequence of uppercase and lower case letters of a alphabet form a bitstring to denote the likeness index. The alpabet (a-z) denotes the offset index of the corresponding unicode character in the IDN label character sequence. In the example above, "aA" denote a likeness index value 1 and a likeness size 4. "cCc" denote a likeness index value 2 and a likeness size 8. "aA" supplements to <latin a> to form <cyrillic a> and "cCc" supplements to <latin o> to form <cyrillic o>. <cyrillic zhe> has no case information since it is assumed to have no lookalike.. 2^(The number of repeated alphabets) is identical to the likeness size and shall be large enough not to be changed in future version of unicode look-alike normalization. If the offset index of a long IDN label needs to exceed 26, we can insert un-used digit '9' to mark a milestone from which the offset is added by 26 and the alphbet should begin with 'a' again. If applications and dns server would not casefold ACE label, this look-alike information would be retained throught the process to restore the pre-normalization native-script. The restoring process is as follows: First, we ACE-decode the label part of our new ACE label and get a sequence of unicode points LABEL. Second, from likeness indices part of our new ACE label, we construct a array of binary tuples of likeness index and offset index of target character into the decoded label LABEL. Third, for each binary tuple (i,offset), we replace each target code point in the decoded label LABEL with the restored original character LA_CHAR(LABEL[offset],i). ACE comparison We have three IDN labels like these: IDN1 = <cyrillic a><cyrillic zhe><cyrillic o> IDN2 = <latin a><cyrillic zhe><cyrillic o> IDN3 = <latin a><cyrillic zhe><latin o> ACE(IDN1) = dq--aAcCc--{<latin a>}{cyrillic zhe}{<latin o>} ACE(IDN2) = dq--aacCc--{<latin a>}{cyrillic zhe}{<latin o>} ACE(IDN3) = dq--aaccc--{<latin a>}{cyrillic zhe}{<latin o>} IDN1,IDN2,IDN3 are equivalent modulo look-alike normalization, but these three ACE labels have differently-cased likeness indices, but are regarded as the same domain in case-insensitive comparison in applications and dns servers. If a IDN has no ambiguous characters, we can omit '--' in some ACEs. IDN4 = <cyrillic zhe><cyrillic zhe> ACE(IDN4) = dq--{<cyrillic zhe>}{cyrillic zhe} And if an IDN is look-alike normalized into all-latin LDH domain, it should not be registered as a IDN but as an LDH domain, and in this case, we cannot provide likeness preservation. Maximum length of ACE label This new scheme comes with some overheads in ACE label : additional "--" and encoded likeness informations. If we assume the mean average of likeness size to be 2 (1 bit): overhead = 2 + 1 * (number of ambiguous characters in a label) Since most Han/Hangeul letters have no other look-alike characters, Overall ACE label efficiency for han/hangeul would not be affected. Latin,Greek,Cyrillic,Katakana and many Indian scripts have many look-alike characters. Efficient ACE-encoding of IDN label is required for this scheme. Implication of restoring pre-normalization forms This new ACE prefixing scheme retains the look-alike information, so that we can restore the original native-script labels before look-alike normalization even when they contain look-alike characters across scripts. For example, Katakana 'ka' (U+30AB) and Chinese letter 'power' (U+529B). look the same. We can assign likenessindex 0 and 1 to 'ka' and 'power', respectively. If we normalized 'ka' into 'power' without encoded case information, we could not restore 'ka' anymore. We could not avoid font-rendering problems and conflicts of interests between related countries. But, with this encoded likeness information, we can restore 'ka' and we have no such problem. More works to be done Some sequences of characters look similiar to a character or other sequences of characters. Most of these sequences are normalized and unified in KC Normalization in NAMEPREP. But still some visual similarities are not completely eliminated. We need more elaborations on this subject. Security Consideration These scheme suggests ACE labels to be prefixed by additional look-alike information encoded in sequences of cased alphabets and does not introduce any security hole into IDN. References [UNICODE] The Unicode Consortium, "The Unicode Standard", http://www.unicode.org/unicode/standard/standard.html. [IDNA] Patrik Faltstrom, Paul Hoffman, "Internationalizing Host Names In Applications (IDNA)", draft-ietf-idn-idna-03. [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation of Internationalized Host Names", July 19, 2001 draft-ietf-idn-nameprep-05. [DUDE02] Mark Welter, Brian Spolarich, Adam Costello, "DUDE: Differential Unicode Domain Encoding", 2001-May-31, draft-ietf-idn-dude-02. [AMCACEZ] Adam Costello, "AMC-ACE-z version 0.2.1", 2001-May-31, draft-ietf-idn-amc-ace-z-00, latest version at http://www.cs.berkeley.edu/~amc/charset/amc-ace-z.gz Author Soobok Lee <lsb@postel.co.kr> Postel Services, Inc. http://www.postel.co.kr Tel: +82-11-9774-2737
- Re: [idn] new I-D: Safely Encoding of likeness in… Soobok Lee
- Re: [idn] new I-D: Safely Encoding of likeness in… Soobok Lee
- Re: [idn] new I-D: Safely Encoding of likeness in… James Seng/Personal
- Re: [idn] new I-D: Safely Encoding of likeness in… Soobok Lee
- Re: [idn] new I-D: Safely Encoding of likeness in… Eric Brunner-Williams in Portland Maine
- Re: [idn] new I-D: Safely Encoding of likeness in… James Seng/Personal
- Re: [idn] new I-D: Safely Encoding of likeness in… Adam M. Costello
- Re: [idn] new I-D: Safely Encoding of likeness in… Mark Davis
- Re: [idn] new I-D: Safely Encoding of likeness in… Eric Brunner-Williams in Portland Maine
- Re: [idn] new I-D: Safely Encoding of likeness in… David Hopwood
- Re: [idn] new I-D: Safely Encoding of likeness in… Eric Brunner-Williams in Portland Maine
- Re: [idn] new I-D: Safely Encoding of likeness in… Soobok Lee
- Re: [idn] new I-D: Safely Encoding of likeness in… Soobok Lee
- Re: [idn] new I-D: Safely Encoding of likeness in… Eric Brunner-Williams in Portland Maine
- Re: [idn] new I-D: Safely Encoding of likeness in… Soobok Lee
- [idn] new I-D: Safely Encoding of likeness inform… Soobok Lee
- Re: [idn] new I-D: Safely Encoding of likeness in… Mark Davis
- Re: [idn] new I-D: Safely Encoding of likeness in… Eric Brunner-Williams in Portland Maine
- Re: [idn] new I-D: Safely Encoding of likeness in… Soobok Lee
- Re: [idn] new I-D: Safely Encoding of likeness in… Soobok Lee