[idn] new I-D: Safely Encoding of likeness information into ACE label version 0.2

"Soobok Lee" <lsb@postel.co.kr> Sun, 29 July 2001 05:52 UTC

Received: from psg.com (exim@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with SMTP id BAA14239 for <idn-archive@lists.ietf.org>; Sun, 29 Jul 2001 01:52:54 -0400 (EDT)
Received: from lserv by psg.com with local (Exim 3.31 #1) id 15QJzL-000GX4-00 for idn-data@psg.com; Fri, 27 Jul 2001 19:37:59 -0700
Received: from bora.postel.to ([164.124.123.206] helo=bora.lsb.org ident=root) by psg.com with esmtp (Exim 3.31 #1) id 15QJzK-000GUx-00 for idn@ops.ietf.org; Fri, 27 Jul 2001 19:37:58 -0700
Received: from soobok ([210.217.27.237] (may be forged)) by bora.lsb.org (8.9.3/8.8.7) with SMTP id LAA20071; Sat, 28 Jul 2001 11:37:46 +0900
Message-ID: <004701c1170f$1eb34d20$ed1bd9d2@postel.co.kr>
From: Soobok Lee <lsb@postel.co.kr>
To: idn@ops.ietf.org
Cc: lsb@postel.co.kr
Subject: [idn] new I-D: Safely Encoding of likeness information into ACE label version 0.2
Date: Sat, 28 Jul 2001 11:43:44 +0900
MIME-Version: 1.0
Content-Type: text/plain; charset="ks_c_5601-1987"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4522.1200
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200
Sender: owner-idn@ops.ietf.org
Precedence: bulk
Content-Transfer-Encoding: 7bit

Hi,
I post this new I-D to this mailing list to be reviewed  before
14/AUG. This I-D is somewhat half-baked, but helps to solve
the problems of look-alike characters with some pains.
I will repost the revised one next week.

Welcome any criticisms and supports for further discussions.

Regards,

Soobok Lee, lsb@postel.co.kr

=========================================================================
Internet Draft                                              Soobok Lee
draft-lsb-lookalike-00.txt                        Postel Services, Inc
28 Jul, 2001
Expires in six months


        Safely Encoding of likeness information into ACE label
                        version 0.2


Status of this Memo

    This document is an Internet-Draft and is in full conformance with
    all provisions of Section 10 of RFC2026.

    Distribution of this document is unlimited.  Please send comments to
    the author lsb@postel.co.kr


Abstract

  For a unicode character which has one or more other look-alike characters,
  we define both its look-alike normalization form and a likeness index and
  suggest new ACE prefixing rule for likeness indices which are encoded using
  a sequence of mixed-case latin alphabets. The likeness index is used to
  restore the pre-normalization form of the ACE-decoded label and in the same
  time does not affect case-insensitive label comparison 
  in existing applications and DNS servers.


Contents

 Overview
 ACE comparison
 Maximum length of ACE label
 Implication of restoring pre-normaliation forms
 More works to be done
 Security considerations
 References
 Author


Overview

 The Unicode Standard has many sets of look-alike characters that are not
 documented in detail yet. But introducing the unicode characters set
 into IDN, the primary internet identifier, require some rigorous works
 to be done in this area near future for security reasons.

 ACE algorithms such as [DUDE],[AMC-ACE-Z], preserve the case information
 in the original IDN label by augmenting base32 latin digits to their
 uppercase ones, which do not affect case-insensitive label comparison
 operations in applications and DNS servers, while preserving the case
 information.

 This draft is based on such case-preserving and case-insensitive nature
 of IDNA architecture and extends them to fullful the need to map
 look-alike letters into a unified one while retaining the information
 about the pre-unification letter for later rendering of ACE-decoded
 labels for end users.

  Let's assume we have a unicode  look-alike normalization NFLA in the future
  version of the unicode standard.

  For certain unicode points a1,a2,a3 that satisfy
     NFLA(a1) ==NFLA(a2)  == NFLA(a3) == a1.
  let's define a function LA_SET(uc) so that:
     LA_SET(a1)={a1,a2,a3}

  In this case
     the size of LA_SET(a1) is 3, is called 'current minimal likeness size',
 
  We shoud define 'maximal likeness size' for each unicode point. 
  For that, we define a function LikeSize(uc) so that
     LikeSize(a1) = 4, LikeSize(a2)= 4, LIkeSize(a3)=4.

  It is wise to choose 4 instead of 3  to make room for future additions 
  to UNICODE repertoire of scripts that may contain new look-alike characters.
  We need to look carefully into proposed or approved new scripts to be added
  unicode standards near future.
   

  We can express a1,a2,a3  as  these binary tuples
   a1=(a1,0)
   a2=(a1,1)
   a3=(a1,2)
  The each second index value of these tuples is called 'likeness index',
  for which we define a function so that
    LikeIndex(a1) =  0,
    LikeIndex(a2) =  1,
    LikeIndex(a3) =  2.

  Let's define a function
    LA_TUPLE(uc) = ( NFLA(uc), LikeIndex(uc), LikeSize(LA_SET(uc))  ).

  Let's define a restoring function LA_CHAR(uc,i)  so that:
    LA_CHAR(a1,0)=a1,
    LA_CHAR(a1,1)=a2,
    LA_CHAR(a1,2)=a3.


 The main idea of this draft is to incorporate the ascii encoded likeness
 index and likeness size of a ambiguous character of a IDN label
 using a sequence of uppercase and lowercase latin alpahbets inserted after
 ACE prefix.

 For example, for an cyrillic IDN label <cyrillic a><cyrillic zhe><cyrillic o>:
 we have a ACE label without looka-like normalization of

    dq--{<cyrillic a>}{cyrillic zhe}{<cyrillic o>}

 Only cyrillic 'a' and 'o' has latin look-alikes, so that we can make
 new ACE label WITH  looka-like normalization

    dq--AaCcC--{<latin a>}{cyrillic zhe}{<latin o>}

 The uppercase letters in "aAcCc" denote bit '1' and
 lowercase letters denote bit '0' and these sequence of uppercase and lower
 case letters of a alphabet form a bitstring to denote the likeness index.
 The alpabet (a-z) denotes the offset index of the corresponding unicode
 character in the IDN label character sequence.

 In the example above, "aA" denote a likeness index value 1 and a likeness
 size 4. "cCc" denote a likeness index value 2 and a likeness size 8.
 "aA" supplements to <latin a> to form <cyrillic a> and
 "cCc" supplements to <latin o> to form <cyrillic o>.
 <cyrillic zhe> has no case information since it is assumed to have no
lookalike..

 2^(The number of repeated alphabets) is identical to the likeness size and
 shall be large enough not to be changed in future version of unicode
 look-alike normalization.

 If the offset index of a long IDN label needs to exceed 26, we can insert
 un-used digit '9' to mark a milestone from which the offset is added by 26
 and the alphbet should begin with 'a' again.

 If applications and dns server would not casefold ACE label, this
 look-alike information would be retained throught the process to
 restore the pre-normalization native-script.

 The restoring process is as follows:

  First,
   we ACE-decode the label part of our new ACE label and get
   a sequence of unicode points LABEL.

  Second,
   from likeness indices part of our new ACE label, we construct
   a array of binary tuples of likeness index and offset index of
   target character into the decoded label LABEL.

  Third,
   for each binary tuple (i,offset), we replace each target
   code point in the decoded label LABEL with the restored original
   character LA_CHAR(LABEL[offset],i).


ACE comparison

  We have three IDN labels like these:

    IDN1 =  <cyrillic a><cyrillic zhe><cyrillic o>
    IDN2 =  <latin a><cyrillic zhe><cyrillic o>
    IDN3 =  <latin a><cyrillic zhe><latin o>

    ACE(IDN1)  =  dq--aAcCc--{<latin a>}{cyrillic zhe}{<latin o>}
    ACE(IDN2)  =  dq--aacCc--{<latin a>}{cyrillic zhe}{<latin o>}
    ACE(IDN3)  =  dq--aaccc--{<latin a>}{cyrillic zhe}{<latin o>}

  IDN1,IDN2,IDN3 are equivalent modulo look-alike normalization,
  but these three ACE labels have differently-cased likeness indices,
  but are regarded as the same domain in case-insensitive comparison
  in applications and dns servers.

  If a IDN has no ambiguous characters, we can omit '--' in some ACEs.

   IDN4 =  <cyrillic zhe><cyrillic zhe>
   ACE(IDN4)  =  dq--{<cyrillic zhe>}{cyrillic zhe}

  And if an IDN is look-alike normalized into all-latin LDH domain,
  it should not be registered as a IDN but as an LDH domain, and in
  this case, we cannot provide likeness preservation.



Maximum length of ACE label

 This new scheme comes with some overheads in ACE label :
   additional "--" and encoded likeness informations.

 If we assume the mean average of likeness size to be 2 (1 bit):
   overhead  = 2 + 1 * (number of ambiguous characters in a label)

 Since most Han/Hangeul letters have no other look-alike characters,
 Overall ACE label efficiency for han/hangeul would not be affected.

 Latin,Greek,Cyrillic,Katakana and many Indian scripts have
 many look-alike characters.

 Efficient ACE-encoding of IDN label is required for this scheme.



Implication of restoring pre-normalization forms

 This new ACE prefixing scheme retains the look-alike information,
 so that we can restore the original native-script labels before
 look-alike normalization even when they contain  look-alike characters
 across scripts.


 For example, Katakana 'ka' (U+30AB) and Chinese letter 'power' (U+529B).
 look the same. We can assign  likenessindex 0 and 1 to 'ka' and 'power',
 respectively.

 If we normalized 'ka' into 'power' without encoded case information,
 we could not restore 'ka' anymore. We could not avoid font-rendering
 problems and conflicts of interests between related countries.

 But, with this encoded likeness information, we can restore 'ka'
 and we have no such problem.


More works to be done

 Some sequences of characters look similiar to a character or
 other sequences of characters.

 Most of these sequences are normalized and unified in KC
 Normalization in NAMEPREP. But still some visual similarities
 are not completely eliminated.

 We need more elaborations on this subject.



Security Consideration

 These scheme suggests ACE labels to be prefixed  by additional
 look-alike information encoded in sequences of cased alphabets
 and does not introduce any security hole into IDN.


References

    [UNICODE] The Unicode Consortium, "The Unicode Standard",
    http://www.unicode.org/unicode/standard/standard.html.

    [IDNA] Patrik Faltstrom, Paul Hoffman, "Internationalizing Host
    Names In Applications (IDNA)", draft-ietf-idn-idna-03.

    [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation
    of Internationalized Host Names", July 19, 2001
    draft-ietf-idn-nameprep-05.

    [DUDE02] Mark Welter, Brian Spolarich, Adam Costello,
    "DUDE: Differential Unicode Domain Encoding", 2001-May-31,
    draft-ietf-idn-dude-02.

    [AMCACEZ] Adam Costello, "AMC-ACE-z version 0.2.1",
    2001-May-31, draft-ietf-idn-amc-ace-z-00, latest version at
    http://www.cs.berkeley.edu/~amc/charset/amc-ace-z.gz


Author

    Soobok Lee <lsb@postel.co.kr>
    Postel Services, Inc.
    http://www.postel.co.kr
    Tel: +82-11-9774-2737