[idn] a way toward homograph resolution ? (was "improving WG operation")

"JFC (Jefsey) Morfin" <jefsey@jefsey.com> Wed, 11 May 2005 05:44 UTC

Received: from psg.com (mailnull@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id BAA27328 for <idn-archive@lists.ietf.org>; Wed, 11 May 2005 01:44:45 -0400 (EDT)
Received: from majordom by psg.com with local (Exim 4.50 (FreeBSD)) id 1DVjrv-000KQO-7z for idn-data@psg.com; Wed, 11 May 2005 05:34:51 +0000
Received: from [63.247.74.122] (helo=montage.altserver.com) by psg.com with esmtps (TLSv1:DES-CBC3-SHA:168) (Exim 4.50 (FreeBSD)) id 1DVjrt-000KQ9-6W for idn@ops.ietf.org; Wed, 11 May 2005 05:34:49 +0000
Received: from lns-p19-4-idf-82-65-244-40.adsl.proxad.net ([82.65.244.40] helo=jfc.afrac.org) by montage.altserver.com with esmtpa (Exim 4.44) id 1DVjrr-0006dp-HB; Tue, 10 May 2005 22:34:47 -0700
Message-Id: <6.2.1.2.2.20050511050500.045cf140@mail.jefsey.com>
X-Mailer: QUALCOMM Windows Eudora Version 6.2.1.2
Date: Wed, 11 May 2005 06:08:18 +0200
To: ietf@ietf.org
From: "JFC (Jefsey) Morfin" <jefsey@jefsey.com>
Subject: [idn] a way toward homograph resolution ? (was "improving WG operation")
Cc: idn@ops.ietf.org, "Hallam-Baker, Phillip" <pbaker@verisign.com>
In-Reply-To: <001601c555d3$453fd9c0$7f1afea9@oemcomputer>
References: <198A730C2044DE4A96749D13E167AD37250259@MOU1WNEXMB04.vcorp.ad.vrsn.com> <6.2.1.2.2.20050511021431.048f8060@mail.jefsey.com> <001601c555d3$453fd9c0$7f1afea9@oemcomputer>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format="flowed"
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - montage.altserver.com
X-AntiAbuse: Original Domain - ops.ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - jefsey.com
X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on psg.com
X-Spam-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.0.2
Sender: owner-idn@ops.ietf.org
Precedence: bulk

On 04:43 11/05/2005, Randy Presuhn said:
>From: "JFC (Jefsey) Morfin" <jefsey@jefsey.com>
> > To: "Hallam-Baker, Phillip" <pbaker@verisign.com>
> > Cc: <ietf@ietf.org>
> > Sent: Tuesday, May 10, 2005 5:29 PM
> > Subject: RE: improving WG operation
>...
> > They do not not only delete. I suggest you just come to the WG-ltru where
> > they have decided to document RFC 2277 charsets into RFC 3066 langtags. So
> > you can enjoy charset conflicts, something you never though about, I
> > presume. You cannot stop progress.
>...
>
>I guess Jefsey is upset because the WG rejected his proposal
>to expand our scope to include charsets.  The ltru WG is most
>emphatically *not* confusing charsets with language tags.

I am not upset :-). To the countrary I find extremely interesting that some 
people were able to rename charsets "scripts" in order to insert charsets 
into languages descriptions while claiming they dont (cf. above). Obviously 
they are unhappy when I expose the trick. Anyway the result is great fun: 
people will be prevented from accessing a page they know to read, if they 
do not know the language.


This cacologic however might be a good way to solve the IDN homograph issue 
and the phishing problem.

If we revert from those famous "scripts" to what they are, i.e. unicode 
partitions, hence stable and well documented charsets 
(http://www.unicode.org/Public/4.1.0/ucd/Scripts.txt) , using them browsers 
can expose the homographs not related to the page charset in IDNs, and kill 
the risks of phishing.

This only calls for the browsers to extract the charset, I mean the script 
name from the langtag, call this file, read the list of codes points in the 
charset/associated to the script, and display the URL accordingly, 
indicating the characters which are no part of the script/charset. This 
relieves the ccTLD/TLD Manager from responsibilities he cannot fulfil at 
3+level.

There are howver still (minor) points to address:
- there are some minor disparities between the "script" name in the 
langtag, and the script name in the script.txt file should be reduced over 
time. I suppose that if this is a major issue, there will be help.
- the script.txt file is currently supported on the Unicode site. Even in 
caching it (92 K) it will be called everytime people will start their 
browser. This may therefore represent several billions of access a day.
- the WG-ltru only realy wants to address XML issues, related to old XML 
libraries. Some coordination with other WGs or interests could be fruitful. 
They plan the language tags registry to extend to scripts and to register 
them. I suppose other WGs could benefit from this (all those involved in a 
way or another with internationalisation and languages).

jfc