[idn] punctuation

Erik van der Poel <erik@vanderpoel.org> Thu, 24 February 2005 07:48 UTC

Received: from psg.com (mailnull@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA12639 for <idn-archive@lists.ietf.org>; Thu, 24 Feb 2005 02:48:03 -0500 (EST)
Received: from majordom by psg.com with local (Exim 4.44 (FreeBSD)) id 1D4DYO-000E7g-UL for idn-data@psg.com; Thu, 24 Feb 2005 07:36:56 +0000
Received: from [207.115.63.101] (helo=pimout2-ext.prodigy.net) by psg.com with esmtp (Exim 4.44 (FreeBSD)) id 1D4DYM-000E7R-OQ for idn@ops.ietf.org; Thu, 24 Feb 2005 07:36:55 +0000
Received: from [10.1.1.2] (adsl-64-174-147-206.dsl.sntc01.pacbell.net [64.174.147.206]) by pimout2-ext.prodigy.net (8.12.10 milter /8.12.10) with ESMTP id j1O7anuq419934; Thu, 24 Feb 2005 02:36:50 -0500
Message-ID: <421D8411.9030006@vanderpoel.org>
Date: Wed, 23 Feb 2005 23:36:49 -0800
From: Erik van der Poel <erik@vanderpoel.org>
User-Agent: Mozilla Thunderbird 1.0 (X11/20041206)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: IETF idn working group <idn@ops.ietf.org>
Subject: [idn] punctuation
References: <421B8484.3070802@vanderpoel.org> <20050223072837.GA21463~@nicemice.net> <D872CCF059514053ECF8A198@scan.jck.com>
In-Reply-To: <D872CCF059514053ECF8A198@scan.jck.com>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on psg.com
X-Spam-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.0.1
Sender: owner-idn@ops.ietf.org
Precedence: bulk
Content-Transfer-Encoding: 7bit

Adam M. Costello wrote:
 > I imagine you'd want all the characters that could immediately follow
 > the host name in a URI, so add "?" and "#" to that list.
 >
 > But how well do average users know URI syntax anyway?  What would they
 > think of:
 >
 > http://foo.com&bar.baz.xx
 > http://foo.com~bar.baz.xx
 > http://foo.com|bar.baz.xx
 >
 > Maybe we either need to ban all punctuation (as in my proposal about
 > internationalized host names), or always make the boundaries of the
 > domain name apparent to the user (using color or highlighting or
 > underlining or something).

I started to write down all the delimiters that could appear in DNS, 
URIs and email, and then I realized that this problem is not just about 
the homographs of the *legal* delimiters used in these contexts. No, it 
is about whatever *looks like* a legal delimiter to the average user, 
because the phishers don't have to stick to the (homographs of the) 
legal delimiters. Then I went back in the archives, and of course, Adam 
has already pointed this out.

The implications of this are actually quite profound. Since there are so 
many characters in Unicode, and since many of those are unfamiliar to 
the average user, a lot of those might look like punctuation.

As Adam also points out in another email, it's too bad that domain names 
are usually displayed in "little-endian" order. If they were displayed 
in the opposite (big-endian) order, the 3rd example above would become:

http://xx.baz.com|bar.foo

Notice how the "com" and "foo" are now separated. The "real" (unspoofed) 
URI would look like this:

http://com.foo

If users were actually used to seeing it this way, they might notice the 
spoof above more easily. But they aren't used to seeing it this way, and 
it would be pretty difficult to change this convention now. It's too late.

Back to punctuation: Banning all punctuation would not be enough. We 
would have to ban anything that might look like punctuation to the user. 
That would mean banning a huge swath of Unicode, which is probably not 
in the best interests of various communities around the world. Besides, 
different people will have different ideas about what looks like 
punctuation. So it might be hard to decide which huge swath of Unicode 
to ban.

So maybe it's better to consider Adam's alternative idea: make the 
boundaries of the domain name apparent (using color or whatever). Over 
time, the users will get used to seeing domain names this way, and then 
they will be able to spot domain name spoofs more easily too.

But even if we were to color the whole domain name:

foo.com|bar.baz.xx

The user might still think that this site is somehow related to foo.com 
and therefore safe (as was also pointed out). So you'd have to display 
the "unusual" characters like '|' differently. Or something. Sigh. Seems 
hopeless.

Are the phishers going to have a field day with IDN, or what?

But is this problem really limited to IDN? What about the following 
legal ASCII DNS name:

foo.com--secure-user-services-and-products.tech-mecca.biz

Does this mean that we should try to switch left-to-right readers (most 
of the world) over to big-endian domain names? Please tell me I'm 
overreacting!

Erik