Re: [idn] process

"Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord.cnri.reston.va.us> Sat, 26 February 2005 08:23 UTC

Received: from psg.com (mailnull@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id DAA10731 for <idn-archive@lists.ietf.org>; Sat, 26 Feb 2005 03:23:39 -0500 (EST)
Received: from majordom by psg.com with local (Exim 4.44 (FreeBSD)) id 1D4xAV-000JiU-BL for idn-data@psg.com; Sat, 26 Feb 2005 08:19:19 +0000
Received: from [128.32.132.165] (helo=nicemice.net) by psg.com with esmtp (Exim 4.44 (FreeBSD)) id 1D4xAS-000Ji9-2Q for idn@ops.ietf.org; Sat, 26 Feb 2005 08:19:16 +0000
Received: from amc by nicemice.net with local (Exim 3.35 #1 (Debian)) id 1D4xAP-0004rB-00 for <idn@ops.ietf.org>; Sat, 26 Feb 2005 00:19:13 -0800
Date: Sat, 26 Feb 2005 08:19:13 +0000
From: "Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord.cnri.reston.va.us>
To: idn@ops.ietf.org
Subject: Re: [idn] process
Message-ID: <20050226081913.GD14956~@nicemice.net>
Reply-To: IETF idn working group <idn@ops.ietf.org>
References: <D872CCF059514053ECF8A198@scan.jck.com> <421D8411.9030006@vanderpoel.org> <p06210208be4390618c81@[192.168.0.101]> <421E0D0C.2000309@vanderpoel.org> <p06210202be43c3888991@[192.168.0.101]> <E07CE813AD23B2D95DA0C740@scan.jck.com> <421E30F2.1040408@vanderpoel.org> <0E7F74C71945B923C52211F3@scan.jck.com> <421EA0C9.1010500@vanderpoel.org> <00a401c51af3$7863aae0$030aa8c0@DEWELL>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <421FCBD7.8000805@vanderpoel.org> <421FA55B.9000308@vanderpoel.org> <A574CA1BE87BFDA3C2A1AC0E@scan.jck.com> <00a401c51af3$7863aae0$030aa8c0@DEWELL>
User-Agent: Mutt/1.5.6+20040722i
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on psg.com
X-Spam-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.0.1
Sender: owner-idn@ops.ietf.org
Precedence: bulk

Doug Ewell <dewell@adelphia.net> wrote:

> Is it really possible that we spent a year and a half, two years on
> putting together an IDN architecture, and during all that time nobody
> ever gave the slightest thought to the possibility of someone using
> IDNs for spoofing purposes,

No, it was thought about, and it was decided that the IDNA protocol was
not the place to address those issues; that they should be addressed in
registries and user interfaces.

IDNA could have addressed the easier portion of the problem (prohibiting
punctuation and symbols) (and for a while I was arguing for that), but
it still would have left the harder part of the problem (dealing with
script mixtures and homographs among letters) for the registries and
user interfaces to deal with, so why not let them deal with the easier
part too?

(Of course, one could then ask why that argument doesn't apply to all
the invisible characters that IDNA does prohibit.  I have no good answer
at the moment.  Maybe invisibility was the only disqualifying attribute
that everyone could agree on.)

John C Klensin <klensin@jck.com> wrote:

> I hope that those who wrote the IDNA specs will agree with the
> statement of those principles I'm about to make, or at least that they
> are close... they may not.
>
> (1) To the extent possible, we should accommodate all Unicode
> characters, excluding as little as possible.

That (or something very similar) was a principle that went into the
IDNA spec.  I personally was inclined to define both internationalized
domain names and internationalized host names, where the former would
be completely general (allowing *all* Unicode characters, even the
invisible ones), and the latter would be much narrower (excluding most
punctuation and symbols).  This would be an analogy to traditional
domain names (which allow all ASCII characters, even control characters)
and traditional host names (which allow only the ASCII letters, digits,
and one punctuation mark, the hyphen-minus).

On the other hand, there was an argument that the traditional
distinction between domain names and host names was the source of
endless confusion and debate, and was a mistake that should not be
repeated with IDNs.  I have some sympathy for that argument.

In any case, we ended up with just one set of non-ASCII characters for
IDNs, between the two extremes: only invisible characters are excluded.
(I think there's one exception--a visible space character that is also
excluded).

> (2) When code points had been identified by UTC as the same as, or
> equivalent to, others, we tended to map them together, rather than
> picking one and prohibiting the others.

This was more than a tendency; it was strictly followed.

> This has caused more problems than most of us expected, with people
> being surprised when they register or query using one character and
> the result that comes back uses another.

I think this happens only for the case-folding mappings.  The
normalization mappings should not surprise anyone.

> It also creates a near-homograph problem that we haven't "discovered"
> in the last couple of weeks: If we have character X mapping to
> character Y, but X looks vaguely like Z, then there may be no Y-Z
> homograph, but there may be an X-Z one.

True.  And again, I think it's just the case-folding mappings that do
this, not the normalization mappings.

> Curiously, if we followed existing precedents, we could even move
> IDNA from Proposed to Draft and change the tables to eliminate many
> mappings and characters: no change to the algorithm, just elimination
> of some features that didn't work in practice.

If we want to place further restrictions the set of characters used
in IDNs, I think it would be pretty rude of us to simply add them to
the set of prohibited characters in Nameprep.  What about the guy who
registered <not_equal>.com?  What if people had already bookmarked that
site, and created links to it?  Are we just going to break those links?

A less rude approach would be to recommend that domain labels containing
certain characters not be displayed.  Their ACE forms could still be
display, and they could still be looked up.  The domain holder in this
example could register a new displayable domain name, and could put an
HTTP redirector at the old site, and existing bookmarks and links would
continue to work.

Erik van der Poel <erik@vanderpoel.org> wrote:

> I believe it would be difficult to reach consensus on a relatively
> narrow extension of the LDH rule.  Just for starters, the hyphen used
> to separate names and other strings in the Western world is not used
> in Japan for Katakana, because Katakana uses a middle dot (U+30FB) to
> separate 2 Katakana strings.  In fact, this character is allowed in
> .jp.

But notice how seldom the hyphen-minus is actually used in domain
names.  People prefer to just run words together, even in languages that
customarily use word breaks.  Maybe the analogous characters in other
scripts (like the katakana middle dot) would likewise be very seldom
used in practice (especially in Japan where the lack of word breaks is
the norm), and would not be missed if they were deprecated.

> It may be possible to "tune" the tables, but nowhere in your email do
> I find any reference to the ACE prefix.  I think that we should also
> figure out exactly which types of changes would absolutely require a
> new ACE prefix,

Coming up with the necessary and sufficient conditions will be tricky,
but now that you've got me thinking about it, I think I can supply
one sufficient condition:  If the only changes you make are to add
characters to the prohibited table, I don't think you need to change the
ACE prefix.  This would cause some valid IDN labels under the old spec
to become invalid under the new spec, and would cause some valid ACE
labels under the old spec to become bogo-ACE labels under the new space.
(The bogo-ACE phenomenon already exists: there are labels that begin
with the ACE prefix but don't validate during ToUnicode and therefore
display as literal ASCII strings.)  It would not cause anything to
encode or decode to something different than it used to.

But I don't advocate making such a change (see my argument above about
rudeness).

AMC