Re: [idn] process

John C Klensin <klensin@jck.com> Fri, 25 February 2005 17:34 UTC

Received: from psg.com (mailnull@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id MAA11107 for <idn-archive@lists.ietf.org>; Fri, 25 Feb 2005 12:34:09 -0500 (EST)
Received: from majordom by psg.com with local (Exim 4.44 (FreeBSD)) id 1D4jI7-0002pf-5Y for idn-data@psg.com; Fri, 25 Feb 2005 17:30:15 +0000
Received: from [209.187.148.211] (helo=bs.jck.com) by psg.com with esmtp (Exim 4.44 (FreeBSD)) id 1D4jI4-0002pM-4c for idn@ops.ietf.org; Fri, 25 Feb 2005 17:30:12 +0000
Received: from [209.187.148.215] (helo=scan.jck.com) by bs.jck.com with esmtp (Exim 4.34) id 1D4jI2-0002sx-Us; Fri, 25 Feb 2005 12:30:11 -0500
Date: Fri, 25 Feb 2005 12:29:58 -0500
From: John C Klensin <klensin@jck.com>
To: Doug Ewell <dewell@adelphia.net>, idn@ops.ietf.org
cc: Erik van der Poel <erik@vanderpoel.org>
Subject: Re: [idn] process
Message-ID: <A574CA1BE87BFDA3C2A1AC0E@scan.jck.com>
In-Reply-To: <00a401c51af3$7863aae0$030aa8c0@DEWELL>
References: <421B8484.3070802@vanderpoel.org> <20050223072837.GA21463~@nicemice.net> <D872CCF059514053ECF8A198@scan.jck.com> <421D8411.9030006@vanderpoel.org> <p06210208be4390618c81@[192.168.0.101]> <421E0D0C.2000309@vanderpoel.org> <p06210202be43c3888991@[192.168.0.101]> <E07CE813AD23B2D95DA0C740@scan.jck.com> <421E30F2.1040408@vanderpoel.org> <0E7F74C71945B923C52211F3@scan.jck.com> <421EA0C9.1010500@vanderpoel.org> <00a401c51af3$7863aae0$030aa8c0@DEWELL>
X-Mailer: Mulberry/3.1.6 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on psg.com
X-Spam-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.0.1
Sender: owner-idn@ops.ietf.org
Precedence: bulk
Content-Transfer-Encoding: 7bit


--On Thursday, 24 February, 2005 20:35 -0800 Doug Ewell
<dewell@adelphia.net> wrote:

> Erik van der Poel <erik at vanderpoel dot org> wrote:
> 
>> 1. Is this the right time to start working on Internet Drafts
>> leading up to new version(s) of the IDNA RFC(s)? If not, when?
>> ...
> 
> I don't know about anyone else, but something seems badly
> wrong here.
> 
> Is it really possible that we spent a year and a half, two
> years on putting together an IDN architecture, and during all
> that time nobody ever gave the slightest thought to the
> possibility of someone using IDNs for spoofing purposes, and
> now that one or two well-publicized spoofing examples have
> appeared, we are ready to start all over again with a new and
> probably incompatible version of the architecture?

I certainly hope not.  And certainly we knew about these issues.
Even the potential problem with symbols and box-drawing
characters was identified, although not in the lurid detail of
some of the recent example.  There was discussion around the WG
about what to do about those issues, and how completely to
describe them.  I think the consensus at that time was to not
write a lot of these issues up in detail for fear of
discouraging IDNA implementations.  That consensus was, IMO,
reached in a WG in which many of the participants were, for
various reasons, just anxious to get finished and not paying
much attention to the finer details.

While I advocated at least one radically different architecture
while the IDN WG work was going on, I think, personally, that
looking at a new and incompatible architecture would be pretty
close to insane.  As I tried to explain in my remarks on Erik's
proposal that we reverse the presentation order of domain names,
I just don't think it is possible to go there.  And, even if we
wanted to, there is no reason to believe that any other
architecture would work better: these homograph problems are the
inevitable consequence of the relationships among the scripts
themselves: it is unlikely that even dumping Unicode and
switching to something else would help very much.  And there
isn't any "something else".

However, the decision to adopt two philosophical principles, and
one strong assumption, went into the design of IDNA and its
supporting tables.  The assumption has not turned out to be
completely valid and we may need to look harder at the
implications of its failure.  I suggest that either or both of
the philosophical principles could be reviewed and, if
necessary, changed in the light of experience and that neither
change would be fatal to IDNs or IDNA, or even especially
disruptive if not fairly soon.

This is _not_ a suggestion that those changes should be made,
only that it would be plausible for us to review the decisions
and reach some conclusions about whether they are still
appropriate in the light of experience.

I hope that those who wrote the IDNA specs will agree with the
statement of those principles I'm about to make, or at least
that they are close... they may not.

(1) To the extent possible, we should accommodate all Unicode
characters, excluding as little as possible.  This position was
reinforced by the view that, at the time, the Unicode
classifications of characters were considered a little soft and
a general conviction that the IETF should not be making
character-by-character decisions.   A counter-principle, now if
not then, is that we should permit a relatively narrow extension
of the "letter-digit-hyphen" rule, i.e., permitting, only
letters (in any alphabet or script), perhaps local digits, and
the hyphen, but no other punctuation, symbols, drawing
characters, or other non-letter characters.  Adam has argued for
that revised principle recently; several people argued for it
when IDNA was being produced.  We could probably still impose
it, and, in any event, it would not require a change in the
basic architecture (see below).

(2) When code points had been identified by UTC as the same as,
or equivalent to, others, we tended to map them together, rather
than picking one and prohibiting the others.   This has caused
more problems than most of us expected, with people being
surprised when they register or query using one character and
the result that comes back uses another.  It also creates a
near-homograph problem that we haven't "discovered" in the last
couple of weeks: If we have character X mapping to character Y,
but X looks vaguely like Z, then there may be no Y-Z homograph,
but there may be an X-Z one.  That could make display decisions,
etc., quite critical and, unless applications got it entirely
right, we might end up with a new family of attacks.  Again,
that decision could be reviewed.  Perhaps there are groups of
characters that should be prohibited from being included in a
lookup or registration operation, not just mapped to something
more reasonable.  And, again, this would be a tuning of tables,
not a change in the basis architecture.

The assumption I referred to above was that ICANN would take a
strong role in determining which characters were really
appropriate for registration and under what circumstances, that
they would institute and enforce appropriate rules, and that
everyone relevant would pay attention to whatever they said.
Every element of that assumption has turned out to be false:
they haven't taken that role; their guidelines are weak,
ambiguous, and at least partially wrong; and some registries
have just ignored the rules that do exist without any penalty.
If there is a problem, either we are going to need to solve it,
or we are going to risk different solutions in different
applications that, taken together, compromise interoperability.

> Is this sending the kind of "stability" message that was
> considered so important two or three years ago?

It is sending the "get it right and get it interoperable"
message that is supposed to dominate IETF decision-making,
especially with Proposed Standards.

> Is there even enough solid information to begin writing
> anything, or just a general feeling that Something Needs To Be
> Done?

I think it is time for us to ask the questions that are
suggested above, and to ask them explicitly.   If doing so
produces the answer that it is time to make changes --table
changes, not architectural changes-- I think we should do so.
Perhaps we could combine that table review process with an
upgrade to Unicode 4.x, which would accommodate several scripts
we can't handle today.

Could this be done compatibly?  Not quite.  For starters, we
would have to address more squarely the question that the first
principle identified above bypassed: does someone have the
_right_ to register a particular sequence of Unicode characters?
If the answer is that, because I can draw out a symbol that
represents my business, or my religion, or my location, then I
have the "right" to register it, then we are in trouble: someone
out there will organize the Church of the Holy Right-Slash and
prohibiting it will discriminate against that religion,
especially if left-slashes and vertical bars are permitted.  If
we can get past "right to register", we need to look at the
experience of the browser implementers who have already
concluded that, registered or not, they really don't want to
recognize or process domain names containing such characters.
And then we need to present the transition problem of
eliminating any such domains that may exist to ICANN and say
"you were unable or ineffective at preventing these problems
from occurring, so, as a prize, you get to figure out how to
retire those names and are now prohibited by the updated
standard".

Curiously, if we followed existing precedents, we could even
more IDNA from Proposed to Draft and change the tables to
eliminate many mappings and characters: no change to the
algorithm, just elimination of some features that didn't work in
practice.  That is not a proposal, just an observation :-)

     john