Re: [idn] Re: stability

"Martin v. Löwis" <martin@v.loewis.de> Tue, 15 March 2005 20:36 UTC

Received: from psg.com (mailnull@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA16470 for <idn-archive@lists.ietf.org>; Tue, 15 Mar 2005 15:36:54 -0500 (EST)
Received: from majordom by psg.com with local (Exim 4.44 (FreeBSD)) id 1DBIcs-0000ap-T3 for idn-data@psg.com; Tue, 15 Mar 2005 20:26:50 +0000
Received: from [80.67.18.15] (helo=smtprelay03.ispgateway.de) by psg.com with esmtp (Exim 4.44 (FreeBSD)) id 1DBIcr-0000aD-0u for idn@ops.ietf.org; Tue, 15 Mar 2005 20:26:49 +0000
Received: (qmail 20883 invoked from network); 15 Mar 2005 20:26:51 -0000
Received: from unknown (HELO [80.185.154.213]) (544451@[80.185.154.213]) (envelope-sender <martin@v.loewis.de>) by smtprelay03.ispgateway.de (qmail-ldap-1.03) with AES256-SHA encrypted SMTP for <erik@vanderpoel.org>; 15 Mar 2005 20:26:51 -0000
Message-ID: <4237450A.9010901@v.loewis.de>
Date: Tue, 15 Mar 2005 21:26:50 +0100
From: "\"Martin v. Löwis\"" <martin@v.loewis.de>
User-Agent: Debian Thunderbird 1.0 (X11/20050116)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Erik van der Poel <erik@vanderpoel.org>
CC: Simon Josefsson <jas@extundo.com>, Mark Davis <mark.davis@jtcsv.com>, idn@ops.ietf.org
Subject: Re: [idn] Re: stability
References: <421B8484.3070802@vanderpoel.org> <20050223072837.GA21463~@nicemice.net> <D872CCF059514053ECF8A198@scan.jck.com> <421D8411.9030006@vanderpoel.org> <p06210208be4390618c81@[192.168.0.101]> <421E0D0C.2000309@vanderpoel.org> <p06210202be43c3888991@[192.168.0.101]> <E07CE813AD23B2D95DA0C740@scan.jck.com> <421E30F2.1040408@vanderpoel.org> <0E7F74C71945B923C52211F3@scan.jck.com> <421EA0C9.1010500@vanderpoel.org> <00a401c51af3$7863aae0$030aa8c0@DEWELL> <A574CA1BE87BFDA3C2A1AC0E@scan.jck.com> <42322CE2.4040509@vanderpoel.org> <4232B2FD.1080104@vanderpoel.org> <4232BA56.5090001@vanderpoel.org> <iluk6odazwb.fsf@latte.josefsson.org> <00e801c528a8$99ad37d0$72703009@sanjose.ibm.com> <ilull8qb5n5.fsf@latte.josefsson.org> <42367B63.6080300@vanderpoel.org>
In-Reply-To: <42367B63.6080300@vanderpoel.org>
X-Enigmail-Version: 0.90.0.0
X-Enigmail-Supports: pgp-inline, pgp-mime
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on psg.com
X-Spam-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.0.1
Sender: owner-idn@ops.ietf.org
Precedence: bulk
Content-Transfer-Encoding: 7bit

Erik van der Poel wrote:
> I read UAX #15 and PRI #29. It's quite unfortunate that such a mistake 
> was made in the spec, and that several implementations have implemented 
> that mistake so faithfully.

It's also quite understandable. It is not at all obvious that the
correction is necessary; even know that I read it, and even though
I have implemented the algorithm myself (for Python), I found it very
difficult to understand the issue. Here is the problem:

In NFD, combining characters are sorted according to their combining
class, in increasing order. So you always have

starter small_combiner_A large_combiner_B next_start ...
(with A <= B)

The old text says that a combiner is blocked if it has the same
combining class, so

starter combiner_A other_combiner_B (with A==B; if starter
cannot be combined with combiner_A, then combiner_A blocks
combiner_B)

Now, the correction says that you should consider also the case

starter combiner_A combiner_B; with A > B ?!

How can that be? NFD should have sorted them so that combiner_B
comes *before* combiner_A, so it would not be blocked. Think about it.

The answer is this: This is *only* possible if combiner_B is a
starter, i.e. B==0. But if so, why could you possibly combine it
with the starter? Can you ever combine two starters? Think about it.

The answer is yes: for Hangul Jamo. They all have combining class 0,
yet they can be combined. There are also a few other characters which
have combining class 0 and still can be combined. However, it is not
at all obvious.

For the specific case of Python, it turns out that I special-cased
Hangul composition, so it won't apply the standard algorithm (of
looking for blockers); this means that all the examples in PR#29
apparently work "correctly" with Python. However, for the non-Hangul
cases, it is possible to produce the "bad" behaviour with Python 2.4.


> I feel that we are still at the very beginning of the adoption of the 
> particular Unicodes affected by this mistake. Most of them are for South 
> Asian languages. Hangul is much further along, but not the particular 
> Unicodes that are affected here (i.e. the Jamo).

It's not that easy. When you use the old algorithm, you get normal
Hangul syllables, which would be allowed in IDNA. It's only that the
sequence *before* the normalization should not be allowed.

> More importantly, this 
> mistake only affects highly unusual, malformed data. I think that if 
> IDNA decides not to follow Unicode's recommendation now or in the next 
> couple of years, 10 or 20 years from now we would look back in time and 
> regret this decision. 

I don't think so. "We" could still change the decision in 20 years, and
not a single registration would be affected. The sequences causing the
behaviour change are *really* unusual - I don't know if software can
visually render them in a meaningful way, and I guess a native speaker
would just consider them moji-bake. So it is unlikely that anybody would
try to use them as input to IDNA in the next 20 years in a reasonable
application.

> It is interesting that, in this case, Unicode seems to have implemented 
> first and written the spec later, which is the way the IETF is supposed 
> to do things too. It's just unfortunate that the Unicode spec was 
> transcribed incorrectly from the implementation(s). On the other hand, 
> IDNA seems to have done it in the opposite order. First, the spec was 
> written, and now that we have deployed some implementations, we are 
> finding serious problems with punctuation marks and symbols.

That's why IDNA is still a Proposed Standard Protocol (not even
a Draft Standard Protocol); see STD 1. It will advance to Draft
Standard if two independent and interoperable implementations
from different code bases have been developed, and sufficient
successful operational experience has been gained; see BCP 9.

It also *not* the case that it was specified first and implemented then.
All along the process, people have been implementing bits and pieces of
it, test beds have been run, and so on. You might not have been around,
but some people still remember.

Regards,
Martin