Re: [idn] Re: stability

"Martin v. Löwis" <martin@v.loewis.de> Tue, 15 March 2005 20:36 UTC

Message-ID: <4237450A.9010901@v.loewis.de>
Date: Tue, 15 Mar 2005 21:26:50 +0100
From: "\"Martin v. Löwis\"" <martin@v.loewis.de>
User-Agent: Debian Thunderbird 1.0 (X11/20050116)
MIME-Version: 1.0
To: Erik van der Poel <erik@vanderpoel.org>
CC: Simon Josefsson <jas@extundo.com>, Mark Davis <mark.davis@jtcsv.com>, idn@ops.ietf.org
Subject: Re: [idn] Re: stability
References: <421B8484.3070802@vanderpoel.org> <20050223072837.GA21463~@nicemice.net> <D872CCF059514053ECF8A198@scan.jck.com> <421D8411.9030006@vanderpoel.org> <p06210208be4390618c81@[192.168.0.101]> <421E0D0C.2000309@vanderpoel.org> <p06210202be43c3888991@[192.168.0.101]> <E07CE813AD23B2D95DA0C740@scan.jck.com> <421E30F2.1040408@vanderpoel.org> <0E7F74C71945B923C52211F3@scan.jck.com> <421EA0C9.1010500@vanderpoel.org> <00a401c51af3$7863aae0$030aa8c0@DEWELL> <A574CA1BE87BFDA3C2A1AC0E@scan.jck.com> <42322CE2.4040509@vanderpoel.org> <4232B2FD.1080104@vanderpoel.org> <4232BA56.5090001@vanderpoel.org> <iluk6odazwb.fsf@latte.josefsson.org> <00e801c528a8$99ad37d0$72703009@sanjose.ibm.com> <ilull8qb5n5.fsf@latte.josefsson.org> <42367B63.6080300@vanderpoel.org>
In-Reply-To: <42367B63.6080300@vanderpoel.org>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Sender: owner-idn@ops.ietf.org
Precedence: bulk
Content-Transfer-Encoding: 7bit

Erik van der Poel wrote:
> I read UAX #15 and PRI #29. It's quite unfortunate that such a mistake 
> was made in the spec, and that several implementations have implemented 
> that mistake so faithfully.

It's also quite understandable. It is not at all obvious that the
correction is necessary; even know that I read it, and even though
I have implemented the algorithm myself (for Python), I found it very
difficult to understand the issue. Here is the problem:

In NFD, combining characters are sorted according to their combining
class, in increasing order. So you always have

starter small_combiner_A large_combiner_B next_start ...
(with A <= B)

The old text says that a combiner is blocked if it has the same
combining class, so

starter combiner_A other_combiner_B (with A==B; if starter
cannot be combined with combiner_A, then combiner_A blocks
combiner_B)

Now, the correction says that you should consider also the case

starter combiner_A combiner_B; with A > B ?!

How can that be? NFD should have sorted them so that combiner_B
comes *before* combiner_A, so it would not be blocked. Think about it.

The answer is this: This is *only* possible if combiner_B is a
starter, i.e. B==0. But if so, why could you possibly combine it
with the starter? Can you ever combine two starters? Think about it.

The answer is yes: for Hangul Jamo. They all have combining class 0,
yet they can be combined. There are also a few other characters which
have combining class 0 and still can be combined. However, it is not
at all obvious.

For the specific case of Python, it turns out that I special-cased
Hangul composition, so it won't apply the standard algorithm (of
looking for blockers); this means that all the examples in PR#29
apparently work "correctly" with Python. However, for the non-Hangul
cases, it is possible to produce the "bad" behaviour with Python 2.4.

> I feel that we are still at the very beginning of the adoption of the 
> particular Unicodes affected by this mistake. Most of them are for South 
> Asian languages. Hangul is much further along, but not the particular 
> Unicodes that are affected here (i.e. the Jamo).

It's not that easy. When you use the old algorithm, you get normal
Hangul syllables, which would be allowed in IDNA. It's only that the
sequence *before* the normalization should not be allowed.

> More importantly, this 
> mistake only affects highly unusual, malformed data. I think that if 
> IDNA decides not to follow Unicode's recommendation now or in the next 
> couple of years, 10 or 20 years from now we would look back in time and 
> regret this decision. 

I don't think so. "We" could still change the decision in 20 years, and
not a single registration would be affected. The sequences causing the
behaviour change are *really* unusual - I don't know if software can
visually render them in a meaningful way, and I guess a native speaker
would just consider them moji-bake. So it is unlikely that anybody would
try to use them as input to IDNA in the next 20 years in a reasonable
application.

> It is interesting that, in this case, Unicode seems to have implemented 
> first and written the spec later, which is the way the IETF is supposed 
> to do things too. It's just unfortunate that the Unicode spec was 
> transcribed incorrectly from the implementation(s). On the other hand, 
> IDNA seems to have done it in the opposite order. First, the spec was 
> written, and now that we have deployed some implementations, we are 
> finding serious problems with punctuation marks and symbols.

That's why IDNA is still a Proposed Standard Protocol (not even
a Draft Standard Protocol); see STD 1. It will advance to Draft
Standard if two independent and interoperable implementations
from different code bases have been developed, and sufficient
successful operational experience has been gained; see BCP 9.

It also *not* the case that it was specified first and implemented then.
All along the process, people have been implementing bits and pieces of
it, test beds have been run, and so on. You might not have been around,
but some people still remember.

Regards,
Martin

[idn] related work Erik van der Poel
[idn] Unicode categories Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue JFC (Jefsey) Morfin
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue John C Klensin
Re: [idn] nameprep2 and the slash homograph issue Adam M. Costello
Re: [idn] something a little lighter for the week… Doug Ewell
Re: [idn] stability Erik van der Poel
Re: [idn] Re: character tables Erik van der Poel
Re: [idn] Re: process Adam M. Costello
Re: [idn] punctuation John C Klensin
Re: [idn] Re: stability JFC (Jefsey) Morfin
Re: [idn] Re: character tables Gervase Markham
Re: [idn] stringprep: PRI #29 Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Gervase Markham
Re: [idn] Re: stability Erik van der Poel
Re: [idn] process Paul Hoffman
Re: [idn] Re: character tables YAO Jiankang
Re: [idn] nameprep2 and the slash homograph issue JFC (Jefsey) Morfin
Re: [idn] nameprep2 and the slash homograph issue Adam M. Costello
Re: [idn] punctuation John C Klensin
Re: [idn] punctuation tedd
Re: [idn] Re: character tables JFC (Jefsey) Morfin
Re: [idn] punctuation Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Gervase Markham
Re: [idn] Re: stability Erik van der Poel
Re: [idn] Re: character tables Adam M. Costello
[idn] Re: character tables John C Klensin
Re: [idn] Re: character tables Erik van der Poel
Re: [idn] Re: stability JFC (Jefsey) Morfin
Re: [idn] Re: character tables Paul Hoffman
Re: [idn] Re: stability Martin v. Löwis
Re: [idn] Re: character tables Erik van der Poel
Re: [idn] Re: stability John C Klensin
[idn] Re: Unicode categories John C Klensin
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
[idn] character tables Erik van der Poel
Re: [idn] Re: character tables John C Klensin
Re: [idn] Re: stability Mark Davis
Re: [idn] Re: stringprep: PRI #29 Erik van der Poel
[idn] stability Erik van der Poel
Re: [idn] Re: character tables Erik van der Poel
Re: [idn] Re: dichotomies JFC (Jefsey) Morfin
Re: [idn] process Adam M. Costello
Re: [idn] Re: character tables William Tan
Re: [idn] Re: process James Seng
[idn] Re: stability Simon Josefsson
Re: [idn] stability Erik van der Poel
[idn] Re: stability Martin v. Löwis
Re: [idn] Re: process Jaap Akkerhuis
Re: [idn] Re: stringprep: PRI #29 Adam M. Costello
Re: [idn] punctuation tedd
[idn] Re: dichotomies Erik van der Poel
Re: [idn] Re: stability Martin v. Löwis
Re: [idn] punctuation Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] process JFC (Jefsey) Morfin
[idn] Re: stability Simon Josefsson
Re: [idn] nameprep2 and the slash homograph issue JFC (Jefsey) Morfin
[idn] Re: stringprep: PRI #29 Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Adam M. Costello
Re: [idn] process John C Klensin
Re: [idn] Re: Unicode categories Mark Davis
Re: [idn] process Doug Ewell
Re: [idn] Re: stability Adam M. Costello
Re: [idn] process Erik van der Poel
[idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] punctuation tedd
[idn] punctuation Erik van der Poel
Re: [idn] Re: stability James Seng
[idn] Re: stability Simon Josefsson
[idn] something a little lighter for the weekend Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] something a little lighter for the week… Adam M. Costello
Re: [idn] process Gervase Markham
[idn] Re: character tables Cary Karp
[idn] Mozilla? JFC (Jefsey) Morfin
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] punctuation Erik van der Poel
[idn] Re: Unicode categories Erik van der Poel
[idn] Re: stability Simon Josefsson
Re: [idn] Re: character tables JFC (Jefsey) Morfin
[idn] Re: process Stephane Bortzmeyer
Re: [idn] process Erik van der Poel
Re: [idn] punctuation Jaap Akkerhuis
Re: [idn] Re: character tables Gervase Markham
Re: [idn] Re: process Jaap Akkerhuis
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] Re: process James Seng
[idn] stringprep mailing list Erik van der Poel
Re: [idn] Re: dichotomies Erik van der Poel
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
Re: [idn] Re: stability Erik van der Poel
Re: [idn] Re: character tables Erik van der Poel
Re: [idn] Re: stability JFC (Jefsey) Morfin
Re: [idn] Re: process Erik van der Poel
[idn] Re: stringprep: PRI #29 Simon Josefsson
Re: [idn] punctuation Erik van der Poel
Re: [idn] stability Martin v. Löwis
[idn] stringprep: PRI #29 Erik van der Poel
Re: [idn] Re: character tables Paul Hoffman
Re: [idn] nameprep2 and the slash homograph issue Erik van der Poel
[idn] Re: stability Simon Josefsson
[idn] process Erik van der Poel
[idn] stringprep: existing profiles and string pr… Erik van der Poel
Re: [idn] Re: stability Erik van der Poel
[idn] dichotomies Erik van der Poel
Re: [idn] stability JFC (Jefsey) Morfin
[idn] Re: character tables Cary Karp
Re: [idn] Re: process Erik van der Poel
[idn] Re: stringprep mailing list Simon Josefsson
Re: [idn] Re: Unicode categories Martin v. Löwis
Re: [idn] Re: stability JFC (Jefsey) Morfin
Re: [idn] something a little lighter for the week… John C Klensin
Re: [idn] something a little lighter for the week… Adam M. Costello
Re: [idn] Re: dichotomies JFC (Jefsey) Morfin
Re: [idn] Re: stability Erik van der Poel
Re: [idn] Re: stability Erik van der Poel
[idn] Re: stringprep: PRI #29 Simon Josefsson
Re: [idn] stability Erik van der Poel
[idn] Re: stringprep: PRI #29 Simon Josefsson