Re: [idn] space-like unicode char

Soobok Lee <lsb@lsb.org> Fri, 08 April 2005 07:11 UTC

Received: from psg.com (mailnull@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id DAA12263 for <idn-archive@lists.ietf.org>; Fri, 8 Apr 2005 03:11:39 -0400 (EDT)
Received: from majordom by psg.com with local (Exim 4.44 (FreeBSD)) id 1DJnYH-0000Fq-Tb for idn-data@psg.com; Fri, 08 Apr 2005 07:05:13 +0000
Received: from [211.196.150.53] (helo=postel5.postel.co.kr) by psg.com with esmtp (Exim 4.44 (FreeBSD)) id 1DJnYF-0000FO-De for idn@ops.ietf.org; Fri, 08 Apr 2005 07:05:11 +0000
Received: from [10.1.1.21] ([61.73.48.22]) by postel5.postel.co.kr (8.13.0.PreAlpha4/8.13.0.PreAlpha4) with ESMTP id j38758JR024364; Fri, 8 Apr 2005 16:05:08 +0900
Message-ID: <42562D22.3090609@lsb.org>
Date: Fri, 08 Apr 2005 16:05:06 +0900
From: Soobok Lee <lsb@lsb.org>
User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Erik van der Poel <erik@vanderpoel.org>
CC: idn@ops.ietf.org
Subject: Re: [idn] space-like unicode char
References: <42181FD5.3070608@lsb.org> <4255E488.8010302@vanderpoel.org>
In-Reply-To: <4255E488.8010302@vanderpoel.org>
Content-Type: text/plain; charset="EUC-KR"
Content-Transfer-Encoding: 7bit
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on psg.com
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00 autolearn=ham version=3.0.1
Sender: owner-idn@ops.ietf.org
Precedence: bulk
Content-Transfer-Encoding: 7bit

Erik van der Poel wrote:

> Soobok Lee wrote:
>
>> U+1160 is a space-like char and even stringprep/nameprep does not
>> filter it out because the char is not for punctuational purpose.
>
>
> U+1160 is HANGUL JUNGSEONG FILLER and it is used to transform
> nonstandard syllables into standard ones (Unicode 3.0 section 3.11
> (RFC 3454 refers to Unicode 3.2.0)). However, this transformation is
> one of the additional transformations not considered part of Unicode
> normalization (3.2.0's UAX #15 Annex 10). 

Exactly. U+1160 is not "touched" by Unicode normalization (NFC).

> So this character is not generated by Stringprep/Nameprep.However, it
> is not prohibited either, so it may occur in the input to (and output
> from) Stringprep/Nameprep.

Yes, it may occur.

> I read some of the sections on Hangul in the Unicode book and Web
> site, but I did not see any rules regarding repeated occurrences of
> U+1160 (as you had in your example, not quoted above). I also did not
> see any rules about what to do when a filler is not followed by a
> Hangul jamo. It would be nice to have these rules in Unicode or in
> Stringprep.

U+1160 problem has been raised 3.5 years ago (you can look into this
huge idn-list archive by keyword search for 1160 or filler)
with some additional hangul jamo problem. One draft has been submitted
by me (you may find that in www.i-d-n.net)
to filter out these invalid char sequences. But the draft had been
discarded . Someone argued that such filtering * complicates *
stringprep algorithms with context-sensitive filtering/prohibiting and
the problem is up to UTC/NFC not to IETF. of course, i couldn't accept that.

Anyway, we can't backtrack into 2002/Dec without giving up backward
compatibility promise of stringprep.


>
> I tried U+1160 followed by a Latin character in MSIE with i-Nav and in
> Firefox with IDN turned on, and it was displayed as a wide space. It
> is unfortunate that both implementations chose to display it as a
> space instead of deleting it.

Yes. Plugins M U S T filter out U+1160 from validated ToUnicode()ed
labels, whether or not IDNA requires that.

Soobok