CJK Incompatiblities (was: Re: Question about the agenda)

Kenneth Whistler <kenw@sybase.com> Sat, 21 March 2009 00:49 UTC

Return-Path: <kenw@sybase.com>
X-Original-To: idna-update@alvestrand.no
Delivered-To: idna-update@alvestrand.no
Received: from localhost (localhost [127.0.0.1]) by eikenes.alvestrand.no (Postfix) with ESMTP id C95D539E24E for <idna-update@alvestrand.no>; Sat, 21 Mar 2009 01:49:37 +0100 (CET)
X-Virus-Scanned: Debian amavisd-new at eikenes.alvestrand.no
Received: from eikenes.alvestrand.no ([127.0.0.1]) by localhost (eikenes.alvestrand.no [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 84RKO4elJedd for <idna-update@alvestrand.no>; Sat, 21 Mar 2009 01:49:33 +0100 (CET)
X-Greylist: from auto-whitelisted by SQLgrey-1.6.8
Received: from inergen.sybase.com (inergen.sybase.com [192.138.151.43]) by eikenes.alvestrand.no (Postfix) with ESMTP id 1B08439E0B3 for <idna-update@alvestrand.no>; Sat, 21 Mar 2009 01:49:32 +0100 (CET)
Received: from smtp1.sybase.com (sybgate [10.22.97.84]) by inergen.sybase.com with ESMTP id n2L0nVL25799; Fri, 20 Mar 2009 16:49:31 -0800 (PST)
Received: from atlantis-new.sybase.com (localhost [127.0.0.1]) by smtp1.sybase.com with ESMTP id n2L0nVL00591; Fri, 20 Mar 2009 17:49:31 -0700 (PDT)
Received: from birdie.sybase.com (birdie.sybase.com [10.22.85.43]) by atlantis-new.sybase.com (8.13.7+Sun/8.13.7) with ESMTP id n2L0nU2V008095; Fri, 20 Mar 2009 17:49:30 -0700 (PDT)
Received: from birdie (birdie [10.22.85.43]) by birdie.sybase.com (8.11.6+Sun/8.11.6) with SMTP id n2L0nTY20646; Fri, 20 Mar 2009 17:49:29 -0700 (PDT)
Message-Id: <200903210049.n2L0nTY20646@birdie.sybase.com>
Date: Fri, 20 Mar 2009 17:49:29 -0700
From: Kenneth Whistler <kenw@sybase.com>
Subject: CJK Incompatiblities (was: Re: Question about the agenda)
To: phoffman@imc.org
MIME-Version: 1.0
Content-Type: TEXT/plain; charset="us-ascii"
Content-MD5: 66QhWJoiU7+UQ98cB2H2Lg==
X-Mailer: dtmail 1.3.0 @(#)CDE Version 1.4.6_06 SunOS 5.8 sun4u sparc
Cc: idna-update@alvestrand.no, kenw@sybase.com
X-BeenThere: idna-update@alvestrand.no
X-Mailman-Version: 2.1.9
Precedence: list
Reply-To: Kenneth Whistler <kenw@sybase.com>
List-Id: IDNA update work <idna-update.alvestrand.no>
List-Unsubscribe: <http://www.alvestrand.no/mailman/listinfo/idna-update>, <mailto:idna-update-request@alvestrand.no?subject=unsubscribe>
List-Archive: <http://www.alvestrand.no/pipermail/idna-update>
List-Post: <mailto:idna-update@alvestrand.no>
List-Help: <mailto:idna-update-request@alvestrand.no?subject=help>
List-Subscribe: <http://www.alvestrand.no/mailman/listinfo/idna-update>, <mailto:idna-update-request@alvestrand.no?subject=subscribe>
X-List-Received-Date: Sat, 21 Mar 2009 00:49:37 -0000

> >Perhaps we are simply reflecting a different
> >interpretation of "conclusions"?
> 
> Not really. The abstract of the JET draft says "[IDNA2008] will 
> cause incompatibilities for Chinese, Japanese and Korean (CJK) 
> scripts and languages." Section 3 of that draft gives a good 
> list of incompatibilities, none of which were listed in your 
> document. It does not seem fair to ask the WG "complete discussions,
> if necessary on IDNA2008 implications" while purposely ignoring
> some of the implications that have been brought to the WG's 
> attention, particularly those from major registries with a 
> lot of IDNA experience who spent the time to write them down 
> in an Internet Draft.

The incompatibilities noted in draft-jet-idnabis-cjk-localmapping-01
are a small subset of the incompatibilities noted and
discussed in:

http://www.unicode.org/reports/tr46/tr46-1.html

which we (the UTC), although not being a major registry,
have also spent the time to write down and bring to
the WG's attention.

To wit:

jet-idnabis-cjk-localmapping-01

3.1 Label separators

This deals with the well-known problem of the processing
conventions for U+3002 IDEOGRAPHIC FULL STOP and the
halfwidth and fullwidth versions as equivalent to "."
for label separators.

That is also accounted for in D-UTR #46.

3.2 Compatibility characters

That deals with the the fullwidth letters and digits and
the halfwidth katakana. Those are mapped in IDNA2003.
They are simply DISALLOWED and not mapped in IDNA2008.

The preprocessing mapping in D-UTR #46 accounts for those.

Either IDNA2008 lets casing and NFKC mapping back into
the protocol to eliminate this kind of incompatibility
(which is widespread now -- hence the perceived need
for "local mappings" such as that described in the
JET draft), or

IDNA2008 stands as is, without case and NFKC mapping,
in which case D-UTR #46 will likely turn into the
de facto standard for preprocessing to maintain maximal
compatibility with existing IDNA2003 processing. That
would also eliminate the need for a CJK-specific local
mapping for this particular issue.

3.3 Exceptions

U+3005 IDEOGRAPHIC ITERATION MARK (there is a code point error
                                   in the JET draft)
                                   
U+30FB KATAKANA MIDDLE DOT (there is a name error in the JET draft)

Those are CONTEXTO in the current tables document for IDNA,
rather than PVALID, so there are potential incompatibilities
where they might be valid in an IDNA2003 label that would
be disallowed under IDNA2008 A.10 and A.12 CONTEXTO rules
for the two characters, respectively.

I have no idea how those two ended up getting CONTEXTO
designations in the tables document -- I must have been
snoozing when that happened. U+3005 should just get
derived as PVALID by regular category derivation. It is
no more contextually constrained than several other
iteration marks that are PVALID in the table, such
as U+309D HIRAGANA ITERATION MARK. So that is simply
a mistake and an overabundance of misplaced caution for
the tables document. U+3005 --> PVALID and that problem
goes away.

U+30FB KATAKANA MIDDLE DOT needed to have an exception
for the derivation, since it is General_Category=Po
in the Unicode Character Database. But, in my opinion,
the right answer here is to specify that it is simply
PVALID, and to give up on the overspecification of
exactly where it has to occur in a label, which is
causing the incompatibility that the JET draft notes.
If the tables document is changed this way, then this
unnecessary incompatibility also goes away.

At that point, there are is only the very generic
issue of mapping left (one part of which is the
treatment of label separators, which is technically
outside the context of the definition of the labels
themselves, anyway).

--Ken