Re: [iucg] Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison
John C Klensin <klensin@jck.com> Thu, 07 August 2014 16:09 UTC
Return-Path: <klensin@jck.com>
X-Original-To: iucg@ietfa.amsl.com
Delivered-To: iucg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1])
by ietfa.amsl.com (Postfix) with ESMTP id E8E1F1B2E21
for <iucg@ietfa.amsl.com>; Thu, 7 Aug 2014 09:09:54 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.901
X-Spam-Level:
X-Spam-Status: No, score=-2.901 tagged_above=-999 required=5
tests=[BAYES_05=-0.5, GB_I_LETTER=-2, MIME_8BIT_HEADER=0.3,
RCVD_IN_DNSWL_LOW=-0.7, RP_MATCHES_RCVD=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44])
by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id M1ZLoNxlAXtK for <iucg@ietfa.amsl.com>;
Thu, 7 Aug 2014 09:09:52 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51])
(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
(No client certificate requested)
by ietfa.amsl.com (Postfix) with ESMTPS id 60A081B2E1F
for <iucg@ietf.org>; Thu, 7 Aug 2014 09:09:52 -0700 (PDT)
Received: from [198.252.137.115] (helo=JcK-HP8200.jck.com)
by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD))
(envelope-from <klensin@jck.com>)
id 1XFQGA-000N31-0O; Thu, 07 Aug 2014 12:09:46 -0400
Date: Thu, 07 Aug 2014 12:09:40 -0400
From: John C Klensin <klensin@jck.com>
To: JFC Morfin <jefsey@jefsey.com>,
=?UTF-8?Q?Mark_Davis_=E2=98=95=EF=B8=8F?= <mark@macchiato.com>
Message-ID: <AA3E83F7AA61539EC4FFC502@JcK-HP8200.jck.com>
In-Reply-To: <20140806124932.E9AA77C3B37@mork.alvestrand.no>
References: <C0D401D76B8D1BA472604BB4@JCK-EEE10>
<CAJ2xs_F9+6_+Fz-xFdSGBUV82qmMa33Y8+F9mjinMKx9=YoKcA@mail.gmail.com>
<CAJ2xs_H_Gy9b_A5LZj0o9rFffbvbnVGLv+22CD7NhmZhLXE6Rg@mail.gmail.c
om> <219A83FB-B0C4-4B58-93A9-84A976B9147E@frobbit.se>
<20140806124932.E9AA77C3B37@mork.alvestrand.no>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.115
X-SA-Exim-Mail-From: klensin@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: http://mailarchive.ietf.org/arch/msg/iucg/LKJQa2LKz7bcVl8ogXMND9v3F_A
Cc: Marc Blanchet <Marc.Blanchet@viagenie.ca>,
IDNA update work <idna-update@alvestrand.no>, iucg@ietf.org,
gerard lang <gerard_lang@orange.fr>
Subject: Re: [iucg] Unicode 7.0.0, (combining) Hamza Above,
and normalization for comparison
X-BeenThere: iucg@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: internet users contributing group <iucg@ietf.org>
List-Id: internet users contributing group <iucg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/iucg>,
<mailto:iucg-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/iucg/>
List-Post: <mailto:iucg@ietf.org>
List-Help: <mailto:iucg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/iucg>,
<mailto:iucg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 07 Aug 2014 16:09:55 -0000
--On Wednesday, August 06, 2014 14:01 +0200 JFC Morfin <jefsey@jefsey.com> wrote: > At 07:03 06/08/2014, Patrik Fältström wrote: >> To be honest, I do not think it matters where it is discussed. > > I suggest we keep it discussed here. The reason why is the > ICANN response to the plaintiffs in the .ir, etc. case. "the > DNS provides a human interface to the internet protocol > addressing system". This seems to be a good definition to > commonly sustain as it is technically true, easy to > understand, and makes a clear distinction between the human > and the non-human issues. >... Jefsey, I am not sure I understand what you are talking about but, if I do, it is about an almost completely different topic. The actual issue here is extremely narrow and not associated with subjective (or computable via a distance function) visual conformability at all. The issue also applies to a very small (compared to the total Unicode repertoire) number of characters. It also involves the relationship between characters/ code points within a single script, while most conversations about confusability have related to characters (more specifically, code point sequences) within a given script. In the hope that we can at least all be talking about the same thing, let me try to summarize the issue with the hope that Mark and I can at least agree about the summary. At the risk of being harsh, while I think more informed discussion of the issues would be helpful, getting to an informed discussion requires some serious effort to understand the Unicode Standard, its construction, and its specific treatment of Arabic. The explanation below may be helpful to those who have at least a large fraction of that understanding; those who don't have that much would be, IMO, well-advised to do some serious reading before trying to participate. In some situations, the Hamza combining character is used with a base character as a pronunciation indicator. I'm told that, for the "BEH" base character and a few others, the most common such use is when Arabic (language) words are written in a Perso-Arabic context or similar writing system environments. Hamza is less often used this was for writing contemporary Arabic language in an Arabic language context, but that usage has changed over time. That usage as a pronunciation indicator has been supported in Unicode for years and years by a combining sequence using the base character and Hamza Above (a combining character). In the lead-up to Unicode 7.0.0, the Unicode Consortium apparently got a request to include some characters that are needed for a North African language that is sometimes written in Arabic script. While it looks just like the BEH WITH HAMZA ABOVE combining sequence (and Unicode even decided to give it that name), it is really a conceptually separate character. I think there is no disagreement up to that point, including about the abstract form of the combining sequence looking "just like" the newly-assigned character. Again, this is within the same script and there are no issues about what is confusingly similar and what is not. There are several (I think equivalent) versions of where the disagreement sets in, but let me try what seems today to be the most clear. Section 2.2 of The Unicode Standard seems to be quite clear that coding in the standard is independent of language, and similar considerations (see, in particular, the subsection titled "Unification"). Some of us who read that believe that a new code point for this letter should not be assigned at all and that, if it is, it should be subject to other rules that decompose it back into the combining sequence. However, section 2.2 makes clear that there are multiple considerations or "design principles" (it lists ten of them), that it may not be possible to apply them all and get a consistent result in a particular case, and that it is necessary to strike a balance. Presumably on the basis of that balance, and with the precedent of at least three other characters in the Arabic script that are also distinct characters for some languages but combining sequences (if used at all) for others, a new code point, U+08A1, was assigned without a decomposition back to the existing combining sequence. Mark (and others associated with the decision) cite the language issues, the precedents in the other Arabic characters, and so on. Those of us who aren't enthused about the new character at all but who are really concerned about two code point sequences that, once language identification or considerations are removed, yield the identical characters are very concerned that normalization doesn't create an equality relationship. IMO, that can't be "resolved" or consensus reached because the criteria for making the decision are different and lead to different results. There are still parts of the criteria that Unicode is applying that confuse me (with frequent examples about use of Latin script among European languages and even some Latin characters being used to denote rather different phonemes in Hanyu Pinjin than they denote in Western Europe as examples of the confusion). I'd like to be wrong. But, if I'm not, then we have a mechanism in IDNA for dealing with newly-added Unicode code points that are problematic for IDNA. It seems to me that there is little question that this new character (and at least its three predecessors) are problematic for IDNA (Mark may disagree, but I don't think he has said that yet). The question then becomes whether the damage that would be done by just accepting the Unicode decision and allowing U+08A1 to be PVALID would be greater or less than the potential damage from excluding it or writing special rules, rules that might, in some ways, parallel the discussion of Hamza in the "Arabic" section of The Unicode Standard. Again, more general discussion about confusable characters, especially between scripts, are relevant, but not to this thread and discussion and maybe not in/to this WG or the IETF. As an aside, if you dig back through the literature in optical (or equivalent) character recognition in the pre-Kurzweil period, you will find quite a bit about abstract character properties and distance functions that might explain why the latter never worked very well even with a character repertoire limited to a single language. Some things have clearly changed in more than a half-century, but a lot has not. best, john
- Re: [iucg] Unicode 7.0.0, (combining) Hamza Above… JFC Morfin
- Re: [iucg] Unicode 7.0.0, (combining) Hamza Above… John C Klensin
- Re: [iucg] Unicode 7.0.0, (combining) Hamza Above… Jefsey
- [iucg] Non-Unicode interfaces to IDNs (was: Re: U… John C Klensin