[iucg] Non-Unicode interfaces to IDNs (was: Re: Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison)
John C Klensin <klensin@jck.com> Thu, 07 August 2014 20:29 UTC
Return-Path: <klensin@jck.com>
X-Original-To: iucg@ietfa.amsl.com
Delivered-To: iucg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1])
by ietfa.amsl.com (Postfix) with ESMTP id 8BC4F1A00D7
for <iucg@ietfa.amsl.com>; Thu, 7 Aug 2014 13:29:35 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.301
X-Spam-Level:
X-Spam-Status: No, score=-4.301 tagged_above=-999 required=5
tests=[BAYES_00=-1.9, GB_I_LETTER=-2, MIME_8BIT_HEADER=0.3,
RCVD_IN_DNSWL_LOW=-0.7, RP_MATCHES_RCVD=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44])
by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id fKdGcTc_onkO for <iucg@ietfa.amsl.com>;
Thu, 7 Aug 2014 13:29:27 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51])
(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
(No client certificate requested)
by ietfa.amsl.com (Postfix) with ESMTPS id A4E7F1B28FF
for <iucg@ietf.org>; Thu, 7 Aug 2014 13:29:27 -0700 (PDT)
Received: from [198.252.137.115] (helo=JcK-HP8200.jck.com)
by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD))
(envelope-from <klensin@jck.com>)
id 1XFUJI-000NQG-VM; Thu, 07 Aug 2014 16:29:16 -0400
Date: Thu, 07 Aug 2014 16:29:11 -0400
From: John C Klensin <klensin@jck.com>
To: Jefsey <jefsey@jefsey.com>,
=?UTF-8?Q?Mark_Davis_=E2=98=95=EF=B8=8F?= <mark@macchiato.com>
Message-ID: <D8BF3AE554610792992C8E2C@JcK-HP8200.jck.com>
References: <C0D401D76B8D1BA472604BB4@JCK-EEE10>
<CAJ2xs_F9+6_+Fz-xFdSGBUV82qmMa33Y8+F9mjinMKx9=YoKcA@mail.gmail.com>
<CAJ2xs_H_Gy9b_A5LZj0o9rFffbvbnVGLv+22CD7NhmZhLXE6Rg@mail.gmail.c
om> <219A83FB-B0C4-4B58-93A9-84A976B9147E@frobbit.se>
<20140806124932.E9AA77C3B37@mork.alvestrand.no>
<AA3E83F7AA61539EC4FFC502@JcK-HP8200.jck.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.115
X-SA-Exim-Mail-From: klensin@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: http://mailarchive.ietf.org/arch/msg/iucg/rFSVxcFIFfUGqnyDErLmjFpG-to
Cc: Marc Blanchet <Marc.Blanchet@viagenie.ca>,
IDNA update work <idna-update@alvestrand.no>, iucg@ietf.org,
gerard lang <gerard_lang@orange.fr>
Subject: [iucg] Non-Unicode interfaces to IDNs (was: Re: Unicode 7.0.0,
(combining) Hamza Above, and normalization for comparison)
X-BeenThere: iucg@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: internet users contributing group <iucg@ietf.org>
List-Id: internet users contributing group <iucg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/iucg>,
<mailto:iucg-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/iucg/>
List-Post: <mailto:iucg@ietf.org>
List-Help: <mailto:iucg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/iucg>,
<mailto:iucg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 07 Aug 2014 20:29:35 -0000
--On Thursday, August 07, 2014 20:07 +0200 Jefsey <jefsey@jefsey.com> wrote: > At 18:09 07/08/2014, John C Klensin wrote: >> Jefsey, >> >> I am not sure I understand what you are talking about but, if >> I do, it is about an almost completely different topic. > > John, > > as you may remember I am not interested in Unicode per se. I > am interested in the open pragmatic use of the digisphere > through whatever is available and people want to use. This > calls for external (fringe to fringe) innovations not to be > constrained by any internal (end to end) MUST. As far as I > undedrstand, the issue being raised concerns orthotypography > in a specific language? Actually, perhaps the opposite because one view of the problem is a desire to keep the IDN uses of Unicode language-independent. > This has been ruled out of the IDNA2008 scope. Because it was > ruled out the Unicode scope. Yes. And that makes it out of scope for this list. not just the Unicode 7.0.0 topic thread. For whatever it is worth, the Internet is rapidly moving in a direction in which there will be two kinds of Web browsers: those that work (more or less exclusively) with Unicode encoded in UTF-8 and those that don't work with contemporary tools and facilities. That set of issues is obviously much broader than IDNs. If you feel a need to support or rely on non-Unicode interfaces you should probably be standing in front that that juggernaut, not trying to fine-tune IDNA edge cases. > Now, what may have to be clarified is what a user calls > confusable. For users, "confusables" are different codepoint > sequences that look the same (whatever the reason). If the > Hamza added sequences are not creating internet use confusion, > we are not concerned. If they are, we are concerned. However, > we MUST not decide for others: we only are to give them the > possibility to decide by themselves. Then you are discussing yet another problem. People who survey the written forms of languages indicate that the language in which the particular character in question is written (variously known as Fula, Fulan, Peul, and other names) is almost always written in this century in Latin script. I haven't even tried to do the research but a guess from my recollections about the history of the area is that Latin script started to predominate about the time of the beginning of French presence in Northern Africa. The Unicode discussion (and various online articles that may or may not be independent) say that the language is still written in Arabic script by "Islamists" (a category whose definition one can guess at but cannot be certain of its meaning in this context). The first question is whether that community is likely to be registering and using domain names that use this particular character. I have no way to guess the answer to that question but want to stress that this character is a non-issue for anyone who reads and writes the Fula language exclusively in Latin characters (and may be less of an issue from those who don't read or write, and maybe haven't heard of, that language). The second part is more complicated. Even in the Unicode context as of Version 6.3 and earlier, one can create something that looks just like this character by using BEH and Combining Hamza Above. That is clearly inconvenient and annoying, in the same sense that it would be inconvenient and annoying to have to figure out a way to enter U+0063 and then U+0327 any time you wanted to write "ç" (U+00E7) (I'm assuming a French-relevant example will work better for you than the usual Swedish one). That brings us to the crux of this rather subtle problem. If you (or the users you are concerned about) think of "ç" as a special and unique latter, rather than as a decorated "c", then, following the Fula example that leads to U+08A1, then that letter should not compare equal, even after normalization, to the U+0063 U+0327 combination and, unless the latter is meaningful in some language that does not consider the combination a letter, the combination should not be allowed at all... any more than we would expect "w" to compare equal to the sequence "vv" or "uu". But, if both U+00E7 and the combining sequence are allowed (as is the case today) and both are normalized using the same method, the results will compare equal... not because they look alike, but because they are the same character. Moreover, the IDNA requirement that all A-labels be in NFC form (which doesn't affect your ordinary text) effectively makes it impossible to incorporate the combining sequence into the DNS, so no potential for two representations of the same character that do not compare equal. However, because of the distinction that was explained earlier, and perhaps because Arabic script is subject to different rules in practice than Latin script, the intent (or at least the effect of letting the existing rules work without introducing an exception) is that the DNS be able to accommodate both ARABIC BEH WITH HAMZA ABOVE as a single, atomic, character coded as U+08A1 and ARABIC BEH with HAMZA ABOVE as a two-character combining sequence with identical appearance and without the two comparing equal after normalization. _That_ is the present issue. Now, in a world in which one were not using Unicode but instead used, e.g., national language code pages, both the French and Fula cases above are entirely non-issues. In ISO 8859-1 and its descendants, "ç" appears as a single character, there is no combining cedella, and it makes no difference whether that ISO 8859-1 character is mapped to Unicode by using the single code point or the combining sequence as long as one is consistent about it. Similarly, if one created a seven or eight bit code page for Fula written in extended Arabic script, the combining Hamza Above would probably not be included, the single code point probably would, and it would make no difference how the character was mapped to Unicode as long as one was consistent about it. The only problem with the above paragraph is that it is questionable whether continued use of ISO 8859-1 is viable today much less whether there would be international acceptance of new code pages. Whether it is viable today or not, it is clear that it is getting rapidly less plausible no matter how much you (or others) might wish for it. > A digital name supports a semantic address through group of > visual signs. Whatever the underlying code, version, etc. For > the time being the IETF has chosen a single underlying > typographic code to support the digital names' signs. This may be a vocabulary issue, but I would associate the term "typographic code" with the combination of some sort of reference to a character repertoire (a coded character set, encoding, and code point would be one such reference, but definitely not the only possibility) with a reference to a type style or type family or member thereof (or, less precisely, a "font"). The IETF has _never_ made such a choice. > That > code does not consider orthotypography (i.e. semantic > constraints that are language dependant). So the IETF has > chosed that its end to end protocol are not concerned by > orthotypographic issues. Tomorrow we can chose an additional > code to Unicode: we need the use of these two codes to be > transparent to our own use, independently from any > orthotypographic issue. This is the metric of our choice. I think that, in practice, the window on your making and using that choice is closing rapidly if it has not closed already. See above. The only constraint existing IETF work is going to impose on you is that there be an orderly and consistent mapping between whatever you decide to use and the Unicode repertoire. If there is not, you will find that you are introducing your own form of confusion, one that will be very hard for you users to understand. However, if you want to create or adopt a language-specific or even typography-specific coding, don't let me discourage you from trying as long as you don't expect most people on this list (or even me) to discuss it with you. > 1. TLD Managers must be able to use their own add-ons to > support or not orthotypographic aspects in their zone. How do > you know if you will not create conflicts today with Hamza, or > in "equivalent" other cases? Such TLD managers would face at least two problems: One would be getting equivalent plug-ins into all of the browsers that potential users of the TLD (or URIs that include it) might use or reference. Otherwise, they would expose users to wildly inconsistent behavior, which is generally not a good idea. Reducing the chance of that sort of confusion is one of many things driving the "UTF-8 only" movement. The second is that the TLD manager has very little control over the TLD environment. If the plug-ins merely change the appearance of characters to make them a little more attractive, that might be a non-problem. But, if they create distinctions that don't exist elsewhere or map some characters into others, there is a lot of potential for very bad things to happen. In particular, the propagation of such plugins would create a wonderful opportunity for people with malicious intent because, at least conceptually, any plug in that can alter the form of a domain name or URI can alter it all the way to something completely different. > 2. If you consider Hamza you must consider French majucules. > Or am I wrong? You are mostly wrong because the issue with French majuscules is that, if you had a CCS that distinguished them from conventional upper and lower case letters, information would be lost in mapping that CCS to Unicode, causing the "Consistency" condition mentioned above to fail. There is no loss of information in this Hamza case because the mappings to and from Unicode are trivial (because the discussion is entirely about Unicode) and, as discussed above, any specialized code page would behave in a consistent and predicable way. Interestingly --and part of the challenge the Unicoce Consortium faces-- if normalization brought U+08A1 together with the combining sequence but the "separate character" argument continued to hold, the normalization process would lose the information about whether the original was that "separate character" or the combining sequence. But that situation is much like the counterfactual c-with-cedilla case mentioned above, not anything to do with majuscules. And I suspect this list has now had enough of this discussion. best, john
- Re: [iucg] Unicode 7.0.0, (combining) Hamza Above… JFC Morfin
- Re: [iucg] Unicode 7.0.0, (combining) Hamza Above… John C Klensin
- Re: [iucg] Unicode 7.0.0, (combining) Hamza Above… Jefsey
- [iucg] Non-Unicode interfaces to IDNs (was: Re: U… John C Klensin