Re: [I18n-discuss] Comments on "troublesome-characters" from Arabic script
John C Klensin <john-ietf@jck.com> Sun, 30 July 2017 15:47 UTC
Return-Path: <john-ietf@jck.com>
X-Original-To: i18n-discuss@ietfa.amsl.com
Delivered-To: i18n-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4990F131C7F for <i18n-discuss@ietfa.amsl.com>; Sun, 30 Jul 2017 08:47:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RP_MATCHES_RCVD=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4dy_GrWu6W-9 for <i18n-discuss@ietfa.amsl.com>; Sun, 30 Jul 2017 08:47:13 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9F919129417 for <i18n-discuss@iab.org>; Sun, 30 Jul 2017 08:47:13 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1dbqQp-000L9q-31; Sun, 30 Jul 2017 11:47:03 -0400
Date: Sun, 30 Jul 2017 11:46:57 -0400
From: John C Klensin <john-ietf@jck.com>
To: "Abdulaziz H. Al-Zoman" <azoman@citc.gov.sa>, 'Andrew Sullivan' <ajs@anvilwalrusden.com>, Raed Al-Fayez <rfayez@citc.gov.sa>
cc: "'i18n-discuss@iab.org'" <i18n-discuss@iab.org>
Message-ID: <58624950E7E9BB71733FF3B3@PSB>
In-Reply-To: <EDEC5B615F83D44981FA2D0DCA997167013170FA78@ry0mail1.citc.gov.sa>
References: <043D7B5CFC1AB8469108EA7BB5F68BB00124076934@ry0mail1.citc.gov.sa> <EDEC5B615F83D44981FA2D0DCA9971670131709366@ry0mail1.citc.gov.sa> <20170727211320.dkano7pdmjxoj62h@mx4.yitter.info> <EDEC5B615F83D44981FA2D0DCA997167013170FA78@ry0mail1.citc.gov.sa>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18n-discuss/wuMv74tQaj7PyckewWUi28zRE_E>
Subject: Re: [I18n-discuss] Comments on "troublesome-characters" from Arabic script
X-BeenThere: i18n-discuss@iab.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Internationalization Program Open Discussion List <i18n-discuss.iab.org>
List-Unsubscribe: <https://www.iab.org/mailman/options/i18n-discuss>, <mailto:i18n-discuss-request@iab.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18n-discuss/>
List-Post: <mailto:i18n-discuss@iab.org>
List-Help: <mailto:i18n-discuss-request@iab.org?subject=help>
List-Subscribe: <https://www.iab.org/mailman/listinfo/i18n-discuss>, <mailto:i18n-discuss-request@iab.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Jul 2017 15:47:16 -0000
Andrew, Dr. Al-Zoman, Let me suggest what may be a middle ground (a variation of one I've already suggested off-list to Andrew and Asmus) to see if it works for you. I think the key here is an expansion of things Andrew has already said, which is that this particular draft should not (in IETF-speak, MUST NOT) be interpreted out of context with other documents. Partially for that reason, it may need additional text to explain those connections. FWYW, my hypothesis is that Arabic script is not the only one that would be adversely affected if "troublesome" were interpreted as "bad character" or "don't allow this". We were just lucky that you and your colleagues spotted the problems early, mentioned it, and were able to clearly explain the issue. Would you agree with the following: (1) The use of combining characters with Arabic script is problematic, to the point of requiring special attention and knowledge whether they should be prohibited or not. AFAICT, combining characters are not allowed by the analysis and recommendations of RFC 5564 and, by adding more precombined forms, Unicode has decided to not encourage their use. (2) Arabic script as defined by Unicode is unique in having two sets of code points for digits within the same script. The observation that European digits are often used with the Arabic language in some places further complicates that problem. (3) At least for the part of the world's population that does not use them every day, scripts that are normally written with the characters connected (a category that is certainly not limited to Arabic) are more of a problem than scripts written with more obvious character boundaries. This does not reflect badly on those scripts in any way, but it does make it harder for those who are ignorant about the scripts to look characters up in tables, transcribe written forms into computer input, etc. (4) Again, at least for those who do not use them every day, scripts that are normally written right to left, especially those that also use numerals that are written left to right (i.e., with the highest-valued digit on the left), are more difficult than left to right ones. While this is not a significant problem with the scripts themselves, it can, unless care is exercise by people who understand what is going on, result in strange behavior in some circumstances. Fully-qualified domain names where some labels start or end with digits have proven to be particularly difficult examples. Now, if we agree on that much, I suggest that the document should probably say the following (in different words) and say it very clearly and strongly: (i) Arabic script (because of the first two characteristics above) and _any_ script with either or both of the other two characteristics make the requirements that a registry understand the script(s) it allows, that it carefully design a plan for working with each of those scripts, and that it consider itself accountable for the results, even more important than those requirements usually are. Code points that are, or are not, listed by this document may be useful as input to such registry understanding and policy development efforts, but are not a substitute for them. Again, those principles apply to all registries and all scripts, but the ones identified by (1)-(4) above require extra attention (so, by the way, do registries allowing more than one of Greek, Latin, or Cyrillic scripts and several other combinations -- perhaps the document should include a section of "troublesome" combinations of scripts). (ii) The rule of numeral homogeneity (RFC 5564, Section 2.3.1), expanded to also prohibit mixing with "Eastern" Arabic-Indic digits, is far more important and useful than trying to call out any of those digit code points as problematic. (iii) Combining marks are extremely problematic with Arabic script. If combining marks are prohibited by the registry, then code points that are listed in the table as problematic because of possible combining sequences become non-problematic. While, if I understand RFC 5564 and your comments correctly, simply prohibiting all of them is reasonable for the Arabic language, I note that the latest version of the ASIWG recommendations that I can find (June 2008, http://arabic-domains.org/docs/jun2010/IDNA200x_Arabic_Script_Disallowed_Codepoints.pdf) [1] allows a large number of combining characters. (iv) If recommendations for domain name use (or identifier use more generally) have been developed by groups reflecting a broad range of expertise, including experts on the relevant linguistics and writing systems (not just, e.g., name-marketing), for a particular script or language and those recommendations are very specific about what they cover and are applied carefully, they are most likely to be a better guide to what is appropriate and/or troublesome than a list, including this "troublesome" one, developed for general use. For the Arabic language written in Arabic script and a domain registry that does not allow labels based on other languages written in Arabic script, that implies that RFC 5564 is a better guide than this document, no matter what this one says. By contrast, if one had a registry that was exclusively registering labels based on, e.g., Persian writing conventions, this document (perhaps in combination with the generic recommendations for Arabic script in the ASIWG work) would be more appropriate than trying to apply 5564 (if specific and carefully-developed recommendations for Persian existed, that would, of course, be better yet). Given an explanation of that type, the vast majority of the Arabic code points could either be removed from the Internet Draft or turned into cross-references to the explanation or more authoritative documents. Would that be a reasonable direction to take? best regards, john [1] Were the ASIWG recommendations ever completed and formally approved by ESCWA and the relevant countries? The table mentioned above is the latest I can find but it is dated even earlier than the 2009 meeting I attended, and the asiwg.org site to which several references point appears to now be in Chinese script. An up-to-date reference, if it exists, would be appreciated. --On Sunday, July 30, 2017 08:37 +0000 "Abdulaziz H. Al-Zoman" <azoman@citc.gov.sa> wrote: > Good day Andrew, > > Thanks for your follow-up email to my feedback. > > > I do understand the goal of the internet draft as you've > stated: "to create the conditions for guidance > for operators and applications, so that > it is possible for (for instance) my > user agent to work globally, even when > some parts of the global linguistic > environment is foreign to me and when > there are no in-protocol clues about > the language I'm facing." > > but I'm worried by the inclusion of "basic" code points (22 > out of 28 Arabic language alphabet) to the repository instead > of only the problematic code points (non-spacing marks). This > would make the Arabic language unusable (or troublesome) to > operators and developers. While, I would like to see some > encouragement to support Arabic IDNs rather than shun them > away from it. > > Therefore, I would suggest that the registry includes only > the problematic sequence of code points that incorporate > non-spacing marks but not the basic (and essential) > characters. > > So I would suggest that the repository table looks like the > following table (for example) where: Column 1 represent the > problematic (sequence of) code points and Column 2 contains > the Reasons and Comments. The Column 1 should NOT contain a > basic code point alone without non-spacing mark causing the > problem. > > Column 1 | Column 2 > ---------------------+------------------ > 062F, 065C | Identical in appearance to ... > ---------------------+------------------ > 062F, 06EC | Identical in appearance to ... > ---------------------+------------------ > 0631, 06EC | Identical in appearance to ... > ---------------------+------------------ > 0633, 06DB | Identical in appearance to ... > > > Yours, > Abdulaziz Al-Zoman >
- [I18n-discuss] Comments on "troublesome-character… Abdulaziz Al-Zoman
- [I18n-discuss] Comments on "troublesome-character… Raed Al-Fayez
- Re: [I18n-discuss] Comments on "troublesome-chara… Andrew Sullivan
- Re: [I18n-discuss] Comments on "troublesome-chara… Asmus Freytag
- Re: [I18n-discuss] Comments on "troublesome-chara… Abdulaziz H. Al-Zoman
- Re: [I18n-discuss] Comments on "troublesome-chara… John C Klensin
- Re: [I18n-discuss] Comments on "troublesome-chara… Asmus Freytag