Re: [I18n-discuss] Comments on "troublesome-characters" from Arabic script

John C Klensin <john-ietf@jck.com> Sun, 30 July 2017 15:47 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18n-discuss@ietfa.amsl.com
Delivered-To: i18n-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4990F131C7F for <i18n-discuss@ietfa.amsl.com>; Sun, 30 Jul 2017 08:47:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RP_MATCHES_RCVD=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4dy_GrWu6W-9 for <i18n-discuss@ietfa.amsl.com>; Sun, 30 Jul 2017 08:47:13 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9F919129417 for <i18n-discuss@iab.org>; Sun, 30 Jul 2017 08:47:13 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1dbqQp-000L9q-31; Sun, 30 Jul 2017 11:47:03 -0400
Date: Sun, 30 Jul 2017 11:46:57 -0400
From: John C Klensin <john-ietf@jck.com>
To: "Abdulaziz H. Al-Zoman" <azoman@citc.gov.sa>, 'Andrew Sullivan' <ajs@anvilwalrusden.com>, Raed Al-Fayez <rfayez@citc.gov.sa>
cc: "'i18n-discuss@iab.org'" <i18n-discuss@iab.org>
Message-ID: <58624950E7E9BB71733FF3B3@PSB>
In-Reply-To: <EDEC5B615F83D44981FA2D0DCA997167013170FA78@ry0mail1.citc.gov.sa>
References: <043D7B5CFC1AB8469108EA7BB5F68BB00124076934@ry0mail1.citc.gov.sa> <EDEC5B615F83D44981FA2D0DCA9971670131709366@ry0mail1.citc.gov.sa> <20170727211320.dkano7pdmjxoj62h@mx4.yitter.info> <EDEC5B615F83D44981FA2D0DCA997167013170FA78@ry0mail1.citc.gov.sa>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18n-discuss/wuMv74tQaj7PyckewWUi28zRE_E>
Subject: Re: [I18n-discuss] Comments on "troublesome-characters" from Arabic script
X-BeenThere: i18n-discuss@iab.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Internationalization Program Open Discussion List <i18n-discuss.iab.org>
List-Unsubscribe: <https://www.iab.org/mailman/options/i18n-discuss>, <mailto:i18n-discuss-request@iab.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18n-discuss/>
List-Post: <mailto:i18n-discuss@iab.org>
List-Help: <mailto:i18n-discuss-request@iab.org?subject=help>
List-Subscribe: <https://www.iab.org/mailman/listinfo/i18n-discuss>, <mailto:i18n-discuss-request@iab.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Jul 2017 15:47:16 -0000

Andrew, Dr. Al-Zoman,

Let me suggest what may be a middle ground (a variation of one
I've already suggested off-list to Andrew and Asmus) to see if
it works for you.   I think the key here is an expansion of
things Andrew has already said, which is that this particular
draft should not (in IETF-speak, MUST NOT) be interpreted out of
context with other documents.  Partially for that reason, it may
need additional text to explain those connections.

FWYW, my hypothesis is that Arabic script is not the only one
that would be adversely affected if "troublesome" were
interpreted as "bad character" or "don't allow this".  We were
just lucky that you and your colleagues spotted the problems
early, mentioned it, and were able to clearly explain the issue.

Would you agree with the following:

(1) The use of combining characters with Arabic script is
problematic, to the point of requiring special attention and
knowledge whether they should be prohibited or not.  AFAICT,
combining characters are not allowed by the analysis and
recommendations of RFC 5564 and, by adding more precombined
forms, Unicode has decided to not encourage their use.

(2) Arabic script as defined by Unicode is unique in having two
sets of code points for digits within the same script.  The
observation that European digits are often used with the Arabic
language in some places further complicates that problem.

(3) At least for the part of the world's population that does
not use them every day, scripts that are normally written with
the characters connected (a category that is certainly not
limited to Arabic) are more of a problem than scripts written
with more obvious character boundaries.  This does not reflect
badly on those scripts in any way, but it does make it harder
for those who are ignorant about the scripts to look characters
up in tables, transcribe written forms into computer input, etc.

(4) Again, at least for those who do not use them every day,
scripts that are normally written right to left, especially
those that also use numerals that are written left to right
(i.e., with the highest-valued digit on the left), are more
difficult than left to right ones.  While this is not a
significant problem with the scripts themselves, it can, unless
care is exercise by people who understand what is going on,
result in strange behavior in some circumstances.
Fully-qualified domain names where some labels start or end with
digits have proven to be particularly difficult examples. 

Now, if we agree on that much, I suggest that the document
should probably say the following (in different words) and say
it very clearly and strongly:

(i) Arabic script (because of the first two characteristics
above) and _any_ script with either or both of the other two
characteristics make the requirements that a registry understand
the script(s) it allows, that it carefully design a plan for
working with each of those scripts, and that it consider itself
accountable for the results, even more important than those
requirements usually are.  Code points that are, or are not,
listed by this document may be useful as input to such registry
understanding and policy development efforts, but are not a
substitute for them.  Again, those principles apply to all
registries and all scripts, but the ones identified by (1)-(4)
above require extra attention (so, by the way, do registries
allowing more than one of Greek, Latin, or Cyrillic scripts and
several other combinations -- perhaps the document should
include a section of "troublesome" combinations of scripts).

(ii) The rule of numeral homogeneity (RFC 5564, Section 2.3.1),
expanded to also prohibit mixing with "Eastern" Arabic-Indic
digits, is far more important and useful than trying to call out
any of those digit code points as problematic.  

(iii) Combining marks are extremely problematic with Arabic
script.  If combining marks are prohibited by the registry, then
code points that are listed in the table as problematic because
of possible combining sequences become non-problematic.  While,
if I understand RFC 5564 and your comments correctly, simply
prohibiting all of them is reasonable for the Arabic language, I
note that the latest version of the ASIWG recommendations that I
can find (June 2008,
http://arabic-domains.org/docs/jun2010/IDNA200x_Arabic_Script_Disallowed_Codepoints.pdf)
[1] allows a large number of combining characters.

(iv) If recommendations for domain name use (or identifier use
more generally) have been developed by groups reflecting a broad
range of expertise, including experts on the relevant
linguistics and writing systems (not just, e.g.,
name-marketing), for a particular script or language and those
recommendations are very specific about what they cover and are
applied carefully, they are most likely to be a better guide to
what is appropriate and/or troublesome than a list, including
this "troublesome" one, developed for general use.  For the
Arabic language written in Arabic script and a domain registry
that does not allow labels based on other languages written in
Arabic script, that implies that RFC 5564 is a better guide than
this document, no matter what this one says.  By contrast, if
one had a registry that was exclusively registering labels based
on, e.g., Persian writing conventions, this document (perhaps in
combination with the generic recommendations for Arabic script
in the ASIWG work) would be more appropriate than trying to
apply 5564 (if specific and carefully-developed recommendations
for Persian existed, that would, of course, be better yet).  

Given an explanation of that type, the vast majority of the
Arabic code points could either be removed from the Internet
Draft or turned into cross-references to the explanation or more
authoritative documents.

Would that be a reasonable direction to take?

best regards,
    john

[1] Were the ASIWG recommendations ever completed and formally
approved by ESCWA and the relevant countries?   The table
mentioned above is the latest I can find but it is dated even
earlier than the 2009 meeting I attended, and the asiwg.org site
to which several references point appears to now be in Chinese
script.  An up-to-date reference, if it exists, would be
appreciated.

--On Sunday, July 30, 2017 08:37 +0000 "Abdulaziz H. Al-Zoman"
<azoman@citc.gov.sa> wrote:

> Good day Andrew,
> 
> Thanks for your follow-up email to my feedback.
> 
> 
> I do understand the goal of the internet draft as you've
> stated:      "to create the conditions for guidance
>       for operators and applications, so that
>       it is possible for (for instance) my
>       user agent to work globally, even when
>       some parts of the global linguistic
>       environment is foreign to me and when
>       there are no in-protocol clues about
>       the language I'm facing."
> 
> but I'm worried by the inclusion of "basic" code points (22
> out of 28 Arabic language alphabet) to the repository instead
> of only the problematic code points (non-spacing marks). This
> would make the Arabic language unusable (or troublesome) to
> operators and developers. While, I would like to see some
> encouragement to support Arabic IDNs rather than shun them
> away from it.
> 
> Therefore,  I would suggest that the registry includes only
> the problematic sequence of code points that incorporate
> non-spacing marks but not the basic (and essential)
> characters. 
> 
> So I would suggest that the repository table looks like the
> following table (for example) where: Column 1 represent the
> problematic (sequence of) code points and Column 2 contains
> the Reasons and Comments. The Column 1 should NOT contain a
> basic code point alone without non-spacing mark causing the
> problem.
> 
> Column 1             | Column 2
> ---------------------+------------------
> 062F, 065C           | Identical in appearance to ...
> ---------------------+------------------
> 062F, 06EC           | Identical in appearance to ...
> ---------------------+------------------
> 0631, 06EC           | Identical in appearance to ...
> ---------------------+------------------
> 0633, 06DB           | Identical in appearance to ...
> 
> 
> Yours,
> Abdulaziz Al-Zoman
>