Re: [I18n-discuss] Comments on "troublesome-characters" from Arabic script

On 7/30/2017 8:46 AM, John C Klensin wrote:
> Andrew, Dr. Al-Zoman,
>
> Let me suggest what may be a middle ground (a variation of one
> I've already suggested off-list to Andrew and Asmus) to see if
> it works for you.   I think the key here is an expansion of
> things Andrew has already said, which is that this particular
> draft should not (in IETF-speak, MUST NOT) be interpreted out of
> context with other documents.  Partially for that reason, it may
> need additional text to explain those connections.

The code point registry is based on data from those other documents.

An updated draft of the introductory text will clarify that.

>
> FWYW, my hypothesis is that Arabic script is not the only one
> that would be adversely affected if "troublesome" were
> interpreted as "bad character" or "don't allow this".  We were
> just lucky that you and your colleagues spotted the problems
> early, mentioned it, and were able to clearly explain the issue.

If we can find a way not not be limited by the RFC "text table" for 
presentation
of the registry content it will be easier to clarify that some code 
points are
listed not to suggest that they be excluded, but because they are part of a
variant pair.
>
> Would you agree with the following:
>
> (1) The use of combining characters with Arabic script is
> problematic, to the point of requiring special attention and
> knowledge whether they should be prohibited or not.  AFAICT,
> combining characters are not allowed by the analysis and
> recommendations of RFC 5564 and, by adding more precombined
> forms, Unicode has decided to not encourage their use.

RFC 5564 does not cover all combining marks, but many. The Arabic VIP 
study did not address combining marks, but contained an extensive set of 
(possible) variants between single code points and sequence. The Root 
Zone LGR came down on excluding all combining marks for Arabic.

The draft, as shared, simply contained the combined information from 
these three sources.

However, in the process it became clear that the attempt to define a 
clean variant set using combining sequences is fraught with difficulties 
and it is not even clear that it could succeed and achieve a 
self-consistent result. Therefore, it is probably wisest to retreat to 
the position taken by the Root Zone LGR which, effectively, extends the 
reasoning of RFC 5564, and list all Arabic combining marks as 
"not-recommended". This has the benefit of removing all sequences (as 
they would require use of a code point that's not recommended) and 
therefore all description of possible variant relations between them and 
single code points.

There will be remaining single code points, but these will be in mutual 
variant relations with other single code points (because of shared shapes).

In those cases, there is no preferred variant, so both will have to be 
listed, but the explanation should make clear that just because two code 
points are possibly variants it is not a useful technique to blindly 
exclude both (usability would suffer immensely and nobody is interested 
in zone policies that make a script unusable).
>
> (2) Arabic script as defined by Unicode is unique in having two
> sets of code points for digits within the same script.  The
> observation that European digits are often used with the Arabic
> language in some places further complicates that problem.
This is already handled by the suggestion to make these variants of each 
other.

The draft uses the language "mutually exclusive" because of perceived 
sensitivity of the term variant, but this experience has shown, that 
avoiding this term, which is well-understood in communities that use a 
script where variants are essential will only add to the confusion. 
Therefore, the description of the registry should use the term variant 
(and in RFC 7940 and another pending RFC there will be enough 
description of that term to give it precision).
>
> (3) At least for the part of the world's population that does
> not use them every day, scripts that are normally written with
> the characters connected (a category that is certainly not
> limited to Arabic) are more of a problem than scripts written
> with more obvious character boundaries.  This does not reflect
> badly on those scripts in any way, but it does make it harder
> for those who are ignorant about the scripts to look characters
> up in tables, transcribe written forms into computer input, etc.

This does not make them "troublesome" on the code point level for the 
purposes of the proposed registry.

However, some scripts need context rules for certain code points (or 
classes of code points) and where that is the case, the registry should 
point that out. It would be one of the "considerations" that have to be 
taken into account in drafting a policy for a zone using that script.
>
> (4) Again, at least for those who do not use them every day,
> scripts that are normally written right to left, especially
> those that also use numerals that are written left to right
> (i.e., with the highest-valued digit on the left), are more
> difficult than left to right ones.  While this is not a
> significant problem with the scripts themselves, it can, unless
> care is exercise by people who understand what is going on,
> result in strange behavior in some circumstances.
> Fully-qualified domain names where some labels start or end with
> digits have proven to be particularly difficult examples.
>
> Now, if we agree on that much, I suggest that the document
> should probably say the following (in different words) and say
> it very clearly and strongly:
>
> (i) Arabic script (because of the first two characteristics
> above) and _any_ script with either or both of the other two
> characteristics make the requirements that a registry understand
> the script(s) it allows, that it carefully design a plan for
> working with each of those scripts, and that it consider itself
> accountable for the results, even more important than those
> requirements usually are.  Code points that are, or are not,
> listed by this document may be useful as input to such registry
> understanding and policy development efforts, but are not a
> substitute for them.  Again, those principles apply to all
> registries and all scripts, but the ones identified by (1)-(4)
> above require extra attention (so, by the way, do registries
> allowing more than one of Greek, Latin, or Cyrillic scripts and
> several other combinations -- perhaps the document should
> include a section of "troublesome" combinations of scripts).

At the moment, the proposed initial content of the registry does not 
include cross-script issues. However, I am coming to the conclusion that 
that was a bad idea and that the existing data collections for in-script 
and cross-script issues should be combined; it is easy enough to tag 
entries with a script value to help people weed out code points not 
relevant to their zone.
>
> (ii) The rule of numeral homogeneity (RFC 5564, Section 2.3.1),
> expanded to also prohibit mixing with "Eastern" Arabic-Indic
> digits, is far more important and useful than trying to call out
> any of those digit code points as problematic.

Being homogeneous inside a label is useful, but, ultimately you have to 
look at the zone. The digits remain "troublesome" in the sense that you 
need to apply two things: (1) a context rule to limit each label to one 
set (2) a variant relation to prevent two label, differing only in the 
type of digit at a given position. We have to get over the troublesome 
== must-exclude fallacy anyway, and replace that by troublesome == need 
some rules to mitigate a potential issue.
>
> (iii) Combining marks are extremely problematic with Arabic
> script.  If combining marks are prohibited by the registry, then
> code points that are listed in the table as problematic because
> of possible combining sequences become non-problematic.  While,
> if I understand RFC 5564 and your comments correctly, simply
> prohibiting all of them is reasonable for the Arabic language, I
> note that the latest version of the ASIWG recommendations that I
> can find (June 2008,
> http://arabic-domains.org/docs/jun2010/IDNA200x_Arabic_Script_Disallowed_Codepoints.pdf)
> [1] allows a large number of combining characters.

I think this is now superseded by the Root Zone LGR effort for Arabic, 
which had a very qualified group look at this issue and they came away 
with recommending no combining marks to be allowed.
>
> (iv) If recommendations for domain name use (or identifier use
> more generally) have been developed by groups reflecting a broad
> range of expertise, including experts on the relevant
> linguistics and writing systems (not just, e.g.,
> name-marketing), for a particular script or language and those
> recommendations are very specific about what they cover and are
> applied carefully, they are most likely to be a better guide to
> what is appropriate and/or troublesome than a list, including
> this "troublesome" one, developed for general use.

Totally agreed, that's why the registry MUST be defined in a way that 
all content is attributed to existing recommendations (with some leeway 
to allow the registrar to update the list "by analogy" when new code 
points are added that the original authors couldn't have known about).
>   For the
> Arabic language written in Arabic script and a domain registry
> that does not allow labels based on other languages written in
> Arabic script, that implies that RFC 5564 is a better guide than
> this document, no matter what this one says.  By contrast, if
> one had a registry that was exclusively registering labels based
> on, e.g., Persian writing conventions, this document (perhaps in
> combination with the generic recommendations for Arabic script
> in the ASIWG work) would be more appropriate than trying to
> apply 5564 (if specific and carefully-developed recommendations
> for Persian existed, that would, of course, be better yet).
This one says what 5564 says, but it also says what later efforts by 
similar and well-qualified groups are saying.

The confusion came from the fact that the VIP report did not exclude 
combining marks from its analysis, not even the ones not recommended by 
RFC 5564. We now understand that that report simply was an intermediate 
step in the analysis, now completed by the Root Zone work done by 
TF-AIDN. There's a clear path forward to the next version of the ID, and 
then we can see whether it contains any further issues we should address.

A./
>
> Given an explanation of that type, the vast majority of the
> Arabic code points could either be removed from the Internet
> Draft or turned into cross-references to the explanation or more
> authoritative documents.
>
> Would that be a reasonable direction to take?
>
> best regards,
>      john
>
> [1] Were the ASIWG recommendations ever completed and formally
> approved by ESCWA and the relevant countries?   The table
> mentioned above is the latest I can find but it is dated even
> earlier than the 2009 meeting I attended, and the asiwg.org site
> to which several references point appears to now be in Chinese
> script.  An up-to-date reference, if it exists, would be
> appreciated.
>
> --On Sunday, July 30, 2017 08:37 +0000 "Abdulaziz H. Al-Zoman"
> <azoman@citc.gov.sa> wrote:
>
>> Good day Andrew,
>>
>> Thanks for your follow-up email to my feedback.
>>
>>
>> I do understand the goal of the internet draft as you've
>> stated:      "to create the conditions for guidance
>>        for operators and applications, so that
>>        it is possible for (for instance) my
>>        user agent to work globally, even when
>>        some parts of the global linguistic
>>        environment is foreign to me and when
>>        there are no in-protocol clues about
>>        the language I'm facing."
>>
>> but I'm worried by the inclusion of "basic" code points (22
>> out of 28 Arabic language alphabet) to the repository instead
>> of only the problematic code points (non-spacing marks). This
>> would make the Arabic language unusable (or troublesome) to
>> operators and developers. While, I would like to see some
>> encouragement to support Arabic IDNs rather than shun them
>> away from it.
>>
>> Therefore,  I would suggest that the registry includes only
>> the problematic sequence of code points that incorporate
>> non-spacing marks but not the basic (and essential)
>> characters.
>>
>> So I would suggest that the repository table looks like the
>> following table (for example) where: Column 1 represent the
>> problematic (sequence of) code points and Column 2 contains
>> the Reasons and Comments. The Column 1 should NOT contain a
>> basic code point alone without non-spacing mark causing the
>> problem.
>>
>> Column 1             | Column 2
>> ---------------------+------------------
>> 062F, 065C           | Identical in appearance to ...
>> ---------------------+------------------
>> 062F, 06EC           | Identical in appearance to ...
>> ---------------------+------------------
>> 0631, 06EC           | Identical in appearance to ...
>> ---------------------+------------------
>> 0633, 06DB           | Identical in appearance to ...
>>
>>
>> Yours,
>> Abdulaziz Al-Zoman
>>
> _______________________________________________
> I18n-discuss mailing list
> I18n-discuss@iab.org
> https://www.iab.org/mailman/listinfo/i18n-discuss
>