Re: [I18n-discuss] Comments on "troublesome-characters" from Arabic script

Asmus Freytag <asmusf@ix.netcom.com> Fri, 28 July 2017 17:23 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18n-discuss@ietfa.amsl.com
Delivered-To: i18n-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 81C64132048 for <i18n-discuss@ietfa.amsl.com>; Fri, 28 Jul 2017 10:23:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.748
X-Spam-Level:
X-Spam-Status: No, score=-0.748 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DEAR_SOMETHING=1.973, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ix.netcom.com; domainkeys=pass (2048-bit key) header.from=asmusf@ix.netcom.com header.d=ix.netcom.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3YFdoQ8FtWmP for <i18n-discuss@ietfa.amsl.com>; Fri, 28 Jul 2017 10:22:59 -0700 (PDT)
Received: from elasmtp-kukur.atl.sa.earthlink.net (elasmtp-kukur.atl.sa.earthlink.net [209.86.89.65]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5DEA4132043 for <i18n-discuss@iab.org>; Fri, 28 Jul 2017 10:22:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ix.netcom.com; s=dk12062016; t=1501262579; bh=1zXaZl4O+zYImO1ZT5nE/vwzymbG6oLPM7vR bGCvUyk=; h=Received:From:To:Subject:Message-ID:Date:User-Agent: MIME-Version:Content-Type:Content-Transfer-Encoding: Content-Language:X-ELNK-Trace:X-Originating-IP; b=rzFWqbAEyFSEzz45 q4IEia6DOGdwsUZ0G+lvzj8n26rMRXEXO2fEc03AHmcYHcST35V80HYO8IiJWlR4mUu LcqwnVE3K3INR9c+MgoFU/WTBsAeSGEvzDa9fLWxDk8Q+4ImaEyKfxe5wMPr/cZ55R+ rRL5kacyBVYAdBswXSy8u3EwRSfHSZXOvh02QO5rmigmB3e7+itoZtqQAgQ5FJz3KVH L3HuaCn6s/YZJr9y0AtvYGE8kAh4rn1I8LXCoZtO/tz6iaO3/fo3M/86fCq1/jp2eku wNjXz3Kq6gr3P0vAK/Anxcy0MU84fB/QNFhzfXOY8GOntrhKDHkpbLMGkA==
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com; b=r6adzuY+O7Ull78l03jKxMMXLE525Snqe6zFfGp4r0J8cOTb7rKRCMQca+tdmkVdKovfxzyJsLFMgfjg6rcfLHKVHcyjUOA0IiQcPlovl6DKiOdKr2EHA9tyx/7YILUHp4b9X0aBi6+LWv12OuL44vt2uD26LTal2d6ndicchSyekY5Hmup3zgzRDiP8B9Jn/N6haY3rMrUX5IELEki4Xo0adUxbBxW8BubsE4+//BzywNJzbPMlFeJ5NZhH1OiFxjyoqdgtczPK79rSFzW1+rUUiy7+W5HTNOYpPuMwaBdZqe0dzQ0hRk40BKe50VlH0mHPdIFXvheTTEYtnLSbBA==; h=Received:From:To:Subject:Message-ID:Date:User-Agent:MIME-Version:Content-Type:Content-Transfer-Encoding:Content-Language:X-ELNK-Trace:X-Originating-IP;
Received: from [94.100.23.163] (helo=[10.4.50.183]) by elasmtp-kukur.atl.sa.earthlink.net with esmtpa (Exim 4.67) (envelope-from <asmusf@ix.netcom.com>) id 1db8yX-0001gq-HT for i18n-discuss@iab.org; Fri, 28 Jul 2017 13:22:57 -0400
From: Asmus Freytag <asmusf@ix.netcom.com>
To: i18n-discuss@iab.org
Message-ID: <32d3bdab-dad3-f41a-6eb4-efdb1f3c7c93@ix.netcom.com>
Date: Fri, 28 Jul 2017 10:23:08 -0700
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Content-Language: en-US
X-ELNK-Trace: 464f085de979d7246f36dc87813833b245123d0f0160089e367e6ff49fada55c3e2b9ff926919511350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 94.100.23.163
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18n-discuss/FWlcQZzrDaQep2HneDUqS5FV6iA>
Subject: Re: [I18n-discuss] Comments on "troublesome-characters" from Arabic script
X-BeenThere: i18n-discuss@iab.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Internationalization Program Open Discussion List <i18n-discuss.iab.org>
List-Unsubscribe: <https://www.iab.org/mailman/options/i18n-discuss>, <mailto:i18n-discuss-request@iab.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18n-discuss/>
List-Post: <mailto:i18n-discuss@iab.org>
List-Help: <mailto:i18n-discuss-request@iab.org?subject=help>
List-Subscribe: <https://www.iab.org/mailman/listinfo/i18n-discuss>, <mailto:i18n-discuss-request@iab.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Jul 2017 17:23:01 -0000

>
>       [I18n-discuss] Comments on "troublesome-characters" from Arabic
>       script
>
> Abdulaziz Al-Zoman <azoman@citc.gov.sa> Mon, 24 July 2017 04:42 UTC
>
> Dear Sir/Mam (i18n-discuss)
>
> Please find my comments concerning the draft-freytag-troublesome-characters-01.txt:
> “Those Troublesome Characters: A Registry of Unicode Code Points Needing Special
> Consideration When Used in Network Identifiers”
>
> As the aim of this Internet draft was to create a registry of code points that
> need special consideration when used as identifiers so that it would guide
> system administrators in setting parameters for allowable code points in an
> identifier system, and to aid applications in creating security aids for users.
> I have substantial concern on what code points should be part of this registry
> and hence how the recipients (software developer, registries, registrars, etc.)
> interpret the inclusion of a code point in the repository.

Dear Abdulaziz Al-Zoman,

The aim is indeed to identify code points that need "special 
considerations" (read: if used
indiscriminately in the context of other code points can be problematic).

This is not the same as suggesting that all the code points listed in 
the registry need to be
restricted. The vast majority of code points that ought to be restricted 
outright is already
DISALLOWED under IDNA 2008.

There are a handful of code points that should not be used because they 
have been
deprecated by Unicode (Unicode recommends that these code points not be 
used for
any purpose) but were not DISALLOWED, but these are an exception.

The majority of code points that are proposed for the registry are those 
that happen to
be identical in appearance to either another code point in the same 
script, or a sequence
of code points (usually a combining sequence).

If both code points, or both the code point and the matching combining 
sequence are
included in the repertoire of eligible code points for a zone, then 
labels can be registered
that are identical in appearance but use different code points, and thus 
are distinct, but
indistinguishable by any user.

The registry attempts to list known examples of this, and it lists both 
sides of each matching
pair, not making any value judgments as to whether one or the other is 
preferred.


>
> For example, the registry includes some essential characters (letters) that may
> result at the end useless identifiers if these characters are restricted or
> blocked (because they are part of the repository).   For instance, with
> respect to the Arabic language, the registry consists of a large portion of
> the Arabic basic alphabet that may result to a limited character set for creating
> identifiers (e.g., the Arabic language consists of 28 essential alphabet characters,
> while the repository suggests that out of these 28 characters there are 22
> troublesome-characters; i.e. more than 78% of the Arabic language characters
> can't be used if they are blocked).  This will cause impractical use of the
> Arabic language in network identifier. While the actual "troublesome-characters"
> are the non-spacing marks.

In the Arabic case, different groups at different times have come to the 
conclusion that
combining marks should not be used. The latest such group was TF-AIDN 
who concluded
that the root zone should exclude all Arabic combining marks.

If the policy for any zone follows this approach and excludes all 
combining marks, then
the remaining code points are no longer in danger of being misidentified 
for a combining
sequence or vice versa. We would say, the problem that lead to them 
being listed in the
registry has been successfully mitigated.

However, if a zone decides to allow Arabic combining marks, then other 
steps would
be required to mitigate the problem. As you note, excluding the basic 
letters of the
Arabic alphabet is not a useful approach of mitigation.

>
> Therefore,  we would suggest that the registry includes only the problematic
> code points such as non-spacing marks but not the basic characters (they can
> be in the comment field or the reason for including such a non-spacing mark).

This would make a value judgment as to which are preferred; we felt that 
while one can
more or less objectively identify which code points and sequences have 
an issue, such as
identical appearance, there is no such objective grounds for the 
preference of one code point
over another.

In the Arabic case, there are good reasons for the approach used in RFC 
5564 and also in the
Arabic Root Zone LGR. The registry already notes those combining marks 
that RFC 5564
recommends for exclusion, and provides a reference to the Root Zone LGR 
for the others.
This language could be more explicit to help users of the registry 
arrive at a better conclusion
on what form of mitigation to choose (without making an actual 
prescription).

However, even after excluding all combining marks, some code points will 
still be in need
of "special considerations" in Arabic. These include some of the digits 
and some of the basic
letters.

Again, for the general case, attempting to mitigate any issues by 
blanket exclusion of all
of these code points would severely affect the usability of the set for 
purposes of network
identifiers, and is therefore not practical. Therefore, some other means 
should be investigated.

For the digits, for any zone, like the Root Zone, where it is not 
possible to exclude one of the
two sets, the goal ought to be to prevent two labels that differ only in 
which code point is used
for some digit, but that look identical to the user. If one of these 
labels is delegated, the other
should be blocked. Making each pair of identical appearing digits 
*blocked* variants of each
other would achieve this. (Using RFC 7940 this policy can be expressed 
in a machine-readable
way).

Blocked variants thus implement "mutual exclusion" of identical looking 
identifiers; this
mitigates the issue without having to apply an a-priori preference for 
one code point over
another.

Some of the basic letters in Arabic share visual appearance, at least in 
some positional forms.
Identical labels, that differ only by using an alternate letter (but are 
indistinguishable because
of the shared forms) could be blocked from each other as discussed above.

However, this may be unsatisfactory, because of the barriers it creates 
for some user communities
of accessing network identifiers created by other communities if these 
two communities happen
to differ in their default choice of which code point to use at that 
position in the word.

A possible solution might be to allow variants that are *allocatable*, 
so that some entity is
able to request both versions of the identifier, while all other would 
be blocked from registering
a look-alike label. While this may be a suitable choice, it would be 
beyond the purposes of
the proposed repository to give any prescriptions.

> As adding basic characters to the repository  may lead to preventing their use
>   in identifiers while the origin of the problem is due to the misuse of
> non-spacing marks. Thus, the repository should cover only the non-spacing
> marks indicating their risk-free usage, and their harmful usage which should be blocked.


Arabic is a bit of a special case in that excluding all combining marks 
is a viable approach.
In the Latin script, for example, it is not possible to exclude all 
combining marks, because there
are some languages that do not have precomposed letters for all members 
of their alphabet.

In those cases, successful mitigation might consist of enumerating all 
combining sequences
known to be required for such alphabets (the list is not very long) and 
then to exclude all
other combining sequences. (Using RFC 7940 this can be done in a 
machine-readable way).

In conclusion, it appears to me that the concern is in how the facts are 
presented and a
possible lack of guidance on how to use the listed information.

A./