Re: [I18n-discuss] Comments on "troublesome-characters" from Arabic script
Asmus Freytag <asmusf@ix.netcom.com> Fri, 28 July 2017 17:23 UTC
Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18n-discuss@ietfa.amsl.com
Delivered-To: i18n-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 81C64132048 for <i18n-discuss@ietfa.amsl.com>; Fri, 28 Jul 2017 10:23:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.748
X-Spam-Level:
X-Spam-Status: No, score=-0.748 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DEAR_SOMETHING=1.973, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ix.netcom.com; domainkeys=pass (2048-bit key) header.from=asmusf@ix.netcom.com header.d=ix.netcom.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3YFdoQ8FtWmP for <i18n-discuss@ietfa.amsl.com>; Fri, 28 Jul 2017 10:22:59 -0700 (PDT)
Received: from elasmtp-kukur.atl.sa.earthlink.net (elasmtp-kukur.atl.sa.earthlink.net [209.86.89.65]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5DEA4132043 for <i18n-discuss@iab.org>; Fri, 28 Jul 2017 10:22:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ix.netcom.com; s=dk12062016; t=1501262579; bh=1zXaZl4O+zYImO1ZT5nE/vwzymbG6oLPM7vR bGCvUyk=; h=Received:From:To:Subject:Message-ID:Date:User-Agent: MIME-Version:Content-Type:Content-Transfer-Encoding: Content-Language:X-ELNK-Trace:X-Originating-IP; b=rzFWqbAEyFSEzz45 q4IEia6DOGdwsUZ0G+lvzj8n26rMRXEXO2fEc03AHmcYHcST35V80HYO8IiJWlR4mUu LcqwnVE3K3INR9c+MgoFU/WTBsAeSGEvzDa9fLWxDk8Q+4ImaEyKfxe5wMPr/cZ55R+ rRL5kacyBVYAdBswXSy8u3EwRSfHSZXOvh02QO5rmigmB3e7+itoZtqQAgQ5FJz3KVH L3HuaCn6s/YZJr9y0AtvYGE8kAh4rn1I8LXCoZtO/tz6iaO3/fo3M/86fCq1/jp2eku wNjXz3Kq6gr3P0vAK/Anxcy0MU84fB/QNFhzfXOY8GOntrhKDHkpbLMGkA==
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com; b=r6adzuY+O7Ull78l03jKxMMXLE525Snqe6zFfGp4r0J8cOTb7rKRCMQca+tdmkVdKovfxzyJsLFMgfjg6rcfLHKVHcyjUOA0IiQcPlovl6DKiOdKr2EHA9tyx/7YILUHp4b9X0aBi6+LWv12OuL44vt2uD26LTal2d6ndicchSyekY5Hmup3zgzRDiP8B9Jn/N6haY3rMrUX5IELEki4Xo0adUxbBxW8BubsE4+//BzywNJzbPMlFeJ5NZhH1OiFxjyoqdgtczPK79rSFzW1+rUUiy7+W5HTNOYpPuMwaBdZqe0dzQ0hRk40BKe50VlH0mHPdIFXvheTTEYtnLSbBA==; h=Received:From:To:Subject:Message-ID:Date:User-Agent:MIME-Version:Content-Type:Content-Transfer-Encoding:Content-Language:X-ELNK-Trace:X-Originating-IP;
Received: from [94.100.23.163] (helo=[10.4.50.183]) by elasmtp-kukur.atl.sa.earthlink.net with esmtpa (Exim 4.67) (envelope-from <asmusf@ix.netcom.com>) id 1db8yX-0001gq-HT for i18n-discuss@iab.org; Fri, 28 Jul 2017 13:22:57 -0400
From: Asmus Freytag <asmusf@ix.netcom.com>
To: i18n-discuss@iab.org
Message-ID: <32d3bdab-dad3-f41a-6eb4-efdb1f3c7c93@ix.netcom.com>
Date: Fri, 28 Jul 2017 10:23:08 -0700
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Content-Language: en-US
X-ELNK-Trace: 464f085de979d7246f36dc87813833b245123d0f0160089e367e6ff49fada55c3e2b9ff926919511350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 94.100.23.163
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18n-discuss/FWlcQZzrDaQep2HneDUqS5FV6iA>
Subject: Re: [I18n-discuss] Comments on "troublesome-characters" from Arabic script
X-BeenThere: i18n-discuss@iab.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Internationalization Program Open Discussion List <i18n-discuss.iab.org>
List-Unsubscribe: <https://www.iab.org/mailman/options/i18n-discuss>, <mailto:i18n-discuss-request@iab.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18n-discuss/>
List-Post: <mailto:i18n-discuss@iab.org>
List-Help: <mailto:i18n-discuss-request@iab.org?subject=help>
List-Subscribe: <https://www.iab.org/mailman/listinfo/i18n-discuss>, <mailto:i18n-discuss-request@iab.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Jul 2017 17:23:01 -0000
> > [I18n-discuss] Comments on "troublesome-characters" from Arabic > script > > Abdulaziz Al-Zoman <azoman@citc.gov.sa> Mon, 24 July 2017 04:42 UTC > > Dear Sir/Mam (i18n-discuss) > > Please find my comments concerning the draft-freytag-troublesome-characters-01.txt: > “Those Troublesome Characters: A Registry of Unicode Code Points Needing Special > Consideration When Used in Network Identifiers” > > As the aim of this Internet draft was to create a registry of code points that > need special consideration when used as identifiers so that it would guide > system administrators in setting parameters for allowable code points in an > identifier system, and to aid applications in creating security aids for users. > I have substantial concern on what code points should be part of this registry > and hence how the recipients (software developer, registries, registrars, etc.) > interpret the inclusion of a code point in the repository. Dear Abdulaziz Al-Zoman, The aim is indeed to identify code points that need "special considerations" (read: if used indiscriminately in the context of other code points can be problematic). This is not the same as suggesting that all the code points listed in the registry need to be restricted. The vast majority of code points that ought to be restricted outright is already DISALLOWED under IDNA 2008. There are a handful of code points that should not be used because they have been deprecated by Unicode (Unicode recommends that these code points not be used for any purpose) but were not DISALLOWED, but these are an exception. The majority of code points that are proposed for the registry are those that happen to be identical in appearance to either another code point in the same script, or a sequence of code points (usually a combining sequence). If both code points, or both the code point and the matching combining sequence are included in the repertoire of eligible code points for a zone, then labels can be registered that are identical in appearance but use different code points, and thus are distinct, but indistinguishable by any user. The registry attempts to list known examples of this, and it lists both sides of each matching pair, not making any value judgments as to whether one or the other is preferred. > > For example, the registry includes some essential characters (letters) that may > result at the end useless identifiers if these characters are restricted or > blocked (because they are part of the repository). For instance, with > respect to the Arabic language, the registry consists of a large portion of > the Arabic basic alphabet that may result to a limited character set for creating > identifiers (e.g., the Arabic language consists of 28 essential alphabet characters, > while the repository suggests that out of these 28 characters there are 22 > troublesome-characters; i.e. more than 78% of the Arabic language characters > can't be used if they are blocked). This will cause impractical use of the > Arabic language in network identifier. While the actual "troublesome-characters" > are the non-spacing marks. In the Arabic case, different groups at different times have come to the conclusion that combining marks should not be used. The latest such group was TF-AIDN who concluded that the root zone should exclude all Arabic combining marks. If the policy for any zone follows this approach and excludes all combining marks, then the remaining code points are no longer in danger of being misidentified for a combining sequence or vice versa. We would say, the problem that lead to them being listed in the registry has been successfully mitigated. However, if a zone decides to allow Arabic combining marks, then other steps would be required to mitigate the problem. As you note, excluding the basic letters of the Arabic alphabet is not a useful approach of mitigation. > > Therefore, we would suggest that the registry includes only the problematic > code points such as non-spacing marks but not the basic characters (they can > be in the comment field or the reason for including such a non-spacing mark). This would make a value judgment as to which are preferred; we felt that while one can more or less objectively identify which code points and sequences have an issue, such as identical appearance, there is no such objective grounds for the preference of one code point over another. In the Arabic case, there are good reasons for the approach used in RFC 5564 and also in the Arabic Root Zone LGR. The registry already notes those combining marks that RFC 5564 recommends for exclusion, and provides a reference to the Root Zone LGR for the others. This language could be more explicit to help users of the registry arrive at a better conclusion on what form of mitigation to choose (without making an actual prescription). However, even after excluding all combining marks, some code points will still be in need of "special considerations" in Arabic. These include some of the digits and some of the basic letters. Again, for the general case, attempting to mitigate any issues by blanket exclusion of all of these code points would severely affect the usability of the set for purposes of network identifiers, and is therefore not practical. Therefore, some other means should be investigated. For the digits, for any zone, like the Root Zone, where it is not possible to exclude one of the two sets, the goal ought to be to prevent two labels that differ only in which code point is used for some digit, but that look identical to the user. If one of these labels is delegated, the other should be blocked. Making each pair of identical appearing digits *blocked* variants of each other would achieve this. (Using RFC 7940 this policy can be expressed in a machine-readable way). Blocked variants thus implement "mutual exclusion" of identical looking identifiers; this mitigates the issue without having to apply an a-priori preference for one code point over another. Some of the basic letters in Arabic share visual appearance, at least in some positional forms. Identical labels, that differ only by using an alternate letter (but are indistinguishable because of the shared forms) could be blocked from each other as discussed above. However, this may be unsatisfactory, because of the barriers it creates for some user communities of accessing network identifiers created by other communities if these two communities happen to differ in their default choice of which code point to use at that position in the word. A possible solution might be to allow variants that are *allocatable*, so that some entity is able to request both versions of the identifier, while all other would be blocked from registering a look-alike label. While this may be a suitable choice, it would be beyond the purposes of the proposed repository to give any prescriptions. > As adding basic characters to the repository may lead to preventing their use > in identifiers while the origin of the problem is due to the misuse of > non-spacing marks. Thus, the repository should cover only the non-spacing > marks indicating their risk-free usage, and their harmful usage which should be blocked. Arabic is a bit of a special case in that excluding all combining marks is a viable approach. In the Latin script, for example, it is not possible to exclude all combining marks, because there are some languages that do not have precomposed letters for all members of their alphabet. In those cases, successful mitigation might consist of enumerating all combining sequences known to be required for such alphabets (the list is not very long) and then to exclude all other combining sequences. (Using RFC 7940 this can be done in a machine-readable way). In conclusion, it appears to me that the concern is in how the facts are presented and a possible lack of guidance on how to use the listed information. A./
- [I18n-discuss] Comments on "troublesome-character… Abdulaziz Al-Zoman
- [I18n-discuss] Comments on "troublesome-character… Raed Al-Fayez
- Re: [I18n-discuss] Comments on "troublesome-chara… Andrew Sullivan
- Re: [I18n-discuss] Comments on "troublesome-chara… Asmus Freytag
- Re: [I18n-discuss] Comments on "troublesome-chara… Abdulaziz H. Al-Zoman
- Re: [I18n-discuss] Comments on "troublesome-chara… John C Klensin
- Re: [I18n-discuss] Comments on "troublesome-chara… Asmus Freytag