Re: [I18nrp] Last Call: <draft-faltstrom-unicode11-05.txt> (IDNA2008 and Unicode 11.0.0) to Informational RFC

Asmus Freytag <asmusf@ix.netcom.com> Tue, 04 December 2018 18:48 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A5518130ED9 for <i18nrp@ietfa.amsl.com>; Tue, 4 Dec 2018 10:48:48 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.699
X-Spam-Level:
X-Spam-Status: No, score=-2.699 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ix.netcom.com; domainkeys=pass (2048-bit key) header.from=asmusf@ix.netcom.com header.d=ix.netcom.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FzrSLfEBhEcI for <i18nrp@ietfa.amsl.com>; Tue, 4 Dec 2018 10:48:46 -0800 (PST)
Received: from elasmtp-dupuy.atl.sa.earthlink.net (elasmtp-dupuy.atl.sa.earthlink.net [209.86.89.62]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E29E7130FA6 for <i18nrp@ietf.org>; Tue, 4 Dec 2018 10:48:45 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ix.netcom.com; s=dk12062016; t=1543949325; bh=WdO/RqdB19vrFB1J9fNWYPpF8aB+a3xsStmJ iYysg3I=; h=Received:Subject:To:References:From:Message-ID:Date: User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language: X-ELNK-Trace:X-Originating-IP; b=eSwWndJIjYFer4f76RNnZXKSCy7Wlsd6c QCOYpa/AQIJaZ7xwyyFVAFIVjrjxTDaIiytp1d6QTsxhO+vF0xfgRLtBDe2r8xcY5RK tr5SCROLQJMf6G5wLjTiVx848Cn8DMtfTS2xy5n0PhmaD2s7a1NnmpyAJmisKfWTrDB /KvSrctmaGlWDlEqrzPwMcx6JNyLFC4V9l1Ukz34jUy4isa5UDf5xnBbkvM01XwonEj MMoHlrdZKieX+NIUIjQT+V7IoQNWN7TOk4gi7SWwrHkNfsEwUVNxnEA9qdpzu0+y+F1 PgD09vXYzzjdJyCfc6HylN7XJYMESqfz1xb8kAgyg==
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com; b=S+bsLYBmjhLEExjYKLfja0xWM+RZyWtL5GAUeKOrsm2W9v0NioGxRQBuBBFBhQgT1zTkrbA2oH20BbRs67Dp21vBVMfnCLzU/S5d0ovhEvTTiPr4t8RbMhej6mkcv7cm/BKJWX7InY33v5vg3m8IFNptW3oEsTqOOQOpP22Z1WuSF6wxo3e0wfbOcsYDr/YIJvIBRA5XiC3W+/vAXI3DPqLhGUEhc98QTal8ChL7G04TcKlCQsjADI0uFEfmwaxTE8cBQ8iKyBgbgPjWCLdKHU7mRqs1b+Ww1hv5mUCDo+CFX8hkDDETo7bhpL5im3dwrCWnhLX44IxKL7hPzEOYbA==; h=Received:Subject:To:References:From:Message-ID:Date:User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language:X-ELNK-Trace:X-Originating-IP;
Received: from [174.21.171.131] (helo=[192.168.1.111]) by elasmtp-dupuy.atl.sa.earthlink.net with esmtpa (Exim 4) (envelope-from <asmusf@ix.netcom.com>) id 1gUFkS-00056o-82 for i18nrp@ietf.org; Tue, 04 Dec 2018 13:48:44 -0500
To: i18nrp@ietf.org
References: <154385119878.18333.5085298134102919486.idtracker@ietfa.amsl.com> <FF6F9EB9-C73B-4EC0-AC4F-3E3BFBABA0AB@vpnc.org> <8E20D432-01B0-4B52-80BB-3348C5FE73AF@vpnc.org> <A5B69D318689A6515CCB4883@PSB> <E5A7B829-FE59-4649-AE36-2918E910A667@frobbit.se> <95EB3D3321F6E91A22C9FE3B@PSB>
From: Asmus Freytag <asmusf@ix.netcom.com>
Message-ID: <db1db0c8-2b05-1ba9-8a15-71de3713fc14@ix.netcom.com>
Date: Tue, 04 Dec 2018 10:48:45 -0800
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.3.2
MIME-Version: 1.0
In-Reply-To: <95EB3D3321F6E91A22C9FE3B@PSB>
Content-Type: multipart/alternative; boundary="------------10221F3A107E1A091EAA9E03"
Content-Language: en-US
X-ELNK-Trace: 464f085de979d7246f36dc87813833b28d93432b0f0788b991b6ab7e7f45776766f30427ff285833350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 174.21.171.131
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/e7f6WeG0HMN_uD1Tliz6NfFciLk>
Subject: Re: [I18nrp] Last Call: <draft-faltstrom-unicode11-05.txt> (IDNA2008 and Unicode 11.0.0) to Informational RFC
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 04 Dec 2018 18:48:49 -0000

On 12/4/2018 2:33 AM, John C Klensin wrote:
> Incidentally, if Asmus's comments about the insufficiency of the
> CONTEXTo model for a range of problems is correct, we should be
> looking at whether IDNA2008 is in need of changes/ updating.
> Those changes might turn out to be substantive or as minor as
> making it much more clear that what IDNA2008 permits is a
> superset of what should be allowed in a particular zone and who
> has responsibility (and accountability) for selecting a
> zone-specific subset (aided by whatever hints we can provide).
> It seems clear to me that we should not process this document
> until we have answers to, or at least much more clarity about,
> that issue.  YMMD.

Complex scripts (such as Indic and related scripts) are conceptually 
written as a stream of syllables. However, what is encoded is the set of 
elements that make up syllables. This leads to a well-formedness 
problem: it is possible to type sequences that are not well-formed 
syllables.

There are two kinds: syllables that "can't happen" given the phonetics 
of given language, and  syllables that are structurally unsound (violate 
the principles of the writing system).

Both can be problematic. The latter, because unsound sequences can look 
exactly like correct sequences, or, alternatively, because unsound 
sequences are not supported by renderers and fonts and lead to 
unpredictable display.

The "can't happen" sequences may cause issues to readers; the best 
analogy is that they are mentally processed like a nonsense letter (such 
as the ones invented by Dr. Seuss in "On Beyond Zebra"), instead of like 
a nonsense "word" as an arbitrary sequence of Latin letters would.

Structurally unsound sequences should be prohibited by policy, while 
some or all of the phonetically impossible sequences may be worth 
prohibiting for added security, but the latter are language dependent. 
(Some languages may differ in what constitutes a structurally sound 
syllable as well, an issue we encountered in the Thai script, where 
renderers designed for the Thai language will not handle some of the 
other languages).

Because of the (potential/actual) language dependency much of these 
constraints cannot be embedded in the protocol layer like IDNA 2008.

There are subset of unsound sequences that Unicode calls out as having a 
"do not use" status independent of language (See chapter 12 of the 
standard for examples). These could be captured with CONTEXTO (or 
similar) constraints in an updated IDNA2008.

However, they are always a subset of the "real" set of unsound 
sequences; any registry doing a good job for complex scripts would 
therefore find that implementing the CONTEXTO constraints becomes 
redundant. (Also, unlike RFC7940, these rules are not machine readable 
and therefore not trivial to implement and verify).

Fundamentally, IDNA2008 primarily focuses on repertoire, the few 
CONTEXTx rules notwithstanding. Changing that model would be a major 
undertaking and it is questionable that the results would be beneficial.

Perhaps some other tack could be taken: in your draft, it could be 
pointed out that Unicode assigns "do not use" to some sequences (which 
means that there is no text, identifier or not, that should contain 
them). Preventing these from registrations could be declared part of the 
responsibility of registries under RFC 5891.

This approach is more flexible because it allows registries to adopt 
context rules that are a superset of the rules needed to exclude these 
particular sequences. Which is precisely what we are doing in the Root 
Zone. If you go to http://icann.org/idn and look for the set of RZ-LGR 
proposals, you will find that the context rules for the various Indic 
scripts do not explicitly call out these "do not use" sequences, but 
that they adopt context rules on some of the constituent code points 
that also disallow those sequences, while implementing a more general 
constraint appropriate to the writing system.

For an example, see section 7 of 
https://www.icann.org/en/system/files/files/proposal-devanagari-lgr-27jul18-en.pdf

This shows the more generalized rules for Devanagari, but also where 
they had to be tweaked to allow for both Hindi and other languages, e.g. 
Santali.

For comparison, check for example, Table 12-1 in 
http://www.unicode.org/versions/latest/ch12.pdf

You will find that Table 12-1 is covered by the rule:

     3. M: must be preceded by C or CN

     (Read: "Dependent vowel signs (matras) must follow a consonant
      with or without an optional Nukta").

This rule prevents dependent vowel signs from being combined with 
independent vowel letters, which is what the list in table 12-1 
enumerates. However, the RZ-LGR rule also covers many other contexts 
that are inappropriate (unsound) for matras; some of the latter 
constraints may not be as universal, and, unlike the prohibition of 
Table 12-1 would be out of scope for a protocol.

The Root Zone LGR uses RFC 7940 to translate this rule into a single 
line of XML, which can be machine processed.

A./