Re: [Idna-update] IDNA and combining sequences

"Asmus Freytag (c)" <asmusf@ix.netcom.com> Sat, 10 March 2018 18:08 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: idna-update@ietfa.amsl.com
Delivered-To: idna-update@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D65E9126DD9 for <idna-update@ietfa.amsl.com>; Sat, 10 Mar 2018 10:08:11 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.721
X-Spam-Level:
X-Spam-Status: No, score=-2.721 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ix.netcom.com; domainkeys=pass (2048-bit key) header.from=asmusf@ix.netcom.com header.d=ix.netcom.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xKBSbt8vrjmN for <idna-update@ietfa.amsl.com>; Sat, 10 Mar 2018 10:08:09 -0800 (PST)
Received: from elasmtp-masked.atl.sa.earthlink.net (elasmtp-masked.atl.sa.earthlink.net [209.86.89.68]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B9F39124D68 for <idna-update@ietf.org>; Sat, 10 Mar 2018 10:08:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ix.netcom.com; s=dk12062016; t=1520705289; bh=e7I57hpCQOnLFhgTEQ20VNprFkXjECtW9sjT 2RCV5FI=; h=Received:Subject:To:Cc:References:From:Message-ID:Date: User-Agent:MIME-Version:In-Reply-To:Content-Type: Content-Transfer-Encoding:Content-Language:X-ELNK-Trace: X-Originating-IP; b=RH0Pjf+f2iNJot0KYguT2J1LJe9x2RsEBFNXb3aWsucFi2 ZpfPbHJ8B6wf76D5CRUk3Sec63AFJBUzd946ptGHjf5u6VaVAwOBO68CjUZH/2GZQRA KwMuugq+tXmX2kLgWCvcCVO0NdqZdBOqW04wrtxgqDb1wRlSigINFuoKgZ51rKkLs6W HZcLG8+YQvKcX4YKF929vBXmaJosmIUrK/p1gx8BLryNhxgM5GX0x6UwU0gl4AAKbdk muTEavkaizZwOy7fdwIFoJVAItRMEnBGQgiP8GHoOX26+m8RMR4LpHtfmbeOYWx+GZz 8056wayb7HDKU+tCjv7uMZqOW49w==
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com; b=hWipInacNINjTmaxaSxpyAEPgbFwDmMtW46D41UQhrKKmDmU7KpQYwJbZ3AdFDL2n4XImBjbVq0R1fmdul/NVuoJYBvoxX92Pc58MHlqCMhkGGh6fJx7EUnowTwar4mmJN2SJKKRk/I8Ic2OCfyiNFiKRBuCKZLLiQ2/ZLBh3z3mGhC1sB59ClJwP/CXLc0KJVKb2zhuTR7n3nE7c6cxgoSvBo6voSUz0Ilki/SxIdiSlbYuJzSQpo2bYeJF+tAFZ+Eh+9YZEKuLab4n5e/Po7o0R6E+hIYekbJDScyUeWGGXoJqex3+iky3KXvJ1RroZdL8i+gKhCBEZIBQo9q4mQ==; h=Received:Subject:To:Cc:References:From:Message-ID:Date:User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Transfer-Encoding:Content-Language:X-ELNK-Trace:X-Originating-IP;
Received: from [46.21.151.107] (helo=[10.4.47.190]) by elasmtp-masked.atl.sa.earthlink.net with esmtpa (Exim 4) (envelope-from <asmusf@ix.netcom.com>) id 1euiub-00070i-UQ; Sat, 10 Mar 2018 13:08:06 -0500
To: John C Klensin <john-ietf@jck.com>, =?UTF-8?B?UGF0cmlrIEbDpGx0c3Ryw7Zt?= <paf@frobbit.se>
Cc: idna-update@ietf.org
References: <C4FBCF12821031786F472AA2@PSB> <02c29140-29f1-cc81-8c4f-8249d0f23b2c@ix.netcom.com> <1E562CDE39B4224F227E765D@PSB> <516E58F3-015D-4AD7-A3FD-0749A6890245@frobbit.se> <D26CE952D968BBEC0AB96A76@PSB>
From: "Asmus Freytag (c)" <asmusf@ix.netcom.com>
Message-ID: <70445d5e-6294-26f5-50b8-cb9ae7345c56@ix.netcom.com>
Date: Sat, 10 Mar 2018 10:08:08 -0800
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0
MIME-Version: 1.0
In-Reply-To: <D26CE952D968BBEC0AB96A76@PSB>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-US
X-ELNK-Trace: 464f085de979d7246f36dc87813833b2c1627926350bb93fe0b4bf86a1aafe751d858ffcef8ab932350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 46.21.151.107
Archived-At: <https://mailarchive.ietf.org/arch/msg/idna-update/2eDu2dj114IOYfsU8x4Vt4QxRCo>
Subject: Re: [Idna-update] IDNA and combining sequences
X-BeenThere: idna-update@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Internationalized Domain Names in Applications \(IDNA\) implementation and update discussions" <idna-update.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idna-update>, <mailto:idna-update-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idna-update/>
List-Post: <mailto:idna-update@ietf.org>
List-Help: <mailto:idna-update-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idna-update>, <mailto:idna-update-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 10 Mar 2018 18:08:12 -0000

Patrik asked, where do we go from here.

Let me propose something, but first some response to John.

On 3/10/2018 5:51 AM, John C Klensin wrote:
>
> --On Saturday, March 10, 2018 09:06 -0400 Patrik Fältström
> <paf@frobbit.se> wrote:
>
>> I think we should do some scenario planning here. Remember
>> that we do not have a world where IDNA2008 based on Unicode
>> 6.x is what people use. People use all different kind of mix
>> between IDNA2008, IDNA2003 and Unicode versions. I have myself
>> worked with the curl library (that uses libidn) and to be
>> honest, I do not think people KNOW what they use. Or they
>> know, and they know they violate the rules. For the contracted
>> parties, they have the LGR coming down the road anyways, so...

Correct, and it doesn't stop there: some parties are off the reservation 
doing emoji...

>>
>> And *if* we go down this path, is it "enough" to do in the LGR
>> (i.e. ICANN) or should IETF do some adoption (and W3C?), or
>> should IETF say "we can move forward in a more safe way AS WE
>> KNOW ICANN DO LGR"...
> As I said at far more length in another note, the LGR is (at
> least by its charter/ Procedure) designed for, and applicable
> only to, the root.  While those code points would probably be
> safe to use in any zone, trying to apply them globally would be
> unduly restrictive, would violate the "administratively
> distributed hierarchy" principle, and, IMO, just wouldn't fly..
> The latter would probably quickly reach the point that attempts
> to apply the LGR to labels at the second level and beyond would
> encourage non-compliance and daring ICANN to do anything about
> it.

Totally agree. There's no benefit in "imposing" the RZ-LGR on other zones.

However, the various script LGRs can serve both as examples and as starting
points. They do contain (or are about to contain) worked examples of
repertoire context rules and or variant rules applicable to modern users
of each script. They are also heavily documented, so that anybody can
understand the "why" behind their design. That makes them useful as
a starting point as well, to which you can add features as needed. One
thing almost anyone will want to add outside the root zone is digits
(and the hyphen).

That's why I have consistently argued that the RZ-LGR is a useful example
and starting point.

>
> As an example of the overly restrictive part, I assume that the
> LGR rules, like the ccTLD Fast Track Procedures, would prohibit
> the use of "ур" (U+0443 U+0440) or other permutations of those
> characters) in to root on the grounds of either intrinsic
> confusability or the potential conflict with the 3166-1 alpha-2
> list.

The LGR does not have to "prohibit" these. Instead, they are recognized as
so-called cross-script variants of the Latin letter.

Meaning, if you have a label containing Latin "yp" in the root, then you
do not get to use Cyrillic "ур"in some other label, if the two would 
otherwise
be (or look) the same.

However, you remain free to use Cyrillic "ур" in any label that does not
collide with some Latin label, either because no homograph label exists,
or because yours has some specific Cyrillic code point in it that would
distinguish it.

The ability to use the variant mechanism for such cases is  the main
difference between using LGRs (aka idn tables) compared to the protocol
in enforcing restrictions.

The cross-script variant mechanism is a more appropriate method to address
issues like that, because it doesn't blindly ban perfectly acceptable 
uses of code
points, but instead resolves any conflicts in favor of the first mover.

This reflects linguistic reality - in most cases, cross-script labels 
that are
homographs would look more than a bit "contrived", such the licence
plate with the Russian expletive AXY HEXO that some Unicoder had for
a while - it makes no sense in English, and passed the DMV.

At one point, we did some research in one of the scripts and found
that defining even a substantial numbers of such variants had little
impact on blocking legitimate words. Had a similar number of code
points been banned, the utility of the IDN labels would have been
seriously compromised.

For that reason alone, simply banning code points is more restrictive
than necessary.

> However, if we accept the logic that justifies IDN TLDs,
> that sequuence would be acceptable in a subtree of a domain
> whose TLD label was in Cyrillic and that published and applied a
> policy that its subdomains were entirely Cyrillic.

There are other RZ restrictions (no digits, no hyphen) that aren't
appropriate for zones lower down the tree.

And there are a few cases where it wasn't possible to cater to
all languages in a given script simultaneously, as some had
conflicting conventions (for complex scripts).

Such cases, while rare, are another good reason to use the Root Zone
as a starting point (and not the end-point) of LGR design for other Zones.

However,  the audience of 90%+ of existing zones would have been served
equally well had their idntables and policies been based on nothing more
than the RZ-LGR plus digits and hyphen. The stuff that the Root Zone
restricts is very often simply not in use.


>
>> Or...
> I think the above puts us well into the "or..." range.
>

I think we've come to that conclusion before - and we keep revisiting
ground that I thought had somewhat settled. We had been working
on a two-prong approach, of which your ID was one of the prongs:

1) to reiterate that the "raw" protocol doesn't by itself design
useful and secure LGRs (aka. idn tables/policies), but requires
conscious selection (and other decisions).

2) to give guidance as to what the issues are and how to address
them.

Now, we thought that the best approach for the second prong was
to create a registry of "troublesome" code points.

That turned out to be more challenging than anticipated. Not
because it is difficult to come up with a list. That's actually not
as difficult as it looks - for modern scripts.

No, the issue is that the code points, like your example of "yp"
are not problematic in isolation. What then should get listed? If we list
just the Cyrillic code points, then people creating a Cyrillic zone
will complain loudly that the entry is unjustified. If we list both the
Latin and the Cyrillic code point, on the basis that their interaction
is problematic, we get that reaction from two groups.

However, in the RZ effort, both groups were perfectly happy
defining mutually blocked variants - in fact, we had to stop
them from going overboard. . . .

Where to we need to go?

We need some recommendation that goes beyond "know what
you are doing" (while true, it's too vague to be helpful). We
need a set of recommendations that is more or less like the
example I wrote down for the combining marks. Something
that's specific enough that people can act on it, but also can
be adjusted to fit circumstances.

In writing recommendations for (more) "secure" IDNs, it's
important to make a distinction between modern-use scripts
and the vast number of historical scripts and technical (phonetic)
notations. The latter are really not well suited to secure use
in public zones. (The protocol must cater for non-public zones
as well).

Now, one option would be to create something like a "profile"
on IDNA2008 for "(more) secure IDNs for public zone."

Doing that would sidestep the issue of backwards compatibility
because you could be more restrictive; however, there's
the same issue that any enumerated restrictions would
run into the same issues of changes in usage (spelling
reforms) or changes in encoding.

However, qualitative descriptions wouldn't suffer as much
from that issue; I think they would add value, whether
as recommendation or as "profile".

A./