Re: [Idna-update] Expiration impending: <draft-klensin-idna-rfc5891bis-01.txt>

Asmus Freytag <asmusf@ix.netcom.com> Thu, 08 March 2018 20:02 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: idna-update@ietfa.amsl.com
Delivered-To: idna-update@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0C2AC120721 for <idna-update@ietfa.amsl.com>; Thu, 8 Mar 2018 12:02:55 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.721
X-Spam-Level:
X-Spam-Status: No, score=-2.721 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ix.netcom.com; domainkeys=pass (2048-bit key) header.from=asmusf@ix.netcom.com header.d=ix.netcom.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LZt6Kq6y3N9z for <idna-update@ietfa.amsl.com>; Thu, 8 Mar 2018 12:02:52 -0800 (PST)
Received: from elasmtp-masked.atl.sa.earthlink.net (elasmtp-masked.atl.sa.earthlink.net [209.86.89.68]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 03DEB1250B8 for <idna-update@ietf.org>; Thu, 8 Mar 2018 12:02:49 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ix.netcom.com; s=dk12062016; t=1520539370; bh=iBi9FuJtIImrZR2bOMtVy/143lpX+5Yhter+ hfOEL38=; h=Received:Subject:To:References:From:Message-ID:Date: User-Agent:MIME-Version:In-Reply-To:Content-Type: Content-Transfer-Encoding:Content-Language:X-ELNK-Trace: X-Originating-IP; b=WHWxRvflgvcDsN7yY8655sgQmV0Hg0PM3LUd459xeYktaF h/3jAQS4H5N2iaEp3g6p79otIvfTcb/v4hPZbTjnw5r7Cbm5pFM1EIWBsBr7as5kBlu F8BArcyOhr62U3mmw9rB1kpJdUhHbhPGu8Chw92+iaDUuZoG4NCZDEV8N7/UIjNR9kz CBKd/PAAC4ugy8BY+sXRm6kV09a6yvtxrxtD9GN9S8wxnR/jwcB4RoCe3Tnkt3l2P2c EAth7FTkSYppZvIaSSy4mCo5nEMv2Qp+KFrWOBSvAbVnD0PxNbSu54RqgiIuq2CoXlt 7Q2+GzWXmmW+vbKt36H7qx5UKs0w==
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com; b=DZSlPTrjEUafs0vbEgaAkjp9X5R5VIUJ1DAJff4/mXLIQP7aml8ziOnpI0n2KjEd8YnP90lOWCbSh+sY9FD0qd/lo19U6wd9VmPqeh87RoNk2lGrrt9ydw00e8BU7tFOAQQUrczq20hhX2sEyEtIP8Zfipxf+HGCsAMuIktybunEFGDYXlYiXmQZI8woGKpaQ5AuRpD9zvwkaNYCnMEQlhE9EuW7yV1Tw6Tc7MjpAopfga+XaWHRJvZSDWbhbFePBRN7BMSXBNk0UNkkoP4sVtdmjdvgHqYUNCBDRJK7+WDXJR9/UW4cNDrh/1a3p8PKk75HPAxvhLADvXOLTARXNA==; h=Received:Subject:To:References:From:Message-ID:Date:User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Transfer-Encoding:Content-Language:X-ELNK-Trace:X-Originating-IP;
Received: from [71.35.186.204] (helo=[192.168.1.103]) by elasmtp-masked.atl.sa.earthlink.net with esmtpa (Exim 4) (envelope-from <asmusf@ix.netcom.com>) id 1eu1kV-0002m9-UU for idna-update@ietf.org; Thu, 08 Mar 2018 15:02:48 -0500
To: idna-update@ietf.org
References: <C4FBCF12821031786F472AA2@PSB>
From: Asmus Freytag <asmusf@ix.netcom.com>
Message-ID: <02c29140-29f1-cc81-8c4f-8249d0f23b2c@ix.netcom.com>
Date: Thu, 8 Mar 2018 12:02:52 -0800
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0
MIME-Version: 1.0
In-Reply-To: <C4FBCF12821031786F472AA2@PSB>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-US
X-ELNK-Trace: 464f085de979d7246f36dc87813833b2c1627926350bb93f5c24928e240690f528ed23686566b73f350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 71.35.186.204
Archived-At: <https://mailarchive.ietf.org/arch/msg/idna-update/M1d0fBC7wGu5vHLoNMudUCQ8SOc>
Subject: Re: [Idna-update] Expiration impending: <draft-klensin-idna-rfc5891bis-01.txt>
X-BeenThere: idna-update@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Internationalized Domain Names in Applications \(IDNA\) implementation and update discussions" <idna-update.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idna-update>, <mailto:idna-update-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idna-update/>
List-Post: <mailto:idna-update@ietf.org>
List-Help: <mailto:idna-update-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idna-update>, <mailto:idna-update-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 08 Mar 2018 20:02:55 -0000

On 3/8/2018 9:11 AM, John C Klensin wrote:
> One more thought about something it may be helpful to remind
> ourselves periodically.
>
> One of the notions floated when IDNs were first being discussed
> and raised a few times after that was to simply ban combining
> forms of all types.  The theory was that there was no
> entitlement to write any character form (e.g., any "word") in
> DNS labels, that those labels were all about mnemonics and
> nothing but mnemonics, and that a "no combining sequences" rule
> would eliminate most issues about normalization and grapheme
> clusters and make everything a lot easier to explain to
> implementers and others who wanted to conform but were not
> willing to make the investment to actually understand all of the
> complex issues and rules with which we are now dealing.

It would also very  nicely prevent IDNs on the entire Indian Subcontinent.

Ironically, it is Arabic, where most (all) of the combining can safely be
excluded from IDNs. This is being done for the Root Zone, for example,
as you acknowledge below.

The reason is that in Arabic, these marks are generally optional and/or
used for specific purposes in specialized text. Therefore, leaving them
out is not detrimental to the usability of IDNs and the Root Zone will
not allow them (extending the set of prohibited ones from RFC 5564)

Therefore, in the Root Zone the Unicode 7 addition of the U+08A1 would
not be an issue.
>
> If that were the rule and someone really, really, wanted a
> grapheme that could only be formed in Unicode with a combining
> sequence, it would be up to them to convince the Unicode
> Consortium that their favorite character (grapheme) needed to be
> added to Unicode as a single code point.   However hard that
> might be, it would not be our problem.

Well, we see what happens if someone does get Unicode to add a pre-
composed form: the entire process is derailed. That's what happened
with the addition of U+08A1 even though it is NOT the case that this is
a true "precomposed" form - while the same graphical elements are
involved, the result does not look identical. There is a strong similarity
of course, but despite what people read into the Unicode character
name, this is not a case of an exact homoglyph.
>
> FWIW, a "nothing but precombined characters" rule is essentially
> the recommendation for Arabic IDNs in RFC 5564 and, I
> understand, in the emerging Arabic script rules for the root
> zone.

That is because of the way these function in Arabic.

Unicode could not generally add precomposed forms of, say, Latin code
points, say a new letter with a dot above, because of normalization
stability. In the Latin script, a combining dot above and a precomposed
dot above are identical.

However, even there you have a number of combining marks that are
not considered as part of possible decompositions: they are the code
points for various (stroke) overlays and some attached extenders.

Like the Arabic combining marks, they could (and should) be disallowed
from LGRs. (The Latin LGR for the Root Zone will not allow combining
marks other than in enumerated combinations - that's something that
works for Latin, Greek, Cyrillic and practically all scripts that are not
South or South East Asian complex scripts.
>
> We didn't go down that path, not only because of impassioned
> pleas for some of the character forms that might be excluded but
> because of precisely the reassurance that arguably led to the
> non-decomposing characters thread -- assurances that no new code
> points would be added to Unicode if there were already a
> combining sequence that could reasonably substitute for it in
> the same script except under very unusual circumstances and,
> when those circumstances occurred, the new code points would
> decompose to those sequences.

Other participants in that discussion remember this claim differently.
Unicode Normalization forms C and D were never about "reasonable
substitutions" but about "exact equivalents" or "the same thing,
except for the encoding".

As there is generally no benefit in encoding another representation of
"the same thing", Unicode does not allow addition of precomposed
code points that can be decomposed into something that is the
exact equivalent.

>
> I don't suggest that we try to reverse that decision at this
> point.   I assume that, if nothing else, it would just be too
> disruptive.  However, it is worth pointing out that a "no
> combining sequences" rule would eliminate the non-decomposing
> character problem and at least a few other potential spoofing
> and related cases.   It might also be worth examining as a
> guideline or advice for registries who are interesting in
> raising the safety level of what they allow to be registered
> without having to understand the underlying issues more deeply.

A useful set of recommendation for handling combining marks safely in LGRs
would consist of:

1) in all non-complex scripts: allow only fixed enumerations of base 
code point
and combining marks. (The number of required combinations is small, even
for a sprawling script like Latin).

2) in all complex scripts (where the number of combinations is too large),
provide context rules that assure combining marks are not placed in the
wrong part of a syllable (such wrong contexts cannot be "read" by humans
and not rendered correctly by machines). The Root LGR presents suitable
examples of this for SEA scripts, Indic scripts in preparation.

3) in scripts where combining marks express optional elements (vowels, etc.)
disallow all of them. (Arabic, see Root Zone LGR for example)

4) in scripts where combining marks are used for historical/ special 
purposes
disallow those (diacritics for classical Greek, stroke overlays and 
other marks
for linguistics).

5) in LGRs supporting variants, consider mutually blocking labels that vary
only in the presence or absence of some combining diacritic; some 
diacritics
are not reliably distinguished from each other (comma below, cedilla) or 
from
an undecorated base character (forms with/without Nukta).

What is a complex script: a good approximation is whether it contains a code
point with ccc=virama, plus Thaana, which isn't really complex but has 
mandatory
vowels classed as combining marks. Effectively all South and South East 
Asian
scripts that are abugidas.

In addition, you will want recommendations that actually address the 
homoglyph
issues: there are a fair number of non-combining mark exact visual 
duplicates in
Unicode. These are not exact equivalents, because Unicode does not treat
code points that differ in script, case or digit/letter properties as 
equivalent.

A./


>
> best,
>      john
>
>
>
> _______________________________________________
> IDNA-UPDATE mailing list
> IDNA-UPDATE@ietf.org
> https://www.ietf.org/mailman/listinfo/idna-update
>