[Idna-update] IDNA and combining sequences (was: Re: Expiration impending: <draft-klensin-idna-rfc5891bis-01.txt>)

John C Klensin <john-ietf@jck.com> Fri, 09 March 2018 16:27 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: idna-update@ietfa.amsl.com
Delivered-To: idna-update@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B039D12D7F9 for <idna-update@ietfa.amsl.com>; Fri, 9 Mar 2018 08:27:36 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.909
X-Spam-Level:
X-Spam-Status: No, score=-1.909 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id bwJFn1ygyiuL for <idna-update@ietfa.amsl.com>; Fri, 9 Mar 2018 08:27:34 -0800 (PST)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1E55612D7F2 for <idna-update@ietf.org>; Fri, 9 Mar 2018 08:27:33 -0800 (PST)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1euKrk-0003y3-92; Fri, 09 Mar 2018 11:27:32 -0500
Date: Fri, 09 Mar 2018 11:27:25 -0500
From: John C Klensin <john-ietf@jck.com>
To: Asmus Freytag <asmusf@ix.netcom.com>, idna-update@ietf.org
Message-ID: <1E562CDE39B4224F227E765D@PSB>
In-Reply-To: <02c29140-29f1-cc81-8c4f-8249d0f23b2c@ix.netcom.com>
References: <C4FBCF12821031786F472AA2@PSB> <02c29140-29f1-cc81-8c4f-8249d0f23b2c@ix.netcom.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/idna-update/N-3G3YOiaFNLoDjvcPxMmy5Unvk>
Subject: [Idna-update] IDNA and combining sequences (was: Re: Expiration impending: <draft-klensin-idna-rfc5891bis-01.txt>)
X-BeenThere: idna-update@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Internationalized Domain Names in Applications \(IDNA\) implementation and update discussions" <idna-update.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idna-update>, <mailto:idna-update-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idna-update/>
List-Post: <mailto:idna-update@ietf.org>
List-Help: <mailto:idna-update-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idna-update>, <mailto:idna-update-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 09 Mar 2018 16:27:37 -0000

Asmus,

Thanks for the more detailed explanation.  I don't think we
really disagree.   Three observations:

* The Unicode Standard is very clear that code points are
assigned to abstract characters and not to particular fonts,
type styles, or type families.  While it specifies rendering
rules for particular scripts (and, if I recall without going
back and checking, sometimes even particular languages0, those
rules are much more readily applied to actual blocks of text
than to relatively short mnemonic strings that may not obey the
word or phrase formation rules of the writing system for some
human language.  We've also seen applications and
implementations that are trying to do as little as possible (or,
stated a bit more positively, to incorporate as little code that
is exercised only for very specific cases as possible) and hence
are unlikely to carefully apply rendering rules to strings even
if they support the fonts needed to display the code points in
those strings.  There are also applications and formats out
there that allow incorporating fonts of the author's choice,
rather than picking from those we, or a different application
implementer, might consider optional. Because of those factors
and because users will often see what they expect to see, I
think we need to be very careful when making statements about
what might or might not look like something else.  

* In retrospect, we missed an important opportunity when the
"preferred syntax" rules were written into RFC 1034 and its
predecessors.  It would have been possible at some time decades
ago to specify a rule for combining words for the purpose
mnemonics to form DNS labels.  porky-pig or porkypig, usc-isi or
uscisi, etc.   For those old enough to remember the host table
and NIC-approved names, we actually had such a convention, but
it was, at least as far as I can remember, never written down...
and it certainly did not make it to RFC 1034.   So we now have
internetsociety.org and not internet-society.org, but, if one
wants to have a discussion about confusion, remember which of
those two labels is being used is at least as confusing to a
typical user as any of the more subtle character or
character-representation issues we have been discussing (if
there is anyone reading this with an inclination toward evil or
chaos, it might be interesting to see if internet-society.org
could be registered and, if so, whether it would cause any
interesting havoc).   Noting that, like so many other things,
IDNs don't create new problems but may make old ones more
complex, I observe that, AFAIK, ICANN has never seriously
discussed a guideline prohibiting having both porky-pig and
porkyig as labels in the same zone, a situation that adds to my
skepticism about blocked variants as a major part of any
solution.  All of that aside, had there been a discussion in RFC
1034 about the relationship between, e.g., "xerox-copier" and
"xeroxcopier", it certainly would have been reflected in the
extrapolations that generated most of the rules of IDNA2008.

* Your explanation below is very helpful, at least to me.  I am
confident that, if it was what we were given when we started the
IDNA2008 design, some of the rules would have reflected it.   We
were given a different explanation and answer to our questions,
one much closer to "there is no problem and will be no problem
in  the future, so don't worry - no special rules are needed".  

The question is what to do now.  If we were to decide that your
discussion would made a good addition to the protocol, we would
run into at least two problems.  One is that we promised the
registry community that there would be no more disruptive
changes and, at least as important, the Unicode Consortium
outrage at our altering the way a handful of code points were
treated would presumably be much greater if we were to make a
change that would invalidate a large number of labels that might
now be registered and in use somewhere in the tree.  Another is
the problem that started this thread -- there appears to be no
energy in the IETF to consider and process changes to IDNA, even
fairly trivial clarifications.

See my next note for a different aspect of this.

best,
   john


--On Thursday, March 8, 2018 12:02 -0800 Asmus Freytag
<asmusf@ix.netcom.com> wrote:

> On 3/8/2018 9:11 AM, John C Klensin wrote:
>> One more thought about something it may be helpful to remind
>> ourselves periodically.
>> 
>> One of the notions floated when IDNs were first being
>> discussed and raised a few times after that was to simply ban
>> combining forms of all types.  The theory was that there was
>> no entitlement to write any character form (e.g., any "word")
>> in DNS labels, that those labels were all about mnemonics and
>> nothing but mnemonics, and that a "no combining sequences"
>> rule would eliminate most issues about normalization and
>> grapheme clusters and make everything a lot easier to explain
>> to implementers and others who wanted to conform but were not
>> willing to make the investment to actually understand all of
>> the complex issues and rules with which we are now dealing.
> 
> It would also very  nicely prevent IDNs on the entire Indian
> Subcontinent.
> 
> Ironically, it is Arabic, where most (all) of the combining
> can safely be
> excluded from IDNs. This is being done for the Root Zone, for
> example,
> as you acknowledge below.
> 
> The reason is that in Arabic, these marks are generally
> optional and/or
> used for specific purposes in specialized text. Therefore,
> leaving them
> out is not detrimental to the usability of IDNs and the Root
> Zone will
> not allow them (extending the set of prohibited ones from RFC
> 5564)
> 
> Therefore, in the Root Zone the Unicode 7 addition of the
> U+08A1 would
> not be an issue.
>> 
>> If that were the rule and someone really, really, wanted a
>> grapheme that could only be formed in Unicode with a combining
>> sequence, it would be up to them to convince the Unicode
>> Consortium that their favorite character (grapheme) needed to
>> be added to Unicode as a single code point.   However hard
>> that might be, it would not be our problem.
> 
> Well, we see what happens if someone does get Unicode to add a
> pre-
> composed form: the entire process is derailed. That's what
> happened
> with the addition of U+08A1 even though it is NOT the case
> that this is
> a true "precomposed" form - while the same graphical elements
> are
> involved, the result does not look identical. There is a
> strong similarity
> of course, but despite what people read into the Unicode
> character
> name, this is not a case of an exact homoglyph.
>> 
>> FWIW, a "nothing but precombined characters" rule is
>> essentially the recommendation for Arabic IDNs in RFC 5564
>> and, I understand, in the emerging Arabic script rules for
>> the root zone.
> 
> That is because of the way these function in Arabic.
> 
> Unicode could not generally add precomposed forms of, say,
> Latin code
> points, say a new letter with a dot above, because of
> normalization
> stability. In the Latin script, a combining dot above and a
> precomposed
> dot above are identical.
> 
> However, even there you have a number of combining marks that
> are
> not considered as part of possible decompositions: they are
> the code
> points for various (stroke) overlays and some attached
> extenders.
> 
> Like the Arabic combining marks, they could (and should) be
> disallowed
> from LGRs. (The Latin LGR for the Root Zone will not allow
> combining
> marks other than in enumerated combinations - that's something
> that
> works for Latin, Greek, Cyrillic and practically all scripts
> that are not
> South or South East Asian complex scripts.
>> 
>> We didn't go down that path, not only because of impassioned
>> pleas for some of the character forms that might be excluded
>> but because of precisely the reassurance that arguably led to
>> the non-decomposing characters thread -- assurances that no
>> new code points would be added to Unicode if there were
>> already a combining sequence that could reasonably substitute
>> for it in the same script except under very unusual
>> circumstances and, when those circumstances occurred, the new
>> code points would decompose to those sequences.
> 
> Other participants in that discussion remember this claim
> differently.
> Unicode Normalization forms C and D were never about
> "reasonable
> substitutions" but about "exact equivalents" or "the same
> thing,
> except for the encoding".
> 
> As there is generally no benefit in encoding another
> representation of
> "the same thing", Unicode does not allow addition of
> precomposed
> code points that can be decomposed into something that is the
> exact equivalent.
> 
>> 
>> I don't suggest that we try to reverse that decision at this
>> point.   I assume that, if nothing else, it would just be too
>> disruptive.  However, it is worth pointing out that a "no
>> combining sequences" rule would eliminate the non-decomposing
>> character problem and at least a few other potential spoofing
>> and related cases.   It might also be worth examining as a
>> guideline or advice for registries who are interesting in
>> raising the safety level of what they allow to be registered
>> without having to understand the underlying issues more
>> deeply.
> 
> A useful set of recommendation for handling combining marks
> safely in LGRs
> would consist of:
> 
> 1) in all non-complex scripts: allow only fixed enumerations
> of base code point
> and combining marks. (The number of required combinations is
> small, even
> for a sprawling script like Latin).
> 
> 2) in all complex scripts (where the number of combinations is
> too large),
> provide context rules that assure combining marks are not
> placed in the
> wrong part of a syllable (such wrong contexts cannot be "read"
> by humans
> and not rendered correctly by machines). The Root LGR presents
> suitable
> examples of this for SEA scripts, Indic scripts in preparation.
> 
> 3) in scripts where combining marks express optional elements
> (vowels, etc.)
> disallow all of them. (Arabic, see Root Zone LGR for example)
> 
> 4) in scripts where combining marks are used for historical/
> special purposes
> disallow those (diacritics for classical Greek, stroke
> overlays and other marks
> for linguistics).
> 
> 5) in LGRs supporting variants, consider mutually blocking
> labels that vary
> only in the presence or absence of some combining diacritic;
> some diacritics
> are not reliably distinguished from each other (comma below,
> cedilla) or from
> an undecorated base character (forms with/without Nukta).
> 
> What is a complex script: a good approximation is whether it
> contains a code
> point with ccc=virama, plus Thaana, which isn't really complex
> but has mandatory
> vowels classed as combining marks. Effectively all South and
> South East Asian
> scripts that are abugidas.
> 
> In addition, you will want recommendations that actually
> address the homoglyph
> issues: there are a fair number of non-combining mark exact
> visual duplicates in
> Unicode. These are not exact equivalents, because Unicode does
> not treat
> code points that differ in script, case or digit/letter
> properties as equivalent.
> 
> A./
> 
> 
>> 
>> best,
>>      john
>> 
>> 
>> 
>> _______________________________________________
>> IDNA-UPDATE mailing list
>> IDNA-UPDATE@ietf.org
>> https://www.ietf.org/mailman/listinfo/idna-update
>> 
> 
> _______________________________________________
> IDNA-UPDATE mailing list
> IDNA-UPDATE@ietf.org
> https://www.ietf.org/mailman/listinfo/idna-update