[Idna-update] IDNA and combining sequences (was: Re: Expiration impending: <draft-klensin-idna-rfc5891bis-01.txt>)
John C Klensin <john-ietf@jck.com> Fri, 09 March 2018 16:27 UTC
Return-Path: <john-ietf@jck.com>
X-Original-To: idna-update@ietfa.amsl.com
Delivered-To: idna-update@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1])
by ietfa.amsl.com (Postfix) with ESMTP id B039D12D7F9
for <idna-update@ietfa.amsl.com>; Fri, 9 Mar 2018 08:27:36 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.909
X-Spam-Level:
X-Spam-Status: No, score=-1.909 tagged_above=-999 required=5
tests=[BAYES_00=-1.9, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001]
autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44])
by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id bwJFn1ygyiuL for <idna-update@ietfa.amsl.com>;
Fri, 9 Mar 2018 08:27:34 -0800 (PST)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51])
(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
(No client certificate requested)
by ietfa.amsl.com (Postfix) with ESMTPS id 1E55612D7F2
for <idna-update@ietf.org>; Fri, 9 Mar 2018 08:27:33 -0800 (PST)
Received: from [198.252.137.10] (helo=PSB)
by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD))
(envelope-from <john-ietf@jck.com>)
id 1euKrk-0003y3-92; Fri, 09 Mar 2018 11:27:32 -0500
Date: Fri, 09 Mar 2018 11:27:25 -0500
From: John C Klensin <john-ietf@jck.com>
To: Asmus Freytag <asmusf@ix.netcom.com>, idna-update@ietf.org
Message-ID: <1E562CDE39B4224F227E765D@PSB>
In-Reply-To: <02c29140-29f1-cc81-8c4f-8249d0f23b2c@ix.netcom.com>
References: <C4FBCF12821031786F472AA2@PSB>
<02c29140-29f1-cc81-8c4f-8249d0f23b2c@ix.netcom.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/idna-update/N-3G3YOiaFNLoDjvcPxMmy5Unvk>
Subject: [Idna-update] IDNA and combining sequences (was: Re: Expiration
impending: <draft-klensin-idna-rfc5891bis-01.txt>)
X-BeenThere: idna-update@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Internationalized Domain Names in Applications \(IDNA\)
implementation and update discussions" <idna-update.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idna-update>,
<mailto:idna-update-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idna-update/>
List-Post: <mailto:idna-update@ietf.org>
List-Help: <mailto:idna-update-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idna-update>,
<mailto:idna-update-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 09 Mar 2018 16:27:37 -0000
Asmus, Thanks for the more detailed explanation. I don't think we really disagree. Three observations: * The Unicode Standard is very clear that code points are assigned to abstract characters and not to particular fonts, type styles, or type families. While it specifies rendering rules for particular scripts (and, if I recall without going back and checking, sometimes even particular languages0, those rules are much more readily applied to actual blocks of text than to relatively short mnemonic strings that may not obey the word or phrase formation rules of the writing system for some human language. We've also seen applications and implementations that are trying to do as little as possible (or, stated a bit more positively, to incorporate as little code that is exercised only for very specific cases as possible) and hence are unlikely to carefully apply rendering rules to strings even if they support the fonts needed to display the code points in those strings. There are also applications and formats out there that allow incorporating fonts of the author's choice, rather than picking from those we, or a different application implementer, might consider optional. Because of those factors and because users will often see what they expect to see, I think we need to be very careful when making statements about what might or might not look like something else. * In retrospect, we missed an important opportunity when the "preferred syntax" rules were written into RFC 1034 and its predecessors. It would have been possible at some time decades ago to specify a rule for combining words for the purpose mnemonics to form DNS labels. porky-pig or porkypig, usc-isi or uscisi, etc. For those old enough to remember the host table and NIC-approved names, we actually had such a convention, but it was, at least as far as I can remember, never written down... and it certainly did not make it to RFC 1034. So we now have internetsociety.org and not internet-society.org, but, if one wants to have a discussion about confusion, remember which of those two labels is being used is at least as confusing to a typical user as any of the more subtle character or character-representation issues we have been discussing (if there is anyone reading this with an inclination toward evil or chaos, it might be interesting to see if internet-society.org could be registered and, if so, whether it would cause any interesting havoc). Noting that, like so many other things, IDNs don't create new problems but may make old ones more complex, I observe that, AFAIK, ICANN has never seriously discussed a guideline prohibiting having both porky-pig and porkyig as labels in the same zone, a situation that adds to my skepticism about blocked variants as a major part of any solution. All of that aside, had there been a discussion in RFC 1034 about the relationship between, e.g., "xerox-copier" and "xeroxcopier", it certainly would have been reflected in the extrapolations that generated most of the rules of IDNA2008. * Your explanation below is very helpful, at least to me. I am confident that, if it was what we were given when we started the IDNA2008 design, some of the rules would have reflected it. We were given a different explanation and answer to our questions, one much closer to "there is no problem and will be no problem in the future, so don't worry - no special rules are needed". The question is what to do now. If we were to decide that your discussion would made a good addition to the protocol, we would run into at least two problems. One is that we promised the registry community that there would be no more disruptive changes and, at least as important, the Unicode Consortium outrage at our altering the way a handful of code points were treated would presumably be much greater if we were to make a change that would invalidate a large number of labels that might now be registered and in use somewhere in the tree. Another is the problem that started this thread -- there appears to be no energy in the IETF to consider and process changes to IDNA, even fairly trivial clarifications. See my next note for a different aspect of this. best, john --On Thursday, March 8, 2018 12:02 -0800 Asmus Freytag <asmusf@ix.netcom.com> wrote: > On 3/8/2018 9:11 AM, John C Klensin wrote: >> One more thought about something it may be helpful to remind >> ourselves periodically. >> >> One of the notions floated when IDNs were first being >> discussed and raised a few times after that was to simply ban >> combining forms of all types. The theory was that there was >> no entitlement to write any character form (e.g., any "word") >> in DNS labels, that those labels were all about mnemonics and >> nothing but mnemonics, and that a "no combining sequences" >> rule would eliminate most issues about normalization and >> grapheme clusters and make everything a lot easier to explain >> to implementers and others who wanted to conform but were not >> willing to make the investment to actually understand all of >> the complex issues and rules with which we are now dealing. > > It would also very nicely prevent IDNs on the entire Indian > Subcontinent. > > Ironically, it is Arabic, where most (all) of the combining > can safely be > excluded from IDNs. This is being done for the Root Zone, for > example, > as you acknowledge below. > > The reason is that in Arabic, these marks are generally > optional and/or > used for specific purposes in specialized text. Therefore, > leaving them > out is not detrimental to the usability of IDNs and the Root > Zone will > not allow them (extending the set of prohibited ones from RFC > 5564) > > Therefore, in the Root Zone the Unicode 7 addition of the > U+08A1 would > not be an issue. >> >> If that were the rule and someone really, really, wanted a >> grapheme that could only be formed in Unicode with a combining >> sequence, it would be up to them to convince the Unicode >> Consortium that their favorite character (grapheme) needed to >> be added to Unicode as a single code point. However hard >> that might be, it would not be our problem. > > Well, we see what happens if someone does get Unicode to add a > pre- > composed form: the entire process is derailed. That's what > happened > with the addition of U+08A1 even though it is NOT the case > that this is > a true "precomposed" form - while the same graphical elements > are > involved, the result does not look identical. There is a > strong similarity > of course, but despite what people read into the Unicode > character > name, this is not a case of an exact homoglyph. >> >> FWIW, a "nothing but precombined characters" rule is >> essentially the recommendation for Arabic IDNs in RFC 5564 >> and, I understand, in the emerging Arabic script rules for >> the root zone. > > That is because of the way these function in Arabic. > > Unicode could not generally add precomposed forms of, say, > Latin code > points, say a new letter with a dot above, because of > normalization > stability. In the Latin script, a combining dot above and a > precomposed > dot above are identical. > > However, even there you have a number of combining marks that > are > not considered as part of possible decompositions: they are > the code > points for various (stroke) overlays and some attached > extenders. > > Like the Arabic combining marks, they could (and should) be > disallowed > from LGRs. (The Latin LGR for the Root Zone will not allow > combining > marks other than in enumerated combinations - that's something > that > works for Latin, Greek, Cyrillic and practically all scripts > that are not > South or South East Asian complex scripts. >> >> We didn't go down that path, not only because of impassioned >> pleas for some of the character forms that might be excluded >> but because of precisely the reassurance that arguably led to >> the non-decomposing characters thread -- assurances that no >> new code points would be added to Unicode if there were >> already a combining sequence that could reasonably substitute >> for it in the same script except under very unusual >> circumstances and, when those circumstances occurred, the new >> code points would decompose to those sequences. > > Other participants in that discussion remember this claim > differently. > Unicode Normalization forms C and D were never about > "reasonable > substitutions" but about "exact equivalents" or "the same > thing, > except for the encoding". > > As there is generally no benefit in encoding another > representation of > "the same thing", Unicode does not allow addition of > precomposed > code points that can be decomposed into something that is the > exact equivalent. > >> >> I don't suggest that we try to reverse that decision at this >> point. I assume that, if nothing else, it would just be too >> disruptive. However, it is worth pointing out that a "no >> combining sequences" rule would eliminate the non-decomposing >> character problem and at least a few other potential spoofing >> and related cases. It might also be worth examining as a >> guideline or advice for registries who are interesting in >> raising the safety level of what they allow to be registered >> without having to understand the underlying issues more >> deeply. > > A useful set of recommendation for handling combining marks > safely in LGRs > would consist of: > > 1) in all non-complex scripts: allow only fixed enumerations > of base code point > and combining marks. (The number of required combinations is > small, even > for a sprawling script like Latin). > > 2) in all complex scripts (where the number of combinations is > too large), > provide context rules that assure combining marks are not > placed in the > wrong part of a syllable (such wrong contexts cannot be "read" > by humans > and not rendered correctly by machines). The Root LGR presents > suitable > examples of this for SEA scripts, Indic scripts in preparation. > > 3) in scripts where combining marks express optional elements > (vowels, etc.) > disallow all of them. (Arabic, see Root Zone LGR for example) > > 4) in scripts where combining marks are used for historical/ > special purposes > disallow those (diacritics for classical Greek, stroke > overlays and other marks > for linguistics). > > 5) in LGRs supporting variants, consider mutually blocking > labels that vary > only in the presence or absence of some combining diacritic; > some diacritics > are not reliably distinguished from each other (comma below, > cedilla) or from > an undecorated base character (forms with/without Nukta). > > What is a complex script: a good approximation is whether it > contains a code > point with ccc=virama, plus Thaana, which isn't really complex > but has mandatory > vowels classed as combining marks. Effectively all South and > South East Asian > scripts that are abugidas. > > In addition, you will want recommendations that actually > address the homoglyph > issues: there are a fair number of non-combining mark exact > visual duplicates in > Unicode. These are not exact equivalents, because Unicode does > not treat > code points that differ in script, case or digit/letter > properties as equivalent. > > A./ > > >> >> best, >> john >> >> >> >> _______________________________________________ >> IDNA-UPDATE mailing list >> IDNA-UPDATE@ietf.org >> https://www.ietf.org/mailman/listinfo/idna-update >> > > _______________________________________________ > IDNA-UPDATE mailing list > IDNA-UPDATE@ietf.org > https://www.ietf.org/mailman/listinfo/idna-update
- [Idna-update] FWD: Expiration impending: <draft-k… John C Klensin
- Re: [Idna-update] [Ext] FWD: Expiration impending… Kim Davies
- Re: [Idna-update] [Ext] FWD: Expiration impending… Patrik Fältström
- Re: [Idna-update] [Ext] FWD: Expiration impending… Andrew Sullivan
- Re: [Idna-update] [Ext] FWD: Expiration impending… Patrik Fältström
- Re: [Idna-update] [Ext] FWD: Expiration impending… Patrik Fältström
- Re: [Idna-update] [Ext] FWD: Expiration impending… John R. Levine
- Re: [Idna-update] [Ext] FWD: Expiration impending… Suzanne Woolf
- Re: [Idna-update] [Ext] FWD: Expiration impending… Andrew Sullivan
- Re: [Idna-update] [Ext] FWD: Expiration impending… Asmus Freytag
- Re: [Idna-update] FWD: Expiration impending: <dra… Francisco Arias
- Re: [Idna-update] [Ext] FWD: Expiration impending… John C Klensin
- Re: [Idna-update] [Ext] FWD: Expiration impending… Asmus Freytag
- Re: [Idna-update] [Ext] FWD: Expiration impending… Andrew Sullivan
- Re: [Idna-update] [Ext] FWD: Expiration impending… Patrik Fältström
- Re: [Idna-update] [Ext] FWD: Expiration impending… Asmus Freytag
- Re: [Idna-update] [Ext] FWD: Expiration impending… Patrik Fältström
- Re: [Idna-update] Expiration impending: <draft-kl… Patrik Fältström
- Re: [Idna-update] Expiration impending: <draft-kl… John C Klensin
- Re: [Idna-update] Expiration impending: <draft-kl… Francisco Arias
- Re: [Idna-update] Expiration impending: <draft-kl… Patrik Fältström
- Re: [Idna-update] Expiration impending: <draft-kl… John C Klensin
- Re: [Idna-update] Expiration impending: <draft-kl… Andrew Sullivan
- Re: [Idna-update] Expiration impending: <draft-kl… John C Klensin
- Re: [Idna-update] Expiration impending: <draft-kl… Andrew Sullivan
- Re: [Idna-update] Expiration impending: <draft-kl… Asmus Freytag
- Re: [Idna-update] Expiration impending: <draft-kl… Asmus Freytag
- [Idna-update] IDNA and combining sequences (was: … John C Klensin
- Re: [Idna-update] Expiration impending: <draft-kl… John C Klensin
- Re: [Idna-update] IDNA and combining sequences (w… Patrik Fältström
- Re: [Idna-update] IDNA and combining sequences (w… John C Klensin
- Re: [Idna-update] IDNA and combining sequences (w… Mark Davis ☕️
- Re: [Idna-update] IDNA and combining sequences Asmus Freytag (c)
- Re: [Idna-update] IDNA and combining sequences (w… John Levine
- Re: [Idna-update] IDNA and combining sequences Asmus Freytag (c)
- Re: [Idna-update] Expiration impending: <draft-kl… Asmus Freytag
- Re: [Idna-update] IDNA and combining sequences Patrik Fältström
- Re: [Idna-update] IDNA and combining sequences John C Klensin
- Re: [Idna-update] IDNA and combining sequences (w… John C Klensin
- Re: [Idna-update] IDNA and combining sequences (w… John R Levine
- Re: [Idna-update] IDNA and combining sequences (w… Asmus Freytag
- Re: [Idna-update] IDNA and combining sequences (w… John Levine
- Re: [Idna-update] IDNA and combining sequences (w… Asmus Freytag (c)
- Re: [Idna-update] Expiration impending: <draft-kl… John C Klensin
- Re: [Idna-update] IDNA and combining sequences (w… John Levine
- Re: [Idna-update] IDNA and combining sequences (w… Asmus Freytag (c)
- Re: [Idna-update] IDNA and combining sequences (w… John R Levine