Re: IAB Statement on Identifiers and Unicode 7.0.0
John C Klensin <klensin@jck.com> Wed, 28 January 2015 18:51 UTC
Return-Path: <klensin@jck.com>
X-Original-To: idna-update@alvestrand.no
Delivered-To: idna-update@alvestrand.no
Received: from localhost (localhost [127.0.0.1]) by mork.alvestrand.no (Postfix) with ESMTP id 7DB0F7C3BC9 for <idna-update@alvestrand.no>; Wed, 28 Jan 2015 19:51:52 +0100 (CET)
X-Virus-Scanned: Debian amavisd-new at alvestrand.no
Received: from mork.alvestrand.no ([127.0.0.1]) by localhost (mork.alvestrand.no [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id lRtXGkyb_1CV for <idna-update@alvestrand.no>; Wed, 28 Jan 2015 19:51:50 +0100 (CET)
X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0
X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) by mork.alvestrand.no (Postfix) with ESMTPS id 237E27C3B4E for <idna-update@alvestrand.no>; Wed, 28 Jan 2015 19:51:50 +0100 (CET)
Received: from [198.252.137.35] (helo=JcK-HP8200.jck.com) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <klensin@jck.com>) id 1YGXiO-000JMC-75 for idna-update@alvestrand.no; Wed, 28 Jan 2015 13:51:48 -0500
Date: Wed, 28 Jan 2015 13:51:43 -0500
From: John C Klensin <klensin@jck.com>
To: idna-update@alvestrand.no
Subject: Re: IAB Statement on Identifiers and Unicode 7.0.0
Message-ID: <CF757F4D30DE3C2AF7080F93@JcK-HP8200.jck.com>
In-Reply-To: <20150128022230.GF77983@mx1.yitter.info>
References: <794C5485-8EF1-4A5F-B92E-4205C3CA0A8B@vigilsec.com> <AE1BD6E6-AF89-4B1E-B4B0-B67A5A72A7DF@vpnc.org> <CY1PR0301MB0731475C6E31B4C725745B4482320@CY1PR0301MB0731.namprd03.prod.outlook.com> <20150128000548.GD77983@mx1.yitter.info> <CY1PR0301MB0731532FF6351495549DB78C82330@CY1PR0301MB0731.namprd03.prod.outlook.com> <20150128022230.GF77983@mx1.yitter.info>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.35
X-SA-Exim-Mail-From: klensin@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
X-BeenThere: idna-update@alvestrand.no
X-Mailman-Version: 2.1.16
Precedence: list
List-Id: IDNA update work <idna-update.alvestrand.no>
List-Unsubscribe: <http://www.alvestrand.no/mailman/options/idna-update>, <mailto:idna-update-request@alvestrand.no?subject=unsubscribe>
List-Archive: <http://www.alvestrand.no/pipermail/idna-update/>
List-Post: <mailto:idna-update@alvestrand.no>
List-Help: <mailto:idna-update-request@alvestrand.no?subject=help>
List-Subscribe: <http://www.alvestrand.no/mailman/listinfo/idna-update>, <mailto:idna-update-request@alvestrand.no?subject=subscribe>
X-List-Received-Date: Wed, 28 Jan 2015 18:51:52 -0000
Hi. I think Vint, Andrew, Patrik, and I are saying much the same thing, but let me try to respond to several messages with a slightly different viewpoint and vocabulary, much of which closely parallels the (IMO, very helpful) exchange between Asmus and myself earlier this week -- an exchange that, while long, I'd encourage those who are interested in this subject to read if you have not done so. First, I'd like to encourage everyone to try to think about this issue as precisely as possible and to avoid both hyperbole and hysteria, if only because they create distractions and make it hard to identify the real issues and figure out what to do about them. This is not about similar-looking ("confusable") characters or about characters, however related, from different scripts. It is not about revising Unicode's script classifications for characters used in normal, natural-language, writing systems either. The IAB has not called for banning Hamza, or characters constructed with Hamza. It has only called for a temporary go-slow policy on a range of code points until we really figure out how to respond to this. If I had known when work started on that statement what I think I know now (thanks to help from John Cowan and Asmus among others), I think the statement would have said equally cautionary things about, e.g., some collections of phonetic description characters that are classified as Latin script letters but that have some very similar decomposition properties. FWIW, the most radical long-term suggestions I have made, seen, or heard of would have the effect of disallowing one or the other of a combining sequence and a [pre]composed form of a character. That is no different from what IDNA2008's "U-labels must be in NFC form" rule (or the tables of IDNA2003) to for thousands of other characters in any respect other than NFC doesn't do the elimination job. I would assume that any decision to ban _both_ the precomposed character and the combining sequence would need to be a matter of per-script or per-language recommendations to, and actions by, zone administrators (in the DNS case) and those who pick identifiers (in most or all of the other cases). That isn't because those "ban both" cases are somehow harder or more risky but because the decision as to what abstract characters to allow in a label (or other identifier) is fully consistent with IDNA2008's general call for care and good judgment and not an issue about how things are coded. I understand that passions can run high about this, but the solution to the IAB's "defer using these things until there is a real solution" advice (at least that is how I have understood it) is to get to a real solution as quickly as possible. That doesn't require that we all focus on the same issues, but a narrow focus, with minimal distractions, would certainly help. The problem _is_ about whether two ways to code the same abstract character, within the same script, can be reliably compared equal with the existing technology and, if not, what can be done to create technology that will work. That makes it, inevitably, about whether the meaning of "same abstract character" can be the same for Unicode purposes (which apparently includes language, phonetic, and usage considerations) as for pure identifier ones (think "IETF identifier", but the Historical Note at the end). IDNA2008's property and new-Unicode-version transition rules assumed the answer to that abstract character question was "yes". It is now clear that, for some groups of characters, it is "no" without further work on IDNA and that is a problem for the reasons Vint, Patrik, and Andrew have referred to. Others have already said this but an approach of lumping these code points (or sequences) that may or may not be the same abstract character together with confusables and handing the issue off to zone administrators is impractical and undesirable because of the "number of registries that act independently" and "no-registry" identifiers issues (see an earlier note from Andrew and/or my discussion with Asmus). If we cannot reasonably know whether two representations (via input methods, coding, or elsewhere) of an identifier match, then we don't have identifiers, we have only names by which things are (or might be) called. For me, at least, the next steps, with the understanding that the first two may, to some extent, depend on the third are: (1) Try to figure out how, if at all possible, to disentangle Precis (and, to the degree relevant, Json, IETF adaptations of PKCS, and other protocols and systems that need to accommodate non-ASCII strings) from this situation and its possible implications. (2) Update draft-klensin-idna-5892upd-unicode70 to reflect our current understanding of the problem. Until and unless some other approach comes along, that document lies on the path to a solution for IDNA. If someone does want to suggest another approach, I'd be happy to work with them to incorporate it if they conclude it would be inefficient to write a separate document. However, I don't think denial is going to do it for us for reasons given above and elsewhere. (3) Try to get a better understanding of the scope and locations of the problem. We know about the Hamza-related cases, but not whether there are similar non-decomposing cases elsewhere within the Arabic script. The discussion in the Unicode Standard suggests that there are not and won't be; some of Asmus's comments appear to indicate that composed forms for many other "characters" that can now only be represented by combining sequences may be in our future. We also know, now, about the phonetic description characters, some of which can be formed from Basic Latin characters and Latin or Common Script composing characters. If there are other types of cases, it would be really desirable, perhaps essential, to know where what they are where to find them, rather than just having comments that there are lots of cases out there (some of which turn out to be cross-script or "similar", not "identical" cases. See also (5) below. I also think there is some work that UTC could do here that the IETF can't do and that would considerably improve the situation: (4) There are statements in the Unicode Standard and about normalization stability that seem, to some of us who have read them multiple time and very carefully, to be very specific about conditions under which new code points are added and their interactions with normalization. It appears from the discussions of the last few weeks that there are additional considerations about phonetics, language issues within scripts, different treatments for different scripts, and perhaps other cases that call for what appear to be exceptions to those statements. It seems to me that it would be wise, in the interest of predictability on which the community can rely --the very essence of applied stability-- to align the statements in the standard with the actual practices and guidelines used in assigning new code points. If the problem is not in the intent of the statements made in the standard but in how many of us have, in good faith, interpreted them, that also suggests that some textual revisions are in order. If those issues have to be worked out in collaboration with ISO/IEC JTC1/SC2, I believe that an increase in transparency would be beneficial to all concerned. (5) I'm sensitive to the distinction Asmus made between legacy cases and rules and plans going forward. Although I hate the idea if the list is long, we can handle the legacy cases by exception list (and have done some of that already). It would be far better if there were some property that would identify code points that would have been handled differently (wrt composition/decomposition (see (7) below) and anything else we should be worried about) if assigned under 2014 rules rather than whatever legacy principles applied. Only UTC can create such a property with any hope of getting it right. However, even such a property would not be of much help unless there are clear rules about new code points assignments that we can understand and rely on (see above) and a clear dividing line between "legacy" and "new and consistent rules". When we did IDNA2008, we were under the impression that dividing line existed and was set by the additional stability rules introduced into Unicode 5.2 (if not a bit earlier). That inference appears to be incorrect, for help in formulating a better one seems to me to be important. (6) If there are really some scripts (e.g., Latin) for which new characters are assigned code points based on "composing sequences preferred" and others (e.g., Arabic and, given the difficulties with using ideographic description sequences in identifiers, Han) in which new character are assigned based on "precomposed grapheme with single code point preferred", it would be extremely desirable to have a property that distinguished the two rather than our having to make lists of scripts. The latter is inherently unstable on a Unicode version by version basis as new scripts are added; presumably the property could easily be updated as part of decisions about each new script. (7) Similarly, to the extent to which the core of the current set of issues (but possibly not the only misconceptions around which IDNA2008 was designed) is associated with characters that one would expect, under a "same script, same form" principle, to decompose (and, in most cases, to compose) when converted to the appropriate canonical form but that do behave as predicted for some well-thought-out reason. If UTC concludes that is, indeed, the key issue, a "you might expect this to decompose but it doesn't" property would be extremely helpful. Because that property would be new, it could be assigned to all existing cases of that situation without violating any stability rules. And the IETF could choose to either disallow all such code points (a decision that would favor stability of possible existing names) and, in conjunction with the "type of script" property described in (6), construct a more complex rule that would favor predictability. Either, in principle, might have exception lists for particularly difficult legacy cases. (8) It might be covered by the above case or might not and might or might not be useful for other reasons, but it appears that a great many code points have been assigned to characters that have been assigned to the Latin (and perhaps Greek or other) scripts and given letter properties that, with a different set of conventions, would have been considered as symbols. My current examples are the IPA block at 0250..02AF and the Phonetic Extensions and Phonetic Extensions Supplement at 1D00..1DFF but I have no reason to believe those are the only cases. It seems to me that those near-symbols are bound to cause problems sooner or later for some identifier context (even if IDNA can handle them in other ways) so I would encourage UTC to consider whether some new property that can be used to distinguish them from the letters normally used to write words in human languages would be appropriate. best regards to all, john Historical note: FWIW, the issue of whether identifiers were different if different languages were used as a base is not a new issue or an IETF-specific one. There were extensive discussions in the ISO, ANSI, and ECMA programming language standardization communities in the mid-1980s (long enough ago that what is now ISO/IEC JTC1/SC22 was still ISO TC97/SC5) about whether the concrete, machine-stored, form identifiers other than those in what we now call Basic Latin characters needed to be a tuple of a language or CCS identifier with a coded string, much along the lines that Andrew suggested in one of his comments. That discussion, which was strongly influenced by increasing awareness of the difficulties with a collection of specialized CCSs, was one part of what led to the creation of the project that became ISO/IEC 10646. It also led from a liaison letter (if I recall, more than one) between the two SCs cautioning against multiple coding forms for the same character and warning against just the situation we find ourselves in today. Perhaps those who write, or have written, programs can understand why the present situation is so disturbing by thinking about a requirement on programming languages that every program contain a declaration of the human language used as a basis for its identifiers, with that information carried, not only into object code, but into every procedure call and little hope of interoperation in the general case among programs or libraries with different language declarations (one could get around that by passing only pointer-references and not names, but we have presumably had enough experience with where that leads from a security vunerability standpoint.
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Mark Davis ☕️
- Fwd: IAB Statement on Identifiers and Unicode 7.0… Paul Hoffman
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Andrew Sullivan
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Paul Hoffman
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Andrew Sullivan
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Andrew Sullivan
- Re: Fwd: IAB Statement on Identifiers and Unicode… Roozbeh Pournader
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Mark Davis ☕️
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Vint Cerf
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Patrik Fältström
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Mark Davis ☕️
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Andrew Sullivan
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Patrik Fältström
- Re: Fwd: IAB Statement on Identifiers and Unicode… Andrew Sullivan
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- Re: Fwd: IAB Statement on Identifiers and Unicode… Asmus Freytag
- RE: Fwd: IAB Statement on Identifiers and Unicode… Shawn Steele
- Re: IAB Statement on Identifiers and Unicode 7.0.0 John C Klensin
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Michel Suignard
- RE: IAB Statement on Identifiers and Unicode 7.0.0 John C Klensin
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Patrik Fältström
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Andrew Sullivan
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Mark Davis ☕️
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Patrik Fältström
- Re: IAB Statement on Identifiers and Unicode 7.0.0 J-F C. Morfin
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Michel Suignard
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Patrik Fältström
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- RE: IAB Statement on Identifiers and Unicode 7.0.0 John C Klensin
- Re: Fwd: IAB Statement on Identifiers and Unicode… Asmus Freytag
- RE: IAB Statement on Identifiers and Unicode 7.0.0 John C Klensin
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Pete Resnick
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Andrew Sullivan
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Andrew Sullivan
- RE: Fwd: IAB Statement on Identifiers and Unicode… Shawn Steele
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Markus Scherer
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Andrew Sullivan
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Andrew Sullivan
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- Re: IAB Statement on Identifiers and Unicode 7.0.0 John C Klensin
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Vint Cerf
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- RE: IAB Statement on Identifiers and Unicode 7.0.0 John C Klensin
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Shawn Steele
- Re: IAB Statement on Identifiers and Unicode 7.0.0 Asmus Freytag
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Abdulrahman I. ALGhadir
- RE: IAB Statement on Identifiers and Unicode 7.0.0 John C Klensin
- RE: IAB Statement on Identifiers and Unicode 7.0.0 Abdulrahman I. ALGhadir