[I18nrp] Limits of IDNA2008 (was:Re: draft-faltstrom-unicode11-04.txt)
John C Klensin <john-ietf@jck.com> Wed, 10 October 2018 19:17 UTC
Return-Path: <john-ietf@jck.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1])
by ietfa.amsl.com (Postfix) with ESMTP id AC71D1277D2;
Wed, 10 Oct 2018 12:17:36 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5
tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001]
autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44])
by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id 79c9yPJghEjC; Wed, 10 Oct 2018 12:17:34 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51])
(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
(No client certificate requested)
by ietfa.amsl.com (Postfix) with ESMTPS id 32AA71277BB;
Wed, 10 Oct 2018 12:17:34 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB)
by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD))
(envelope-from <john-ietf@jck.com>)
id 1gAJzA-0002q7-Jz; Wed, 10 Oct 2018 15:17:32 -0400
Date: Wed, 10 Oct 2018 15:17:25 -0400
From: John C Klensin <john-ietf@jck.com>
To: "Hollenbeck, Scott" <shollenbeck=40verisign.com@dmarc.ietf.org>
cc: "'paf=40frobbit.se@dmarc.ietf.org'" <paf=40frobbit.se@dmarc.ietf.org>,
"'i18nrp@ietf.org'" <i18nrp@ietf.org>, "'iab@iab.org'" <iab@iab.org>
Message-ID: <FB9181768D399AB7695B2E70@PSB>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/pwQCt4xQWILw2IUvVak4twfCON4>
Subject: [I18nrp] Limits of IDNA2008 (was:Re: draft-faltstrom-unicode11-04.txt)
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>,
<mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>,
<mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Oct 2018 19:17:37 -0000
--On Wednesday, October 10, 2018 11:12 +0000 "Hollenbeck, Scott" <shollenbeck=40verisign.com@dmarc.ietf.org> wrote: >... >> The choices for IETF when things like this happens are: >> >> 1. Keep IDNA2008 with no exceptions >> >> 2. Keep IDNA2008 with exceptions >> >> 3. Stop referring (directly) to Unicode as it is not stable >> enough >> >> Probably more choices than these... >> >> My proposal is [1], together with a more forceful push to >> strict IDNA2008 adoption. No IDNA2003, no UTS#46, no homebrew >> mixes. Including that registries really do a careful >> conservative selection of code points to be used in whatever >> context it is to be used. > > I tend to agree. It's more stable for registry operators. Scott (and everyone else), FWIW, I do too, but I'm a little concerned about about the sleeping dragon (not merely a nice elephant [1]) in this particular room. When we designed IDNA2008 (and, to a considerable degree IDNA2003, the JET variant model of RFC 3743, and the "preferred syntax" rules of RFC 1034/1035) we more or less assumed that almost everything could be handled by character rules. In other words, we would identify which characters were ok, which ones were not, and which ones were going to be treated as equivalent to which other ones. RFC 1034/1035 also contained a few rules about positioning of characters in labels: the "no leading digits" rule that was later abandoned and at least a general assumption that hyphens belonged in the middle of strings, not at either end (you will recall that it didn't take long into the ICANN period before registrants tested the latter). With IDNA2008, we recognized that some rules were needed to prevent real problems with multiple character sequences whole label and reflected them in the CONTECTJ and CONTECTO rules, but the standard is basically still about valid and disallowed (invalid) characters for use in labels. What we didn't do was deal with a number of character-sequencing issues that essentially would prohibit some labels even though all of the characters (code points) in them are ok (PVALID or conforming the the CONTEXTx rules) individually. Most of the issues are fundamental to the relevant writing systems, not something that can be blamed on Unicode decisions. We didn't deal with them in the IDNA2008 rules and algorithm for at least three reasons: (i) those of us who did the IDNA2008 design work underestimated their importance and complexity and, as it turned out, no one set us straight, (ii) we didn't know how to specify appropriate rules, and (iii) we thought we had specified an effective workaround. Well, we were wrong. Our understanding of the effectiveness and universality of the Unicode normalization rules was somewhere between insufficient and just plain wrong. We made some assumptions about future (relative to circa version 3.2) extensions to Unicode that were not quite adequate. We did not try to consider scripts that have rendering requirements that go well beyond simply displaying a Unicode string in sequence, one grapheme (treated atomically) at a time and the risks posed by some systems trying to display things that way, others rendering those strings correctly, and possible confusion between the two groups. We also did not consider special measures for complex scripts in which certain sequences of characters just make no sense (and cannot be rendered in any plausible way) even if the equivalent sequence of code points can be formed into a string. In addition, a great many of the discussions about IDNs in recent years have focused on confusion among characters, and IDNA2008 (deliberately) did not deal with that either. It is clear (at least to me) that, if we decided we wanted to change the IDNA2008 rules at categories to address those issues, the solution would lie, not in an expanded exception list (Patrik's #2) but in actually revisiting and adding to the categories and rules themselves. Whether that is feasible (even if the IETF had more energy and expertise) is, to me at least, an open question. The IDNA2008 spec proposed to deal with that range of issues by following and specifying a model that goes back at least to RFC 1591 and arguably earlier, a model that was reflected in the earliest of ICANN IDN guidelines. The model is that registries were expected to exercise considerable responsibility to the community and, in particular, that they not allow registration of strings involving scripts that they didn't understand and for which they were unwilling to be accountable [2]. That provision may have been hopelessly naive from the beginning. Certainly some registries have been better-behaved (by that criterion) than others and there is a perception that those who decided to behave well and carefully would put themselves at a commercial disadvantage. The possibility that it would have provide a full employment situation for the relatively small number of i18n experts out there may or may not be a consideration. However, whether that requirement for responsible, community-serving, registry responsibility isn't up to the pressures now being placed on IDNA or whether it was hopeless from the beginning, it seems very clear that IDNA2008 without it is not up to the job many people think it should do. Even if the only thing we should do is to adjust our expectations and be clear about the modified versions, it seems to me that charging ahead without reviewing those questions and whether the standards need changing would border on irresponsible. john [1] As another aspect of why true internationalization is difficult, I note that I am contemplating a nasty-tempered, fire-breathing, European dragon, not a kindly, luck-bringing East Asian one :-) [2] One of the documents in the i18n queue, draft-klensin-idna-rfc5891bis, is about clarifying and reinforcing that "registries must be responsible" requirement. It is possible that a better alternative would be to face reality and abandon it, but, if we do, it would be good to have an alternative.
- [I18nrp] Limits of IDNA2008 (was:Re: draft-faltst… John C Klensin
- Re: [I18nrp] Limits of IDNA2008 (was:Re: draft-fa… Asmus Freytag