Re: [I18nrp] Limits of IDNA2008 (was:Re: draft-faltstrom-unicode11-04.txt)
Asmus Freytag <asmusf@ix.netcom.com> Thu, 11 October 2018 01:17 UTC
Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1])
by ietfa.amsl.com (Postfix) with ESMTP id B3DFB130DFA;
Wed, 10 Oct 2018 18:17:38 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.699
X-Spam-Level:
X-Spam-Status: No, score=-2.699 tagged_above=-999 required=5
tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7,
SPF_PASS=-0.001, URIBL_BLOCKED=0.001]
autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key)
header.d=ix.netcom.com; domainkeys=pass (2048-bit key)
header.from=asmusf@ix.netcom.com header.d=ix.netcom.com
Received: from mail.ietf.org ([4.31.198.44])
by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id XihyIVJ0j5Fc; Wed, 10 Oct 2018 18:17:34 -0700 (PDT)
Received: from elasmtp-galgo.atl.sa.earthlink.net
(elasmtp-galgo.atl.sa.earthlink.net [209.86.89.61])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(No client certificate requested)
by ietfa.amsl.com (Postfix) with ESMTPS id 4E91F130DF3;
Wed, 10 Oct 2018 18:17:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ix.netcom.com;
s=dk12062016; t=1539220654; bh=ObAoOv8q8qIt3DsW/6Rl1sT9vuIBXB0RTewS
Av5Zg8k=; h=Received:Subject:To:Cc:References:From:Message-ID:Date:
User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language:
X-ELNK-Trace:X-Originating-IP; b=CEqIZSfxZ3NsIYVbO0XrURW9GdIE7VdPV
7DstBKE6VrHUOaP8n4CzG6JTOXZmIkzTN9kjWgcUfmooVymF/+6nMT1aqYLkyi88fEi
jPnKntTMYqiMVcIiXwkIg8pwfPFDGeng4qNKuygoLs8wyX9E6bDUm02bgBjlVW4G5p9
KiE8NObkNPJAjxU+WmBah3s197mqvN8EeVVQMNnnWRK4vpk16JiE/G6+6yPYO8k+Ui2
P8J2s775RWIoxUI0XDYB6Nj4banyIRrgIBocS7KcHDS7uSbCFJI/vpZQUJ5LirUMPUb
ciEZPM9RJhB30ghKTKnAJfiPPJoCdpE7UVUU7Z6ZA==
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com;
b=UxjGSqO5Xl3eXCmM+53ObxMKFaQXnjnMD+2oKRrChudA8itM57ZZQwtMYzQwD1jVnCEjKooBJTvABWGXdWEFOFQ9DbC6Q/ARN8U2aWT7C9nE3cmilOZOWA8ACPIGlsJdfxYfhcEQI3bac38dzyB2laiKpTYiLJSy70D7pEv/Dq1euNFF6JBsGeFH4PJSLrR0FW8Tb4fUA8aEHv5rgRXB0B8qZPw/gizMbYuziPE4wOnxD7v07jadZFkiSr4yd+KxMfHV0o4Py8+e8EmKvZJKpuQhRVVSiixYvVLlXMQeyVv3y/SXTe1tgObt1C+MSffCKiIOJreE3FDrLpDgufu3cg==;
h=Received:Subject:To:Cc:References:From:Message-ID:Date:User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language:X-ELNK-Trace:X-Originating-IP;
Received: from [188.96.232.123] (helo=[192.168.2.116])
by elasmtp-galgo.atl.sa.earthlink.net with esmtpa (Exim 4)
(envelope-from <asmusf@ix.netcom.com>)
id 1gAPbV-000ClN-TB; Wed, 10 Oct 2018 21:17:30 -0400
To: John C Klensin <john-ietf@jck.com>,
"Hollenbeck, Scott" <shollenbeck=40verisign.com@dmarc.ietf.org>
Cc: paf=40frobbit.se@dmarc.ietf.org, i18nrp@ietf.org, iab@iab.org
References: <FB9181768D399AB7695B2E70@PSB>
From: Asmus Freytag <asmusf@ix.netcom.com>
Message-ID: <e6002e32-a0db-9d25-976e-fb4f31fea9f4@ix.netcom.com>
Date: Wed, 10 Oct 2018 18:17:32 -0700
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101
Thunderbird/60.2.1
MIME-Version: 1.0
In-Reply-To: <FB9181768D399AB7695B2E70@PSB>
Content-Type: multipart/alternative;
boundary="------------9A38A0EA469CA9E873D951A9"
Content-Language: en-US
X-ELNK-Trace: 464f085de979d7246f36dc87813833b2b7eec10b52094b3ef003ebbee31e8c7376804880ad8dd2b0350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 188.96.232.123
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/DIIiQ6CmFncxxFUWdWiWLVqkbEM>
Subject: Re: [I18nrp] Limits of IDNA2008 (was:Re:
draft-faltstrom-unicode11-04.txt)
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>,
<mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>,
<mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Oct 2018 01:17:39 -0000
On 10/10/2018 12:17 PM, John C Klensin wrote: > > --On Wednesday, October 10, 2018 11:12 +0000 "Hollenbeck, Scott" > <shollenbeck=40verisign.com@dmarc.ietf.org> wrote: > >> ... >>> The choices for IETF when things like this happens are: >>> >>> 1. Keep IDNA2008 with no exceptions >>> >>> 2. Keep IDNA2008 with exceptions >>> >>> 3. Stop referring (directly) to Unicode as it is not stable >>> enough >>> >>> Probably more choices than these... >>> >>> My proposal is [1], together with a more forceful push to >>> strict IDNA2008 adoption. No IDNA2003, no UTS#46, no homebrew >>> mixes. Including that registries really do a careful >>> conservative selection of code points to be used in whatever >>> context it is to be used. >> I tend to agree. It's more stable for registry operators. > Scott (and everyone else), > > FWIW, I do too, but I'm a little concerned about about the > sleeping dragon (not merely a nice elephant [1]) in > this particular room. When we designed IDNA2008 (and, to a > considerable degree IDNA2003, the JET variant model of RFC 3743, > and the "preferred syntax" rules of RFC 1034/1035) we more or > less assumed that almost everything could be handled by > character rules. In other words, we would identify which > characters were ok, which ones were not, and which ones were > going to be treated as equivalent to which other ones. RFC > 1034/1035 also contained a few rules about positioning of > characters in labels: the "no leading digits" rule that was > later abandoned and at least a general assumption that hyphens > belonged in the middle of strings, not at either end (you will > recall that it didn't take long into the ICANN period before > registrants tested the latter). > > With IDNA2008, we recognized that some rules were needed to > prevent real problems with multiple character sequences whole > label and reflected them in the CONTECTJ and CONTECTO rules, but > the standard is basically still about valid and disallowed > (invalid) characters for use in labels. > > What we didn't do was deal with a number of character-sequencing > issues that essentially would prohibit some labels even though > all of the characters (code points) in them are ok (PVALID or > conforming the the CONTEXTx rules) individually. Most of the > issues are fundamental to the relevant writing systems, not > something that can be blamed on Unicode decisions. We didn't > deal with them in the IDNA2008 rules and algorithm for at least > three reasons: (i) those of us who did the IDNA2008 design work > underestimated their importance and complexity and, as it turned > out, no one set us straight, (ii) we didn't know how to specify > appropriate rules, and (iii) we thought we had specified an > effective workaround. > > Well, we were wrong. Our understanding of the effectiveness and > universality of the Unicode normalization rules was somewhere > between insufficient and just plain wrong. We made some > assumptions about future (relative to circa version 3.2) > extensions to Unicode that were not quite adequate. We did not > try to consider scripts that have rendering requirements that go > well beyond simply displaying a Unicode string in sequence, one > grapheme (treated atomically) at a time and the risks posed by > some systems trying to display things that way, others rendering > those strings correctly, and possible confusion between the two > groups. We also did not consider special measures for complex > scripts in which certain sequences of characters just make no > sense (and cannot be rendered in any plausible way) even if the > equivalent sequence of code points can be formed into a string. > In addition, a great many of the discussions about IDNs in > recent years have focused on confusion among characters, and > IDNA2008 (deliberately) did not deal with that either. > > It is clear (at least to me) that, if we decided we wanted to > change the IDNA2008 rules at categories to address those issues, > the solution would lie, not in an expanded exception list > (Patrik's #2) but in actually revisiting and adding to the > categories and rules themselves. Whether that is feasible (even > if the IETF had more energy and expertise) is, to me at least, > an open question. John (and all), having deeply immersed myself for the last seven years in the problem of drawing up conservative label generation rules for the root zone that address all modern scripts (and languages), I feel I've reached an under- standing that allows me to perhaps draw a few conclusions of my own. First and foremost among them is the conclusion that the problem space is too complex to yield to any attempts at creating a "one-size-fits-all" approach. From different requirements of different user communities to different behavior and internal constraints of different writing system, a general solution will either have to be too restrictive (least common denominator) or too permissive (allowing other, more focused solutions to be built on it. I believe that IDNA2008 is an example of the latter, and therefore must be viewed as foundational; necessary to define the playing field, but not sufficient to complete without other specifications built on top of it. The JET specification for defining semantic variants or the RFC on Arabic domain names are such specifications. As result of the additional research and development work performed for the root zone LGRs, these RFCs should probably be updated by the respective communities. For Chinese, there's been a bit of a breakthrough in defining allocatable code point variants in a way that restricts the number of allocatable variant label to a reasonable number, even for long labels, avoiding the risk of combinatorial explosion. The design should be codified in an RFC so it can be more easily applied for other zones, if desired. For Arabic, there's for the first time a full accounting of how to treat *all* the orthographies, including African ones. Important conclusions reached by the community might make for a useful updated RFC. For the Indic and SEA languages, nobody had worked out an analysis of what is actually required until now. (You find the current LGR drafts from https://icann.org/idn - they make for interesting reading). The community worked hard at creating a reasonably systematic approach covering 10 or so scripts that share a common history and varying degrees of a sometimes deep structural as well as surface similarity. ... (read on) > The IDNA2008 spec proposed to deal with that range of issues by > following and specifying a model that goes back at least to RFC > 1591 and arguably earlier, a model that was reflected in the > earliest of ICANN IDN guidelines. The model is that registries > were expected to exercise considerable responsibility to the > community and, in particular, that they not allow registration > of strings involving scripts that they didn't understand and for > which they were unwilling to be accountable [2]. That > provision may have been hopelessly naive from the beginning. > Certainly some registries have been better-behaved (by that > criterion) than others and there is a perception that those who > decided to behave well and carefully would put themselves at a > commercial disadvantage. The possibility that it would have > provide a full employment situation for the relatively small > number of i18n experts out there may or may not be a > consideration. ... These community efforts have brought together a very healthy mix of expertise from the linguistic or writing system end as well as domain name expertise. This is not something that IETF has a business duplicating, but perhaps, to the extent that the process resulted in some settled conclusions, the same communities could be encouraged to capture them also in an RFC. Now that we have RFC7940 defining an unified mechanical representation of label generation rules (including variants) these efforts, unlike earlier RFC could focus on linguistic and typographic constraints and would need to spend less effort on defining file formats or notations. ... > However, whether that requirement for responsible, > community-serving, registry responsibility isn't up to the > pressures now being placed on IDNA or whether it was hopeless > from the beginning, it seems very clear that IDNA2008 without it > is not up to the job many people think it should do. Even if > the only thing we should do is to adjust our expectations and be > clear about the modified versions, it seems to me that charging > ahead without reviewing those questions and whether the > standards need changing would border on irresponsible. ... Finally, continuing with my understanding of IDNA2008 as both foundational and (by necessity) having to be slightly permissive so as to provide the base onto which all these efforts can be built, I want to argue very strongly for separating all these questions: (1) how to handle update to Unicode 11.0 (and beyond) in the existing IDNA2008 framework (2) how to handle any (minor) tweaking of IDNA2008 - as in clarifying its role, setting expectations, and so on (3) how to best capture the emerging "state of the art" in actually designing robust label generation rulesets and perhaps (4) how and whether to address issues of confusing similarity; or defining a framework for handling that issue, if that's more appropriate. I'm very leery of creating unnecessary dependencies among all of these. I think Patrik's got it right, that his suggested solution of (1) is relatively cut&dried and can and should progress soonest (we are not doing ourselves a favor by keeping everything added after Unicode 6.3.0 in some kind of limbo). A wholesale replacement of IDNA2008 is likewise not on the table for (2), I believe; no matter its intentions it would just muddle the waters. That said, some tinkering in the margins, perhaps tightening some requirements, setting expectations etc. may well be needed - John's draft is the starting point of that discussion. In the unlikely event this effort results in the conclusion that it is not possible to avoid some breaking changes to IDNA2008, I'm rather of the opinion that they would not be limited to the few issues covered in (1), and we would have deeper issues than the question of whether we should have gone ahead and published (1). (3) and (4) should build on top of IDNA2008 - as long as there are no gravely incompatible changes introduced by (2), these efforts can proceed at their own pace. So, yes, my conclusion is, let's move ahead with (1) and get it out o the way. Then tackle (2). A./ > > john > > > > [1] As another aspect of why true internationalization is > difficult, I note that I am contemplating a nasty-tempered, > fire-breathing, European dragon, not a kindly, luck-bringing > East Asian one :-) > > [2] One of the documents in the i18n queue, > draft-klensin-idna-rfc5891bis, is about clarifying and > reinforcing that "registries must be responsible" requirement. > It is possible that a better alternative would be to face > reality and abandon it, but, if we do, it would be good to have > an alternative. > > _______________________________________________ > i18nRP mailing list > i18nRP@ietf.org > https://www.ietf.org/mailman/listinfo/i18nrp >
- [I18nrp] Limits of IDNA2008 (was:Re: draft-faltst… John C Klensin
- Re: [I18nrp] Limits of IDNA2008 (was:Re: draft-fa… Asmus Freytag