Re: [I18nrp] Limits of IDNA2008 (was:Re: draft-faltstrom-unicode11-04.txt)

Asmus Freytag <asmusf@ix.netcom.com> Thu, 11 October 2018 01:17 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B3DFB130DFA; Wed, 10 Oct 2018 18:17:38 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.699
X-Spam-Level:
X-Spam-Status: No, score=-2.699 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ix.netcom.com; domainkeys=pass (2048-bit key) header.from=asmusf@ix.netcom.com header.d=ix.netcom.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XihyIVJ0j5Fc; Wed, 10 Oct 2018 18:17:34 -0700 (PDT)
Received: from elasmtp-galgo.atl.sa.earthlink.net (elasmtp-galgo.atl.sa.earthlink.net [209.86.89.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4E91F130DF3; Wed, 10 Oct 2018 18:17:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ix.netcom.com; s=dk12062016; t=1539220654; bh=ObAoOv8q8qIt3DsW/6Rl1sT9vuIBXB0RTewS Av5Zg8k=; h=Received:Subject:To:Cc:References:From:Message-ID:Date: User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language: X-ELNK-Trace:X-Originating-IP; b=CEqIZSfxZ3NsIYVbO0XrURW9GdIE7VdPV 7DstBKE6VrHUOaP8n4CzG6JTOXZmIkzTN9kjWgcUfmooVymF/+6nMT1aqYLkyi88fEi jPnKntTMYqiMVcIiXwkIg8pwfPFDGeng4qNKuygoLs8wyX9E6bDUm02bgBjlVW4G5p9 KiE8NObkNPJAjxU+WmBah3s197mqvN8EeVVQMNnnWRK4vpk16JiE/G6+6yPYO8k+Ui2 P8J2s775RWIoxUI0XDYB6Nj4banyIRrgIBocS7KcHDS7uSbCFJI/vpZQUJ5LirUMPUb ciEZPM9RJhB30ghKTKnAJfiPPJoCdpE7UVUU7Z6ZA==
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com; b=UxjGSqO5Xl3eXCmM+53ObxMKFaQXnjnMD+2oKRrChudA8itM57ZZQwtMYzQwD1jVnCEjKooBJTvABWGXdWEFOFQ9DbC6Q/ARN8U2aWT7C9nE3cmilOZOWA8ACPIGlsJdfxYfhcEQI3bac38dzyB2laiKpTYiLJSy70D7pEv/Dq1euNFF6JBsGeFH4PJSLrR0FW8Tb4fUA8aEHv5rgRXB0B8qZPw/gizMbYuziPE4wOnxD7v07jadZFkiSr4yd+KxMfHV0o4Py8+e8EmKvZJKpuQhRVVSiixYvVLlXMQeyVv3y/SXTe1tgObt1C+MSffCKiIOJreE3FDrLpDgufu3cg==; h=Received:Subject:To:Cc:References:From:Message-ID:Date:User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language:X-ELNK-Trace:X-Originating-IP;
Received: from [188.96.232.123] (helo=[192.168.2.116]) by elasmtp-galgo.atl.sa.earthlink.net with esmtpa (Exim 4) (envelope-from <asmusf@ix.netcom.com>) id 1gAPbV-000ClN-TB; Wed, 10 Oct 2018 21:17:30 -0400
To: John C Klensin <john-ietf@jck.com>, "Hollenbeck, Scott" <shollenbeck=40verisign.com@dmarc.ietf.org>
Cc: paf=40frobbit.se@dmarc.ietf.org, i18nrp@ietf.org, iab@iab.org
References: <FB9181768D399AB7695B2E70@PSB>
From: Asmus Freytag <asmusf@ix.netcom.com>
Message-ID: <e6002e32-a0db-9d25-976e-fb4f31fea9f4@ix.netcom.com>
Date: Wed, 10 Oct 2018 18:17:32 -0700
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1
MIME-Version: 1.0
In-Reply-To: <FB9181768D399AB7695B2E70@PSB>
Content-Type: multipart/alternative; boundary="------------9A38A0EA469CA9E873D951A9"
Content-Language: en-US
X-ELNK-Trace: 464f085de979d7246f36dc87813833b2b7eec10b52094b3ef003ebbee31e8c7376804880ad8dd2b0350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 188.96.232.123
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/DIIiQ6CmFncxxFUWdWiWLVqkbEM>
Subject: Re: [I18nrp] Limits of IDNA2008 (was:Re: draft-faltstrom-unicode11-04.txt)
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Oct 2018 01:17:39 -0000

On 10/10/2018 12:17 PM, John C Klensin wrote:
>
> --On Wednesday, October 10, 2018 11:12 +0000 "Hollenbeck, Scott"
> <shollenbeck=40verisign.com@dmarc.ietf.org> wrote:
>
>> ...
>>> The choices for IETF when things like this happens are:
>>>
>>> 1. Keep IDNA2008 with no exceptions
>>>
>>> 2. Keep IDNA2008 with exceptions
>>>
>>> 3. Stop referring (directly) to Unicode as it is not stable
>>> enough
>>>
>>> Probably more choices than these...
>>>
>>> My proposal is [1], together with a more forceful push to
>>> strict IDNA2008 adoption. No IDNA2003, no UTS#46, no homebrew
>>> mixes. Including that registries really do a careful
>>> conservative selection of code points to be used in whatever
>>> context it is to be used.
>> I tend to agree. It's more stable for registry operators.
> Scott (and everyone else),
>
> FWIW, I do too, but I'm a little concerned about about the
> sleeping dragon (not merely a nice elephant [1]) in
> this particular room.  When we designed IDNA2008 (and, to a
> considerable degree IDNA2003, the JET variant model of RFC 3743,
> and the "preferred syntax" rules of RFC 1034/1035) we more or
> less assumed that almost everything could be handled by
> character rules.  In other words, we would identify which
> characters were ok, which ones were not, and which ones were
> going to be treated as equivalent to which other ones.  RFC
> 1034/1035 also contained a few rules about positioning of
> characters in labels: the "no leading digits" rule that was
> later abandoned and at least a general assumption that hyphens
> belonged in the middle of strings, not at either end (you will
> recall that it didn't take long into the ICANN period before
> registrants tested the latter).
>
> With IDNA2008, we recognized that some rules were needed to
> prevent real problems with multiple character sequences whole
> label and reflected them in the CONTECTJ and CONTECTO rules, but
> the standard is basically still about valid and disallowed
> (invalid) characters for use in labels.
>
> What we didn't do was deal with a number of character-sequencing
> issues that essentially would prohibit some labels even though
> all of the characters (code points) in them are ok (PVALID or
> conforming the the CONTEXTx rules) individually.  Most of the
> issues are fundamental to the relevant writing systems, not
> something that can be blamed on Unicode decisions.   We didn't
> deal with them in the IDNA2008 rules and algorithm for at least
> three reasons: (i) those of us who did the IDNA2008 design work
> underestimated their importance and complexity and, as it turned
> out, no one set us straight, (ii) we didn't know how to specify
> appropriate rules, and (iii) we thought we had specified an
> effective workaround.
>
> Well, we were wrong.  Our understanding of the effectiveness and
> universality of the Unicode normalization rules was somewhere
> between insufficient and just plain wrong.   We made some
> assumptions about future (relative to circa version 3.2)
> extensions to Unicode that were not quite adequate.  We did not
> try to consider scripts that have rendering requirements that go
> well beyond simply displaying a Unicode string in sequence, one
> grapheme (treated atomically) at a time and the risks posed by
> some systems trying to display things that way, others rendering
> those strings correctly, and possible confusion between the two
> groups.  We also did not consider special measures for complex
> scripts in which certain sequences of characters just make no
> sense (and cannot be rendered in any plausible way) even if the
> equivalent sequence of code points can be formed into a string.
> In addition, a great many of the discussions about IDNs in
> recent years have focused on confusion among characters, and
> IDNA2008 (deliberately) did not deal with that either.
>
> It is clear (at least to me) that, if we decided we wanted to
> change the IDNA2008 rules at categories to address those issues,
> the solution would lie, not in an expanded exception list
> (Patrik's #2) but in actually revisiting and adding to the
> categories and rules themselves.  Whether that is feasible (even
> if the IETF had more energy and expertise) is, to me at least,
> an open question.

John (and all),

having deeply immersed myself for the last seven years in the problem of
drawing up conservative label generation rules for the root zone that
address all modern scripts (and languages), I feel I've reached an under-
standing that allows me to perhaps draw a few conclusions of my own.

First and foremost among them is the conclusion that the problem space
is too complex to yield to any attempts at creating a "one-size-fits-all"
approach. From different requirements of different user communities
to different behavior and internal constraints of different writing system,
a general solution will either have to be too restrictive (least common
denominator) or too permissive (allowing other, more focused solutions
to be built on it.

I believe that IDNA2008 is an example of the latter, and therefore must
be viewed as foundational; necessary to define the playing field, but not
sufficient to complete without other specifications built on top of it.

The JET specification for defining semantic variants or the RFC on Arabic
domain names are such specifications. As result of the additional research
and development work performed for the root zone LGRs, these RFCs
should probably be updated by the respective communities.

For Chinese, there's been a bit of a breakthrough in defining allocatable
code point variants in a way that restricts the number of allocatable
variant label to a reasonable number, even for long labels, avoiding
the risk of combinatorial explosion. The design should be codified in an
RFC so it can be more easily applied for other zones, if desired.

For Arabic, there's for the first time a full accounting of how to treat
*all* the orthographies, including African ones. Important conclusions
reached by the community might make for a useful updated RFC.

For the Indic and SEA languages, nobody had worked out an analysis
of what is actually required until now. (You find the current LGR drafts
from https://icann.org/idn - they make for interesting reading). The
community worked hard at creating a reasonably systematic approach
covering 10 or so scripts that share a common history and varying
degrees of a sometimes deep structural as well as surface similarity.

... (read on)

> The IDNA2008 spec proposed to deal with that range of issues by
> following and specifying a model that goes back at least to RFC
> 1591 and arguably earlier, a model that was reflected in the
> earliest of ICANN IDN guidelines.  The model is that registries
> were expected to exercise considerable responsibility to the
> community and, in particular, that they not allow registration
> of strings involving scripts that they didn't understand and for
> which they were unwilling to be accountable [2].   That
> provision may have been hopelessly naive from the beginning.
> Certainly some registries have been better-behaved (by that
> criterion) than others and there is a perception that those who
> decided to behave well and carefully would put themselves at a
> commercial disadvantage.  The possibility that it would have
> provide a full employment situation for the relatively small
> number of i18n experts out there may or may not be a
> consideration.

...

These community efforts have brought together a very healthy mix
of expertise from the linguistic or writing system end as well as domain
name expertise. This is not something that IETF has a business duplicating,
but perhaps, to the extent that the process resulted in some settled
conclusions, the same communities could be encouraged to capture
them also in an RFC.

Now that we have RFC7940 defining an unified mechanical representation
of label generation rules (including variants) these efforts, unlike earlier
RFC could focus on linguistic and typographic constraints and would
need to spend less effort on defining file formats or notations.

...

> However, whether that requirement for responsible,
> community-serving, registry responsibility isn't up to the
> pressures now being placed on IDNA or whether it was hopeless
> from the beginning, it seems very clear that IDNA2008 without it
> is not up to the job many people think it should do.   Even if
> the only thing we should do is to adjust our expectations and be
> clear about the modified versions, it seems to me that charging
> ahead without reviewing those questions and whether the
> standards need changing would border on irresponsible.

...

Finally, continuing with my understanding of IDNA2008 as both
foundational and (by necessity) having to be slightly permissive so as
to provide the base onto which all these efforts can be built, I want
to argue very strongly for separating all these questions:

(1) how to handle update to Unicode 11.0 (and beyond) in the existing
IDNA2008 framework

(2) how to handle any (minor) tweaking of IDNA2008 - as in clarifying
its role, setting expectations, and so on

(3) how to best capture the emerging "state of the art" in actually
designing robust label generation rulesets

and perhaps

(4) how and whether to address issues of confusing similarity; or
defining a framework for handling that issue, if that's more appropriate.

I'm very leery of creating unnecessary dependencies among all
of these. I think Patrik's got it right, that his suggested solution of (1)
is relatively cut&dried and can and should progress soonest (we are
not doing ourselves a favor by keeping everything added after Unicode
6.3.0 in some kind of limbo).

A wholesale replacement of IDNA2008 is likewise not on the table for
(2), I believe; no matter its intentions it would just muddle the waters.
That said, some tinkering in the margins, perhaps tightening some
requirements, setting expectations etc. may well be needed - John's
draft is the starting point of that discussion. In the unlikely event this
effort results in the conclusion that it is not possible to avoid some
breaking changes to IDNA2008, I'm rather of the opinion that they
would not be limited to the few issues covered in (1),  and we would
have deeper issues than the question of whether we should have
gone ahead and published (1).

(3) and (4) should build on top of IDNA2008 - as long as there are no
gravely incompatible changes introduced by (2), these efforts can
proceed at their own pace.

So, yes, my conclusion is, let's move ahead with (1) and get it out o
the way. Then tackle (2).

A./

>
>      john
>
>
>
> [1] As another aspect of why true internationalization is
> difficult, I note that I am contemplating a nasty-tempered,
> fire-breathing, European dragon, not a kindly, luck-bringing
> East Asian one :-)
>
> [2] One of the documents in the i18n queue,
> draft-klensin-idna-rfc5891bis, is about clarifying and
> reinforcing that "registries must be responsible" requirement.
> It is possible that a better alternative would be to face
> reality and abandon it, but, if we do, it would be good to have
> an alternative.
>
> _______________________________________________
> i18nRP mailing list
> i18nRP@ietf.org
> https://www.ietf.org/mailman/listinfo/i18nrp
>