[I18nrp] Limits of IDNA2008 (was:Re: draft-faltstrom-unicode11-04.txt)

John C Klensin <john-ietf@jck.com> Wed, 10 October 2018 19:17 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AC71D1277D2; Wed, 10 Oct 2018 12:17:36 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 79c9yPJghEjC; Wed, 10 Oct 2018 12:17:34 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 32AA71277BB; Wed, 10 Oct 2018 12:17:34 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1gAJzA-0002q7-Jz; Wed, 10 Oct 2018 15:17:32 -0400
Date: Wed, 10 Oct 2018 15:17:25 -0400
From: John C Klensin <john-ietf@jck.com>
To: "Hollenbeck, Scott" <shollenbeck=40verisign.com@dmarc.ietf.org>
cc: "'paf=40frobbit.se@dmarc.ietf.org'" <paf=40frobbit.se@dmarc.ietf.org>, "'i18nrp@ietf.org'" <i18nrp@ietf.org>, "'iab@iab.org'" <iab@iab.org>
Message-ID: <FB9181768D399AB7695B2E70@PSB>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/pwQCt4xQWILw2IUvVak4twfCON4>
Subject: [I18nrp] Limits of IDNA2008 (was:Re: draft-faltstrom-unicode11-04.txt)
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Oct 2018 19:17:37 -0000


--On Wednesday, October 10, 2018 11:12 +0000 "Hollenbeck, Scott"
<shollenbeck=40verisign.com@dmarc.ietf.org> wrote:

>...
>> The choices for IETF when things like this happens are:
>> 
>> 1. Keep IDNA2008 with no exceptions
>> 
>> 2. Keep IDNA2008 with exceptions
>> 
>> 3. Stop referring (directly) to Unicode as it is not stable
>> enough
>> 
>> Probably more choices than these...
>> 
>> My proposal is [1], together with a more forceful push to
>> strict IDNA2008 adoption. No IDNA2003, no UTS#46, no homebrew
>> mixes. Including that registries really do a careful
>> conservative selection of code points to be used in whatever
>> context it is to be used.
> 
> I tend to agree. It's more stable for registry operators.

Scott (and everyone else),

FWIW, I do too, but I'm a little concerned about about the
sleeping dragon (not merely a nice elephant [1]) in
this particular room.  When we designed IDNA2008 (and, to a
considerable degree IDNA2003, the JET variant model of RFC 3743,
and the "preferred syntax" rules of RFC 1034/1035) we more or
less assumed that almost everything could be handled by
character rules.  In other words, we would identify which
characters were ok, which ones were not, and which ones were
going to be treated as equivalent to which other ones.  RFC
1034/1035 also contained a few rules about positioning of
characters in labels: the "no leading digits" rule that was
later abandoned and at least a general assumption that hyphens
belonged in the middle of strings, not at either end (you will
recall that it didn't take long into the ICANN period before
registrants tested the latter).

With IDNA2008, we recognized that some rules were needed to
prevent real problems with multiple character sequences whole
label and reflected them in the CONTECTJ and CONTECTO rules, but
the standard is basically still about valid and disallowed
(invalid) characters for use in labels.  

What we didn't do was deal with a number of character-sequencing
issues that essentially would prohibit some labels even though
all of the characters (code points) in them are ok (PVALID or
conforming the the CONTEXTx rules) individually.  Most of the
issues are fundamental to the relevant writing systems, not
something that can be blamed on Unicode decisions.   We didn't
deal with them in the IDNA2008 rules and algorithm for at least
three reasons: (i) those of us who did the IDNA2008 design work
underestimated their importance and complexity and, as it turned
out, no one set us straight, (ii) we didn't know how to specify
appropriate rules, and (iii) we thought we had specified an
effective workaround.  

Well, we were wrong.  Our understanding of the effectiveness and
universality of the Unicode normalization rules was somewhere
between insufficient and just plain wrong.   We made some
assumptions about future (relative to circa version 3.2)
extensions to Unicode that were not quite adequate.  We did not
try to consider scripts that have rendering requirements that go
well beyond simply displaying a Unicode string in sequence, one
grapheme (treated atomically) at a time and the risks posed by
some systems trying to display things that way, others rendering
those strings correctly, and possible confusion between the two
groups.  We also did not consider special measures for complex
scripts in which certain sequences of characters just make no
sense (and cannot be rendered in any plausible way) even if the
equivalent sequence of code points can be formed into a string.
In addition, a great many of the discussions about IDNs in
recent years have focused on confusion among characters, and
IDNA2008 (deliberately) did not deal with that either.

It is clear (at least to me) that, if we decided we wanted to
change the IDNA2008 rules at categories to address those issues,
the solution would lie, not in an expanded exception list
(Patrik's #2) but in actually revisiting and adding to the
categories and rules themselves.  Whether that is feasible (even
if the IETF had more energy and expertise) is, to me at least,
an open question.

The IDNA2008 spec proposed to deal with that range of issues by
following and specifying a model that goes back at least to RFC
1591 and arguably earlier, a model that was reflected in the
earliest of ICANN IDN guidelines.  The model is that registries
were expected to exercise considerable responsibility to the
community and, in particular, that they not allow registration
of strings involving scripts that they didn't understand and for
which they were unwilling to be accountable [2].   That
provision may have been hopelessly naive from the beginning.
Certainly some registries have been better-behaved (by that
criterion) than others and there is a perception that those who
decided to behave well and carefully would put themselves at a
commercial disadvantage.  The possibility that it would have
provide a full employment situation for the relatively small
number of i18n experts out there may or may not be a
consideration.

However, whether that requirement for responsible,
community-serving, registry responsibility isn't up to the
pressures now being placed on IDNA or whether it was hopeless
from the beginning, it seems very clear that IDNA2008 without it
is not up to the job many people think it should do.   Even if
the only thing we should do is to adjust our expectations and be
clear about the modified versions, it seems to me that charging
ahead without reviewing those questions and whether the
standards need changing would border on irresponsible.

    john



[1] As another aspect of why true internationalization is
difficult, I note that I am contemplating a nasty-tempered,
fire-breathing, European dragon, not a kindly, luck-bringing
East Asian one :-)

[2] One of the documents in the i18n queue,
draft-klensin-idna-rfc5891bis, is about clarifying and
reinforcing that "registries must be responsible" requirement.
It is possible that a better alternative would be to face
reality and abandon it, but, if we do, it would be good to have
an alternative.