Re: [Idna-update] Expiration impending: <draft-klensin-idna-rfc5891bis-01.txt>

John C Klensin <john-ietf@jck.com> Tue, 06 March 2018 23:06 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: idna-update@ietfa.amsl.com
Delivered-To: idna-update@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id ABD431200FC for <idna-update@ietfa.amsl.com>; Tue, 6 Mar 2018 15:06:55 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.91
X-Spam-Level:
X-Spam-Status: No, score=-1.91 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, T_RP_MATCHES_RCVD=-0.01] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id h5fI7ZxrMHd5 for <idna-update@ietfa.amsl.com>; Tue, 6 Mar 2018 15:06:51 -0800 (PST)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2E2C7126C0F for <idna-update@ietf.org>; Tue, 6 Mar 2018 15:06:51 -0800 (PST)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1etLfM-000KlZ-JU for idna-update@ietf.org; Tue, 06 Mar 2018 18:06:40 -0500
Date: Tue, 06 Mar 2018 18:06:33 -0500
From: John C Klensin <john-ietf@jck.com>
To: idna-update@ietf.org
Message-ID: <3CE2979FE460587F4D1FC3EE@PSB>
In-Reply-To: <0FB6F961-41C2-42EB-8713-C5B2F2CA83FD@frobbit.se>
References: <091044BC-5FE8-4050-911F-DACB83A4DDD4@icann.org> <0FB6F961-41C2-42EB-8713-C5B2F2CA83FD@frobbit.se>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/idna-update/cKOzbsCVW8Mkr6v1pPsugFpNtPM>
Subject: Re: [Idna-update] Expiration impending: <draft-klensin-idna-rfc5891bis-01.txt>
X-BeenThere: idna-update@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Internationalized Domain Names in Applications \(IDNA\) implementation and update discussions" <idna-update.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idna-update>, <mailto:idna-update-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idna-update/>
List-Post: <mailto:idna-update@ietf.org>
List-Help: <mailto:idna-update-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idna-update>, <mailto:idna-update-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Mar 2018 23:06:56 -0000

Hi.

I have been mostly offline after sending my note a bit before
23:30 UTC yesterday and am going to try to address several
issues in one response.  I'm going to try to write this quickly,
so apologies if I miss something important.

(1) First, I think there are a few major principles that
separate IDNA2008 from IDNA2003.  They are:

(i) Moving away from the design of tables of permitted and not
permitted code points (including some transformations expressed
in the same tables) to principles about code point inclusion and
exclusion expressed primarily in rules about Unicode properties.
The IDNA2003 model requires revisions with every version of
Unicode to stay current.  The IDNA2008 model prefers a review
with each new version of Unicode but, in principle (and probably
in practice if we got the rules right) does not.

(ii) Clarifying responsibility for the protocol to work
correctly.  IDNA2003 put almost all of the responsibility on the
registration side of things by having only very weak checks on
lookup -- in essence, if one could find the name, it was ok.
That model, btw, exists in the UTR#46 and closely-related WHATWG
recommendations.  IDNA2008 requires lookup-time checks for at
some minimal level of validity, a change that not only makes the
protocol work better but that provides something of a check on
abusive and/or non-conforming registration processes at all
levels of the DNS tree (see below for more about that).  At the
same time, IDNA2008 clarifies the responsibility of registries
to know what they are doing in allowing a string to be
registered.   draft-klensin-idna-rfc5891bis-01 essentially
reinforces and restates that requirement without changing
anything.  It may also be helpful to remember that requirements
that registries understand what they are doing and act in a
responsible fashion predate IDNs, and ICANN, and even the
"trustee" language of RFC 1591.  Enforcement is, of course, not
the IETF's problem.   See below.

(iii) What is normative and what is just helpful.   It is
important to note that it is the rules of IDNA2008 that are
normative.  The derived tables (including the ones in the IANA
databases) are an important response to those who "just want to
be told what to do" or who don't have the time and resources to
do the calculations, but they are not part of the standard or,
as far as that standard is concerned, particularly important.
Of course, those table do not remove the obligation to be
informed and careful from anyone, but it is good to be
realistic, especially in the current ICANN climate.  Of course,
users of the standard may treat tables more authoritatively.
That is fine as long as the tables for one year do not treat
previously disallowed characters as PVALID or vice versa without
careful explanation and documentation .. see below.

(iv) Permissiveness and conservatism.  Depending on how one
reads IDNA2003, this may not be a difference, but, while one of
the IDNA2008 design principles is to extend the LDH "preferred
syntax" rules to non-ASCII characters, it is ultimately very
permissive, with few restrictions other than those motivated by
having the protocol work.  It is about specifying things that
allow reasonable parties to create mnemonic labels using
characters that they understand (see (ii) above and comments
below) and not about trying to guard against bad behavior except
in the most problematic circumstances.  In particular, there are
no rules requiring that labels make linguistic sense, no rules
against archaic scripts, no rules prohibiting mixed-script
labels or mixing, e.g., digit types within a label, any of which
rules might be reasonable for a registry to impose.

(v) Finally and driven strongly by user experience and reports,
IDNA2008 attempts to be sure that the round trip between a label
as expressed by the user and the label as stored in the DNS is
an identity relationship.  The restrictions on mapping and case
folding (especially the latter) have other motivations as well,
but are primarily just corollaries of the U-label <-> A-label
requirements.  The principle implies other requirements that we
haven't talked much about.  One is that using a non-ASCII string
in the domain context of a URI is a really bad idea (it wasn't a
good idea under IDNA2003, but the issues are now more clear).
Another is that names in reference or identifier contexts
(including but not limited to URIs) should probably be final
names (in more or less the same sense as the use of that term
for MX DATA (targets)).  What the user sees or can type are
other matters, but implementations should be designed so that
users do not enter or see one string and get an error (or other)
message about some other one.   IMO and in retrospect, we did
not do a nearly good enough job of explaining those issues in
the IDNA2008 documents; if there were energy to do the writing
(there probably is) and to review (we have evidence that there
is not) explanatory and clarifying materials would probably be
helpful.

(2) Whether the issues that we uncovered when we reviewed the
changes in Unicode 7.0 were "new" or not depends on definitions
and that is, IMO, not particularly important.  I've been told
that there were significant debates fairly early in Unicode's
history about whether normalization should be restricted to a
very small number of methods or whether there should be options
or variations for different circumstances, the latter partially
to address some of these issues.   What is clear is that we not
only didn't know about those cases when the design of IDNA2008
was being sorted out but that we asked explicitly about
relationships going forward and got answers that persuaded us
that additional rules were unnecessary and that gave us no hint
(at least that we understood) that we should look more deeply
into the existing cases.   It is clear to me that, had we known
about the situation in the 2006-2008 period, we would have
attempted to include rules to deal with it in IDNA2008.  That is
not to claim that we would have been successful in developing
such rules, but I believe the specs would have at least included
stronger warnings.

(3) One key principle that is shared by IDNA2003 (and Stringprep
more specifically) and IDNA2008 is that the IETF has strong
consensus on avoiding getting into the business of going through
Unicode one code point at a time and making our own
classifications on an individual code point basis.   We haven't
wanted to delegate that job to an individual or small expert
committee, especially ones we don't know how to sustain
long-term.  That possibility has been examined multiple times,
in multiple IETF WGs and in the IAB, and the conclusion has
always been, at least AFAIK, the same.   

(4) ICANN's relationship to this situation require some sort of
shared understanding of their role, something I'm not sure we
have even today.  There is general agreement that they can and
should make rules for labels that can appear in the root zone
and nearly as general agreement that those rules should be very,
very, conservative -- much more narrow than what is allowed by
IDNA2008 with regard to what code points can be used.  Whether
they have, or should have, authority over second-level labels
that conform to IDNA2008 (or even those that don't) has been
hotly debated.   Even ignoring the consequences of the
"distributed administrative hierarchy" principle, the empirical
evidence has been that they are powerless in practice and hence
that debating applicability of rules developed for the root (or
even modified versions of such rules) to second level labels and
below is a waste of time.  There have been discussions about the
importance of different rules at the second and third level (and
beyond) since the earliest attempts by ICANN to define IDN
policy.  Those discussions (which occurred before anyone started
talking seriously about IDNs in the root) included, for example,
the clear understanding that labels using archaic scripts might
be very useful and appropriate within the DNS trees of
particular institutions or enterprises even if they were not
appropriate at the second level.  With the introduction of IDN
TLDs and the general flattening of the DNS and the root, the
same arguments may apply at the second level within an
appropriately-defined TLD.  

(5) As implied above, I think it is important to distinguish
between efforts designed to help well-intentioned people, and
registries who are willing to invest resources into doing the
right thing, and efforts about preventing confusion, especially
confusion created by those who have either evil intentions or
who don't care about the use of DNS names as identifiers.  That
distinction is particularly important when suggestions are made
about rules to be defined on the basis of what "most fonts" do
with particular code points.  At least until there are changes
in how generic presentation software (such as web browsers and
IM programs) works -- changes to which there appears to be
considerable resistance -- "most fonts" is not helpful because
malicious characters can not only select their own preferred
fonts but, if something like CSS is in use, they can even
overlay graphemes that were never intended to be overlaid,
making, among other things, concerns about normalization almost
trivial.  That does not mean we should give up, but it does
imply, at least to me, that we need to be very careful about our
expectations and promises, even if registries get smart enough
to require the malicious actors to turn things up a notch.

(6) Now, a few observations on some of Asmus's comments, and
then I will come back to ICANN and the IANA tables at the end of
this section.

>...
> In many cases, using the variant mechanism (blocked variants)
> would be the appropriate remedy - but this is a tool not
> available to the IANA review.

It is also not a tool that is part of IDNA (either version),
except insofar as it is part of the "registries must take
responsibility for what they choose to register, including the
scripts and characters involved" model.  While I think
identifying variants and blocking them is a useful tool (and
that has been proven to be the case in some domains), the
identification process itself requires considerable effort and
knowledge.  At least outside the root and especially if less
well-known scripts are involved, it is not going to be 100%
reliable except for dichotomous pairs of characters (e.g., for
Simplified and Traditional Chinese) and the implications of
missing a string that should be treated as a variant
(particularly relative to invitations to bad actors) should be
thoroughly understood and accepted (by lawyers as well as from a
technical standpoint).   We also have significant empirical
evidence that almost every effort to define variants and then
block them has rapidly led to some party who claims that they
want to/ need to/ have the right to/ are entitled to have some
of those variants delegated.  And then often get their wishes.

So, for the root where the resources to do the analyses are
available and we can at least hope that a "no delegated
variants" rule can hold, I think blocked variants are a fine
idea.  I do note however that the so-called ccTLD Fast Track
apparently allowed some variants to be delegated and that "it is
unfair to allow those who got in line first to get something
that is now barred to us" and "we are being put at a competitive
disadvantage" have often been persuasive argument to ICANN in
the past.   Below the top level, where such a system has to
depend on registries investing significant resources to identify
and block names, I think it is just unrealistic.  Some
registries may comply, but experience indicates that, faced with
a choice of registering and delegating a name and thereby
producing revenue and investing significant effort to avoid
delegating such names (and therefore not making the money) in an
environment in which there are no effective sanctions if they
register the name(s), I am pessimistic about its being very
effective.

I do think that blocked variants are a useful tool, but that
their being effective requires registries to be very
responsible, including seeking out and blocking questionable
cases.  To the extent to which that behavior not only costs them
money but that, in the absence of clear rules that do not
require subjective judgment, might possibly subject them to
litigation, I think expectations should be kept low.

> In any case, there is little benefit in keeping the IANA
> tables stuck at Unicode 6.3.0 -- the number of pre-existing
> cases (going as far back as the earliest Unicode version
> covered by IDNA2008) definitely exceeds the expected
> incremental addition from pending versions of Unicode.

I'd love to know how you quantify the latter going forward,
especially in an environment in which the Unicode Consortium has
apparently decided to treat additional characters and code
points (via the "adopt a character" program) as a profit center.
If one were to follow the proposed rules for the root and
exclude archaic characters and other characters that would be
disallowed for other (non-IDNA2008) reasons (neither of which I
think you have proposed), the size of the expected incremental
additions gets even smaller.   

> And a
> comprehensive solution lies outside the methodology of
> property-based inclusion; however, the property-based IANA
> tables would make a solid base on which to implement
> additional mitigation.
> 
> Therefore, the rationale to maintain this process in a stalled
> state is tenuous at best.

Now let's return to my comment near the beginning.   The IANA
tables are not normative.  They are published as a guideline and
convenience for those who want and appreciate such things.  If
ICANN, as the registrar for the root zone, decided to make its
own calculation of a base code point repertoire from the
IDNA2008 rule set and use that repertoire as input to the LGR
process for the root zone, not only is there no one to stop
them, but their doing that is perfectly conformant to IDNA2008.
I don't think it would even violate the IAB statement as I read
it.

But the purpose of that IANA table as I have understood it since
we created the registry is, again, to help out the less wary,
less informed and knowledgeable, and, frankly, the more lazy.
The mere existence of that population is probably a sign that
the IDNA2008 requirement that registries not register strings
that they do not thoroughly understand (even if those strings
are otherwise IDNA2008-conformant and might be acceptable if
registered in a different zone) is not working well.  For them,
I think there has been no case made that holding off on revising
the tables is a cause of significant harm.. and some argument
that it just represents a very conservative list, albeit not as
conservative as the list would be if similar-looking strings
within the same script were identified and somehow excluded (and
not as conservative as the lists that will presumably be the
result of the LGR process).

That is an issue with the "troublesome character" list as well.
If it is taken as one person's informative catalogue of code
points that even the most careful and informed of registries
should be extra-careful about, I think it provides a useful
service.  If, as has been the case with "whatever is allowed by
IDNA2003", "whatever is allowed by IDNA2008", or even "whatever
is allowed by UTR#46", regardless of what those documents
actually say, it is interpreted as "the IETF and/or ICANN say
that anything allowed by IDAN2008 and not on this list is safe
to register", then it becomes a problem... as well as colliding
with the principle described in (3) above.

>...
> Secondary to that is finding a way to communicate the to
> consumers of these tables that simply allowing all PVALID code
> points isn't a robust solution for many writing systems and
> additional due diligence needs to be preformed - for example
> along the same lines as is being done now for the Root Zone
> (which is quickly defining the state of the art in that 
> respect).

Well, IDNA2008 says that additional due diligence is required
and that registries need to do it.
draft-klensin-idna-rfc5891bis-01 says it more clearly and
forcefully but can't seem to get processed in the IETF (which is
where this thread started).  ICANN has been asked, more than
once, to make some clear statements about registry
responsibility and to work on a plan to enforce that principle
(including a plan to have the GAC work out a plan for ccTLDs if
that is required).   Those suggestions are not proceeding any
faster than <draft-klensin-idna-rfc5891bis.  I think they are
even more bogged down and have been for longer, in part because,
unlike the I-D in the IETF, the idea meets active resistance.

best,
   john

p.s., especially to IAB members who might be reading this.
ICANN study groups and review efforts have an extensive history,
especially in relatively recent years, of being turned into
efforts that can only produce conclusions that support whatever
ICANN is doing, perhaps suggesting more of it, perhaps
suggesting cosmetic fine-tuning, and sometimes both but with
actual criticisms prevented from being published and circulated.
If the IAB is convinced that will not occur in this case, then,
if someone reasonably qualified is needed and you don't have a
better candidate, I will volunteer despite my reluctance to do
further free consulting work for ICANN (and assuming the rules
for the study group do not require that I go out of pocket).
On the other hand, if you do not have good reason to be
convinced of that, I suggest that it is in the best interests of
the IAB and the DNS technical and user communities that you
decline the offer to supply someone for the position.

p.p.s. Just to respond, for context, to a few of the comments
made and discussed earlier, I could read text in three scripts
before I graduated from secondary school, with more than one
language in two of them and one of the three scripts running
right to left.  I can also read the characters of at least one
more script but have no claim on ever being able to read the
associated language except, painfully, for a few sections of
some classical materials.  I have some claim on at least some
understanding how one other script, very different from those
above, works.  I also did some work on character coding,
identifiers in programming languages (including non-ASCII ones),
and multilingual thesauri, and worked a bit with a very
well-known typography expert, most long before there was a
Unicode. Do those things, or the fact that I studied a bit with
experts on the evolution of languages and writing systems, make
me an expert in this area?  Nope.  But it does give me some
basis for believing I'm got a little bit more background for
constructing reasonable intuitions and having some idea what
questions to ask than someone who is familiar with only one
language, or maybe two, and who has deduced that everything else
follows the same rules (conclusions that are very common in the
IETF and ICANN although, I hope, not on this list).