Re: [I18nrp] Confusion among characters and strings

John C Klensin <john-ietf@jck.com> Wed, 17 October 2018 02:14 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BB2EC127332 for <i18nrp@ietfa.amsl.com>; Tue, 16 Oct 2018 19:14:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id AptAUiEnbeGq for <i18nrp@ietfa.amsl.com>; Tue, 16 Oct 2018 19:14:42 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C78EA126F72 for <i18nrp@ietf.org>; Tue, 16 Oct 2018 19:14:41 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1gCbM5-0003pC-LY; Tue, 16 Oct 2018 22:14:37 -0400
Date: Tue, 16 Oct 2018 22:14:30 -0400
From: John C Klensin <john-ietf@jck.com>
To: Asmus Freytag <asmusf@ix.netcom.com>, Larry Masinter <LMM@acm.org>, i18nrp@ietf.org
Message-ID: <77896C689E0BAE86D5EB44C6@PSB>
In-Reply-To: <4df1f049-bbdd-9c1c-7752-496fd3ff474c@ix.netcom.com>
References: <145D45F77511A9B1281FE35D@PSB> <033401d461f1$7d181590$774840b0$@acm.org> <4df1f049-bbdd-9c1c-7752-496fd3ff474c@ix.netcom.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/5_qpklO3fjHELnW8UcNWrLkCJxg>
Subject: Re: [I18nrp] Confusion among characters and strings
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Oct 2018 02:14:45 -0000


--On Monday, October 15, 2018 04:36 -0700 Asmus Freytag
<asmusf@ix.netcom.com> wrote:

> That said, I see nothing wrong with making letter o and zero
> (and some other examples) either outright blocked variants or
> something that's flagged as potentially malicious. For the
> Root Zone we don't have digits, so I didn't have to
> investigate them, of course.

> All comes down to whether there's a will on the side of
> registries to police things, and whether, for those that do,
> we can define either minimal standards or best practices
> (perhaps on a per-script basis)

Just to be clear, this is a point about which Asmus and I are
100% in agreement.  

Another is that, given variations in human languages and writing
systems, no set of rules are likely to be 100% effective, at
least unless they are so restrictive as to guarantee pushback
and, for any registry that doesn't believe in them, probably
non-conformance.  As an extreme example of the latter, consider
the original, vintage RFC 1591 and earlier, rules about new gTLD
names, which were, in essence:

 (i) To get one, you are going to need to demonstrate (to
	a very skeptical review process) that you have a real
	need, that the requirement or application had a very
	broad scope and value to the global Internet, that the
	need cannot be satisfied by use of hierarchy within an
	existing TLD, and that the management of first-level
	subdomains (i.e., SLDs) was going to be responsible and
	was not going to turn into controversy or a problem for
	IANA.
 (ii) If you got past (i), the domain was going to be
	represented by a mnemonic label of exactly three ASCII
	characters in length.
 (iii) Those three letters were going to be letters -- no
	punctuation or digits. 
 (iv) There are actually advantages to the Internet to
	keeping the number of TLDs small and having the list
	change very slowly.  

Noting that ICANN almost immediately dropped the first two
criteria, the three in combination and especially the third,
which remains (note Asums's comment about not having to deal
with digits in the root), are very effective in avoiding the
risk of confusion.  The fourth, which was dropped when the
current new gTLD program was introduced, provided an additional
measure of protection as well as reinforcing the level of proof
of benefits required by the first.

There was a clear understanding that application of those rules
to domains below the root would be impractical and, more
important, inappropriate.  It is key to the above that pre-ICANN
IANA was able (and believed to be certain) to enforce its own
rules.

In today's world, as Asmus has explained from a different
perspective, the only realistic approach is to create a series
of layers of protocol rules and guidelines, getting increasingly
specific and localized as one progresses through the layers.
Specifically, we have

* The DNS and restrictions imposed by its structure as laid out
in RFC 1034 and 1035 and their successors.
* For ASCII strings, the "preferred syntax" of 1034/1035; for
labels containing non-ASCII characters, IDNA2008 and its rules
about allowable characters.  If one were to summarize IDNA2008
from a high conceptual level, it requires the same
letter-digit-hyphen restrictions, with extended interpretations
of "letter" and "digit" and some additional restrictions
required by combining characters, case-comparison relationships,
and characters that seem important but that clearly posed
problems is used in an unrestricted way.

Those two layers are inherently global.  If they are not, many
things stop working or don't work and have the same
interpretations in an interoperable way.   Then we start
restricting the repertoire of names for a particular zone
further...

* General guidelines that may not apply to some specific cases.
"Don't mix scripts in a label", "avoid leading and trailing
digits if possible", and "avoid strings that are trivially
confused with or spelling variations on well-known string or
names" are examples of such guidelines.

* Rules or guidelines that restrict the characters, or character
sequences (actually quite a different matter) that can be used
in a particular script.  The rules either preclude mixed-script
labels or require additional explanation.

* Rules or guidelines that impose further restrictions because
of the way specific languages use particular scripts.  As a
trivial Latin script example, one might ban letters with more
than one diacritical mark from the DNS, but there are a few
languages that would make unusable. 

* Guidelines that affect entire FQDNs, rather than individual
labels.  Some of these are obvious, e.g., a zone dedicated to a
single language or function might want to be sure that all of
its subdomains conform to that language or function.  Others are
not.

... and probably other examples.  I don't see those as really
layering neatly, but you get the idea and it probably does not
make much difference whether they do or not.

As is often the case in other areas, if one wants safety
(including security and lack of even benign confusion), one
gives something up.   In the DNS case, that might be flexibility
of naming [1], low complexity, and ease of testing compliance
[2].

Where Asmus and I _may_ differ is a layering and boundary
question.  As we discussed when assembling
draft-klensin-idna-rfc5891bis, sooner or later there is a
necessity for registries to understand what they are doing and
be responsible, to use good sense, and to be accountable for
whatever is done.  At the level of an individual registry, that
is, almost by definition, the last rule in the chain.   The
question is how many sets of rules it is worth injecting between
IDNA2008 and "the registry needs to be responsible and
preferably accoutable".

If the registry is not motivated to act responsibly -- whether
because it is too much trouble, there is financial incentive to
not do so, there are no penalties or costs, or for other
reasons-- then it is unlikely that the intermediate guidelines
will help (and possible that there won't even be IDNA2008
conformance, as we have seen with registries selling emoji
domain names).  Sadly, we have a good deal of empirical evidence
for that.  The other problem source, for which there is also
empirical evidence, is that there are registries out there whose
strategy is to identify some set of rules and then make the
claim that, if they follow those rules, they are not responsible
for anything that goes wrong: the fault lies with the
rule-specifier.  And the more specific the rules or guidance
gets to whatever the registry wants to do, the easier it seems
to be for a registry to take that "we followed the rules, so not
our fault" position.

So, where I think we differ is how much beyond IDNA2008 it is
worth going with guidelines of various sorts.   I wouldn't say
"zero", but, for the reasons above, I wouldn't go very far.  I
think Asmus would go much further and, if I were considering
this only on technical and linguistic grounds, I would probably
agree with him.

   john





[1] At the time the DNS was designed and early policies laid
out, there were very strong (and generally accepted) attitudes,
left over from the name assignments that ultimately comprised
the host table, that there were certain conventions and
decisions, and one might as well just get used to them because
there were no other options.  No more... and completely
incompatible with the idea that people should be able to choose
their own names in/from their own languages.


[2] Even for the root, compare the above four rules with the LGR
process and evaluation rules, with or without RFC 7940 and 8228.