Re: [I18nrp] Confusion among characters and strings
John C Klensin <john-ietf@jck.com> Wed, 17 October 2018 02:14 UTC
Return-Path: <john-ietf@jck.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1])
by ietfa.amsl.com (Postfix) with ESMTP id BB2EC127332
for <i18nrp@ietfa.amsl.com>; Tue, 16 Oct 2018 19:14:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5
tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, URIBL_BLOCKED=0.001]
autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44])
by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id AptAUiEnbeGq for <i18nrp@ietfa.amsl.com>;
Tue, 16 Oct 2018 19:14:42 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51])
(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
(No client certificate requested)
by ietfa.amsl.com (Postfix) with ESMTPS id C78EA126F72
for <i18nrp@ietf.org>; Tue, 16 Oct 2018 19:14:41 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB)
by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD))
(envelope-from <john-ietf@jck.com>)
id 1gCbM5-0003pC-LY; Tue, 16 Oct 2018 22:14:37 -0400
Date: Tue, 16 Oct 2018 22:14:30 -0400
From: John C Klensin <john-ietf@jck.com>
To: Asmus Freytag <asmusf@ix.netcom.com>, Larry Masinter <LMM@acm.org>,
i18nrp@ietf.org
Message-ID: <77896C689E0BAE86D5EB44C6@PSB>
In-Reply-To: <4df1f049-bbdd-9c1c-7752-496fd3ff474c@ix.netcom.com>
References: <145D45F77511A9B1281FE35D@PSB>
<033401d461f1$7d181590$774840b0$@acm.org>
<4df1f049-bbdd-9c1c-7752-496fd3ff474c@ix.netcom.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/5_qpklO3fjHELnW8UcNWrLkCJxg>
Subject: Re: [I18nrp] Confusion among characters and strings
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>,
<mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>,
<mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Oct 2018 02:14:45 -0000
--On Monday, October 15, 2018 04:36 -0700 Asmus Freytag <asmusf@ix.netcom.com> wrote: > That said, I see nothing wrong with making letter o and zero > (and some other examples) either outright blocked variants or > something that's flagged as potentially malicious. For the > Root Zone we don't have digits, so I didn't have to > investigate them, of course. > All comes down to whether there's a will on the side of > registries to police things, and whether, for those that do, > we can define either minimal standards or best practices > (perhaps on a per-script basis) Just to be clear, this is a point about which Asmus and I are 100% in agreement. Another is that, given variations in human languages and writing systems, no set of rules are likely to be 100% effective, at least unless they are so restrictive as to guarantee pushback and, for any registry that doesn't believe in them, probably non-conformance. As an extreme example of the latter, consider the original, vintage RFC 1591 and earlier, rules about new gTLD names, which were, in essence: (i) To get one, you are going to need to demonstrate (to a very skeptical review process) that you have a real need, that the requirement or application had a very broad scope and value to the global Internet, that the need cannot be satisfied by use of hierarchy within an existing TLD, and that the management of first-level subdomains (i.e., SLDs) was going to be responsible and was not going to turn into controversy or a problem for IANA. (ii) If you got past (i), the domain was going to be represented by a mnemonic label of exactly three ASCII characters in length. (iii) Those three letters were going to be letters -- no punctuation or digits. (iv) There are actually advantages to the Internet to keeping the number of TLDs small and having the list change very slowly. Noting that ICANN almost immediately dropped the first two criteria, the three in combination and especially the third, which remains (note Asums's comment about not having to deal with digits in the root), are very effective in avoiding the risk of confusion. The fourth, which was dropped when the current new gTLD program was introduced, provided an additional measure of protection as well as reinforcing the level of proof of benefits required by the first. There was a clear understanding that application of those rules to domains below the root would be impractical and, more important, inappropriate. It is key to the above that pre-ICANN IANA was able (and believed to be certain) to enforce its own rules. In today's world, as Asmus has explained from a different perspective, the only realistic approach is to create a series of layers of protocol rules and guidelines, getting increasingly specific and localized as one progresses through the layers. Specifically, we have * The DNS and restrictions imposed by its structure as laid out in RFC 1034 and 1035 and their successors. * For ASCII strings, the "preferred syntax" of 1034/1035; for labels containing non-ASCII characters, IDNA2008 and its rules about allowable characters. If one were to summarize IDNA2008 from a high conceptual level, it requires the same letter-digit-hyphen restrictions, with extended interpretations of "letter" and "digit" and some additional restrictions required by combining characters, case-comparison relationships, and characters that seem important but that clearly posed problems is used in an unrestricted way. Those two layers are inherently global. If they are not, many things stop working or don't work and have the same interpretations in an interoperable way. Then we start restricting the repertoire of names for a particular zone further... * General guidelines that may not apply to some specific cases. "Don't mix scripts in a label", "avoid leading and trailing digits if possible", and "avoid strings that are trivially confused with or spelling variations on well-known string or names" are examples of such guidelines. * Rules or guidelines that restrict the characters, or character sequences (actually quite a different matter) that can be used in a particular script. The rules either preclude mixed-script labels or require additional explanation. * Rules or guidelines that impose further restrictions because of the way specific languages use particular scripts. As a trivial Latin script example, one might ban letters with more than one diacritical mark from the DNS, but there are a few languages that would make unusable. * Guidelines that affect entire FQDNs, rather than individual labels. Some of these are obvious, e.g., a zone dedicated to a single language or function might want to be sure that all of its subdomains conform to that language or function. Others are not. ... and probably other examples. I don't see those as really layering neatly, but you get the idea and it probably does not make much difference whether they do or not. As is often the case in other areas, if one wants safety (including security and lack of even benign confusion), one gives something up. In the DNS case, that might be flexibility of naming [1], low complexity, and ease of testing compliance [2]. Where Asmus and I _may_ differ is a layering and boundary question. As we discussed when assembling draft-klensin-idna-rfc5891bis, sooner or later there is a necessity for registries to understand what they are doing and be responsible, to use good sense, and to be accountable for whatever is done. At the level of an individual registry, that is, almost by definition, the last rule in the chain. The question is how many sets of rules it is worth injecting between IDNA2008 and "the registry needs to be responsible and preferably accoutable". If the registry is not motivated to act responsibly -- whether because it is too much trouble, there is financial incentive to not do so, there are no penalties or costs, or for other reasons-- then it is unlikely that the intermediate guidelines will help (and possible that there won't even be IDNA2008 conformance, as we have seen with registries selling emoji domain names). Sadly, we have a good deal of empirical evidence for that. The other problem source, for which there is also empirical evidence, is that there are registries out there whose strategy is to identify some set of rules and then make the claim that, if they follow those rules, they are not responsible for anything that goes wrong: the fault lies with the rule-specifier. And the more specific the rules or guidance gets to whatever the registry wants to do, the easier it seems to be for a registry to take that "we followed the rules, so not our fault" position. So, where I think we differ is how much beyond IDNA2008 it is worth going with guidelines of various sorts. I wouldn't say "zero", but, for the reasons above, I wouldn't go very far. I think Asmus would go much further and, if I were considering this only on technical and linguistic grounds, I would probably agree with him. john [1] At the time the DNS was designed and early policies laid out, there were very strong (and generally accepted) attitudes, left over from the name assignments that ultimately comprised the host table, that there were certain conventions and decisions, and one might as well just get used to them because there were no other options. No more... and completely incompatible with the idea that people should be able to choose their own names in/from their own languages. [2] Even for the root, compare the above four rules with the LGR process and evaluation rules, with or without RFC 7940 and 8228.
- Re: [I18nrp] Confusion among characters and strin… John C Klensin
- Re: [I18nrp] Confusion among characters and strin… Nico Williams
- [I18nrp] Confusion among characters and strings John C Klensin
- Re: [I18nrp] Confusion among characters and strin… Larry Masinter
- Re: [I18nrp] Confusion among characters and strin… John C Klensin
- Re: [I18nrp] [Ext] Re: Confusion among characters… Sarmad Hussain
- Re: [I18nrp] Confusion among characters and strin… Asmus Freytag
- Re: [I18nrp] Confusion among characters and strin… John C Klensin
- Re: [I18nrp] Confusion among characters and strin… Asmus Freytag