Re: [I18nrp] Confusion among characters and strings
Nico Williams <nico@cryptonector.com> Thu, 14 June 2018 17:00 UTC
Return-Path: <nico@cryptonector.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1])
by ietfa.amsl.com (Postfix) with ESMTP id 8DDB91294D0
for <i18nrp@ietfa.amsl.com>; Thu, 14 Jun 2018 10:00:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5
tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, URIBL_BLOCKED=0.001]
autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key)
header.d=cryptonector.com
Received: from mail.ietf.org ([4.31.198.44])
by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id aL9g3nFz6SOm for <i18nrp@ietfa.amsl.com>;
Thu, 14 Jun 2018 10:00:06 -0700 (PDT)
Received: from homiemail-a106.g.dreamhost.com (homie-sub4.mail.dreamhost.com
[69.163.253.135])
(using TLSv1.1 with cipher AECDH-AES256-SHA (256/256 bits))
(No client certificate requested)
by ietfa.amsl.com (Postfix) with ESMTPS id AE4CB130E44
for <i18nrp@ietf.org>; Thu, 14 Jun 2018 10:00:06 -0700 (PDT)
Received: from homiemail-a106.g.dreamhost.com (localhost [127.0.0.1])
by homiemail-a106.g.dreamhost.com (Postfix) with ESMTP id 207CD30002937;
Thu, 14 Jun 2018 10:00:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h=date
:from:to:cc:subject:message-id:references:mime-version
:content-type:in-reply-to:content-transfer-encoding; s=
cryptonector.com; bh=8mxm0vLLqEFmqpwD+4OnaG67blQ=; b=lsxYSsD25CJ
X6rTfVkSPhs+KBI/hCCB9G3KDn4uqrfNZOSjT97plC0zUqHtw5eV2r5whn4e05i6
xjf5bE6t6YFCtsTgYauBGBJCfe0oyeZ0pbB6SQIeQFH0NIyoqTaidbzYAry92LRJ
ME6c2zQYIITeN2FZJEhN+Wdm/MN+Fi6Q=
Received: from localhost (cpe-70-123-158-140.austin.res.rr.com
[70.123.158.140])
(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
(No client certificate requested)
(Authenticated sender: nico@cryptonector.com)
by homiemail-a106.g.dreamhost.com (Postfix) with ESMTPSA id 6685630002930;
Thu, 14 Jun 2018 10:00:05 -0700 (PDT)
Date: Thu, 14 Jun 2018 12:00:03 -0500
From: Nico Williams <nico@cryptonector.com>
To: John C Klensin <john-ietf@jck.com>
Cc: i18nrp@ietf.org
Message-ID: <20180614170002.GA4218@localhost>
References: <145D45F77511A9B1281FE35D@PSB>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <145D45F77511A9B1281FE35D@PSB>
User-Agent: Mutt/1.5.24 (2015-08-30)
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/PdgDpCTJvpQuIL4YNn4sgOVjoag>
Subject: Re: [I18nrp] Confusion among characters and strings
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.26
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>,
<mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>,
<mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Jun 2018 17:00:10 -0000
On Tue, Jun 12, 2018 at 01:26:14PM -0400, John C Klensin wrote: > I'm still trying to resist discussions of specific proposed > fixes, but, but "confusability" has now been mentioned by > several people as if it where the only issue, or even nearly the > only issue) and easily solved, let me push back on that as a > sort of third prong to suggesting that reviewing actually i18n > protocol work is different from reviewing protocols that have > non-ASCII elements or implications. Eh, I'm happy to discuss specifics too. As to process ideas, I think those have been discussed extensively already -- enough that a BoF could be had, though that is up to the IAB. I don't think enhanced I18N processes will produce any of the specific solutions you or I have in mind as to the issues listed below. > First, if one could wave a magic algorithm (or wand) at the > confusion issue and make it disappear, there would still be a > number of outstanding issues, with the question of what to do > about normalization anomalies at the top of today's version of > my personal list. So "if we solve confusion, there are no other > important issues" is, at least IMO, fairly obviously false. I'm surprised to read this. I don't think anyone has said or implied that "if we solve confusion, there are no other important issues", nor has anyone waved away normalization issues. However, other issues we seem to handle fairly well already, with few controversies. As to the controversies... Others have commented on the I18N of email addresses. I like Viktor's comment in that regard that we had to choose a cardinal rule to break, and given that, we probably broke the wrong one. As to IDNA... it seems that UTR#46 is non-grata here, though I think UTR#46 is correct, and IDNA2008 is wrong where it differs from it. But enough provocations :^) > [...] > > As a final example in this general group for people to think > about, "3com" caused the original rule prohibiting domain name > labels from having leading digits. Is "Зcom" (first character > is U+0417) confusingly similar? If yes, is it confusing to the > same or a greater degree in "label12З4"? Did you mean that 3com caused that rule to be abandoned? There are plenty of domainnames with labels that start with an ASCII digit, or even all-digits labels. E.g., - 7up.nl - 911.gov Anyways, wasn't this rule there to protect against old, lame parsers that assume a string represents an IP address if it starts with a digit? Such parsers would only look for ASCII digits... No need to extend such a rule to non-ASCII digits. (I'm told there's a very large number of such domains. Everything seems to work, so any such rule should be dropped, if we still had it.) > In another tier are coincidences in which similar graphemes > occur, with different "meanings" on mostly-unrelated scripts. > For example, are "o" (U+006F) and "o" (U+0665) confusable with > each other? The answer probably depends on type style, type > size, how closely the user is looking and what she is expecting, > etc. [...] I don't think we (the IETF) should look for any confusability beyond grapheme look (including some range of representative fonts in the analysis). However, registries should where culturally or legally relevant. > [...] Or how about U+110B, U+17E0, or U+-25CB?. For IDNs (but > they are not the only important case), it may or may not be > important that some of the code points in that list are > prohibited by IDNA2008: if IDNA2008 is adhered to by registries > or lookup applications follow the IDNA2008 rules for checking, > they will be rejected even without worrying about confusion. I'm not keen on IDNA updates forbidding any of these. All of these should be allowed as far as IDNA goes. Instead, each registries should a) define policies, b) enforce them. One obvious and general policy is to disallow any registration that is confusable with an existing registration by a different entity. To enforce this the registry/registrars would check a proposed new domain against all domains in that TLD looking for confusable similarity (more on this in a bit) and if existing domains owned by other entities match, then reject the registration. Note that the confusable similarity check algorithm might change over time. Some registries might want to allow only one script, or a mix of one script and ASCII (for historical reasons, say), or whatever. Some ccTLD registries might only allow digits from a script associated with a language or languages associated with the corresponding country. Another policy might be that malicious use of confusable domains not rejected as confusable at registration time... can cause the registration to be yanked. Satiric use may or may not get treated the same as malicious use. I really like this approach of the registries setting TLD-specific policies. The registries are the entities most likely to know what is culturally appropriate for each TLD! > However, we know of registries who ignore those rules and > register strings as labels that IDNA2008 prohibits (and there > are almost certainly more we don't know about if the whole DNS > is considered) and that UTR#46 recommends not making those > lookup-time checks, so it is hard to rely on that. Yes, we don't have an IETF police. I don't know how we can create a credible IETF police, and IMO we shouldn't want such a thing. Sometimes we need to update RFCs to reflect reality. We should find out whether any such domains that break our rules are accidents (e.g., due to poor software) or intentional (e.g., the registry thinks our rules are lame). > The sequence continues with a number of other cases, but the > above should make at least part of the problem clear: depending > on circumstances and expectations, people cannot (or will not) > distinguish among things that computer algorithms can > distinguish easier, especially if they are relying on Unicode > coed points and high-distinguishability fonts which the people > (especially in the presence of CSS on web pages and some > equivalents elsewhere). [...] Note that there are, e.g., browser plugins for detecting and highlighting confusables. Registries protecting against confusables is certainly desirable, I think, but as you note, they will sometimes fail. Having a user agent UI look for and highlight confusables is... tricky, for as we know many users won't notice. User agents can also distinguish between user-whitelisted/blessed and other domains. But I don't think those two problems add up to having to ban confusables in IDNA. > [...]. Even if one uses artificial > intelligence-like techniques, the resulting algorithms and tests > are going to be no better than their training: where humans > cannot agree on what is confusable and there is no consensus > about "right", it is difficult to believe in the reliability and > accuracy of such methods. The set of sets of confusable graphemes will have to be a) open as long as Unicode is open to new characters, b) open as long as there are such sets to be identified still. We can expect both, humans and automatic confusable detection (e.g., AI) to be able to add, over time, to the set of sets of confusable graphemes. I don't think there's anything wrong with having this set be incomplete at any time, especially early on. That's just life. As this set grows, the protection of users will increase. > Scripts that require special rendering and type families that > use kerning sufficiently aggressively to construct > near-ligatures create more "opportunities" in the form of cases > in which multiple code points must be considered as a sequence > even without combining characters. As an ASCII example and > with some type styles, it can be hard to distinguish "iN" in the > middle of a string from "m" unless one has other context. Again, we can discover these things over time and add them to an evolving corpus of confusable detection metadata and algorithms to be applied by registries and/or user agents. > However, there is another point at the far end of that spectrum. > A person with sufficient expectations of seeing one thing, e.g., > because they are expecting a particular script family and not > some others, will see what they expect to see. For example, is > Toys-Я-Us > insertion of a Cyrillic character (in which case it should not > match "Toys-R-Us" and might be considered confusingly similar) > or it is someone overdosing on cleverness (in which case maybe > it does compare equal -- a conclusions that I understand has > been reached by trademark authorities). Similarly, with the > right choice of type styles, small type, or a hangover, > "n" and "ฑ" (U+0E11) We could also consider dyslexia... At least as to lesser confusables which might fall outside a confusable corpus used for registry policies, the fact that courts and trademark authorities can enforce their own policies is useful: it allows us to have a smaller confusable corpus, though we could also just add these to it too. > It is probably also worth remembering that, to a typical user, > to question of whether "color" and "colour" are equivalent is at > least as significant (and a potential source of confusion) as > the relationship between "color" and "co1or". Yes. Where shall we stop? :) This is precisely why we need per-TLD policies. We're not going to be able to make an agreeable one-size- fits-all policy, but we can have a one-size-fits-all meta-policy. > As an additional complication, there may be a difference between > the rules we would make if we are concerned about problems that > could arise from accidents or linguistic issues versus those are > are likely to occur only in the presence of malice (or an excess > of cleverness). Coming back to the 3com example, "Зcom" > easily detected and quite likely to be malicious, but "Зсом" > (U+0417 U+0441 U+043E U+043C) is a great deal more problematic. > Similarly "g00gle" or "g00g1e" are almost certain to be > problematic (especially if one remembers those are all-ASCII > strings and hence that the first can be written "G00GLE", which > is less obvious), but we have many labels with digits in the > middle (e.g., to allow names for routing nodes to match ITU > standards) that are not issues at all. Some of these might not be malicious but satiric. Satire account names that are confusable (to some degree) with those of their object of satire are very common, and social media sites have to navigate this with care, allowing some and rejecting others. Some degree of subjectivity will be needed. Some registries may choose to not allow satiric domains. Others may allow some but not others, with decisions made by humans (with or without AI help). One might want to define something of a "confusable distance" metric for deciding whether two names are "too confusable" for comfort. > Now, it is clear to me that, while some cases are easily > addressed, many are not, and some get well into the area of > subjective, context-dependent and user-experience-dependent, > judgments. The question is what to do about it. The two Yes, there is way too much subjectivity and way too much local cultural variation. Have you noticed that people who learn some language in their late teens or later years... tend to have a hard time distinguishing some sounds from that language? I suspect the same is true for grapheme recognition. Thus, rules that make sense in TLDs with a prevalence of some small set of languages will surely need to vary from those of other TLDs where other languages prevail. The more languages are associated with some TLD (e.g., com), the more restrictive the rules will have to be. > answers I personally find unsatisfactory are, to exaggerate a > bit, "we can't solve all of the problem, so we should just give > up and tell users they are on their own" and "we should solve > the problems we can and then claim that all of the others are > edge cases that should not occur in practice". I think I've I want to punt to the registries as much as possible. That does not fall in either of those two catergories you list. > heard positions stated that soul a lot to me like one or the > other. so I might be in the minority. At the same time, I think > that, no matter what we do, there are cases that depend on > either user vigilance or registrars taking much more > responsibility than we have often seen in recent years (or > both). That observation isn't new: IDNA2008 imposes the latter > as a requirement. If we are going to try to draw a boundary > between the cases we hope to address by rules and algorithms and > those cases we want or need to leave to others, I think we are > obligated to make that boundary as clear and easy to understand > as possible... and that means we both have to figure out how to > do the work and to convince ourselves and the IESG that we have > consensus about it and that the IETF should accept that > consensus. The best we can do, IMO, is: - cajole registries to develop and enforce policies regarding confusables - coax ICANN to get the registries to do that - develop a few generic policies from which registries could choose to apply -- i.e., do their work for them to some extent maybe one required generic policy (described above) - work with the Unicode Consortium to develop some such genric policies - develop software tools for the registries and registrars - work with the Unicode Consortium to develop such tools I don't think we can or should want to do anything else. Note that all such work will be hard to fund at the IETF as it really involves doing other people's work. Most IETF participants are funded to work on Internet protocols, though they tend to stray a little. But I doubt most IETF participants could get much time funded for participating in the above listed activities. This makes me think that the best we can really do is: - cajole ICANN, the registries, the registrars, and the UC We probably can't develop too much in the way of generic policies, or tools without having people specifically funded to work on them. However, perhaps ISOC can arrange or obtain funding for those. Also, the UC is the best-placed organization for putting together standard confusable sets -- not the IETF. All that said, if specific proposals are made, we *can* expect participants to review them, even if they are not funded to produce such proposals -- reviewing other people's work is understood to be a cost of doing business at the IETF. Nico --
- Re: [I18nrp] Confusion among characters and strin… John C Klensin
- Re: [I18nrp] Confusion among characters and strin… Nico Williams
- [I18nrp] Confusion among characters and strings John C Klensin
- Re: [I18nrp] Confusion among characters and strin… Larry Masinter
- Re: [I18nrp] Confusion among characters and strin… John C Klensin
- Re: [I18nrp] [Ext] Re: Confusion among characters… Sarmad Hussain
- Re: [I18nrp] Confusion among characters and strin… Asmus Freytag
- Re: [I18nrp] Confusion among characters and strin… John C Klensin
- Re: [I18nrp] Confusion among characters and strin… Asmus Freytag