[I18nrp] Confusion among characters and strings

John C Klensin <john-ietf@jck.com> Tue, 12 June 2018 17:26 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 56F2A130E06 for <i18nrp@ietfa.amsl.com>; Tue, 12 Jun 2018 10:26:25 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3P5cQVFRHpXR for <i18nrp@ietfa.amsl.com>; Tue, 12 Jun 2018 10:26:22 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id ACAB6130F73 for <i18nrp@ietf.org>; Tue, 12 Jun 2018 10:26:22 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1fSn3l-0006Gl-An for i18nrp@ietf.org; Tue, 12 Jun 2018 13:26:21 -0400
Date: Tue, 12 Jun 2018 13:26:14 -0400
From: John C Klensin <john-ietf@jck.com>
To: i18nrp@ietf.org
Message-ID: <145D45F77511A9B1281FE35D@PSB>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/NslO8oYGZxXpWqR5Io2Z4cJZl7A>
Subject: [I18nrp] Confusion among characters and strings
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.26
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Jun 2018 17:26:26 -0000

Hi.

I'm still trying to resist discussions of specific proposed
fixes, but, but "confusability" has now been mentioned by
several people as if it where the only issue, or even nearly the
only issue) and easily solved, let me push back on that as a
sort of third prong to suggesting that reviewing actually i18n
protocol work is different from reviewing protocols that have
non-ASCII elements or implications.

First, if one could wave a magic algorithm (or wand) at the
confusion issue and make it disappear, there would still be a
number of outstanding issues, with the question of what to do
about normalization anomalies at the top of today's version of
my personal list.  So "if we solve confusion, there are no other
important issues" is, at least IMO, fairly obviously false.

(A small note on what follows: where convenient, I've chosen
examples from scripts I think are likely to be readily
available, and to render in a reasonably consistent and
predictable way, for most people and computer setups who are
likely to be dealing with the ASCII in which most of this text
is written.  All non-ASCII characters have been identified by
the Unicode code points and I've sometimes omitted the graphic
forms because they were not easily inserted.  I've avoid scripts
that require special rendering rules, scripts that are
predominantly right-to-left, and scripts with very complex
characters for convenience, not because similar examples could
not be constructed with them.  I hope I've avoided this note
becoming an English/ASCII user's view of the world; if it seems
to drift in that direction, the reason is more likely the about
than that particular bias.)

The more important thing about confusion is that it spans a very
wide range of issues.  At one extreme are characters that
despite having graphemes that are in closely-related scripts,
historically copies of each other, are assigned to different
code points.  Consider the oft-cited "a" (U+0061) and "а"
(U+0430) or "o" (U+006F) and "ο" (U+03BF).  As what I think of
as another tier, consider cases that are extremely dependent on
choices of type styles, such as  "y" (U+0079), "у" (U+0443) and
"ч" (U+0447).  Some of these can also be traps for the unwary
(whether automated or human).  For example, if one is not aware
of the existence of a character that looks like "ч" separate
from "у", the two are a lot more likely to be mistaken for each
other.  Of course, typestyle distinctions can be used to reduce
or eliminate confusion, not just to cause it, with the classic
ASCII example being the use of a slashed 0 for zero,
unambiguously distinguishing from "0" and "0".  Proving that one
should be careful about what one wishes for, as one starts using
non-Basic Latin, that slashed zero is potentially confusable
with "ø" (U+00FB).   

As a final example in this general group for people to think
about, "3com" caused the original rule prohibiting domain name
labels from having leading digits. Is "Зcom" (first character
is U+0417) confusingly similar?  If yes, is it confusing to the
same or a greater degree in "label12З4"?

In another tier are coincidences in which similar graphemes
occur, with different "meanings" on mostly-unrelated scripts.
For example, are "o" (U+006F) and "o" (U+0665) confusable with
each other?  The answer probably depends on type style, type
size, how closely the user is looking and what she is expecting,
etc.  Or how about U+110B, U+17E0, or U+-25CB?.   For IDNs (but
they are not the only important case), it may or may not be
important that some of the code points in that list are
prohibited by IDNA2008: if IDNA2008 is adhered to by registries
or lookup applications follow the IDNA2008 rules for checking,
they will be rejected even without worrying about confusion.
However, we know of registries who ignore those rules and
register strings as labels that IDNA2008 prohibits (and there
are almost certainly more we don't know about if the whole DNS
is considered) and that UTR#46 recommends not making those
lookup-time checks, so it is hard to rely on that.

The sequence continues with a number of other cases, but the
above should make at least part of the problem clear: depending
on circumstances and expectations, people cannot (or will not)
distinguish among things that computer algorithms can
distinguish easier, especially if they are relying on Unicode
coed points and high-distinguishability fonts which the people
(especially in the presence of CSS on web pages and some
equivalents elsewhere).  Even if one uses artificial
intelligence-like techniques, the resulting algorithms and tests
are going to be no better than their training: where humans
cannot agree on what is confusable and there is no consensus
about "right", it is difficult to believe in the reliability and
accuracy of such methods.

Scripts that require special rendering and type families that
use kerning sufficiently aggressively to construct
near-ligatures create more "opportunities" in the form of cases
in which multiple code points must be considered as a sequence
even without combining characters.   As an ASCII example and
with some type styles, it can be hard to distinguish "iN" in the
middle of a string from "m" unless one has other context.

However, there is another point at the far end of that spectrum.
A person with sufficient expectations of seeing one thing, e.g.,
because they are expecting a particular script family and not
some others, will see what they expect to see.   For example, is
  Toys-Я-Us
insertion of a Cyrillic character (in which case it should not
match "Toys-R-Us" and might be considered confusingly similar)
or it is someone overdosing on cleverness (in which case maybe
it does compare equal -- a conclusions that I understand has
been reached by trademark authorities).  Similarly, with the
right choice of type styles, small type, or a hangover, 
    "n" and "ฑ" (U+0E11)

It is probably also worth remembering that, to a typical user,
to question of whether "color" and "colour" are equivalent is at
least as significant (and a potential source of confusion) as
the relationship between "color" and "co1or".

As an additional complication, there may be a difference between
the rules we would make if we are concerned about problems that
could arise from accidents or linguistic issues versus those are
are likely to occur only in the presence of malice (or an excess
of cleverness).   Coming back to the 3com example, "Зcom"
easily detected and quite likely to be malicious, but "Зсом"
(U+0417 U+0441 U+043E U+043C) is a great deal more problematic.
Similarly "g00gle" or "g00g1e" are almost certain to be
problematic (especially if one remembers those are all-ASCII
strings and hence that the first can be written "G00GLE", which
is less obvious), but we have many labels with digits in the
middle (e.g., to allow names for routing nodes to match ITU
standards) that are not issues at all.

Now, it is clear to me that, while some cases are easily
addressed, many are not, and some get well into the area of
subjective, context-dependent and user-experience-dependent,
judgments.  The question is what to do about it.  The two
answers I personally find unsatisfactory are, to exaggerate a
bit, "we can't solve all of the problem, so we should just give
up and tell users they are on their own" and "we should solve
the problems we can and then claim that all of the others are
edge cases that should not occur in practice".   I think I've
heard positions stated that soul a lot to me like one or the
other. so I might be in the minority.  At the same time, I think
that, no matter what we do, there are cases that depend on
either user vigilance or registrars taking much more
responsibility than we have often seen in recent years (or
both).  That observation isn't new: IDNA2008 imposes the latter
as a requirement.  If we are going to try to draw a boundary
between the cases we hope to address by rules and algorithms and
those cases we want or need to leave to others, I think we are
obligated to make that boundary as clear and easy to understand
as possible... and that means we both have to figure out how to
do the work and to convince ourselves and the IESG that we have
consensus about it and that the IETF should accept that
consensus.

best, 
  john


So it seems to me that another one of those decisions the IETF
needs to figure out how to address is where