Re: Possible BofF question -- I18n

Nico Williams <nico@cryptonector.com> Tue, 05 June 2018 06:53 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: ietf@ietfa.amsl.com
Delivered-To: ietf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 746EE130F00 for <ietf@ietfa.amsl.com>; Mon, 4 Jun 2018 23:53:47 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.998
X-Spam-Level:
X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cryptonector.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 1Ul6LcG4jRnW for <ietf@ietfa.amsl.com>; Mon, 4 Jun 2018 23:53:44 -0700 (PDT)
Received: from homiemail-a55.g.dreamhost.com (homie-sub4.mail.dreamhost.com [69.163.253.135]) (using TLSv1.1 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 7F49F130EFF for <ietf@ietf.org>; Mon, 4 Jun 2018 23:53:44 -0700 (PDT)
Received: from homiemail-a55.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a55.g.dreamhost.com (Postfix) with ESMTP id DBC6368013439 for <ietf@ietf.org>; Mon, 4 Jun 2018 23:53:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h= mime-version:references:in-reply-to:from:date:message-id:subject :to:cc:content-type; s=cryptonector.com; bh=vSFSLjace3iokE/xEQGZ 71jHK1k=; b=xzr9bEXLoY2SY4AR9plmAeTkTy84XxGl1+h0aV/K7iMrgD1ssjfb 8kBfdj+FZ3coQAqQOBhswCP3Jj5NR1AUXdCy58E1xlW+TTKAfVETxWzx4MW8++05 ik1sQg++UtX8kjxjOnDt1XxpNMxeMWoQ4lf7oq5usMUCW9nUF6AE1qo=
Received: from mail-vk0-f52.google.com (mail-vk0-f52.google.com [209.85.213.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by homiemail-a55.g.dreamhost.com (Postfix) with ESMTPSA id AAF7168013436 for <ietf@ietf.org>; Mon, 4 Jun 2018 23:53:43 -0700 (PDT)
Received: by mail-vk0-f52.google.com with SMTP id x4-v6so736273vkx.11 for <ietf@ietf.org>; Mon, 04 Jun 2018 23:53:43 -0700 (PDT)
X-Gm-Message-State: APt69E1QA9o12lK8o5YA6+zuMMNYonVScnMjryCOo7wQF+0kMJ91kN3p wBNIbMzBQ1juiJI43g6i8oBX1K3HCLWqnPLHFw==
X-Google-Smtp-Source: ADUXVKLsJPAnouDbkq9BxnQaSTf9IUWhLIPhHZYEcln6OXn7z9Ce2e/ub7mFpo0kjudbsQg+21YIg/mtAoT9GdJkYZ8=
X-Received: by 2002:a1f:d5c2:: with SMTP id m185-v6mr813376vkg.133.1528181623029; Mon, 04 Jun 2018 23:53:43 -0700 (PDT)
MIME-Version: 1.0
References: <383c2404-7beb-63e9-b2b2-e75fd1b174f1@mozilla.com> <20180601041949.GH14446@localhost> <A13FFF23-49BD-459D-8B5B-D3448154EEBC@frobbit.se> <20180601151053.GI14446@localhost> <2584adb9-1622-8b49-7236-ecc7dd374974@mozilla.com> <alpine.OSX.2.21.1806011219340.7621@ary.qy> <CAK3OfOgv33SJiPJ6ypo8k5hcpnjcJdRso6EXb9b12YNcdDgMUg@mail.gmail.com> <6c5d5618-74a5-dcc8-d818-89243a41f307@gmail.com> <20180603061350.GM14446@localhost> <d125f213-c096-1e93-0a6e-ffdfc55a7ac6@gmail.com> <20180605031021.GO14446@localhost> <CAC4RtVAHd37mHFv7TypVdKATtHtBNX0pEszbn+ke5RMh-oExMA@mail.gmail.com>
In-Reply-To: <CAC4RtVAHd37mHFv7TypVdKATtHtBNX0pEszbn+ke5RMh-oExMA@mail.gmail.com>
From: Nico Williams <nico@cryptonector.com>
Date: Tue, 05 Jun 2018 02:53:30 -0400
X-Gmail-Original-Message-ID: <CAK3OfOh8SsGgKTSKFXvgpY3Ju1=Mz+csuu9P_AMcC8KXBxfkkQ@mail.gmail.com>
Message-ID: <CAK3OfOh8SsGgKTSKFXvgpY3Ju1=Mz+csuu9P_AMcC8KXBxfkkQ@mail.gmail.com>
Subject: Re: Possible BofF question -- I18n
To: Barry Leiba <barryleiba@computer.org>
Cc: Brian E Carpenter <brian.e.carpenter@gmail.com>, IETF general list <ietf@ietf.org>, John R Levine <johnl@taugh.com>
Content-Type: multipart/alternative; boundary="000000000000cb0675056ddf8246"
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf/A1I9CyoVogeO6MvszupTNS3IgkE>
X-BeenThere: ietf@ietf.org
X-Mailman-Version: 2.1.26
Precedence: list
List-Id: IETF-Discussion <ietf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf>, <mailto:ietf-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf/>
List-Post: <mailto:ietf@ietf.org>
List-Help: <mailto:ietf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf>, <mailto:ietf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Jun 2018 06:53:48 -0000

On Tue, Jun 5, 2018 at 1:50 AM Barry Leiba <barryleiba@computer.org> wrote:

> On Mon, Jun 4, 2018 at 11:10 PM, Nico Williams <nico@cryptonector.com>
> wrote:
> >> We're in a space where the evaluation of A==B depends on more than
> >> the bit strings A and B. Your post about form-insensitive filename
> >> comparisons is a case in point, although I don't pretend to understand
> >> it. OK, we can argue whether that's a dark art or simply complicated
> >
> >   form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b))
> >
> > Except that actually one can greatly optimize this to avoid most of the
> > compute and memory cost of normalization.
> >
> > To see why consider comparing my first name as I usually write it
> > (Nicolas) vs.  how it should be written (Nicolás).  The two strings
> > should compare as not equivalent.  But the two ways to write the second
> > form (with the &acute; precomposed vs. decomposed) should compare as
> > equivalent (because they are).
>
> But there's one of the things that makes this a complicated topic:


I was describing a specific primitive, but i do like your taking this
further:

- we say that "nicolas" is not equivalent to "nicolás"
> - but we say that "nicolás" *is* equivalent to "nicola´s", and we
> handle this using normalization


Right, that is simple enough.  For some value of simple.  You need
normalization code (which isn't trivial), then it is simple.

- does that mean that it's OK to have "nicolas" and "nicolás" as two
> different usernames assigned to two different users?


In filesystems there's also whether to be case-sensitive, and out can be a
per-filesystem opt-in.

As to usernames, principal names, and so on, well, it's a rather subjective
choice.  "Nicolas" is perfectly correct in French, and is distinct from
"Nicolás", though it can be confusing, especially if you have software that
cannot display accents...

Now, they obvious question is: is this something a protocol should address
by making &acute; equivalent to 'a' globally, or should this be policy
local to an appropriate administration domain?  Well, that's a bit of a
judgement call, but the best option is to give people the freedom to make
that choice where possible.  Thus, not globally considering any
combinations of 'a' equivalent to each other and 'a... is the better
approach.

In terms of DNS, consider a proposal to ban mixing of scripts in any one
label...  But in South Korea it is common to mix Hangul with -ing endings,
so why should .kr not be allowed to use at least that sort of actor
mixing?  There are almost certainly other similar cases, and more will
arise as culture evolves!

Who is in a better position than they registries to make such a decision?
Certainly NOT the IETF, not any one participant and not the IETF
collectively.

There is a big difference between form equivalence (same exact character,
two or more ways to represent it as the codepoint level) and confusables.
We can trivially (see above) deal with the former, but the latter is going
to need local policy.  I really don't see a better answer re: confusables,
and i know that's not a popular opinion, but i don't think it's wrong.

- if yes, how do we deal with the human interface issues involved?
> What happens if the human identified as "nicolás" uses an input
> mechanism that doesn't have a way to enter "á"?  How can he log in?


Answered above.

- if no, how do we make sure (in an automated way) that we don't make
> that assignment?


This one is easy: IF you really want this (i don't think we should want
this globally) decompose (normalize to NFD) then drop combining codepoint.
This answer won't work for cross-script confusables, naturally, which is
partly why i wouldn't recommend this approach.

- does the answer change if "nicolás" is a domain name instead of a
> username?


Same answer!  The local authority (here: the registry) should decide this,
write a policy, and enforce it (by having the registrars implement it).
 (See comments below about user-agents as vessels for local policy as well.)

I don't think we can make such a policy globally that doesn't risk angering
some local communities.

- does the answer change if "nicolás" is a *password*?


It can.  Losing some entropy in a password might be safe, but this is
simpler as a global policy rather than as local policy.  It's even simpler
to tell users to only use characters they can reliably input on all devices
(this isn't as trivial as it should be, but by and large this approach
works).

- and what about "nicolàs"?  and "nicolâs"?  and "nicoläs?

- what about "nicolаs" (that's a Cyrillic character in the penultimate
> position)?

- what about "nicolαs" (that's a Greek character in the penultimate
> position)?
> - what about other Unicode characters that look like "a", either
> exactly (as with Cyrillic) or closely (as with Greek)?
> - what about handling of "ä" vs "ae"?  Do we want to avoid assigning
> "käse" and "kaese" as distinct usernames?


Same answers as above.

Does the answer to this
> differ depending upon whether the language is German (where using "ae"
> to represent "ä" is common) or Swedish (where it is not)?


Only if the context can let an end-user choose one (or more) language(s).
DNS, for example, cannot.  A filesystem cannot either.  Text documents /
word processors can (and might, especially in a search function).

Now extend this to the many other characters that can look similar
> (say, "n" vs "ñ" in Spanish).  Extend it to other language-related
> issues ("i" vs "ı" vs "İ" vs "I" in Turkish; all the character
> variants in Arabic).


Same answers as above.

The protocols should be permissive.  Local policies should be less so -
perhaps no more permissive than is absolutely necessary.

Note that a user-agent is also a place where local policy can be applied.
In fact, there exist browser extensions to deal with confusables.

These are only some of the reasons it's difficult.  And the number of
> people who stand up and say, "oh, just <do this> and the problem is
> solved," demonstrates that too too too many people *think* they
> understand... and don't.


It's difficult because our world culture has globalized while at the sane
tone we are not willing to unify confusable characters.  When i say "we"
here i mean mankind in all its local polities.  We've tried Han
unification, and that failed as a matter of politics.  We (the IETF) can
hate this all we like, but we cannot change it and should not even try.
We've talked about human rights and I18N.. some might say that getting
their characters drawn the way they want without needing a user context..
is a human right..  These are global political issues way beyond the IETF's
reach.

Nico
--