Re: Possible BofF question -- I18n

Nico Williams <nico@cryptonector.com> Tue, 05 June 2018 06:53 UTC

MIME-Version: 1.0
References: <383c2404-7beb-63e9-b2b2-e75fd1b174f1@mozilla.com> <20180601041949.GH14446@localhost> <A13FFF23-49BD-459D-8B5B-D3448154EEBC@frobbit.se> <20180601151053.GI14446@localhost> <2584adb9-1622-8b49-7236-ecc7dd374974@mozilla.com> <alpine.OSX.2.21.1806011219340.7621@ary.qy> <CAK3OfOgv33SJiPJ6ypo8k5hcpnjcJdRso6EXb9b12YNcdDgMUg@mail.gmail.com> <6c5d5618-74a5-dcc8-d818-89243a41f307@gmail.com> <20180603061350.GM14446@localhost> <d125f213-c096-1e93-0a6e-ffdfc55a7ac6@gmail.com> <20180605031021.GO14446@localhost> <CAC4RtVAHd37mHFv7TypVdKATtHtBNX0pEszbn+ke5RMh-oExMA@mail.gmail.com>
In-Reply-To: <CAC4RtVAHd37mHFv7TypVdKATtHtBNX0pEszbn+ke5RMh-oExMA@mail.gmail.com>
From: Nico Williams <nico@cryptonector.com>
Date: Tue, 05 Jun 2018 02:53:30 -0400
Message-ID: <CAK3OfOh8SsGgKTSKFXvgpY3Ju1=Mz+csuu9P_AMcC8KXBxfkkQ@mail.gmail.com>
Subject: Re: Possible BofF question -- I18n
To: Barry Leiba <barryleiba@computer.org>
Cc: Brian E Carpenter <brian.e.carpenter@gmail.com>, IETF general list <ietf@ietf.org>, John R Levine <johnl@taugh.com>
Content-Type: multipart/alternative; boundary="000000000000cb0675056ddf8246"
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf/A1I9CyoVogeO6MvszupTNS3IgkE>
Precedence: list

On Tue, Jun 5, 2018 at 1:50 AM Barry Leiba <barryleiba@computer.org> wrote:

> On Mon, Jun 4, 2018 at 11:10 PM, Nico Williams <nico@cryptonector.com>
> wrote:
> >> We're in a space where the evaluation of A==B depends on more than
> >> the bit strings A and B. Your post about form-insensitive filename
> >> comparisons is a case in point, although I don't pretend to understand
> >> it. OK, we can argue whether that's a dark art or simply complicated
> >
> >   form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b))
> >
> > Except that actually one can greatly optimize this to avoid most of the
> > compute and memory cost of normalization.
> >
> > To see why consider comparing my first name as I usually write it
> > (Nicolas) vs.  how it should be written (Nicolás).  The two strings
> > should compare as not equivalent.  But the two ways to write the second
> > form (with the &acute; precomposed vs. decomposed) should compare as
> > equivalent (because they are).
>
> But there's one of the things that makes this a complicated topic:

I was describing a specific primitive, but i do like your taking this
further:

- we say that "nicolas" is not equivalent to "nicolás"
> - but we say that "nicolás" *is* equivalent to "nicola´s", and we
> handle this using normalization

Right, that is simple enough.  For some value of simple.  You need
normalization code (which isn't trivial), then it is simple.

- does that mean that it's OK to have "nicolas" and "nicolás" as two
> different usernames assigned to two different users?

In filesystems there's also whether to be case-sensitive, and out can be a
per-filesystem opt-in.

As to usernames, principal names, and so on, well, it's a rather subjective
choice.  "Nicolas" is perfectly correct in French, and is distinct from
"Nicolás", though it can be confusing, especially if you have software that
cannot display accents...

Now, they obvious question is: is this something a protocol should address
by making &acute; equivalent to 'a' globally, or should this be policy
local to an appropriate administration domain?  Well, that's a bit of a
judgement call, but the best option is to give people the freedom to make
that choice where possible.  Thus, not globally considering any
combinations of 'a' equivalent to each other and 'a... is the better
approach.

In terms of DNS, consider a proposal to ban mixing of scripts in any one
label...  But in South Korea it is common to mix Hangul with -ing endings,
so why should .kr not be allowed to use at least that sort of actor
mixing?  There are almost certainly other similar cases, and more will
arise as culture evolves!

Who is in a better position than they registries to make such a decision?
Certainly NOT the IETF, not any one participant and not the IETF
collectively.

There is a big difference between form equivalence (same exact character,
two or more ways to represent it as the codepoint level) and confusables.
We can trivially (see above) deal with the former, but the latter is going
to need local policy.  I really don't see a better answer re: confusables,
and i know that's not a popular opinion, but i don't think it's wrong.

- if yes, how do we deal with the human interface issues involved?
> What happens if the human identified as "nicolás" uses an input
> mechanism that doesn't have a way to enter "á"?  How can he log in?

Answered above.

- if no, how do we make sure (in an automated way) that we don't make
> that assignment?

This one is easy: IF you really want this (i don't think we should want
this globally) decompose (normalize to NFD) then drop combining codepoint.
This answer won't work for cross-script confusables, naturally, which is
partly why i wouldn't recommend this approach.

- does the answer change if "nicolás" is a domain name instead of a
> username?

Same answer!  The local authority (here: the registry) should decide this,
write a policy, and enforce it (by having the registrars implement it).
 (See comments below about user-agents as vessels for local policy as well.)

I don't think we can make such a policy globally that doesn't risk angering
some local communities.

- does the answer change if "nicolás" is a *password*?

It can.  Losing some entropy in a password might be safe, but this is
simpler as a global policy rather than as local policy.  It's even simpler
to tell users to only use characters they can reliably input on all devices
(this isn't as trivial as it should be, but by and large this approach
works).

- and what about "nicolàs"?  and "nicolâs"?  and "nicoläs?

- what about "nicolаs" (that's a Cyrillic character in the penultimate
> position)?

- what about "nicolαs" (that's a Greek character in the penultimate
> position)?
> - what about other Unicode characters that look like "a", either
> exactly (as with Cyrillic) or closely (as with Greek)?
> - what about handling of "ä" vs "ae"?  Do we want to avoid assigning
> "käse" and "kaese" as distinct usernames?

Same answers as above.

Does the answer to this
> differ depending upon whether the language is German (where using "ae"
> to represent "ä" is common) or Swedish (where it is not)?

Only if the context can let an end-user choose one (or more) language(s).
DNS, for example, cannot.  A filesystem cannot either.  Text documents /
word processors can (and might, especially in a search function).

Now extend this to the many other characters that can look similar
> (say, "n" vs "ñ" in Spanish).  Extend it to other language-related
> issues ("i" vs "ı" vs "İ" vs "I" in Turkish; all the character
> variants in Arabic).

Same answers as above.

The protocols should be permissive.  Local policies should be less so -
perhaps no more permissive than is absolutely necessary.

Note that a user-agent is also a place where local policy can be applied.
In fact, there exist browser extensions to deal with confusables.

These are only some of the reasons it's difficult.  And the number of
> people who stand up and say, "oh, just <do this> and the problem is
> solved," demonstrates that too too too many people *think* they
> understand... and don't.

It's difficult because our world culture has globalized while at the sane
tone we are not willing to unify confusable characters.  When i say "we"
here i mean mankind in all its local polities.  We've tried Han
unification, and that failed as a matter of politics.  We (the IETF) can
hate this all we like, but we cannot change it and should not even try.
We've talked about human rights and I18N.. some might say that getting
their characters drawn the way they want without needing a user context..
is a human right..  These are global political issues way beyond the IETF's
reach.

Nico
--

Possible OBF question -- I18n John C Klensin
Re: Possible OBF question -- I18n John Levine
Re: Possible BofF question -- I18n (was: Re: Poss… John C Klensin
Re: Possible BofF question -- I18n (was: Re: Poss… Patrik Fältström
Re: Possible BofF question -- I18n (was: Re: Poss… John R Levine
Re: Possible BofF question -- I18n (was: Re: Poss… John C Klensin
Re: Possible BofF question -- I18n (was: Re: Poss… Patrik Fältström
Re: Possible BofF question -- I18n (was: Re: Poss… Patrik Fältström
Re: Possible BofF question -- I18n (was: Re: Poss… tom p.
Re: Possible BofF question -- I18n (was: Re: Poss… John R Levine
Re: Possible BofF question -- I18n (was: Re: Poss… John C Klensin
Re: Possible BofF question -- I18n (was: Re: Poss… Nico Williams
Re: Possible BofF question -- I18n (was: Re: Poss… Peter Saint-Andre
Re: Possible BofF question -- I18n (was: Re: Poss… Donald Eastlake
Re: Possible BofF question -- I18n (was: Re: Poss… John C Klensin
Re: Possible BofF question -- I18n (was: Re: Poss… John C Klensin
Re: Possible BofF question -- I18n (was: Re: Poss… Patrik Fältström
Re: Possible BofF question -- I18n (was: Re: Poss… Patrik Fältström
Re: Possible BofF question -- I18n (was: Re: Poss… Spencer Dawkins at IETF
Re: Possible BofF question -- I18n Adam Roach
Re: Possible BofF question -- I18n (was: Re: Poss… Nico Williams
Re: Possible BofF question -- I18n (was: Re: Poss… Nico Williams
Re: Possible BofF question -- I18n Brian E Carpenter
RE: Possible BofF question -- I18n (was: Re: Poss… Larry Masinter
Re: Possible BofF question -- I18n (was: Re: Poss… Patrik Fältström
RE: Possible BofF question -- I18n (was: Re: Poss… John C Klensin
Re: Possible BofF question -- I18n (was: Re: Poss… Nico Williams
Re: Possible BofF question -- I18n (was: Re: Poss… Patrik Fältström
Re: Possible BofF question -- I18n (was: Re: Poss… Benjamin Kaduk
Re: Possible BofF question -- I18n (was: Re: Poss… Peter Saint-Andre
Re: Possible BofF question -- I18n (was: Re: Poss… Niels ten Oever
Re: Possible BofF question -- I18n (was: Re: Poss… John R Levine
Re: Possible BofF question -- I18n (was: Re: Poss… Peter Saint-Andre
Re: Possible BofF question -- I18n (was: Re: Poss… John R Levine
Re: Possible BofF question -- I18n (was: Re: Poss… Spencer Dawkins at IETF
Re: Possible BofF question -- I18n (was: Re: Poss… Spencer Dawkins at IETF
Re: Possible BofF question -- I18n (was: Re: Poss… Nico Williams
Re: Possible BofF question -- I18n (was: Re: Poss… Nico Williams
Re: Possible BofF question -- I18n (was: Re: Poss… Stephen Farrell
Re: Possible BofF question -- I18n (was: Re: Poss… Nico Williams
Re: Possible BofF question -- I18n (was: Re: Poss… John C Klensin
Re: Possible BofF question -- I18n (was: Re: Poss… Patrik Fältström
Re: Possible BofF question -- I18n (was: Re: Poss… Spencer Dawkins at IETF
Re: Possible BofF question -- I18n (was: Re: Poss… Peter Saint-Andre
Re: Possible BofF question -- I18n (was: Re: Poss… John C Klensin
Re: Possible BofF question -- I18n (was: Re: Poss… Spencer Dawkins at IETF
Re: Possible BofF question -- I18n (was: Re: Poss… Nico Williams
Re: Possible BofF question -- I18n (was: Re: Poss… Nico Williams
Re: Possible BofF question -- I18n (was: Re: Poss… Spencer Dawkins at IETF
Re: Possible BofF question -- I18n Brian E Carpenter
Re: Possible BofF question -- I18n (was: Re: Poss… Patrik Fältström
Re: Possible BofF question -- I18n John R Levine
Re: Possible BofF question -- I18n (was: Re: Poss… John C Klensin
Re: Possible BofF question -- I18n (was: Re: Poss… Nico Williams
Re: Possible BofF question -- I18n Nico Williams
Re: Possible BofF question -- I18n Nico Williams
Re: Possible BofF question -- I18n (was: Re: Poss… Patrik Fältström
Re: Possible BofF question -- I18n Brian E Carpenter
Re: Possible BofF question -- I18n Nico Williams
Re: Possible BofF question -- I18n Barry Leiba
Re: Possible BofF question -- I18n Viktor Dukhovni
Re: Possible BofF question -- I18n Christian Huitema
Re: Possible BofF question -- I18n Nico Williams
Re: Possible BofF question -- I18n tom p.
Re: Possible BofF question -- I18n Barry Leiba
Re: Possible BofF question -- I18n ned+ietf
Re: Possible BofF question -- I18n ned+ietf