Re: Possible BofF question -- I18n

Barry Leiba <barryleiba@computer.org> Tue, 05 June 2018 05:50 UTC

Return-Path: <barryleiba.mailing.lists@gmail.com>
X-Original-To: ietf@ietfa.amsl.com
Delivered-To: ietf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2EA3B130ED7 for <ietf@ietfa.amsl.com>; Mon, 4 Jun 2018 22:50:18 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.102
X-Spam-Level:
X-Spam-Status: No, score=-2.102 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.248, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.249, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ChqyOD5Mi2qH for <ietf@ietfa.amsl.com>; Mon, 4 Jun 2018 22:50:13 -0700 (PDT)
Received: from mail-qk0-x22a.google.com (mail-qk0-x22a.google.com [IPv6:2607:f8b0:400d:c09::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9FEAB130ED6 for <ietf@ietf.org>; Mon, 4 Jun 2018 22:50:13 -0700 (PDT)
Received: by mail-qk0-x22a.google.com with SMTP id g126-v6so763291qke.10 for <ietf@ietf.org>; Mon, 04 Jun 2018 22:50:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-transfer-encoding; bh=BuVud5Pm7ELgND4OpJrynQBDEFEE8Tu+M2YWeqiNptU=; b=PcemLe0riKKKxhscml1ktBn7GyAtAKRDHCIY3Awyrc2lP09bDrYfx5yTw5bX8BzuaM 9amaaGQ8qV22PLpdNrr4zU5TD1fy6S5gaQMEu037hpevGsTF0sednHqIL3XbYDJO3Psv 221HV1qeBB9AZv1MqHdNqh4MA0Gp1PbH87bJj4R4HpfFzwXik2i1VvSwNSdu/YV8nLRs Fl4QfJkQaEnKQijfN2/CTPF2fld/IDmFoiD3CQA63HHwB+GzfjUzSqq872Z6PDtZ/GU4 2og0xMw8UpASHrhKoOOkt0nalBZQM2zKtRZHPdcGCW6GzBmhEuFtAA8vINEOmcD+gj98 nPsw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc:content-transfer-encoding; bh=BuVud5Pm7ELgND4OpJrynQBDEFEE8Tu+M2YWeqiNptU=; b=a61n6BlFVDwflh5rrmuZBEXpYU36J0iniA4UosT4K8inYxLxhO2Oxw2kEVTM9Roqkg Eyh460UC66ZPNkll8WxCKRvr120yjN8QELLiYTEEfgoi+tjuoMYKuaFA7T+XnFK70y2D K2TaCO00/Y/QcfNrMX4K3xiFjZ0FabrKKl2EqDCNiJmjgmpuUWnO7ovdX9BFUGRozODs Fnj466tIwE2Zc6UNtaRWsqG72BKdUC79aVs7fCZltCGLiwDkosGogXyNV72xZ0tTy0mN xvX1tDu7IlWLCMgMe450vBnDQia9GCOdpBiGJA4FWbZWEImtnY+6q/1/GCFXgfrOVPj+ ltuw==
X-Gm-Message-State: APt69E0sfGdrcxio6eNnFeu4+KGbMF/MIrurb8jY7Ev3bwRjeYhhZemr 0Kx0JnOKFuNHkTZNsuwjF5pVs6RjN81xMK3bS1w=
X-Google-Smtp-Source: ADUXVKKrbGQCPBV+0ByOYPJ0a7XnUmpmKKwx5tD7x7+McA5W7VZMIcjBQahdQ8Y8WFiulkLhOpHqKHQylk/NZfZROSU=
X-Received: by 2002:a37:3889:: with SMTP id f131-v6mr20532101qka.408.1528177812441; Mon, 04 Jun 2018 22:50:12 -0700 (PDT)
MIME-Version: 1.0
Sender: barryleiba.mailing.lists@gmail.com
Received: by 2002:ac8:1977:0:0:0:0:0 with HTTP; Mon, 4 Jun 2018 22:50:11 -0700 (PDT)
In-Reply-To: <20180605031021.GO14446@localhost>
References: <383c2404-7beb-63e9-b2b2-e75fd1b174f1@mozilla.com> <20180601041949.GH14446@localhost> <A13FFF23-49BD-459D-8B5B-D3448154EEBC@frobbit.se> <20180601151053.GI14446@localhost> <2584adb9-1622-8b49-7236-ecc7dd374974@mozilla.com> <alpine.OSX.2.21.1806011219340.7621@ary.qy> <CAK3OfOgv33SJiPJ6ypo8k5hcpnjcJdRso6EXb9b12YNcdDgMUg@mail.gmail.com> <6c5d5618-74a5-dcc8-d818-89243a41f307@gmail.com> <20180603061350.GM14446@localhost> <d125f213-c096-1e93-0a6e-ffdfc55a7ac6@gmail.com> <20180605031021.GO14446@localhost>
From: Barry Leiba <barryleiba@computer.org>
Date: Tue, 05 Jun 2018 01:50:11 -0400
X-Google-Sender-Auth: yy9-2wrufk1T5cEthxBIxXXAjzI
Message-ID: <CAC4RtVAHd37mHFv7TypVdKATtHtBNX0pEszbn+ke5RMh-oExMA@mail.gmail.com>
Subject: Re: Possible BofF question -- I18n
To: Nico Williams <nico@cryptonector.com>
Cc: Brian E Carpenter <brian.e.carpenter@gmail.com>, John R Levine <johnl@taugh.com>, IETF general list <ietf@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf/25MSA8KbR-wB1j8ZzUn4Ly0D8GU>
X-BeenThere: ietf@ietf.org
X-Mailman-Version: 2.1.26
Precedence: list
List-Id: IETF-Discussion <ietf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf>, <mailto:ietf-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf/>
List-Post: <mailto:ietf@ietf.org>
List-Help: <mailto:ietf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf>, <mailto:ietf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Jun 2018 05:50:18 -0000

On Mon, Jun 4, 2018 at 11:10 PM, Nico Williams <nico@cryptonector.com> wrote:
>> We're in a space where the evaluation of A==B depends on more than
>> the bit strings A and B. Your post about form-insensitive filename
>> comparisons is a case in point, although I don't pretend to understand
>> it. OK, we can argue whether that's a dark art or simply complicated
>
>   form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b))
>
> Except that actually one can greatly optimize this to avoid most of the
> compute and memory cost of normalization.
>
> To see why consider comparing my first name as I usually write it
> (Nicolas) vs.  how it should be written (Nicolás).  The two strings
> should compare as not equivalent.  But the two ways to write the second
> form (with the &acute; precomposed vs. decomposed) should compare as
> equivalent (because they are).

But there's one of the things that makes this a complicated topic:

- we say that "nicolas" is not equivalent to "nicolás"
- but we say that "nicolás" *is* equivalent to "nicola´s", and we
handle this using normalization
- does that mean that it's OK to have "nicolas" and "nicolás" as two
different usernames assigned to two different users?
- if yes, how do we deal with the human interface issues involved?
What happens if the human identified as "nicolás" uses an input
mechanism that doesn't have a way to enter "á"?  How can he log in?
- if no, how do we make sure (in an automated way) that we don't make
that assignment?
- does the answer change if "nicolás" is a domain name instead of a username?
- does the answer change if "nicolás" is a *password*?
- and what about "nicolàs"?  and "nicolâs"?  and "nicoläs"?
- what about "nicolаs" (that's a Cyrillic character in the penultimate
position)?
- what about "nicolαs" (that's a Greek character in the penultimate position)?
- what about other Unicode characters that look like "a", either
exactly (as with Cyrillic) or closely (as with Greek)?
- what about handling of "ä" vs "ae"?  Do we want to avoid assigning
"käse" and "kaese" as distinct usernames?  Does the answer to this
differ depending upon whether the language is German (where using "ae"
to represent "ä" is common) or Swedish (where it is not)?

Now extend this to the many other characters that can look similar
(say, "n" vs "ñ" in Spanish).  Extend it to other language-related
issues ("i" vs "ı" vs "İ" vs "I" in Turkish; all the character
variants in Arabic).

These are only some of the reasons it's difficult.  And the number of
people who stand up and say, "oh, just <do this> and the problem is
solved," demonstrates that too too too many people *think* they
understand... and don't.

Barry