Re: Possible BofF question -- I18n

Nico Williams <nico@cryptonector.com> Tue, 05 June 2018 03:10 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: ietf@ietfa.amsl.com
Delivered-To: ietf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 74C5E130E89 for <ietf@ietfa.amsl.com>; Mon, 4 Jun 2018 20:10:30 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2
X-Spam-Level:
X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cryptonector.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ZqBLXDXbjuop for <ietf@ietfa.amsl.com>; Mon, 4 Jun 2018 20:10:28 -0700 (PDT)
Received: from homiemail-a71.g.dreamhost.com (homie-sub4.mail.dreamhost.com [69.163.253.135]) (using TLSv1.1 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 7192B130E88 for <ietf@ietf.org>; Mon, 4 Jun 2018 20:10:28 -0700 (PDT)
Received: from homiemail-a71.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a71.g.dreamhost.com (Postfix) with ESMTP id 81ABD50000620; Mon, 4 Jun 2018 20:10:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h=date :from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to:content-transfer-encoding; s= cryptonector.com; bh=HvUBANKd78f+dhg2ImS184ekaFM=; b=Wf5zCMaesbz 1z6aA629DdHlMfLUKo+kg4Gakslrn/xNI/BUecyOIWSuZMjdd+BJQOpk+wKCOBPL sZfOkd7+Poz7RdWXgYfTHGrv4uDhp70lV4hF5a10+7xdC83/w3jLQuqtqs7h846q WDB4irtiQACPFcdpT3tSdhnLPgGAS8Rc=
Received: from localhost (unknown [172.58.233.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by homiemail-a71.g.dreamhost.com (Postfix) with ESMTPSA id 5D5D95000061E; Mon, 4 Jun 2018 20:10:26 -0700 (PDT)
Date: Mon, 04 Jun 2018 22:10:24 -0500
From: Nico Williams <nico@cryptonector.com>
To: Brian E Carpenter <brian.e.carpenter@gmail.com>
Cc: John R Levine <johnl@taugh.com>, IETF general list <ietf@ietf.org>
Subject: Re: Possible BofF question -- I18n
Message-ID: <20180605031021.GO14446@localhost>
References: <383c2404-7beb-63e9-b2b2-e75fd1b174f1@mozilla.com> <20180601041949.GH14446@localhost> <A13FFF23-49BD-459D-8B5B-D3448154EEBC@frobbit.se> <20180601151053.GI14446@localhost> <2584adb9-1622-8b49-7236-ecc7dd374974@mozilla.com> <alpine.OSX.2.21.1806011219340.7621@ary.qy> <CAK3OfOgv33SJiPJ6ypo8k5hcpnjcJdRso6EXb9b12YNcdDgMUg@mail.gmail.com> <6c5d5618-74a5-dcc8-d818-89243a41f307@gmail.com> <20180603061350.GM14446@localhost> <d125f213-c096-1e93-0a6e-ffdfc55a7ac6@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
In-Reply-To: <d125f213-c096-1e93-0a6e-ffdfc55a7ac6@gmail.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf/KPekhzCYkVEzZChj1vULKI_cKRY>
X-BeenThere: ietf@ietf.org
X-Mailman-Version: 2.1.26
Precedence: list
List-Id: IETF-Discussion <ietf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf>, <mailto:ietf-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf/>
List-Post: <mailto:ietf@ietf.org>
List-Help: <mailto:ietf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf>, <mailto:ietf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Jun 2018 03:10:31 -0000

On Mon, Jun 04, 2018 at 08:56:36AM +1200, Brian E Carpenter wrote:
> On 03/06/2018 18:13, Nico Williams wrote:
> > I disagree.  It's not a black art.  There are some corners where
> > reasonable people can and will disagree (should emoji be allowed in
> > domainnames?), and there will be some cases that require script-specific
> > expertise, and therefore a lot of time to sort out.  But I18N is not a
> > dark art at all.  If it were, then how would we get anything done in
> > that space?  The E in IETF stands for Engineering, not Dark Art.
> 
> We're in a space where the evaluation of A==B depends on more than
> the bit strings A and B. Your post about form-insensitive filename
> comparisons is a case in point, although I don't pretend to understand
> it. OK, we can argue whether that's a dark art or simply complicated

  form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b))

Except that actually one can greatly optimize this to avoid most of the
compute and memory cost of normalization.

To see why consider comparing my first name as I usually write it
(Nicolas) vs.  how it should be written (Nicolás).  The two strings
should compare as not equivalent.  But the two ways to write the second
form (with the &acute; precomposed vs. decomposed) should compare as
equivalent (because they are).

Notionally one can iterate the codepoints in the two strings and compare
them, with a fast-path in the case where pairs of codepoints are
byte-wise equal, and a slow path where they are not.

In most cases all strings are in the same form (as input via whatever
input methods), in which case equal strings compare as equal without
having to use the slow path, and non-equivalent strings compare as
non-equivalent with only as many slow-path executions as needed before
the point where they differ.

The slow path basically collects one non-combining codepoint as a many
combining codepoints follow it, normalizes just that one character from
each string and memcmp()s them.  The slow path doesn't require
allocation either, since there is a limit to how many codepoints a
character can require.

For non-equivalent ASCII-mostly strings this is very fast.

Now, this optimization is pretty obvious -- it's the sort of thing
engineers do.  It's not a black art.

Now, of course, deciding what characters to allow or forbid in some
identifier... admits some subjectivity.  E.g., whether to allow emoji in
domainname labels.  I would submit to you that we already permit emoji
in domainname labels: what else are ideographs (Han/Kanji/whatever) if
not pictographs (emojis) [that have been in use for a long time]?  Is it
not snobbish/elitist to say that you can have any Kanji you want but not
a pictograph?  Have you seen how the cool kids write?  They are really
🆒, sometimes stringing along a sequence of emojis... much like one might
Kanji.

> engineering, but really what I need is (a) some generally applicable
> guidelines on protocol design in this area and (b) some people willing to
> review any relevant design work.

I agree.

For example, I've been saying for a long time that filesystem protocols
should not specify a normalization for filenames and such.  Instead the
filesystems (not the protocol implementations) should use form-
insensitive comparison.  For a protocol like Kerberos...  form-
insensitive comparison doesn't quite work as well as just normalizing as
soon as possible, so do that.  I mean, normalization is not really a
difficult thing anymore -- the code for it exists now.

As to what characters to allow/forbid in what contexts, I do think we're
better off getting the the Unicode Consortium (who are, arguably, the
real experts in this) to do the heavy lifting there, and in our
protocols to forbid as little as possible.  Similarly for mappings.

That's just for starters.  I hope I've illustrated that I18N is not that
much of a dark art.

Nico
--