Re: [I18nrp] next steps

Nico Williams <nico@cryptonector.com> Thu, 26 July 2018 19:51 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 78975130EE2 for <i18nrp@ietfa.amsl.com>; Thu, 26 Jul 2018 12:51:22 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cryptonector.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Xk9HJ4_ZkQts for <i18nrp@ietfa.amsl.com>; Thu, 26 Jul 2018 12:51:19 -0700 (PDT)
Received: from pdx1-sub0-mail-a8.g.dreamhost.com (smtp8.dreamhost.com [64.90.62.177]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BD565130DC4 for <i18nrp@ietf.org>; Thu, 26 Jul 2018 12:51:19 -0700 (PDT)
Received: from pdx1-sub0-mail-a8.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a8.g.dreamhost.com (Postfix) with ESMTP id DCFB9813A9; Thu, 26 Jul 2018 12:51:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h=date :from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to; s=cryptonector.com; bh=6GDDBNY0SK51wy E5ctl0/dRE2ew=; b=F2SpQA9UV4RLVZt6mBj696arOFPqxed5BQ5BICIo9LVQ1K kS6C31OZrvvoSKZNyM+3+PkKEkTkWxur0mQilDJXbSlKvkRNGGo4lX6GGRMaepzT K6Tva0izx0v3U/NPFLmgK0WAxGUBV1porGr47yooJFxAIFGpP0zSimQL3IQO0=
Received: from localhost (cpe-70-123-158-140.austin.res.rr.com [70.123.158.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by pdx1-sub0-mail-a8.g.dreamhost.com (Postfix) with ESMTPSA id 6881A81397; Thu, 26 Jul 2018 12:51:16 -0700 (PDT)
Date: Thu, 26 Jul 2018 14:51:12 -0500
From: Nico Williams <nico@cryptonector.com>
To: Larry Masinter <LMM@acm.org>
Cc: 'Peter Saint-Andre' <stpeter@stpeter.im>, 'Paul Hoffman' <paul.hoffman@vpnc.org>, i18nrp@ietf.org
Message-ID: <20180726195111.GC5700@localhost>
References: <E10F785F-39A8-4A03-B5F0-0672B806B440@vpnc.org> <de326e16-8f93-7afd-0090-06ee7e672471@stpeter.im> <5b569782.1c69fb81.4603.51f9@mx.google.com> <20180724070212.GA5700@localhost> <002801d42483$96a9e7a0$c3fdb6e0$@acm.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <002801d42483$96a9e7a0$c3fdb6e0$@acm.org>
User-Agent: Mutt/1.5.24 (2015-08-30)
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/5XuEOwvB_EOhNgk0BElurBDqfHk>
Subject: Re: [I18nrp] next steps
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 26 Jul 2018 19:51:23 -0000

On Wed, Jul 25, 2018 at 06:54:27PM -0700, Larry Masinter wrote:
> Here's a wild idea, flame away
> 
> > [...]
> 
> It isn't clear how Unicode defines "normalization" compared to how the
> term is used in mathematics. If you start with an equivalence relation

The important thing is that Unicode defines canonical normalization
forms.  I don't believe there is an argument that those forms aren't
normal -- that is, that they allow multiple equivalent forms in one NF,
which clearly they do not.

> https://en.wikipedia.org/wiki/Equivalence_relation 
> you can define a "canonicalization" if the set has a total ordering,
> and pick the "least" or "most". But the form you might get by doing so
> might not be desirable, or suitable to be called "normal", even if
> it's useful for determining equivalence.
> 
> You don't need to normalize locally for using hash-table caching
> if you allow the hash-table to reflect a cached redirect,
> and put the onus on the origin to supply a canonical string.

Take as an example LATIN LETTER U WITH DIAERESIS AND MACRON U+0304.
This character can be expressed in this many ways:

  - U+1E7B
  - U+0075 U+0308 U+0304
  - U+0075 U+0304 U+0308
  - U+016B U+0308

That's four different ways, only two of which are in one NF or another.

Now imagine a string that has four such characters.  You'd not have 4^4
possible combinations -- 256(!!) of them only two (or maybe four) of
which would be in one NF or another (well, four, if any of those
characters have different K forms).

I'm really not familiar enough with input methods to know how many of
all those combinations are actually likely in the real world, but
because most input methods pre-compose, and because HFS+ decomposes,
you'll have at least two forms.  Any more than two forms and this can't
scale, and it saves you nothing.

Better to normalize as you hash (and only if necessary -- all-ASCII
strings, for example, require no normalization) then.

> You might think this would bloat the table with all of the
> possible redirects, but as long as you look up on data entry

Not just bloat the table.  Also bloat compute, and if you have to have
these sent to you, then bloat network bandwidth.

> and the origin is consistent, there should only be one redirect.
> In fact, you could limit it to one, or just periodically clear LRU.

The form-insensitive string comparison algorithm I described is actually
quite fast in most cases, as it can manage to avoid normalization in
most cases.  Normalizing as you hash is not as fast, but it's OK.

> I'd even include case folding, punycode and %xx-URL percent encodings
> as possible input forms. The times they are typed in are
> rare.

If you're hash table allows duplicates, sure, then you can still be
case-sensitive.  Otherwise only if you want case-insensitivity.

> It would allow different TLDs to define their own additional
> equivalences and case folding.
> 
> Some variation of this could be made to work for DNS, HTTP,
> and email addresses with some enormous deployment problems.
> Would need to prove that it works.

IMO it would not work at all.

> Sorry if this is terse

I got it.

I don't push form-insensitivity only as an optimization.  IMO it has the
best, most user-friendly semantics and interoperability characteristics,
and it minimizes the number of locations where normalization
capabilities are needed.  That's at least three big reasons to prefer
f-i.

Nico
--