Re: [I18nrp] next steps

Nico Williams <nico@cryptonector.com> Tue, 24 July 2018 07:12 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 541A1130E4A for <i18nrp@ietfa.amsl.com>; Tue, 24 Jul 2018 00:12:15 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2
X-Spam-Level:
X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cryptonector.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6jxqz3NUrrXI for <i18nrp@ietfa.amsl.com>; Tue, 24 Jul 2018 00:12:14 -0700 (PDT)
Received: from pdx1-sub0-mail-a14.g.dreamhost.com (smtp9.dreamhost.com [64.90.62.178]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id F0490130E12 for <i18nrp@ietf.org>; Tue, 24 Jul 2018 00:12:13 -0700 (PDT)
Received: from pdx1-sub0-mail-a14.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a14.g.dreamhost.com (Postfix) with ESMTP id 26A5180F68; Tue, 24 Jul 2018 00:12:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h=date :from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to; s=cryptonector.com; bh=eah0e2MSUxSM5g ABpodCIoc+E7A=; b=qRtKqH+cUpMhzSUH6JZg7CWJber9lRVztAG3OM9NQNU9VF mnlHVAI8p28p5nQJPOp2IJZUJ2njmyX6eKUvUZ+1aZ7xTQdk/y0dp6HItWPHwm4d eEIrJ1xXBTkdiJNndVw9xrSrW8KEs5+tXKFrnAR1HGp9FW3RIv0L7dAbgLq8Y=
Received: from localhost (cpe-70-123-158-140.austin.res.rr.com [70.123.158.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by pdx1-sub0-mail-a14.g.dreamhost.com (Postfix) with ESMTPSA id B12397FAFD; Tue, 24 Jul 2018 00:12:11 -0700 (PDT)
Date: Tue, 24 Jul 2018 02:12:09 -0500
From: Nico Williams <nico@cryptonector.com>
To: Asmus Freytag <asmusf@ix.netcom.com>
Cc: i18nrp@ietf.org
Message-ID: <20180724071208.GB5700@localhost>
References: <E10F785F-39A8-4A03-B5F0-0672B806B440@vpnc.org> <de326e16-8f93-7afd-0090-06ee7e672471@stpeter.im> <5b569782.1c69fb81.4603.51f9@mx.google.com> <9c81fb76-e4cd-a9d7-eded-960256f224ec@ix.netcom.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <9c81fb76-e4cd-a9d7-eded-960256f224ec@ix.netcom.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/J9gP2SrfIJgaWqxp9xhODVB4LPk>
Subject: Re: [I18nrp] next steps
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Jul 2018 07:12:15 -0000

On Mon, Jul 23, 2018 at 08:38:15PM -0700, Asmus Freytag wrote:
> If you define equivalence relations (e.g. variants) based on code points,
> then you'll need to apply these / compare strings in NFD otherwise you will
> miss out on the effect of re-ordering of combining sequences.

Or NFC.  It doesn't matter which form you choose as the internal form
for comparison, provided that the comparison is form-insensitive.

Now, since normalization to NFC is defined in terms of an initial
normalization to NFD, it's obviously best to use NFD internally anyways.

Note, BTW, that form-insensitive comparison is not necessarily enough to
implement form-insensitivity.  If you have hash tables, say, and you're
hashing strings for hash table lookups, then you have to normalize
during hashing as well in order to have form-insensitive behavior.

The nice thing about this form-insensitive comparison/hashing is that it
can be done with fixed and small memory allocation, as you can step
strings character by character, and there is a maximum length in
codepoints to a character normalized to NFD, thus the amount of memory
needed is fixed, and that memory can be allocated on the stack (where
you have one).

Indeed, you can even optimize form-insensitive string comparison so that
for the mostly-ASCII and mostly-not-equivalent or mostly-equal input
cases... most characters do not require canonical decomposition.

Nico
--