Re: [precis] Enforcement as an Idempotent operation

Peter Saint-Andre <stpeter@stpeter.im> Sun, 12 February 2017 19:27 UTC

Return-Path: <stpeter@stpeter.im>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 18CCE129AC6 for <precis@ietfa.amsl.com>; Sun, 12 Feb 2017 11:27:30 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.701
X-Spam-Level:
X-Spam-Status: No, score=-2.701 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=stpeter.im header.b=RympYsi7; dkim=pass (1024-bit key) header.d=messagingengine.com header.b=E3sddI3z
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Egdl4FIAnCLC for <precis@ietfa.amsl.com>; Sun, 12 Feb 2017 11:27:28 -0800 (PST)
Received: from new1-smtp.messagingengine.com (new1-smtp.messagingengine.com [66.111.4.221]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 6DB94129AC2 for <precis@ietf.org>; Sun, 12 Feb 2017 11:27:28 -0800 (PST)
Received: from compute2.internal (compute2.nyi.internal [10.202.2.42]) by mailnew.nyi.internal (Postfix) with ESMTP id E4759C5A; Sun, 12 Feb 2017 14:27:25 -0500 (EST)
Received: from frontend1 ([10.202.2.160]) by compute2.internal (MEProxy); Sun, 12 Feb 2017 14:27:25 -0500
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=stpeter.im; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-sender :x-me-sender:x-sasl-enc:x-sasl-enc; s=mesmtp; bh=Q2dwamZ1b3jOf0z rVVIA3UzGO08=; b=RympYsi7bTswXE/ci7N6GBXV/GGyqu4gp/w3+trxoEres0t c+tSUHpOyxhkvM1Z+d7vbJlCWJ3kuhFvvUa3WfyFfR/rU03ejksyL+VgeX42kq1T NknzGsKtN9txlEzgFMVpo2SRWFZePOGX1c9W5lsfpiDK95t7+ldDjrMvHMBM=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-sender:x-me-sender:x-sasl-enc:x-sasl-enc; s= smtpout; bh=Q2dwamZ1b3jOf0zrVVIA3UzGO08=; b=E3sddI3zY+8h4y4DtZF/ XLOyh6rNq5t8+rmwBewZRR/lYLxoa/pRLfst1qCSDRZTTKJ/EhSge3+Ki7bnOFqD dG6zIcoJ4/2hIEySBSTR7bNF8t9J1OlK/yHUIVb2FSd+LeXj8J+gmOF15uaLC/VW O+ndUu2ki2gcZ9CGi48RMBY=
X-ME-Sender: <xms:HbegWLNvf9ZrY_iwU98M-3HaRQgl_0lq6yMC2-pn_k3h1eov285h1A>
X-Sasl-enc: CGMESeLpfO6ekJeMSh7iwm9dvH7U9PLgf+ko/mGVgReg 1486927645
Received: from aither.local (unknown [76.25.4.24]) by mail.messagingengine.com (Postfix) with ESMTPA id D76B77E273; Sun, 12 Feb 2017 14:27:24 -0500 (EST)
To: William Fisher <william.w.fisher@gmail.com>
References: <CAHVjMKHVvmS6jty3-jwnnuqy-xdw-xY2j+5ExLRr6tXCMRbC2Q@mail.gmail.com> <f9b49a96-2189-bccd-5dc0-a4dc8146cbcc@stpeter.im> <CAHVjMKEVTOCV68OTfXnXhWKiXT798m2osGkwHVRhw4Cs0RLw0w@mail.gmail.com>
From: Peter Saint-Andre <stpeter@stpeter.im>
Message-ID: <15c31273-c278-af61-2a01-0b68ab8af182@stpeter.im>
Date: Sun, 12 Feb 2017 12:27:24 -0700
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Thunderbird/45.7.1
MIME-Version: 1.0
In-Reply-To: <CAHVjMKEVTOCV68OTfXnXhWKiXT798m2osGkwHVRhw4Cs0RLw0w@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/precis/XxuRqzK6Q7j6SH4TjE6fXiUaQ6s>
Cc: precis@ietf.org
Subject: Re: [precis] Enforcement as an Idempotent operation
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 12 Feb 2017 19:27:30 -0000

Hi Bill, thanks for your message and sorry about the seriously delayed 
reply - I've been working to finish some other Internet-Drafts and now 
have time again to finish the PRECIS updates.

On 10/13/16 1:33 PM, William Fisher wrote:
> On Wed, Oct 12, 2016 at 9:03 PM, Peter Saint-Andre <stpeter@stpeter.im> wrote:
>> It's not clear to me that U+1F11 has the problem you describe; perhaps could you sketch it out further?
>
> Oops, that should be U+0001F11A.

Did you mean U+212A (KELVIN SIGN)? That decomposes to U+004B (LATIN 
CAPITAL LETTER K).

> The full example is:
> "\U0001f11aevin" => "(K)evin" => "(k)evin"

Yes, "U+212Aevin" => "Kevin" via NFKC.

However, "U+212Aevin" => "kevin" via toLower() if I am not mistaken.

> I wrote a program to categorize characters that are not idempotent
> under Nickname "ToLower" (ignoring white space). The numbers are the
> same for Unicode 6.3, 8.0 and 9.0.
>
> {
>   '<font>': 467,
>   '<square>': 90,
>   '<compat>': 35,
>   '<super>': 27,
>   '<circle>': 4
> }

Would you mind sending me your list of characters? (I'm happy to receive 
it off-list.) I suspect that it might be similar to a list that emerged 
from differing assumptions regarding how to apply the PRECIS rules in 
implementations. My original implementation for testing purposes was 
rather naïve, whereas the implementation that Yoshiro Yoneya and 
Takahiro Nemoto created was smarter, in the sense that it would follow 
the chain of characters and decompose each one fully as it went along 
(this might require a few rounds of applying the normalization rule in 
order to fully decompose the original characters).

> The following two characters also appear to fail the idempotent test.
> The initial decompositions do not begin with '<'.
>
> \u03d3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
> \u03d4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL

These examples are different from the KELVIN SIGN example, because there 
is no direct toLower() transformation - normalization needs to happen 
before toLower() is applied.

>> Thanks for your input. Personally I will think about it further and post again after I do so.
>
> To me, the problem is to take untrusted input, validate it using
> specified rules, and transform it into a stable, unambiguous format.
> I'm still learning more about Unicode.

Trust me: it never ends.

> Is there a reason that the case
> mapping rule has to be applied *before* the normalization rule?

As explained in Section 5.2.1 of RFC 7564, there is a good reason to 
apply the width mapping rule before the normalization rule. I'm now less 
sure that it makes sense, for comparison purposes, to apply the case 
mapping rule before the normalization rule.

> The
> order appears to make a difference for NFKC.  I suppose the Nickname
> "comparison" profile could re-apply the case mapping rule after the
> normalization rule?

If I understand correctly, you are suggesting that an implementation 
that is processing nickname strings for purposes of comparison would do 
the following:

1. Apply the "enforcement" action in Section 2.3
2. Apply the "comparison" action in Section 2.4

Let's choose a practical but somewhat contrived example: a nickname of 
ΨϓΧΗ, which is U+03A8 U+03D3 U+03A7 U+0397 (something like an uppercase 
version of the Greek word for soul, although the accent is wrong). This 
includes the code point U+03D3 that you mention above (which, by the 
way, is not the standard code point for the Greek letter upsilon but an 
alternative with a hook symbol, the usual character being U+03A5).

The two-step process you suggest would involve the following:

1. The "enforcement" action results in normalization (note that full 
normalization involves several steps):

U+03A8 U+03D3 U+03A7 U+0397
=>
U+03A8 U+03D2 U+0301 U+03A7 U+0397
=>
U+03A8 U+03A5 U+0301 U+03A7 U+0397

(note that U+03D2 has a compatibility equivalent of U+03A5)

2. The "comparison" action results in case mapping:

U+03A8 U+03A5 U+0301 U+03A7 U+0397
=>
U+03C8 U+03C5 U+0301 U+03C7 U+03B7

Thus, for comparison purposes, ΨϓΧΗ and ψύχη would be considered equivalent.

Unfortunately, even though that seems to yield the correct outcome, it's 
not what RFC 7700 specifies.

I'll continue to think about this - in particular, about any negative 
implications from modifying the order of operations so that 
normalization comes before case mapping (unlike what we specified in RFC 
7700). Because we would prefer that all the PRECIS specs follow the same 
order, we'd also need to look at the implications for RFC 7613 (although 
the OpaqueString profile that we use there for passwords has quite a 
different purpose than the Nickname profile in RFC 7700); however, this 
might not be possible.

Peter