Re: [precis] Enforcement as an Idempotent operation

William Fisher <william.w.fisher@gmail.com> Thu, 13 October 2016 19:33 UTC

Return-Path: <william.w.fisher@gmail.com>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8C03B12963C for <precis@ietfa.amsl.com>; Thu, 13 Oct 2016 12:33:18 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.7
X-Spam-Level:
X-Spam-Status: No, score=-2.7 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id At-pEMHlLgYy for <precis@ietfa.amsl.com>; Thu, 13 Oct 2016 12:33:16 -0700 (PDT)
Received: from mail-io0-x22d.google.com (mail-io0-x22d.google.com [IPv6:2607:f8b0:4001:c06::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 030F8129622 for <precis@ietf.org>; Thu, 13 Oct 2016 12:33:15 -0700 (PDT)
Received: by mail-io0-x22d.google.com with SMTP id j37so97236516ioo.3 for <precis@ietf.org>; Thu, 13 Oct 2016 12:33:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=Exm3uaAUtCtqtNRdX3QRHarXoAuEB8tOJSoyfasNdX4=; b=yVyXkscXVKxjFRQaunxntDUVXeJA0sgIGy8vSPhgisCE3GkMv3T5+CtXYDijiBcGAP DwPjp1fv+nIzsV5Ondc7RF8C60w0bHJ34OJUC+cNtPPTuEaGRlsnm6B1ZRDl5vQzQ3AF axO2fPCgxxrFmY6jLnqZ/BK4+pcJ5+rwJwUVWZsE3OhhOb0fmXJ+o6iKuW9nVjvT0TzD xHqknkY/Az/2YWS56rLuHXZJKfLO+TWoGqY68Ed4cA2rutsc/8aP/QhljEk6Q6iZP6l1 ok59U7tTsavC1/E0SzmrPHXqGZX12LmnlMMwUQNSptMqLbba45NS1jEDNlGeUvynNpMV lhLA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=Exm3uaAUtCtqtNRdX3QRHarXoAuEB8tOJSoyfasNdX4=; b=GQslTcp5kotPuZx9s2opW4AkjfDfznSux1Y6sMpjUFftcsrbkk2xuBaIe+ezkJbVrO ifJB1MLdOodsbMP5rh0mi6mrBNGVonyayYCUbgtGe/fe1i0niJ5ACGwxsxBDTNd0Oth+ Q+OXzYrzq1AkznxkvggyMPQr85FalT9QTFsHzgRZ/atEMM5zgRxGyE9ZIY79lm+rs7Sw uW0m3f+9DmnPy7t0PdrinRovyZEVzsTcNE6igGmGSgiCbN2vJwqc1EF781mvViAl93pn gw8FW+1u1GK01sz1L2mdQomqzgcjv9kBZ8mlfTvcBiLPVWqViNUSJQi7m+iMV/fsCge+ iMpg==
X-Gm-Message-State: AA6/9RmMBEqBtDjV9ahvJIU6k9xSZidIp0BNcJfpxsqci18eZEjs2U4gFnzwQgXS6OyhA83RHn2IEvujYAhYCg==
X-Received: by 10.107.20.199 with SMTP id 190mr8710168iou.214.1476387195211; Thu, 13 Oct 2016 12:33:15 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.50.95.42 with HTTP; Thu, 13 Oct 2016 12:33:14 -0700 (PDT)
In-Reply-To: <f9b49a96-2189-bccd-5dc0-a4dc8146cbcc@stpeter.im>
References: <CAHVjMKHVvmS6jty3-jwnnuqy-xdw-xY2j+5ExLRr6tXCMRbC2Q@mail.gmail.com> <f9b49a96-2189-bccd-5dc0-a4dc8146cbcc@stpeter.im>
From: William Fisher <william.w.fisher@gmail.com>
Date: Thu, 13 Oct 2016 12:33:14 -0700
Message-ID: <CAHVjMKEVTOCV68OTfXnXhWKiXT798m2osGkwHVRhw4Cs0RLw0w@mail.gmail.com>
To: Peter Saint-Andre <stpeter@stpeter.im>
Content-Type: text/plain; charset="UTF-8"
Archived-At: <https://mailarchive.ietf.org/arch/msg/precis/5IwBDVF88hxzI39KCn86k_M2O7U>
Cc: precis@ietf.org
Subject: Re: [precis] Enforcement as an Idempotent operation
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Oct 2016 19:33:18 -0000

On Wed, Oct 12, 2016 at 9:03 PM, Peter Saint-Andre <stpeter@stpeter.im> wrote:
> It's not clear to me that U+1F11 has the problem you describe; perhaps could you sketch it out further?

Oops, that should be U+0001F11A. The full example is:
"\U0001f11aevin" => "(K)evin" => "(k)evin"

I wrote a program to categorize characters that are not idempotent
under Nickname "ToLower" (ignoring white space). The numbers are the
same for Unicode 6.3, 8.0 and 9.0.

{
  '<font>': 467,
  '<square>': 90,
  '<compat>': 35,
  '<super>': 27,
  '<circle>': 4
}

The following two characters also appear to fail the idempotent test.
The initial decompositions do not begin with '<'.

\u03d3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
\u03d4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL


> Thanks for your input. Personally I will think about it further and post again after I do so.

To me, the problem is to take untrusted input, validate it using
specified rules, and transform it into a stable, unambiguous format.
I'm still learning more about Unicode. Is there a reason that the case
mapping rule has to be applied *before* the normalization rule?  The
order appears to make a difference for NFKC.  I suppose the Nickname
"comparison" profile could re-apply the case mapping rule after the
normalization rule?

Thanks,
-Bill