Re: [precis] Applying the rules three times to get a stable output string?

Christian Schudt <christian.schudt@gmx.de> Sat, 09 December 2017 22:41 UTC

Return-Path: <christian.schudt@gmx.de>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 70750126CC7 for <precis@ietfa.amsl.com>; Sat, 9 Dec 2017 14:41:17 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.619
X-Spam-Level:
X-Spam-Status: No, score=-2.619 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FbYGhDpNsWPm for <precis@ietfa.amsl.com>; Sat, 9 Dec 2017 14:41:15 -0800 (PST)
Received: from mout.gmx.net (mout.gmx.net [212.227.15.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id F0DEE1243F6 for <precis@ietf.org>; Sat, 9 Dec 2017 14:41:14 -0800 (PST)
Received: from christihudtsmbp.fritz.box ([88.77.188.33]) by mail.gmx.com (mrgmx001 [212.227.17.190]) with ESMTPSA (Nemesis) id 0LfkUs-1er4pR3NYw-00pNAB; Sat, 09 Dec 2017 23:41:11 +0100
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\))
From: Christian Schudt <christian.schudt@gmx.de>
In-Reply-To: <CAHVjMKEEndoJhMvMEQPPvvCS+t_4vkpp61iFoKrXNksrCB6ohA@mail.gmail.com>
Date: Sat, 09 Dec 2017 23:41:11 +0100
Cc: precis@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <81CE0A33-A6B8-464B-80E4-F2F94F44CB28@gmx.de>
References: <C64B78C6-8109-4F36-BB76-EA8AB229FCE2@gmx.de> <CAHVjMKGmZK1DQJmbM-4Gb6W8NUbzG-qQXnXBScr6Yh+o==wxuw@mail.gmail.com> <C31DFCC1-31BB-49E4-A9BD-071BF5AC6C02@gmx.de> <CAHVjMKEEndoJhMvMEQPPvvCS+t_4vkpp61iFoKrXNksrCB6ohA@mail.gmail.com>
To: William Fisher <william.w.fisher@gmail.com>
X-Mailer: Apple Mail (2.3273)
X-Provags-ID: V03:K0:IOr6IGvmUvCMQmxrfihHda9mSVoeq3n3PCC0vreIHSvrtwLLPZK Ltf6QDXL6kPcbfXABlHaxz8BwJw0pelLAXy/+f4DHF7ecVugb0uRZ3k9Dk7s04WC6sNSlcs KjMvTbGhxiJ/zmhUwyvDVy5xLEkFMYiqr60G+ci4eCrdiJH1dvTyPQ5lFF4dFNZyL5TBLFG TV1CDnDMyvj34Fk6wBkmA==
X-UI-Out-Filterresults: notjunk:1;V01:K0:WXa+EVfdWrA=:tSvmIhzH22eDv9hoqOgMjs 3befrd6RlQmxCpQB8ICdGWDokvcKKbOTF6RWsH9ODYA/GulbIi9JV4cFxJFpXZ6OUdTWG7u70 Y04D1/QG1FcUakLwdLodXq9y4BidCdRFrWj9m39w3hYt3qMIR2lzUD+byhK/4MOat2kevHZTF Rb6fsRjsrZDvvwGcf7H7a7aaAGPTXLBCIY774XmiiS1J4lZzPUoG4SFPQugKYvqX7uIaAw+o0 15oTo+Z/IfbarKfnncSUMCiO1L1AwWO9ceYe4nB8nVJ72y7plebkPR2B4197DJDDXLjXJFTPc Od+s1dXVe+FeXK2X37ZR/6OEL9tCXkZ5Ejf5X59HvpiaLt1Pex6CpPn50jM4mEB0II0N92tnM GhoBPpLQX9ZSHIg76cfs5dxXVF4St4Z18RP24bu8J3Na58L4jS2EgtpL74jj4RHm7kPMShEhm bU6UWPvgBXZUaAIsh543xVRGq15Gctij5H3hlEI+KnSb7IJmI1JcsXaA7sH53xuNvTVae0oCY wBa2kgrOEEVmTsNrLQ8UKLRewb1JZ1jfSaAmsqA+ps2TLj+89eLX8zqkPKWse0YiammgW8kk+ PVAvsukHnYC4V3LAUqYCkkK3y+7+V/w6nVAX4XMgBvJPcwtu5tUxrjsdLK/eu5pbr2XXMh29f RTqYK5G2ov/14X++zHetk/XQOrsPOyMDHFUdjiE2ZU0gkQMoDTAeuNOC1vrQihiSXBVUWQido tYNm5HKVwP+gxPJEgah2rFp0V868VAhzxeWMemjfp5eZuJF3EZC2XD3yRRQ2aT98ZfbTq1mlz ijuoCS65KDquqVuo4TQJ4WJHdEh2uaL4olQFnj+iJf0cRw3POo=
Archived-At: <https://mailarchive.ietf.org/arch/msg/precis/1ip7RDuSvnLor8wdFxWMeb2URhM>
Subject: Re: [precis] Applying the rules three times to get a stable output string?
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 09 Dec 2017 22:41:17 -0000

I just wrote a Java test, which checks the idempotency for all code points (0 - 0X10FFFF) for all 4 profiles (opaque, username, username preserved, nickname).

The result is as you suspected:

Only the Nickname profile requires additional application of the rules in order to stabilize the output string.
But there’s is no code point which requires more than one iteration.

The other 3 profiles are idempotent on the first run.

I tested with Java 8, which uses Unicode 6.2.0.
If the tests fail on Java 9 (Unicode 8.0), I’ll report back.


— Christian


> Am 09.12.2017 um 23:09 schrieb William Fisher <william.w.fisher@gmail.com>:
> 
> I did not come across any code points where IdentifierClass/Usernames
> required multiple passes to make the result idempotent. Only the
> Nickname profile is affected, due to the interaction between NFKC and
> the case/space rules.
> 
> My implementation applies an extra iteration for the Nickname profile.
> The other profiles verify that the result is idempotent and raise a
> DISALLOWED/not_idempotent error if this is violated. I do not believe
> there are legal inputs for Usernames which violate the idempotency
> requirement, so this is purely defensive.
> 
> 
> On Sat, Dec 9, 2017 at 2:27 PM, Christian Schudt
> <christian.schudt@gmx.de> wrote:
>> Great, thanks! These code points revealed some bugs :-). They should have been included in the Examples.
>> 
>> Are there any known code points for the IdentifierClass / Usernames as well?
>> Seems like all these code points are disallowed anyway.
>> 
>> If not, implementations could save 1-2 iterations and only apply the „3-times“-rule for FreeformClass.
>> 
>> 
>> 
>>> Am 09.12.2017 um 20:34 schrieb William Fisher <william.w.fisher@gmail.com>:
>>> 
>>> Where it makes a difference for NicknameCaseMapped:
>>> 
>>> "\u210c"
>>> "\u20a8"
>>> 
>>> Where it makes a difference for Nickname due to spaces:
>>> 
>>> "\u00a8"
>>> "\u02dc"
>>> 
>>> 
>>> On Sat, Dec 9, 2017 at 8:37 AM, Christian Schudt
>>> <christian.schudt@gmx.de> wrote:
>>>> Hi,
>>>> 
>>>> RFC 8264 introduced these new sentences:
>>>> 
>>>>  under certain circumstances, such as when Unicode
>>>>  Normalization Form KC is used, performing Unicode normalization after
>>>>  case mapping can still yield uppercase characters for certain code
>>>>  points
>>>> 
>>>>  Therefore, an implementation SHOULD apply the rules
>>>>  repeatedly until the output string is stable
>>>> 
>>>> 
>>>> I could imagine these sentences refer to code points of the „Unstable“ category, but this category is unused.
>>>> 
>>>> Are there any concrete code points or input strings which show this unstable behaviour?
>>>> I am asking for some test vectors, i.e. an input string, which doesn’t have the expected output string after the first rule application, but after the second one.
>>>> 
>>>> Thanks,
>>>> — Christian
>>>> _______________________________________________
>>>> precis mailing list
>>>> precis@ietf.org
>>>> https://www.ietf.org/mailman/listinfo/precis
>>