Re: [precis] toLower() vs. toCaseFold()

John C Klensin <john-ietf@jck.com> Sat, 07 May 2016 02:40 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1754C12D1DB for <precis@ietfa.amsl.com>; Fri, 6 May 2016 19:40:15 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.896
X-Spam-Level:
X-Spam-Status: No, score=-2.896 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RP_MATCHES_RCVD=-0.996] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Kc8_V6p159Ve for <precis@ietfa.amsl.com>; Fri, 6 May 2016 19:40:13 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8EBE612B029 for <precis@ietf.org>; Fri, 6 May 2016 19:40:13 -0700 (PDT)
Received: from [198.252.137.10] (helo=JcK-HP8200.jck.com) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1aysA7-000MAC-4C; Fri, 06 May 2016 22:40:11 -0400
Date: Fri, 06 May 2016 22:40:06 -0400
From: John C Klensin <john-ietf@jck.com>
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, Peter Saint-Andre <stpeter@stpeter.im>, precis@ietf.org
Message-ID: <6F0075DBF071EB43A3F97F73@JcK-HP8200.jck.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/6wMNhM0YCKtMrN6P1LONAfnAcAU>
Subject: Re: [precis] toLower() vs. toCaseFold()
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 07 May 2016 02:40:15 -0000

(sorry... earlier copy sent from wrong address)

--On Friday, May 06, 2016 15:54 +0900 "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:

> Hello Peter,
> 
> On 2016/05/05 07:43, Peter Saint-Andre wrote:
> 
>> I suggested that we add some text about this to 7564bis. Here
>> is a proposed paragraph for insertion in §5.2.3
>> ("Case-Mapping Rule"):
>> 
>>    The Unicode toCaseFold() operation defined by the Unicode
>>    Default Case Folding algorithm is most appropriate when an
>>    application needs to compare two strings.  When an
>>    application merely wishes to convert uppercase and
>>    titlecase code points to the lowercase equivalents while
>>    preserving lowercase code points, the Unicode toLower()
>>    operation is more appropriate and is less likely to
>>    violate the "Principle of Least Astonishment".  Therefore,
>>    application developers are advised to carefully consider
>>    whether they truly need to use the toCaseFold() operation
>>    in a given situation, or whether the toLower() operation
>>    would be more appropriate than the toCaseFold() operation.
>> 
>> Suggestions for improvement are welcome, especially from
>> John. (E.g., we might want to more explicitly call out
>> comparison vs. other contexts in the normative text elsewhere
>> in §5.2.3).
> 
> I think 'compare' should be changed to 'search'. That's the
> prototypical use case for CaseFold.

Hmm.  If we have to choose, I think I prefer "compare".  I just
looked at the subsections on "Default Case Folding" and "Default
Caseless Matching" in Section 3.13 of TUS 8.0 and it says a lot
about comparison and nothing about search.   Recommended
compromise:  Make the relevant sentence fragment read "most
appropriate when an application needs to compare two strings
such as in search operations."

I'd still prefer to denounce toCaseFold completely, especially
where identifiers are concerned.  It just has far too much
potential for being destructive and creating false results
(either positive or negative) when the language context is
unknown.  People/designers/implementers who are not prepared to
understand those issues and their implications should really not
be using the thing.

> Also, the language in the "Therefore" sentence is somewhat
> convoluted. It's unclear which alternative this text prefers.
> I suggest that if we want to put the two alternatives on an
> equal footing (i.e. make sure the application designer thinks
> carefully), then a more parallel sentence structure, avoiding
> words such as "carefully", "truly", and "would", would be more
> appropriate. What about:
> 
>                                         Therefore, application
> developers
>     are advised to carefully consider whether toCaseFold() or
>     toLower() is more appropriate.

For the reasons above, I'm not sure that an even footing is
appropriate.  I'd rather have the guidance be closer to "use
toLowerCase, which your users are likely to understand, unless
you need CaseFolding for some particular reason and understand
its implications"

best,
    john