Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname

John C Klensin <john-ietf@jck.com> Thu, 01 October 2015 13:50 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C7FF71A6F11 for <precis@ietfa.amsl.com>; Thu, 1 Oct 2015 06:50:50 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.91
X-Spam-Level:
X-Spam-Status: No, score=-3.91 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, GB_I_LETTER=-2, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id J3Nj0RljoTW0 for <precis@ietfa.amsl.com>; Thu, 1 Oct 2015 06:50:47 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 00A9F1A6EE7 for <precis@ietf.org>; Thu, 1 Oct 2015 06:50:46 -0700 (PDT)
Received: from [198.252.137.10] (helo=JcK-HP8200.jck.com) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1ZheFu-000L0F-Ru; Thu, 01 Oct 2015 09:50:42 -0400
Date: Thu, 01 Oct 2015 09:50:37 -0400
From: John C Klensin <john-ietf@jck.com>
To: Peter Saint-Andre - &yet <peter@andyet.net>, Tom Worster <fsb@thefsb.org>, Alexey Melnikov <Alexey.Melnikov@isode.com>
Message-ID: <588752141F4228C805E674FC@JcK-HP8200.jck.com>
In-Reply-To: <560C5149.5090607@andyet.net>
References: <D230767C.6587A%fsb@thefsb.org> <560C5149.5090607@andyet.net>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/VymX0A55nS__OGVV-3TJGT2I7pw>
Cc: precis@ietf.org
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Oct 2015 13:50:51 -0000

--On Wednesday, September 30, 2015 15:16 -0600 Peter Saint-Andre
- &yet <peter@andyet.net> wrote:

> Hi Tom, thanks for the note.
> 
> My feeling is that we phrased things in a slightly wrong way,
> because we assumed that case-mapping applies primarily or only
> to uppercase and titlecase characters. I think this was more a
> matter of communication (because people think of case mapping
> as something needed only with respect to uppercase
> characters), whereas it obviously applies more generally
> (i.e., applying Unicode Default Case Folding will result in
> mapping of the code points you mention here).
> 
> We could do something like this in the nickname spec...
>...
  
> NEW
> 
>     3.  Case Mapping Rule: Unicode Default Case Folding MUST
> be applied,
>         as defined in the Unicode Standard [Unicode] (at the
> time
>         of this writing, the algorithm is specified in Chapter
> 3 of
>         [Unicode7.0]).  The primary result of doing so is that
> uppercase
>         characters are mapped to lowercase characters. In
> applications
>         that prohibit conflicting nicknames, this rule helps
> to reduce
>         the possibility of confusion by ensuring that nicknames
>         differing only by case (e.g., "stpeter" vs. "StPeter")
> would not
>         be presented to a human user at the same time.
>...

Peter,

While your proposed text is an improvement, the desire of many
people for a magic "just tell me what to do" formula, one that
lets them avoid understanding the issues, may call for a little
more:

(1) First, toCaseFold is _not_ toLowerCase.  Saying "The primary
result of doing so is that uppercase characters are mapped to
lowercase characters" is true for toCaseFold, but it has other
effects that are information-losing and may be counterintuitive
in some locales and situations.  

(2) Second, probably as a result of having IDNA in the lead,
we've gotten sloppy about language and operations and should
probably start untangling that before it gets people in trouble.
The Unicode Standard, at least as I understand it, is fairly
clear that the most important (and really only safe) use of
toCaseFold is as part of a comparison operation.  Using your
example it is entirely reasonable to treat, "stpeter" and
"StPeter" as equivalent in a comparison operation, but accepting
one string and changing it to the other for display may not be a
really good idea.  While that transformation may be acceptable
(although I would be surprised if there were no people who share
your surname who could consider "stpeter" or "Stpeter"
unacceptable and might even believe that "StPeter" is an
unacceptable substitute for "St. Peter"), it also points out the
dangers of using Basic Latin script examples to illustrate
situations in which even more extended Latin script, much less
other scripts, may raise more complex issues.    Because IDNA is
essentially a workaround because changing the DNS comparison
rules was impractical for several reasons, we ended up using
toCaweFold to map characters and strings into others in IDNA2003
but PRECIS implementations that do not have the same constraints
would, in general, be better off confining the use of
toCaseFold, or even toLowerCase, to comparison operations.  

(3) Because toCaweFold loses information when used for more than
comparison (for comparison, it merely contributes to what some
people would consider false positives for matching) involves
some controversial decisions and, because of stability
requirements, cannot be changed even if the controversies are
resolved in other ways, we end up with, e.g.,
    toCaseFold ("Nuß") -> "nuss"
which is considered an acceptable transformation in some places
that identify themselves as speaking/using German and two
different unacceptable errors in others.  Again, this will
almost always be much more serious if the transformation is used
to map and replace strings than if it is used to compare (fwiw,
that particular example is part of a continuing disagreement
between IDNA2008 and, among others, German domain registry
authorities on one side and UTC and UTR 46 on the other).

(4) If the motivation is really to avoid confusion, the correct
confusion-blocking rule for Latin script (but not others) and
many languages that use it (but certainly not all) involves
moving beyond toCaseFold and treating all "decorated" characters
(characters normally represented by glyphs consisting of a Basic
Latin character and one or more diacritical or equivalent
markings) compare equal to their base characters, e.g., "á" not
only matches "Á" but also "a" and "A" and, as an unfortunate
side-effect, maybe "À" and "à" as well.  This is bad news for
languages in which decorated Latin characters are used to
represent phonetically and conceptually different characters,
not just pronunciation variations.  I am not qualified to
evaluate "how bad".   In addition, extrapolations from this
principle about Latin script to unrelated scripts will almost
certainly lead to serious errors and/or additional confusion.

More on this and Tom's question below...
 
> On 9/29/15 3:28 PM, Tom Worster wrote:
>> Peter, Alexey,
>> 
>> I think there is an ambiguity in the specification of case
>> mapping in RFC 7613 and draft-ietf-precis-nickname-19.
>...
>> But there are 55 code points in Unicode 7.0.0 that change
>> under default case folding that are neither uppercase nor
>> titlecase characters, 12 of which are Lowercase_Letter. I
>> suspect this stems from a confusion between Unicode case
>> mapping and case folding. 

Yes, I think so.   See above, but, if I were making the rules, I
would say "never use toCaseFold where case mapping is intended
and, in particular, where one wants to substitute one string for
another rather than checking a pair of strings for equivalence
or perhaps telling users what would be considered equivalent".
That interpretation is, I believe, consistent with most of the
Unicode FAQ text you have quoted and other Unicode statements.
However I have lost that argument before and hope, given
decisions that have been made and deployed, that I was wrong.
But there is another issue...

>...
>> The nickname profile can be corrected or the algorithm
>> clarified. I'm not sure what to do with a Proposed Standard
>> RFC. Errata? Can the case mapping rule be changed in IANA?
>> https://www.iana.org/assignments/precis-parameters/profiles/U
>> sernameCaseMap ped.txt
>> e.g. to "Apply Unicode default case folding"

Almost certainly not... an "update" revision of the spec would
be needed.

At least a few of the characters you questioned raise another
issue:

>...
>> Ll; 03D0; C; 03B2; # GREEK BETA SYMBOL
>> Ll; 03D1; C; 03B8; # GREEK THETA SYMBOL
>> Ll; 03D5; C; 03C6; # GREEK PHI SYMBOL
>> Ll; 03D6; C; 03C0; # GREEK PI SYMBOL
>> Ll; 03F0; C; 03BA; # GREEK KAPPA SYMBOL
>> Ll; 03F1; C; 03C1; # GREEK RHO SYMBOL
>> Ll; 03F5; C; 03B5; # GREEK LUNATE EPSILON SYMBOL
>...
>> Ll; 1FBE; C; 03B9; # GREEK PROSGEGRAMMENI
>> Nl; 2160; C; 2170; # ROMAN NUMERAL ONE
>  (etc)
>> So; 24B6; C; 24D0; # CIRCLED LATIN CAPITAL LETTER A
>> So; 24B7; C; 24D1; # CIRCLED LATIN CAPITAL LETTER B
>...

Those examples, and others, are independent of their Unicode
categories, not characters used in writing "words" of normal
languages.  Most of them are inherently confusable with the
similar-looking letters, e.g., U+2160 and U+2170 with upper and
lower-case "I" (and ("i") respectively or U+03D0 and its
relationship to "β".  The latter also raises the
now-purely-academic question of whether a "variant letterform",
such as U+03D0, violates the Unicode principle that different
code points are not assigned to different glyph forms of the
same latter, but those kinds of questions are another thing that
makes these discussions difficult, especially for those who
don't want to get involved with even script-specific or
locale-specific details.    To the extent possible, we dealt
with such characters in IDNA2008 by identifying them as
DISALLOWED, but PRECIS permits enough additional flexibility to,
as you have noticed, allow people who don't understand what they
are doing (or who are trying to avoid that necessity) to get
themselves and their users into a lot of trouble.

Fewer easy answers here than one would like and would expect in
some alternate and easier reality.

best,
    john