Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname

Tom Worster <fsb@thefsb.org> Thu, 01 October 2015 16:15 UTC

Return-Path: <fsb@thefsb.org>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 456501B2DC5 for <precis@ietfa.amsl.com>; Thu, 1 Oct 2015 09:15:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.901
X-Spam-Level:
X-Spam-Status: No, score=-3.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, GB_I_LETTER=-2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id oZHIewbe-yjO for <precis@ietfa.amsl.com>; Thu, 1 Oct 2015 09:15:11 -0700 (PDT)
Received: from smtp98.iad3a.emailsrvr.com (smtp98.iad3a.emailsrvr.com [173.203.187.98]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1839E1B2DBF for <precis@ietf.org>; Thu, 1 Oct 2015 09:15:10 -0700 (PDT)
Received: from smtp13.relay.iad3a.emailsrvr.com (localhost.localdomain [127.0.0.1]) by smtp13.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 341931002CA; Thu, 1 Oct 2015 12:15:09 -0400 (EDT)
Received: by smtp13.relay.iad3a.emailsrvr.com (Authenticated sender: fsb-AT-thefsb.org) with ESMTPSA id C94C81002CC; Thu, 1 Oct 2015 12:15:06 -0400 (EDT)
X-Sender-Id: fsb@thefsb.org
Received: from [10.0.1.2] (c-73-4-147-142.hsd1.ma.comcast.net [73.4.147.142]) (using TLSv1 with cipher DES-CBC3-SHA) by 0.0.0.0:465 (trex/5.4.2); Thu, 01 Oct 2015 16:15:09 GMT
User-Agent: Microsoft-MacOutlook/14.5.5.150821
Date: Thu, 01 Oct 2015 12:15:03 -0400
From: Tom Worster <fsb@thefsb.org>
To: precis@ietf.org, John C Klensin <john-ietf@jck.com>, Peter Saint-Andre - &yet <peter@andyet.net>, Alexey Melnikov <Alexey.Melnikov@isode.com>
Message-ID: <D232C6F6.65904%fsb@thefsb.org>
Thread-Topic: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
References: <D230767C.6587A%fsb@thefsb.org> <560C5149.5090607@andyet.net> <588752141F4228C805E674FC@JcK-HP8200.jck.com>
In-Reply-To: <588752141F4228C805E674FC@JcK-HP8200.jck.com>
Mime-version: 1.0
Content-type: text/plain; charset="UTF-8"
Content-transfer-encoding: quoted-printable
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/RbuFOeIKlnngy9JUPdbxZ3NbkDY>
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Oct 2015 16:15:16 -0000

Thanks, John, for the most interesting email I got this week:)

It raises a question I would like to ask not only to you. As an
implementer working through the specs, I got the clear impression that,
while the PRECIS framework can potentially be useful for various things,
the four profiles are explicitly specified exclusively for the purpose of
string comparison operations on usernames, passwords and nicknames. If
someone uses them for some other purpose and thereby causes problems, it's
not RFC 7613's fault.

The question is, did I get the wrong impression?

If not then the discussion of how the profiles should specify case
map/folding, which is specified only in the profiles, is simpler. The
points to resolve being, as John put it, the "false positives".

Fwiw, the (immature and inadequately tested) implementation is open at
https://github.com/tom--/precis

Tom


On 10/1/15, 9:50 AM, "John C Klensin" <john-ietf@jck.com> wrote:

>--On Wednesday, September 30, 2015 15:16 -0600 Peter Saint-Andre
>- &yet <peter@andyet.net> wrote:
>
>> Hi Tom, thanks for the note.
>> 
>> My feeling is that we phrased things in a slightly wrong way,
>> because we assumed that case-mapping applies primarily or only
>> to uppercase and titlecase characters. I think this was more a
>> matter of communication (because people think of case mapping
>> as something needed only with respect to uppercase
>> characters), whereas it obviously applies more generally
>> (i.e., applying Unicode Default Case Folding will result in
>> mapping of the code points you mention here).
>> 
>> We could do something like this in the nickname spec...
>>...
>  
>> NEW
>> 
>>     3.  Case Mapping Rule: Unicode Default Case Folding MUST
>> be applied,
>>         as defined in the Unicode Standard [Unicode] (at the
>> time
>>         of this writing, the algorithm is specified in Chapter
>> 3 of
>>         [Unicode7.0]).  The primary result of doing so is that
>> uppercase
>>         characters are mapped to lowercase characters. In
>> applications
>>         that prohibit conflicting nicknames, this rule helps
>> to reduce
>>         the possibility of confusion by ensuring that nicknames
>>         differing only by case (e.g., "stpeter" vs. "StPeter")
>> would not
>>         be presented to a human user at the same time.
>>...
>
>Peter,
>
>While your proposed text is an improvement, the desire of many
>people for a magic "just tell me what to do" formula, one that
>lets them avoid understanding the issues, may call for a little
>more:
>
>(1) First, toCaseFold is _not_ toLowerCase.  Saying "The primary
>result of doing so is that uppercase characters are mapped to
>lowercase characters" is true for toCaseFold, but it has other
>effects that are information-losing and may be counterintuitive
>in some locales and situations.
>
>(2) Second, probably as a result of having IDNA in the lead,
>we've gotten sloppy about language and operations and should
>probably start untangling that before it gets people in trouble.
>The Unicode Standard, at least as I understand it, is fairly
>clear that the most important (and really only safe) use of
>toCaseFold is as part of a comparison operation.  Using your
>example it is entirely reasonable to treat, "stpeter" and
>"StPeter" as equivalent in a comparison operation, but accepting
>one string and changing it to the other for display may not be a
>really good idea.  While that transformation may be acceptable
>(although I would be surprised if there were no people who share
>your surname who could consider "stpeter" or "Stpeter"
>unacceptable and might even believe that "StPeter" is an
>unacceptable substitute for "St. Peter"), it also points out the
>dangers of using Basic Latin script examples to illustrate
>situations in which even more extended Latin script, much less
>other scripts, may raise more complex issues.    Because IDNA is
>essentially a workaround because changing the DNS comparison
>rules was impractical for several reasons, we ended up using
>toCaweFold to map characters and strings into others in IDNA2003
>but PRECIS implementations that do not have the same constraints
>would, in general, be better off confining the use of
>toCaseFold, or even toLowerCase, to comparison operations.
>
>(3) Because toCaweFold loses information when used for more than
>comparison (for comparison, it merely contributes to what some
>people would consider false positives for matching) involves
>some controversial decisions and, because of stability
>requirements, cannot be changed even if the controversies are
>resolved in other ways, we end up with, e.g.,
>    toCaseFold ("Nuß") -> "nuss"
>which is considered an acceptable transformation in some places
>that identify themselves as speaking/using German and two
>different unacceptable errors in others.  Again, this will
>almost always be much more serious if the transformation is used
>to map and replace strings than if it is used to compare (fwiw,
>that particular example is part of a continuing disagreement
>between IDNA2008 and, among others, German domain registry
>authorities on one side and UTC and UTR 46 on the other).
>
>(4) If the motivation is really to avoid confusion, the correct
>confusion-blocking rule for Latin script (but not others) and
>many languages that use it (but certainly not all) involves
>moving beyond toCaseFold and treating all "decorated" characters
>(characters normally represented by glyphs consisting of a Basic
>Latin character and one or more diacritical or equivalent
>markings) compare equal to their base characters, e.g., "á" not
>only matches "Á" but also "a" and "A" and, as an unfortunate
>side-effect, maybe "À" and "à" as well.  This is bad news for
>languages in which decorated Latin characters are used to
>represent phonetically and conceptually different characters,
>not just pronunciation variations.  I am not qualified to
>evaluate "how bad".   In addition, extrapolations from this
>principle about Latin script to unrelated scripts will almost
>certainly lead to serious errors and/or additional confusion.
>
>More on this and Tom's question below...
> 
>> On 9/29/15 3:28 PM, Tom Worster wrote:
>>> Peter, Alexey,
>>> 
>>> I think there is an ambiguity in the specification of case
>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19.
>>...
>>> But there are 55 code points in Unicode 7.0.0 that change
>>> under default case folding that are neither uppercase nor
>>> titlecase characters, 12 of which are Lowercase_Letter. I
>>> suspect this stems from a confusion between Unicode case
>>> mapping and case folding.
>
>Yes, I think so.   See above, but, if I were making the rules, I
>would say "never use toCaseFold where case mapping is intended
>and, in particular, where one wants to substitute one string for
>another rather than checking a pair of strings for equivalence
>or perhaps telling users what would be considered equivalent".
>That interpretation is, I believe, consistent with most of the
>Unicode FAQ text you have quoted and other Unicode statements.
>However I have lost that argument before and hope, given
>decisions that have been made and deployed, that I was wrong.
>But there is another issue...
>
>>...
>>> The nickname profile can be corrected or the algorithm
>>> clarified. I'm not sure what to do with a Proposed Standard
>>> RFC. Errata? Can the case mapping rule be changed in IANA?
>>> https://www.iana.org/assignments/precis-parameters/profiles/U
>>> sernameCaseMap ped.txt
>>> e.g. to "Apply Unicode default case folding"
>
>Almost certainly not... an "update" revision of the spec would
>be needed.
>
>At least a few of the characters you questioned raise another
>issue:
>
>>...
>>> Ll; 03D0; C; 03B2; # GREEK BETA SYMBOL
>>> Ll; 03D1; C; 03B8; # GREEK THETA SYMBOL
>>> Ll; 03D5; C; 03C6; # GREEK PHI SYMBOL
>>> Ll; 03D6; C; 03C0; # GREEK PI SYMBOL
>>> Ll; 03F0; C; 03BA; # GREEK KAPPA SYMBOL
>>> Ll; 03F1; C; 03C1; # GREEK RHO SYMBOL
>>> Ll; 03F5; C; 03B5; # GREEK LUNATE EPSILON SYMBOL
>>...
>>> Ll; 1FBE; C; 03B9; # GREEK PROSGEGRAMMENI
>>> Nl; 2160; C; 2170; # ROMAN NUMERAL ONE
>>  (etc)
>>> So; 24B6; C; 24D0; # CIRCLED LATIN CAPITAL LETTER A
>>> So; 24B7; C; 24D1; # CIRCLED LATIN CAPITAL LETTER B
>>...
>
>Those examples, and others, are independent of their Unicode
>categories, not characters used in writing "words" of normal
>languages.  Most of them are inherently confusable with the
>similar-looking letters, e.g., U+2160 and U+2170 with upper and
>lower-case "I" (and ("i") respectively or U+03D0 and its
>relationship to "β".  The latter also raises the
>now-purely-academic question of whether a "variant letterform",
>such as U+03D0, violates the Unicode principle that different
>code points are not assigned to different glyph forms of the
>same latter, but those kinds of questions are another thing that
>makes these discussions difficult, especially for those who
>don't want to get involved with even script-specific or
>locale-specific details.    To the extent possible, we dealt
>with such characters in IDNA2008 by identifying them as
>DISALLOWED, but PRECIS permits enough additional flexibility to,
>as you have noticed, allow people who don't understand what they
>are doing (or who are trying to avoid that necessity) to get
>themselves and their users into a lot of trouble.
>
>Fewer easy answers here than one would like and would expect in
>some alternate and easier reality.
>
>best,
>    john
>