Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname

Peter Saint-Andre - &yet <peter@andyet.net> Mon, 26 October 2015 21:51 UTC

Return-Path: <peter@andyet.net>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1459A1A1BC8 for <precis@ietfa.amsl.com>; Mon, 26 Oct 2015 14:51:25 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.099
X-Spam-Level: *
X-Spam-Status: No, score=1.099 tagged_above=-999 required=5 tests=[BAYES_50=0.8, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, GB_I_LETTER=-2, MANGLED_LIST=2.3, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id aIvoBvgyyevh for <precis@ietfa.amsl.com>; Mon, 26 Oct 2015 14:51:22 -0700 (PDT)
Received: from mail-oi0-x232.google.com (mail-oi0-x232.google.com [IPv6:2607:f8b0:4003:c06::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 790AB1A1BBE for <precis@ietf.org>; Mon, 26 Oct 2015 14:51:22 -0700 (PDT)
Received: by oifu63 with SMTP id u63so66552691oif.2 for <precis@ietf.org>; Mon, 26 Oct 2015 14:51:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=andyet_net.20150623.gappssmtp.com; s=20150623; h=from:subject:to:references:cc:message-id:date:user-agent :mime-version:in-reply-to:content-type:content-transfer-encoding; bh=9+eX1/bJgJm+VlALlrf8zc9PeKBYFYxds7YJPFGCeIA=; b=v7JuatuTs7JFWNXFt1U7f/czWJe8NWmn3B+NeWUWwy/6VDbE4jyqjYyUwNFs2yOY7Q UZZa6DqTzgxOkAbm0dxRYvG02TbG5eCkr+wTloprtDRz5TnDieXrA9aJ1EaSbk3nt02F k2t6t4iuhCwqtjbOgURNM8/YD7ol5f8ubYDdJDTLo4OSTzMhWunH4zMkygpWa3Tbfizy 0DWqxcuUNu/mzKsiZ5OuD8x12mqoEmcy9FxN0d0C6dzPjeL2/afRzQiBu2qko4YXwps9 hL+8ZfZmPnIwmsVRz2el3tGsA99tIKNGn114MhVATMS8umgH4F3SmLbdmxm2uiZuA+Cz aTIg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:subject:to:references:cc:message-id:date :user-agent:mime-version:in-reply-to:content-type :content-transfer-encoding; bh=9+eX1/bJgJm+VlALlrf8zc9PeKBYFYxds7YJPFGCeIA=; b=fLAQP2oef6/ifqULI0+wG2MYui2NHYNOPKoXHio+/tTrD2AscPAYF3m/nOlEJDkJhq /Iep5PHhb1Kl5hK7hC9TZHZIwEQWetRDXaD7IeB3iqa47LmzmT0wY4GkEU/tmnF430pB jL0Jjh6sOOuMVf1GadET9Qs9W7P59JNdxVlz7cDe6XT5YkRHNtPNvJ2o6T2oHk/C6EAA 5bP2QQBauOMnErUUyq4Bb8vfBcRNJQidCiJsV7PS1UemDcquZdYtpP4clYOikDitzvtw 9RMdpeWxnc29PY8fUFka0zoAXQfEqz5L0A6Sj65QgLcKRVPOIxb9q4MyFIQQbAQWwLnh nKNg==
X-Gm-Message-State: ALoCoQnffQgG67H5aP8kctTIBGtESjNfEitWHE49UAyI3Bv3iA5sobFKRmsOATkAw4Bf/LYOVp02
X-Received: by 10.202.200.140 with SMTP id y134mr24775666oif.5.1445896281724; Mon, 26 Oct 2015 14:51:21 -0700 (PDT)
Received: from aither.local (c-73-34-202-214.hsd1.co.comcast.net. [73.34.202.214]) by smtp.googlemail.com with ESMTPSA id q2sm15846000obf.22.2015.10.26.14.51.19 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 26 Oct 2015 14:51:20 -0700 (PDT)
From: Peter Saint-Andre - &yet <peter@andyet.net>
To: John C Klensin <john-ietf@jck.com>, Tom Worster <fsb@thefsb.org>, Alexey Melnikov <Alexey.Melnikov@isode.com>
References: <D230767C.6587A%fsb@thefsb.org> <560C5149.5090607@andyet.net> <588752141F4228C805E674FC@JcK-HP8200.jck.com>
Message-ID: <562EA055.3030404@andyet.net>
Date: Mon, 26 Oct 2015 15:51:17 -0600
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <588752141F4228C805E674FC@JcK-HP8200.jck.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/EYhJynP9UQy3J5FCx-aDjMCuSCE>
Cc: precis@ietf.org
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Oct 2015 21:51:25 -0000

My apologies for the delayed reply. Comments inline.

On 10/1/15 7:50 AM, John C Klensin wrote:
> --On Wednesday, September 30, 2015 15:16 -0600 Peter Saint-Andre
> - &yet <peter@andyet.net> wrote:
>
>> Hi Tom, thanks for the note.
>>
>> My feeling is that we phrased things in a slightly wrong way,
>> because we assumed that case-mapping applies primarily or only
>> to uppercase and titlecase characters. I think this was more a
>> matter of communication (because people think of case mapping
>> as something needed only with respect to uppercase
>> characters), whereas it obviously applies more generally
>> (i.e., applying Unicode Default Case Folding will result in
>> mapping of the code points you mention here).
>>
>> We could do something like this in the nickname spec...
>> ...
>
>> NEW
>>
>>      3.  Case Mapping Rule: Unicode Default Case Folding MUST
>> be applied,
>>          as defined in the Unicode Standard [Unicode] (at the
>> time
>>          of this writing, the algorithm is specified in Chapter
>> 3 of
>>          [Unicode7.0]).  The primary result of doing so is that
>> uppercase
>>          characters are mapped to lowercase characters. In
>> applications
>>          that prohibit conflicting nicknames, this rule helps
>> to reduce
>>          the possibility of confusion by ensuring that nicknames
>>          differing only by case (e.g., "stpeter" vs. "StPeter")
>> would not
>>          be presented to a human user at the same time.
>> ...
>
> Peter,
>
> While your proposed text is an improvement,

Happy to hear it. All I intended was a slight clarification.

> the desire of many
> people for a magic "just tell me what to do" formula, one that
> lets them avoid understanding the issues, may call for a little
> more:

There is always a need for more when it comes to i18n.

> (1) First, toCaseFold is _not_ toLowerCase.  Saying "The primary
> result of doing so is that uppercase characters are mapped to
> lowercase characters" is true for toCaseFold,

By "primary" I meant two things: (1) lowercasing is what happens to the 
preponderance of code points and (2) this is the result that most people 
care about.

> but it has other
> effects that are information-losing and may be counterintuitive
> in some locales and situations.

Indeed.

> (2) Second, probably as a result of having IDNA in the lead,
> we've gotten sloppy about language and operations and should
> probably start untangling that before it gets people in trouble.

Where is the right place to do that untangling? (I doubt that it is the 
precis-nickname document.)

> The Unicode Standard, at least as I understand it, is fairly
> clear that the most important (and really only safe) use of
> toCaseFold is as part of a comparison operation.

Thanks for noting that. For example, Section 5.18 of Unicode 8.0.0 says:

    Caseless matching is implemented using case folding, which is the
    process of mapping characters of different case to a single form, so
    that case differences in strings are erased. Case folding allows for
    fast caseless matches in lookups because only binary comparison is
    required. It is more than just conversion to lowercase.

> Using your
> example it is entirely reasonable to treat, "stpeter" and
> "StPeter" as equivalent in a comparison operation, but accepting
> one string and changing it to the other for display may not be a
> really good idea.  While that transformation may be acceptable
> (although I would be surprised if there were no people who share
> your surname who could consider "stpeter" or "Stpeter"
> unacceptable and might even believe that "StPeter" is an
> unacceptable substitute for "St. Peter"),

I do receive email at stpeter@gmail.com intended for st.peter@gmail.com 
but that's a separate topic...

> it also points out the
> dangers of using Basic Latin script examples to illustrate
> situations in which even more extended Latin script, much less
> other scripts, may raise more complex issues.    Because IDNA is
> essentially a workaround because changing the DNS comparison
> rules was impractical for several reasons, we ended up using
> toCaweFold to map characters and strings into others in IDNA2003
> but PRECIS implementations that do not have the same constraints
> would, in general, be better off confining the use of
> toCaseFold, or even toLowerCase, to comparison operations.

Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does it make 
sense for this nickname specification to differ in this respect from the 
published RFCs? Shall we file errata against those documents? (This 
might apply only to RFC 7613, which says to apply case folding as part 
of the enforcement process - when exactly to apply case folding is not 
stipulated by RFC 7564.)

> (3) Because toCaweFold loses information when used for more than
> comparison (for comparison, it merely contributes to what some
> people would consider false positives for matching) involves
> some controversial decisions and, because of stability
> requirements, cannot be changed even if the controversies are
> resolved in other ways, we end up with, e.g.,
>      toCaseFold ("Nuß") -> "nuss"
> which is considered an acceptable transformation in some places
> that identify themselves as speaking/using German and two
> different unacceptable errors in others.  Again, this will
> almost always be much more serious if the transformation is used
> to map and replace strings than if it is used to compare (fwiw,
> that particular example is part of a continuing disagreement
> between IDNA2008 and, among others, German domain registry
> authorities on one side and UTC and UTR 46 on the other).

Agreed.

> (4) If the motivation is really to avoid confusion, the correct
> confusion-blocking rule for Latin script (but not others) and
> many languages that use it (but certainly not all) involves
> moving beyond toCaseFold and treating all "decorated" characters
> (characters normally represented by glyphs consisting of a Basic
> Latin character and one or more diacritical or equivalent
> markings) compare equal to their base characters, e.g., "á" not
> only matches "Á" but also "a" and "A" and, as an unfortunate
> side-effect, maybe "À" and "à" as well.  This is bad news for
> languages in which decorated Latin characters are used to
> represent phonetically and conceptually different characters,
> not just pronunciation variations.  I am not qualified to
> evaluate "how bad".   In addition, extrapolations from this
> principle about Latin script to unrelated scripts will almost
> certainly lead to serious errors and/or additional confusion.

I would not be comfortable going that far...

> More on this and Tom's question below...
>
>> On 9/29/15 3:28 PM, Tom Worster wrote:
>>> Peter, Alexey,
>>>
>>> I think there is an ambiguity in the specification of case
>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19.
>> ...
>>> But there are 55 code points in Unicode 7.0.0 that change
>>> under default case folding that are neither uppercase nor
>>> titlecase characters, 12 of which are Lowercase_Letter. I
>>> suspect this stems from a confusion between Unicode case
>>> mapping and case folding.
>
> Yes, I think so.   See above, but, if I were making the rules, I
> would say "never use toCaseFold where case mapping is intended
> and, in particular, where one wants to substitute one string for
> another rather than checking a pair of strings for equivalence
> or perhaps telling users what would be considered equivalent".
> That interpretation is, I believe, consistent with most of the
> Unicode FAQ text you have quoted and other Unicode statements.
> However I have lost that argument before and hope, given
> decisions that have been made and deployed, that I was wrong.
> But there is another issue...
>
>> ...
>>> The nickname profile can be corrected or the algorithm
>>> clarified. I'm not sure what to do with a Proposed Standard
>>> RFC. Errata? Can the case mapping rule be changed in IANA?
>>> https://www.iana.org/assignments/precis-parameters/profiles/U
>>> sernameCaseMap ped.txt
>>> e.g. to "Apply Unicode default case folding"
>
> Almost certainly not... an "update" revision of the spec would
> be needed.

Yes, RFC 7613 would need to be updated via a separate spec or 7613bis.

> At least a few of the characters you questioned raise another
> issue:
>
>> ...
>>> Ll; 03D0; C; 03B2; # GREEK BETA SYMBOL
>>> Ll; 03D1; C; 03B8; # GREEK THETA SYMBOL
>>> Ll; 03D5; C; 03C6; # GREEK PHI SYMBOL
>>> Ll; 03D6; C; 03C0; # GREEK PI SYMBOL
>>> Ll; 03F0; C; 03BA; # GREEK KAPPA SYMBOL
>>> Ll; 03F1; C; 03C1; # GREEK RHO SYMBOL
>>> Ll; 03F5; C; 03B5; # GREEK LUNATE EPSILON SYMBOL
>> ...
>>> Ll; 1FBE; C; 03B9; # GREEK PROSGEGRAMMENI
>>> Nl; 2160; C; 2170; # ROMAN NUMERAL ONE
>>   (etc)
>>> So; 24B6; C; 24D0; # CIRCLED LATIN CAPITAL LETTER A
>>> So; 24B7; C; 24D1; # CIRCLED LATIN CAPITAL LETTER B
>> ...
>
> Those examples, and others, are independent of their Unicode
> categories, not characters used in writing "words" of normal
> languages.  Most of them are inherently confusable with the
> similar-looking letters, e.g., U+2160 and U+2170 with upper and
> lower-case "I" (and ("i") respectively or U+03D0 and its
> relationship to "β".  The latter also raises the
> now-purely-academic question of whether a "variant letterform",
> such as U+03D0, violates the Unicode principle that different
> code points are not assigned to different glyph forms of the
> same latter, but those kinds of questions are another thing that
> makes these discussions difficult, especially for those who
> don't want to get involved with even script-specific or
> locale-specific details.    To the extent possible, we dealt
> with such characters in IDNA2008 by identifying them as
> DISALLOWED, but PRECIS permits enough additional flexibility to,
> as you have noticed, allow people who don't understand what they
> are doing (or who are trying to avoid that necessity) to get
> themselves and their users into a lot of trouble.

This is certainly the case for FreeformClass in PRECIS. I would hope 
that we took a safer path with the IdentifierClass (e.g., U+03D0 is 
disallowed there).

> Fewer easy answers here than one would like and would expect in
> some alternate and easier reality.

Always. :(

After all that, I have 3 questions:

(1) Is my proposed text enough of a clarification that we should make 
that change before the nickname I-D is published as an RFC?

(2) Should we modify draft-ietf-precis-nickname so that case folding is 
applied only as part of comparison and not as part of enforcement? If 
so, should we make that change before this document is published as an RFC?

(3) Should we update RFC 7613 so that case folding is applied only as 
part of comparison and not as part of enforcement?

Peter