Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname

Peter Saint-Andre <peter@andyet.net> Wed, 28 October 2015 21:53 UTC

Return-Path: <peter@andyet.net>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 452901A0121 for <precis@ietfa.amsl.com>; Wed, 28 Oct 2015 14:53:06 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.601
X-Spam-Level:
X-Spam-Status: No, score=-1.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, GB_I_LETTER=-2, MANGLED_LIST=2.3, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id s-RDbN19rW1h for <precis@ietfa.amsl.com>; Wed, 28 Oct 2015 14:53:00 -0700 (PDT)
Received: from mail-ob0-x232.google.com (mail-ob0-x232.google.com [IPv6:2607:f8b0:4003:c01::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8165D1A011B for <precis@ietf.org>; Wed, 28 Oct 2015 14:53:00 -0700 (PDT)
Received: by obbwb3 with SMTP id wb3so18670429obb.0 for <precis@ietf.org>; Wed, 28 Oct 2015 14:53:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=andyet_net.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-type:content-transfer-encoding; bh=Z8EuDYSJx4ujZi7wl6BJ8CP4X+QxPGBiUC0uNXXLb7M=; b=dVsyCwmRnhI18CdfcUC+dmYqbnXfMpDZbdUuewx+WtbHcFOOwaQ9A6Whcw1IzoutXK eU6J3lp6hZKopHEDfFvTxrGHQgi5BHCt4zBbjAEzjHuN+HTR1anqZVYyGlJtnOXCVAfr oH9hjC6MwoZ8r3585L/m0yrmuQOmpHtVP7U1ZMCxtdepXnCscGTzxv7vojY6jK8fikWr Fu+JFF00cVmHOrbxSQw7g+ng5bljoOcqBglGHBVrCdO7/dCcMYiWH56Yc30+oyXy1NKk 7kP2k7wM4dKQGcVVn1MN4zjzhkdD25OaIQ7fqoKARG6c5HN0e58kYcCxZ6eAbKW6a2YP pUOg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-type :content-transfer-encoding; bh=Z8EuDYSJx4ujZi7wl6BJ8CP4X+QxPGBiUC0uNXXLb7M=; b=JBCTStsqIEzukBM3/q8UcFStjFjt4Sd1X8iTfxzfDdAdl8e7ir6Chdvhl0LVZRKuud M1GFhCjA10AYEO8vonRUY1nF2tXkZYVsrNyEyV6blwJizlDjg+LPQiPVpxAQPgRnWFXW EY4UDUgEIe9rBVSyxXtaWvz0DH/Ko0YUMKtt732qnJEfUcew3s9mf8BDHnJRaJ4fTAET +k9sG8swn7c0gvMtA35RBN4+PiqfT+Y7kysufxok0BJZoMCb8eUKt7KKS2rrsQTtr4cT imoIvbqJ51QFQjj8Y33MgFVbJIAhSwYIQbqiiJypcY8aWbnJt/wZlf16bok2vmhbvm6B ROJA==
X-Gm-Message-State: ALoCoQkkwn2oeWUpz82Fc+pTK8VUsSNlvsYqsYgFTRm+2NFzIa2khzzmgTUFfP51O9jw7uPQCIDM
X-Received: by 10.60.173.42 with SMTP id bh10mr30045673oec.58.1446069179795; Wed, 28 Oct 2015 14:52:59 -0700 (PDT)
Received: from aither.local (c-73-34-202-214.hsd1.co.comcast.net. [73.34.202.214]) by smtp.googlemail.com with ESMTPSA id t192sm20844854oie.29.2015.10.28.14.52.57 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 28 Oct 2015 14:52:58 -0700 (PDT)
To: John C Klensin <john-ietf@jck.com>, Tom Worster <fsb@thefsb.org>, Alexey Melnikov <Alexey.Melnikov@isode.com>
References: <0347834EBDC481BD99BDBE67@JcK-HP8200.jck.com> <56302E6D.5030901@andyet.net> <56312AAC.1000300@andyet.net> <56313616.8000801@andyet.net>
From: Peter Saint-Andre <peter@andyet.net>
Message-ID: <563143B9.7020707@andyet.net>
Date: Wed, 28 Oct 2015 15:52:57 -0600
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <56313616.8000801@andyet.net>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/gk-1oNw5nY7k-gyWbpVmXZ9R1Rc>
Cc: precis@ietf.org
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Oct 2015 21:53:06 -0000

Example 7 needs to be corrected, too, in accordance with CaseFolding.txt.

On 10/28/15 2:54 PM, Peter Saint-Andre wrote:
> And here is another correction in Section 3...
>
> OLD
>
>     Regarding examples 5, 6, and 7: applying Unicode Default Case Folding
>     to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL LETTER
>     SIGMA (U+03C3), and doing so during comparison would result in
>     matching the nicknames in examples 5 and 6; however, because the
>     PRECIS mapping rules do not account for the special status of GREEK
>     SMALL LETTER FINAL SIGMA (U+03C2), the nicknames in examples 5 and 7
>     or examples 6 and 7 would not be matched.
>
> NEW
>
>     Regarding examples 5, 6, and 7: applying Unicode Default Case Folding
>     to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL LETTER
>     SIGMA (U+03C3), and the same is true of GREEK SMALL LETTER FINAL
>     SIGMA (U+03C2); therefore, the comparison operation defined in
>     Section 2.4 would result in matching of the nicknames in examples 5,
>     6, and 7.
>
> On 10/28/15 2:06 PM, Peter Saint-Andre wrote:
>> I propose the following text changes:
>>
>> ###
>>
>> OLD
>>
>>     3.  Case Mapping Rule: Uppercase and titlecase characters MUST be
>>         mapped to their lowercase equivalents using Unicode Default Case
>>         Folding as defined in the Unicode Standard [Unicode] (at the time
>>         of this writing, the algorithm is specified in Chapter 3 of
>>         [Unicode7.0]).  In applications that prohibit conflicting
>>         nicknames, this rule helps to reduce the possibility of confusion
>>         by ensuring that nicknames differing only by case (e.g.,
>>         "stpeter" vs. "StPeter") would not be presented to a human user
>>         at the same time.
>>
>> NEW
>>
>>     3.  Case Mapping Rule: Unicode Default Case Folding MUST be applied,
>>         as defined in the Unicode Standard [Unicode] (at the time
>>         of this writing, the algorithm is specified in Chapter 3 of
>>         [Unicode7.0]).  The primary result of doing so is that uppercase
>>         characters are mapped to lowercase characters. In applications
>>         that prohibit conflicting nicknames, this rule helps to reduce
>>         the possibility of confusion by ensuring that nicknames
>>         differing only by case (e.g., "stpeter" vs. "StPeter") would not
>>         be presented to a human user at the same time.
>>
>> ###
>>
>> (The foregoing was previously sent to the list.)
>>
>> ###
>>
>> OLD
>>
>> 2.3.  Enforcement
>>
>>     An entity that performs enforcement according to this profile MUST
>>     prepare a string as described in Section 2.2 and MUST also apply the
>>     rules specified in Section 2.1.  The rules MUST be applied in the
>>     order shown.
>>
>>     After all of the foregoing rules have been enforced, the entity MUST
>>     ensure that the nickname is not zero bytes in length (this is done
>>     after enforcing the rules to prevent applications from mistakenly
>>     omitting a nickname entirely, because when internationalized
>>     characters are accepted, a non-empty sequence of characters can
>>     result in a zero-length nickname after canonicalization).
>>
>> 2.4.  Comparison
>>
>>     An entity that performs comparison of two strings according to this
>>     profile MUST prepare each string and enforce the rules as specified
>>     in Sections 2.2 and 2.3.  The two strings are to be considered
>>     equivalent if they are an exact octet-for-octet match (sometimes
>>     called "bit-string identity").
>>
>> NEW
>>
>> 2.3.  Enforcement
>>
>>     An entity that performs enforcement according to this profile MUST
>>     prepare a string as described in Section 2.2 and MUST also apply the
>>     following rules specified in Section 2.1 in the order shown:
>>
>>     1. Additional Mapping Rule
>>     2. Normalization Rule
>>     3. Directionality Rule
>>
>>     After all of the foregoing rules have been enforced, the entity MUST
>>     ensure that the nickname is not zero bytes in length (this is done
>>     after enforcing the rules to prevent applications from mistakenly
>>     omitting a nickname entirely, because when internationalized
>>     characters are accepted, a non-empty sequence of characters can
>>     result in a zero-length nickname after canonicalization).
>>
>> 2.4.  Comparison
>>
>>     An entity that performs comparison of two strings according to this
>>     profile MUST prepare each string as specified in Section 2.2 and
>>     MUST apply the following rules specified in Section 2.1 in the order
>>     shown:
>>
>>     1. Additional Mapping Rule
>>     2. Case Mapping Rule
>>     3. Normalization Rule
>>     4. Directionality Rule
>>
>>     The two strings are to be considered equivalent if they are an exact
>>     octet-for-octet match (sometimes called "bit-string identity").
>>
>> ###
>>
>> In addition, some variation on John's proposed text about toLowerCase
>> vs. toCaseFold might be appropriate at the end of Section 4; however,
>> I'm still not sure that is necessary if we move the case mapping rule to
>> the comparison operation.
>>
>> Peter
>>
>> On 10/27/15 8:09 PM, Peter Saint-Andre wrote:
>>> On 10/27/15 11:32 AM, John C Klensin wrote:
>>>> Response to Monday's note immediately below; response to today's
>>>> follows it.  My apologies, but it is probably important to read
>>>> both.  My further apologies for the length of this note, but I
>>>> think we are in deep trouble here,
>>>
>>> Internationalization always seems to be a matter of how deep the trouble
>>> is...
>>>
>>>> trouble that is aggravated by
>>>> precis-mappings and precis-nickname both being post-approval and
>>>> that, as far as I know, there are no future plans for PRECIS
>>>> work (having precis-nickname in AUTH48 just emphasizes that --
>>>> see comment at end).
>>>
>>> We had not planned to work on PRECIS because we thought we were done for
>>> awhile. If that's not the case and we need to fix things, then so be it.
>>> Whether there is sufficient and continued energy for such work is
>>> another question. Personally I don't want us to have broken RFCs out
>>> there.
>>>
>>>> --On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre -
>>>> &yet <peter@andyet.net> wrote:
>>>>
>>>>> My apologies for the delayed reply. Comments inline.
>>>>
>>>> A few remarks below... I can't tell whether we disagree or
>>>> whether at least one of us, probably me, are not being
>>>> adequately clear.  (Material on which we fairly clearly agree
>>>> elided.)
>>>>
>>>>
>>>>> On 10/1/15 7:50 AM, John C Klensin wrote:
>>>>>> --On Wednesday, September 30, 2015 15:16 -0600 Peter
>>>>>> Saint-Andre - &yet <peter@andyet.net> wrote:
>>>>> ...
>>>>>> Peter,
>>>>>>
>>>>>> While your proposed text is an improvement,
>>>>>
>>>>> Happy to hear it. All I intended was a slight clarification.
>>>>
>>>> But I'm not certain we are there yet...
>>>
>>> Agreed. The text I proposed addressed only a very small part of the
>>> problem.
>>>
>>>>>> the desire of many
>>>>>> people for a magic "just tell me what to do" formula, one that
>>>>>> lets them avoid understanding the issues, may call for a
>>>>>> little more:
>>>>>
>>>>> There is always a need for more when it comes to i18n.
>>>>
>>>> But I think it is a little more that that.  I've heard several
>>>> times, including in PRECIS meetings, requests for "just tell me
>>>> what to do and make sure it isn't complicated" (or "I don't want
>>>> to have to think about, much less understand, the issues").  We
>>>> can debate whether giving in to those requests in the I18n case
>>>> is wise.  I think it leads directly to conclusions equivalent to
>>>> "I understand my own script and writing system (or think I do)
>>>> and therefore, since all writing systems must be pretty much the
>>>> same, I understand all of the core issues in terms of my script
>>>> and understanding".   That, in turn, leads directly to the "how
>>>> do you spell 'Zürich'?" and "all spellings of 'Zuerich' should
>>>> be treated as equivalent" discussion that sounded like they
>>>> dominated a BOF at IETF 93.
>>>>
>>>> Now I actually think it is reasonable for someone to ask for a
>>>> library that will do the job most of the time and that will
>>>> almost never cause their users or customers to get angry at
>>>> them.  But, if we are going to call what we do "standards", they
>>>> should contain sufficient information that would-be library
>>>> authors can know what to do ... or understand that they are in
>>>> over their heads.  And, for these particular cases, we may need
>>>> to explain, or help the library authors explain, why some cases
>>>> will fail and, indeed, get users mad at vendors.
>>>>
>>>>
>>>>>> (1) First, toCaseFold is _not_ toLowerCase.  Saying "The
>>>>>> primary result of doing so is that uppercase characters are
>>>>>> mapped to lowercase characters" is true for toCaseFold,
>>>>>
>>>>> By "primary" I meant two things: (1) lowercasing is what
>>>>> happens to the preponderance of code points and (2) this is
>>>>> the result that most people care about.
>>>>
>>>> If I parse the above correctly, I think you are wrong.   I think
>>>> what most people want, care about, and think they are getting,
>>>> is lower case conversion, i.e., an operation that preserves
>>>> lower case characters and converts upper case characters to the
>>>> equivalent lower case.  toCaseFold isn't that operation.  It is
>>>> a much more complex and subtle operation that, as well as
>>>> converting upper case characters to lower case, sometimes
>>>> converts lower case characters to different lower case
>>>> characters (or strings of them).  It also requires a fairly good
>>>> understanding of Unicode (not just a relevant script) and
>>>> historical Unicode decisions to predict its behavior and to have
>>>> any hope of explaining that behavior to users.   If one is
>>>> trying to compare (as distinct from converting), then toCaseFold
>>>> may be exactly what it wanted. but it is really hard to explain
>>>> or justify that in terms of "nicknames" or "aliases", which are
>>>> about conversion.   And, if one hopes to explain what is going
>>>> on to users in terms of "lower casing", then toCaseFold is just
>>>> the wrong operation.  That is what toLowerCase is for and the
>>>> two operations are just not equivalent.
>>>
>>> My recollection, quite possibly inaccurate or incomplete, from at least
>>> one and I think several in-person meetings of the PRECIS WG was: just
>>> use Unicode Default Case Folding because if you use anything else or try
>>> to roll your own you will be fubar forever. I do not recall any
>>> discussion of the issues you have raised in this thread (e.g., about the
>>> inadvisability of using case folding for anything but comparison
>>> operations) until the last few weeks. However, I freely admit that's
>>> probably because, through my own faults and ignorance, I didn't
>>> understand what you were saying.
>>>
>>>> FWIW and purely by coincidence wrt PRECIS and this document, I
>>>> had a conversation a few days ago with an expert on Arabic (and
>>>> Persian) calligraphy and writing systems (and good general
>>>> knowledge of writing systems) who is quite insistent that any
>>>> procedure we use for case-insensitive matching (e.g., case
>>>> folding) is discriminatory, inconsistent, and just
>>>> badly-thought-out if that same procedure doesn't treat isolated,
>>>> initial, and medial forms of the same character as equivalent.
>>>> He further strengthens his case (sic) by noting that Unicode
>>>> case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA,
>>>> unambiguously a lower-case character) to U+03C3 (GREEK SMALL
>>>> LETTER SIGMA), a relationship that depends entirely on
>>>> positional use and not case.  He also believes the same
>>>> relationships should apply to all other scripts that make form
>>>> distinctions for some characters based on positions in a string
>>>> and for which Unicode has chosen to assign different code
>>>> points.  Even if there were wide acceptance of his view, Unicode
>>>> stability principles would prevent changing toCaseFold (or
>>>> CaseFolding.txt), but this is more evidence that what toCaseFold
>>>> does and does not do is going to be hard to explain to either
>>>> casual users or to writing system experts whose primary
>>>> experience is not with the Greek-Latin-Cyrillic group.
>>>>
>>>> I don't think we want to say "these matching rules are somewhat
>>>> arbitrary and irrational, but, if you don't like it, blame
>>>> Unicode and not us", if only because it is our choice to use
>>>> those matching rules.  More below.
>>>>
>>>>
>>>>> ...
>>>>>> (2) Second, probably as a result of having IDNA in the lead,
>>>>>> we've gotten sloppy about language and operations and should
>>>>>> probably start untangling that before it gets people in
>>>>>> trouble.
>>>>>
>>>>> Where is the right place to do that untangling? (I doubt that
>>>>> it is the precis-nickname document.)
>>>>
>>>> I agree that precis-nickname isn't the ideal place.  I also
>>>> believe that you and it are the innocent victims of the
>>>> situation.  At the same time, I don't believe IETF should be
>>>> producing incomplete, ambiguous, erroneous, or misleading
>>>> standards because no one could get around to doing the right
>>>> foundational work.
>>>
>>> Agreed. I too want to get this right, even though it's not a lot of fun
>>> and it's certainly more work than I thought I was signing up for at the
>>> NEWPREP BoF years ago.
>>>
>>>>>> The Unicode Standard, at least as I understand it, is fairly
>>>>>> clear that the most important (and really only safe) use of
>>>>>> toCaseFold is as part of a comparison operation.
>>>>>
>>>>> Thanks for noting that. For example, Section 5.18 of Unicode
>>>>> 8.0.0 says:
>>>>>
>>>>>      Caseless matching is implemented using case folding, which
>>>>> is the
>>>>>      process of mapping characters of different case to a
>>>>> single form, so
>>>>>      that case differences in strings are erased. Case folding
>>>>> allows for
>>>>>      fast caseless matches in lookups because only binary
>>>>> comparison is
>>>>>      required. It is more than just conversion to lowercase.
>>>>
>>>> Right.  But, again, when its use is appropriate (a very
>>>> controversial topic in itself with our painful IDNA history with
>>>> Final Sigma, Eszett and the case-independent versus
>>>> position-independent controversy called out above as examples)
>>>> that is "matches in lookups" (what I've described elsewhere as
>>>> "comparison only").  Not creating or defining nicknames or
>>>> aliases.  And that _is_ a problem for this document.
>>>
>>> I'm not convinced that things are as bad as you think. If we say in
>>> draft-ietf-precis-nickname that the case mapping rule is to be applied
>>> only as part of comparison and not as part of enforcement - which I
>>> think is really what we care about (e.g., to prevent spoofing of users
>>> in chat rooms) - then I think we might be most of the way there.
>>>
>>>>>> Using your
>>>>>> example it is entirely reasonable to treat, "stpeter" and
>>>>>> "StPeter" as equivalent in a comparison operation, but
>>>>>> accepting one string and changing it to the other for display
>>>>>> may not be a really good idea.  While that transformation may
>>>>>> be acceptable (although I would be surprised if there were no
>>>>>> people who share your surname who could consider "stpeter" or
>>>>>> "Stpeter" unacceptable and might even believe that "StPeter"
>>>>>> is an unacceptable substitute for "St. Peter"),
>>>>>
>>>>> I do receive email at stpeter@gmail.com intended for
>>>>> st.peter@gmail.com but that's a separate topic...
>>>>
>>>> One that is relevant because it "works" as a side-effect of a
>>>> decision Google has made about mailbox name equivalence, a
>>>> decision that, IMO, will sooner or later get someone into a lot
>>>> of trouble and,  more important, a decision and matching rule
>>>> that PRECIS, AFAICT, does not allow and that IDNA unambigiously
>>>> forbids.
>>>>
>>>>>> it also points out the
>>>>>> dangers of using Basic Latin script examples to illustrate
>>>>>> situations in which even more extended Latin script, much less
>>>>>> other scripts, may raise more complex issues.    Because IDNA
>>>>>> is essentially a workaround because changing the DNS
>>>>>> comparison rules was impractical for several reasons, we
>>>>>> ended up using toCaweFold to map characters and strings into
>>>>>> others in IDNA2003 but PRECIS implementations that do not
>>>>>> have the same constraints would, in general, be better off
>>>>>> confining the use of toCaseFold, or even toLowerCase, to
>>>>>> comparison operations.
>>>>>
>>>>> Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does
>>>>> it make sense for this nickname specification to differ in
>>>>> this respect from the published RFCs? Shall we file errata
>>>>> against those documents? (This might apply only to RFC 7613,
>>>>> which says to apply case folding as part of the enforcement
>>>>> process - when exactly to apply case folding is not stipulated
>>>>> by RFC 7564.)
>>>>
>>>> To the extent to which this is a "botched that because the WG
>>>> didn't understand the issues well enough" conclusion, it would
>>>> be entirely reasonable to generate an updating RFC that repairs
>>>> 7613 and/or 7564, even doing so in an addendum to
>>>> precis-nickname if that is the only way to do that
>>>> expeditiously.  Per the above, we really don't want to give
>>>> library routine writers bad instructions.  As I understand it,
>>>> the current position of the RFC Editor and IESG is that
>>>> technical specification errors discovered in retrospect or after
>>>> people start using a spec are not appropriate topics for errata.
>>>> If the WG is not willing to do any of those things, then I
>>>> suggest that precis-nickname at least needs to contain a very
>>>> clear warning notice about this situation (see my response to
>>>> your question 1 below).
>>>
>>> I think we'll probably need to fix 7613 and 7564. I am hoping we can fix
>>> nickname now so that it is less incorrect than the other two. That
>>> doesn't necessarily mean we won't need to also further fix nickname
>>> later on.
>>>
>>> Granted, we were supposed to avoid this problem by working on all of the
>>> PRECIS specs simultaneously. Clearly we have not avoided the problem, so
>>> we need to solve it one way or another. If that means bis for them all,
>>> we need to deal with it.
>>>
>>>>>> (3) Because toCaweFold loses information when used for more
>>>>>> than comparison (for comparison, it merely contributes to
>>>>>> what some people would consider false positives for matching)
>>>>>> involves some controversial decisions and, because of
>>>>>> stability requirements, cannot be changed even if the
>>>>>> controversies are resolved in other ways, we end up with,
>>>>>> e.g.,
>>>>>>       toCaseFold ("Nuß") -> "nuss"
>>>>>> which is considered an acceptable transformation in some
>>>>>> places that identify themselves as speaking/using German and
>>>>>> two different unacceptable errors in others.  Again, this will
>>>>>> almost always be much more serious if the transformation is
>>>>>> used to map and replace strings than if it is used to compare
>>>>>> (fwiw, that particular example is part of a continuing
>>>>>> disagreement between IDNA2008 and, among others, German
>>>>>> domain registry authorities on one side and UTC and UTR 46 on
>>>>>> the other).
>>>>>
>>>>> Agreed.
>>>>
>>>> See "warning notice" comment above and question 1 response below.
>>>>
>>>>> (4) If the motivation is really to avoid confusion, the
>>>>>> correct confusion-blocking rule for Latin script (but not
>>>>>> others) and many languages that use it (but certainly not
>>>>>> all) involves moving beyond toCaseFold and treating all
>>>>>> "decorated" characters (characters normally represented by
>>>>>> glyphs consisting of a Basic Latin character and one or more
>>>>>> diacritical or equivalent markings) compare equal to their
>>>>>> base characters, e.g., "á" not only matches "Á" but also
>>>>>> "a" and "A" and, as an unfortunate side-effect, maybe "À"
>>>>>> and "à" as well.  This is bad news for languages in which
>>>>>> decorated Latin characters are used to represent phonetically
>>>>>> and conceptually different characters, not just pronunciation
>>>>>> variations.  I am not qualified to evaluate "how bad".   In
>>>>>> addition, extrapolations from this principle about Latin
>>>>>> script to unrelated scripts will almost certainly lead to
>>>>>> serious errors and/or additional confusion.
>>>>>
>>>>> I would not be comfortable going that far...
>>>>
>>>> In case it isn't clear, I would not be either.  But it is where
>>>> getting sloppy about this stuff could easily take us.  It is
>>>> worth noting that it also identifies one of the difficulties
>>>> with doing a global system to be applied to many types of
>>>> applications (like the PRECIS work) and then applying it in user
>>>> interface software that end users will expect to be localized to
>>>> their assumptions because it has been mapped or translated into
>>>> their language (if one normally speaks Upper Slobbovian but has
>>>> some familiarity with English, an application interface in
>>>> English will probably be expected to be "foreign", odd, and
>>>> maybe even inconsistent with whatever expectations exist.  But,
>>>> if the interface is in Upper Slobbovian, the natural and
>>>> reasonable assumption will be the matching should conform to
>>>> normal Upper Slobbovian conventions.    FWIW, a matching rule
>>>> that says:
>>>>
>>>>   (i) Two instances of a base character with the same
>>>>     diacritical mark(s) match.
>>>>   (ii) Two instances of a base character with different
>>>>     diacritical mark(s) do not match.
>>>>   (iii) Two instances of a base character, one with
>>>>     diacritical mark(s) and the other without any decoration
>>>>     match.
>>>>
>>>> Is precisely correct and normal behavior for at least one
>>>> language that uses Latin script.  It is also the normal practice
>>>> for at least one Latin script transcription system that is used
>>>> by a large fraction of a billion people (maybe more).
>>>
>>> That is indeed sobering.
>>>
>>>>>> More on this and Tom's question below...
>>>>>>
>>>>>>> On 9/29/15 3:28 PM, Tom Worster wrote:
>>>>>>>> Peter, Alexey,
>>>>>>>>
>>>>>>>> I think there is an ambiguity in the specification of case
>>>>>>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19.
>>>>>>> ...
>>>>>>>> But there are 55 code points in Unicode 7.0.0 that change
>>>>>>>> under default case folding that are neither uppercase nor
>>>>>>>> titlecase characters, 12 of which are Lowercase_Letter. I
>>>>>>>> suspect this stems from a confusion between Unicode case
>>>>>>>> mapping and case folding.
>>>>
>>>> In the context of the above, a different way to say the same
>>>> thing is that people are looking at toCaseFold and assuming (and
>>>> explaining things in terms of) toLowerCase.  toCaseFold works
>>>> the way it is expected to and those 55 code points are, more or
>>>> less, collateral damage to get to a matching algorithm that
>>>> favors false positives over false negatives and various edge
>>>> cases (including in "edge cases" languages spoken by, and script
>>>> variations used by, millions of people).
>>>
>>> Sadly I suspect that is an accurate description of the current state of
>>> affairs (modulo my comment above about PRECIS WG discussions at one or
>>> more IETF meetings).
>>>
>>>>> ...
>>>>> After all that, I have 3 questions:
>>>>
>>>> Personal opinions about answers...
>>>>
>>>>> (1) Is my proposed text enough of a clarification that we
>>>>> should make that change before the nickname I-D is published
>>>>> as an RFC?
>>>>
>>>> I think the clarification is an improvement and is important
>>>> enough to incorporate (I know that is the answer to a slightly
>>>> different question).
>>>>
>>>> However, I think it is inadequate without a serious warning
>>>> about the situation.
>>>
>>> Yes.
>>>
>>>>  That warning could appear in either this
>>>> document or RFC 7613 (or 7613bis) with a pointer from the other,
>>>> but, unless you want to revise 7613 now, this one is handy.
>>>
>>> I suspect that we need to revise 7613. I suspect that we might also need
>>> to revise 7564 (at least with respect to the order in which operations
>>> are applied, since there has been some confusion among implementers).
>>>
>>> Well, we always knew that we would need to revise them. Just not so
>>> soon.
>>>
>>>> Comment about possible text below.
>>>>
>>>>> (2) Should we modify draft-ietf-precis-nickname so that case
>>>>> folding is applied only as part of comparison and not as part
>>>>> of enforcement? If so, should we make that change before this
>>>>> document is published as an RFC?
>>>>
>>>> Yes.  If something is used for "enforcement", it should be lower
>>>> casing or something else that can be explained to people who are
>>>> ordinarily familiar with one or more of the scripts that make
>>>> case distinctions.
>>>>
>>>> However, viewed in the light of this discussion, the whole
>>>> "enforcement" concept becomes a little dicey, especially if, as
>>>> I believe but don't have time to verify, the transformations
>>>> performed by toLowerCase are not a proper subset of those
>>>> performed by toCaseFold.
>>>
>>> My initial thought is that case mapping doesn't belong in the nickname
>>> enforcement operation at all - only in the comparison operation.
>>>
>>>>> (3) Should we update RFC 7613 so that case folding is applied
>>>>> only as part of comparison and not as part of enforcement?
>>>>
>>>> I think that is necessary.  Following up on the comment above, I
>>>> would prefer that the current Section 3.2.2 (3) of RFC 7613
>>>> either point to Unicode Lower Casing or contain a warning along
>>>> the lines of that below.
>>>
>>> Unlike the nickname profile (which I think can be cleaned up by moving
>>> the case mapping rule to the comparison operation and continuing to use
>>> Unicode Default Case Folding), I think you are right that for the
>>> UsernameCaseMapped profile we probably want Unicode Lower Casing. Thus
>>> the likely need, sooner rather than later, for 7613bis.
>>>
>>>>
>>>>     ----------
>>>>
>>>> --On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre
>>>> <peter@andyet.net> wrote:
>>>>
>>>>> This issue has greater urgency now because
>>>>> draft-ietf-precis-nickname is now in AUTH48...
>>>>>
>>>>> On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote:
>>>>>
>>>>>> After all that, I have 3 questions:
>>>>>>
>>>>>> (1) Is my proposed text enough of a clarification that we
>>>>>> should make that change before the nickname I-D is published
>>>>>> as an RFC?
>>>>>
>>>>> I think so.
>>>>
>>>> See above.
>>>>
>>>>>> (2) Should we modify draft-ietf-precis-nickname so that case
>>>>>> folding is applied only as part of comparison and not as part
>>>>>> of enforcement? If so, should we make that change before this
>>>>>> document is published as an RFC?
>>>>>
>>>>> Although it seems to be the case that Unicode case folding is
>>>>> primarily designed for the purpose of matching (i.e.,
>>>>> comparison),
>>>>
>>>> "Seems" is a little weak.  The Unicode Standard is really quite
>>>> specific about that.
>>>>
>>>>> I have a concern that applying the PRECIS case
>>>>> mapping rule after applying the normalization and
>>>>> directionality rules might have unintended consequences that
>>>>> we haven't had a chance to consider yet. The PRECIS framework
>>>>> expresses a preference (actually a hard requirement) for
>>>>> applying the rules in a particular order. We made a late
>>>>> change to the username profiles (RFC 7613), such that width
>>>>> mapping is applied first (in order to accommodate fullwidth
>>>>> and halfwidth characters in certain East Asian scripts).
>>>>> Making a late change to the nickname profile also concerns me,
>>>>> even though both of these late changes seem reasonable on the
>>>>> face of it. I will try to find time to think about this
>>>>> further in the next 24 hours.
>>>>
>>>> First, a hint for the consideration process: there is a reason
>>>> why Unicode now supports a unified case folding and
>>>> normalization operation.  My recollection is that it is not only
>>>> more efficient to perform both operations at once (rather than
>>>> looking in one table and then the other), but that there are
>>>> some order-dependent or priority-dependent cases.
>>>>
>>>> The very fact that this issue exists (and is coming up again)
>>>> this late in the process (7613 published in August, WG winding
>>>> down and not, e.g., meeting next week) calls at least the PRECIS
>>>> quality of review and some fairly fundamental model issues into
>>>> question.  I first raised that issue a rather long time ago but
>>>> have continued to hope that we have an approximation to "good
>>>> enough" without going back and rethinking everything.
>>>>
>>>> The right solution, IMO, is that, if RFC 7613 is to rationalize
>>>> or explain the operation in terms of converting upper case
>>>> characters to lower case, then it should be using toLowerCase
>>>> because that is what the operation does.  After a quick look at
>>>> 7613, amending/updating it to simply convert to lower case would
>>>> be straightforward (and would not raise the ordering issue
>>>> called out above).  It would presumably require another IETF
>>>> Last Call, however and I'd hope we would see some serious
>>>> discussion within the WG (and with UTC) before making the change
>>>> and about how it is explained.
>>>>
>>>> If we are not willing to make a change
>>>
>>> I'm willing. It would, as you note, require some careful thinking and
>>> review to make sure that we got it (more) right this time.
>>>
>>>> that significant and/or
>>>> if we conclude that the WG (and perhaps the IETF) have
>>>> completely run out of energy for dealing with i18n issues [1],
>>>> then I suggest that we introduce some additional text.  I've
>>>> just spent a half-hour trying to find the AUTH48 copy of
>>>> precis-nickname (aka RFC-to-be-7700), but the RFC Editor has
>>>> apparently changed naming conventions and the various queue
>>>> entry pages all point to the -19 I-D and not the current working
>>>> copy so I can't try to match text and insertion point to what is
>>>> there already.
>>>
>>> http://www.rfc-editor.org/authors/rfc7700.txt
>>>
>>>>  The suggestion is a patch (and a hack), not a
>>>> good fix but something like it is probably the least drastic
>>>> measure that would yield something that doesn't contain
>>>> unexplained known defects.
>>>>
>>>> Rough version of suggested text (possibly to go after your
>>>> revised paragraph and following up my comments in my 1 October
>>>> note).  Some of the terminology needs checking which I can do if
>>>> you want to go this route:
>>>>
>>>>     'Users of this specification should note that the
>>>>     concept of "lower case conversion" is somewhat elusive
>>>>     and more dependent on the conventions of different
>>>>     languages and notation systems that use the same script
>>>>     than may appear obvious at first glance, especially if
>>>>     that glance is at Basic Latin characters (i.e., the
>>>>     ASCII letter repertoire).  Unicode provides two
>>>>     different mapping procedures that produce lower-case
>>>>     characters, but they have different effects and results
>>>>     for many characters.  The more conservative one,
>>>>     typically appropriately applicable when lower case forms
>>>>     are needed, is actual lower-casing (embodied in the
>>>>     Unicode operation toLowerCase).  A more radical
>>>>     operation, normally suitable only for string matching in
>>>>     situations in which it is better to consider uncertain
>>>>     cases as matching than to treat them as distinct, is
>>>>     called "Case Folding" (Unicode operation toCaseFold).
>>>>     While the two operations will often produce the same
>>>>     results, Case Folding maps some lower case characters
>>>>     into others and performs other transformations that may
>>>>     be intuitively reasonable and expected for some users
>>>>     and quite astonishing (or just wrong) to others.  There
>>>>     may be no practical alternative, especially if the
>>>>     operations are to be used for mapping or enforcement, to
>>>>     developers of PRECIS-dependent understanding that the
>>>>     cases in which the two yield different results require
>>>>     careful understanding of the relevant user base and its
>>>>     needs [2].'
>>>
>>> Thanks.
>>>
>>> I am not sure if we need something like that if we move case mapping
>>> (here, case folding) to the comparison operation only - but something
>>> like that might still be appropriate.
>>>
>>>>>> (3) Should we update RFC 7613 so that case folding is applied
>>>>>> only as part of comparison and not as part of enforcement?
>>>>>
>>>>> That is less urgent so I suggest that we address the nickname
>>>>> spec first.
>>>>
>>>> Unless you (or someone else here) have a plausible plan to
>>>> continue and revitalize the WG and assign it that revision work
>>>> (and bring everyone actively participating up to the level
>>>> needed to easily understand this discussion thread and feel
>>>> embarrassed for not spotting the problems), I think we need to
>>>> assume that this is our last shot.  Absent an active and
>>>> committed WG, "do this first" could easily be equivalent to
>>>> "don't get around to the other, ever".
>>>
>>> As mentioned, I don't want to have broken RFCs out there.
>>>
>>>> I think that the particular set of issues that started this
>>>> thread as a known defect in the PRECIS specs, both nickname and
>>>> 7613 and that we are obligated to either fix the problems or at
>>>> least explain them.  The above warning text is an attempt to
>>>> explain and identify the problems even if it does not actually
>>>> provide a solution.  If it were published as part of
>>>> precis-nickname, it could include a statement to the effect that
>>>> it should also be treated as an update to 7613 or, if the IESG
>>>> and RFC Editor would agree in advance to accept, rather than
>>>> bury, the thing, I suppose we could publish it in
>>>> precis-nickname and create an erratum to 7613 indicating that it
>>>> should have included some form of that statement.  Neither
>>>> option implies a huge amount of work to update 7613.  But I
>>>> think that making the changes of (2) without doing anything
>>>> about (3) makes the two documents inconsistent with each other
>>>> and that would be an additional known defect.
>>>>
>>>> Procedural question: given that precis-nickname is in AUTH48 as
>>>> of yesterday and I don't see anything blocking publication next
>>>> week if you and Barry sign off on the revised text that the WG
>>>> hasn't seen,
>>>
>>> There is no revised text yet. That's why we're having this discussion.
>>>
>>>> does someone need to file a pro forma objection/
>>>> appeal to block that until this is sorted out and the WG has a
>>>> chance to review proposed publication text?
>>>
>>> I see no reason to invoke the specter of appeals quite yet. Everyone is
>>> working in good faith to do the right thing and get this mess cleaned
>>> up.
>>>
>>>> [1] I believe our collective inability to deal with the
>>>> within-script character forms that do not normalize to each
>>>> other because of language-dependent or other usage factors can
>>>> be taken as evidence of having run out of energy,
>>>
>>> Or in my case simple ignorance of some of the relevant issues and
>>> examples. It's not easy to know about all of this.
>>>
>>>> but it is
>>>> probably in the interest of finishing the PRECIS work to try to
>>>> treat that as a separate issue.
>>>
>>> Probably.
>>>
>>>> [2] Not unlike the reason to differentiate between NFC and NFKC
>>>> and understand the effects of each.
>>>
>>> Another thing that's not easy to grok in fulness.
>>>
>>> Peter
>>>