Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname

Peter Saint-Andre <peter@andyet.net> Wed, 28 October 2015 20:54 UTC

Return-Path: <peter@andyet.net>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0E66D1B5DEA for <precis@ietfa.amsl.com>; Wed, 28 Oct 2015 13:54:55 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.601
X-Spam-Level:
X-Spam-Status: No, score=-1.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, GB_I_LETTER=-2, MANGLED_LIST=2.3, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 08nZWDCc5Eik for <precis@ietfa.amsl.com>; Wed, 28 Oct 2015 13:54:49 -0700 (PDT)
Received: from mail-oi0-x22e.google.com (mail-oi0-x22e.google.com [IPv6:2607:f8b0:4003:c06::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8263E1B5DE2 for <precis@ietf.org>; Wed, 28 Oct 2015 13:54:49 -0700 (PDT)
Received: by oies66 with SMTP id s66so12386565oie.1 for <precis@ietf.org>; Wed, 28 Oct 2015 13:54:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=andyet_net.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-type:content-transfer-encoding; bh=MehwFRTfb6njP34dppIDIlHcRculEGLtKZ2fUL+49SU=; b=QrJjyhIAD3KvVLt87sqPhrhQwSToObttn5eWUQ9JnH+VDWWXwfyv8mEezikMq9SACw 7o4HX944apufQ14deyL9YC472anQdz54UtrObOoz/3AlCr2e6u++bDliV9Aglh0coPut qM1cdqyVGFPOxW4mXA2qbWd+Ox6eIiVUhs06/zPI5lKgpz4wtXSeTMnHZo4muaUhzSE0 T1rHTH9usMmCAmmkBNQdsJvAlxVZz4MbygDm5IJGPKkWFBUv+oO+qOizpyZrBvyBZJbR d0ivCX1myEfNYcZbLN9aCCu3l4f0S6pQAli8ss+aNsGVqmJ2Q693ILMw6kah0Euwol9R fhoQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-type :content-transfer-encoding; bh=MehwFRTfb6njP34dppIDIlHcRculEGLtKZ2fUL+49SU=; b=dRN8+CxgE1uPXL/pk9eglyvWJi4w/G8BWobTSK3Up1vjYoJCxxSG93/ewTercWWnHi uS73n2y0xqwbx5Lt1mcQ/iOdoZRk7wV+1f/+M2XjYn1vlKVJP4zpf/vRU6d8DxEq64x1 ScganRhsr8qZ9pNE2+XWaqEpekqUDPrIDoyPL3/07dWN1jddSbGEi2WfnfyW99Bf2nSh R1Y+GQlZtq3J+BX4wjR2umXK6VQjsO/gOYrj8C/LkMK1nsGj+9OBBOjk8Y/3rPEsA+K7 0zZiG4hFaR9L77oI9pv/3edjt2KgA9xOBvlbL1CmxhYFIynEyXSiPUwVrBKedaPPM9qE DkdA==
X-Gm-Message-State: ALoCoQk4dcMnd1jYNb5odOIJzozMDGfbhplMD9C09+lEy4awxLhGIETSAWKaZsV8abhgQqMtJU9h
X-Received: by 10.202.183.137 with SMTP id h131mr6363260oif.58.1446065688765; Wed, 28 Oct 2015 13:54:48 -0700 (PDT)
Received: from aither.local (c-73-34-202-214.hsd1.co.comcast.net. [73.34.202.214]) by smtp.googlemail.com with ESMTPSA id s127sm20726444oia.21.2015.10.28.13.54.46 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 28 Oct 2015 13:54:47 -0700 (PDT)
To: John C Klensin <john-ietf@jck.com>, Tom Worster <fsb@thefsb.org>, Alexey Melnikov <Alexey.Melnikov@isode.com>
References: <0347834EBDC481BD99BDBE67@JcK-HP8200.jck.com> <56302E6D.5030901@andyet.net> <56312AAC.1000300@andyet.net>
From: Peter Saint-Andre <peter@andyet.net>
Message-ID: <56313616.8000801@andyet.net>
Date: Wed, 28 Oct 2015 14:54:46 -0600
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <56312AAC.1000300@andyet.net>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/q2OB_d9zT8vOrOPPv9jsck7uRqw>
Cc: precis@ietf.org
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Oct 2015 20:54:55 -0000

And here is another correction in Section 3...

OLD

    Regarding examples 5, 6, and 7: applying Unicode Default Case Folding
    to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL LETTER
    SIGMA (U+03C3), and doing so during comparison would result in
    matching the nicknames in examples 5 and 6; however, because the
    PRECIS mapping rules do not account for the special status of GREEK
    SMALL LETTER FINAL SIGMA (U+03C2), the nicknames in examples 5 and 7
    or examples 6 and 7 would not be matched.

NEW

    Regarding examples 5, 6, and 7: applying Unicode Default Case Folding
    to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL LETTER
    SIGMA (U+03C3), and the same is true of GREEK SMALL LETTER FINAL
    SIGMA (U+03C2); therefore, the comparison operation defined in
    Section 2.4 would result in matching of the nicknames in examples 5,
    6, and 7.

On 10/28/15 2:06 PM, Peter Saint-Andre wrote:
> I propose the following text changes:
>
> ###
>
> OLD
>
>     3.  Case Mapping Rule: Uppercase and titlecase characters MUST be
>         mapped to their lowercase equivalents using Unicode Default Case
>         Folding as defined in the Unicode Standard [Unicode] (at the time
>         of this writing, the algorithm is specified in Chapter 3 of
>         [Unicode7.0]).  In applications that prohibit conflicting
>         nicknames, this rule helps to reduce the possibility of confusion
>         by ensuring that nicknames differing only by case (e.g.,
>         "stpeter" vs. "StPeter") would not be presented to a human user
>         at the same time.
>
> NEW
>
>     3.  Case Mapping Rule: Unicode Default Case Folding MUST be applied,
>         as defined in the Unicode Standard [Unicode] (at the time
>         of this writing, the algorithm is specified in Chapter 3 of
>         [Unicode7.0]).  The primary result of doing so is that uppercase
>         characters are mapped to lowercase characters. In applications
>         that prohibit conflicting nicknames, this rule helps to reduce
>         the possibility of confusion by ensuring that nicknames
>         differing only by case (e.g., "stpeter" vs. "StPeter") would not
>         be presented to a human user at the same time.
>
> ###
>
> (The foregoing was previously sent to the list.)
>
> ###
>
> OLD
>
> 2.3.  Enforcement
>
>     An entity that performs enforcement according to this profile MUST
>     prepare a string as described in Section 2.2 and MUST also apply the
>     rules specified in Section 2.1.  The rules MUST be applied in the
>     order shown.
>
>     After all of the foregoing rules have been enforced, the entity MUST
>     ensure that the nickname is not zero bytes in length (this is done
>     after enforcing the rules to prevent applications from mistakenly
>     omitting a nickname entirely, because when internationalized
>     characters are accepted, a non-empty sequence of characters can
>     result in a zero-length nickname after canonicalization).
>
> 2.4.  Comparison
>
>     An entity that performs comparison of two strings according to this
>     profile MUST prepare each string and enforce the rules as specified
>     in Sections 2.2 and 2.3.  The two strings are to be considered
>     equivalent if they are an exact octet-for-octet match (sometimes
>     called "bit-string identity").
>
> NEW
>
> 2.3.  Enforcement
>
>     An entity that performs enforcement according to this profile MUST
>     prepare a string as described in Section 2.2 and MUST also apply the
>     following rules specified in Section 2.1 in the order shown:
>
>     1. Additional Mapping Rule
>     2. Normalization Rule
>     3. Directionality Rule
>
>     After all of the foregoing rules have been enforced, the entity MUST
>     ensure that the nickname is not zero bytes in length (this is done
>     after enforcing the rules to prevent applications from mistakenly
>     omitting a nickname entirely, because when internationalized
>     characters are accepted, a non-empty sequence of characters can
>     result in a zero-length nickname after canonicalization).
>
> 2.4.  Comparison
>
>     An entity that performs comparison of two strings according to this
>     profile MUST prepare each string as specified in Section 2.2 and
>     MUST apply the following rules specified in Section 2.1 in the order
>     shown:
>
>     1. Additional Mapping Rule
>     2. Case Mapping Rule
>     3. Normalization Rule
>     4. Directionality Rule
>
>     The two strings are to be considered equivalent if they are an exact
>     octet-for-octet match (sometimes called "bit-string identity").
>
> ###
>
> In addition, some variation on John's proposed text about toLowerCase
> vs. toCaseFold might be appropriate at the end of Section 4; however,
> I'm still not sure that is necessary if we move the case mapping rule to
> the comparison operation.
>
> Peter
>
> On 10/27/15 8:09 PM, Peter Saint-Andre wrote:
>> On 10/27/15 11:32 AM, John C Klensin wrote:
>>> Response to Monday's note immediately below; response to today's
>>> follows it.  My apologies, but it is probably important to read
>>> both.  My further apologies for the length of this note, but I
>>> think we are in deep trouble here,
>>
>> Internationalization always seems to be a matter of how deep the trouble
>> is...
>>
>>> trouble that is aggravated by
>>> precis-mappings and precis-nickname both being post-approval and
>>> that, as far as I know, there are no future plans for PRECIS
>>> work (having precis-nickname in AUTH48 just emphasizes that --
>>> see comment at end).
>>
>> We had not planned to work on PRECIS because we thought we were done for
>> awhile. If that's not the case and we need to fix things, then so be it.
>> Whether there is sufficient and continued energy for such work is
>> another question. Personally I don't want us to have broken RFCs out
>> there.
>>
>>> --On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre -
>>> &yet <peter@andyet.net> wrote:
>>>
>>>> My apologies for the delayed reply. Comments inline.
>>>
>>> A few remarks below... I can't tell whether we disagree or
>>> whether at least one of us, probably me, are not being
>>> adequately clear.  (Material on which we fairly clearly agree
>>> elided.)
>>>
>>>
>>>> On 10/1/15 7:50 AM, John C Klensin wrote:
>>>>> --On Wednesday, September 30, 2015 15:16 -0600 Peter
>>>>> Saint-Andre - &yet <peter@andyet.net> wrote:
>>>> ...
>>>>> Peter,
>>>>>
>>>>> While your proposed text is an improvement,
>>>>
>>>> Happy to hear it. All I intended was a slight clarification.
>>>
>>> But I'm not certain we are there yet...
>>
>> Agreed. The text I proposed addressed only a very small part of the
>> problem.
>>
>>>>> the desire of many
>>>>> people for a magic "just tell me what to do" formula, one that
>>>>> lets them avoid understanding the issues, may call for a
>>>>> little more:
>>>>
>>>> There is always a need for more when it comes to i18n.
>>>
>>> But I think it is a little more that that.  I've heard several
>>> times, including in PRECIS meetings, requests for "just tell me
>>> what to do and make sure it isn't complicated" (or "I don't want
>>> to have to think about, much less understand, the issues").  We
>>> can debate whether giving in to those requests in the I18n case
>>> is wise.  I think it leads directly to conclusions equivalent to
>>> "I understand my own script and writing system (or think I do)
>>> and therefore, since all writing systems must be pretty much the
>>> same, I understand all of the core issues in terms of my script
>>> and understanding".   That, in turn, leads directly to the "how
>>> do you spell 'Zürich'?" and "all spellings of 'Zuerich' should
>>> be treated as equivalent" discussion that sounded like they
>>> dominated a BOF at IETF 93.
>>>
>>> Now I actually think it is reasonable for someone to ask for a
>>> library that will do the job most of the time and that will
>>> almost never cause their users or customers to get angry at
>>> them.  But, if we are going to call what we do "standards", they
>>> should contain sufficient information that would-be library
>>> authors can know what to do ... or understand that they are in
>>> over their heads.  And, for these particular cases, we may need
>>> to explain, or help the library authors explain, why some cases
>>> will fail and, indeed, get users mad at vendors.
>>>
>>>
>>>>> (1) First, toCaseFold is _not_ toLowerCase.  Saying "The
>>>>> primary result of doing so is that uppercase characters are
>>>>> mapped to lowercase characters" is true for toCaseFold,
>>>>
>>>> By "primary" I meant two things: (1) lowercasing is what
>>>> happens to the preponderance of code points and (2) this is
>>>> the result that most people care about.
>>>
>>> If I parse the above correctly, I think you are wrong.   I think
>>> what most people want, care about, and think they are getting,
>>> is lower case conversion, i.e., an operation that preserves
>>> lower case characters and converts upper case characters to the
>>> equivalent lower case.  toCaseFold isn't that operation.  It is
>>> a much more complex and subtle operation that, as well as
>>> converting upper case characters to lower case, sometimes
>>> converts lower case characters to different lower case
>>> characters (or strings of them).  It also requires a fairly good
>>> understanding of Unicode (not just a relevant script) and
>>> historical Unicode decisions to predict its behavior and to have
>>> any hope of explaining that behavior to users.   If one is
>>> trying to compare (as distinct from converting), then toCaseFold
>>> may be exactly what it wanted. but it is really hard to explain
>>> or justify that in terms of "nicknames" or "aliases", which are
>>> about conversion.   And, if one hopes to explain what is going
>>> on to users in terms of "lower casing", then toCaseFold is just
>>> the wrong operation.  That is what toLowerCase is for and the
>>> two operations are just not equivalent.
>>
>> My recollection, quite possibly inaccurate or incomplete, from at least
>> one and I think several in-person meetings of the PRECIS WG was: just
>> use Unicode Default Case Folding because if you use anything else or try
>> to roll your own you will be fubar forever. I do not recall any
>> discussion of the issues you have raised in this thread (e.g., about the
>> inadvisability of using case folding for anything but comparison
>> operations) until the last few weeks. However, I freely admit that's
>> probably because, through my own faults and ignorance, I didn't
>> understand what you were saying.
>>
>>> FWIW and purely by coincidence wrt PRECIS and this document, I
>>> had a conversation a few days ago with an expert on Arabic (and
>>> Persian) calligraphy and writing systems (and good general
>>> knowledge of writing systems) who is quite insistent that any
>>> procedure we use for case-insensitive matching (e.g., case
>>> folding) is discriminatory, inconsistent, and just
>>> badly-thought-out if that same procedure doesn't treat isolated,
>>> initial, and medial forms of the same character as equivalent.
>>> He further strengthens his case (sic) by noting that Unicode
>>> case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA,
>>> unambiguously a lower-case character) to U+03C3 (GREEK SMALL
>>> LETTER SIGMA), a relationship that depends entirely on
>>> positional use and not case.  He also believes the same
>>> relationships should apply to all other scripts that make form
>>> distinctions for some characters based on positions in a string
>>> and for which Unicode has chosen to assign different code
>>> points.  Even if there were wide acceptance of his view, Unicode
>>> stability principles would prevent changing toCaseFold (or
>>> CaseFolding.txt), but this is more evidence that what toCaseFold
>>> does and does not do is going to be hard to explain to either
>>> casual users or to writing system experts whose primary
>>> experience is not with the Greek-Latin-Cyrillic group.
>>>
>>> I don't think we want to say "these matching rules are somewhat
>>> arbitrary and irrational, but, if you don't like it, blame
>>> Unicode and not us", if only because it is our choice to use
>>> those matching rules.  More below.
>>>
>>>
>>>> ...
>>>>> (2) Second, probably as a result of having IDNA in the lead,
>>>>> we've gotten sloppy about language and operations and should
>>>>> probably start untangling that before it gets people in
>>>>> trouble.
>>>>
>>>> Where is the right place to do that untangling? (I doubt that
>>>> it is the precis-nickname document.)
>>>
>>> I agree that precis-nickname isn't the ideal place.  I also
>>> believe that you and it are the innocent victims of the
>>> situation.  At the same time, I don't believe IETF should be
>>> producing incomplete, ambiguous, erroneous, or misleading
>>> standards because no one could get around to doing the right
>>> foundational work.
>>
>> Agreed. I too want to get this right, even though it's not a lot of fun
>> and it's certainly more work than I thought I was signing up for at the
>> NEWPREP BoF years ago.
>>
>>>>> The Unicode Standard, at least as I understand it, is fairly
>>>>> clear that the most important (and really only safe) use of
>>>>> toCaseFold is as part of a comparison operation.
>>>>
>>>> Thanks for noting that. For example, Section 5.18 of Unicode
>>>> 8.0.0 says:
>>>>
>>>>      Caseless matching is implemented using case folding, which
>>>> is the
>>>>      process of mapping characters of different case to a
>>>> single form, so
>>>>      that case differences in strings are erased. Case folding
>>>> allows for
>>>>      fast caseless matches in lookups because only binary
>>>> comparison is
>>>>      required. It is more than just conversion to lowercase.
>>>
>>> Right.  But, again, when its use is appropriate (a very
>>> controversial topic in itself with our painful IDNA history with
>>> Final Sigma, Eszett and the case-independent versus
>>> position-independent controversy called out above as examples)
>>> that is "matches in lookups" (what I've described elsewhere as
>>> "comparison only").  Not creating or defining nicknames or
>>> aliases.  And that _is_ a problem for this document.
>>
>> I'm not convinced that things are as bad as you think. If we say in
>> draft-ietf-precis-nickname that the case mapping rule is to be applied
>> only as part of comparison and not as part of enforcement - which I
>> think is really what we care about (e.g., to prevent spoofing of users
>> in chat rooms) - then I think we might be most of the way there.
>>
>>>>> Using your
>>>>> example it is entirely reasonable to treat, "stpeter" and
>>>>> "StPeter" as equivalent in a comparison operation, but
>>>>> accepting one string and changing it to the other for display
>>>>> may not be a really good idea.  While that transformation may
>>>>> be acceptable (although I would be surprised if there were no
>>>>> people who share your surname who could consider "stpeter" or
>>>>> "Stpeter" unacceptable and might even believe that "StPeter"
>>>>> is an unacceptable substitute for "St. Peter"),
>>>>
>>>> I do receive email at stpeter@gmail.com intended for
>>>> st.peter@gmail.com but that's a separate topic...
>>>
>>> One that is relevant because it "works" as a side-effect of a
>>> decision Google has made about mailbox name equivalence, a
>>> decision that, IMO, will sooner or later get someone into a lot
>>> of trouble and,  more important, a decision and matching rule
>>> that PRECIS, AFAICT, does not allow and that IDNA unambigiously
>>> forbids.
>>>
>>>>> it also points out the
>>>>> dangers of using Basic Latin script examples to illustrate
>>>>> situations in which even more extended Latin script, much less
>>>>> other scripts, may raise more complex issues.    Because IDNA
>>>>> is essentially a workaround because changing the DNS
>>>>> comparison rules was impractical for several reasons, we
>>>>> ended up using toCaweFold to map characters and strings into
>>>>> others in IDNA2003 but PRECIS implementations that do not
>>>>> have the same constraints would, in general, be better off
>>>>> confining the use of toCaseFold, or even toLowerCase, to
>>>>> comparison operations.
>>>>
>>>> Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does
>>>> it make sense for this nickname specification to differ in
>>>> this respect from the published RFCs? Shall we file errata
>>>> against those documents? (This might apply only to RFC 7613,
>>>> which says to apply case folding as part of the enforcement
>>>> process - when exactly to apply case folding is not stipulated
>>>> by RFC 7564.)
>>>
>>> To the extent to which this is a "botched that because the WG
>>> didn't understand the issues well enough" conclusion, it would
>>> be entirely reasonable to generate an updating RFC that repairs
>>> 7613 and/or 7564, even doing so in an addendum to
>>> precis-nickname if that is the only way to do that
>>> expeditiously.  Per the above, we really don't want to give
>>> library routine writers bad instructions.  As I understand it,
>>> the current position of the RFC Editor and IESG is that
>>> technical specification errors discovered in retrospect or after
>>> people start using a spec are not appropriate topics for errata.
>>> If the WG is not willing to do any of those things, then I
>>> suggest that precis-nickname at least needs to contain a very
>>> clear warning notice about this situation (see my response to
>>> your question 1 below).
>>
>> I think we'll probably need to fix 7613 and 7564. I am hoping we can fix
>> nickname now so that it is less incorrect than the other two. That
>> doesn't necessarily mean we won't need to also further fix nickname
>> later on.
>>
>> Granted, we were supposed to avoid this problem by working on all of the
>> PRECIS specs simultaneously. Clearly we have not avoided the problem, so
>> we need to solve it one way or another. If that means bis for them all,
>> we need to deal with it.
>>
>>>>> (3) Because toCaweFold loses information when used for more
>>>>> than comparison (for comparison, it merely contributes to
>>>>> what some people would consider false positives for matching)
>>>>> involves some controversial decisions and, because of
>>>>> stability requirements, cannot be changed even if the
>>>>> controversies are resolved in other ways, we end up with,
>>>>> e.g.,
>>>>>       toCaseFold ("Nuß") -> "nuss"
>>>>> which is considered an acceptable transformation in some
>>>>> places that identify themselves as speaking/using German and
>>>>> two different unacceptable errors in others.  Again, this will
>>>>> almost always be much more serious if the transformation is
>>>>> used to map and replace strings than if it is used to compare
>>>>> (fwiw, that particular example is part of a continuing
>>>>> disagreement between IDNA2008 and, among others, German
>>>>> domain registry authorities on one side and UTC and UTR 46 on
>>>>> the other).
>>>>
>>>> Agreed.
>>>
>>> See "warning notice" comment above and question 1 response below.
>>>
>>>> (4) If the motivation is really to avoid confusion, the
>>>>> correct confusion-blocking rule for Latin script (but not
>>>>> others) and many languages that use it (but certainly not
>>>>> all) involves moving beyond toCaseFold and treating all
>>>>> "decorated" characters (characters normally represented by
>>>>> glyphs consisting of a Basic Latin character and one or more
>>>>> diacritical or equivalent markings) compare equal to their
>>>>> base characters, e.g., "á" not only matches "Á" but also
>>>>> "a" and "A" and, as an unfortunate side-effect, maybe "À"
>>>>> and "à" as well.  This is bad news for languages in which
>>>>> decorated Latin characters are used to represent phonetically
>>>>> and conceptually different characters, not just pronunciation
>>>>> variations.  I am not qualified to evaluate "how bad".   In
>>>>> addition, extrapolations from this principle about Latin
>>>>> script to unrelated scripts will almost certainly lead to
>>>>> serious errors and/or additional confusion.
>>>>
>>>> I would not be comfortable going that far...
>>>
>>> In case it isn't clear, I would not be either.  But it is where
>>> getting sloppy about this stuff could easily take us.  It is
>>> worth noting that it also identifies one of the difficulties
>>> with doing a global system to be applied to many types of
>>> applications (like the PRECIS work) and then applying it in user
>>> interface software that end users will expect to be localized to
>>> their assumptions because it has been mapped or translated into
>>> their language (if one normally speaks Upper Slobbovian but has
>>> some familiarity with English, an application interface in
>>> English will probably be expected to be "foreign", odd, and
>>> maybe even inconsistent with whatever expectations exist.  But,
>>> if the interface is in Upper Slobbovian, the natural and
>>> reasonable assumption will be the matching should conform to
>>> normal Upper Slobbovian conventions.    FWIW, a matching rule
>>> that says:
>>>
>>>   (i) Two instances of a base character with the same
>>>     diacritical mark(s) match.
>>>   (ii) Two instances of a base character with different
>>>     diacritical mark(s) do not match.
>>>   (iii) Two instances of a base character, one with
>>>     diacritical mark(s) and the other without any decoration
>>>     match.
>>>
>>> Is precisely correct and normal behavior for at least one
>>> language that uses Latin script.  It is also the normal practice
>>> for at least one Latin script transcription system that is used
>>> by a large fraction of a billion people (maybe more).
>>
>> That is indeed sobering.
>>
>>>>> More on this and Tom's question below...
>>>>>
>>>>>> On 9/29/15 3:28 PM, Tom Worster wrote:
>>>>>>> Peter, Alexey,
>>>>>>>
>>>>>>> I think there is an ambiguity in the specification of case
>>>>>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19.
>>>>>> ...
>>>>>>> But there are 55 code points in Unicode 7.0.0 that change
>>>>>>> under default case folding that are neither uppercase nor
>>>>>>> titlecase characters, 12 of which are Lowercase_Letter. I
>>>>>>> suspect this stems from a confusion between Unicode case
>>>>>>> mapping and case folding.
>>>
>>> In the context of the above, a different way to say the same
>>> thing is that people are looking at toCaseFold and assuming (and
>>> explaining things in terms of) toLowerCase.  toCaseFold works
>>> the way it is expected to and those 55 code points are, more or
>>> less, collateral damage to get to a matching algorithm that
>>> favors false positives over false negatives and various edge
>>> cases (including in "edge cases" languages spoken by, and script
>>> variations used by, millions of people).
>>
>> Sadly I suspect that is an accurate description of the current state of
>> affairs (modulo my comment above about PRECIS WG discussions at one or
>> more IETF meetings).
>>
>>>> ...
>>>> After all that, I have 3 questions:
>>>
>>> Personal opinions about answers...
>>>
>>>> (1) Is my proposed text enough of a clarification that we
>>>> should make that change before the nickname I-D is published
>>>> as an RFC?
>>>
>>> I think the clarification is an improvement and is important
>>> enough to incorporate (I know that is the answer to a slightly
>>> different question).
>>>
>>> However, I think it is inadequate without a serious warning
>>> about the situation.
>>
>> Yes.
>>
>>>  That warning could appear in either this
>>> document or RFC 7613 (or 7613bis) with a pointer from the other,
>>> but, unless you want to revise 7613 now, this one is handy.
>>
>> I suspect that we need to revise 7613. I suspect that we might also need
>> to revise 7564 (at least with respect to the order in which operations
>> are applied, since there has been some confusion among implementers).
>>
>> Well, we always knew that we would need to revise them. Just not so soon.
>>
>>> Comment about possible text below.
>>>
>>>> (2) Should we modify draft-ietf-precis-nickname so that case
>>>> folding is applied only as part of comparison and not as part
>>>> of enforcement? If so, should we make that change before this
>>>> document is published as an RFC?
>>>
>>> Yes.  If something is used for "enforcement", it should be lower
>>> casing or something else that can be explained to people who are
>>> ordinarily familiar with one or more of the scripts that make
>>> case distinctions.
>>>
>>> However, viewed in the light of this discussion, the whole
>>> "enforcement" concept becomes a little dicey, especially if, as
>>> I believe but don't have time to verify, the transformations
>>> performed by toLowerCase are not a proper subset of those
>>> performed by toCaseFold.
>>
>> My initial thought is that case mapping doesn't belong in the nickname
>> enforcement operation at all - only in the comparison operation.
>>
>>>> (3) Should we update RFC 7613 so that case folding is applied
>>>> only as part of comparison and not as part of enforcement?
>>>
>>> I think that is necessary.  Following up on the comment above, I
>>> would prefer that the current Section 3.2.2 (3) of RFC 7613
>>> either point to Unicode Lower Casing or contain a warning along
>>> the lines of that below.
>>
>> Unlike the nickname profile (which I think can be cleaned up by moving
>> the case mapping rule to the comparison operation and continuing to use
>> Unicode Default Case Folding), I think you are right that for the
>> UsernameCaseMapped profile we probably want Unicode Lower Casing. Thus
>> the likely need, sooner rather than later, for 7613bis.
>>
>>>
>>>     ----------
>>>
>>> --On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre
>>> <peter@andyet.net> wrote:
>>>
>>>> This issue has greater urgency now because
>>>> draft-ietf-precis-nickname is now in AUTH48...
>>>>
>>>> On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote:
>>>>
>>>>> After all that, I have 3 questions:
>>>>>
>>>>> (1) Is my proposed text enough of a clarification that we
>>>>> should make that change before the nickname I-D is published
>>>>> as an RFC?
>>>>
>>>> I think so.
>>>
>>> See above.
>>>
>>>>> (2) Should we modify draft-ietf-precis-nickname so that case
>>>>> folding is applied only as part of comparison and not as part
>>>>> of enforcement? If so, should we make that change before this
>>>>> document is published as an RFC?
>>>>
>>>> Although it seems to be the case that Unicode case folding is
>>>> primarily designed for the purpose of matching (i.e.,
>>>> comparison),
>>>
>>> "Seems" is a little weak.  The Unicode Standard is really quite
>>> specific about that.
>>>
>>>> I have a concern that applying the PRECIS case
>>>> mapping rule after applying the normalization and
>>>> directionality rules might have unintended consequences that
>>>> we haven't had a chance to consider yet. The PRECIS framework
>>>> expresses a preference (actually a hard requirement) for
>>>> applying the rules in a particular order. We made a late
>>>> change to the username profiles (RFC 7613), such that width
>>>> mapping is applied first (in order to accommodate fullwidth
>>>> and halfwidth characters in certain East Asian scripts).
>>>> Making a late change to the nickname profile also concerns me,
>>>> even though both of these late changes seem reasonable on the
>>>> face of it. I will try to find time to think about this
>>>> further in the next 24 hours.
>>>
>>> First, a hint for the consideration process: there is a reason
>>> why Unicode now supports a unified case folding and
>>> normalization operation.  My recollection is that it is not only
>>> more efficient to perform both operations at once (rather than
>>> looking in one table and then the other), but that there are
>>> some order-dependent or priority-dependent cases.
>>>
>>> The very fact that this issue exists (and is coming up again)
>>> this late in the process (7613 published in August, WG winding
>>> down and not, e.g., meeting next week) calls at least the PRECIS
>>> quality of review and some fairly fundamental model issues into
>>> question.  I first raised that issue a rather long time ago but
>>> have continued to hope that we have an approximation to "good
>>> enough" without going back and rethinking everything.
>>>
>>> The right solution, IMO, is that, if RFC 7613 is to rationalize
>>> or explain the operation in terms of converting upper case
>>> characters to lower case, then it should be using toLowerCase
>>> because that is what the operation does.  After a quick look at
>>> 7613, amending/updating it to simply convert to lower case would
>>> be straightforward (and would not raise the ordering issue
>>> called out above).  It would presumably require another IETF
>>> Last Call, however and I'd hope we would see some serious
>>> discussion within the WG (and with UTC) before making the change
>>> and about how it is explained.
>>>
>>> If we are not willing to make a change
>>
>> I'm willing. It would, as you note, require some careful thinking and
>> review to make sure that we got it (more) right this time.
>>
>>> that significant and/or
>>> if we conclude that the WG (and perhaps the IETF) have
>>> completely run out of energy for dealing with i18n issues [1],
>>> then I suggest that we introduce some additional text.  I've
>>> just spent a half-hour trying to find the AUTH48 copy of
>>> precis-nickname (aka RFC-to-be-7700), but the RFC Editor has
>>> apparently changed naming conventions and the various queue
>>> entry pages all point to the -19 I-D and not the current working
>>> copy so I can't try to match text and insertion point to what is
>>> there already.
>>
>> http://www.rfc-editor.org/authors/rfc7700.txt
>>
>>>  The suggestion is a patch (and a hack), not a
>>> good fix but something like it is probably the least drastic
>>> measure that would yield something that doesn't contain
>>> unexplained known defects.
>>>
>>> Rough version of suggested text (possibly to go after your
>>> revised paragraph and following up my comments in my 1 October
>>> note).  Some of the terminology needs checking which I can do if
>>> you want to go this route:
>>>
>>>     'Users of this specification should note that the
>>>     concept of "lower case conversion" is somewhat elusive
>>>     and more dependent on the conventions of different
>>>     languages and notation systems that use the same script
>>>     than may appear obvious at first glance, especially if
>>>     that glance is at Basic Latin characters (i.e., the
>>>     ASCII letter repertoire).  Unicode provides two
>>>     different mapping procedures that produce lower-case
>>>     characters, but they have different effects and results
>>>     for many characters.  The more conservative one,
>>>     typically appropriately applicable when lower case forms
>>>     are needed, is actual lower-casing (embodied in the
>>>     Unicode operation toLowerCase).  A more radical
>>>     operation, normally suitable only for string matching in
>>>     situations in which it is better to consider uncertain
>>>     cases as matching than to treat them as distinct, is
>>>     called "Case Folding" (Unicode operation toCaseFold).
>>>     While the two operations will often produce the same
>>>     results, Case Folding maps some lower case characters
>>>     into others and performs other transformations that may
>>>     be intuitively reasonable and expected for some users
>>>     and quite astonishing (or just wrong) to others.  There
>>>     may be no practical alternative, especially if the
>>>     operations are to be used for mapping or enforcement, to
>>>     developers of PRECIS-dependent understanding that the
>>>     cases in which the two yield different results require
>>>     careful understanding of the relevant user base and its
>>>     needs [2].'
>>
>> Thanks.
>>
>> I am not sure if we need something like that if we move case mapping
>> (here, case folding) to the comparison operation only - but something
>> like that might still be appropriate.
>>
>>>>> (3) Should we update RFC 7613 so that case folding is applied
>>>>> only as part of comparison and not as part of enforcement?
>>>>
>>>> That is less urgent so I suggest that we address the nickname
>>>> spec first.
>>>
>>> Unless you (or someone else here) have a plausible plan to
>>> continue and revitalize the WG and assign it that revision work
>>> (and bring everyone actively participating up to the level
>>> needed to easily understand this discussion thread and feel
>>> embarrassed for not spotting the problems), I think we need to
>>> assume that this is our last shot.  Absent an active and
>>> committed WG, "do this first" could easily be equivalent to
>>> "don't get around to the other, ever".
>>
>> As mentioned, I don't want to have broken RFCs out there.
>>
>>> I think that the particular set of issues that started this
>>> thread as a known defect in the PRECIS specs, both nickname and
>>> 7613 and that we are obligated to either fix the problems or at
>>> least explain them.  The above warning text is an attempt to
>>> explain and identify the problems even if it does not actually
>>> provide a solution.  If it were published as part of
>>> precis-nickname, it could include a statement to the effect that
>>> it should also be treated as an update to 7613 or, if the IESG
>>> and RFC Editor would agree in advance to accept, rather than
>>> bury, the thing, I suppose we could publish it in
>>> precis-nickname and create an erratum to 7613 indicating that it
>>> should have included some form of that statement.  Neither
>>> option implies a huge amount of work to update 7613.  But I
>>> think that making the changes of (2) without doing anything
>>> about (3) makes the two documents inconsistent with each other
>>> and that would be an additional known defect.
>>>
>>> Procedural question: given that precis-nickname is in AUTH48 as
>>> of yesterday and I don't see anything blocking publication next
>>> week if you and Barry sign off on the revised text that the WG
>>> hasn't seen,
>>
>> There is no revised text yet. That's why we're having this discussion.
>>
>>> does someone need to file a pro forma objection/
>>> appeal to block that until this is sorted out and the WG has a
>>> chance to review proposed publication text?
>>
>> I see no reason to invoke the specter of appeals quite yet. Everyone is
>> working in good faith to do the right thing and get this mess cleaned up.
>>
>>> [1] I believe our collective inability to deal with the
>>> within-script character forms that do not normalize to each
>>> other because of language-dependent or other usage factors can
>>> be taken as evidence of having run out of energy,
>>
>> Or in my case simple ignorance of some of the relevant issues and
>> examples. It's not easy to know about all of this.
>>
>>> but it is
>>> probably in the interest of finishing the PRECIS work to try to
>>> treat that as a separate issue.
>>
>> Probably.
>>
>>> [2] Not unlike the reason to differentiate between NFC and NFKC
>>> and understand the effects of each.
>>
>> Another thing that's not easy to grok in fulness.
>>
>> Peter
>>