Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname

Peter Saint-Andre <peter@andyet.net> Wed, 28 October 2015 20:06 UTC

Return-Path: <peter@andyet.net>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BBD531ACE55 for <precis@ietfa.amsl.com>; Wed, 28 Oct 2015 13:06:13 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.601
X-Spam-Level:
X-Spam-Status: No, score=-1.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, GB_I_LETTER=-2, MANGLED_LIST=2.3, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pUmh_0KGkDvx for <precis@ietfa.amsl.com>; Wed, 28 Oct 2015 13:06:08 -0700 (PDT)
Received: from mail-ob0-x22c.google.com (mail-ob0-x22c.google.com [IPv6:2607:f8b0:4003:c01::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5DA871ACE17 for <precis@ietf.org>; Wed, 28 Oct 2015 13:06:08 -0700 (PDT)
Received: by obbza9 with SMTP id za9so16197421obb.1 for <precis@ietf.org>; Wed, 28 Oct 2015 13:06:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=andyet_net.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-type:content-transfer-encoding; bh=PeyWlkSyebBrtdIuW6hvHGvJqlCzEERN1eUu5IqRCa8=; b=ZEV/zuNfrfecjpmoUtTSF8/tcoa/Oya8MgdfpjqDTppiN/n9euKR9m8rYmC81sixzK ShWi+WUOPeDhf9WglKDQERD9PknRij+hlB0zZAO8PTQ97CbiNob4L5TFUyCy1YD9EG9I HrWBnX1kfbNMkoPN0lAok0pqLfNmj05y7cqkrSDe13/GkK61q8B9w4WWH4fvNuvCaf0r 3vTuaZddr5en3NKk2BmjUNXXIq5fYbj0ditjMD1HzVpk730sWEGC+kX1//gVRxNd1R5+ Sp2du69vQOSm+6xeQtsauROdB55mslg2MhDfXDgHcg+vom6JsIXmimWCqeVfXTfiP4RH ciEQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-type :content-transfer-encoding; bh=PeyWlkSyebBrtdIuW6hvHGvJqlCzEERN1eUu5IqRCa8=; b=CgK+Gr/bkiBBUyQ3TnZsWiWh+HLoyFvvniHOnYX+9ZXySBmvJF0YrjDIP2/IpQdUwv OdGD0iAERW4/1yK1tJ0OQsVTbaWA5IQeXSf2QBMqFfRRfbkhTLcaB2NFRag70lutqTJM RPxs24JIMa+OxTKAgK73wcf/dxBhScAAcp1qb2hR3Xknsl1iedBz+CUUdXN+bHzst6VB 9DCbZwuDh88/EtOzKzUjq7f7KJKPuc20B6pz3IuyUYWgvanle3bkNIGpr1ni+9WruhTV gPx2+2FbdEywXI5uMJ+H72O33caJJHeK+XEcP7b5DDn2M953sUVQZBtzqQJ8fakjnte2 UebA==
X-Gm-Message-State: ALoCoQkPDVvYee5+ykuVs8JDreaukjPU4ErA4QL/xRAKiN+1iXxXg8ppHF8TLobml6v1ATHJ0eaK
X-Received: by 10.60.173.42 with SMTP id bh10mr29749836oec.58.1446062767517; Wed, 28 Oct 2015 13:06:07 -0700 (PDT)
Received: from aither.local (c-73-34-202-214.hsd1.co.comcast.net. [73.34.202.214]) by smtp.googlemail.com with ESMTPSA id r124sm3718311oia.24.2015.10.28.13.06.05 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 28 Oct 2015 13:06:06 -0700 (PDT)
To: John C Klensin <john-ietf@jck.com>, Tom Worster <fsb@thefsb.org>, Alexey Melnikov <Alexey.Melnikov@isode.com>
References: <0347834EBDC481BD99BDBE67@JcK-HP8200.jck.com> <56302E6D.5030901@andyet.net>
From: Peter Saint-Andre <peter@andyet.net>
Message-ID: <56312AAC.1000300@andyet.net>
Date: Wed, 28 Oct 2015 14:06:04 -0600
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <56302E6D.5030901@andyet.net>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/c34jwtWaQDsO-hYamFP84pO5ym0>
Cc: precis@ietf.org
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Oct 2015 20:06:13 -0000

I propose the following text changes:

###

OLD

    3.  Case Mapping Rule: Uppercase and titlecase characters MUST be
        mapped to their lowercase equivalents using Unicode Default Case
        Folding as defined in the Unicode Standard [Unicode] (at the time
        of this writing, the algorithm is specified in Chapter 3 of
        [Unicode7.0]).  In applications that prohibit conflicting
        nicknames, this rule helps to reduce the possibility of confusion
        by ensuring that nicknames differing only by case (e.g.,
        "stpeter" vs. "StPeter") would not be presented to a human user
        at the same time.

NEW

    3.  Case Mapping Rule: Unicode Default Case Folding MUST be applied,
        as defined in the Unicode Standard [Unicode] (at the time
        of this writing, the algorithm is specified in Chapter 3 of
        [Unicode7.0]).  The primary result of doing so is that uppercase
        characters are mapped to lowercase characters. In applications
        that prohibit conflicting nicknames, this rule helps to reduce
        the possibility of confusion by ensuring that nicknames
        differing only by case (e.g., "stpeter" vs. "StPeter") would not
        be presented to a human user at the same time.

###

(The foregoing was previously sent to the list.)

###

OLD

2.3.  Enforcement

    An entity that performs enforcement according to this profile MUST
    prepare a string as described in Section 2.2 and MUST also apply the
    rules specified in Section 2.1.  The rules MUST be applied in the
    order shown.

    After all of the foregoing rules have been enforced, the entity MUST
    ensure that the nickname is not zero bytes in length (this is done
    after enforcing the rules to prevent applications from mistakenly
    omitting a nickname entirely, because when internationalized
    characters are accepted, a non-empty sequence of characters can
    result in a zero-length nickname after canonicalization).

2.4.  Comparison

    An entity that performs comparison of two strings according to this
    profile MUST prepare each string and enforce the rules as specified
    in Sections 2.2 and 2.3.  The two strings are to be considered
    equivalent if they are an exact octet-for-octet match (sometimes
    called "bit-string identity").

NEW

2.3.  Enforcement

    An entity that performs enforcement according to this profile MUST
    prepare a string as described in Section 2.2 and MUST also apply the
    following rules specified in Section 2.1 in the order shown:

    1. Additional Mapping Rule
    2. Normalization Rule
    3. Directionality Rule

    After all of the foregoing rules have been enforced, the entity MUST
    ensure that the nickname is not zero bytes in length (this is done
    after enforcing the rules to prevent applications from mistakenly
    omitting a nickname entirely, because when internationalized
    characters are accepted, a non-empty sequence of characters can
    result in a zero-length nickname after canonicalization).

2.4.  Comparison

    An entity that performs comparison of two strings according to this
    profile MUST prepare each string as specified in Section 2.2 and
    MUST apply the following rules specified in Section 2.1 in the order
    shown:

    1. Additional Mapping Rule
    2. Case Mapping Rule
    3. Normalization Rule
    4. Directionality Rule

    The two strings are to be considered equivalent if they are an exact
    octet-for-octet match (sometimes called "bit-string identity").

###

In addition, some variation on John's proposed text about toLowerCase 
vs. toCaseFold might be appropriate at the end of Section 4; however, 
I'm still not sure that is necessary if we move the case mapping rule to 
the comparison operation.

Peter

On 10/27/15 8:09 PM, Peter Saint-Andre wrote:
> On 10/27/15 11:32 AM, John C Klensin wrote:
>> Response to Monday's note immediately below; response to today's
>> follows it.  My apologies, but it is probably important to read
>> both.  My further apologies for the length of this note, but I
>> think we are in deep trouble here,
>
> Internationalization always seems to be a matter of how deep the trouble
> is...
>
>> trouble that is aggravated by
>> precis-mappings and precis-nickname both being post-approval and
>> that, as far as I know, there are no future plans for PRECIS
>> work (having precis-nickname in AUTH48 just emphasizes that --
>> see comment at end).
>
> We had not planned to work on PRECIS because we thought we were done for
> awhile. If that's not the case and we need to fix things, then so be it.
> Whether there is sufficient and continued energy for such work is
> another question. Personally I don't want us to have broken RFCs out there.
>
>> --On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre -
>> &yet <peter@andyet.net> wrote:
>>
>>> My apologies for the delayed reply. Comments inline.
>>
>> A few remarks below... I can't tell whether we disagree or
>> whether at least one of us, probably me, are not being
>> adequately clear.  (Material on which we fairly clearly agree
>> elided.)
>>
>>
>>> On 10/1/15 7:50 AM, John C Klensin wrote:
>>>> --On Wednesday, September 30, 2015 15:16 -0600 Peter
>>>> Saint-Andre - &yet <peter@andyet.net> wrote:
>>> ...
>>>> Peter,
>>>>
>>>> While your proposed text is an improvement,
>>>
>>> Happy to hear it. All I intended was a slight clarification.
>>
>> But I'm not certain we are there yet...
>
> Agreed. The text I proposed addressed only a very small part of the
> problem.
>
>>>> the desire of many
>>>> people for a magic "just tell me what to do" formula, one that
>>>> lets them avoid understanding the issues, may call for a
>>>> little more:
>>>
>>> There is always a need for more when it comes to i18n.
>>
>> But I think it is a little more that that.  I've heard several
>> times, including in PRECIS meetings, requests for "just tell me
>> what to do and make sure it isn't complicated" (or "I don't want
>> to have to think about, much less understand, the issues").  We
>> can debate whether giving in to those requests in the I18n case
>> is wise.  I think it leads directly to conclusions equivalent to
>> "I understand my own script and writing system (or think I do)
>> and therefore, since all writing systems must be pretty much the
>> same, I understand all of the core issues in terms of my script
>> and understanding".   That, in turn, leads directly to the "how
>> do you spell 'Zürich'?" and "all spellings of 'Zuerich' should
>> be treated as equivalent" discussion that sounded like they
>> dominated a BOF at IETF 93.
>>
>> Now I actually think it is reasonable for someone to ask for a
>> library that will do the job most of the time and that will
>> almost never cause their users or customers to get angry at
>> them.  But, if we are going to call what we do "standards", they
>> should contain sufficient information that would-be library
>> authors can know what to do ... or understand that they are in
>> over their heads.  And, for these particular cases, we may need
>> to explain, or help the library authors explain, why some cases
>> will fail and, indeed, get users mad at vendors.
>>
>>
>>>> (1) First, toCaseFold is _not_ toLowerCase.  Saying "The
>>>> primary result of doing so is that uppercase characters are
>>>> mapped to lowercase characters" is true for toCaseFold,
>>>
>>> By "primary" I meant two things: (1) lowercasing is what
>>> happens to the preponderance of code points and (2) this is
>>> the result that most people care about.
>>
>> If I parse the above correctly, I think you are wrong.   I think
>> what most people want, care about, and think they are getting,
>> is lower case conversion, i.e., an operation that preserves
>> lower case characters and converts upper case characters to the
>> equivalent lower case.  toCaseFold isn't that operation.  It is
>> a much more complex and subtle operation that, as well as
>> converting upper case characters to lower case, sometimes
>> converts lower case characters to different lower case
>> characters (or strings of them).  It also requires a fairly good
>> understanding of Unicode (not just a relevant script) and
>> historical Unicode decisions to predict its behavior and to have
>> any hope of explaining that behavior to users.   If one is
>> trying to compare (as distinct from converting), then toCaseFold
>> may be exactly what it wanted. but it is really hard to explain
>> or justify that in terms of "nicknames" or "aliases", which are
>> about conversion.   And, if one hopes to explain what is going
>> on to users in terms of "lower casing", then toCaseFold is just
>> the wrong operation.  That is what toLowerCase is for and the
>> two operations are just not equivalent.
>
> My recollection, quite possibly inaccurate or incomplete, from at least
> one and I think several in-person meetings of the PRECIS WG was: just
> use Unicode Default Case Folding because if you use anything else or try
> to roll your own you will be fubar forever. I do not recall any
> discussion of the issues you have raised in this thread (e.g., about the
> inadvisability of using case folding for anything but comparison
> operations) until the last few weeks. However, I freely admit that's
> probably because, through my own faults and ignorance, I didn't
> understand what you were saying.
>
>> FWIW and purely by coincidence wrt PRECIS and this document, I
>> had a conversation a few days ago with an expert on Arabic (and
>> Persian) calligraphy and writing systems (and good general
>> knowledge of writing systems) who is quite insistent that any
>> procedure we use for case-insensitive matching (e.g., case
>> folding) is discriminatory, inconsistent, and just
>> badly-thought-out if that same procedure doesn't treat isolated,
>> initial, and medial forms of the same character as equivalent.
>> He further strengthens his case (sic) by noting that Unicode
>> case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA,
>> unambiguously a lower-case character) to U+03C3 (GREEK SMALL
>> LETTER SIGMA), a relationship that depends entirely on
>> positional use and not case.  He also believes the same
>> relationships should apply to all other scripts that make form
>> distinctions for some characters based on positions in a string
>> and for which Unicode has chosen to assign different code
>> points.  Even if there were wide acceptance of his view, Unicode
>> stability principles would prevent changing toCaseFold (or
>> CaseFolding.txt), but this is more evidence that what toCaseFold
>> does and does not do is going to be hard to explain to either
>> casual users or to writing system experts whose primary
>> experience is not with the Greek-Latin-Cyrillic group.
>>
>> I don't think we want to say "these matching rules are somewhat
>> arbitrary and irrational, but, if you don't like it, blame
>> Unicode and not us", if only because it is our choice to use
>> those matching rules.  More below.
>>
>>
>>> ...
>>>> (2) Second, probably as a result of having IDNA in the lead,
>>>> we've gotten sloppy about language and operations and should
>>>> probably start untangling that before it gets people in
>>>> trouble.
>>>
>>> Where is the right place to do that untangling? (I doubt that
>>> it is the precis-nickname document.)
>>
>> I agree that precis-nickname isn't the ideal place.  I also
>> believe that you and it are the innocent victims of the
>> situation.  At the same time, I don't believe IETF should be
>> producing incomplete, ambiguous, erroneous, or misleading
>> standards because no one could get around to doing the right
>> foundational work.
>
> Agreed. I too want to get this right, even though it's not a lot of fun
> and it's certainly more work than I thought I was signing up for at the
> NEWPREP BoF years ago.
>
>>>> The Unicode Standard, at least as I understand it, is fairly
>>>> clear that the most important (and really only safe) use of
>>>> toCaseFold is as part of a comparison operation.
>>>
>>> Thanks for noting that. For example, Section 5.18 of Unicode
>>> 8.0.0 says:
>>>
>>>      Caseless matching is implemented using case folding, which
>>> is the
>>>      process of mapping characters of different case to a
>>> single form, so
>>>      that case differences in strings are erased. Case folding
>>> allows for
>>>      fast caseless matches in lookups because only binary
>>> comparison is
>>>      required. It is more than just conversion to lowercase.
>>
>> Right.  But, again, when its use is appropriate (a very
>> controversial topic in itself with our painful IDNA history with
>> Final Sigma, Eszett and the case-independent versus
>> position-independent controversy called out above as examples)
>> that is "matches in lookups" (what I've described elsewhere as
>> "comparison only").  Not creating or defining nicknames or
>> aliases.  And that _is_ a problem for this document.
>
> I'm not convinced that things are as bad as you think. If we say in
> draft-ietf-precis-nickname that the case mapping rule is to be applied
> only as part of comparison and not as part of enforcement - which I
> think is really what we care about (e.g., to prevent spoofing of users
> in chat rooms) - then I think we might be most of the way there.
>
>>>> Using your
>>>> example it is entirely reasonable to treat, "stpeter" and
>>>> "StPeter" as equivalent in a comparison operation, but
>>>> accepting one string and changing it to the other for display
>>>> may not be a really good idea.  While that transformation may
>>>> be acceptable (although I would be surprised if there were no
>>>> people who share your surname who could consider "stpeter" or
>>>> "Stpeter" unacceptable and might even believe that "StPeter"
>>>> is an unacceptable substitute for "St. Peter"),
>>>
>>> I do receive email at stpeter@gmail.com intended for
>>> st.peter@gmail.com but that's a separate topic...
>>
>> One that is relevant because it "works" as a side-effect of a
>> decision Google has made about mailbox name equivalence, a
>> decision that, IMO, will sooner or later get someone into a lot
>> of trouble and,  more important, a decision and matching rule
>> that PRECIS, AFAICT, does not allow and that IDNA unambigiously
>> forbids.
>>
>>>> it also points out the
>>>> dangers of using Basic Latin script examples to illustrate
>>>> situations in which even more extended Latin script, much less
>>>> other scripts, may raise more complex issues.    Because IDNA
>>>> is essentially a workaround because changing the DNS
>>>> comparison rules was impractical for several reasons, we
>>>> ended up using toCaweFold to map characters and strings into
>>>> others in IDNA2003 but PRECIS implementations that do not
>>>> have the same constraints would, in general, be better off
>>>> confining the use of toCaseFold, or even toLowerCase, to
>>>> comparison operations.
>>>
>>> Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does
>>> it make sense for this nickname specification to differ in
>>> this respect from the published RFCs? Shall we file errata
>>> against those documents? (This might apply only to RFC 7613,
>>> which says to apply case folding as part of the enforcement
>>> process - when exactly to apply case folding is not stipulated
>>> by RFC 7564.)
>>
>> To the extent to which this is a "botched that because the WG
>> didn't understand the issues well enough" conclusion, it would
>> be entirely reasonable to generate an updating RFC that repairs
>> 7613 and/or 7564, even doing so in an addendum to
>> precis-nickname if that is the only way to do that
>> expeditiously.  Per the above, we really don't want to give
>> library routine writers bad instructions.  As I understand it,
>> the current position of the RFC Editor and IESG is that
>> technical specification errors discovered in retrospect or after
>> people start using a spec are not appropriate topics for errata.
>> If the WG is not willing to do any of those things, then I
>> suggest that precis-nickname at least needs to contain a very
>> clear warning notice about this situation (see my response to
>> your question 1 below).
>
> I think we'll probably need to fix 7613 and 7564. I am hoping we can fix
> nickname now so that it is less incorrect than the other two. That
> doesn't necessarily mean we won't need to also further fix nickname
> later on.
>
> Granted, we were supposed to avoid this problem by working on all of the
> PRECIS specs simultaneously. Clearly we have not avoided the problem, so
> we need to solve it one way or another. If that means bis for them all,
> we need to deal with it.
>
>>>> (3) Because toCaweFold loses information when used for more
>>>> than comparison (for comparison, it merely contributes to
>>>> what some people would consider false positives for matching)
>>>> involves some controversial decisions and, because of
>>>> stability requirements, cannot be changed even if the
>>>> controversies are resolved in other ways, we end up with,
>>>> e.g.,
>>>>       toCaseFold ("Nuß") -> "nuss"
>>>> which is considered an acceptable transformation in some
>>>> places that identify themselves as speaking/using German and
>>>> two different unacceptable errors in others.  Again, this will
>>>> almost always be much more serious if the transformation is
>>>> used to map and replace strings than if it is used to compare
>>>> (fwiw, that particular example is part of a continuing
>>>> disagreement between IDNA2008 and, among others, German
>>>> domain registry authorities on one side and UTC and UTR 46 on
>>>> the other).
>>>
>>> Agreed.
>>
>> See "warning notice" comment above and question 1 response below.
>>
>>> (4) If the motivation is really to avoid confusion, the
>>>> correct confusion-blocking rule for Latin script (but not
>>>> others) and many languages that use it (but certainly not
>>>> all) involves moving beyond toCaseFold and treating all
>>>> "decorated" characters (characters normally represented by
>>>> glyphs consisting of a Basic Latin character and one or more
>>>> diacritical or equivalent markings) compare equal to their
>>>> base characters, e.g., "á" not only matches "Á" but also
>>>> "a" and "A" and, as an unfortunate side-effect, maybe "À"
>>>> and "à" as well.  This is bad news for languages in which
>>>> decorated Latin characters are used to represent phonetically
>>>> and conceptually different characters, not just pronunciation
>>>> variations.  I am not qualified to evaluate "how bad".   In
>>>> addition, extrapolations from this principle about Latin
>>>> script to unrelated scripts will almost certainly lead to
>>>> serious errors and/or additional confusion.
>>>
>>> I would not be comfortable going that far...
>>
>> In case it isn't clear, I would not be either.  But it is where
>> getting sloppy about this stuff could easily take us.  It is
>> worth noting that it also identifies one of the difficulties
>> with doing a global system to be applied to many types of
>> applications (like the PRECIS work) and then applying it in user
>> interface software that end users will expect to be localized to
>> their assumptions because it has been mapped or translated into
>> their language (if one normally speaks Upper Slobbovian but has
>> some familiarity with English, an application interface in
>> English will probably be expected to be "foreign", odd, and
>> maybe even inconsistent with whatever expectations exist.  But,
>> if the interface is in Upper Slobbovian, the natural and
>> reasonable assumption will be the matching should conform to
>> normal Upper Slobbovian conventions.    FWIW, a matching rule
>> that says:
>>
>>   (i) Two instances of a base character with the same
>>     diacritical mark(s) match.
>>   (ii) Two instances of a base character with different
>>     diacritical mark(s) do not match.
>>   (iii) Two instances of a base character, one with
>>     diacritical mark(s) and the other without any decoration
>>     match.
>>
>> Is precisely correct and normal behavior for at least one
>> language that uses Latin script.  It is also the normal practice
>> for at least one Latin script transcription system that is used
>> by a large fraction of a billion people (maybe more).
>
> That is indeed sobering.
>
>>>> More on this and Tom's question below...
>>>>
>>>>> On 9/29/15 3:28 PM, Tom Worster wrote:
>>>>>> Peter, Alexey,
>>>>>>
>>>>>> I think there is an ambiguity in the specification of case
>>>>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19.
>>>>> ...
>>>>>> But there are 55 code points in Unicode 7.0.0 that change
>>>>>> under default case folding that are neither uppercase nor
>>>>>> titlecase characters, 12 of which are Lowercase_Letter. I
>>>>>> suspect this stems from a confusion between Unicode case
>>>>>> mapping and case folding.
>>
>> In the context of the above, a different way to say the same
>> thing is that people are looking at toCaseFold and assuming (and
>> explaining things in terms of) toLowerCase.  toCaseFold works
>> the way it is expected to and those 55 code points are, more or
>> less, collateral damage to get to a matching algorithm that
>> favors false positives over false negatives and various edge
>> cases (including in "edge cases" languages spoken by, and script
>> variations used by, millions of people).
>
> Sadly I suspect that is an accurate description of the current state of
> affairs (modulo my comment above about PRECIS WG discussions at one or
> more IETF meetings).
>
>>> ...
>>> After all that, I have 3 questions:
>>
>> Personal opinions about answers...
>>
>>> (1) Is my proposed text enough of a clarification that we
>>> should make that change before the nickname I-D is published
>>> as an RFC?
>>
>> I think the clarification is an improvement and is important
>> enough to incorporate (I know that is the answer to a slightly
>> different question).
>>
>> However, I think it is inadequate without a serious warning
>> about the situation.
>
> Yes.
>
>>  That warning could appear in either this
>> document or RFC 7613 (or 7613bis) with a pointer from the other,
>> but, unless you want to revise 7613 now, this one is handy.
>
> I suspect that we need to revise 7613. I suspect that we might also need
> to revise 7564 (at least with respect to the order in which operations
> are applied, since there has been some confusion among implementers).
>
> Well, we always knew that we would need to revise them. Just not so soon.
>
>> Comment about possible text below.
>>
>>> (2) Should we modify draft-ietf-precis-nickname so that case
>>> folding is applied only as part of comparison and not as part
>>> of enforcement? If so, should we make that change before this
>>> document is published as an RFC?
>>
>> Yes.  If something is used for "enforcement", it should be lower
>> casing or something else that can be explained to people who are
>> ordinarily familiar with one or more of the scripts that make
>> case distinctions.
>>
>> However, viewed in the light of this discussion, the whole
>> "enforcement" concept becomes a little dicey, especially if, as
>> I believe but don't have time to verify, the transformations
>> performed by toLowerCase are not a proper subset of those
>> performed by toCaseFold.
>
> My initial thought is that case mapping doesn't belong in the nickname
> enforcement operation at all - only in the comparison operation.
>
>>> (3) Should we update RFC 7613 so that case folding is applied
>>> only as part of comparison and not as part of enforcement?
>>
>> I think that is necessary.  Following up on the comment above, I
>> would prefer that the current Section 3.2.2 (3) of RFC 7613
>> either point to Unicode Lower Casing or contain a warning along
>> the lines of that below.
>
> Unlike the nickname profile (which I think can be cleaned up by moving
> the case mapping rule to the comparison operation and continuing to use
> Unicode Default Case Folding), I think you are right that for the
> UsernameCaseMapped profile we probably want Unicode Lower Casing. Thus
> the likely need, sooner rather than later, for 7613bis.
>
>>
>>     ----------
>>
>> --On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre
>> <peter@andyet.net> wrote:
>>
>>> This issue has greater urgency now because
>>> draft-ietf-precis-nickname is now in AUTH48...
>>>
>>> On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote:
>>>
>>>> After all that, I have 3 questions:
>>>>
>>>> (1) Is my proposed text enough of a clarification that we
>>>> should make that change before the nickname I-D is published
>>>> as an RFC?
>>>
>>> I think so.
>>
>> See above.
>>
>>>> (2) Should we modify draft-ietf-precis-nickname so that case
>>>> folding is applied only as part of comparison and not as part
>>>> of enforcement? If so, should we make that change before this
>>>> document is published as an RFC?
>>>
>>> Although it seems to be the case that Unicode case folding is
>>> primarily designed for the purpose of matching (i.e.,
>>> comparison),
>>
>> "Seems" is a little weak.  The Unicode Standard is really quite
>> specific about that.
>>
>>> I have a concern that applying the PRECIS case
>>> mapping rule after applying the normalization and
>>> directionality rules might have unintended consequences that
>>> we haven't had a chance to consider yet. The PRECIS framework
>>> expresses a preference (actually a hard requirement) for
>>> applying the rules in a particular order. We made a late
>>> change to the username profiles (RFC 7613), such that width
>>> mapping is applied first (in order to accommodate fullwidth
>>> and halfwidth characters in certain East Asian scripts).
>>> Making a late change to the nickname profile also concerns me,
>>> even though both of these late changes seem reasonable on the
>>> face of it. I will try to find time to think about this
>>> further in the next 24 hours.
>>
>> First, a hint for the consideration process: there is a reason
>> why Unicode now supports a unified case folding and
>> normalization operation.  My recollection is that it is not only
>> more efficient to perform both operations at once (rather than
>> looking in one table and then the other), but that there are
>> some order-dependent or priority-dependent cases.
>>
>> The very fact that this issue exists (and is coming up again)
>> this late in the process (7613 published in August, WG winding
>> down and not, e.g., meeting next week) calls at least the PRECIS
>> quality of review and some fairly fundamental model issues into
>> question.  I first raised that issue a rather long time ago but
>> have continued to hope that we have an approximation to "good
>> enough" without going back and rethinking everything.
>>
>> The right solution, IMO, is that, if RFC 7613 is to rationalize
>> or explain the operation in terms of converting upper case
>> characters to lower case, then it should be using toLowerCase
>> because that is what the operation does.  After a quick look at
>> 7613, amending/updating it to simply convert to lower case would
>> be straightforward (and would not raise the ordering issue
>> called out above).  It would presumably require another IETF
>> Last Call, however and I'd hope we would see some serious
>> discussion within the WG (and with UTC) before making the change
>> and about how it is explained.
>>
>> If we are not willing to make a change
>
> I'm willing. It would, as you note, require some careful thinking and
> review to make sure that we got it (more) right this time.
>
>> that significant and/or
>> if we conclude that the WG (and perhaps the IETF) have
>> completely run out of energy for dealing with i18n issues [1],
>> then I suggest that we introduce some additional text.  I've
>> just spent a half-hour trying to find the AUTH48 copy of
>> precis-nickname (aka RFC-to-be-7700), but the RFC Editor has
>> apparently changed naming conventions and the various queue
>> entry pages all point to the -19 I-D and not the current working
>> copy so I can't try to match text and insertion point to what is
>> there already.
>
> http://www.rfc-editor.org/authors/rfc7700.txt
>
>>  The suggestion is a patch (and a hack), not a
>> good fix but something like it is probably the least drastic
>> measure that would yield something that doesn't contain
>> unexplained known defects.
>>
>> Rough version of suggested text (possibly to go after your
>> revised paragraph and following up my comments in my 1 October
>> note).  Some of the terminology needs checking which I can do if
>> you want to go this route:
>>
>>     'Users of this specification should note that the
>>     concept of "lower case conversion" is somewhat elusive
>>     and more dependent on the conventions of different
>>     languages and notation systems that use the same script
>>     than may appear obvious at first glance, especially if
>>     that glance is at Basic Latin characters (i.e., the
>>     ASCII letter repertoire).  Unicode provides two
>>     different mapping procedures that produce lower-case
>>     characters, but they have different effects and results
>>     for many characters.  The more conservative one,
>>     typically appropriately applicable when lower case forms
>>     are needed, is actual lower-casing (embodied in the
>>     Unicode operation toLowerCase).  A more radical
>>     operation, normally suitable only for string matching in
>>     situations in which it is better to consider uncertain
>>     cases as matching than to treat them as distinct, is
>>     called "Case Folding" (Unicode operation toCaseFold).
>>     While the two operations will often produce the same
>>     results, Case Folding maps some lower case characters
>>     into others and performs other transformations that may
>>     be intuitively reasonable and expected for some users
>>     and quite astonishing (or just wrong) to others.  There
>>     may be no practical alternative, especially if the
>>     operations are to be used for mapping or enforcement, to
>>     developers of PRECIS-dependent understanding that the
>>     cases in which the two yield different results require
>>     careful understanding of the relevant user base and its
>>     needs [2].'
>
> Thanks.
>
> I am not sure if we need something like that if we move case mapping
> (here, case folding) to the comparison operation only - but something
> like that might still be appropriate.
>
>>>> (3) Should we update RFC 7613 so that case folding is applied
>>>> only as part of comparison and not as part of enforcement?
>>>
>>> That is less urgent so I suggest that we address the nickname
>>> spec first.
>>
>> Unless you (or someone else here) have a plausible plan to
>> continue and revitalize the WG and assign it that revision work
>> (and bring everyone actively participating up to the level
>> needed to easily understand this discussion thread and feel
>> embarrassed for not spotting the problems), I think we need to
>> assume that this is our last shot.  Absent an active and
>> committed WG, "do this first" could easily be equivalent to
>> "don't get around to the other, ever".
>
> As mentioned, I don't want to have broken RFCs out there.
>
>> I think that the particular set of issues that started this
>> thread as a known defect in the PRECIS specs, both nickname and
>> 7613 and that we are obligated to either fix the problems or at
>> least explain them.  The above warning text is an attempt to
>> explain and identify the problems even if it does not actually
>> provide a solution.  If it were published as part of
>> precis-nickname, it could include a statement to the effect that
>> it should also be treated as an update to 7613 or, if the IESG
>> and RFC Editor would agree in advance to accept, rather than
>> bury, the thing, I suppose we could publish it in
>> precis-nickname and create an erratum to 7613 indicating that it
>> should have included some form of that statement.  Neither
>> option implies a huge amount of work to update 7613.  But I
>> think that making the changes of (2) without doing anything
>> about (3) makes the two documents inconsistent with each other
>> and that would be an additional known defect.
>>
>> Procedural question: given that precis-nickname is in AUTH48 as
>> of yesterday and I don't see anything blocking publication next
>> week if you and Barry sign off on the revised text that the WG
>> hasn't seen,
>
> There is no revised text yet. That's why we're having this discussion.
>
>> does someone need to file a pro forma objection/
>> appeal to block that until this is sorted out and the WG has a
>> chance to review proposed publication text?
>
> I see no reason to invoke the specter of appeals quite yet. Everyone is
> working in good faith to do the right thing and get this mess cleaned up.
>
>> [1] I believe our collective inability to deal with the
>> within-script character forms that do not normalize to each
>> other because of language-dependent or other usage factors can
>> be taken as evidence of having run out of energy,
>
> Or in my case simple ignorance of some of the relevant issues and
> examples. It's not easy to know about all of this.
>
>> but it is
>> probably in the interest of finishing the PRECIS work to try to
>> treat that as a separate issue.
>
> Probably.
>
>> [2] Not unlike the reason to differentiate between NFC and NFKC
>> and understand the effects of each.
>
> Another thing that's not easy to grok in fulness.
>
> Peter
>
>