Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname

Peter Saint-Andre <peter@andyet.net> Tue, 03 November 2015 04:42 UTC

To: John C Klensin <john-ietf@jck.com>, Tom Worster <fsb@thefsb.org>, Alexey Melnikov <Alexey.Melnikov@isode.com>
References: <0347834EBDC481BD99BDBE67@JcK-HP8200.jck.com> <56302E6D.5030901@andyet.net> <56312AAC.1000300@andyet.net> <56313616.8000801@andyet.net> <563143B9.7020707@andyet.net>
From: Peter Saint-Andre <peter@andyet.net>
Message-ID: <56383B2F.6080505@andyet.net>
Date: Mon, 02 Nov 2015 21:42:23 -0700
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.3.0
MIME-Version: 1.0
In-Reply-To: <563143B9.7020707@andyet.net>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/7hPlg3Y1ll44prIkQi4miftaurY>
Cc: precis@ietf.org
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
Precedence: list

For ease of reviewing only, and with no presumption that these proposed 
changes have been accepted by anyone, I have asked the RFC Editor to 
provisionally update the document in AUTH48 as outlined in the messages 
I have sent to the list. The resulting file is here:

http://www.rfc-editor.org/authors/rfc7700.txt

Despite those caveats, if at all possible I would prefer to find an 
acceptable solution for publishing this RFC now without undue further 
delays (in part because draft-ietf-simple-chat has been held on this 
document for almost 3 years!). That does not mean, as I said earlier in 
this thread, that I want to have broken RFCs out there, but I think we 
can fix this one acceptably now and then update it again in strict 
coherence with updates to RFC 7564 and RFC 7613. I am committed to 
getting things right, but I am also committed to not holding up other 
people's work for years and years on end.

Peter

On 10/28/15 3:52 PM, Peter Saint-Andre wrote:
> Example 7 needs to be corrected, too, in accordance with CaseFolding.txt.
>
> On 10/28/15 2:54 PM, Peter Saint-Andre wrote:
>> And here is another correction in Section 3...
>>
>> OLD
>>
>>     Regarding examples 5, 6, and 7: applying Unicode Default Case Folding
>>     to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL LETTER
>>     SIGMA (U+03C3), and doing so during comparison would result in
>>     matching the nicknames in examples 5 and 6; however, because the
>>     PRECIS mapping rules do not account for the special status of GREEK
>>     SMALL LETTER FINAL SIGMA (U+03C2), the nicknames in examples 5 and 7
>>     or examples 6 and 7 would not be matched.
>>
>> NEW
>>
>>     Regarding examples 5, 6, and 7: applying Unicode Default Case Folding
>>     to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL LETTER
>>     SIGMA (U+03C3), and the same is true of GREEK SMALL LETTER FINAL
>>     SIGMA (U+03C2); therefore, the comparison operation defined in
>>     Section 2.4 would result in matching of the nicknames in examples 5,
>>     6, and 7.
>>
>> On 10/28/15 2:06 PM, Peter Saint-Andre wrote:
>>> I propose the following text changes:
>>>
>>> ###
>>>
>>> OLD
>>>
>>>     3.  Case Mapping Rule: Uppercase and titlecase characters MUST be
>>>         mapped to their lowercase equivalents using Unicode Default Case
>>>         Folding as defined in the Unicode Standard [Unicode] (at the
>>> time
>>>         of this writing, the algorithm is specified in Chapter 3 of
>>>         [Unicode7.0]).  In applications that prohibit conflicting
>>>         nicknames, this rule helps to reduce the possibility of
>>> confusion
>>>         by ensuring that nicknames differing only by case (e.g.,
>>>         "stpeter" vs. "StPeter") would not be presented to a human user
>>>         at the same time.
>>>
>>> NEW
>>>
>>>     3.  Case Mapping Rule: Unicode Default Case Folding MUST be applied,
>>>         as defined in the Unicode Standard [Unicode] (at the time
>>>         of this writing, the algorithm is specified in Chapter 3 of
>>>         [Unicode7.0]).  The primary result of doing so is that uppercase
>>>         characters are mapped to lowercase characters. In applications
>>>         that prohibit conflicting nicknames, this rule helps to reduce
>>>         the possibility of confusion by ensuring that nicknames
>>>         differing only by case (e.g., "stpeter" vs. "StPeter") would not
>>>         be presented to a human user at the same time.
>>>
>>> ###
>>>
>>> (The foregoing was previously sent to the list.)
>>>
>>> ###
>>>
>>> OLD
>>>
>>> 2.3.  Enforcement
>>>
>>>     An entity that performs enforcement according to this profile MUST
>>>     prepare a string as described in Section 2.2 and MUST also apply the
>>>     rules specified in Section 2.1.  The rules MUST be applied in the
>>>     order shown.
>>>
>>>     After all of the foregoing rules have been enforced, the entity MUST
>>>     ensure that the nickname is not zero bytes in length (this is done
>>>     after enforcing the rules to prevent applications from mistakenly
>>>     omitting a nickname entirely, because when internationalized
>>>     characters are accepted, a non-empty sequence of characters can
>>>     result in a zero-length nickname after canonicalization).
>>>
>>> 2.4.  Comparison
>>>
>>>     An entity that performs comparison of two strings according to this
>>>     profile MUST prepare each string and enforce the rules as specified
>>>     in Sections 2.2 and 2.3.  The two strings are to be considered
>>>     equivalent if they are an exact octet-for-octet match (sometimes
>>>     called "bit-string identity").
>>>
>>> NEW
>>>
>>> 2.3.  Enforcement
>>>
>>>     An entity that performs enforcement according to this profile MUST
>>>     prepare a string as described in Section 2.2 and MUST also apply the
>>>     following rules specified in Section 2.1 in the order shown:
>>>
>>>     1. Additional Mapping Rule
>>>     2. Normalization Rule
>>>     3. Directionality Rule
>>>
>>>     After all of the foregoing rules have been enforced, the entity MUST
>>>     ensure that the nickname is not zero bytes in length (this is done
>>>     after enforcing the rules to prevent applications from mistakenly
>>>     omitting a nickname entirely, because when internationalized
>>>     characters are accepted, a non-empty sequence of characters can
>>>     result in a zero-length nickname after canonicalization).
>>>
>>> 2.4.  Comparison
>>>
>>>     An entity that performs comparison of two strings according to this
>>>     profile MUST prepare each string as specified in Section 2.2 and
>>>     MUST apply the following rules specified in Section 2.1 in the order
>>>     shown:
>>>
>>>     1. Additional Mapping Rule
>>>     2. Case Mapping Rule
>>>     3. Normalization Rule
>>>     4. Directionality Rule
>>>
>>>     The two strings are to be considered equivalent if they are an exact
>>>     octet-for-octet match (sometimes called "bit-string identity").
>>>
>>> ###
>>>
>>> In addition, some variation on John's proposed text about toLowerCase
>>> vs. toCaseFold might be appropriate at the end of Section 4; however,
>>> I'm still not sure that is necessary if we move the case mapping rule to
>>> the comparison operation.
>>>
>>> Peter
>>>
>>> On 10/27/15 8:09 PM, Peter Saint-Andre wrote:
>>>> On 10/27/15 11:32 AM, John C Klensin wrote:
>>>>> Response to Monday's note immediately below; response to today's
>>>>> follows it.  My apologies, but it is probably important to read
>>>>> both.  My further apologies for the length of this note, but I
>>>>> think we are in deep trouble here,
>>>>
>>>> Internationalization always seems to be a matter of how deep the
>>>> trouble
>>>> is...
>>>>
>>>>> trouble that is aggravated by
>>>>> precis-mappings and precis-nickname both being post-approval and
>>>>> that, as far as I know, there are no future plans for PRECIS
>>>>> work (having precis-nickname in AUTH48 just emphasizes that --
>>>>> see comment at end).
>>>>
>>>> We had not planned to work on PRECIS because we thought we were done
>>>> for
>>>> awhile. If that's not the case and we need to fix things, then so be
>>>> it.
>>>> Whether there is sufficient and continued energy for such work is
>>>> another question. Personally I don't want us to have broken RFCs out
>>>> there.
>>>>
>>>>> --On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre -
>>>>> &yet <peter@andyet.net> wrote:
>>>>>
>>>>>> My apologies for the delayed reply. Comments inline.
>>>>>
>>>>> A few remarks below... I can't tell whether we disagree or
>>>>> whether at least one of us, probably me, are not being
>>>>> adequately clear.  (Material on which we fairly clearly agree
>>>>> elided.)
>>>>>
>>>>>
>>>>>> On 10/1/15 7:50 AM, John C Klensin wrote:
>>>>>>> --On Wednesday, September 30, 2015 15:16 -0600 Peter
>>>>>>> Saint-Andre - &yet <peter@andyet.net> wrote:
>>>>>> ...
>>>>>>> Peter,
>>>>>>>
>>>>>>> While your proposed text is an improvement,
>>>>>>
>>>>>> Happy to hear it. All I intended was a slight clarification.
>>>>>
>>>>> But I'm not certain we are there yet...
>>>>
>>>> Agreed. The text I proposed addressed only a very small part of the
>>>> problem.
>>>>
>>>>>>> the desire of many
>>>>>>> people for a magic "just tell me what to do" formula, one that
>>>>>>> lets them avoid understanding the issues, may call for a
>>>>>>> little more:
>>>>>>
>>>>>> There is always a need for more when it comes to i18n.
>>>>>
>>>>> But I think it is a little more that that.  I've heard several
>>>>> times, including in PRECIS meetings, requests for "just tell me
>>>>> what to do and make sure it isn't complicated" (or "I don't want
>>>>> to have to think about, much less understand, the issues").  We
>>>>> can debate whether giving in to those requests in the I18n case
>>>>> is wise.  I think it leads directly to conclusions equivalent to
>>>>> "I understand my own script and writing system (or think I do)
>>>>> and therefore, since all writing systems must be pretty much the
>>>>> same, I understand all of the core issues in terms of my script
>>>>> and understanding".   That, in turn, leads directly to the "how
>>>>> do you spell 'Zürich'?" and "all spellings of 'Zuerich' should
>>>>> be treated as equivalent" discussion that sounded like they
>>>>> dominated a BOF at IETF 93.
>>>>>
>>>>> Now I actually think it is reasonable for someone to ask for a
>>>>> library that will do the job most of the time and that will
>>>>> almost never cause their users or customers to get angry at
>>>>> them.  But, if we are going to call what we do "standards", they
>>>>> should contain sufficient information that would-be library
>>>>> authors can know what to do ... or understand that they are in
>>>>> over their heads.  And, for these particular cases, we may need
>>>>> to explain, or help the library authors explain, why some cases
>>>>> will fail and, indeed, get users mad at vendors.
>>>>>
>>>>>
>>>>>>> (1) First, toCaseFold is _not_ toLowerCase.  Saying "The
>>>>>>> primary result of doing so is that uppercase characters are
>>>>>>> mapped to lowercase characters" is true for toCaseFold,
>>>>>>
>>>>>> By "primary" I meant two things: (1) lowercasing is what
>>>>>> happens to the preponderance of code points and (2) this is
>>>>>> the result that most people care about.
>>>>>
>>>>> If I parse the above correctly, I think you are wrong.   I think
>>>>> what most people want, care about, and think they are getting,
>>>>> is lower case conversion, i.e., an operation that preserves
>>>>> lower case characters and converts upper case characters to the
>>>>> equivalent lower case.  toCaseFold isn't that operation.  It is
>>>>> a much more complex and subtle operation that, as well as
>>>>> converting upper case characters to lower case, sometimes
>>>>> converts lower case characters to different lower case
>>>>> characters (or strings of them).  It also requires a fairly good
>>>>> understanding of Unicode (not just a relevant script) and
>>>>> historical Unicode decisions to predict its behavior and to have
>>>>> any hope of explaining that behavior to users.   If one is
>>>>> trying to compare (as distinct from converting), then toCaseFold
>>>>> may be exactly what it wanted. but it is really hard to explain
>>>>> or justify that in terms of "nicknames" or "aliases", which are
>>>>> about conversion.   And, if one hopes to explain what is going
>>>>> on to users in terms of "lower casing", then toCaseFold is just
>>>>> the wrong operation.  That is what toLowerCase is for and the
>>>>> two operations are just not equivalent.
>>>>
>>>> My recollection, quite possibly inaccurate or incomplete, from at least
>>>> one and I think several in-person meetings of the PRECIS WG was: just
>>>> use Unicode Default Case Folding because if you use anything else or
>>>> try
>>>> to roll your own you will be fubar forever. I do not recall any
>>>> discussion of the issues you have raised in this thread (e.g., about
>>>> the
>>>> inadvisability of using case folding for anything but comparison
>>>> operations) until the last few weeks. However, I freely admit that's
>>>> probably because, through my own faults and ignorance, I didn't
>>>> understand what you were saying.
>>>>
>>>>> FWIW and purely by coincidence wrt PRECIS and this document, I
>>>>> had a conversation a few days ago with an expert on Arabic (and
>>>>> Persian) calligraphy and writing systems (and good general
>>>>> knowledge of writing systems) who is quite insistent that any
>>>>> procedure we use for case-insensitive matching (e.g., case
>>>>> folding) is discriminatory, inconsistent, and just
>>>>> badly-thought-out if that same procedure doesn't treat isolated,
>>>>> initial, and medial forms of the same character as equivalent.
>>>>> He further strengthens his case (sic) by noting that Unicode
>>>>> case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA,
>>>>> unambiguously a lower-case character) to U+03C3 (GREEK SMALL
>>>>> LETTER SIGMA), a relationship that depends entirely on
>>>>> positional use and not case.  He also believes the same
>>>>> relationships should apply to all other scripts that make form
>>>>> distinctions for some characters based on positions in a string
>>>>> and for which Unicode has chosen to assign different code
>>>>> points.  Even if there were wide acceptance of his view, Unicode
>>>>> stability principles would prevent changing toCaseFold (or
>>>>> CaseFolding.txt), but this is more evidence that what toCaseFold
>>>>> does and does not do is going to be hard to explain to either
>>>>> casual users or to writing system experts whose primary
>>>>> experience is not with the Greek-Latin-Cyrillic group.
>>>>>
>>>>> I don't think we want to say "these matching rules are somewhat
>>>>> arbitrary and irrational, but, if you don't like it, blame
>>>>> Unicode and not us", if only because it is our choice to use
>>>>> those matching rules.  More below.
>>>>>
>>>>>
>>>>>> ...
>>>>>>> (2) Second, probably as a result of having IDNA in the lead,
>>>>>>> we've gotten sloppy about language and operations and should
>>>>>>> probably start untangling that before it gets people in
>>>>>>> trouble.
>>>>>>
>>>>>> Where is the right place to do that untangling? (I doubt that
>>>>>> it is the precis-nickname document.)
>>>>>
>>>>> I agree that precis-nickname isn't the ideal place.  I also
>>>>> believe that you and it are the innocent victims of the
>>>>> situation.  At the same time, I don't believe IETF should be
>>>>> producing incomplete, ambiguous, erroneous, or misleading
>>>>> standards because no one could get around to doing the right
>>>>> foundational work.
>>>>
>>>> Agreed. I too want to get this right, even though it's not a lot of fun
>>>> and it's certainly more work than I thought I was signing up for at the
>>>> NEWPREP BoF years ago.
>>>>
>>>>>>> The Unicode Standard, at least as I understand it, is fairly
>>>>>>> clear that the most important (and really only safe) use of
>>>>>>> toCaseFold is as part of a comparison operation.
>>>>>>
>>>>>> Thanks for noting that. For example, Section 5.18 of Unicode
>>>>>> 8.0.0 says:
>>>>>>
>>>>>>      Caseless matching is implemented using case folding, which
>>>>>> is the
>>>>>>      process of mapping characters of different case to a
>>>>>> single form, so
>>>>>>      that case differences in strings are erased. Case folding
>>>>>> allows for
>>>>>>      fast caseless matches in lookups because only binary
>>>>>> comparison is
>>>>>>      required. It is more than just conversion to lowercase.
>>>>>
>>>>> Right.  But, again, when its use is appropriate (a very
>>>>> controversial topic in itself with our painful IDNA history with
>>>>> Final Sigma, Eszett and the case-independent versus
>>>>> position-independent controversy called out above as examples)
>>>>> that is "matches in lookups" (what I've described elsewhere as
>>>>> "comparison only").  Not creating or defining nicknames or
>>>>> aliases.  And that _is_ a problem for this document.
>>>>
>>>> I'm not convinced that things are as bad as you think. If we say in
>>>> draft-ietf-precis-nickname that the case mapping rule is to be applied
>>>> only as part of comparison and not as part of enforcement - which I
>>>> think is really what we care about (e.g., to prevent spoofing of users
>>>> in chat rooms) - then I think we might be most of the way there.
>>>>
>>>>>>> Using your
>>>>>>> example it is entirely reasonable to treat, "stpeter" and
>>>>>>> "StPeter" as equivalent in a comparison operation, but
>>>>>>> accepting one string and changing it to the other for display
>>>>>>> may not be a really good idea.  While that transformation may
>>>>>>> be acceptable (although I would be surprised if there were no
>>>>>>> people who share your surname who could consider "stpeter" or
>>>>>>> "Stpeter" unacceptable and might even believe that "StPeter"
>>>>>>> is an unacceptable substitute for "St. Peter"),
>>>>>>
>>>>>> I do receive email at stpeter@gmail.com intended for
>>>>>> st.peter@gmail.com but that's a separate topic...
>>>>>
>>>>> One that is relevant because it "works" as a side-effect of a
>>>>> decision Google has made about mailbox name equivalence, a
>>>>> decision that, IMO, will sooner or later get someone into a lot
>>>>> of trouble and,  more important, a decision and matching rule
>>>>> that PRECIS, AFAICT, does not allow and that IDNA unambigiously
>>>>> forbids.
>>>>>
>>>>>>> it also points out the
>>>>>>> dangers of using Basic Latin script examples to illustrate
>>>>>>> situations in which even more extended Latin script, much less
>>>>>>> other scripts, may raise more complex issues.    Because IDNA
>>>>>>> is essentially a workaround because changing the DNS
>>>>>>> comparison rules was impractical for several reasons, we
>>>>>>> ended up using toCaweFold to map characters and strings into
>>>>>>> others in IDNA2003 but PRECIS implementations that do not
>>>>>>> have the same constraints would, in general, be better off
>>>>>>> confining the use of toCaseFold, or even toLowerCase, to
>>>>>>> comparison operations.
>>>>>>
>>>>>> Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does
>>>>>> it make sense for this nickname specification to differ in
>>>>>> this respect from the published RFCs? Shall we file errata
>>>>>> against those documents? (This might apply only to RFC 7613,
>>>>>> which says to apply case folding as part of the enforcement
>>>>>> process - when exactly to apply case folding is not stipulated
>>>>>> by RFC 7564.)
>>>>>
>>>>> To the extent to which this is a "botched that because the WG
>>>>> didn't understand the issues well enough" conclusion, it would
>>>>> be entirely reasonable to generate an updating RFC that repairs
>>>>> 7613 and/or 7564, even doing so in an addendum to
>>>>> precis-nickname if that is the only way to do that
>>>>> expeditiously.  Per the above, we really don't want to give
>>>>> library routine writers bad instructions.  As I understand it,
>>>>> the current position of the RFC Editor and IESG is that
>>>>> technical specification errors discovered in retrospect or after
>>>>> people start using a spec are not appropriate topics for errata.
>>>>> If the WG is not willing to do any of those things, then I
>>>>> suggest that precis-nickname at least needs to contain a very
>>>>> clear warning notice about this situation (see my response to
>>>>> your question 1 below).
>>>>
>>>> I think we'll probably need to fix 7613 and 7564. I am hoping we can
>>>> fix
>>>> nickname now so that it is less incorrect than the other two. That
>>>> doesn't necessarily mean we won't need to also further fix nickname
>>>> later on.
>>>>
>>>> Granted, we were supposed to avoid this problem by working on all of
>>>> the
>>>> PRECIS specs simultaneously. Clearly we have not avoided the
>>>> problem, so
>>>> we need to solve it one way or another. If that means bis for them all,
>>>> we need to deal with it.
>>>>
>>>>>>> (3) Because toCaweFold loses information when used for more
>>>>>>> than comparison (for comparison, it merely contributes to
>>>>>>> what some people would consider false positives for matching)
>>>>>>> involves some controversial decisions and, because of
>>>>>>> stability requirements, cannot be changed even if the
>>>>>>> controversies are resolved in other ways, we end up with,
>>>>>>> e.g.,
>>>>>>>       toCaseFold ("Nuß") -> "nuss"
>>>>>>> which is considered an acceptable transformation in some
>>>>>>> places that identify themselves as speaking/using German and
>>>>>>> two different unacceptable errors in others.  Again, this will
>>>>>>> almost always be much more serious if the transformation is
>>>>>>> used to map and replace strings than if it is used to compare
>>>>>>> (fwiw, that particular example is part of a continuing
>>>>>>> disagreement between IDNA2008 and, among others, German
>>>>>>> domain registry authorities on one side and UTC and UTR 46 on
>>>>>>> the other).
>>>>>>
>>>>>> Agreed.
>>>>>
>>>>> See "warning notice" comment above and question 1 response below.
>>>>>
>>>>>> (4) If the motivation is really to avoid confusion, the
>>>>>>> correct confusion-blocking rule for Latin script (but not
>>>>>>> others) and many languages that use it (but certainly not
>>>>>>> all) involves moving beyond toCaseFold and treating all
>>>>>>> "decorated" characters (characters normally represented by
>>>>>>> glyphs consisting of a Basic Latin character and one or more
>>>>>>> diacritical or equivalent markings) compare equal to their
>>>>>>> base characters, e.g., "á" not only matches "Á" but also
>>>>>>> "a" and "A" and, as an unfortunate side-effect, maybe "À"
>>>>>>> and "à" as well.  This is bad news for languages in which
>>>>>>> decorated Latin characters are used to represent phonetically
>>>>>>> and conceptually different characters, not just pronunciation
>>>>>>> variations.  I am not qualified to evaluate "how bad".   In
>>>>>>> addition, extrapolations from this principle about Latin
>>>>>>> script to unrelated scripts will almost certainly lead to
>>>>>>> serious errors and/or additional confusion.
>>>>>>
>>>>>> I would not be comfortable going that far...
>>>>>
>>>>> In case it isn't clear, I would not be either.  But it is where
>>>>> getting sloppy about this stuff could easily take us.  It is
>>>>> worth noting that it also identifies one of the difficulties
>>>>> with doing a global system to be applied to many types of
>>>>> applications (like the PRECIS work) and then applying it in user
>>>>> interface software that end users will expect to be localized to
>>>>> their assumptions because it has been mapped or translated into
>>>>> their language (if one normally speaks Upper Slobbovian but has
>>>>> some familiarity with English, an application interface in
>>>>> English will probably be expected to be "foreign", odd, and
>>>>> maybe even inconsistent with whatever expectations exist.  But,
>>>>> if the interface is in Upper Slobbovian, the natural and
>>>>> reasonable assumption will be the matching should conform to
>>>>> normal Upper Slobbovian conventions.    FWIW, a matching rule
>>>>> that says:
>>>>>
>>>>>   (i) Two instances of a base character with the same
>>>>>     diacritical mark(s) match.
>>>>>   (ii) Two instances of a base character with different
>>>>>     diacritical mark(s) do not match.
>>>>>   (iii) Two instances of a base character, one with
>>>>>     diacritical mark(s) and the other without any decoration
>>>>>     match.
>>>>>
>>>>> Is precisely correct and normal behavior for at least one
>>>>> language that uses Latin script.  It is also the normal practice
>>>>> for at least one Latin script transcription system that is used
>>>>> by a large fraction of a billion people (maybe more).
>>>>
>>>> That is indeed sobering.
>>>>
>>>>>>> More on this and Tom's question below...
>>>>>>>
>>>>>>>> On 9/29/15 3:28 PM, Tom Worster wrote:
>>>>>>>>> Peter, Alexey,
>>>>>>>>>
>>>>>>>>> I think there is an ambiguity in the specification of case
>>>>>>>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19.
>>>>>>>> ...
>>>>>>>>> But there are 55 code points in Unicode 7.0.0 that change
>>>>>>>>> under default case folding that are neither uppercase nor
>>>>>>>>> titlecase characters, 12 of which are Lowercase_Letter. I
>>>>>>>>> suspect this stems from a confusion between Unicode case
>>>>>>>>> mapping and case folding.
>>>>>
>>>>> In the context of the above, a different way to say the same
>>>>> thing is that people are looking at toCaseFold and assuming (and
>>>>> explaining things in terms of) toLowerCase.  toCaseFold works
>>>>> the way it is expected to and those 55 code points are, more or
>>>>> less, collateral damage to get to a matching algorithm that
>>>>> favors false positives over false negatives and various edge
>>>>> cases (including in "edge cases" languages spoken by, and script
>>>>> variations used by, millions of people).
>>>>
>>>> Sadly I suspect that is an accurate description of the current state of
>>>> affairs (modulo my comment above about PRECIS WG discussions at one or
>>>> more IETF meetings).
>>>>
>>>>>> ...
>>>>>> After all that, I have 3 questions:
>>>>>
>>>>> Personal opinions about answers...
>>>>>
>>>>>> (1) Is my proposed text enough of a clarification that we
>>>>>> should make that change before the nickname I-D is published
>>>>>> as an RFC?
>>>>>
>>>>> I think the clarification is an improvement and is important
>>>>> enough to incorporate (I know that is the answer to a slightly
>>>>> different question).
>>>>>
>>>>> However, I think it is inadequate without a serious warning
>>>>> about the situation.
>>>>
>>>> Yes.
>>>>
>>>>>  That warning could appear in either this
>>>>> document or RFC 7613 (or 7613bis) with a pointer from the other,
>>>>> but, unless you want to revise 7613 now, this one is handy.
>>>>
>>>> I suspect that we need to revise 7613. I suspect that we might also
>>>> need
>>>> to revise 7564 (at least with respect to the order in which operations
>>>> are applied, since there has been some confusion among implementers).
>>>>
>>>> Well, we always knew that we would need to revise them. Just not so
>>>> soon.
>>>>
>>>>> Comment about possible text below.
>>>>>
>>>>>> (2) Should we modify draft-ietf-precis-nickname so that case
>>>>>> folding is applied only as part of comparison and not as part
>>>>>> of enforcement? If so, should we make that change before this
>>>>>> document is published as an RFC?
>>>>>
>>>>> Yes.  If something is used for "enforcement", it should be lower
>>>>> casing or something else that can be explained to people who are
>>>>> ordinarily familiar with one or more of the scripts that make
>>>>> case distinctions.
>>>>>
>>>>> However, viewed in the light of this discussion, the whole
>>>>> "enforcement" concept becomes a little dicey, especially if, as
>>>>> I believe but don't have time to verify, the transformations
>>>>> performed by toLowerCase are not a proper subset of those
>>>>> performed by toCaseFold.
>>>>
>>>> My initial thought is that case mapping doesn't belong in the nickname
>>>> enforcement operation at all - only in the comparison operation.
>>>>
>>>>>> (3) Should we update RFC 7613 so that case folding is applied
>>>>>> only as part of comparison and not as part of enforcement?
>>>>>
>>>>> I think that is necessary.  Following up on the comment above, I
>>>>> would prefer that the current Section 3.2.2 (3) of RFC 7613
>>>>> either point to Unicode Lower Casing or contain a warning along
>>>>> the lines of that below.
>>>>
>>>> Unlike the nickname profile (which I think can be cleaned up by moving
>>>> the case mapping rule to the comparison operation and continuing to use
>>>> Unicode Default Case Folding), I think you are right that for the
>>>> UsernameCaseMapped profile we probably want Unicode Lower Casing. Thus
>>>> the likely need, sooner rather than later, for 7613bis.
>>>>
>>>>>
>>>>>     ----------
>>>>>
>>>>> --On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre
>>>>> <peter@andyet.net> wrote:
>>>>>
>>>>>> This issue has greater urgency now because
>>>>>> draft-ietf-precis-nickname is now in AUTH48...
>>>>>>
>>>>>> On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote:
>>>>>>
>>>>>>> After all that, I have 3 questions:
>>>>>>>
>>>>>>> (1) Is my proposed text enough of a clarification that we
>>>>>>> should make that change before the nickname I-D is published
>>>>>>> as an RFC?
>>>>>>
>>>>>> I think so.
>>>>>
>>>>> See above.
>>>>>
>>>>>>> (2) Should we modify draft-ietf-precis-nickname so that case
>>>>>>> folding is applied only as part of comparison and not as part
>>>>>>> of enforcement? If so, should we make that change before this
>>>>>>> document is published as an RFC?
>>>>>>
>>>>>> Although it seems to be the case that Unicode case folding is
>>>>>> primarily designed for the purpose of matching (i.e.,
>>>>>> comparison),
>>>>>
>>>>> "Seems" is a little weak.  The Unicode Standard is really quite
>>>>> specific about that.
>>>>>
>>>>>> I have a concern that applying the PRECIS case
>>>>>> mapping rule after applying the normalization and
>>>>>> directionality rules might have unintended consequences that
>>>>>> we haven't had a chance to consider yet. The PRECIS framework
>>>>>> expresses a preference (actually a hard requirement) for
>>>>>> applying the rules in a particular order. We made a late
>>>>>> change to the username profiles (RFC 7613), such that width
>>>>>> mapping is applied first (in order to accommodate fullwidth
>>>>>> and halfwidth characters in certain East Asian scripts).
>>>>>> Making a late change to the nickname profile also concerns me,
>>>>>> even though both of these late changes seem reasonable on the
>>>>>> face of it. I will try to find time to think about this
>>>>>> further in the next 24 hours.
>>>>>
>>>>> First, a hint for the consideration process: there is a reason
>>>>> why Unicode now supports a unified case folding and
>>>>> normalization operation.  My recollection is that it is not only
>>>>> more efficient to perform both operations at once (rather than
>>>>> looking in one table and then the other), but that there are
>>>>> some order-dependent or priority-dependent cases.
>>>>>
>>>>> The very fact that this issue exists (and is coming up again)
>>>>> this late in the process (7613 published in August, WG winding
>>>>> down and not, e.g., meeting next week) calls at least the PRECIS
>>>>> quality of review and some fairly fundamental model issues into
>>>>> question.  I first raised that issue a rather long time ago but
>>>>> have continued to hope that we have an approximation to "good
>>>>> enough" without going back and rethinking everything.
>>>>>
>>>>> The right solution, IMO, is that, if RFC 7613 is to rationalize
>>>>> or explain the operation in terms of converting upper case
>>>>> characters to lower case, then it should be using toLowerCase
>>>>> because that is what the operation does.  After a quick look at
>>>>> 7613, amending/updating it to simply convert to lower case would
>>>>> be straightforward (and would not raise the ordering issue
>>>>> called out above).  It would presumably require another IETF
>>>>> Last Call, however and I'd hope we would see some serious
>>>>> discussion within the WG (and with UTC) before making the change
>>>>> and about how it is explained.
>>>>>
>>>>> If we are not willing to make a change
>>>>
>>>> I'm willing. It would, as you note, require some careful thinking and
>>>> review to make sure that we got it (more) right this time.
>>>>
>>>>> that significant and/or
>>>>> if we conclude that the WG (and perhaps the IETF) have
>>>>> completely run out of energy for dealing with i18n issues [1],
>>>>> then I suggest that we introduce some additional text.  I've
>>>>> just spent a half-hour trying to find the AUTH48 copy of
>>>>> precis-nickname (aka RFC-to-be-7700), but the RFC Editor has
>>>>> apparently changed naming conventions and the various queue
>>>>> entry pages all point to the -19 I-D and not the current working
>>>>> copy so I can't try to match text and insertion point to what is
>>>>> there already.
>>>>
>>>> http://www.rfc-editor.org/authors/rfc7700.txt
>>>>
>>>>>  The suggestion is a patch (and a hack), not a
>>>>> good fix but something like it is probably the least drastic
>>>>> measure that would yield something that doesn't contain
>>>>> unexplained known defects.
>>>>>
>>>>> Rough version of suggested text (possibly to go after your
>>>>> revised paragraph and following up my comments in my 1 October
>>>>> note).  Some of the terminology needs checking which I can do if
>>>>> you want to go this route:
>>>>>
>>>>>     'Users of this specification should note that the
>>>>>     concept of "lower case conversion" is somewhat elusive
>>>>>     and more dependent on the conventions of different
>>>>>     languages and notation systems that use the same script
>>>>>     than may appear obvious at first glance, especially if
>>>>>     that glance is at Basic Latin characters (i.e., the
>>>>>     ASCII letter repertoire).  Unicode provides two
>>>>>     different mapping procedures that produce lower-case
>>>>>     characters, but they have different effects and results
>>>>>     for many characters.  The more conservative one,
>>>>>     typically appropriately applicable when lower case forms
>>>>>     are needed, is actual lower-casing (embodied in the
>>>>>     Unicode operation toLowerCase).  A more radical
>>>>>     operation, normally suitable only for string matching in
>>>>>     situations in which it is better to consider uncertain
>>>>>     cases as matching than to treat them as distinct, is
>>>>>     called "Case Folding" (Unicode operation toCaseFold).
>>>>>     While the two operations will often produce the same
>>>>>     results, Case Folding maps some lower case characters
>>>>>     into others and performs other transformations that may
>>>>>     be intuitively reasonable and expected for some users
>>>>>     and quite astonishing (or just wrong) to others.  There
>>>>>     may be no practical alternative, especially if the
>>>>>     operations are to be used for mapping or enforcement, to
>>>>>     developers of PRECIS-dependent understanding that the
>>>>>     cases in which the two yield different results require
>>>>>     careful understanding of the relevant user base and its
>>>>>     needs [2].'
>>>>
>>>> Thanks.
>>>>
>>>> I am not sure if we need something like that if we move case mapping
>>>> (here, case folding) to the comparison operation only - but something
>>>> like that might still be appropriate.
>>>>
>>>>>>> (3) Should we update RFC 7613 so that case folding is applied
>>>>>>> only as part of comparison and not as part of enforcement?
>>>>>>
>>>>>> That is less urgent so I suggest that we address the nickname
>>>>>> spec first.
>>>>>
>>>>> Unless you (or someone else here) have a plausible plan to
>>>>> continue and revitalize the WG and assign it that revision work
>>>>> (and bring everyone actively participating up to the level
>>>>> needed to easily understand this discussion thread and feel
>>>>> embarrassed for not spotting the problems), I think we need to
>>>>> assume that this is our last shot.  Absent an active and
>>>>> committed WG, "do this first" could easily be equivalent to
>>>>> "don't get around to the other, ever".
>>>>
>>>> As mentioned, I don't want to have broken RFCs out there.
>>>>
>>>>> I think that the particular set of issues that started this
>>>>> thread as a known defect in the PRECIS specs, both nickname and
>>>>> 7613 and that we are obligated to either fix the problems or at
>>>>> least explain them.  The above warning text is an attempt to
>>>>> explain and identify the problems even if it does not actually
>>>>> provide a solution.  If it were published as part of
>>>>> precis-nickname, it could include a statement to the effect that
>>>>> it should also be treated as an update to 7613 or, if the IESG
>>>>> and RFC Editor would agree in advance to accept, rather than
>>>>> bury, the thing, I suppose we could publish it in
>>>>> precis-nickname and create an erratum to 7613 indicating that it
>>>>> should have included some form of that statement.  Neither
>>>>> option implies a huge amount of work to update 7613.  But I
>>>>> think that making the changes of (2) without doing anything
>>>>> about (3) makes the two documents inconsistent with each other
>>>>> and that would be an additional known defect.
>>>>>
>>>>> Procedural question: given that precis-nickname is in AUTH48 as
>>>>> of yesterday and I don't see anything blocking publication next
>>>>> week if you and Barry sign off on the revised text that the WG
>>>>> hasn't seen,
>>>>
>>>> There is no revised text yet. That's why we're having this discussion.
>>>>
>>>>> does someone need to file a pro forma objection/
>>>>> appeal to block that until this is sorted out and the WG has a
>>>>> chance to review proposed publication text?
>>>>
>>>> I see no reason to invoke the specter of appeals quite yet. Everyone is
>>>> working in good faith to do the right thing and get this mess cleaned
>>>> up.
>>>>
>>>>> [1] I believe our collective inability to deal with the
>>>>> within-script character forms that do not normalize to each
>>>>> other because of language-dependent or other usage factors can
>>>>> be taken as evidence of having run out of energy,
>>>>
>>>> Or in my case simple ignorance of some of the relevant issues and
>>>> examples. It's not easy to know about all of this.
>>>>
>>>>> but it is
>>>>> probably in the interest of finishing the PRECIS work to try to
>>>>> treat that as a separate issue.
>>>>
>>>> Probably.
>>>>
>>>>> [2] Not unlike the reason to differentiate between NFC and NFKC
>>>>> and understand the effects of each.
>>>>
>>>> Another thing that's not easy to grok in fulness.
>>>>
>>>> Peter
>>>>

[precis] Ambiguity in specification of case mappi… Tom Worster
Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre - &yet
Re: [precis] Ambiguity in specification of case m… John C Klensin
Re: [precis] Ambiguity in specification of case m… Tom Worster
Re: [precis] Ambiguity in specification of case m… John C Klensin
Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre - &yet
Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
Re: [precis] Ambiguity in specification of case m… John C Klensin
Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
Re: [precis] Ambiguity in specification of case m… John C Klensin
Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
Re: [precis] Ambiguity in specification of case m… Tom Worster
Re: [precis] Ambiguity in specification of case m… John C Klensin
Re: [precis] Ambiguity in specification of case m… Tom Worster
Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
Re: [precis] Ambiguity in specification of case m… Tom Worster
Re: [precis] Ambiguity in specification of case m… John C Klensin