Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
Tom Worster <fsb@thefsb.org> Tue, 03 November 2015 13:16 UTC
Return-Path: <fsb@thefsb.org>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 61FC81B3358 for <precis@ietfa.amsl.com>; Tue, 3 Nov 2015 05:16:27 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.601
X-Spam-Level:
X-Spam-Status: No, score=-1.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, GB_I_LETTER=-2, MANGLED_LIST=2.3, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xv76hMbV7AaH for <precis@ietfa.amsl.com>; Tue, 3 Nov 2015 05:16:21 -0800 (PST)
Received: from smtp66.iad3a.emailsrvr.com (smtp66.iad3a.emailsrvr.com [173.203.187.66]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 7E70B1B3352 for <precis@ietf.org>; Tue, 3 Nov 2015 05:16:20 -0800 (PST)
Received: from smtp9.relay.iad3a.emailsrvr.com (localhost.localdomain [127.0.0.1]) by smtp9.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 7653D380470; Tue, 3 Nov 2015 08:16:19 -0500 (EST)
Received: by smtp9.relay.iad3a.emailsrvr.com (Authenticated sender: fsb-AT-thefsb.org) with ESMTPSA id DB376380401; Tue, 3 Nov 2015 08:16:16 -0500 (EST)
X-Sender-Id: fsb@thefsb.org
Received: from [10.0.1.2] (c-73-4-147-142.hsd1.ma.comcast.net [73.4.147.142]) (using TLSv1 with cipher DES-CBC3-SHA) by 0.0.0.0:465 (trex/5.5.4); Tue, 03 Nov 2015 08:16:19 -0500
User-Agent: Microsoft-MacOutlook/14.5.7.151005
Date: Tue, 03 Nov 2015 08:16:13 -0500
From: Tom Worster <fsb@thefsb.org>
To: Peter Saint-Andre <peter@andyet.net>, John C Klensin <john-ietf@jck.com>, Alexey Melnikov <Alexey.Melnikov@isode.com>
Message-ID: <D25E1ABD.673A9%fsb@thefsb.org>
Thread-Topic: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
References: <0347834EBDC481BD99BDBE67@JcK-HP8200.jck.com> <56302E6D.5030901@andyet.net> <56312AAC.1000300@andyet.net> <56313616.8000801@andyet.net> <563143B9.7020707@andyet.net> <56383B2F.6080505@andyet.net>
In-Reply-To: <56383B2F.6080505@andyet.net>
Mime-version: 1.0
Content-type: text/plain; charset="UTF-8"
Content-transfer-encoding: quoted-printable
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/OEzMK1rNLzgOoZuIBF44Wr2IqWU>
Cc: precis@ietf.org
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 03 Nov 2015 13:16:27 -0000
Peter, This is much better and, as far as the normative language goes, seems correct and unambiguous. The informative sentence in 2.1 Rule 3: "The primary result of doing so is that uppercase characters are mapped to lowercase characters." is good but I think it's worth spending a few more words to spell out that "primary" implies exceptions. "While the primary result is that uppercase characters are mapped to lowercase characters, there are exceptions." It might nudge a few fore implementers to understand that toLowerCase() isn't the right thing. Tom On 11/2/15, 11:42 PM, "Peter Saint-Andre" <peter@andyet.net> wrote: >For ease of reviewing only, and with no presumption that these proposed >changes have been accepted by anyone, I have asked the RFC Editor to >provisionally update the document in AUTH48 as outlined in the messages >I have sent to the list. The resulting file is here: > >http://www.rfc-editor.org/authors/rfc7700.txt > >Despite those caveats, if at all possible I would prefer to find an >acceptable solution for publishing this RFC now without undue further >delays (in part because draft-ietf-simple-chat has been held on this >document for almost 3 years!). That does not mean, as I said earlier in >this thread, that I want to have broken RFCs out there, but I think we >can fix this one acceptably now and then update it again in strict >coherence with updates to RFC 7564 and RFC 7613. I am committed to >getting things right, but I am also committed to not holding up other >people's work for years and years on end. > >Peter > >On 10/28/15 3:52 PM, Peter Saint-Andre wrote: >> Example 7 needs to be corrected, too, in accordance with >>CaseFolding.txt. >> >> On 10/28/15 2:54 PM, Peter Saint-Andre wrote: >>> And here is another correction in Section 3... >>> >>> OLD >>> >>> Regarding examples 5, 6, and 7: applying Unicode Default Case >>>Folding >>> to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL >>>LETTER >>> SIGMA (U+03C3), and doing so during comparison would result in >>> matching the nicknames in examples 5 and 6; however, because the >>> PRECIS mapping rules do not account for the special status of GREEK >>> SMALL LETTER FINAL SIGMA (U+03C2), the nicknames in examples 5 and >>>7 >>> or examples 6 and 7 would not be matched. >>> >>> NEW >>> >>> Regarding examples 5, 6, and 7: applying Unicode Default Case >>>Folding >>> to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL >>>LETTER >>> SIGMA (U+03C3), and the same is true of GREEK SMALL LETTER FINAL >>> SIGMA (U+03C2); therefore, the comparison operation defined in >>> Section 2.4 would result in matching of the nicknames in examples >>>5, >>> 6, and 7. >>> >>> On 10/28/15 2:06 PM, Peter Saint-Andre wrote: >>>> I propose the following text changes: >>>> >>>> ### >>>> >>>> OLD >>>> >>>> 3. Case Mapping Rule: Uppercase and titlecase characters MUST be >>>> mapped to their lowercase equivalents using Unicode Default >>>>Case >>>> Folding as defined in the Unicode Standard [Unicode] (at the >>>> time >>>> of this writing, the algorithm is specified in Chapter 3 of >>>> [Unicode7.0]). In applications that prohibit conflicting >>>> nicknames, this rule helps to reduce the possibility of >>>> confusion >>>> by ensuring that nicknames differing only by case (e.g., >>>> "stpeter" vs. "StPeter") would not be presented to a human >>>>user >>>> at the same time. >>>> >>>> NEW >>>> >>>> 3. Case Mapping Rule: Unicode Default Case Folding MUST be >>>>applied, >>>> as defined in the Unicode Standard [Unicode] (at the time >>>> of this writing, the algorithm is specified in Chapter 3 of >>>> [Unicode7.0]). The primary result of doing so is that >>>>uppercase >>>> characters are mapped to lowercase characters. In applications >>>> that prohibit conflicting nicknames, this rule helps to reduce >>>> the possibility of confusion by ensuring that nicknames >>>> differing only by case (e.g., "stpeter" vs. "StPeter") would >>>>not >>>> be presented to a human user at the same time. >>>> >>>> ### >>>> >>>> (The foregoing was previously sent to the list.) >>>> >>>> ### >>>> >>>> OLD >>>> >>>> 2.3. Enforcement >>>> >>>> An entity that performs enforcement according to this profile MUST >>>> prepare a string as described in Section 2.2 and MUST also apply >>>>the >>>> rules specified in Section 2.1. The rules MUST be applied in the >>>> order shown. >>>> >>>> After all of the foregoing rules have been enforced, the entity >>>>MUST >>>> ensure that the nickname is not zero bytes in length (this is done >>>> after enforcing the rules to prevent applications from mistakenly >>>> omitting a nickname entirely, because when internationalized >>>> characters are accepted, a non-empty sequence of characters can >>>> result in a zero-length nickname after canonicalization). >>>> >>>> 2.4. Comparison >>>> >>>> An entity that performs comparison of two strings according to >>>>this >>>> profile MUST prepare each string and enforce the rules as >>>>specified >>>> in Sections 2.2 and 2.3. The two strings are to be considered >>>> equivalent if they are an exact octet-for-octet match (sometimes >>>> called "bit-string identity"). >>>> >>>> NEW >>>> >>>> 2.3. Enforcement >>>> >>>> An entity that performs enforcement according to this profile MUST >>>> prepare a string as described in Section 2.2 and MUST also apply >>>>the >>>> following rules specified in Section 2.1 in the order shown: >>>> >>>> 1. Additional Mapping Rule >>>> 2. Normalization Rule >>>> 3. Directionality Rule >>>> >>>> After all of the foregoing rules have been enforced, the entity >>>>MUST >>>> ensure that the nickname is not zero bytes in length (this is done >>>> after enforcing the rules to prevent applications from mistakenly >>>> omitting a nickname entirely, because when internationalized >>>> characters are accepted, a non-empty sequence of characters can >>>> result in a zero-length nickname after canonicalization). >>>> >>>> 2.4. Comparison >>>> >>>> An entity that performs comparison of two strings according to >>>>this >>>> profile MUST prepare each string as specified in Section 2.2 and >>>> MUST apply the following rules specified in Section 2.1 in the >>>>order >>>> shown: >>>> >>>> 1. Additional Mapping Rule >>>> 2. Case Mapping Rule >>>> 3. Normalization Rule >>>> 4. Directionality Rule >>>> >>>> The two strings are to be considered equivalent if they are an >>>>exact >>>> octet-for-octet match (sometimes called "bit-string identity"). >>>> >>>> ### >>>> >>>> In addition, some variation on John's proposed text about toLowerCase >>>> vs. toCaseFold might be appropriate at the end of Section 4; however, >>>> I'm still not sure that is necessary if we move the case mapping rule >>>>to >>>> the comparison operation. >>>> >>>> Peter >>>> >>>> On 10/27/15 8:09 PM, Peter Saint-Andre wrote: >>>>> On 10/27/15 11:32 AM, John C Klensin wrote: >>>>>> Response to Monday's note immediately below; response to today's >>>>>> follows it. My apologies, but it is probably important to read >>>>>> both. My further apologies for the length of this note, but I >>>>>> think we are in deep trouble here, >>>>> >>>>> Internationalization always seems to be a matter of how deep the >>>>> trouble >>>>> is... >>>>> >>>>>> trouble that is aggravated by >>>>>> precis-mappings and precis-nickname both being post-approval and >>>>>> that, as far as I know, there are no future plans for PRECIS >>>>>> work (having precis-nickname in AUTH48 just emphasizes that -- >>>>>> see comment at end). >>>>> >>>>> We had not planned to work on PRECIS because we thought we were done >>>>> for >>>>> awhile. If that's not the case and we need to fix things, then so be >>>>> it. >>>>> Whether there is sufficient and continued energy for such work is >>>>> another question. Personally I don't want us to have broken RFCs out >>>>> there. >>>>> >>>>>> --On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre - >>>>>> &yet <peter@andyet.net> wrote: >>>>>> >>>>>>> My apologies for the delayed reply. Comments inline. >>>>>> >>>>>> A few remarks below... I can't tell whether we disagree or >>>>>> whether at least one of us, probably me, are not being >>>>>> adequately clear. (Material on which we fairly clearly agree >>>>>> elided.) >>>>>> >>>>>> >>>>>>> On 10/1/15 7:50 AM, John C Klensin wrote: >>>>>>>> --On Wednesday, September 30, 2015 15:16 -0600 Peter >>>>>>>> Saint-Andre - &yet <peter@andyet.net> wrote: >>>>>>> ... >>>>>>>> Peter, >>>>>>>> >>>>>>>> While your proposed text is an improvement, >>>>>>> >>>>>>> Happy to hear it. All I intended was a slight clarification. >>>>>> >>>>>> But I'm not certain we are there yet... >>>>> >>>>> Agreed. The text I proposed addressed only a very small part of the >>>>> problem. >>>>> >>>>>>>> the desire of many >>>>>>>> people for a magic "just tell me what to do" formula, one that >>>>>>>> lets them avoid understanding the issues, may call for a >>>>>>>> little more: >>>>>>> >>>>>>> There is always a need for more when it comes to i18n. >>>>>> >>>>>> But I think it is a little more that that. I've heard several >>>>>> times, including in PRECIS meetings, requests for "just tell me >>>>>> what to do and make sure it isn't complicated" (or "I don't want >>>>>> to have to think about, much less understand, the issues"). We >>>>>> can debate whether giving in to those requests in the I18n case >>>>>> is wise. I think it leads directly to conclusions equivalent to >>>>>> "I understand my own script and writing system (or think I do) >>>>>> and therefore, since all writing systems must be pretty much the >>>>>> same, I understand all of the core issues in terms of my script >>>>>> and understanding". That, in turn, leads directly to the "how >>>>>> do you spell 'Zürich'?" and "all spellings of 'Zuerich' should >>>>>> be treated as equivalent" discussion that sounded like they >>>>>> dominated a BOF at IETF 93. >>>>>> >>>>>> Now I actually think it is reasonable for someone to ask for a >>>>>> library that will do the job most of the time and that will >>>>>> almost never cause their users or customers to get angry at >>>>>> them. But, if we are going to call what we do "standards", they >>>>>> should contain sufficient information that would-be library >>>>>> authors can know what to do ... or understand that they are in >>>>>> over their heads. And, for these particular cases, we may need >>>>>> to explain, or help the library authors explain, why some cases >>>>>> will fail and, indeed, get users mad at vendors. >>>>>> >>>>>> >>>>>>>> (1) First, toCaseFold is _not_ toLowerCase. Saying "The >>>>>>>> primary result of doing so is that uppercase characters are >>>>>>>> mapped to lowercase characters" is true for toCaseFold, >>>>>>> >>>>>>> By "primary" I meant two things: (1) lowercasing is what >>>>>>> happens to the preponderance of code points and (2) this is >>>>>>> the result that most people care about. >>>>>> >>>>>> If I parse the above correctly, I think you are wrong. I think >>>>>> what most people want, care about, and think they are getting, >>>>>> is lower case conversion, i.e., an operation that preserves >>>>>> lower case characters and converts upper case characters to the >>>>>> equivalent lower case. toCaseFold isn't that operation. It is >>>>>> a much more complex and subtle operation that, as well as >>>>>> converting upper case characters to lower case, sometimes >>>>>> converts lower case characters to different lower case >>>>>> characters (or strings of them). It also requires a fairly good >>>>>> understanding of Unicode (not just a relevant script) and >>>>>> historical Unicode decisions to predict its behavior and to have >>>>>> any hope of explaining that behavior to users. If one is >>>>>> trying to compare (as distinct from converting), then toCaseFold >>>>>> may be exactly what it wanted. but it is really hard to explain >>>>>> or justify that in terms of "nicknames" or "aliases", which are >>>>>> about conversion. And, if one hopes to explain what is going >>>>>> on to users in terms of "lower casing", then toCaseFold is just >>>>>> the wrong operation. That is what toLowerCase is for and the >>>>>> two operations are just not equivalent. >>>>> >>>>> My recollection, quite possibly inaccurate or incomplete, from at >>>>>least >>>>> one and I think several in-person meetings of the PRECIS WG was: just >>>>> use Unicode Default Case Folding because if you use anything else or >>>>> try >>>>> to roll your own you will be fubar forever. I do not recall any >>>>> discussion of the issues you have raised in this thread (e.g., about >>>>> the >>>>> inadvisability of using case folding for anything but comparison >>>>> operations) until the last few weeks. However, I freely admit that's >>>>> probably because, through my own faults and ignorance, I didn't >>>>> understand what you were saying. >>>>> >>>>>> FWIW and purely by coincidence wrt PRECIS and this document, I >>>>>> had a conversation a few days ago with an expert on Arabic (and >>>>>> Persian) calligraphy and writing systems (and good general >>>>>> knowledge of writing systems) who is quite insistent that any >>>>>> procedure we use for case-insensitive matching (e.g., case >>>>>> folding) is discriminatory, inconsistent, and just >>>>>> badly-thought-out if that same procedure doesn't treat isolated, >>>>>> initial, and medial forms of the same character as equivalent. >>>>>> He further strengthens his case (sic) by noting that Unicode >>>>>> case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA, >>>>>> unambiguously a lower-case character) to U+03C3 (GREEK SMALL >>>>>> LETTER SIGMA), a relationship that depends entirely on >>>>>> positional use and not case. He also believes the same >>>>>> relationships should apply to all other scripts that make form >>>>>> distinctions for some characters based on positions in a string >>>>>> and for which Unicode has chosen to assign different code >>>>>> points. Even if there were wide acceptance of his view, Unicode >>>>>> stability principles would prevent changing toCaseFold (or >>>>>> CaseFolding.txt), but this is more evidence that what toCaseFold >>>>>> does and does not do is going to be hard to explain to either >>>>>> casual users or to writing system experts whose primary >>>>>> experience is not with the Greek-Latin-Cyrillic group. >>>>>> >>>>>> I don't think we want to say "these matching rules are somewhat >>>>>> arbitrary and irrational, but, if you don't like it, blame >>>>>> Unicode and not us", if only because it is our choice to use >>>>>> those matching rules. More below. >>>>>> >>>>>> >>>>>>> ... >>>>>>>> (2) Second, probably as a result of having IDNA in the lead, >>>>>>>> we've gotten sloppy about language and operations and should >>>>>>>> probably start untangling that before it gets people in >>>>>>>> trouble. >>>>>>> >>>>>>> Where is the right place to do that untangling? (I doubt that >>>>>>> it is the precis-nickname document.) >>>>>> >>>>>> I agree that precis-nickname isn't the ideal place. I also >>>>>> believe that you and it are the innocent victims of the >>>>>> situation. At the same time, I don't believe IETF should be >>>>>> producing incomplete, ambiguous, erroneous, or misleading >>>>>> standards because no one could get around to doing the right >>>>>> foundational work. >>>>> >>>>> Agreed. I too want to get this right, even though it's not a lot of >>>>>fun >>>>> and it's certainly more work than I thought I was signing up for at >>>>>the >>>>> NEWPREP BoF years ago. >>>>> >>>>>>>> The Unicode Standard, at least as I understand it, is fairly >>>>>>>> clear that the most important (and really only safe) use of >>>>>>>> toCaseFold is as part of a comparison operation. >>>>>>> >>>>>>> Thanks for noting that. For example, Section 5.18 of Unicode >>>>>>> 8.0.0 says: >>>>>>> >>>>>>> Caseless matching is implemented using case folding, which >>>>>>> is the >>>>>>> process of mapping characters of different case to a >>>>>>> single form, so >>>>>>> that case differences in strings are erased. Case folding >>>>>>> allows for >>>>>>> fast caseless matches in lookups because only binary >>>>>>> comparison is >>>>>>> required. It is more than just conversion to lowercase. >>>>>> >>>>>> Right. But, again, when its use is appropriate (a very >>>>>> controversial topic in itself with our painful IDNA history with >>>>>> Final Sigma, Eszett and the case-independent versus >>>>>> position-independent controversy called out above as examples) >>>>>> that is "matches in lookups" (what I've described elsewhere as >>>>>> "comparison only"). Not creating or defining nicknames or >>>>>> aliases. And that _is_ a problem for this document. >>>>> >>>>> I'm not convinced that things are as bad as you think. If we say in >>>>> draft-ietf-precis-nickname that the case mapping rule is to be >>>>>applied >>>>> only as part of comparison and not as part of enforcement - which I >>>>> think is really what we care about (e.g., to prevent spoofing of >>>>>users >>>>> in chat rooms) - then I think we might be most of the way there. >>>>> >>>>>>>> Using your >>>>>>>> example it is entirely reasonable to treat, "stpeter" and >>>>>>>> "StPeter" as equivalent in a comparison operation, but >>>>>>>> accepting one string and changing it to the other for display >>>>>>>> may not be a really good idea. While that transformation may >>>>>>>> be acceptable (although I would be surprised if there were no >>>>>>>> people who share your surname who could consider "stpeter" or >>>>>>>> "Stpeter" unacceptable and might even believe that "StPeter" >>>>>>>> is an unacceptable substitute for "St. Peter"), >>>>>>> >>>>>>> I do receive email at stpeter@gmail.com intended for >>>>>>> st.peter@gmail.com but that's a separate topic... >>>>>> >>>>>> One that is relevant because it "works" as a side-effect of a >>>>>> decision Google has made about mailbox name equivalence, a >>>>>> decision that, IMO, will sooner or later get someone into a lot >>>>>> of trouble and, more important, a decision and matching rule >>>>>> that PRECIS, AFAICT, does not allow and that IDNA unambigiously >>>>>> forbids. >>>>>> >>>>>>>> it also points out the >>>>>>>> dangers of using Basic Latin script examples to illustrate >>>>>>>> situations in which even more extended Latin script, much less >>>>>>>> other scripts, may raise more complex issues. Because IDNA >>>>>>>> is essentially a workaround because changing the DNS >>>>>>>> comparison rules was impractical for several reasons, we >>>>>>>> ended up using toCaweFold to map characters and strings into >>>>>>>> others in IDNA2003 but PRECIS implementations that do not >>>>>>>> have the same constraints would, in general, be better off >>>>>>>> confining the use of toCaseFold, or even toLowerCase, to >>>>>>>> comparison operations. >>>>>>> >>>>>>> Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does >>>>>>> it make sense for this nickname specification to differ in >>>>>>> this respect from the published RFCs? Shall we file errata >>>>>>> against those documents? (This might apply only to RFC 7613, >>>>>>> which says to apply case folding as part of the enforcement >>>>>>> process - when exactly to apply case folding is not stipulated >>>>>>> by RFC 7564.) >>>>>> >>>>>> To the extent to which this is a "botched that because the WG >>>>>> didn't understand the issues well enough" conclusion, it would >>>>>> be entirely reasonable to generate an updating RFC that repairs >>>>>> 7613 and/or 7564, even doing so in an addendum to >>>>>> precis-nickname if that is the only way to do that >>>>>> expeditiously. Per the above, we really don't want to give >>>>>> library routine writers bad instructions. As I understand it, >>>>>> the current position of the RFC Editor and IESG is that >>>>>> technical specification errors discovered in retrospect or after >>>>>> people start using a spec are not appropriate topics for errata. >>>>>> If the WG is not willing to do any of those things, then I >>>>>> suggest that precis-nickname at least needs to contain a very >>>>>> clear warning notice about this situation (see my response to >>>>>> your question 1 below). >>>>> >>>>> I think we'll probably need to fix 7613 and 7564. I am hoping we can >>>>> fix >>>>> nickname now so that it is less incorrect than the other two. That >>>>> doesn't necessarily mean we won't need to also further fix nickname >>>>> later on. >>>>> >>>>> Granted, we were supposed to avoid this problem by working on all of >>>>> the >>>>> PRECIS specs simultaneously. Clearly we have not avoided the >>>>> problem, so >>>>> we need to solve it one way or another. If that means bis for them >>>>>all, >>>>> we need to deal with it. >>>>> >>>>>>>> (3) Because toCaweFold loses information when used for more >>>>>>>> than comparison (for comparison, it merely contributes to >>>>>>>> what some people would consider false positives for matching) >>>>>>>> involves some controversial decisions and, because of >>>>>>>> stability requirements, cannot be changed even if the >>>>>>>> controversies are resolved in other ways, we end up with, >>>>>>>> e.g., >>>>>>>> toCaseFold ("Nuß") -> "nuss" >>>>>>>> which is considered an acceptable transformation in some >>>>>>>> places that identify themselves as speaking/using German and >>>>>>>> two different unacceptable errors in others. Again, this will >>>>>>>> almost always be much more serious if the transformation is >>>>>>>> used to map and replace strings than if it is used to compare >>>>>>>> (fwiw, that particular example is part of a continuing >>>>>>>> disagreement between IDNA2008 and, among others, German >>>>>>>> domain registry authorities on one side and UTC and UTR 46 on >>>>>>>> the other). >>>>>>> >>>>>>> Agreed. >>>>>> >>>>>> See "warning notice" comment above and question 1 response below. >>>>>> >>>>>>> (4) If the motivation is really to avoid confusion, the >>>>>>>> correct confusion-blocking rule for Latin script (but not >>>>>>>> others) and many languages that use it (but certainly not >>>>>>>> all) involves moving beyond toCaseFold and treating all >>>>>>>> "decorated" characters (characters normally represented by >>>>>>>> glyphs consisting of a Basic Latin character and one or more >>>>>>>> diacritical or equivalent markings) compare equal to their >>>>>>>> base characters, e.g., "á" not only matches "Á" but also >>>>>>>> "a" and "A" and, as an unfortunate side-effect, maybe "À" >>>>>>>> and "à" as well. This is bad news for languages in which >>>>>>>> decorated Latin characters are used to represent phonetically >>>>>>>> and conceptually different characters, not just pronunciation >>>>>>>> variations. I am not qualified to evaluate "how bad". In >>>>>>>> addition, extrapolations from this principle about Latin >>>>>>>> script to unrelated scripts will almost certainly lead to >>>>>>>> serious errors and/or additional confusion. >>>>>>> >>>>>>> I would not be comfortable going that far... >>>>>> >>>>>> In case it isn't clear, I would not be either. But it is where >>>>>> getting sloppy about this stuff could easily take us. It is >>>>>> worth noting that it also identifies one of the difficulties >>>>>> with doing a global system to be applied to many types of >>>>>> applications (like the PRECIS work) and then applying it in user >>>>>> interface software that end users will expect to be localized to >>>>>> their assumptions because it has been mapped or translated into >>>>>> their language (if one normally speaks Upper Slobbovian but has >>>>>> some familiarity with English, an application interface in >>>>>> English will probably be expected to be "foreign", odd, and >>>>>> maybe even inconsistent with whatever expectations exist. But, >>>>>> if the interface is in Upper Slobbovian, the natural and >>>>>> reasonable assumption will be the matching should conform to >>>>>> normal Upper Slobbovian conventions. FWIW, a matching rule >>>>>> that says: >>>>>> >>>>>> (i) Two instances of a base character with the same >>>>>> diacritical mark(s) match. >>>>>> (ii) Two instances of a base character with different >>>>>> diacritical mark(s) do not match. >>>>>> (iii) Two instances of a base character, one with >>>>>> diacritical mark(s) and the other without any decoration >>>>>> match. >>>>>> >>>>>> Is precisely correct and normal behavior for at least one >>>>>> language that uses Latin script. It is also the normal practice >>>>>> for at least one Latin script transcription system that is used >>>>>> by a large fraction of a billion people (maybe more). >>>>> >>>>> That is indeed sobering. >>>>> >>>>>>>> More on this and Tom's question below... >>>>>>>> >>>>>>>>> On 9/29/15 3:28 PM, Tom Worster wrote: >>>>>>>>>> Peter, Alexey, >>>>>>>>>> >>>>>>>>>> I think there is an ambiguity in the specification of case >>>>>>>>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19. >>>>>>>>> ... >>>>>>>>>> But there are 55 code points in Unicode 7.0.0 that change >>>>>>>>>> under default case folding that are neither uppercase nor >>>>>>>>>> titlecase characters, 12 of which are Lowercase_Letter. I >>>>>>>>>> suspect this stems from a confusion between Unicode case >>>>>>>>>> mapping and case folding. >>>>>> >>>>>> In the context of the above, a different way to say the same >>>>>> thing is that people are looking at toCaseFold and assuming (and >>>>>> explaining things in terms of) toLowerCase. toCaseFold works >>>>>> the way it is expected to and those 55 code points are, more or >>>>>> less, collateral damage to get to a matching algorithm that >>>>>> favors false positives over false negatives and various edge >>>>>> cases (including in "edge cases" languages spoken by, and script >>>>>> variations used by, millions of people). >>>>> >>>>> Sadly I suspect that is an accurate description of the current state >>>>>of >>>>> affairs (modulo my comment above about PRECIS WG discussions at one >>>>>or >>>>> more IETF meetings). >>>>> >>>>>>> ... >>>>>>> After all that, I have 3 questions: >>>>>> >>>>>> Personal opinions about answers... >>>>>> >>>>>>> (1) Is my proposed text enough of a clarification that we >>>>>>> should make that change before the nickname I-D is published >>>>>>> as an RFC? >>>>>> >>>>>> I think the clarification is an improvement and is important >>>>>> enough to incorporate (I know that is the answer to a slightly >>>>>> different question). >>>>>> >>>>>> However, I think it is inadequate without a serious warning >>>>>> about the situation. >>>>> >>>>> Yes. >>>>> >>>>>> That warning could appear in either this >>>>>> document or RFC 7613 (or 7613bis) with a pointer from the other, >>>>>> but, unless you want to revise 7613 now, this one is handy. >>>>> >>>>> I suspect that we need to revise 7613. I suspect that we might also >>>>> need >>>>> to revise 7564 (at least with respect to the order in which >>>>>operations >>>>> are applied, since there has been some confusion among implementers). >>>>> >>>>> Well, we always knew that we would need to revise them. Just not so >>>>> soon. >>>>> >>>>>> Comment about possible text below. >>>>>> >>>>>>> (2) Should we modify draft-ietf-precis-nickname so that case >>>>>>> folding is applied only as part of comparison and not as part >>>>>>> of enforcement? If so, should we make that change before this >>>>>>> document is published as an RFC? >>>>>> >>>>>> Yes. If something is used for "enforcement", it should be lower >>>>>> casing or something else that can be explained to people who are >>>>>> ordinarily familiar with one or more of the scripts that make >>>>>> case distinctions. >>>>>> >>>>>> However, viewed in the light of this discussion, the whole >>>>>> "enforcement" concept becomes a little dicey, especially if, as >>>>>> I believe but don't have time to verify, the transformations >>>>>> performed by toLowerCase are not a proper subset of those >>>>>> performed by toCaseFold. >>>>> >>>>> My initial thought is that case mapping doesn't belong in the >>>>>nickname >>>>> enforcement operation at all - only in the comparison operation. >>>>> >>>>>>> (3) Should we update RFC 7613 so that case folding is applied >>>>>>> only as part of comparison and not as part of enforcement? >>>>>> >>>>>> I think that is necessary. Following up on the comment above, I >>>>>> would prefer that the current Section 3.2.2 (3) of RFC 7613 >>>>>> either point to Unicode Lower Casing or contain a warning along >>>>>> the lines of that below. >>>>> >>>>> Unlike the nickname profile (which I think can be cleaned up by >>>>>moving >>>>> the case mapping rule to the comparison operation and continuing to >>>>>use >>>>> Unicode Default Case Folding), I think you are right that for the >>>>> UsernameCaseMapped profile we probably want Unicode Lower Casing. >>>>>Thus >>>>> the likely need, sooner rather than later, for 7613bis. >>>>> >>>>>> >>>>>> ---------- >>>>>> >>>>>> --On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre >>>>>> <peter@andyet.net> wrote: >>>>>> >>>>>>> This issue has greater urgency now because >>>>>>> draft-ietf-precis-nickname is now in AUTH48... >>>>>>> >>>>>>> On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote: >>>>>>> >>>>>>>> After all that, I have 3 questions: >>>>>>>> >>>>>>>> (1) Is my proposed text enough of a clarification that we >>>>>>>> should make that change before the nickname I-D is published >>>>>>>> as an RFC? >>>>>>> >>>>>>> I think so. >>>>>> >>>>>> See above. >>>>>> >>>>>>>> (2) Should we modify draft-ietf-precis-nickname so that case >>>>>>>> folding is applied only as part of comparison and not as part >>>>>>>> of enforcement? If so, should we make that change before this >>>>>>>> document is published as an RFC? >>>>>>> >>>>>>> Although it seems to be the case that Unicode case folding is >>>>>>> primarily designed for the purpose of matching (i.e., >>>>>>> comparison), >>>>>> >>>>>> "Seems" is a little weak. The Unicode Standard is really quite >>>>>> specific about that. >>>>>> >>>>>>> I have a concern that applying the PRECIS case >>>>>>> mapping rule after applying the normalization and >>>>>>> directionality rules might have unintended consequences that >>>>>>> we haven't had a chance to consider yet. The PRECIS framework >>>>>>> expresses a preference (actually a hard requirement) for >>>>>>> applying the rules in a particular order. We made a late >>>>>>> change to the username profiles (RFC 7613), such that width >>>>>>> mapping is applied first (in order to accommodate fullwidth >>>>>>> and halfwidth characters in certain East Asian scripts). >>>>>>> Making a late change to the nickname profile also concerns me, >>>>>>> even though both of these late changes seem reasonable on the >>>>>>> face of it. I will try to find time to think about this >>>>>>> further in the next 24 hours. >>>>>> >>>>>> First, a hint for the consideration process: there is a reason >>>>>> why Unicode now supports a unified case folding and >>>>>> normalization operation. My recollection is that it is not only >>>>>> more efficient to perform both operations at once (rather than >>>>>> looking in one table and then the other), but that there are >>>>>> some order-dependent or priority-dependent cases. >>>>>> >>>>>> The very fact that this issue exists (and is coming up again) >>>>>> this late in the process (7613 published in August, WG winding >>>>>> down and not, e.g., meeting next week) calls at least the PRECIS >>>>>> quality of review and some fairly fundamental model issues into >>>>>> question. I first raised that issue a rather long time ago but >>>>>> have continued to hope that we have an approximation to "good >>>>>> enough" without going back and rethinking everything. >>>>>> >>>>>> The right solution, IMO, is that, if RFC 7613 is to rationalize >>>>>> or explain the operation in terms of converting upper case >>>>>> characters to lower case, then it should be using toLowerCase >>>>>> because that is what the operation does. After a quick look at >>>>>> 7613, amending/updating it to simply convert to lower case would >>>>>> be straightforward (and would not raise the ordering issue >>>>>> called out above). It would presumably require another IETF >>>>>> Last Call, however and I'd hope we would see some serious >>>>>> discussion within the WG (and with UTC) before making the change >>>>>> and about how it is explained. >>>>>> >>>>>> If we are not willing to make a change >>>>> >>>>> I'm willing. It would, as you note, require some careful thinking and >>>>> review to make sure that we got it (more) right this time. >>>>> >>>>>> that significant and/or >>>>>> if we conclude that the WG (and perhaps the IETF) have >>>>>> completely run out of energy for dealing with i18n issues [1], >>>>>> then I suggest that we introduce some additional text. I've >>>>>> just spent a half-hour trying to find the AUTH48 copy of >>>>>> precis-nickname (aka RFC-to-be-7700), but the RFC Editor has >>>>>> apparently changed naming conventions and the various queue >>>>>> entry pages all point to the -19 I-D and not the current working >>>>>> copy so I can't try to match text and insertion point to what is >>>>>> there already. >>>>> >>>>> http://www.rfc-editor.org/authors/rfc7700.txt >>>>> >>>>>> The suggestion is a patch (and a hack), not a >>>>>> good fix but something like it is probably the least drastic >>>>>> measure that would yield something that doesn't contain >>>>>> unexplained known defects. >>>>>> >>>>>> Rough version of suggested text (possibly to go after your >>>>>> revised paragraph and following up my comments in my 1 October >>>>>> note). Some of the terminology needs checking which I can do if >>>>>> you want to go this route: >>>>>> >>>>>> 'Users of this specification should note that the >>>>>> concept of "lower case conversion" is somewhat elusive >>>>>> and more dependent on the conventions of different >>>>>> languages and notation systems that use the same script >>>>>> than may appear obvious at first glance, especially if >>>>>> that glance is at Basic Latin characters (i.e., the >>>>>> ASCII letter repertoire). Unicode provides two >>>>>> different mapping procedures that produce lower-case >>>>>> characters, but they have different effects and results >>>>>> for many characters. The more conservative one, >>>>>> typically appropriately applicable when lower case forms >>>>>> are needed, is actual lower-casing (embodied in the >>>>>> Unicode operation toLowerCase). A more radical >>>>>> operation, normally suitable only for string matching in >>>>>> situations in which it is better to consider uncertain >>>>>> cases as matching than to treat them as distinct, is >>>>>> called "Case Folding" (Unicode operation toCaseFold). >>>>>> While the two operations will often produce the same >>>>>> results, Case Folding maps some lower case characters >>>>>> into others and performs other transformations that may >>>>>> be intuitively reasonable and expected for some users >>>>>> and quite astonishing (or just wrong) to others. There >>>>>> may be no practical alternative, especially if the >>>>>> operations are to be used for mapping or enforcement, to >>>>>> developers of PRECIS-dependent understanding that the >>>>>> cases in which the two yield different results require >>>>>> careful understanding of the relevant user base and its >>>>>> needs [2].' >>>>> >>>>> Thanks. >>>>> >>>>> I am not sure if we need something like that if we move case mapping >>>>> (here, case folding) to the comparison operation only - but something >>>>> like that might still be appropriate. >>>>> >>>>>>>> (3) Should we update RFC 7613 so that case folding is applied >>>>>>>> only as part of comparison and not as part of enforcement? >>>>>>> >>>>>>> That is less urgent so I suggest that we address the nickname >>>>>>> spec first. >>>>>> >>>>>> Unless you (or someone else here) have a plausible plan to >>>>>> continue and revitalize the WG and assign it that revision work >>>>>> (and bring everyone actively participating up to the level >>>>>> needed to easily understand this discussion thread and feel >>>>>> embarrassed for not spotting the problems), I think we need to >>>>>> assume that this is our last shot. Absent an active and >>>>>> committed WG, "do this first" could easily be equivalent to >>>>>> "don't get around to the other, ever". >>>>> >>>>> As mentioned, I don't want to have broken RFCs out there. >>>>> >>>>>> I think that the particular set of issues that started this >>>>>> thread as a known defect in the PRECIS specs, both nickname and >>>>>> 7613 and that we are obligated to either fix the problems or at >>>>>> least explain them. The above warning text is an attempt to >>>>>> explain and identify the problems even if it does not actually >>>>>> provide a solution. If it were published as part of >>>>>> precis-nickname, it could include a statement to the effect that >>>>>> it should also be treated as an update to 7613 or, if the IESG >>>>>> and RFC Editor would agree in advance to accept, rather than >>>>>> bury, the thing, I suppose we could publish it in >>>>>> precis-nickname and create an erratum to 7613 indicating that it >>>>>> should have included some form of that statement. Neither >>>>>> option implies a huge amount of work to update 7613. But I >>>>>> think that making the changes of (2) without doing anything >>>>>> about (3) makes the two documents inconsistent with each other >>>>>> and that would be an additional known defect. >>>>>> >>>>>> Procedural question: given that precis-nickname is in AUTH48 as >>>>>> of yesterday and I don't see anything blocking publication next >>>>>> week if you and Barry sign off on the revised text that the WG >>>>>> hasn't seen, >>>>> >>>>> There is no revised text yet. That's why we're having this >>>>>discussion. >>>>> >>>>>> does someone need to file a pro forma objection/ >>>>>> appeal to block that until this is sorted out and the WG has a >>>>>> chance to review proposed publication text? >>>>> >>>>> I see no reason to invoke the specter of appeals quite yet. Everyone >>>>>is >>>>> working in good faith to do the right thing and get this mess cleaned >>>>> up. >>>>> >>>>>> [1] I believe our collective inability to deal with the >>>>>> within-script character forms that do not normalize to each >>>>>> other because of language-dependent or other usage factors can >>>>>> be taken as evidence of having run out of energy, >>>>> >>>>> Or in my case simple ignorance of some of the relevant issues and >>>>> examples. It's not easy to know about all of this. >>>>> >>>>>> but it is >>>>>> probably in the interest of finishing the PRECIS work to try to >>>>>> treat that as a separate issue. >>>>> >>>>> Probably. >>>>> >>>>>> [2] Not unlike the reason to differentiate between NFC and NFKC >>>>>> and understand the effects of each. >>>>> >>>>> Another thing that's not easy to grok in fulness. >>>>> >>>>> Peter
- [precis] Ambiguity in specification of case mappi… Tom Worster
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre - &yet
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre - &yet
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin