Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
John C Klensin <john-ietf@jck.com> Tue, 27 October 2015 17:32 UTC
Return-Path: <john-ietf@jck.com>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2C3D71ACD2B for <precis@ietfa.amsl.com>; Tue, 27 Oct 2015 10:32:22 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.39
X-Spam-Level:
X-Spam-Status: No, score=0.39 tagged_above=-999 required=5 tests=[BAYES_50=0.8, GB_I_LETTER=-2, MANGLED_LIST=2.3, RCVD_IN_DNSWL_LOW=-0.7, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4fOd9Ag5GAvh for <precis@ietfa.amsl.com>; Tue, 27 Oct 2015 10:32:16 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EB7B41ACD1D for <precis@ietf.org>; Tue, 27 Oct 2015 10:32:15 -0700 (PDT)
Received: from [198.252.137.10] (helo=JcK-HP8200.jck.com) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1Zr86T-000Prn-L1; Tue, 27 Oct 2015 13:32:09 -0400
Date: Tue, 27 Oct 2015 13:32:04 -0400
From: John C Klensin <john-ietf@jck.com>
To: Peter Saint-Andre - &yet <peter@andyet.net>, Tom Worster <fsb@thefsb.org>, Alexey Melnikov <Alexey.Melnikov@isode.com>
Message-ID: <0347834EBDC481BD99BDBE67@JcK-HP8200.jck.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/Jua858TgnE6GFGb5XUnPqPK3CmQ>
Cc: precis@ietf.org
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Oct 2015 17:32:22 -0000
Response to Monday's note immediately below; response to today's follows it. My apologies, but it is probably important to read both. My further apologies for the length of this note, but I think we are in deep trouble here, trouble that is aggravated by precis-mappings and precis-nickname both being post-approval and that, as far as I know, there are no future plans for PRECIS work (having precis-nickname in AUTH48 just emphasizes that -- see comment at end). --On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre - &yet <peter@andyet.net> wrote: > My apologies for the delayed reply. Comments inline. A few remarks below... I can't tell whether we disagree or whether at least one of us, probably me, are not being adequately clear. (Material on which we fairly clearly agree elided.) > On 10/1/15 7:50 AM, John C Klensin wrote: >> --On Wednesday, September 30, 2015 15:16 -0600 Peter >> Saint-Andre - &yet <peter@andyet.net> wrote: >... >> Peter, >> >> While your proposed text is an improvement, > > Happy to hear it. All I intended was a slight clarification. But I'm not certain we are there yet... >> the desire of many >> people for a magic "just tell me what to do" formula, one that >> lets them avoid understanding the issues, may call for a >> little more: > > There is always a need for more when it comes to i18n. But I think it is a little more that that. I've heard several times, including in PRECIS meetings, requests for "just tell me what to do and make sure it isn't complicated" (or "I don't want to have to think about, much less understand, the issues"). We can debate whether giving in to those requests in the I18n case is wise. I think it leads directly to conclusions equivalent to "I understand my own script and writing system (or think I do) and therefore, since all writing systems must be pretty much the same, I understand all of the core issues in terms of my script and understanding". That, in turn, leads directly to the "how do you spell 'Zürich'?" and "all spellings of 'Zuerich' should be treated as equivalent" discussion that sounded like they dominated a BOF at IETF 93. Now I actually think it is reasonable for someone to ask for a library that will do the job most of the time and that will almost never cause their users or customers to get angry at them. But, if we are going to call what we do "standards", they should contain sufficient information that would-be library authors can know what to do ... or understand that they are in over their heads. And, for these particular cases, we may need to explain, or help the library authors explain, why some cases will fail and, indeed, get users mad at vendors. >> (1) First, toCaseFold is _not_ toLowerCase. Saying "The >> primary result of doing so is that uppercase characters are >> mapped to lowercase characters" is true for toCaseFold, > > By "primary" I meant two things: (1) lowercasing is what > happens to the preponderance of code points and (2) this is > the result that most people care about. If I parse the above correctly, I think you are wrong. I think what most people want, care about, and think they are getting, is lower case conversion, i.e., an operation that preserves lower case characters and converts upper case characters to the equivalent lower case. toCaseFold isn't that operation. It is a much more complex and subtle operation that, as well as converting upper case characters to lower case, sometimes converts lower case characters to different lower case characters (or strings of them). It also requires a fairly good understanding of Unicode (not just a relevant script) and historical Unicode decisions to predict its behavior and to have any hope of explaining that behavior to users. If one is trying to compare (as distinct from converting), then toCaseFold may be exactly what it wanted. but it is really hard to explain or justify that in terms of "nicknames" or "aliases", which are about conversion. And, if one hopes to explain what is going on to users in terms of "lower casing", then toCaseFold is just the wrong operation. That is what toLowerCase is for and the two operations are just not equivalent. FWIW and purely by coincidence wrt PRECIS and this document, I had a conversation a few days ago with an expert on Arabic (and Persian) calligraphy and writing systems (and good general knowledge of writing systems) who is quite insistent that any procedure we use for case-insensitive matching (e.g., case folding) is discriminatory, inconsistent, and just badly-thought-out if that same procedure doesn't treat isolated, initial, and medial forms of the same character as equivalent. He further strengthens his case (sic) by noting that Unicode case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA, unambiguously a lower-case character) to U+03C3 (GREEK SMALL LETTER SIGMA), a relationship that depends entirely on positional use and not case. He also believes the same relationships should apply to all other scripts that make form distinctions for some characters based on positions in a string and for which Unicode has chosen to assign different code points. Even if there were wide acceptance of his view, Unicode stability principles would prevent changing toCaseFold (or CaseFolding.txt), but this is more evidence that what toCaseFold does and does not do is going to be hard to explain to either casual users or to writing system experts whose primary experience is not with the Greek-Latin-Cyrillic group. I don't think we want to say "these matching rules are somewhat arbitrary and irrational, but, if you don't like it, blame Unicode and not us", if only because it is our choice to use those matching rules. More below. >... >> (2) Second, probably as a result of having IDNA in the lead, >> we've gotten sloppy about language and operations and should >> probably start untangling that before it gets people in >> trouble. > > Where is the right place to do that untangling? (I doubt that > it is the precis-nickname document.) I agree that precis-nickname isn't the ideal place. I also believe that you and it are the innocent victims of the situation. At the same time, I don't believe IETF should be producing incomplete, ambiguous, erroneous, or misleading standards because no one could get around to doing the right foundational work. >> The Unicode Standard, at least as I understand it, is fairly >> clear that the most important (and really only safe) use of >> toCaseFold is as part of a comparison operation. > > Thanks for noting that. For example, Section 5.18 of Unicode > 8.0.0 says: > > Caseless matching is implemented using case folding, which > is the > process of mapping characters of different case to a > single form, so > that case differences in strings are erased. Case folding > allows for > fast caseless matches in lookups because only binary > comparison is > required. It is more than just conversion to lowercase. Right. But, again, when its use is appropriate (a very controversial topic in itself with our painful IDNA history with Final Sigma, Eszett and the case-independent versus position-independent controversy called out above as examples) that is "matches in lookups" (what I've described elsewhere as "comparison only"). Not creating or defining nicknames or aliases. And that _is_ a problem for this document. >> Using your >> example it is entirely reasonable to treat, "stpeter" and >> "StPeter" as equivalent in a comparison operation, but >> accepting one string and changing it to the other for display >> may not be a really good idea. While that transformation may >> be acceptable (although I would be surprised if there were no >> people who share your surname who could consider "stpeter" or >> "Stpeter" unacceptable and might even believe that "StPeter" >> is an unacceptable substitute for "St. Peter"), > > I do receive email at stpeter@gmail.com intended for > st.peter@gmail.com but that's a separate topic... One that is relevant because it "works" as a side-effect of a decision Google has made about mailbox name equivalence, a decision that, IMO, will sooner or later get someone into a lot of trouble and, more important, a decision and matching rule that PRECIS, AFAICT, does not allow and that IDNA unambigiously forbids. >> it also points out the >> dangers of using Basic Latin script examples to illustrate >> situations in which even more extended Latin script, much less >> other scripts, may raise more complex issues. Because IDNA >> is essentially a workaround because changing the DNS >> comparison rules was impractical for several reasons, we >> ended up using toCaweFold to map characters and strings into >> others in IDNA2003 but PRECIS implementations that do not >> have the same constraints would, in general, be better off >> confining the use of toCaseFold, or even toLowerCase, to >> comparison operations. > > Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does > it make sense for this nickname specification to differ in > this respect from the published RFCs? Shall we file errata > against those documents? (This might apply only to RFC 7613, > which says to apply case folding as part of the enforcement > process - when exactly to apply case folding is not stipulated > by RFC 7564.) To the extent to which this is a "botched that because the WG didn't understand the issues well enough" conclusion, it would be entirely reasonable to generate an updating RFC that repairs 7613 and/or 7564, even doing so in an addendum to precis-nickname if that is the only way to do that expeditiously. Per the above, we really don't want to give library routine writers bad instructions. As I understand it, the current position of the RFC Editor and IESG is that technical specification errors discovered in retrospect or after people start using a spec are not appropriate topics for errata. If the WG is not willing to do any of those things, then I suggest that precis-nickname at least needs to contain a very clear warning notice about this situation (see my response to your question 1 below). >> (3) Because toCaweFold loses information when used for more >> than comparison (for comparison, it merely contributes to >> what some people would consider false positives for matching) >> involves some controversial decisions and, because of >> stability requirements, cannot be changed even if the >> controversies are resolved in other ways, we end up with, >> e.g., >> toCaseFold ("Nuß") -> "nuss" >> which is considered an acceptable transformation in some >> places that identify themselves as speaking/using German and >> two different unacceptable errors in others. Again, this will >> almost always be much more serious if the transformation is >> used to map and replace strings than if it is used to compare >> (fwiw, that particular example is part of a continuing >> disagreement between IDNA2008 and, among others, German >> domain registry authorities on one side and UTC and UTR 46 on >> the other). > > Agreed. See "warning notice" comment above and question 1 response below. > (4) If the motivation is really to avoid confusion, the >> correct confusion-blocking rule for Latin script (but not >> others) and many languages that use it (but certainly not >> all) involves moving beyond toCaseFold and treating all >> "decorated" characters (characters normally represented by >> glyphs consisting of a Basic Latin character and one or more >> diacritical or equivalent markings) compare equal to their >> base characters, e.g., "á" not only matches "Á" but also >> "a" and "A" and, as an unfortunate side-effect, maybe "À" >> and "à" as well. This is bad news for languages in which >> decorated Latin characters are used to represent phonetically >> and conceptually different characters, not just pronunciation >> variations. I am not qualified to evaluate "how bad". In >> addition, extrapolations from this principle about Latin >> script to unrelated scripts will almost certainly lead to >> serious errors and/or additional confusion. > > I would not be comfortable going that far... In case it isn't clear, I would not be either. But it is where getting sloppy about this stuff could easily take us. It is worth noting that it also identifies one of the difficulties with doing a global system to be applied to many types of applications (like the PRECIS work) and then applying it in user interface software that end users will expect to be localized to their assumptions because it has been mapped or translated into their language (if one normally speaks Upper Slobbovian but has some familiarity with English, an application interface in English will probably be expected to be "foreign", odd, and maybe even inconsistent with whatever expectations exist. But, if the interface is in Upper Slobbovian, the natural and reasonable assumption will be the matching should conform to normal Upper Slobbovian conventions. FWIW, a matching rule that says: (i) Two instances of a base character with the same diacritical mark(s) match. (ii) Two instances of a base character with different diacritical mark(s) do not match. (iii) Two instances of a base character, one with diacritical mark(s) and the other without any decoration match. Is precisely correct and normal behavior for at least one language that uses Latin script. It is also the normal practice for at least one Latin script transcription system that is used by a large fraction of a billion people (maybe more). >> More on this and Tom's question below... >> >>> On 9/29/15 3:28 PM, Tom Worster wrote: >>>> Peter, Alexey, >>>> >>>> I think there is an ambiguity in the specification of case >>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19. >>> ... >>>> But there are 55 code points in Unicode 7.0.0 that change >>>> under default case folding that are neither uppercase nor >>>> titlecase characters, 12 of which are Lowercase_Letter. I >>>> suspect this stems from a confusion between Unicode case >>>> mapping and case folding. In the context of the above, a different way to say the same thing is that people are looking at toCaseFold and assuming (and explaining things in terms of) toLowerCase. toCaseFold works the way it is expected to and those 55 code points are, more or less, collateral damage to get to a matching algorithm that favors false positives over false negatives and various edge cases (including in "edge cases" languages spoken by, and script variations used by, millions of people). >... > After all that, I have 3 questions: Personal opinions about answers... > (1) Is my proposed text enough of a clarification that we > should make that change before the nickname I-D is published > as an RFC? I think the clarification is an improvement and is important enough to incorporate (I know that is the answer to a slightly different question). However, I think it is inadequate without a serious warning about the situation. That warning could appear in either this document or RFC 7613 (or 7613bis) with a pointer from the other, but, unless you want to revise 7613 now, this one is handy. Comment about possible text below. > (2) Should we modify draft-ietf-precis-nickname so that case > folding is applied only as part of comparison and not as part > of enforcement? If so, should we make that change before this > document is published as an RFC? Yes. If something is used for "enforcement", it should be lower casing or something else that can be explained to people who are ordinarily familiar with one or more of the scripts that make case distinctions. However, viewed in the light of this discussion, the whole "enforcement" concept becomes a little dicey, especially if, as I believe but don't have time to verify, the transformations performed by toLowerCase are not a proper subset of those performed by toCaseFold. > (3) Should we update RFC 7613 so that case folding is applied > only as part of comparison and not as part of enforcement? I think that is necessary. Following up on the comment above, I would prefer that the current Section 3.2.2 (3) of RFC 7613 either point to Unicode Lower Casing or contain a warning along the lines of that below. ---------- --On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre <peter@andyet.net> wrote: > This issue has greater urgency now because > draft-ietf-precis-nickname is now in AUTH48... > > On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote: > >> After all that, I have 3 questions: >> >> (1) Is my proposed text enough of a clarification that we >> should make that change before the nickname I-D is published >> as an RFC? > > I think so. See above. >> (2) Should we modify draft-ietf-precis-nickname so that case >> folding is applied only as part of comparison and not as part >> of enforcement? If so, should we make that change before this >> document is published as an RFC? > > Although it seems to be the case that Unicode case folding is > primarily designed for the purpose of matching (i.e., > comparison), "Seems" is a little weak. The Unicode Standard is really quite specific about that. > I have a concern that applying the PRECIS case > mapping rule after applying the normalization and > directionality rules might have unintended consequences that > we haven't had a chance to consider yet. The PRECIS framework > expresses a preference (actually a hard requirement) for > applying the rules in a particular order. We made a late > change to the username profiles (RFC 7613), such that width > mapping is applied first (in order to accommodate fullwidth > and halfwidth characters in certain East Asian scripts). > Making a late change to the nickname profile also concerns me, > even though both of these late changes seem reasonable on the > face of it. I will try to find time to think about this > further in the next 24 hours. First, a hint for the consideration process: there is a reason why Unicode now supports a unified case folding and normalization operation. My recollection is that it is not only more efficient to perform both operations at once (rather than looking in one table and then the other), but that there are some order-dependent or priority-dependent cases. The very fact that this issue exists (and is coming up again) this late in the process (7613 published in August, WG winding down and not, e.g., meeting next week) calls at least the PRECIS quality of review and some fairly fundamental model issues into question. I first raised that issue a rather long time ago but have continued to hope that we have an approximation to "good enough" without going back and rethinking everything. The right solution, IMO, is that, if RFC 7613 is to rationalize or explain the operation in terms of converting upper case characters to lower case, then it should be using toLowerCase because that is what the operation does. After a quick look at 7613, amending/updating it to simply convert to lower case would be straightforward (and would not raise the ordering issue called out above). It would presumably require another IETF Last Call, however and I'd hope we would see some serious discussion within the WG (and with UTC) before making the change and about how it is explained. If we are not willing to make a change that significant and/or if we conclude that the WG (and perhaps the IETF) have completely run out of energy for dealing with i18n issues [1], then I suggest that we introduce some additional text. I've just spent a half-hour trying to find the AUTH48 copy of precis-nickname (aka RFC-to-be-7700), but the RFC Editor has apparently changed naming conventions and the various queue entry pages all point to the -19 I-D and not the current working copy so I can't try to match text and insertion point to what is there already. The suggestion is a patch (and a hack), not a good fix but something like it is probably the least drastic measure that would yield something that doesn't contain unexplained known defects. Rough version of suggested text (possibly to go after your revised paragraph and following up my comments in my 1 October note). Some of the terminology needs checking which I can do if you want to go this route: 'Users of this specification should note that the concept of "lower case conversion" is somewhat elusive and more dependent on the conventions of different languages and notation systems that use the same script than may appear obvious at first glance, especially if that glance is at Basic Latin characters (i.e., the ASCII letter repertoire). Unicode provides two different mapping procedures that produce lower-case characters, but they have different effects and results for many characters. The more conservative one, typically appropriately applicable when lower case forms are needed, is actual lower-casing (embodied in the Unicode operation toLowerCase). A more radical operation, normally suitable only for string matching in situations in which it is better to consider uncertain cases as matching than to treat them as distinct, is called "Case Folding" (Unicode operation toCaseFold). While the two operations will often produce the same results, Case Folding maps some lower case characters into others and performs other transformations that may be intuitively reasonable and expected for some users and quite astonishing (or just wrong) to others. There may be no practical alternative, especially if the operations are to be used for mapping or enforcement, to developers of PRECIS-dependent understanding that the cases in which the two yield different results require careful understanding of the relevant user base and its needs [2].' >> (3) Should we update RFC 7613 so that case folding is applied >> only as part of comparison and not as part of enforcement? > > That is less urgent so I suggest that we address the nickname > spec first. Unless you (or someone else here) have a plausible plan to continue and revitalize the WG and assign it that revision work (and bring everyone actively participating up to the level needed to easily understand this discussion thread and feel embarrassed for not spotting the problems), I think we need to assume that this is our last shot. Absent an active and committed WG, "do this first" could easily be equivalent to "don't get around to the other, ever". I think that the particular set of issues that started this thread as a known defect in the PRECIS specs, both nickname and 7613 and that we are obligated to either fix the problems or at least explain them. The above warning text is an attempt to explain and identify the problems even if it does not actually provide a solution. If it were published as part of precis-nickname, it could include a statement to the effect that it should also be treated as an update to 7613 or, if the IESG and RFC Editor would agree in advance to accept, rather than bury, the thing, I suppose we could publish it in precis-nickname and create an erratum to 7613 indicating that it should have included some form of that statement. Neither option implies a huge amount of work to update 7613. But I think that making the changes of (2) without doing anything about (3) makes the two documents inconsistent with each other and that would be an additional known defect. Procedural question: given that precis-nickname is in AUTH48 as of yesterday and I don't see anything blocking publication next week if you and Barry sign off on the revised text that the WG hasn't seen, does someone need to file a pro forma objection/ appeal to block that until this is sorted out and the WG has a chance to review proposed publication text? best, john [1] I believe our collective inability to deal with the within-script character forms that do not normalize to each other because of language-dependent or other usage factors can be taken as evidence of having run out of energy, but it is probably in the interest of finishing the PRECIS work to try to treat that as a separate issue. [2] Not unlike the reason to differentiate between NFC and NFKC and understand the effects of each.
- [precis] Ambiguity in specification of case mappi… Tom Worster
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre - &yet
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre - &yet
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin