Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
Peter Saint-Andre <peter@andyet.net> Wed, 28 October 2015 21:53 UTC
Return-Path: <peter@andyet.net>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 452901A0121 for <precis@ietfa.amsl.com>; Wed, 28 Oct 2015 14:53:06 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.601
X-Spam-Level:
X-Spam-Status: No, score=-1.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, GB_I_LETTER=-2, MANGLED_LIST=2.3, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id s-RDbN19rW1h for <precis@ietfa.amsl.com>; Wed, 28 Oct 2015 14:53:00 -0700 (PDT)
Received: from mail-ob0-x232.google.com (mail-ob0-x232.google.com [IPv6:2607:f8b0:4003:c01::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8165D1A011B for <precis@ietf.org>; Wed, 28 Oct 2015 14:53:00 -0700 (PDT)
Received: by obbwb3 with SMTP id wb3so18670429obb.0 for <precis@ietf.org>; Wed, 28 Oct 2015 14:53:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=andyet_net.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-type:content-transfer-encoding; bh=Z8EuDYSJx4ujZi7wl6BJ8CP4X+QxPGBiUC0uNXXLb7M=; b=dVsyCwmRnhI18CdfcUC+dmYqbnXfMpDZbdUuewx+WtbHcFOOwaQ9A6Whcw1IzoutXK eU6J3lp6hZKopHEDfFvTxrGHQgi5BHCt4zBbjAEzjHuN+HTR1anqZVYyGlJtnOXCVAfr oH9hjC6MwoZ8r3585L/m0yrmuQOmpHtVP7U1ZMCxtdepXnCscGTzxv7vojY6jK8fikWr Fu+JFF00cVmHOrbxSQw7g+ng5bljoOcqBglGHBVrCdO7/dCcMYiWH56Yc30+oyXy1NKk 7kP2k7wM4dKQGcVVn1MN4zjzhkdD25OaIQ7fqoKARG6c5HN0e58kYcCxZ6eAbKW6a2YP pUOg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-type :content-transfer-encoding; bh=Z8EuDYSJx4ujZi7wl6BJ8CP4X+QxPGBiUC0uNXXLb7M=; b=JBCTStsqIEzukBM3/q8UcFStjFjt4Sd1X8iTfxzfDdAdl8e7ir6Chdvhl0LVZRKuud M1GFhCjA10AYEO8vonRUY1nF2tXkZYVsrNyEyV6blwJizlDjg+LPQiPVpxAQPgRnWFXW EY4UDUgEIe9rBVSyxXtaWvz0DH/Ko0YUMKtt732qnJEfUcew3s9mf8BDHnJRaJ4fTAET +k9sG8swn7c0gvMtA35RBN4+PiqfT+Y7kysufxok0BJZoMCb8eUKt7KKS2rrsQTtr4cT imoIvbqJ51QFQjj8Y33MgFVbJIAhSwYIQbqiiJypcY8aWbnJt/wZlf16bok2vmhbvm6B ROJA==
X-Gm-Message-State: ALoCoQkkwn2oeWUpz82Fc+pTK8VUsSNlvsYqsYgFTRm+2NFzIa2khzzmgTUFfP51O9jw7uPQCIDM
X-Received: by 10.60.173.42 with SMTP id bh10mr30045673oec.58.1446069179795; Wed, 28 Oct 2015 14:52:59 -0700 (PDT)
Received: from aither.local (c-73-34-202-214.hsd1.co.comcast.net. [73.34.202.214]) by smtp.googlemail.com with ESMTPSA id t192sm20844854oie.29.2015.10.28.14.52.57 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 28 Oct 2015 14:52:58 -0700 (PDT)
To: John C Klensin <john-ietf@jck.com>, Tom Worster <fsb@thefsb.org>, Alexey Melnikov <Alexey.Melnikov@isode.com>
References: <0347834EBDC481BD99BDBE67@JcK-HP8200.jck.com> <56302E6D.5030901@andyet.net> <56312AAC.1000300@andyet.net> <56313616.8000801@andyet.net>
From: Peter Saint-Andre <peter@andyet.net>
Message-ID: <563143B9.7020707@andyet.net>
Date: Wed, 28 Oct 2015 15:52:57 -0600
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <56313616.8000801@andyet.net>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/gk-1oNw5nY7k-gyWbpVmXZ9R1Rc>
Cc: precis@ietf.org
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Oct 2015 21:53:06 -0000
Example 7 needs to be corrected, too, in accordance with CaseFolding.txt. On 10/28/15 2:54 PM, Peter Saint-Andre wrote: > And here is another correction in Section 3... > > OLD > > Regarding examples 5, 6, and 7: applying Unicode Default Case Folding > to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL LETTER > SIGMA (U+03C3), and doing so during comparison would result in > matching the nicknames in examples 5 and 6; however, because the > PRECIS mapping rules do not account for the special status of GREEK > SMALL LETTER FINAL SIGMA (U+03C2), the nicknames in examples 5 and 7 > or examples 6 and 7 would not be matched. > > NEW > > Regarding examples 5, 6, and 7: applying Unicode Default Case Folding > to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL LETTER > SIGMA (U+03C3), and the same is true of GREEK SMALL LETTER FINAL > SIGMA (U+03C2); therefore, the comparison operation defined in > Section 2.4 would result in matching of the nicknames in examples 5, > 6, and 7. > > On 10/28/15 2:06 PM, Peter Saint-Andre wrote: >> I propose the following text changes: >> >> ### >> >> OLD >> >> 3. Case Mapping Rule: Uppercase and titlecase characters MUST be >> mapped to their lowercase equivalents using Unicode Default Case >> Folding as defined in the Unicode Standard [Unicode] (at the time >> of this writing, the algorithm is specified in Chapter 3 of >> [Unicode7.0]). In applications that prohibit conflicting >> nicknames, this rule helps to reduce the possibility of confusion >> by ensuring that nicknames differing only by case (e.g., >> "stpeter" vs. "StPeter") would not be presented to a human user >> at the same time. >> >> NEW >> >> 3. Case Mapping Rule: Unicode Default Case Folding MUST be applied, >> as defined in the Unicode Standard [Unicode] (at the time >> of this writing, the algorithm is specified in Chapter 3 of >> [Unicode7.0]). The primary result of doing so is that uppercase >> characters are mapped to lowercase characters. In applications >> that prohibit conflicting nicknames, this rule helps to reduce >> the possibility of confusion by ensuring that nicknames >> differing only by case (e.g., "stpeter" vs. "StPeter") would not >> be presented to a human user at the same time. >> >> ### >> >> (The foregoing was previously sent to the list.) >> >> ### >> >> OLD >> >> 2.3. Enforcement >> >> An entity that performs enforcement according to this profile MUST >> prepare a string as described in Section 2.2 and MUST also apply the >> rules specified in Section 2.1. The rules MUST be applied in the >> order shown. >> >> After all of the foregoing rules have been enforced, the entity MUST >> ensure that the nickname is not zero bytes in length (this is done >> after enforcing the rules to prevent applications from mistakenly >> omitting a nickname entirely, because when internationalized >> characters are accepted, a non-empty sequence of characters can >> result in a zero-length nickname after canonicalization). >> >> 2.4. Comparison >> >> An entity that performs comparison of two strings according to this >> profile MUST prepare each string and enforce the rules as specified >> in Sections 2.2 and 2.3. The two strings are to be considered >> equivalent if they are an exact octet-for-octet match (sometimes >> called "bit-string identity"). >> >> NEW >> >> 2.3. Enforcement >> >> An entity that performs enforcement according to this profile MUST >> prepare a string as described in Section 2.2 and MUST also apply the >> following rules specified in Section 2.1 in the order shown: >> >> 1. Additional Mapping Rule >> 2. Normalization Rule >> 3. Directionality Rule >> >> After all of the foregoing rules have been enforced, the entity MUST >> ensure that the nickname is not zero bytes in length (this is done >> after enforcing the rules to prevent applications from mistakenly >> omitting a nickname entirely, because when internationalized >> characters are accepted, a non-empty sequence of characters can >> result in a zero-length nickname after canonicalization). >> >> 2.4. Comparison >> >> An entity that performs comparison of two strings according to this >> profile MUST prepare each string as specified in Section 2.2 and >> MUST apply the following rules specified in Section 2.1 in the order >> shown: >> >> 1. Additional Mapping Rule >> 2. Case Mapping Rule >> 3. Normalization Rule >> 4. Directionality Rule >> >> The two strings are to be considered equivalent if they are an exact >> octet-for-octet match (sometimes called "bit-string identity"). >> >> ### >> >> In addition, some variation on John's proposed text about toLowerCase >> vs. toCaseFold might be appropriate at the end of Section 4; however, >> I'm still not sure that is necessary if we move the case mapping rule to >> the comparison operation. >> >> Peter >> >> On 10/27/15 8:09 PM, Peter Saint-Andre wrote: >>> On 10/27/15 11:32 AM, John C Klensin wrote: >>>> Response to Monday's note immediately below; response to today's >>>> follows it. My apologies, but it is probably important to read >>>> both. My further apologies for the length of this note, but I >>>> think we are in deep trouble here, >>> >>> Internationalization always seems to be a matter of how deep the trouble >>> is... >>> >>>> trouble that is aggravated by >>>> precis-mappings and precis-nickname both being post-approval and >>>> that, as far as I know, there are no future plans for PRECIS >>>> work (having precis-nickname in AUTH48 just emphasizes that -- >>>> see comment at end). >>> >>> We had not planned to work on PRECIS because we thought we were done for >>> awhile. If that's not the case and we need to fix things, then so be it. >>> Whether there is sufficient and continued energy for such work is >>> another question. Personally I don't want us to have broken RFCs out >>> there. >>> >>>> --On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre - >>>> &yet <peter@andyet.net> wrote: >>>> >>>>> My apologies for the delayed reply. Comments inline. >>>> >>>> A few remarks below... I can't tell whether we disagree or >>>> whether at least one of us, probably me, are not being >>>> adequately clear. (Material on which we fairly clearly agree >>>> elided.) >>>> >>>> >>>>> On 10/1/15 7:50 AM, John C Klensin wrote: >>>>>> --On Wednesday, September 30, 2015 15:16 -0600 Peter >>>>>> Saint-Andre - &yet <peter@andyet.net> wrote: >>>>> ... >>>>>> Peter, >>>>>> >>>>>> While your proposed text is an improvement, >>>>> >>>>> Happy to hear it. All I intended was a slight clarification. >>>> >>>> But I'm not certain we are there yet... >>> >>> Agreed. The text I proposed addressed only a very small part of the >>> problem. >>> >>>>>> the desire of many >>>>>> people for a magic "just tell me what to do" formula, one that >>>>>> lets them avoid understanding the issues, may call for a >>>>>> little more: >>>>> >>>>> There is always a need for more when it comes to i18n. >>>> >>>> But I think it is a little more that that. I've heard several >>>> times, including in PRECIS meetings, requests for "just tell me >>>> what to do and make sure it isn't complicated" (or "I don't want >>>> to have to think about, much less understand, the issues"). We >>>> can debate whether giving in to those requests in the I18n case >>>> is wise. I think it leads directly to conclusions equivalent to >>>> "I understand my own script and writing system (or think I do) >>>> and therefore, since all writing systems must be pretty much the >>>> same, I understand all of the core issues in terms of my script >>>> and understanding". That, in turn, leads directly to the "how >>>> do you spell 'Zürich'?" and "all spellings of 'Zuerich' should >>>> be treated as equivalent" discussion that sounded like they >>>> dominated a BOF at IETF 93. >>>> >>>> Now I actually think it is reasonable for someone to ask for a >>>> library that will do the job most of the time and that will >>>> almost never cause their users or customers to get angry at >>>> them. But, if we are going to call what we do "standards", they >>>> should contain sufficient information that would-be library >>>> authors can know what to do ... or understand that they are in >>>> over their heads. And, for these particular cases, we may need >>>> to explain, or help the library authors explain, why some cases >>>> will fail and, indeed, get users mad at vendors. >>>> >>>> >>>>>> (1) First, toCaseFold is _not_ toLowerCase. Saying "The >>>>>> primary result of doing so is that uppercase characters are >>>>>> mapped to lowercase characters" is true for toCaseFold, >>>>> >>>>> By "primary" I meant two things: (1) lowercasing is what >>>>> happens to the preponderance of code points and (2) this is >>>>> the result that most people care about. >>>> >>>> If I parse the above correctly, I think you are wrong. I think >>>> what most people want, care about, and think they are getting, >>>> is lower case conversion, i.e., an operation that preserves >>>> lower case characters and converts upper case characters to the >>>> equivalent lower case. toCaseFold isn't that operation. It is >>>> a much more complex and subtle operation that, as well as >>>> converting upper case characters to lower case, sometimes >>>> converts lower case characters to different lower case >>>> characters (or strings of them). It also requires a fairly good >>>> understanding of Unicode (not just a relevant script) and >>>> historical Unicode decisions to predict its behavior and to have >>>> any hope of explaining that behavior to users. If one is >>>> trying to compare (as distinct from converting), then toCaseFold >>>> may be exactly what it wanted. but it is really hard to explain >>>> or justify that in terms of "nicknames" or "aliases", which are >>>> about conversion. And, if one hopes to explain what is going >>>> on to users in terms of "lower casing", then toCaseFold is just >>>> the wrong operation. That is what toLowerCase is for and the >>>> two operations are just not equivalent. >>> >>> My recollection, quite possibly inaccurate or incomplete, from at least >>> one and I think several in-person meetings of the PRECIS WG was: just >>> use Unicode Default Case Folding because if you use anything else or try >>> to roll your own you will be fubar forever. I do not recall any >>> discussion of the issues you have raised in this thread (e.g., about the >>> inadvisability of using case folding for anything but comparison >>> operations) until the last few weeks. However, I freely admit that's >>> probably because, through my own faults and ignorance, I didn't >>> understand what you were saying. >>> >>>> FWIW and purely by coincidence wrt PRECIS and this document, I >>>> had a conversation a few days ago with an expert on Arabic (and >>>> Persian) calligraphy and writing systems (and good general >>>> knowledge of writing systems) who is quite insistent that any >>>> procedure we use for case-insensitive matching (e.g., case >>>> folding) is discriminatory, inconsistent, and just >>>> badly-thought-out if that same procedure doesn't treat isolated, >>>> initial, and medial forms of the same character as equivalent. >>>> He further strengthens his case (sic) by noting that Unicode >>>> case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA, >>>> unambiguously a lower-case character) to U+03C3 (GREEK SMALL >>>> LETTER SIGMA), a relationship that depends entirely on >>>> positional use and not case. He also believes the same >>>> relationships should apply to all other scripts that make form >>>> distinctions for some characters based on positions in a string >>>> and for which Unicode has chosen to assign different code >>>> points. Even if there were wide acceptance of his view, Unicode >>>> stability principles would prevent changing toCaseFold (or >>>> CaseFolding.txt), but this is more evidence that what toCaseFold >>>> does and does not do is going to be hard to explain to either >>>> casual users or to writing system experts whose primary >>>> experience is not with the Greek-Latin-Cyrillic group. >>>> >>>> I don't think we want to say "these matching rules are somewhat >>>> arbitrary and irrational, but, if you don't like it, blame >>>> Unicode and not us", if only because it is our choice to use >>>> those matching rules. More below. >>>> >>>> >>>>> ... >>>>>> (2) Second, probably as a result of having IDNA in the lead, >>>>>> we've gotten sloppy about language and operations and should >>>>>> probably start untangling that before it gets people in >>>>>> trouble. >>>>> >>>>> Where is the right place to do that untangling? (I doubt that >>>>> it is the precis-nickname document.) >>>> >>>> I agree that precis-nickname isn't the ideal place. I also >>>> believe that you and it are the innocent victims of the >>>> situation. At the same time, I don't believe IETF should be >>>> producing incomplete, ambiguous, erroneous, or misleading >>>> standards because no one could get around to doing the right >>>> foundational work. >>> >>> Agreed. I too want to get this right, even though it's not a lot of fun >>> and it's certainly more work than I thought I was signing up for at the >>> NEWPREP BoF years ago. >>> >>>>>> The Unicode Standard, at least as I understand it, is fairly >>>>>> clear that the most important (and really only safe) use of >>>>>> toCaseFold is as part of a comparison operation. >>>>> >>>>> Thanks for noting that. For example, Section 5.18 of Unicode >>>>> 8.0.0 says: >>>>> >>>>> Caseless matching is implemented using case folding, which >>>>> is the >>>>> process of mapping characters of different case to a >>>>> single form, so >>>>> that case differences in strings are erased. Case folding >>>>> allows for >>>>> fast caseless matches in lookups because only binary >>>>> comparison is >>>>> required. It is more than just conversion to lowercase. >>>> >>>> Right. But, again, when its use is appropriate (a very >>>> controversial topic in itself with our painful IDNA history with >>>> Final Sigma, Eszett and the case-independent versus >>>> position-independent controversy called out above as examples) >>>> that is "matches in lookups" (what I've described elsewhere as >>>> "comparison only"). Not creating or defining nicknames or >>>> aliases. And that _is_ a problem for this document. >>> >>> I'm not convinced that things are as bad as you think. If we say in >>> draft-ietf-precis-nickname that the case mapping rule is to be applied >>> only as part of comparison and not as part of enforcement - which I >>> think is really what we care about (e.g., to prevent spoofing of users >>> in chat rooms) - then I think we might be most of the way there. >>> >>>>>> Using your >>>>>> example it is entirely reasonable to treat, "stpeter" and >>>>>> "StPeter" as equivalent in a comparison operation, but >>>>>> accepting one string and changing it to the other for display >>>>>> may not be a really good idea. While that transformation may >>>>>> be acceptable (although I would be surprised if there were no >>>>>> people who share your surname who could consider "stpeter" or >>>>>> "Stpeter" unacceptable and might even believe that "StPeter" >>>>>> is an unacceptable substitute for "St. Peter"), >>>>> >>>>> I do receive email at stpeter@gmail.com intended for >>>>> st.peter@gmail.com but that's a separate topic... >>>> >>>> One that is relevant because it "works" as a side-effect of a >>>> decision Google has made about mailbox name equivalence, a >>>> decision that, IMO, will sooner or later get someone into a lot >>>> of trouble and, more important, a decision and matching rule >>>> that PRECIS, AFAICT, does not allow and that IDNA unambigiously >>>> forbids. >>>> >>>>>> it also points out the >>>>>> dangers of using Basic Latin script examples to illustrate >>>>>> situations in which even more extended Latin script, much less >>>>>> other scripts, may raise more complex issues. Because IDNA >>>>>> is essentially a workaround because changing the DNS >>>>>> comparison rules was impractical for several reasons, we >>>>>> ended up using toCaweFold to map characters and strings into >>>>>> others in IDNA2003 but PRECIS implementations that do not >>>>>> have the same constraints would, in general, be better off >>>>>> confining the use of toCaseFold, or even toLowerCase, to >>>>>> comparison operations. >>>>> >>>>> Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does >>>>> it make sense for this nickname specification to differ in >>>>> this respect from the published RFCs? Shall we file errata >>>>> against those documents? (This might apply only to RFC 7613, >>>>> which says to apply case folding as part of the enforcement >>>>> process - when exactly to apply case folding is not stipulated >>>>> by RFC 7564.) >>>> >>>> To the extent to which this is a "botched that because the WG >>>> didn't understand the issues well enough" conclusion, it would >>>> be entirely reasonable to generate an updating RFC that repairs >>>> 7613 and/or 7564, even doing so in an addendum to >>>> precis-nickname if that is the only way to do that >>>> expeditiously. Per the above, we really don't want to give >>>> library routine writers bad instructions. As I understand it, >>>> the current position of the RFC Editor and IESG is that >>>> technical specification errors discovered in retrospect or after >>>> people start using a spec are not appropriate topics for errata. >>>> If the WG is not willing to do any of those things, then I >>>> suggest that precis-nickname at least needs to contain a very >>>> clear warning notice about this situation (see my response to >>>> your question 1 below). >>> >>> I think we'll probably need to fix 7613 and 7564. I am hoping we can fix >>> nickname now so that it is less incorrect than the other two. That >>> doesn't necessarily mean we won't need to also further fix nickname >>> later on. >>> >>> Granted, we were supposed to avoid this problem by working on all of the >>> PRECIS specs simultaneously. Clearly we have not avoided the problem, so >>> we need to solve it one way or another. If that means bis for them all, >>> we need to deal with it. >>> >>>>>> (3) Because toCaweFold loses information when used for more >>>>>> than comparison (for comparison, it merely contributes to >>>>>> what some people would consider false positives for matching) >>>>>> involves some controversial decisions and, because of >>>>>> stability requirements, cannot be changed even if the >>>>>> controversies are resolved in other ways, we end up with, >>>>>> e.g., >>>>>> toCaseFold ("Nuß") -> "nuss" >>>>>> which is considered an acceptable transformation in some >>>>>> places that identify themselves as speaking/using German and >>>>>> two different unacceptable errors in others. Again, this will >>>>>> almost always be much more serious if the transformation is >>>>>> used to map and replace strings than if it is used to compare >>>>>> (fwiw, that particular example is part of a continuing >>>>>> disagreement between IDNA2008 and, among others, German >>>>>> domain registry authorities on one side and UTC and UTR 46 on >>>>>> the other). >>>>> >>>>> Agreed. >>>> >>>> See "warning notice" comment above and question 1 response below. >>>> >>>>> (4) If the motivation is really to avoid confusion, the >>>>>> correct confusion-blocking rule for Latin script (but not >>>>>> others) and many languages that use it (but certainly not >>>>>> all) involves moving beyond toCaseFold and treating all >>>>>> "decorated" characters (characters normally represented by >>>>>> glyphs consisting of a Basic Latin character and one or more >>>>>> diacritical or equivalent markings) compare equal to their >>>>>> base characters, e.g., "á" not only matches "Á" but also >>>>>> "a" and "A" and, as an unfortunate side-effect, maybe "À" >>>>>> and "à" as well. This is bad news for languages in which >>>>>> decorated Latin characters are used to represent phonetically >>>>>> and conceptually different characters, not just pronunciation >>>>>> variations. I am not qualified to evaluate "how bad". In >>>>>> addition, extrapolations from this principle about Latin >>>>>> script to unrelated scripts will almost certainly lead to >>>>>> serious errors and/or additional confusion. >>>>> >>>>> I would not be comfortable going that far... >>>> >>>> In case it isn't clear, I would not be either. But it is where >>>> getting sloppy about this stuff could easily take us. It is >>>> worth noting that it also identifies one of the difficulties >>>> with doing a global system to be applied to many types of >>>> applications (like the PRECIS work) and then applying it in user >>>> interface software that end users will expect to be localized to >>>> their assumptions because it has been mapped or translated into >>>> their language (if one normally speaks Upper Slobbovian but has >>>> some familiarity with English, an application interface in >>>> English will probably be expected to be "foreign", odd, and >>>> maybe even inconsistent with whatever expectations exist. But, >>>> if the interface is in Upper Slobbovian, the natural and >>>> reasonable assumption will be the matching should conform to >>>> normal Upper Slobbovian conventions. FWIW, a matching rule >>>> that says: >>>> >>>> (i) Two instances of a base character with the same >>>> diacritical mark(s) match. >>>> (ii) Two instances of a base character with different >>>> diacritical mark(s) do not match. >>>> (iii) Two instances of a base character, one with >>>> diacritical mark(s) and the other without any decoration >>>> match. >>>> >>>> Is precisely correct and normal behavior for at least one >>>> language that uses Latin script. It is also the normal practice >>>> for at least one Latin script transcription system that is used >>>> by a large fraction of a billion people (maybe more). >>> >>> That is indeed sobering. >>> >>>>>> More on this and Tom's question below... >>>>>> >>>>>>> On 9/29/15 3:28 PM, Tom Worster wrote: >>>>>>>> Peter, Alexey, >>>>>>>> >>>>>>>> I think there is an ambiguity in the specification of case >>>>>>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19. >>>>>>> ... >>>>>>>> But there are 55 code points in Unicode 7.0.0 that change >>>>>>>> under default case folding that are neither uppercase nor >>>>>>>> titlecase characters, 12 of which are Lowercase_Letter. I >>>>>>>> suspect this stems from a confusion between Unicode case >>>>>>>> mapping and case folding. >>>> >>>> In the context of the above, a different way to say the same >>>> thing is that people are looking at toCaseFold and assuming (and >>>> explaining things in terms of) toLowerCase. toCaseFold works >>>> the way it is expected to and those 55 code points are, more or >>>> less, collateral damage to get to a matching algorithm that >>>> favors false positives over false negatives and various edge >>>> cases (including in "edge cases" languages spoken by, and script >>>> variations used by, millions of people). >>> >>> Sadly I suspect that is an accurate description of the current state of >>> affairs (modulo my comment above about PRECIS WG discussions at one or >>> more IETF meetings). >>> >>>>> ... >>>>> After all that, I have 3 questions: >>>> >>>> Personal opinions about answers... >>>> >>>>> (1) Is my proposed text enough of a clarification that we >>>>> should make that change before the nickname I-D is published >>>>> as an RFC? >>>> >>>> I think the clarification is an improvement and is important >>>> enough to incorporate (I know that is the answer to a slightly >>>> different question). >>>> >>>> However, I think it is inadequate without a serious warning >>>> about the situation. >>> >>> Yes. >>> >>>> That warning could appear in either this >>>> document or RFC 7613 (or 7613bis) with a pointer from the other, >>>> but, unless you want to revise 7613 now, this one is handy. >>> >>> I suspect that we need to revise 7613. I suspect that we might also need >>> to revise 7564 (at least with respect to the order in which operations >>> are applied, since there has been some confusion among implementers). >>> >>> Well, we always knew that we would need to revise them. Just not so >>> soon. >>> >>>> Comment about possible text below. >>>> >>>>> (2) Should we modify draft-ietf-precis-nickname so that case >>>>> folding is applied only as part of comparison and not as part >>>>> of enforcement? If so, should we make that change before this >>>>> document is published as an RFC? >>>> >>>> Yes. If something is used for "enforcement", it should be lower >>>> casing or something else that can be explained to people who are >>>> ordinarily familiar with one or more of the scripts that make >>>> case distinctions. >>>> >>>> However, viewed in the light of this discussion, the whole >>>> "enforcement" concept becomes a little dicey, especially if, as >>>> I believe but don't have time to verify, the transformations >>>> performed by toLowerCase are not a proper subset of those >>>> performed by toCaseFold. >>> >>> My initial thought is that case mapping doesn't belong in the nickname >>> enforcement operation at all - only in the comparison operation. >>> >>>>> (3) Should we update RFC 7613 so that case folding is applied >>>>> only as part of comparison and not as part of enforcement? >>>> >>>> I think that is necessary. Following up on the comment above, I >>>> would prefer that the current Section 3.2.2 (3) of RFC 7613 >>>> either point to Unicode Lower Casing or contain a warning along >>>> the lines of that below. >>> >>> Unlike the nickname profile (which I think can be cleaned up by moving >>> the case mapping rule to the comparison operation and continuing to use >>> Unicode Default Case Folding), I think you are right that for the >>> UsernameCaseMapped profile we probably want Unicode Lower Casing. Thus >>> the likely need, sooner rather than later, for 7613bis. >>> >>>> >>>> ---------- >>>> >>>> --On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre >>>> <peter@andyet.net> wrote: >>>> >>>>> This issue has greater urgency now because >>>>> draft-ietf-precis-nickname is now in AUTH48... >>>>> >>>>> On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote: >>>>> >>>>>> After all that, I have 3 questions: >>>>>> >>>>>> (1) Is my proposed text enough of a clarification that we >>>>>> should make that change before the nickname I-D is published >>>>>> as an RFC? >>>>> >>>>> I think so. >>>> >>>> See above. >>>> >>>>>> (2) Should we modify draft-ietf-precis-nickname so that case >>>>>> folding is applied only as part of comparison and not as part >>>>>> of enforcement? If so, should we make that change before this >>>>>> document is published as an RFC? >>>>> >>>>> Although it seems to be the case that Unicode case folding is >>>>> primarily designed for the purpose of matching (i.e., >>>>> comparison), >>>> >>>> "Seems" is a little weak. The Unicode Standard is really quite >>>> specific about that. >>>> >>>>> I have a concern that applying the PRECIS case >>>>> mapping rule after applying the normalization and >>>>> directionality rules might have unintended consequences that >>>>> we haven't had a chance to consider yet. The PRECIS framework >>>>> expresses a preference (actually a hard requirement) for >>>>> applying the rules in a particular order. We made a late >>>>> change to the username profiles (RFC 7613), such that width >>>>> mapping is applied first (in order to accommodate fullwidth >>>>> and halfwidth characters in certain East Asian scripts). >>>>> Making a late change to the nickname profile also concerns me, >>>>> even though both of these late changes seem reasonable on the >>>>> face of it. I will try to find time to think about this >>>>> further in the next 24 hours. >>>> >>>> First, a hint for the consideration process: there is a reason >>>> why Unicode now supports a unified case folding and >>>> normalization operation. My recollection is that it is not only >>>> more efficient to perform both operations at once (rather than >>>> looking in one table and then the other), but that there are >>>> some order-dependent or priority-dependent cases. >>>> >>>> The very fact that this issue exists (and is coming up again) >>>> this late in the process (7613 published in August, WG winding >>>> down and not, e.g., meeting next week) calls at least the PRECIS >>>> quality of review and some fairly fundamental model issues into >>>> question. I first raised that issue a rather long time ago but >>>> have continued to hope that we have an approximation to "good >>>> enough" without going back and rethinking everything. >>>> >>>> The right solution, IMO, is that, if RFC 7613 is to rationalize >>>> or explain the operation in terms of converting upper case >>>> characters to lower case, then it should be using toLowerCase >>>> because that is what the operation does. After a quick look at >>>> 7613, amending/updating it to simply convert to lower case would >>>> be straightforward (and would not raise the ordering issue >>>> called out above). It would presumably require another IETF >>>> Last Call, however and I'd hope we would see some serious >>>> discussion within the WG (and with UTC) before making the change >>>> and about how it is explained. >>>> >>>> If we are not willing to make a change >>> >>> I'm willing. It would, as you note, require some careful thinking and >>> review to make sure that we got it (more) right this time. >>> >>>> that significant and/or >>>> if we conclude that the WG (and perhaps the IETF) have >>>> completely run out of energy for dealing with i18n issues [1], >>>> then I suggest that we introduce some additional text. I've >>>> just spent a half-hour trying to find the AUTH48 copy of >>>> precis-nickname (aka RFC-to-be-7700), but the RFC Editor has >>>> apparently changed naming conventions and the various queue >>>> entry pages all point to the -19 I-D and not the current working >>>> copy so I can't try to match text and insertion point to what is >>>> there already. >>> >>> http://www.rfc-editor.org/authors/rfc7700.txt >>> >>>> The suggestion is a patch (and a hack), not a >>>> good fix but something like it is probably the least drastic >>>> measure that would yield something that doesn't contain >>>> unexplained known defects. >>>> >>>> Rough version of suggested text (possibly to go after your >>>> revised paragraph and following up my comments in my 1 October >>>> note). Some of the terminology needs checking which I can do if >>>> you want to go this route: >>>> >>>> 'Users of this specification should note that the >>>> concept of "lower case conversion" is somewhat elusive >>>> and more dependent on the conventions of different >>>> languages and notation systems that use the same script >>>> than may appear obvious at first glance, especially if >>>> that glance is at Basic Latin characters (i.e., the >>>> ASCII letter repertoire). Unicode provides two >>>> different mapping procedures that produce lower-case >>>> characters, but they have different effects and results >>>> for many characters. The more conservative one, >>>> typically appropriately applicable when lower case forms >>>> are needed, is actual lower-casing (embodied in the >>>> Unicode operation toLowerCase). A more radical >>>> operation, normally suitable only for string matching in >>>> situations in which it is better to consider uncertain >>>> cases as matching than to treat them as distinct, is >>>> called "Case Folding" (Unicode operation toCaseFold). >>>> While the two operations will often produce the same >>>> results, Case Folding maps some lower case characters >>>> into others and performs other transformations that may >>>> be intuitively reasonable and expected for some users >>>> and quite astonishing (or just wrong) to others. There >>>> may be no practical alternative, especially if the >>>> operations are to be used for mapping or enforcement, to >>>> developers of PRECIS-dependent understanding that the >>>> cases in which the two yield different results require >>>> careful understanding of the relevant user base and its >>>> needs [2].' >>> >>> Thanks. >>> >>> I am not sure if we need something like that if we move case mapping >>> (here, case folding) to the comparison operation only - but something >>> like that might still be appropriate. >>> >>>>>> (3) Should we update RFC 7613 so that case folding is applied >>>>>> only as part of comparison and not as part of enforcement? >>>>> >>>>> That is less urgent so I suggest that we address the nickname >>>>> spec first. >>>> >>>> Unless you (or someone else here) have a plausible plan to >>>> continue and revitalize the WG and assign it that revision work >>>> (and bring everyone actively participating up to the level >>>> needed to easily understand this discussion thread and feel >>>> embarrassed for not spotting the problems), I think we need to >>>> assume that this is our last shot. Absent an active and >>>> committed WG, "do this first" could easily be equivalent to >>>> "don't get around to the other, ever". >>> >>> As mentioned, I don't want to have broken RFCs out there. >>> >>>> I think that the particular set of issues that started this >>>> thread as a known defect in the PRECIS specs, both nickname and >>>> 7613 and that we are obligated to either fix the problems or at >>>> least explain them. The above warning text is an attempt to >>>> explain and identify the problems even if it does not actually >>>> provide a solution. If it were published as part of >>>> precis-nickname, it could include a statement to the effect that >>>> it should also be treated as an update to 7613 or, if the IESG >>>> and RFC Editor would agree in advance to accept, rather than >>>> bury, the thing, I suppose we could publish it in >>>> precis-nickname and create an erratum to 7613 indicating that it >>>> should have included some form of that statement. Neither >>>> option implies a huge amount of work to update 7613. But I >>>> think that making the changes of (2) without doing anything >>>> about (3) makes the two documents inconsistent with each other >>>> and that would be an additional known defect. >>>> >>>> Procedural question: given that precis-nickname is in AUTH48 as >>>> of yesterday and I don't see anything blocking publication next >>>> week if you and Barry sign off on the revised text that the WG >>>> hasn't seen, >>> >>> There is no revised text yet. That's why we're having this discussion. >>> >>>> does someone need to file a pro forma objection/ >>>> appeal to block that until this is sorted out and the WG has a >>>> chance to review proposed publication text? >>> >>> I see no reason to invoke the specter of appeals quite yet. Everyone is >>> working in good faith to do the right thing and get this mess cleaned >>> up. >>> >>>> [1] I believe our collective inability to deal with the >>>> within-script character forms that do not normalize to each >>>> other because of language-dependent or other usage factors can >>>> be taken as evidence of having run out of energy, >>> >>> Or in my case simple ignorance of some of the relevant issues and >>> examples. It's not easy to know about all of this. >>> >>>> but it is >>>> probably in the interest of finishing the PRECIS work to try to >>>> treat that as a separate issue. >>> >>> Probably. >>> >>>> [2] Not unlike the reason to differentiate between NFC and NFKC >>>> and understand the effects of each. >>> >>> Another thing that's not easy to grok in fulness. >>> >>> Peter >>>
- [precis] Ambiguity in specification of case mappi… Tom Worster
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre - &yet
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre - &yet
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin