Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
Peter Saint-Andre <peter@andyet.net> Tue, 03 November 2015 04:42 UTC
Return-Path: <peter@andyet.net>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D23CE1B2DAA for <precis@ietfa.amsl.com>; Mon, 2 Nov 2015 20:42:37 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.601
X-Spam-Level:
X-Spam-Status: No, score=-1.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, GB_I_LETTER=-2, MANGLED_LIST=2.3, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FWYXuZ6LeWrI for <precis@ietfa.amsl.com>; Mon, 2 Nov 2015 20:42:32 -0800 (PST)
Received: from mail-ob0-x231.google.com (mail-ob0-x231.google.com [IPv6:2607:f8b0:4003:c01::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5E18C1B2DB2 for <precis@ietf.org>; Mon, 2 Nov 2015 20:42:27 -0800 (PST)
Received: by obdgf3 with SMTP id gf3so4023833obd.3 for <precis@ietf.org>; Mon, 02 Nov 2015 20:42:26 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=andyet_net.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-type:content-transfer-encoding; bh=bsNcwZ7JIeUyVZz993LA6Mf+6Rm3KlXpGD6yTVEDxEE=; b=Y78JRIm9jVAT5GMezmGGezDrNWBLKz+4bvnyPu3XJ8AgU9bp3YIPmMdkwvK2u5XK00 9DPLZYgTzNrIKsPehqIT8HVNbrjd6KVUtexc0rJp7rsCviB0p6Y/yH7vbKzhDIIteAny shzs8pu0UCR0tL/vycqWrRLIWlRXs/kfRTaHg5MJvL+pRAbYE5LrDoElPVYy3ExMxFgJ 8GYuu85cBQGiYOoqz13wrRSz5g0nMczlk+nqTTHGWTb0JgLJpnH/72bYu/fpXHk6lrfk 65Q4Bq/w8AkZSITeijam4hBi9/WRU9bjDh+rqOWVsXt9w21/yh/G8CzzT8j6fJva8uFJ /9Kw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-type :content-transfer-encoding; bh=bsNcwZ7JIeUyVZz993LA6Mf+6Rm3KlXpGD6yTVEDxEE=; b=gWUwlJNwSbxzYWx8XIcNkKvfZHqAXu82Qfoiu2SrsW87tKosFMi0BRgon9apITCEqm zg9LShbk4AONEl1HqM/UAXlpMKnA515F7teUP3VDG2uf6G0Ueyk7Y3EYYERBxgVV1VQy WZHjboIh/hC9ZSe5BFsyt7oMCPBTyIrwqdEx3BNKERCmpx/MUCIGDldQMv8s1+M1oGbP zAlujJF9sTPI0V1rG2RrSP+I6dHsjTIXZpiGVvt4/FZ/B2kyK1EUz6Hcs7iOs04fkfax EzKBuWtX8SSF9cpAtiZ1B1iUllVy25pVt6oT3TWdDyczsUnzGmZIq07HVZ4U0cw2L6ao 2RKg==
X-Gm-Message-State: ALoCoQmYDmzLi2P0JqwHoXU9LAE0fPvusKsQA7XD9QZCT+Egx/f7V0nU0y2WEc7ATEdzv1o7XYXC
X-Received: by 10.182.129.138 with SMTP id nw10mr10225787obb.24.1446525746670; Mon, 02 Nov 2015 20:42:26 -0800 (PST)
Received: from aither.local (c-73-34-202-214.hsd1.co.comcast.net. [73.34.202.214]) by smtp.googlemail.com with ESMTPSA id v76sm10419815oie.18.2015.11.02.20.42.24 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 02 Nov 2015 20:42:25 -0800 (PST)
To: John C Klensin <john-ietf@jck.com>, Tom Worster <fsb@thefsb.org>, Alexey Melnikov <Alexey.Melnikov@isode.com>
References: <0347834EBDC481BD99BDBE67@JcK-HP8200.jck.com> <56302E6D.5030901@andyet.net> <56312AAC.1000300@andyet.net> <56313616.8000801@andyet.net> <563143B9.7020707@andyet.net>
From: Peter Saint-Andre <peter@andyet.net>
Message-ID: <56383B2F.6080505@andyet.net>
Date: Mon, 02 Nov 2015 21:42:23 -0700
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.3.0
MIME-Version: 1.0
In-Reply-To: <563143B9.7020707@andyet.net>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/7hPlg3Y1ll44prIkQi4miftaurY>
Cc: precis@ietf.org
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 03 Nov 2015 04:42:38 -0000
For ease of reviewing only, and with no presumption that these proposed changes have been accepted by anyone, I have asked the RFC Editor to provisionally update the document in AUTH48 as outlined in the messages I have sent to the list. The resulting file is here: http://www.rfc-editor.org/authors/rfc7700.txt Despite those caveats, if at all possible I would prefer to find an acceptable solution for publishing this RFC now without undue further delays (in part because draft-ietf-simple-chat has been held on this document for almost 3 years!). That does not mean, as I said earlier in this thread, that I want to have broken RFCs out there, but I think we can fix this one acceptably now and then update it again in strict coherence with updates to RFC 7564 and RFC 7613. I am committed to getting things right, but I am also committed to not holding up other people's work for years and years on end. Peter On 10/28/15 3:52 PM, Peter Saint-Andre wrote: > Example 7 needs to be corrected, too, in accordance with CaseFolding.txt. > > On 10/28/15 2:54 PM, Peter Saint-Andre wrote: >> And here is another correction in Section 3... >> >> OLD >> >> Regarding examples 5, 6, and 7: applying Unicode Default Case Folding >> to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL LETTER >> SIGMA (U+03C3), and doing so during comparison would result in >> matching the nicknames in examples 5 and 6; however, because the >> PRECIS mapping rules do not account for the special status of GREEK >> SMALL LETTER FINAL SIGMA (U+03C2), the nicknames in examples 5 and 7 >> or examples 6 and 7 would not be matched. >> >> NEW >> >> Regarding examples 5, 6, and 7: applying Unicode Default Case Folding >> to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL LETTER >> SIGMA (U+03C3), and the same is true of GREEK SMALL LETTER FINAL >> SIGMA (U+03C2); therefore, the comparison operation defined in >> Section 2.4 would result in matching of the nicknames in examples 5, >> 6, and 7. >> >> On 10/28/15 2:06 PM, Peter Saint-Andre wrote: >>> I propose the following text changes: >>> >>> ### >>> >>> OLD >>> >>> 3. Case Mapping Rule: Uppercase and titlecase characters MUST be >>> mapped to their lowercase equivalents using Unicode Default Case >>> Folding as defined in the Unicode Standard [Unicode] (at the >>> time >>> of this writing, the algorithm is specified in Chapter 3 of >>> [Unicode7.0]). In applications that prohibit conflicting >>> nicknames, this rule helps to reduce the possibility of >>> confusion >>> by ensuring that nicknames differing only by case (e.g., >>> "stpeter" vs. "StPeter") would not be presented to a human user >>> at the same time. >>> >>> NEW >>> >>> 3. Case Mapping Rule: Unicode Default Case Folding MUST be applied, >>> as defined in the Unicode Standard [Unicode] (at the time >>> of this writing, the algorithm is specified in Chapter 3 of >>> [Unicode7.0]). The primary result of doing so is that uppercase >>> characters are mapped to lowercase characters. In applications >>> that prohibit conflicting nicknames, this rule helps to reduce >>> the possibility of confusion by ensuring that nicknames >>> differing only by case (e.g., "stpeter" vs. "StPeter") would not >>> be presented to a human user at the same time. >>> >>> ### >>> >>> (The foregoing was previously sent to the list.) >>> >>> ### >>> >>> OLD >>> >>> 2.3. Enforcement >>> >>> An entity that performs enforcement according to this profile MUST >>> prepare a string as described in Section 2.2 and MUST also apply the >>> rules specified in Section 2.1. The rules MUST be applied in the >>> order shown. >>> >>> After all of the foregoing rules have been enforced, the entity MUST >>> ensure that the nickname is not zero bytes in length (this is done >>> after enforcing the rules to prevent applications from mistakenly >>> omitting a nickname entirely, because when internationalized >>> characters are accepted, a non-empty sequence of characters can >>> result in a zero-length nickname after canonicalization). >>> >>> 2.4. Comparison >>> >>> An entity that performs comparison of two strings according to this >>> profile MUST prepare each string and enforce the rules as specified >>> in Sections 2.2 and 2.3. The two strings are to be considered >>> equivalent if they are an exact octet-for-octet match (sometimes >>> called "bit-string identity"). >>> >>> NEW >>> >>> 2.3. Enforcement >>> >>> An entity that performs enforcement according to this profile MUST >>> prepare a string as described in Section 2.2 and MUST also apply the >>> following rules specified in Section 2.1 in the order shown: >>> >>> 1. Additional Mapping Rule >>> 2. Normalization Rule >>> 3. Directionality Rule >>> >>> After all of the foregoing rules have been enforced, the entity MUST >>> ensure that the nickname is not zero bytes in length (this is done >>> after enforcing the rules to prevent applications from mistakenly >>> omitting a nickname entirely, because when internationalized >>> characters are accepted, a non-empty sequence of characters can >>> result in a zero-length nickname after canonicalization). >>> >>> 2.4. Comparison >>> >>> An entity that performs comparison of two strings according to this >>> profile MUST prepare each string as specified in Section 2.2 and >>> MUST apply the following rules specified in Section 2.1 in the order >>> shown: >>> >>> 1. Additional Mapping Rule >>> 2. Case Mapping Rule >>> 3. Normalization Rule >>> 4. Directionality Rule >>> >>> The two strings are to be considered equivalent if they are an exact >>> octet-for-octet match (sometimes called "bit-string identity"). >>> >>> ### >>> >>> In addition, some variation on John's proposed text about toLowerCase >>> vs. toCaseFold might be appropriate at the end of Section 4; however, >>> I'm still not sure that is necessary if we move the case mapping rule to >>> the comparison operation. >>> >>> Peter >>> >>> On 10/27/15 8:09 PM, Peter Saint-Andre wrote: >>>> On 10/27/15 11:32 AM, John C Klensin wrote: >>>>> Response to Monday's note immediately below; response to today's >>>>> follows it. My apologies, but it is probably important to read >>>>> both. My further apologies for the length of this note, but I >>>>> think we are in deep trouble here, >>>> >>>> Internationalization always seems to be a matter of how deep the >>>> trouble >>>> is... >>>> >>>>> trouble that is aggravated by >>>>> precis-mappings and precis-nickname both being post-approval and >>>>> that, as far as I know, there are no future plans for PRECIS >>>>> work (having precis-nickname in AUTH48 just emphasizes that -- >>>>> see comment at end). >>>> >>>> We had not planned to work on PRECIS because we thought we were done >>>> for >>>> awhile. If that's not the case and we need to fix things, then so be >>>> it. >>>> Whether there is sufficient and continued energy for such work is >>>> another question. Personally I don't want us to have broken RFCs out >>>> there. >>>> >>>>> --On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre - >>>>> &yet <peter@andyet.net> wrote: >>>>> >>>>>> My apologies for the delayed reply. Comments inline. >>>>> >>>>> A few remarks below... I can't tell whether we disagree or >>>>> whether at least one of us, probably me, are not being >>>>> adequately clear. (Material on which we fairly clearly agree >>>>> elided.) >>>>> >>>>> >>>>>> On 10/1/15 7:50 AM, John C Klensin wrote: >>>>>>> --On Wednesday, September 30, 2015 15:16 -0600 Peter >>>>>>> Saint-Andre - &yet <peter@andyet.net> wrote: >>>>>> ... >>>>>>> Peter, >>>>>>> >>>>>>> While your proposed text is an improvement, >>>>>> >>>>>> Happy to hear it. All I intended was a slight clarification. >>>>> >>>>> But I'm not certain we are there yet... >>>> >>>> Agreed. The text I proposed addressed only a very small part of the >>>> problem. >>>> >>>>>>> the desire of many >>>>>>> people for a magic "just tell me what to do" formula, one that >>>>>>> lets them avoid understanding the issues, may call for a >>>>>>> little more: >>>>>> >>>>>> There is always a need for more when it comes to i18n. >>>>> >>>>> But I think it is a little more that that. I've heard several >>>>> times, including in PRECIS meetings, requests for "just tell me >>>>> what to do and make sure it isn't complicated" (or "I don't want >>>>> to have to think about, much less understand, the issues"). We >>>>> can debate whether giving in to those requests in the I18n case >>>>> is wise. I think it leads directly to conclusions equivalent to >>>>> "I understand my own script and writing system (or think I do) >>>>> and therefore, since all writing systems must be pretty much the >>>>> same, I understand all of the core issues in terms of my script >>>>> and understanding". That, in turn, leads directly to the "how >>>>> do you spell 'Zürich'?" and "all spellings of 'Zuerich' should >>>>> be treated as equivalent" discussion that sounded like they >>>>> dominated a BOF at IETF 93. >>>>> >>>>> Now I actually think it is reasonable for someone to ask for a >>>>> library that will do the job most of the time and that will >>>>> almost never cause their users or customers to get angry at >>>>> them. But, if we are going to call what we do "standards", they >>>>> should contain sufficient information that would-be library >>>>> authors can know what to do ... or understand that they are in >>>>> over their heads. And, for these particular cases, we may need >>>>> to explain, or help the library authors explain, why some cases >>>>> will fail and, indeed, get users mad at vendors. >>>>> >>>>> >>>>>>> (1) First, toCaseFold is _not_ toLowerCase. Saying "The >>>>>>> primary result of doing so is that uppercase characters are >>>>>>> mapped to lowercase characters" is true for toCaseFold, >>>>>> >>>>>> By "primary" I meant two things: (1) lowercasing is what >>>>>> happens to the preponderance of code points and (2) this is >>>>>> the result that most people care about. >>>>> >>>>> If I parse the above correctly, I think you are wrong. I think >>>>> what most people want, care about, and think they are getting, >>>>> is lower case conversion, i.e., an operation that preserves >>>>> lower case characters and converts upper case characters to the >>>>> equivalent lower case. toCaseFold isn't that operation. It is >>>>> a much more complex and subtle operation that, as well as >>>>> converting upper case characters to lower case, sometimes >>>>> converts lower case characters to different lower case >>>>> characters (or strings of them). It also requires a fairly good >>>>> understanding of Unicode (not just a relevant script) and >>>>> historical Unicode decisions to predict its behavior and to have >>>>> any hope of explaining that behavior to users. If one is >>>>> trying to compare (as distinct from converting), then toCaseFold >>>>> may be exactly what it wanted. but it is really hard to explain >>>>> or justify that in terms of "nicknames" or "aliases", which are >>>>> about conversion. And, if one hopes to explain what is going >>>>> on to users in terms of "lower casing", then toCaseFold is just >>>>> the wrong operation. That is what toLowerCase is for and the >>>>> two operations are just not equivalent. >>>> >>>> My recollection, quite possibly inaccurate or incomplete, from at least >>>> one and I think several in-person meetings of the PRECIS WG was: just >>>> use Unicode Default Case Folding because if you use anything else or >>>> try >>>> to roll your own you will be fubar forever. I do not recall any >>>> discussion of the issues you have raised in this thread (e.g., about >>>> the >>>> inadvisability of using case folding for anything but comparison >>>> operations) until the last few weeks. However, I freely admit that's >>>> probably because, through my own faults and ignorance, I didn't >>>> understand what you were saying. >>>> >>>>> FWIW and purely by coincidence wrt PRECIS and this document, I >>>>> had a conversation a few days ago with an expert on Arabic (and >>>>> Persian) calligraphy and writing systems (and good general >>>>> knowledge of writing systems) who is quite insistent that any >>>>> procedure we use for case-insensitive matching (e.g., case >>>>> folding) is discriminatory, inconsistent, and just >>>>> badly-thought-out if that same procedure doesn't treat isolated, >>>>> initial, and medial forms of the same character as equivalent. >>>>> He further strengthens his case (sic) by noting that Unicode >>>>> case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA, >>>>> unambiguously a lower-case character) to U+03C3 (GREEK SMALL >>>>> LETTER SIGMA), a relationship that depends entirely on >>>>> positional use and not case. He also believes the same >>>>> relationships should apply to all other scripts that make form >>>>> distinctions for some characters based on positions in a string >>>>> and for which Unicode has chosen to assign different code >>>>> points. Even if there were wide acceptance of his view, Unicode >>>>> stability principles would prevent changing toCaseFold (or >>>>> CaseFolding.txt), but this is more evidence that what toCaseFold >>>>> does and does not do is going to be hard to explain to either >>>>> casual users or to writing system experts whose primary >>>>> experience is not with the Greek-Latin-Cyrillic group. >>>>> >>>>> I don't think we want to say "these matching rules are somewhat >>>>> arbitrary and irrational, but, if you don't like it, blame >>>>> Unicode and not us", if only because it is our choice to use >>>>> those matching rules. More below. >>>>> >>>>> >>>>>> ... >>>>>>> (2) Second, probably as a result of having IDNA in the lead, >>>>>>> we've gotten sloppy about language and operations and should >>>>>>> probably start untangling that before it gets people in >>>>>>> trouble. >>>>>> >>>>>> Where is the right place to do that untangling? (I doubt that >>>>>> it is the precis-nickname document.) >>>>> >>>>> I agree that precis-nickname isn't the ideal place. I also >>>>> believe that you and it are the innocent victims of the >>>>> situation. At the same time, I don't believe IETF should be >>>>> producing incomplete, ambiguous, erroneous, or misleading >>>>> standards because no one could get around to doing the right >>>>> foundational work. >>>> >>>> Agreed. I too want to get this right, even though it's not a lot of fun >>>> and it's certainly more work than I thought I was signing up for at the >>>> NEWPREP BoF years ago. >>>> >>>>>>> The Unicode Standard, at least as I understand it, is fairly >>>>>>> clear that the most important (and really only safe) use of >>>>>>> toCaseFold is as part of a comparison operation. >>>>>> >>>>>> Thanks for noting that. For example, Section 5.18 of Unicode >>>>>> 8.0.0 says: >>>>>> >>>>>> Caseless matching is implemented using case folding, which >>>>>> is the >>>>>> process of mapping characters of different case to a >>>>>> single form, so >>>>>> that case differences in strings are erased. Case folding >>>>>> allows for >>>>>> fast caseless matches in lookups because only binary >>>>>> comparison is >>>>>> required. It is more than just conversion to lowercase. >>>>> >>>>> Right. But, again, when its use is appropriate (a very >>>>> controversial topic in itself with our painful IDNA history with >>>>> Final Sigma, Eszett and the case-independent versus >>>>> position-independent controversy called out above as examples) >>>>> that is "matches in lookups" (what I've described elsewhere as >>>>> "comparison only"). Not creating or defining nicknames or >>>>> aliases. And that _is_ a problem for this document. >>>> >>>> I'm not convinced that things are as bad as you think. If we say in >>>> draft-ietf-precis-nickname that the case mapping rule is to be applied >>>> only as part of comparison and not as part of enforcement - which I >>>> think is really what we care about (e.g., to prevent spoofing of users >>>> in chat rooms) - then I think we might be most of the way there. >>>> >>>>>>> Using your >>>>>>> example it is entirely reasonable to treat, "stpeter" and >>>>>>> "StPeter" as equivalent in a comparison operation, but >>>>>>> accepting one string and changing it to the other for display >>>>>>> may not be a really good idea. While that transformation may >>>>>>> be acceptable (although I would be surprised if there were no >>>>>>> people who share your surname who could consider "stpeter" or >>>>>>> "Stpeter" unacceptable and might even believe that "StPeter" >>>>>>> is an unacceptable substitute for "St. Peter"), >>>>>> >>>>>> I do receive email at stpeter@gmail.com intended for >>>>>> st.peter@gmail.com but that's a separate topic... >>>>> >>>>> One that is relevant because it "works" as a side-effect of a >>>>> decision Google has made about mailbox name equivalence, a >>>>> decision that, IMO, will sooner or later get someone into a lot >>>>> of trouble and, more important, a decision and matching rule >>>>> that PRECIS, AFAICT, does not allow and that IDNA unambigiously >>>>> forbids. >>>>> >>>>>>> it also points out the >>>>>>> dangers of using Basic Latin script examples to illustrate >>>>>>> situations in which even more extended Latin script, much less >>>>>>> other scripts, may raise more complex issues. Because IDNA >>>>>>> is essentially a workaround because changing the DNS >>>>>>> comparison rules was impractical for several reasons, we >>>>>>> ended up using toCaweFold to map characters and strings into >>>>>>> others in IDNA2003 but PRECIS implementations that do not >>>>>>> have the same constraints would, in general, be better off >>>>>>> confining the use of toCaseFold, or even toLowerCase, to >>>>>>> comparison operations. >>>>>> >>>>>> Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does >>>>>> it make sense for this nickname specification to differ in >>>>>> this respect from the published RFCs? Shall we file errata >>>>>> against those documents? (This might apply only to RFC 7613, >>>>>> which says to apply case folding as part of the enforcement >>>>>> process - when exactly to apply case folding is not stipulated >>>>>> by RFC 7564.) >>>>> >>>>> To the extent to which this is a "botched that because the WG >>>>> didn't understand the issues well enough" conclusion, it would >>>>> be entirely reasonable to generate an updating RFC that repairs >>>>> 7613 and/or 7564, even doing so in an addendum to >>>>> precis-nickname if that is the only way to do that >>>>> expeditiously. Per the above, we really don't want to give >>>>> library routine writers bad instructions. As I understand it, >>>>> the current position of the RFC Editor and IESG is that >>>>> technical specification errors discovered in retrospect or after >>>>> people start using a spec are not appropriate topics for errata. >>>>> If the WG is not willing to do any of those things, then I >>>>> suggest that precis-nickname at least needs to contain a very >>>>> clear warning notice about this situation (see my response to >>>>> your question 1 below). >>>> >>>> I think we'll probably need to fix 7613 and 7564. I am hoping we can >>>> fix >>>> nickname now so that it is less incorrect than the other two. That >>>> doesn't necessarily mean we won't need to also further fix nickname >>>> later on. >>>> >>>> Granted, we were supposed to avoid this problem by working on all of >>>> the >>>> PRECIS specs simultaneously. Clearly we have not avoided the >>>> problem, so >>>> we need to solve it one way or another. If that means bis for them all, >>>> we need to deal with it. >>>> >>>>>>> (3) Because toCaweFold loses information when used for more >>>>>>> than comparison (for comparison, it merely contributes to >>>>>>> what some people would consider false positives for matching) >>>>>>> involves some controversial decisions and, because of >>>>>>> stability requirements, cannot be changed even if the >>>>>>> controversies are resolved in other ways, we end up with, >>>>>>> e.g., >>>>>>> toCaseFold ("Nuß") -> "nuss" >>>>>>> which is considered an acceptable transformation in some >>>>>>> places that identify themselves as speaking/using German and >>>>>>> two different unacceptable errors in others. Again, this will >>>>>>> almost always be much more serious if the transformation is >>>>>>> used to map and replace strings than if it is used to compare >>>>>>> (fwiw, that particular example is part of a continuing >>>>>>> disagreement between IDNA2008 and, among others, German >>>>>>> domain registry authorities on one side and UTC and UTR 46 on >>>>>>> the other). >>>>>> >>>>>> Agreed. >>>>> >>>>> See "warning notice" comment above and question 1 response below. >>>>> >>>>>> (4) If the motivation is really to avoid confusion, the >>>>>>> correct confusion-blocking rule for Latin script (but not >>>>>>> others) and many languages that use it (but certainly not >>>>>>> all) involves moving beyond toCaseFold and treating all >>>>>>> "decorated" characters (characters normally represented by >>>>>>> glyphs consisting of a Basic Latin character and one or more >>>>>>> diacritical or equivalent markings) compare equal to their >>>>>>> base characters, e.g., "á" not only matches "Á" but also >>>>>>> "a" and "A" and, as an unfortunate side-effect, maybe "À" >>>>>>> and "à" as well. This is bad news for languages in which >>>>>>> decorated Latin characters are used to represent phonetically >>>>>>> and conceptually different characters, not just pronunciation >>>>>>> variations. I am not qualified to evaluate "how bad". In >>>>>>> addition, extrapolations from this principle about Latin >>>>>>> script to unrelated scripts will almost certainly lead to >>>>>>> serious errors and/or additional confusion. >>>>>> >>>>>> I would not be comfortable going that far... >>>>> >>>>> In case it isn't clear, I would not be either. But it is where >>>>> getting sloppy about this stuff could easily take us. It is >>>>> worth noting that it also identifies one of the difficulties >>>>> with doing a global system to be applied to many types of >>>>> applications (like the PRECIS work) and then applying it in user >>>>> interface software that end users will expect to be localized to >>>>> their assumptions because it has been mapped or translated into >>>>> their language (if one normally speaks Upper Slobbovian but has >>>>> some familiarity with English, an application interface in >>>>> English will probably be expected to be "foreign", odd, and >>>>> maybe even inconsistent with whatever expectations exist. But, >>>>> if the interface is in Upper Slobbovian, the natural and >>>>> reasonable assumption will be the matching should conform to >>>>> normal Upper Slobbovian conventions. FWIW, a matching rule >>>>> that says: >>>>> >>>>> (i) Two instances of a base character with the same >>>>> diacritical mark(s) match. >>>>> (ii) Two instances of a base character with different >>>>> diacritical mark(s) do not match. >>>>> (iii) Two instances of a base character, one with >>>>> diacritical mark(s) and the other without any decoration >>>>> match. >>>>> >>>>> Is precisely correct and normal behavior for at least one >>>>> language that uses Latin script. It is also the normal practice >>>>> for at least one Latin script transcription system that is used >>>>> by a large fraction of a billion people (maybe more). >>>> >>>> That is indeed sobering. >>>> >>>>>>> More on this and Tom's question below... >>>>>>> >>>>>>>> On 9/29/15 3:28 PM, Tom Worster wrote: >>>>>>>>> Peter, Alexey, >>>>>>>>> >>>>>>>>> I think there is an ambiguity in the specification of case >>>>>>>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19. >>>>>>>> ... >>>>>>>>> But there are 55 code points in Unicode 7.0.0 that change >>>>>>>>> under default case folding that are neither uppercase nor >>>>>>>>> titlecase characters, 12 of which are Lowercase_Letter. I >>>>>>>>> suspect this stems from a confusion between Unicode case >>>>>>>>> mapping and case folding. >>>>> >>>>> In the context of the above, a different way to say the same >>>>> thing is that people are looking at toCaseFold and assuming (and >>>>> explaining things in terms of) toLowerCase. toCaseFold works >>>>> the way it is expected to and those 55 code points are, more or >>>>> less, collateral damage to get to a matching algorithm that >>>>> favors false positives over false negatives and various edge >>>>> cases (including in "edge cases" languages spoken by, and script >>>>> variations used by, millions of people). >>>> >>>> Sadly I suspect that is an accurate description of the current state of >>>> affairs (modulo my comment above about PRECIS WG discussions at one or >>>> more IETF meetings). >>>> >>>>>> ... >>>>>> After all that, I have 3 questions: >>>>> >>>>> Personal opinions about answers... >>>>> >>>>>> (1) Is my proposed text enough of a clarification that we >>>>>> should make that change before the nickname I-D is published >>>>>> as an RFC? >>>>> >>>>> I think the clarification is an improvement and is important >>>>> enough to incorporate (I know that is the answer to a slightly >>>>> different question). >>>>> >>>>> However, I think it is inadequate without a serious warning >>>>> about the situation. >>>> >>>> Yes. >>>> >>>>> That warning could appear in either this >>>>> document or RFC 7613 (or 7613bis) with a pointer from the other, >>>>> but, unless you want to revise 7613 now, this one is handy. >>>> >>>> I suspect that we need to revise 7613. I suspect that we might also >>>> need >>>> to revise 7564 (at least with respect to the order in which operations >>>> are applied, since there has been some confusion among implementers). >>>> >>>> Well, we always knew that we would need to revise them. Just not so >>>> soon. >>>> >>>>> Comment about possible text below. >>>>> >>>>>> (2) Should we modify draft-ietf-precis-nickname so that case >>>>>> folding is applied only as part of comparison and not as part >>>>>> of enforcement? If so, should we make that change before this >>>>>> document is published as an RFC? >>>>> >>>>> Yes. If something is used for "enforcement", it should be lower >>>>> casing or something else that can be explained to people who are >>>>> ordinarily familiar with one or more of the scripts that make >>>>> case distinctions. >>>>> >>>>> However, viewed in the light of this discussion, the whole >>>>> "enforcement" concept becomes a little dicey, especially if, as >>>>> I believe but don't have time to verify, the transformations >>>>> performed by toLowerCase are not a proper subset of those >>>>> performed by toCaseFold. >>>> >>>> My initial thought is that case mapping doesn't belong in the nickname >>>> enforcement operation at all - only in the comparison operation. >>>> >>>>>> (3) Should we update RFC 7613 so that case folding is applied >>>>>> only as part of comparison and not as part of enforcement? >>>>> >>>>> I think that is necessary. Following up on the comment above, I >>>>> would prefer that the current Section 3.2.2 (3) of RFC 7613 >>>>> either point to Unicode Lower Casing or contain a warning along >>>>> the lines of that below. >>>> >>>> Unlike the nickname profile (which I think can be cleaned up by moving >>>> the case mapping rule to the comparison operation and continuing to use >>>> Unicode Default Case Folding), I think you are right that for the >>>> UsernameCaseMapped profile we probably want Unicode Lower Casing. Thus >>>> the likely need, sooner rather than later, for 7613bis. >>>> >>>>> >>>>> ---------- >>>>> >>>>> --On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre >>>>> <peter@andyet.net> wrote: >>>>> >>>>>> This issue has greater urgency now because >>>>>> draft-ietf-precis-nickname is now in AUTH48... >>>>>> >>>>>> On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote: >>>>>> >>>>>>> After all that, I have 3 questions: >>>>>>> >>>>>>> (1) Is my proposed text enough of a clarification that we >>>>>>> should make that change before the nickname I-D is published >>>>>>> as an RFC? >>>>>> >>>>>> I think so. >>>>> >>>>> See above. >>>>> >>>>>>> (2) Should we modify draft-ietf-precis-nickname so that case >>>>>>> folding is applied only as part of comparison and not as part >>>>>>> of enforcement? If so, should we make that change before this >>>>>>> document is published as an RFC? >>>>>> >>>>>> Although it seems to be the case that Unicode case folding is >>>>>> primarily designed for the purpose of matching (i.e., >>>>>> comparison), >>>>> >>>>> "Seems" is a little weak. The Unicode Standard is really quite >>>>> specific about that. >>>>> >>>>>> I have a concern that applying the PRECIS case >>>>>> mapping rule after applying the normalization and >>>>>> directionality rules might have unintended consequences that >>>>>> we haven't had a chance to consider yet. The PRECIS framework >>>>>> expresses a preference (actually a hard requirement) for >>>>>> applying the rules in a particular order. We made a late >>>>>> change to the username profiles (RFC 7613), such that width >>>>>> mapping is applied first (in order to accommodate fullwidth >>>>>> and halfwidth characters in certain East Asian scripts). >>>>>> Making a late change to the nickname profile also concerns me, >>>>>> even though both of these late changes seem reasonable on the >>>>>> face of it. I will try to find time to think about this >>>>>> further in the next 24 hours. >>>>> >>>>> First, a hint for the consideration process: there is a reason >>>>> why Unicode now supports a unified case folding and >>>>> normalization operation. My recollection is that it is not only >>>>> more efficient to perform both operations at once (rather than >>>>> looking in one table and then the other), but that there are >>>>> some order-dependent or priority-dependent cases. >>>>> >>>>> The very fact that this issue exists (and is coming up again) >>>>> this late in the process (7613 published in August, WG winding >>>>> down and not, e.g., meeting next week) calls at least the PRECIS >>>>> quality of review and some fairly fundamental model issues into >>>>> question. I first raised that issue a rather long time ago but >>>>> have continued to hope that we have an approximation to "good >>>>> enough" without going back and rethinking everything. >>>>> >>>>> The right solution, IMO, is that, if RFC 7613 is to rationalize >>>>> or explain the operation in terms of converting upper case >>>>> characters to lower case, then it should be using toLowerCase >>>>> because that is what the operation does. After a quick look at >>>>> 7613, amending/updating it to simply convert to lower case would >>>>> be straightforward (and would not raise the ordering issue >>>>> called out above). It would presumably require another IETF >>>>> Last Call, however and I'd hope we would see some serious >>>>> discussion within the WG (and with UTC) before making the change >>>>> and about how it is explained. >>>>> >>>>> If we are not willing to make a change >>>> >>>> I'm willing. It would, as you note, require some careful thinking and >>>> review to make sure that we got it (more) right this time. >>>> >>>>> that significant and/or >>>>> if we conclude that the WG (and perhaps the IETF) have >>>>> completely run out of energy for dealing with i18n issues [1], >>>>> then I suggest that we introduce some additional text. I've >>>>> just spent a half-hour trying to find the AUTH48 copy of >>>>> precis-nickname (aka RFC-to-be-7700), but the RFC Editor has >>>>> apparently changed naming conventions and the various queue >>>>> entry pages all point to the -19 I-D and not the current working >>>>> copy so I can't try to match text and insertion point to what is >>>>> there already. >>>> >>>> http://www.rfc-editor.org/authors/rfc7700.txt >>>> >>>>> The suggestion is a patch (and a hack), not a >>>>> good fix but something like it is probably the least drastic >>>>> measure that would yield something that doesn't contain >>>>> unexplained known defects. >>>>> >>>>> Rough version of suggested text (possibly to go after your >>>>> revised paragraph and following up my comments in my 1 October >>>>> note). Some of the terminology needs checking which I can do if >>>>> you want to go this route: >>>>> >>>>> 'Users of this specification should note that the >>>>> concept of "lower case conversion" is somewhat elusive >>>>> and more dependent on the conventions of different >>>>> languages and notation systems that use the same script >>>>> than may appear obvious at first glance, especially if >>>>> that glance is at Basic Latin characters (i.e., the >>>>> ASCII letter repertoire). Unicode provides two >>>>> different mapping procedures that produce lower-case >>>>> characters, but they have different effects and results >>>>> for many characters. The more conservative one, >>>>> typically appropriately applicable when lower case forms >>>>> are needed, is actual lower-casing (embodied in the >>>>> Unicode operation toLowerCase). A more radical >>>>> operation, normally suitable only for string matching in >>>>> situations in which it is better to consider uncertain >>>>> cases as matching than to treat them as distinct, is >>>>> called "Case Folding" (Unicode operation toCaseFold). >>>>> While the two operations will often produce the same >>>>> results, Case Folding maps some lower case characters >>>>> into others and performs other transformations that may >>>>> be intuitively reasonable and expected for some users >>>>> and quite astonishing (or just wrong) to others. There >>>>> may be no practical alternative, especially if the >>>>> operations are to be used for mapping or enforcement, to >>>>> developers of PRECIS-dependent understanding that the >>>>> cases in which the two yield different results require >>>>> careful understanding of the relevant user base and its >>>>> needs [2].' >>>> >>>> Thanks. >>>> >>>> I am not sure if we need something like that if we move case mapping >>>> (here, case folding) to the comparison operation only - but something >>>> like that might still be appropriate. >>>> >>>>>>> (3) Should we update RFC 7613 so that case folding is applied >>>>>>> only as part of comparison and not as part of enforcement? >>>>>> >>>>>> That is less urgent so I suggest that we address the nickname >>>>>> spec first. >>>>> >>>>> Unless you (or someone else here) have a plausible plan to >>>>> continue and revitalize the WG and assign it that revision work >>>>> (and bring everyone actively participating up to the level >>>>> needed to easily understand this discussion thread and feel >>>>> embarrassed for not spotting the problems), I think we need to >>>>> assume that this is our last shot. Absent an active and >>>>> committed WG, "do this first" could easily be equivalent to >>>>> "don't get around to the other, ever". >>>> >>>> As mentioned, I don't want to have broken RFCs out there. >>>> >>>>> I think that the particular set of issues that started this >>>>> thread as a known defect in the PRECIS specs, both nickname and >>>>> 7613 and that we are obligated to either fix the problems or at >>>>> least explain them. The above warning text is an attempt to >>>>> explain and identify the problems even if it does not actually >>>>> provide a solution. If it were published as part of >>>>> precis-nickname, it could include a statement to the effect that >>>>> it should also be treated as an update to 7613 or, if the IESG >>>>> and RFC Editor would agree in advance to accept, rather than >>>>> bury, the thing, I suppose we could publish it in >>>>> precis-nickname and create an erratum to 7613 indicating that it >>>>> should have included some form of that statement. Neither >>>>> option implies a huge amount of work to update 7613. But I >>>>> think that making the changes of (2) without doing anything >>>>> about (3) makes the two documents inconsistent with each other >>>>> and that would be an additional known defect. >>>>> >>>>> Procedural question: given that precis-nickname is in AUTH48 as >>>>> of yesterday and I don't see anything blocking publication next >>>>> week if you and Barry sign off on the revised text that the WG >>>>> hasn't seen, >>>> >>>> There is no revised text yet. That's why we're having this discussion. >>>> >>>>> does someone need to file a pro forma objection/ >>>>> appeal to block that until this is sorted out and the WG has a >>>>> chance to review proposed publication text? >>>> >>>> I see no reason to invoke the specter of appeals quite yet. Everyone is >>>> working in good faith to do the right thing and get this mess cleaned >>>> up. >>>> >>>>> [1] I believe our collective inability to deal with the >>>>> within-script character forms that do not normalize to each >>>>> other because of language-dependent or other usage factors can >>>>> be taken as evidence of having run out of energy, >>>> >>>> Or in my case simple ignorance of some of the relevant issues and >>>> examples. It's not easy to know about all of this. >>>> >>>>> but it is >>>>> probably in the interest of finishing the PRECIS work to try to >>>>> treat that as a separate issue. >>>> >>>> Probably. >>>> >>>>> [2] Not unlike the reason to differentiate between NFC and NFKC >>>>> and understand the effects of each. >>>> >>>> Another thing that's not easy to grok in fulness. >>>> >>>> Peter >>>>
- [precis] Ambiguity in specification of case mappi… Tom Worster
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre - &yet
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre - &yet
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin