Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
Peter Saint-Andre <peter@andyet.net> Wed, 28 October 2015 20:54 UTC
Return-Path: <peter@andyet.net>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0E66D1B5DEA for <precis@ietfa.amsl.com>; Wed, 28 Oct 2015 13:54:55 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.601
X-Spam-Level:
X-Spam-Status: No, score=-1.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, GB_I_LETTER=-2, MANGLED_LIST=2.3, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 08nZWDCc5Eik for <precis@ietfa.amsl.com>; Wed, 28 Oct 2015 13:54:49 -0700 (PDT)
Received: from mail-oi0-x22e.google.com (mail-oi0-x22e.google.com [IPv6:2607:f8b0:4003:c06::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8263E1B5DE2 for <precis@ietf.org>; Wed, 28 Oct 2015 13:54:49 -0700 (PDT)
Received: by oies66 with SMTP id s66so12386565oie.1 for <precis@ietf.org>; Wed, 28 Oct 2015 13:54:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=andyet_net.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-type:content-transfer-encoding; bh=MehwFRTfb6njP34dppIDIlHcRculEGLtKZ2fUL+49SU=; b=QrJjyhIAD3KvVLt87sqPhrhQwSToObttn5eWUQ9JnH+VDWWXwfyv8mEezikMq9SACw 7o4HX944apufQ14deyL9YC472anQdz54UtrObOoz/3AlCr2e6u++bDliV9Aglh0coPut qM1cdqyVGFPOxW4mXA2qbWd+Ox6eIiVUhs06/zPI5lKgpz4wtXSeTMnHZo4muaUhzSE0 T1rHTH9usMmCAmmkBNQdsJvAlxVZz4MbygDm5IJGPKkWFBUv+oO+qOizpyZrBvyBZJbR d0ivCX1myEfNYcZbLN9aCCu3l4f0S6pQAli8ss+aNsGVqmJ2Q693ILMw6kah0Euwol9R fhoQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-type :content-transfer-encoding; bh=MehwFRTfb6njP34dppIDIlHcRculEGLtKZ2fUL+49SU=; b=dRN8+CxgE1uPXL/pk9eglyvWJi4w/G8BWobTSK3Up1vjYoJCxxSG93/ewTercWWnHi uS73n2y0xqwbx5Lt1mcQ/iOdoZRk7wV+1f/+M2XjYn1vlKVJP4zpf/vRU6d8DxEq64x1 ScganRhsr8qZ9pNE2+XWaqEpekqUDPrIDoyPL3/07dWN1jddSbGEi2WfnfyW99Bf2nSh R1Y+GQlZtq3J+BX4wjR2umXK6VQjsO/gOYrj8C/LkMK1nsGj+9OBBOjk8Y/3rPEsA+K7 0zZiG4hFaR9L77oI9pv/3edjt2KgA9xOBvlbL1CmxhYFIynEyXSiPUwVrBKedaPPM9qE DkdA==
X-Gm-Message-State: ALoCoQk4dcMnd1jYNb5odOIJzozMDGfbhplMD9C09+lEy4awxLhGIETSAWKaZsV8abhgQqMtJU9h
X-Received: by 10.202.183.137 with SMTP id h131mr6363260oif.58.1446065688765; Wed, 28 Oct 2015 13:54:48 -0700 (PDT)
Received: from aither.local (c-73-34-202-214.hsd1.co.comcast.net. [73.34.202.214]) by smtp.googlemail.com with ESMTPSA id s127sm20726444oia.21.2015.10.28.13.54.46 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 28 Oct 2015 13:54:47 -0700 (PDT)
To: John C Klensin <john-ietf@jck.com>, Tom Worster <fsb@thefsb.org>, Alexey Melnikov <Alexey.Melnikov@isode.com>
References: <0347834EBDC481BD99BDBE67@JcK-HP8200.jck.com> <56302E6D.5030901@andyet.net> <56312AAC.1000300@andyet.net>
From: Peter Saint-Andre <peter@andyet.net>
Message-ID: <56313616.8000801@andyet.net>
Date: Wed, 28 Oct 2015 14:54:46 -0600
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <56312AAC.1000300@andyet.net>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/q2OB_d9zT8vOrOPPv9jsck7uRqw>
Cc: precis@ietf.org
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Oct 2015 20:54:55 -0000
And here is another correction in Section 3... OLD Regarding examples 5, 6, and 7: applying Unicode Default Case Folding to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL LETTER SIGMA (U+03C3), and doing so during comparison would result in matching the nicknames in examples 5 and 6; however, because the PRECIS mapping rules do not account for the special status of GREEK SMALL LETTER FINAL SIGMA (U+03C2), the nicknames in examples 5 and 7 or examples 6 and 7 would not be matched. NEW Regarding examples 5, 6, and 7: applying Unicode Default Case Folding to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL LETTER SIGMA (U+03C3), and the same is true of GREEK SMALL LETTER FINAL SIGMA (U+03C2); therefore, the comparison operation defined in Section 2.4 would result in matching of the nicknames in examples 5, 6, and 7. On 10/28/15 2:06 PM, Peter Saint-Andre wrote: > I propose the following text changes: > > ### > > OLD > > 3. Case Mapping Rule: Uppercase and titlecase characters MUST be > mapped to their lowercase equivalents using Unicode Default Case > Folding as defined in the Unicode Standard [Unicode] (at the time > of this writing, the algorithm is specified in Chapter 3 of > [Unicode7.0]). In applications that prohibit conflicting > nicknames, this rule helps to reduce the possibility of confusion > by ensuring that nicknames differing only by case (e.g., > "stpeter" vs. "StPeter") would not be presented to a human user > at the same time. > > NEW > > 3. Case Mapping Rule: Unicode Default Case Folding MUST be applied, > as defined in the Unicode Standard [Unicode] (at the time > of this writing, the algorithm is specified in Chapter 3 of > [Unicode7.0]). The primary result of doing so is that uppercase > characters are mapped to lowercase characters. In applications > that prohibit conflicting nicknames, this rule helps to reduce > the possibility of confusion by ensuring that nicknames > differing only by case (e.g., "stpeter" vs. "StPeter") would not > be presented to a human user at the same time. > > ### > > (The foregoing was previously sent to the list.) > > ### > > OLD > > 2.3. Enforcement > > An entity that performs enforcement according to this profile MUST > prepare a string as described in Section 2.2 and MUST also apply the > rules specified in Section 2.1. The rules MUST be applied in the > order shown. > > After all of the foregoing rules have been enforced, the entity MUST > ensure that the nickname is not zero bytes in length (this is done > after enforcing the rules to prevent applications from mistakenly > omitting a nickname entirely, because when internationalized > characters are accepted, a non-empty sequence of characters can > result in a zero-length nickname after canonicalization). > > 2.4. Comparison > > An entity that performs comparison of two strings according to this > profile MUST prepare each string and enforce the rules as specified > in Sections 2.2 and 2.3. The two strings are to be considered > equivalent if they are an exact octet-for-octet match (sometimes > called "bit-string identity"). > > NEW > > 2.3. Enforcement > > An entity that performs enforcement according to this profile MUST > prepare a string as described in Section 2.2 and MUST also apply the > following rules specified in Section 2.1 in the order shown: > > 1. Additional Mapping Rule > 2. Normalization Rule > 3. Directionality Rule > > After all of the foregoing rules have been enforced, the entity MUST > ensure that the nickname is not zero bytes in length (this is done > after enforcing the rules to prevent applications from mistakenly > omitting a nickname entirely, because when internationalized > characters are accepted, a non-empty sequence of characters can > result in a zero-length nickname after canonicalization). > > 2.4. Comparison > > An entity that performs comparison of two strings according to this > profile MUST prepare each string as specified in Section 2.2 and > MUST apply the following rules specified in Section 2.1 in the order > shown: > > 1. Additional Mapping Rule > 2. Case Mapping Rule > 3. Normalization Rule > 4. Directionality Rule > > The two strings are to be considered equivalent if they are an exact > octet-for-octet match (sometimes called "bit-string identity"). > > ### > > In addition, some variation on John's proposed text about toLowerCase > vs. toCaseFold might be appropriate at the end of Section 4; however, > I'm still not sure that is necessary if we move the case mapping rule to > the comparison operation. > > Peter > > On 10/27/15 8:09 PM, Peter Saint-Andre wrote: >> On 10/27/15 11:32 AM, John C Klensin wrote: >>> Response to Monday's note immediately below; response to today's >>> follows it. My apologies, but it is probably important to read >>> both. My further apologies for the length of this note, but I >>> think we are in deep trouble here, >> >> Internationalization always seems to be a matter of how deep the trouble >> is... >> >>> trouble that is aggravated by >>> precis-mappings and precis-nickname both being post-approval and >>> that, as far as I know, there are no future plans for PRECIS >>> work (having precis-nickname in AUTH48 just emphasizes that -- >>> see comment at end). >> >> We had not planned to work on PRECIS because we thought we were done for >> awhile. If that's not the case and we need to fix things, then so be it. >> Whether there is sufficient and continued energy for such work is >> another question. Personally I don't want us to have broken RFCs out >> there. >> >>> --On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre - >>> &yet <peter@andyet.net> wrote: >>> >>>> My apologies for the delayed reply. Comments inline. >>> >>> A few remarks below... I can't tell whether we disagree or >>> whether at least one of us, probably me, are not being >>> adequately clear. (Material on which we fairly clearly agree >>> elided.) >>> >>> >>>> On 10/1/15 7:50 AM, John C Klensin wrote: >>>>> --On Wednesday, September 30, 2015 15:16 -0600 Peter >>>>> Saint-Andre - &yet <peter@andyet.net> wrote: >>>> ... >>>>> Peter, >>>>> >>>>> While your proposed text is an improvement, >>>> >>>> Happy to hear it. All I intended was a slight clarification. >>> >>> But I'm not certain we are there yet... >> >> Agreed. The text I proposed addressed only a very small part of the >> problem. >> >>>>> the desire of many >>>>> people for a magic "just tell me what to do" formula, one that >>>>> lets them avoid understanding the issues, may call for a >>>>> little more: >>>> >>>> There is always a need for more when it comes to i18n. >>> >>> But I think it is a little more that that. I've heard several >>> times, including in PRECIS meetings, requests for "just tell me >>> what to do and make sure it isn't complicated" (or "I don't want >>> to have to think about, much less understand, the issues"). We >>> can debate whether giving in to those requests in the I18n case >>> is wise. I think it leads directly to conclusions equivalent to >>> "I understand my own script and writing system (or think I do) >>> and therefore, since all writing systems must be pretty much the >>> same, I understand all of the core issues in terms of my script >>> and understanding". That, in turn, leads directly to the "how >>> do you spell 'Zürich'?" and "all spellings of 'Zuerich' should >>> be treated as equivalent" discussion that sounded like they >>> dominated a BOF at IETF 93. >>> >>> Now I actually think it is reasonable for someone to ask for a >>> library that will do the job most of the time and that will >>> almost never cause their users or customers to get angry at >>> them. But, if we are going to call what we do "standards", they >>> should contain sufficient information that would-be library >>> authors can know what to do ... or understand that they are in >>> over their heads. And, for these particular cases, we may need >>> to explain, or help the library authors explain, why some cases >>> will fail and, indeed, get users mad at vendors. >>> >>> >>>>> (1) First, toCaseFold is _not_ toLowerCase. Saying "The >>>>> primary result of doing so is that uppercase characters are >>>>> mapped to lowercase characters" is true for toCaseFold, >>>> >>>> By "primary" I meant two things: (1) lowercasing is what >>>> happens to the preponderance of code points and (2) this is >>>> the result that most people care about. >>> >>> If I parse the above correctly, I think you are wrong. I think >>> what most people want, care about, and think they are getting, >>> is lower case conversion, i.e., an operation that preserves >>> lower case characters and converts upper case characters to the >>> equivalent lower case. toCaseFold isn't that operation. It is >>> a much more complex and subtle operation that, as well as >>> converting upper case characters to lower case, sometimes >>> converts lower case characters to different lower case >>> characters (or strings of them). It also requires a fairly good >>> understanding of Unicode (not just a relevant script) and >>> historical Unicode decisions to predict its behavior and to have >>> any hope of explaining that behavior to users. If one is >>> trying to compare (as distinct from converting), then toCaseFold >>> may be exactly what it wanted. but it is really hard to explain >>> or justify that in terms of "nicknames" or "aliases", which are >>> about conversion. And, if one hopes to explain what is going >>> on to users in terms of "lower casing", then toCaseFold is just >>> the wrong operation. That is what toLowerCase is for and the >>> two operations are just not equivalent. >> >> My recollection, quite possibly inaccurate or incomplete, from at least >> one and I think several in-person meetings of the PRECIS WG was: just >> use Unicode Default Case Folding because if you use anything else or try >> to roll your own you will be fubar forever. I do not recall any >> discussion of the issues you have raised in this thread (e.g., about the >> inadvisability of using case folding for anything but comparison >> operations) until the last few weeks. However, I freely admit that's >> probably because, through my own faults and ignorance, I didn't >> understand what you were saying. >> >>> FWIW and purely by coincidence wrt PRECIS and this document, I >>> had a conversation a few days ago with an expert on Arabic (and >>> Persian) calligraphy and writing systems (and good general >>> knowledge of writing systems) who is quite insistent that any >>> procedure we use for case-insensitive matching (e.g., case >>> folding) is discriminatory, inconsistent, and just >>> badly-thought-out if that same procedure doesn't treat isolated, >>> initial, and medial forms of the same character as equivalent. >>> He further strengthens his case (sic) by noting that Unicode >>> case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA, >>> unambiguously a lower-case character) to U+03C3 (GREEK SMALL >>> LETTER SIGMA), a relationship that depends entirely on >>> positional use and not case. He also believes the same >>> relationships should apply to all other scripts that make form >>> distinctions for some characters based on positions in a string >>> and for which Unicode has chosen to assign different code >>> points. Even if there were wide acceptance of his view, Unicode >>> stability principles would prevent changing toCaseFold (or >>> CaseFolding.txt), but this is more evidence that what toCaseFold >>> does and does not do is going to be hard to explain to either >>> casual users or to writing system experts whose primary >>> experience is not with the Greek-Latin-Cyrillic group. >>> >>> I don't think we want to say "these matching rules are somewhat >>> arbitrary and irrational, but, if you don't like it, blame >>> Unicode and not us", if only because it is our choice to use >>> those matching rules. More below. >>> >>> >>>> ... >>>>> (2) Second, probably as a result of having IDNA in the lead, >>>>> we've gotten sloppy about language and operations and should >>>>> probably start untangling that before it gets people in >>>>> trouble. >>>> >>>> Where is the right place to do that untangling? (I doubt that >>>> it is the precis-nickname document.) >>> >>> I agree that precis-nickname isn't the ideal place. I also >>> believe that you and it are the innocent victims of the >>> situation. At the same time, I don't believe IETF should be >>> producing incomplete, ambiguous, erroneous, or misleading >>> standards because no one could get around to doing the right >>> foundational work. >> >> Agreed. I too want to get this right, even though it's not a lot of fun >> and it's certainly more work than I thought I was signing up for at the >> NEWPREP BoF years ago. >> >>>>> The Unicode Standard, at least as I understand it, is fairly >>>>> clear that the most important (and really only safe) use of >>>>> toCaseFold is as part of a comparison operation. >>>> >>>> Thanks for noting that. For example, Section 5.18 of Unicode >>>> 8.0.0 says: >>>> >>>> Caseless matching is implemented using case folding, which >>>> is the >>>> process of mapping characters of different case to a >>>> single form, so >>>> that case differences in strings are erased. Case folding >>>> allows for >>>> fast caseless matches in lookups because only binary >>>> comparison is >>>> required. It is more than just conversion to lowercase. >>> >>> Right. But, again, when its use is appropriate (a very >>> controversial topic in itself with our painful IDNA history with >>> Final Sigma, Eszett and the case-independent versus >>> position-independent controversy called out above as examples) >>> that is "matches in lookups" (what I've described elsewhere as >>> "comparison only"). Not creating or defining nicknames or >>> aliases. And that _is_ a problem for this document. >> >> I'm not convinced that things are as bad as you think. If we say in >> draft-ietf-precis-nickname that the case mapping rule is to be applied >> only as part of comparison and not as part of enforcement - which I >> think is really what we care about (e.g., to prevent spoofing of users >> in chat rooms) - then I think we might be most of the way there. >> >>>>> Using your >>>>> example it is entirely reasonable to treat, "stpeter" and >>>>> "StPeter" as equivalent in a comparison operation, but >>>>> accepting one string and changing it to the other for display >>>>> may not be a really good idea. While that transformation may >>>>> be acceptable (although I would be surprised if there were no >>>>> people who share your surname who could consider "stpeter" or >>>>> "Stpeter" unacceptable and might even believe that "StPeter" >>>>> is an unacceptable substitute for "St. Peter"), >>>> >>>> I do receive email at stpeter@gmail.com intended for >>>> st.peter@gmail.com but that's a separate topic... >>> >>> One that is relevant because it "works" as a side-effect of a >>> decision Google has made about mailbox name equivalence, a >>> decision that, IMO, will sooner or later get someone into a lot >>> of trouble and, more important, a decision and matching rule >>> that PRECIS, AFAICT, does not allow and that IDNA unambigiously >>> forbids. >>> >>>>> it also points out the >>>>> dangers of using Basic Latin script examples to illustrate >>>>> situations in which even more extended Latin script, much less >>>>> other scripts, may raise more complex issues. Because IDNA >>>>> is essentially a workaround because changing the DNS >>>>> comparison rules was impractical for several reasons, we >>>>> ended up using toCaweFold to map characters and strings into >>>>> others in IDNA2003 but PRECIS implementations that do not >>>>> have the same constraints would, in general, be better off >>>>> confining the use of toCaseFold, or even toLowerCase, to >>>>> comparison operations. >>>> >>>> Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does >>>> it make sense for this nickname specification to differ in >>>> this respect from the published RFCs? Shall we file errata >>>> against those documents? (This might apply only to RFC 7613, >>>> which says to apply case folding as part of the enforcement >>>> process - when exactly to apply case folding is not stipulated >>>> by RFC 7564.) >>> >>> To the extent to which this is a "botched that because the WG >>> didn't understand the issues well enough" conclusion, it would >>> be entirely reasonable to generate an updating RFC that repairs >>> 7613 and/or 7564, even doing so in an addendum to >>> precis-nickname if that is the only way to do that >>> expeditiously. Per the above, we really don't want to give >>> library routine writers bad instructions. As I understand it, >>> the current position of the RFC Editor and IESG is that >>> technical specification errors discovered in retrospect or after >>> people start using a spec are not appropriate topics for errata. >>> If the WG is not willing to do any of those things, then I >>> suggest that precis-nickname at least needs to contain a very >>> clear warning notice about this situation (see my response to >>> your question 1 below). >> >> I think we'll probably need to fix 7613 and 7564. I am hoping we can fix >> nickname now so that it is less incorrect than the other two. That >> doesn't necessarily mean we won't need to also further fix nickname >> later on. >> >> Granted, we were supposed to avoid this problem by working on all of the >> PRECIS specs simultaneously. Clearly we have not avoided the problem, so >> we need to solve it one way or another. If that means bis for them all, >> we need to deal with it. >> >>>>> (3) Because toCaweFold loses information when used for more >>>>> than comparison (for comparison, it merely contributes to >>>>> what some people would consider false positives for matching) >>>>> involves some controversial decisions and, because of >>>>> stability requirements, cannot be changed even if the >>>>> controversies are resolved in other ways, we end up with, >>>>> e.g., >>>>> toCaseFold ("Nuß") -> "nuss" >>>>> which is considered an acceptable transformation in some >>>>> places that identify themselves as speaking/using German and >>>>> two different unacceptable errors in others. Again, this will >>>>> almost always be much more serious if the transformation is >>>>> used to map and replace strings than if it is used to compare >>>>> (fwiw, that particular example is part of a continuing >>>>> disagreement between IDNA2008 and, among others, German >>>>> domain registry authorities on one side and UTC and UTR 46 on >>>>> the other). >>>> >>>> Agreed. >>> >>> See "warning notice" comment above and question 1 response below. >>> >>>> (4) If the motivation is really to avoid confusion, the >>>>> correct confusion-blocking rule for Latin script (but not >>>>> others) and many languages that use it (but certainly not >>>>> all) involves moving beyond toCaseFold and treating all >>>>> "decorated" characters (characters normally represented by >>>>> glyphs consisting of a Basic Latin character and one or more >>>>> diacritical or equivalent markings) compare equal to their >>>>> base characters, e.g., "á" not only matches "Á" but also >>>>> "a" and "A" and, as an unfortunate side-effect, maybe "À" >>>>> and "à" as well. This is bad news for languages in which >>>>> decorated Latin characters are used to represent phonetically >>>>> and conceptually different characters, not just pronunciation >>>>> variations. I am not qualified to evaluate "how bad". In >>>>> addition, extrapolations from this principle about Latin >>>>> script to unrelated scripts will almost certainly lead to >>>>> serious errors and/or additional confusion. >>>> >>>> I would not be comfortable going that far... >>> >>> In case it isn't clear, I would not be either. But it is where >>> getting sloppy about this stuff could easily take us. It is >>> worth noting that it also identifies one of the difficulties >>> with doing a global system to be applied to many types of >>> applications (like the PRECIS work) and then applying it in user >>> interface software that end users will expect to be localized to >>> their assumptions because it has been mapped or translated into >>> their language (if one normally speaks Upper Slobbovian but has >>> some familiarity with English, an application interface in >>> English will probably be expected to be "foreign", odd, and >>> maybe even inconsistent with whatever expectations exist. But, >>> if the interface is in Upper Slobbovian, the natural and >>> reasonable assumption will be the matching should conform to >>> normal Upper Slobbovian conventions. FWIW, a matching rule >>> that says: >>> >>> (i) Two instances of a base character with the same >>> diacritical mark(s) match. >>> (ii) Two instances of a base character with different >>> diacritical mark(s) do not match. >>> (iii) Two instances of a base character, one with >>> diacritical mark(s) and the other without any decoration >>> match. >>> >>> Is precisely correct and normal behavior for at least one >>> language that uses Latin script. It is also the normal practice >>> for at least one Latin script transcription system that is used >>> by a large fraction of a billion people (maybe more). >> >> That is indeed sobering. >> >>>>> More on this and Tom's question below... >>>>> >>>>>> On 9/29/15 3:28 PM, Tom Worster wrote: >>>>>>> Peter, Alexey, >>>>>>> >>>>>>> I think there is an ambiguity in the specification of case >>>>>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19. >>>>>> ... >>>>>>> But there are 55 code points in Unicode 7.0.0 that change >>>>>>> under default case folding that are neither uppercase nor >>>>>>> titlecase characters, 12 of which are Lowercase_Letter. I >>>>>>> suspect this stems from a confusion between Unicode case >>>>>>> mapping and case folding. >>> >>> In the context of the above, a different way to say the same >>> thing is that people are looking at toCaseFold and assuming (and >>> explaining things in terms of) toLowerCase. toCaseFold works >>> the way it is expected to and those 55 code points are, more or >>> less, collateral damage to get to a matching algorithm that >>> favors false positives over false negatives and various edge >>> cases (including in "edge cases" languages spoken by, and script >>> variations used by, millions of people). >> >> Sadly I suspect that is an accurate description of the current state of >> affairs (modulo my comment above about PRECIS WG discussions at one or >> more IETF meetings). >> >>>> ... >>>> After all that, I have 3 questions: >>> >>> Personal opinions about answers... >>> >>>> (1) Is my proposed text enough of a clarification that we >>>> should make that change before the nickname I-D is published >>>> as an RFC? >>> >>> I think the clarification is an improvement and is important >>> enough to incorporate (I know that is the answer to a slightly >>> different question). >>> >>> However, I think it is inadequate without a serious warning >>> about the situation. >> >> Yes. >> >>> That warning could appear in either this >>> document or RFC 7613 (or 7613bis) with a pointer from the other, >>> but, unless you want to revise 7613 now, this one is handy. >> >> I suspect that we need to revise 7613. I suspect that we might also need >> to revise 7564 (at least with respect to the order in which operations >> are applied, since there has been some confusion among implementers). >> >> Well, we always knew that we would need to revise them. Just not so soon. >> >>> Comment about possible text below. >>> >>>> (2) Should we modify draft-ietf-precis-nickname so that case >>>> folding is applied only as part of comparison and not as part >>>> of enforcement? If so, should we make that change before this >>>> document is published as an RFC? >>> >>> Yes. If something is used for "enforcement", it should be lower >>> casing or something else that can be explained to people who are >>> ordinarily familiar with one or more of the scripts that make >>> case distinctions. >>> >>> However, viewed in the light of this discussion, the whole >>> "enforcement" concept becomes a little dicey, especially if, as >>> I believe but don't have time to verify, the transformations >>> performed by toLowerCase are not a proper subset of those >>> performed by toCaseFold. >> >> My initial thought is that case mapping doesn't belong in the nickname >> enforcement operation at all - only in the comparison operation. >> >>>> (3) Should we update RFC 7613 so that case folding is applied >>>> only as part of comparison and not as part of enforcement? >>> >>> I think that is necessary. Following up on the comment above, I >>> would prefer that the current Section 3.2.2 (3) of RFC 7613 >>> either point to Unicode Lower Casing or contain a warning along >>> the lines of that below. >> >> Unlike the nickname profile (which I think can be cleaned up by moving >> the case mapping rule to the comparison operation and continuing to use >> Unicode Default Case Folding), I think you are right that for the >> UsernameCaseMapped profile we probably want Unicode Lower Casing. Thus >> the likely need, sooner rather than later, for 7613bis. >> >>> >>> ---------- >>> >>> --On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre >>> <peter@andyet.net> wrote: >>> >>>> This issue has greater urgency now because >>>> draft-ietf-precis-nickname is now in AUTH48... >>>> >>>> On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote: >>>> >>>>> After all that, I have 3 questions: >>>>> >>>>> (1) Is my proposed text enough of a clarification that we >>>>> should make that change before the nickname I-D is published >>>>> as an RFC? >>>> >>>> I think so. >>> >>> See above. >>> >>>>> (2) Should we modify draft-ietf-precis-nickname so that case >>>>> folding is applied only as part of comparison and not as part >>>>> of enforcement? If so, should we make that change before this >>>>> document is published as an RFC? >>>> >>>> Although it seems to be the case that Unicode case folding is >>>> primarily designed for the purpose of matching (i.e., >>>> comparison), >>> >>> "Seems" is a little weak. The Unicode Standard is really quite >>> specific about that. >>> >>>> I have a concern that applying the PRECIS case >>>> mapping rule after applying the normalization and >>>> directionality rules might have unintended consequences that >>>> we haven't had a chance to consider yet. The PRECIS framework >>>> expresses a preference (actually a hard requirement) for >>>> applying the rules in a particular order. We made a late >>>> change to the username profiles (RFC 7613), such that width >>>> mapping is applied first (in order to accommodate fullwidth >>>> and halfwidth characters in certain East Asian scripts). >>>> Making a late change to the nickname profile also concerns me, >>>> even though both of these late changes seem reasonable on the >>>> face of it. I will try to find time to think about this >>>> further in the next 24 hours. >>> >>> First, a hint for the consideration process: there is a reason >>> why Unicode now supports a unified case folding and >>> normalization operation. My recollection is that it is not only >>> more efficient to perform both operations at once (rather than >>> looking in one table and then the other), but that there are >>> some order-dependent or priority-dependent cases. >>> >>> The very fact that this issue exists (and is coming up again) >>> this late in the process (7613 published in August, WG winding >>> down and not, e.g., meeting next week) calls at least the PRECIS >>> quality of review and some fairly fundamental model issues into >>> question. I first raised that issue a rather long time ago but >>> have continued to hope that we have an approximation to "good >>> enough" without going back and rethinking everything. >>> >>> The right solution, IMO, is that, if RFC 7613 is to rationalize >>> or explain the operation in terms of converting upper case >>> characters to lower case, then it should be using toLowerCase >>> because that is what the operation does. After a quick look at >>> 7613, amending/updating it to simply convert to lower case would >>> be straightforward (and would not raise the ordering issue >>> called out above). It would presumably require another IETF >>> Last Call, however and I'd hope we would see some serious >>> discussion within the WG (and with UTC) before making the change >>> and about how it is explained. >>> >>> If we are not willing to make a change >> >> I'm willing. It would, as you note, require some careful thinking and >> review to make sure that we got it (more) right this time. >> >>> that significant and/or >>> if we conclude that the WG (and perhaps the IETF) have >>> completely run out of energy for dealing with i18n issues [1], >>> then I suggest that we introduce some additional text. I've >>> just spent a half-hour trying to find the AUTH48 copy of >>> precis-nickname (aka RFC-to-be-7700), but the RFC Editor has >>> apparently changed naming conventions and the various queue >>> entry pages all point to the -19 I-D and not the current working >>> copy so I can't try to match text and insertion point to what is >>> there already. >> >> http://www.rfc-editor.org/authors/rfc7700.txt >> >>> The suggestion is a patch (and a hack), not a >>> good fix but something like it is probably the least drastic >>> measure that would yield something that doesn't contain >>> unexplained known defects. >>> >>> Rough version of suggested text (possibly to go after your >>> revised paragraph and following up my comments in my 1 October >>> note). Some of the terminology needs checking which I can do if >>> you want to go this route: >>> >>> 'Users of this specification should note that the >>> concept of "lower case conversion" is somewhat elusive >>> and more dependent on the conventions of different >>> languages and notation systems that use the same script >>> than may appear obvious at first glance, especially if >>> that glance is at Basic Latin characters (i.e., the >>> ASCII letter repertoire). Unicode provides two >>> different mapping procedures that produce lower-case >>> characters, but they have different effects and results >>> for many characters. The more conservative one, >>> typically appropriately applicable when lower case forms >>> are needed, is actual lower-casing (embodied in the >>> Unicode operation toLowerCase). A more radical >>> operation, normally suitable only for string matching in >>> situations in which it is better to consider uncertain >>> cases as matching than to treat them as distinct, is >>> called "Case Folding" (Unicode operation toCaseFold). >>> While the two operations will often produce the same >>> results, Case Folding maps some lower case characters >>> into others and performs other transformations that may >>> be intuitively reasonable and expected for some users >>> and quite astonishing (or just wrong) to others. There >>> may be no practical alternative, especially if the >>> operations are to be used for mapping or enforcement, to >>> developers of PRECIS-dependent understanding that the >>> cases in which the two yield different results require >>> careful understanding of the relevant user base and its >>> needs [2].' >> >> Thanks. >> >> I am not sure if we need something like that if we move case mapping >> (here, case folding) to the comparison operation only - but something >> like that might still be appropriate. >> >>>>> (3) Should we update RFC 7613 so that case folding is applied >>>>> only as part of comparison and not as part of enforcement? >>>> >>>> That is less urgent so I suggest that we address the nickname >>>> spec first. >>> >>> Unless you (or someone else here) have a plausible plan to >>> continue and revitalize the WG and assign it that revision work >>> (and bring everyone actively participating up to the level >>> needed to easily understand this discussion thread and feel >>> embarrassed for not spotting the problems), I think we need to >>> assume that this is our last shot. Absent an active and >>> committed WG, "do this first" could easily be equivalent to >>> "don't get around to the other, ever". >> >> As mentioned, I don't want to have broken RFCs out there. >> >>> I think that the particular set of issues that started this >>> thread as a known defect in the PRECIS specs, both nickname and >>> 7613 and that we are obligated to either fix the problems or at >>> least explain them. The above warning text is an attempt to >>> explain and identify the problems even if it does not actually >>> provide a solution. If it were published as part of >>> precis-nickname, it could include a statement to the effect that >>> it should also be treated as an update to 7613 or, if the IESG >>> and RFC Editor would agree in advance to accept, rather than >>> bury, the thing, I suppose we could publish it in >>> precis-nickname and create an erratum to 7613 indicating that it >>> should have included some form of that statement. Neither >>> option implies a huge amount of work to update 7613. But I >>> think that making the changes of (2) without doing anything >>> about (3) makes the two documents inconsistent with each other >>> and that would be an additional known defect. >>> >>> Procedural question: given that precis-nickname is in AUTH48 as >>> of yesterday and I don't see anything blocking publication next >>> week if you and Barry sign off on the revised text that the WG >>> hasn't seen, >> >> There is no revised text yet. That's why we're having this discussion. >> >>> does someone need to file a pro forma objection/ >>> appeal to block that until this is sorted out and the WG has a >>> chance to review proposed publication text? >> >> I see no reason to invoke the specter of appeals quite yet. Everyone is >> working in good faith to do the right thing and get this mess cleaned up. >> >>> [1] I believe our collective inability to deal with the >>> within-script character forms that do not normalize to each >>> other because of language-dependent or other usage factors can >>> be taken as evidence of having run out of energy, >> >> Or in my case simple ignorance of some of the relevant issues and >> examples. It's not easy to know about all of this. >> >>> but it is >>> probably in the interest of finishing the PRECIS work to try to >>> treat that as a separate issue. >> >> Probably. >> >>> [2] Not unlike the reason to differentiate between NFC and NFKC >>> and understand the effects of each. >> >> Another thing that's not easy to grok in fulness. >> >> Peter >>
- [precis] Ambiguity in specification of case mappi… Tom Worster
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre - &yet
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre - &yet
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… Peter Saint-Andre
- Re: [precis] Ambiguity in specification of case m… Tom Worster
- Re: [precis] Ambiguity in specification of case m… John C Klensin