Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname

John C Klensin <john-ietf@jck.com> Tue, 27 October 2015 17:32 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2C3D71ACD2B for <precis@ietfa.amsl.com>; Tue, 27 Oct 2015 10:32:22 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.39
X-Spam-Level:
X-Spam-Status: No, score=0.39 tagged_above=-999 required=5 tests=[BAYES_50=0.8, GB_I_LETTER=-2, MANGLED_LIST=2.3, RCVD_IN_DNSWL_LOW=-0.7, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4fOd9Ag5GAvh for <precis@ietfa.amsl.com>; Tue, 27 Oct 2015 10:32:16 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EB7B41ACD1D for <precis@ietf.org>; Tue, 27 Oct 2015 10:32:15 -0700 (PDT)
Received: from [198.252.137.10] (helo=JcK-HP8200.jck.com) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1Zr86T-000Prn-L1; Tue, 27 Oct 2015 13:32:09 -0400
Date: Tue, 27 Oct 2015 13:32:04 -0400
From: John C Klensin <john-ietf@jck.com>
To: Peter Saint-Andre - &yet <peter@andyet.net>, Tom Worster <fsb@thefsb.org>, Alexey Melnikov <Alexey.Melnikov@isode.com>
Message-ID: <0347834EBDC481BD99BDBE67@JcK-HP8200.jck.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/Jua858TgnE6GFGb5XUnPqPK3CmQ>
Cc: precis@ietf.org
Subject: Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Oct 2015 17:32:22 -0000

Response to Monday's note immediately below; response to today's
follows it.  My apologies, but it is probably important to read
both.  My further apologies for the length of this note, but I
think we are in deep trouble here, trouble that is aggravated by
precis-mappings and precis-nickname both being post-approval and
that, as far as I know, there are no future plans for PRECIS
work (having precis-nickname in AUTH48 just emphasizes that --
see comment at end). 

--On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre -
&yet <peter@andyet.net> wrote:

> My apologies for the delayed reply. Comments inline.

A few remarks below... I can't tell whether we disagree or
whether at least one of us, probably me, are not being
adequately clear.  (Material on which we fairly clearly agree
elided.)


> On 10/1/15 7:50 AM, John C Klensin wrote:
>> --On Wednesday, September 30, 2015 15:16 -0600 Peter
>> Saint-Andre - &yet <peter@andyet.net> wrote:
>...
>> Peter,
>> 
>> While your proposed text is an improvement,
> 
> Happy to hear it. All I intended was a slight clarification.

But I'm not certain we are there yet...

>> the desire of many
>> people for a magic "just tell me what to do" formula, one that
>> lets them avoid understanding the issues, may call for a
>> little more:
> 
> There is always a need for more when it comes to i18n.

But I think it is a little more that that.  I've heard several
times, including in PRECIS meetings, requests for "just tell me
what to do and make sure it isn't complicated" (or "I don't want
to have to think about, much less understand, the issues").  We
can debate whether giving in to those requests in the I18n case
is wise.  I think it leads directly to conclusions equivalent to
"I understand my own script and writing system (or think I do)
and therefore, since all writing systems must be pretty much the
same, I understand all of the core issues in terms of my script
and understanding".   That, in turn, leads directly to the "how
do you spell 'Zürich'?" and "all spellings of 'Zuerich' should
be treated as equivalent" discussion that sounded like they
dominated a BOF at IETF 93.

Now I actually think it is reasonable for someone to ask for a
library that will do the job most of the time and that will
almost never cause their users or customers to get angry at
them.  But, if we are going to call what we do "standards", they
should contain sufficient information that would-be library
authors can know what to do ... or understand that they are in
over their heads.  And, for these particular cases, we may need
to explain, or help the library authors explain, why some cases
will fail and, indeed, get users mad at vendors.

 
>> (1) First, toCaseFold is _not_ toLowerCase.  Saying "The
>> primary result of doing so is that uppercase characters are
>> mapped to lowercase characters" is true for toCaseFold,
> 
> By "primary" I meant two things: (1) lowercasing is what
> happens to the preponderance of code points and (2) this is
> the result that most people care about.

If I parse the above correctly, I think you are wrong.   I think
what most people want, care about, and think they are getting,
is lower case conversion, i.e., an operation that preserves
lower case characters and converts upper case characters to the
equivalent lower case.  toCaseFold isn't that operation.  It is
a much more complex and subtle operation that, as well as
converting upper case characters to lower case, sometimes
converts lower case characters to different lower case
characters (or strings of them).  It also requires a fairly good
understanding of Unicode (not just a relevant script) and
historical Unicode decisions to predict its behavior and to have
any hope of explaining that behavior to users.   If one is
trying to compare (as distinct from converting), then toCaseFold
may be exactly what it wanted. but it is really hard to explain
or justify that in terms of "nicknames" or "aliases", which are
about conversion.   And, if one hopes to explain what is going
on to users in terms of "lower casing", then toCaseFold is just
the wrong operation.  That is what toLowerCase is for and the
two operations are just not equivalent.

FWIW and purely by coincidence wrt PRECIS and this document, I
had a conversation a few days ago with an expert on Arabic (and
Persian) calligraphy and writing systems (and good general
knowledge of writing systems) who is quite insistent that any
procedure we use for case-insensitive matching (e.g., case
folding) is discriminatory, inconsistent, and just
badly-thought-out if that same procedure doesn't treat isolated,
initial, and medial forms of the same character as equivalent.
He further strengthens his case (sic) by noting that Unicode
case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA,
unambiguously a lower-case character) to U+03C3 (GREEK SMALL
LETTER SIGMA), a relationship that depends entirely on
positional use and not case.  He also believes the same
relationships should apply to all other scripts that make form
distinctions for some characters based on positions in a string
and for which Unicode has chosen to assign different code
points.  Even if there were wide acceptance of his view, Unicode
stability principles would prevent changing toCaseFold (or
CaseFolding.txt), but this is more evidence that what toCaseFold
does and does not do is going to be hard to explain to either
casual users or to writing system experts whose primary
experience is not with the Greek-Latin-Cyrillic group.  

I don't think we want to say "these matching rules are somewhat
arbitrary and irrational, but, if you don't like it, blame
Unicode and not us", if only because it is our choice to use
those matching rules.  More below.


>...
>> (2) Second, probably as a result of having IDNA in the lead,
>> we've gotten sloppy about language and operations and should
>> probably start untangling that before it gets people in
>> trouble.
> 
> Where is the right place to do that untangling? (I doubt that
> it is the precis-nickname document.)

I agree that precis-nickname isn't the ideal place.  I also
believe that you and it are the innocent victims of the
situation.  At the same time, I don't believe IETF should be
producing incomplete, ambiguous, erroneous, or misleading
standards because no one could get around to doing the right
foundational work.  

>> The Unicode Standard, at least as I understand it, is fairly
>> clear that the most important (and really only safe) use of
>> toCaseFold is as part of a comparison operation.
> 
> Thanks for noting that. For example, Section 5.18 of Unicode
> 8.0.0 says:
> 
>     Caseless matching is implemented using case folding, which
> is the
>     process of mapping characters of different case to a
> single form, so
>     that case differences in strings are erased. Case folding
> allows for
>     fast caseless matches in lookups because only binary
> comparison is
>     required. It is more than just conversion to lowercase.

Right.  But, again, when its use is appropriate (a very
controversial topic in itself with our painful IDNA history with
Final Sigma, Eszett and the case-independent versus
position-independent controversy called out above as examples)
that is "matches in lookups" (what I've described elsewhere as
"comparison only").  Not creating or defining nicknames or
aliases.  And that _is_ a problem for this document.

>> Using your
>> example it is entirely reasonable to treat, "stpeter" and
>> "StPeter" as equivalent in a comparison operation, but
>> accepting one string and changing it to the other for display
>> may not be a really good idea.  While that transformation may
>> be acceptable (although I would be surprised if there were no
>> people who share your surname who could consider "stpeter" or
>> "Stpeter" unacceptable and might even believe that "StPeter"
>> is an unacceptable substitute for "St. Peter"),
> 
> I do receive email at stpeter@gmail.com intended for
> st.peter@gmail.com but that's a separate topic...

One that is relevant because it "works" as a side-effect of a
decision Google has made about mailbox name equivalence, a
decision that, IMO, will sooner or later get someone into a lot
of trouble and,  more important, a decision and matching rule
that PRECIS, AFAICT, does not allow and that IDNA unambigiously
forbids.

>> it also points out the
>> dangers of using Basic Latin script examples to illustrate
>> situations in which even more extended Latin script, much less
>> other scripts, may raise more complex issues.    Because IDNA
>> is essentially a workaround because changing the DNS
>> comparison rules was impractical for several reasons, we
>> ended up using toCaweFold to map characters and strings into
>> others in IDNA2003 but PRECIS implementations that do not
>> have the same constraints would, in general, be better off
>> confining the use of toCaseFold, or even toLowerCase, to
>> comparison operations.
> 
> Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does
> it make sense for this nickname specification to differ in
> this respect from the published RFCs? Shall we file errata
> against those documents? (This might apply only to RFC 7613,
> which says to apply case folding as part of the enforcement
> process - when exactly to apply case folding is not stipulated
> by RFC 7564.)

To the extent to which this is a "botched that because the WG
didn't understand the issues well enough" conclusion, it would
be entirely reasonable to generate an updating RFC that repairs
7613 and/or 7564, even doing so in an addendum to
precis-nickname if that is the only way to do that
expeditiously.  Per the above, we really don't want to give
library routine writers bad instructions.  As I understand it,
the current position of the RFC Editor and IESG is that
technical specification errors discovered in retrospect or after
people start using a spec are not appropriate topics for errata.
If the WG is not willing to do any of those things, then I
suggest that precis-nickname at least needs to contain a very
clear warning notice about this situation (see my response to
your question 1 below).

>> (3) Because toCaweFold loses information when used for more
>> than comparison (for comparison, it merely contributes to
>> what some people would consider false positives for matching)
>> involves some controversial decisions and, because of
>> stability requirements, cannot be changed even if the
>> controversies are resolved in other ways, we end up with,
>> e.g.,
>>      toCaseFold ("Nuß") -> "nuss"
>> which is considered an acceptable transformation in some
>> places that identify themselves as speaking/using German and
>> two different unacceptable errors in others.  Again, this will
>> almost always be much more serious if the transformation is
>> used to map and replace strings than if it is used to compare
>> (fwiw, that particular example is part of a continuing
>> disagreement between IDNA2008 and, among others, German
>> domain registry authorities on one side and UTC and UTR 46 on
>> the other).
> 
> Agreed.

See "warning notice" comment above and question 1 response below.

> (4) If the motivation is really to avoid confusion, the
>> correct confusion-blocking rule for Latin script (but not
>> others) and many languages that use it (but certainly not
>> all) involves moving beyond toCaseFold and treating all
>> "decorated" characters (characters normally represented by
>> glyphs consisting of a Basic Latin character and one or more
>> diacritical or equivalent markings) compare equal to their
>> base characters, e.g., "á" not only matches "Á" but also
>> "a" and "A" and, as an unfortunate side-effect, maybe "À"
>> and "à" as well.  This is bad news for languages in which
>> decorated Latin characters are used to represent phonetically
>> and conceptually different characters, not just pronunciation
>> variations.  I am not qualified to evaluate "how bad".   In
>> addition, extrapolations from this principle about Latin
>> script to unrelated scripts will almost certainly lead to
>> serious errors and/or additional confusion.
> 
> I would not be comfortable going that far...

In case it isn't clear, I would not be either.  But it is where
getting sloppy about this stuff could easily take us.  It is
worth noting that it also identifies one of the difficulties
with doing a global system to be applied to many types of
applications (like the PRECIS work) and then applying it in user
interface software that end users will expect to be localized to
their assumptions because it has been mapped or translated into
their language (if one normally speaks Upper Slobbovian but has
some familiarity with English, an application interface in
English will probably be expected to be "foreign", odd, and
maybe even inconsistent with whatever expectations exist.  But,
if the interface is in Upper Slobbovian, the natural and
reasonable assumption will be the matching should conform to
normal Upper Slobbovian conventions.    FWIW, a matching rule
that says:

 (i) Two instances of a base character with the same
	diacritical mark(s) match.
 (ii) Two instances of a base character with different
	diacritical mark(s) do not match.
 (iii) Two instances of a base character, one with
	diacritical mark(s) and the other without any decoration
	match.

Is precisely correct and normal behavior for at least one
language that uses Latin script.  It is also the normal practice
for at least one Latin script transcription system that is used
by a large fraction of a billion people (maybe more).

>> More on this and Tom's question below...
>> 
>>> On 9/29/15 3:28 PM, Tom Worster wrote:
>>>> Peter, Alexey,
>>>> 
>>>> I think there is an ambiguity in the specification of case
>>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19.
>>> ...
>>>> But there are 55 code points in Unicode 7.0.0 that change
>>>> under default case folding that are neither uppercase nor
>>>> titlecase characters, 12 of which are Lowercase_Letter. I
>>>> suspect this stems from a confusion between Unicode case
>>>> mapping and case folding.

In the context of the above, a different way to say the same
thing is that people are looking at toCaseFold and assuming (and
explaining things in terms of) toLowerCase.  toCaseFold works
the way it is expected to and those 55 code points are, more or
less, collateral damage to get to a matching algorithm that
favors false positives over false negatives and various edge
cases (including in "edge cases" languages spoken by, and script
variations used by, millions of people).

>...
> After all that, I have 3 questions:

Personal opinions about answers...

> (1) Is my proposed text enough of a clarification that we
> should make that change before the nickname I-D is published
> as an RFC?

I think the clarification is an improvement and is important
enough to incorporate (I know that is the answer to a slightly
different question).

However, I think it is inadequate without a serious warning
about the situation.   That warning could appear in either this
document or RFC 7613 (or 7613bis) with a pointer from the other,
but, unless you want to revise 7613 now, this one is handy.
Comment about possible text below.

> (2) Should we modify draft-ietf-precis-nickname so that case
> folding is applied only as part of comparison and not as part
> of enforcement? If so, should we make that change before this
> document is published as an RFC?

Yes.  If something is used for "enforcement", it should be lower
casing or something else that can be explained to people who are
ordinarily familiar with one or more of the scripts that make
case distinctions.

However, viewed in the light of this discussion, the whole
"enforcement" concept becomes a little dicey, especially if, as
I believe but don't have time to verify, the transformations
performed by toLowerCase are not a proper subset of those
performed by toCaseFold.

> (3) Should we update RFC 7613 so that case folding is applied
> only as part of comparison and not as part of enforcement?

I think that is necessary.  Following up on the comment above, I
would prefer that the current Section 3.2.2 (3) of RFC 7613
either point to Unicode Lower Casing or contain a warning along
the lines of that below.

   ----------

--On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre
<peter@andyet.net> wrote:

> This issue has greater urgency now because
> draft-ietf-precis-nickname is now in AUTH48...
> 
> On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote:
> 
>> After all that, I have 3 questions:
>> 
>> (1) Is my proposed text enough of a clarification that we
>> should make that change before the nickname I-D is published
>> as an RFC?
> 
> I think so.

See above.

>> (2) Should we modify draft-ietf-precis-nickname so that case
>> folding is applied only as part of comparison and not as part
>> of enforcement? If so, should we make that change before this
>> document is published as an RFC?
> 
> Although it seems to be the case that Unicode case folding is
> primarily designed for the purpose of matching (i.e.,
> comparison),

"Seems" is a little weak.  The Unicode Standard is really quite
specific about that.

> I have a concern that applying the PRECIS case
> mapping rule after applying the normalization and
> directionality rules might have unintended consequences that
> we haven't had a chance to consider yet. The PRECIS framework
> expresses a preference (actually a hard requirement) for
> applying the rules in a particular order. We made a late
> change to the username profiles (RFC 7613), such that width
> mapping is applied first (in order to accommodate fullwidth
> and halfwidth characters in certain East Asian scripts).
> Making a late change to the nickname profile also concerns me,
> even though both of these late changes seem reasonable on the
> face of it. I will try to find time to think about this
> further in the next 24 hours.

First, a hint for the consideration process: there is a reason
why Unicode now supports a unified case folding and
normalization operation.  My recollection is that it is not only
more efficient to perform both operations at once (rather than
looking in one table and then the other), but that there are
some order-dependent or priority-dependent cases.

The very fact that this issue exists (and is coming up again)
this late in the process (7613 published in August, WG winding
down and not, e.g., meeting next week) calls at least the PRECIS
quality of review and some fairly fundamental model issues into
question.  I first raised that issue a rather long time ago but
have continued to hope that we have an approximation to "good
enough" without going back and rethinking everything.  

The right solution, IMO, is that, if RFC 7613 is to rationalize
or explain the operation in terms of converting upper case
characters to lower case, then it should be using toLowerCase
because that is what the operation does.  After a quick look at
7613, amending/updating it to simply convert to lower case would
be straightforward (and would not raise the ordering issue
called out above).  It would presumably require another IETF
Last Call, however and I'd hope we would see some serious
discussion within the WG (and with UTC) before making the change
and about how it is explained.

If we are not willing to make a change that significant and/or
if we conclude that the WG (and perhaps the IETF) have
completely run out of energy for dealing with i18n issues [1],
then I suggest that we introduce some additional text.  I've
just spent a half-hour trying to find the AUTH48 copy of
precis-nickname (aka RFC-to-be-7700), but the RFC Editor has
apparently changed naming conventions and the various queue
entry pages all point to the -19 I-D and not the current working
copy so I can't try to match text and insertion point to what is
there already.   The suggestion is a patch (and a hack), not a
good fix but something like it is probably the least drastic
measure that would yield something that doesn't contain
unexplained known defects.

Rough version of suggested text (possibly to go after your
revised paragraph and following up my comments in my 1 October
note).  Some of the terminology needs checking which I can do if
you want to go this route:

	'Users of this specification should note that the
	concept of "lower case conversion" is somewhat elusive
	and more dependent on the conventions of different
	languages and notation systems that use the same script
	than may appear obvious at first glance, especially if
	that glance is at Basic Latin characters (i.e., the
	ASCII letter repertoire).  Unicode provides two
	different mapping procedures that produce lower-case
	characters, but they have different effects and results
	for many characters.  The more conservative one,
	typically appropriately applicable when lower case forms
	are needed, is actual lower-casing (embodied in the
	Unicode operation toLowerCase).  A more radical
	operation, normally suitable only for string matching in
	situations in which it is better to consider uncertain
	cases as matching than to treat them as distinct, is
	called "Case Folding" (Unicode operation toCaseFold).
	While the two operations will often produce the same
	results, Case Folding maps some lower case characters
	into others and performs other transformations that may
	be intuitively reasonable and expected for some users
	and quite astonishing (or just wrong) to others.  There
	may be no practical alternative, especially if the
	operations are to be used for mapping or enforcement, to
	developers of PRECIS-dependent understanding that the
	cases in which the two yield different results require
	careful understanding of the relevant user base and its
	needs [2].'

>> (3) Should we update RFC 7613 so that case folding is applied
>> only as part of comparison and not as part of enforcement?
> 
> That is less urgent so I suggest that we address the nickname
> spec first.

Unless you (or someone else here) have a plausible plan to
continue and revitalize the WG and assign it that revision work
(and bring everyone actively participating up to the level
needed to easily understand this discussion thread and feel
embarrassed for not spotting the problems), I think we need to
assume that this is our last shot.  Absent an active and
committed WG, "do this first" could easily be equivalent to
"don't get around to the other, ever".

I think that the particular set of issues that started this
thread as a known defect in the PRECIS specs, both nickname and
7613 and that we are obligated to either fix the problems or at
least explain them.  The above warning text is an attempt to
explain and identify the problems even if it does not actually
provide a solution.  If it were published as part of
precis-nickname, it could include a statement to the effect that
it should also be treated as an update to 7613 or, if the IESG
and RFC Editor would agree in advance to accept, rather than
bury, the thing, I suppose we could publish it in
precis-nickname and create an erratum to 7613 indicating that it
should have included some form of that statement.  Neither
option implies a huge amount of work to update 7613.  But I
think that making the changes of (2) without doing anything
about (3) makes the two documents inconsistent with each other
and that would be an additional known defect.

Procedural question: given that precis-nickname is in AUTH48 as
of yesterday and I don't see anything blocking publication next
week if you and Barry sign off on the revised text that the WG
hasn't seen, does someone need to file a pro forma objection/
appeal to block that until this is sorted out and the WG has a
chance to review proposed publication text?

    best,
     john






[1] I believe our collective inability to deal with the
within-script character forms that do not normalize to each
other because of language-dependent or other usage factors can
be taken as evidence of having run out of energy, but it is
probably in the interest of finishing the PRECIS work to try to
treat that as a separate issue.

[2] Not unlike the reason to differentiate between NFC and NFKC
and understand the effects of each.