[precis] local case mapping

Peter Saint-Andre <stpeter@stpeter.im> Wed, 04 September 2013 21:43 UTC

Return-Path: <stpeter@stpeter.im>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 48E9321F9AAE for <precis@ietfa.amsl.com>; Wed, 4 Sep 2013 14:43:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -103.416
X-Spam-Level:
X-Spam-Status: No, score=-103.416 tagged_above=-999 required=5 tests=[AWL=1.183, BAYES_00=-2.599, GB_I_LETTER=-2, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FVPyEGrE91P5 for <precis@ietfa.amsl.com>; Wed, 4 Sep 2013 14:43:26 -0700 (PDT)
Received: from stpeter.im (mailhost.stpeter.im [207.210.219.225]) by ietfa.amsl.com (Postfix) with ESMTP id 66A0021F9B57 for <precis@ietf.org>; Wed, 4 Sep 2013 14:43:25 -0700 (PDT)
Received: from ergon.local (unknown [128.107.239.234]) (Authenticated sender: stpeter) by stpeter.im (Postfix) with ESMTPSA id 71958415DE; Wed, 4 Sep 2013 15:47:30 -0600 (MDT)
Message-ID: <5227A979.7050403@stpeter.im>
Date: Wed, 04 Sep 2013 15:43:21 -0600
From: Peter Saint-Andre <stpeter@stpeter.im>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: "precis@ietf.org" <precis@ietf.org>
X-Enigmail-Version: 1.5.2
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Subject: [precis] local case mapping
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/precis>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 04 Sep 2013 21:43:31 -0000

As mentioned, I find Section 2.3 of draft-ietf-precis-mappings hard to
understand. This message is a bit long because I was puzzling over
things as I wrote it. I hope it's not too rambling.

A relatively minor point: there's a bit of confusion between
locale-dependent mapping and context-dependent mapping. For instance:

   Local case mapping is case folding that depends on language and
   context.  For example, the mapping of LATIN CAPITAL LETTER I (U+0049)
   depends on the language context of the user: if the language is
   Turkish (or one of several other languages), the character should be
   mapped into LATIN SMALL LETTER DOTLESS I (U+0131) as this character's
   lower case equivalent.

The first sentence is confusing because it says that local case mapping
depends on language *and* context. I think we mean language *or* context
(or, even better, locale or context, since the Unicode standard talks
about "locale-dependent case mapping"). Also, the use of "case folding"
might not be quite right here, because that term has a very specific
meaning (e.g., Chapter 5.18 of the Unicode standard says things like
"The Unicode case folding algorithm is defined to be simpler and more
efficient than case mappings."). In the next sentence, the phrase
"language context" will cause even more confusion. :-) I think
"language" is enough here. Also I think it would be helpful to add an
example of a context-dependent mapping (but see below!).

Thus I suggest:

   Local case mapping depends on locale or context.  As an example of
   locale-dependent mapping, LATIN CAPITAL LETTER I (U+0049) is
   normally mapped to LATIN SMALL LETTER I ((U+0069); however, if the
   language is Turkish (or one of several other languages), then the
   character should be mapped to LATIN SMALL LETTER DOTLESS I (U+0131).
   As an example of context-dependent mapping, GREEK CAPITAL LETTER
   SIGMA (U+03A3) is mapped to GREEK SMALL LETTER SIGMA (U+03C3) if it
   is followed by another letter, but is mapped to GREEK SMALL LETTER
   FINAL SIGMA (U+03C2) if it is not followed by another letter.

So that's the first paragraph. :-)

The second paragraph says:

   Local case mapping targets only characters that
   get two different results to perform just casefolding that is defined
   in the Casefolding.txt [Casefolding] and perform special casefolding
   that is defined in the Specialcasing.txt then casefolding, because
   PRECIS framework have casefolding.

(Nit: The file names are actually CamelCase in the UCD, i.e.,
CaseFolding.txt and SpecialCase.txt.)

That's a long sentence. I suggest breaking it up (I started to do that
but then bigger questions arose).

As I understand it "just casefolding" isn't really defined in the
CaseFolding.txt file, because four kinds of casefolding are specified
there: (1) common case folding, (2) full case folding, (3) simple case
folding, and (4) Turkic mappings for uppercase I and dotted uppercase I.

It's not clear to me exactly what draft-ietf-precis-mappings is
recommending here. Is it saying that local case mapping applies (a) when
there is a difference between common case mapping and anything else, (b)
when there is no common case folding, (c) when there are multiple
possible case foldings, or (d) something else?

As examples:

1. Common case folding handles things like LATIN CAPITAL LETTER A
(U+0041) to LATIN SMALL LETTER A (U+0061). There are no alternative
mappings for this character and it's handled by UnicodeData.txt.

2. Full case folding can result in mapping to multiple characters, such
as LATIN CAPITAL LETTER SHARP S (U+1E9E) to LATIN SMALL LETTER S
(U+0073) + LATIN SMALL LETTER S (U+0073).

3. Simple case folding always maps to a single character, such as LATIN
CAPITAL LETTER SHARP S (U+1E9E) to LATIN SMALL LETTER SHARP S (U+00DF).

(As we see, there are several possible case foldings for U+1E9E.)

4. LATIN CAPITAL LETTER I commonly maps to LATIN SMALL LETTER I but
under Turkic mappings it maps to LATIN SMALL LETTER DOTLESS I.

The UnicodeData.txt file specifies one-to-one case mappings that are
independent of language, locale, and context. Thus it doesn't cover:

- full case folding (e.g., in UnicodeData.txt U+1E9E is mapped to
U+00DF, not U+0073 U+073) since that's not one-to-one

- context-specific mappings such as uppercase Greek sigma to Greek final
sigma when followed by a space

- locale-specific mappings such as LATIN CAPITAL LETTER I to LATIN SMALL
LETTER DOTLESS I in Turkic languages

I think we need to be clear about what we're trying to accomplish here.
Are we actually addressing all mappings that are not handled by the
UnicodeData.txt file (i.e., everything except "common case folding" as
specified by the CaseFolding.txt file)? It seems that way to me (i.e.,
we need to say something about context-specific mappings for sure, and
probably also case folding that isn't one-to-one such as LATIN CAPITAL
SHARP S to ss), but that's not what draft-ietf-precis-mappings says now...

   There are two types casefoldings defined as Unconditional Mappings
   and Conditional Mappings in the Specialcasing.txt file.  Conditional
   mappings have Language-Insensitive Mappings that target characters
   whose full case mappings do not depend on language, but do depend on
   context.  Language-Sensitive Mappings that these are characters whose
   full case mappings depend on language and perhaps also context.

   Of these mappings, characters with Unconditional Mappings or with
   Language-Insensitive Mappings in Conditional Mappings target are
   mapped into same codepoint(s) with just casefolding or special
   casefolding then casefolding.  But characters with Language-Sensitive
   Mappings in Conditional Mappings targets are mapped into different
   codepoints.  Therefore this document defines characters that are a
   part of characters of Lithuanian(lt), Turkish(tr) and
   Azerbaijanian(az) that Language-Sensitive Mappings targets as targets
   for local case mapping.

As I read that text, it says:

1. "Local case mapping" applies only to the Language-Sensitive Mappings
(one division of the Conditional Mappings) from SpecialCasing.txt

2. "Local case mapping" does not apply to the Language-Insensitive
Mappings from SpecialCasing.txt (e.g., handling of Greek Final Sigma,
which is context-dependent instead of language-dependent but still one
of the Conditional Mappings)

3. "Local case mapping" also does not apply to the Unconditional
Mappings from SpecialCasing.txt (which in general are mappings from
lowercase to titlecase and uppercase anyway, so we might not care about
them)

If we look at the results in Appendix B.1 of draft-ietf-precis-mappings
we see that they match the Language-Sensitive Mappings from the
SpecialCasing.txt file. So if that's all we're doing here, I wonder why
Section 2.3 of our document needs to be so long -- why not just say
"apply the Language-Sensitive Mappings in SpecialCasing.txt" and be done
with it?

Or, do we also need to also specify some of the context-dependent
mappings? (The major one is Greek final sigma.) As an example, in the
PRECIS nickname spec would uppercase "ΦΙΛΟΣ ΜΟΙ" ("my friend") be case
folded for comparison purposes to "φιλος μοι" (with a Greek final sigma,
which is correct in Greek) or to "φιλοσ μοι" (with a Greek medial sigma,
which is incorrect in Greek)?

And what about full case folding vs. simple case folding (e.g., ẞ =
U+1E9E to ss instead of ß = U+00DF)? Do we need to specify one way to
handle characters where either full case folding or simple case folding
can be applied, so that we have consistency for interop purposes?

Perhaps I'm missing something obvious, but IMHO at least Section 2.3
needs to explain a bit more what we're trying to accomplish, and why.

Peter

-- 
Peter Saint-Andre
https://stpeter.im/