[precis] local case mapping
Peter Saint-Andre <stpeter@stpeter.im> Wed, 04 September 2013 21:43 UTC
Return-Path: <stpeter@stpeter.im>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 48E9321F9AAE for <precis@ietfa.amsl.com>; Wed, 4 Sep 2013 14:43:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -103.416
X-Spam-Level:
X-Spam-Status: No, score=-103.416 tagged_above=-999 required=5 tests=[AWL=1.183, BAYES_00=-2.599, GB_I_LETTER=-2, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FVPyEGrE91P5 for <precis@ietfa.amsl.com>; Wed, 4 Sep 2013 14:43:26 -0700 (PDT)
Received: from stpeter.im (mailhost.stpeter.im [207.210.219.225]) by ietfa.amsl.com (Postfix) with ESMTP id 66A0021F9B57 for <precis@ietf.org>; Wed, 4 Sep 2013 14:43:25 -0700 (PDT)
Received: from ergon.local (unknown [128.107.239.234]) (Authenticated sender: stpeter) by stpeter.im (Postfix) with ESMTPSA id 71958415DE; Wed, 4 Sep 2013 15:47:30 -0600 (MDT)
Message-ID: <5227A979.7050403@stpeter.im>
Date: Wed, 04 Sep 2013 15:43:21 -0600
From: Peter Saint-Andre <stpeter@stpeter.im>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: "precis@ietf.org" <precis@ietf.org>
X-Enigmail-Version: 1.5.2
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Subject: [precis] local case mapping
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/precis>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 04 Sep 2013 21:43:31 -0000
As mentioned, I find Section 2.3 of draft-ietf-precis-mappings hard to understand. This message is a bit long because I was puzzling over things as I wrote it. I hope it's not too rambling. A relatively minor point: there's a bit of confusion between locale-dependent mapping and context-dependent mapping. For instance: Local case mapping is case folding that depends on language and context. For example, the mapping of LATIN CAPITAL LETTER I (U+0049) depends on the language context of the user: if the language is Turkish (or one of several other languages), the character should be mapped into LATIN SMALL LETTER DOTLESS I (U+0131) as this character's lower case equivalent. The first sentence is confusing because it says that local case mapping depends on language *and* context. I think we mean language *or* context (or, even better, locale or context, since the Unicode standard talks about "locale-dependent case mapping"). Also, the use of "case folding" might not be quite right here, because that term has a very specific meaning (e.g., Chapter 5.18 of the Unicode standard says things like "The Unicode case folding algorithm is defined to be simpler and more efficient than case mappings."). In the next sentence, the phrase "language context" will cause even more confusion. :-) I think "language" is enough here. Also I think it would be helpful to add an example of a context-dependent mapping (but see below!). Thus I suggest: Local case mapping depends on locale or context. As an example of locale-dependent mapping, LATIN CAPITAL LETTER I (U+0049) is normally mapped to LATIN SMALL LETTER I ((U+0069); however, if the language is Turkish (or one of several other languages), then the character should be mapped to LATIN SMALL LETTER DOTLESS I (U+0131). As an example of context-dependent mapping, GREEK CAPITAL LETTER SIGMA (U+03A3) is mapped to GREEK SMALL LETTER SIGMA (U+03C3) if it is followed by another letter, but is mapped to GREEK SMALL LETTER FINAL SIGMA (U+03C2) if it is not followed by another letter. So that's the first paragraph. :-) The second paragraph says: Local case mapping targets only characters that get two different results to perform just casefolding that is defined in the Casefolding.txt [Casefolding] and perform special casefolding that is defined in the Specialcasing.txt then casefolding, because PRECIS framework have casefolding. (Nit: The file names are actually CamelCase in the UCD, i.e., CaseFolding.txt and SpecialCase.txt.) That's a long sentence. I suggest breaking it up (I started to do that but then bigger questions arose). As I understand it "just casefolding" isn't really defined in the CaseFolding.txt file, because four kinds of casefolding are specified there: (1) common case folding, (2) full case folding, (3) simple case folding, and (4) Turkic mappings for uppercase I and dotted uppercase I. It's not clear to me exactly what draft-ietf-precis-mappings is recommending here. Is it saying that local case mapping applies (a) when there is a difference between common case mapping and anything else, (b) when there is no common case folding, (c) when there are multiple possible case foldings, or (d) something else? As examples: 1. Common case folding handles things like LATIN CAPITAL LETTER A (U+0041) to LATIN SMALL LETTER A (U+0061). There are no alternative mappings for this character and it's handled by UnicodeData.txt. 2. Full case folding can result in mapping to multiple characters, such as LATIN CAPITAL LETTER SHARP S (U+1E9E) to LATIN SMALL LETTER S (U+0073) + LATIN SMALL LETTER S (U+0073). 3. Simple case folding always maps to a single character, such as LATIN CAPITAL LETTER SHARP S (U+1E9E) to LATIN SMALL LETTER SHARP S (U+00DF). (As we see, there are several possible case foldings for U+1E9E.) 4. LATIN CAPITAL LETTER I commonly maps to LATIN SMALL LETTER I but under Turkic mappings it maps to LATIN SMALL LETTER DOTLESS I. The UnicodeData.txt file specifies one-to-one case mappings that are independent of language, locale, and context. Thus it doesn't cover: - full case folding (e.g., in UnicodeData.txt U+1E9E is mapped to U+00DF, not U+0073 U+073) since that's not one-to-one - context-specific mappings such as uppercase Greek sigma to Greek final sigma when followed by a space - locale-specific mappings such as LATIN CAPITAL LETTER I to LATIN SMALL LETTER DOTLESS I in Turkic languages I think we need to be clear about what we're trying to accomplish here. Are we actually addressing all mappings that are not handled by the UnicodeData.txt file (i.e., everything except "common case folding" as specified by the CaseFolding.txt file)? It seems that way to me (i.e., we need to say something about context-specific mappings for sure, and probably also case folding that isn't one-to-one such as LATIN CAPITAL SHARP S to ss), but that's not what draft-ietf-precis-mappings says now... There are two types casefoldings defined as Unconditional Mappings and Conditional Mappings in the Specialcasing.txt file. Conditional mappings have Language-Insensitive Mappings that target characters whose full case mappings do not depend on language, but do depend on context. Language-Sensitive Mappings that these are characters whose full case mappings depend on language and perhaps also context. Of these mappings, characters with Unconditional Mappings or with Language-Insensitive Mappings in Conditional Mappings target are mapped into same codepoint(s) with just casefolding or special casefolding then casefolding. But characters with Language-Sensitive Mappings in Conditional Mappings targets are mapped into different codepoints. Therefore this document defines characters that are a part of characters of Lithuanian(lt), Turkish(tr) and Azerbaijanian(az) that Language-Sensitive Mappings targets as targets for local case mapping. As I read that text, it says: 1. "Local case mapping" applies only to the Language-Sensitive Mappings (one division of the Conditional Mappings) from SpecialCasing.txt 2. "Local case mapping" does not apply to the Language-Insensitive Mappings from SpecialCasing.txt (e.g., handling of Greek Final Sigma, which is context-dependent instead of language-dependent but still one of the Conditional Mappings) 3. "Local case mapping" also does not apply to the Unconditional Mappings from SpecialCasing.txt (which in general are mappings from lowercase to titlecase and uppercase anyway, so we might not care about them) If we look at the results in Appendix B.1 of draft-ietf-precis-mappings we see that they match the Language-Sensitive Mappings from the SpecialCasing.txt file. So if that's all we're doing here, I wonder why Section 2.3 of our document needs to be so long -- why not just say "apply the Language-Sensitive Mappings in SpecialCasing.txt" and be done with it? Or, do we also need to also specify some of the context-dependent mappings? (The major one is Greek final sigma.) As an example, in the PRECIS nickname spec would uppercase "ΦΙΛΟΣ ΜΟΙ" ("my friend") be case folded for comparison purposes to "φιλος μοι" (with a Greek final sigma, which is correct in Greek) or to "φιλοσ μοι" (with a Greek medial sigma, which is incorrect in Greek)? And what about full case folding vs. simple case folding (e.g., ẞ = U+1E9E to ss instead of ß = U+00DF)? Do we need to specify one way to handle characters where either full case folding or simple case folding can be applied, so that we have consistency for interop purposes? Perhaps I'm missing something obvious, but IMHO at least Section 2.3 needs to explain a bit more what we're trying to accomplish, and why. Peter -- Peter Saint-Andre https://stpeter.im/
- [precis] local case mapping Peter Saint-Andre
- Re: [precis] local case mapping Takahiro Nemoto
- Re: [precis] local case mapping Takahiro Nemoto
- Re: [precis] local case mapping Andrew Sullivan
- Re: [precis] local case mapping Peter Saint-Andre
- Re: [precis] local case mapping Andrew Sullivan
- Re: [precis] local case mapping Peter Saint-Andre
- Re: [precis] local case mapping Andrew Sullivan
- Re: [precis] local case mapping Takahiro Nemoto
- Re: [precis] local case mapping Takahiro Nemoto
- Re: [precis] local case mapping Peter Saint-Andre