Re: [precis] local case mapping

Takahiro Nemoto <t.nemo10@kmd.keio.ac.jp> Tue, 17 September 2013 09:56 UTC

Return-Path: <t.nemo10@kmd.keio.ac.jp>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 90DC821F9EDB for <precis@ietfa.amsl.com>; Tue, 17 Sep 2013 02:56:43 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.155
X-Spam-Level:
X-Spam-Status: No, score=0.155 tagged_above=-999 required=5 tests=[BAYES_50=0.001, FRT_BELOW2=2.154, GB_I_LETTER=-2, HTML_MESSAGE=0.001, NO_RELAYS=-0.001]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5HqxQ9TlT7fE for <precis@ietfa.amsl.com>; Tue, 17 Sep 2013 02:56:38 -0700 (PDT)
Received: from mail.kmd.keio.ac.jp (mail.kmd.keio.ac.jp [IPv6:2001:200:167:2e90::164]) by ietfa.amsl.com (Postfix) with ESMTP id 71C4111E83E0 for <precis@ietf.org>; Tue, 17 Sep 2013 02:56:34 -0700 (PDT)
Received: from [IPv6:2001:200:167:2ec1:e149:e026:dcf8:13ac] (unknown [IPv6:2001:200:167:2ec1:e149:e026:dcf8:13ac]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.kmd.keio.ac.jp (Postfix) with ESMTPSA id 4B8D280554; Tue, 17 Sep 2013 18:56:29 +0900 (JST)
Content-Type: multipart/signed; boundary="Apple-Mail=_46A253EF-688E-4E44-BBB1-98915C01A95C"; protocol="application/pgp-signature"; micalg="pgp-sha1"
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
From: Takahiro Nemoto <t.nemo10@kmd.keio.ac.jp>
In-Reply-To: <5227A979.7050403@stpeter.im>
Date: Tue, 17 Sep 2013 18:56:28 +0900
Message-Id: <E0DDC70E-DF8C-4163-8ED5-4ADA115DDB72@kmd.keio.ac.jp>
References: <5227A979.7050403@stpeter.im>
To: Peter Saint-Andre <stpeter@stpeter.im>
X-Mailer: Apple Mail (2.1508)
Cc: "precis@ietf.org" <precis@ietf.org>
Subject: Re: [precis] local case mapping
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/precis>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 17 Sep 2013 09:56:43 -0000

Hi, Peter-san and Alexey-san

Thank you for your thorough review and helpful comment.
Let me respond to the concerns that you raised in the previous email.

I'll try to simplify the mappings document that I wrote according to your suggestion.

Local Case Mapping is targeted towards special characters that depends on the language.
Also, the term "Case Folding" that I used in the mappings document refers 
to the "Common Case Folding" that is mentioned in the CaseFolding.txt file.

The targets of this "Local Case Mapping" are the following characters which are described
in the Language-Sensitive Mappings section in the SpecialCasing.txt file.
(Language-Sensitive Mappings are characters whose full case mappings 
depend on language and perhaps also context.)

For Turkish and Azeri:
<Language>; <Codepoint>; <Lowercase>; <Comments>
<Language> means the alpha-2 codes in [ISO.3166-1].
tr; 0307; ; COMBINING DOT ABOVE
tr; 0130; 0069; LATIN CAPITAL LETTER I WITH DOT ABOVE
tr; 0049; 0131; LATIN CAPITAL LETTER I
az; 0307; ; COMBINING DOT ABOVE
az; 0130; 0069; LATIN CAPITAL LETTER I WITH DOT ABOVE
az; 0049; 0131; LATIN CAPITAL LETTER I

For Lithuanian is at the end of a sentence.

In the CaseFolding.txt, it is written that "Final Sigma" (as written bellow) Case-Folds 
to the "Lowercase Sigma" as Common Case Folding. So let's say if the "Uppercase Sigma" 
that comes at the end of a sentence is converted to "Final Sigma" using Local Case Mapping, 
there is no meaning. Because of this, I would not put Final Sigma as one of the targets 
for Local Case Mapping currently.

In CaseFolding.txt:
03C2; C; 03C3; # GREEK SMALL LETTER FINAL SIGMA

By the way, I investigated Lithuanian which are described in SpecialCasing.txt file.
And I found some problems. I would like to ask you whether Lithuanian should be considered as a
target for local case mapping as open issues.

In the Lithuanian language, there are some characters that are unassigned in the Unicode. 
For this reason, when a uppercase characters with accent (as mentioned in SpecialCasing.txt) 
is mapped into lowercase characters, the characters of the Italian language are used instead 
of the formal Lithuanian language expression.

For example, the character ì; 00CC (i with a grave accent) is being used in both the Lithuanian and Italian language.
In Italian language, this lowercase character is expressed as (ì; 00EC: Grave accent on dotless i), 
whereas in Lithuanian language it is (i̇̀;  0069 0307 0300: Grave accent on dotted i), differentiating both characters.
However, in the input system of PC and smartphones that use the Lithuanian language, 
Italian character  ì(00EC) substitutes that role.

The difference of 00CC in CaseFolding.txt and SpecialCasing.txt is as following:
CaseFolding.txt -> 00CC; C; 00EC; # LATIN CAPITAL LETTER I WITH GRAVE
SpecialCasing.txt -> 00CC; 0069 0307 0300; LATIN CAPITAL LETTER I WITH GRAVE

Looking at the comparison between Lithuanian strings that contain 00CC and 00EC, 
when they are local case mapped using the current definition, 
and then followed by case mapping, they become different strings.

So I'd like to hear your comment about these.

I have two proposals for this problem. 
1)
In the case of localized Lithuanian language, 00EC is mapped into 0069;0307;0300 using local case mapping.

2)
In the future, as there is be a possibility that the combined characters 0069;0307;0300 into 1 character is registered in the Unicode, local case mapping doesn't need to care about it.

Which proposal is more preferable? Please let me know what you think.

Regards,

Nemo

--
Takahiro Nemoto
t.nemo10@kmd.keio.ac.jp




On 2013/09/05, at 6:43, Peter Saint-Andre <stpeter@stpeter.im> wrote:

> As mentioned, I find Section 2.3 of draft-ietf-precis-mappings hard to
> understand. This message is a bit long because I was puzzling over
> things as I wrote it. I hope it's not too rambling.
> 
> A relatively minor point: there's a bit of confusion between
> locale-dependent mapping and context-dependent mapping. For instance:
> 
>   Local case mapping is case folding that depends on language and
>   context.  For example, the mapping of LATIN CAPITAL LETTER I (U+0049)
>   depends on the language context of the user: if the language is
>   Turkish (or one of several other languages), the character should be
>   mapped into LATIN SMALL LETTER DOTLESS I (U+0131) as this character's
>   lower case equivalent.
> 
> The first sentence is confusing because it says that local case mapping
> depends on language *and* context. I think we mean language *or* context
> (or, even better, locale or context, since the Unicode standard talks
> about "locale-dependent case mapping"). Also, the use of "case folding"
> might not be quite right here, because that term has a very specific
> meaning (e.g., Chapter 5.18 of the Unicode standard says things like
> "The Unicode case folding algorithm is defined to be simpler and more
> efficient than case mappings."). In the next sentence, the phrase
> "language context" will cause even more confusion. :-) I think
> "language" is enough here. Also I think it would be helpful to add an
> example of a context-dependent mapping (but see below!).
> 
> Thus I suggest:
> 
>   Local case mapping depends on locale or context.  As an example of
>   locale-dependent mapping, LATIN CAPITAL LETTER I (U+0049) is
>   normally mapped to LATIN SMALL LETTER I ((U+0069); however, if the
>   language is Turkish (or one of several other languages), then the
>   character should be mapped to LATIN SMALL LETTER DOTLESS I (U+0131).
>   As an example of context-dependent mapping, GREEK CAPITAL LETTER
>   SIGMA (U+03A3) is mapped to GREEK SMALL LETTER SIGMA (U+03C3) if it
>   is followed by another letter, but is mapped to GREEK SMALL LETTER
>   FINAL SIGMA (U+03C2) if it is not followed by another letter.
> 
> So that's the first paragraph. :-)
> 
> The second paragraph says:
> 
>   Local case mapping targets only characters that
>   get two different results to perform just casefolding that is defined
>   in the Casefolding.txt [Casefolding] and perform special casefolding
>   that is defined in the Specialcasing.txt then casefolding, because
>   PRECIS framework have casefolding.
> 
> (Nit: The file names are actually CamelCase in the UCD, i.e.,
> CaseFolding.txt and SpecialCase.txt.)
> 
> That's a long sentence. I suggest breaking it up (I started to do that
> but then bigger questions arose).
> 
> As I understand it "just casefolding" isn't really defined in the
> CaseFolding.txt file, because four kinds of casefolding are specified
> there: (1) common case folding, (2) full case folding, (3) simple case
> folding, and (4) Turkic mappings for uppercase I and dotted uppercase I.
> 
> It's not clear to me exactly what draft-ietf-precis-mappings is
> recommending here. Is it saying that local case mapping applies (a) when
> there is a difference between common case mapping and anything else, (b)
> when there is no common case folding, (c) when there are multiple
> possible case foldings, or (d) something else?
> 
> As examples:
> 
> 1. Common case folding handles things like LATIN CAPITAL LETTER A
> (U+0041) to LATIN SMALL LETTER A (U+0061). There are no alternative
> mappings for this character and it's handled by UnicodeData.txt.
> 
> 2. Full case folding can result in mapping to multiple characters, such
> as LATIN CAPITAL LETTER SHARP S (U+1E9E) to LATIN SMALL LETTER S
> (U+0073) + LATIN SMALL LETTER S (U+0073).
> 
> 3. Simple case folding always maps to a single character, such as LATIN
> CAPITAL LETTER SHARP S (U+1E9E) to LATIN SMALL LETTER SHARP S (U+00DF).
> 
> (As we see, there are several possible case foldings for U+1E9E.)
> 
> 4. LATIN CAPITAL LETTER I commonly maps to LATIN SMALL LETTER I but
> under Turkic mappings it maps to LATIN SMALL LETTER DOTLESS I.
> 
> The UnicodeData.txt file specifies one-to-one case mappings that are
> independent of language, locale, and context. Thus it doesn't cover:
> 
> - full case folding (e.g., in UnicodeData.txt U+1E9E is mapped to
> U+00DF, not U+0073 U+073) since that's not one-to-one
> 
> - context-specific mappings such as uppercase Greek sigma to Greek final
> sigma when followed by a space
> 
> - locale-specific mappings such as LATIN CAPITAL LETTER I to LATIN SMALL
> LETTER DOTLESS I in Turkic languages
> 
> I think we need to be clear about what we're trying to accomplish here.
> Are we actually addressing all mappings that are not handled by the
> UnicodeData.txt file (i.e., everything except "common case folding" as
> specified by the CaseFolding.txt file)? It seems that way to me (i.e.,
> we need to say something about context-specific mappings for sure, and
> probably also case folding that isn't one-to-one such as LATIN CAPITAL
> SHARP S to ss), but that's not what draft-ietf-precis-mappings says now...
> 
>   There are two types casefoldings defined as Unconditional Mappings
>   and Conditional Mappings in the Specialcasing.txt file.  Conditional
>   mappings have Language-Insensitive Mappings that target characters
>   whose full case mappings do not depend on language, but do depend on
>   context.  Language-Sensitive Mappings that these are characters whose
>   full case mappings depend on language and perhaps also context.
> 
>   Of these mappings, characters with Unconditional Mappings or with
>   Language-Insensitive Mappings in Conditional Mappings target are
>   mapped into same codepoint(s) with just casefolding or special
>   casefolding then casefolding.  But characters with Language-Sensitive
>   Mappings in Conditional Mappings targets are mapped into different
>   codepoints.  Therefore this document defines characters that are a
>   part of characters of Lithuanian(lt), Turkish(tr) and
>   Azerbaijanian(az) that Language-Sensitive Mappings targets as targets
>   for local case mapping.
> 
> As I read that text, it says:
> 
> 1. "Local case mapping" applies only to the Language-Sensitive Mappings
> (one division of the Conditional Mappings) from SpecialCasing.txt
> 
> 2. "Local case mapping" does not apply to the Language-Insensitive
> Mappings from SpecialCasing.txt (e.g., handling of Greek Final Sigma,
> which is context-dependent instead of language-dependent but still one
> of the Conditional Mappings)
> 
> 3. "Local case mapping" also does not apply to the Unconditional
> Mappings from SpecialCasing.txt (which in general are mappings from
> lowercase to titlecase and uppercase anyway, so we might not care about
> them)
> 
> If we look at the results in Appendix B.1 of draft-ietf-precis-mappings
> we see that they match the Language-Sensitive Mappings from the
> SpecialCasing.txt file. So if that's all we're doing here, I wonder why
> Section 2.3 of our document needs to be so long -- why not just say
> "apply the Language-Sensitive Mappings in SpecialCasing.txt" and be done
> with it?
> 
> Or, do we also need to also specify some of the context-dependent
> mappings? (The major one is Greek final sigma.) As an example, in the
> PRECIS nickname spec would uppercase "ΦΙΛΟΣ ΜΟΙ" ("my friend") be case
> folded for comparison purposes to "φιλος μοι" (with a Greek final sigma,
> which is correct in Greek) or to "φιλοσ μοι" (with a Greek medial sigma,
> which is incorrect in Greek)?
> 
> And what about full case folding vs. simple case folding (e.g., ẞ =
> U+1E9E to ss instead of ß = U+00DF)? Do we need to specify one way to
> handle characters where either full case folding or simple case folding
> can be applied, so that we have consistency for interop purposes?
> 
> Perhaps I'm missing something obvious, but IMHO at least Section 2.3
> needs to explain a bit more what we're trying to accomplish, and why.
> 
> Peter
> 
> -- 
> Peter Saint-Andre
> https://stpeter.im/
> 
> 
> _______________________________________________
> precis mailing list
> precis@ietf.org
> https://www.ietf.org/mailman/listinfo/precis