Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]

John C Klensin <john-ietf@jck.com> Thu, 19 March 2015 20:36 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 270DA1A9084 for <lucid@ietfa.amsl.com>; Thu, 19 Mar 2015 13:36:50 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.61
X-Spam-Level:
X-Spam-Status: No, score=-4.61 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, GB_I_LETTER=-2, RCVD_IN_DNSWL_LOW=-0.7, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Yf1PQpizGYt2 for <lucid@ietfa.amsl.com>; Thu, 19 Mar 2015 13:36:48 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 094541A9064 for <lucid@ietf.org>; Thu, 19 Mar 2015 13:36:48 -0700 (PDT)
Received: from [198.252.137.35] (helo=JcK-HP8200.jck.com) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1YYhBM-000Pkv-KR; Thu, 19 Mar 2015 16:36:44 -0400
Date: Thu, 19 Mar 2015 16:36:39 -0400
From: John C Klensin <john-ietf@jck.com>
To: Asmus Freytag <asmusf@ix.netcom.com>, lucid@ietf.org
Message-ID: <65FEF8E132C1656B0E10BCAC@JcK-HP8200.jck.com>
In-Reply-To: <550B0C70.7030003@ix.netcom.com>
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com> <55008F97.8040701@ix.netcom.com> <CA+9kkMAcgSA1Ch0B9W1Np0LMn2udegZ=AzU1b26dAi+SDcbGgg@mail.gmail.com> <CY1PR0301MB07310C68F6CFDD46AE22086F82190@CY1PR0301MB0731.namprd0 3.prod.outlook.com> <20150311200941.GV15037@mx1.yitter.info> <CY1PR0301MB0731F4EBE5EB5C3340F7059282190@CY1PR0301MB0731.namprd03.prod.outlook.com> <20150319014018.GI5743@mx1.yitter.info> <BLUPR03MB1378184CE32E928A3086665582010@BLUPR03MB1378.namprd03.prod.outlook.com> <20150319023029.GA6046@mx1.yitter.info> <BLUPR03MB137886903F15000BB01E3F5882010@BLUPR03MB1378.namprd03.prod.outlook.com> <A62526FD387D08270363E96E@JcK-HP8200.jck.com> <550B0C70.7030003@ix.netcom.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.35
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/NmHa_YGjpIlNUuJBo3YeJ_0DNPE>
Cc: Andrew Sullivan <ajs@anvilwalrusden.com>
Subject: Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Mar 2015 20:36:50 -0000


--On Thursday, March 19, 2015 10:50 -0700 Asmus Freytag
<asmusf@ix.netcom.com> wrote:

>...
>> Coming back to the issue that started this, had Unicode
>> (followed by the IETF) not deprecated embedded code points to
>> identify languages, U+08A1 could have been coded (using an odd
>> notation whose intent should be obvious) as
>>      <lang:fula>U+0628 U+0654</lang>
>> without loss of any phonetic or semantic information and the
>> justification for adding the code point at all would disappear
>> (at least absent what we generically describe as "politics",
>> in which case the justification for its not decomposing would
>> disappear).
> 
> It's not just that a language unaccountably prefers a
> different sequence,
> but that the Hamza normally is used to indicate a glottal
> stop. 

"unaccountability" is a judgment that has nothing to do with
this.  I've talked about it and constructed an example in
language terms because that is the way this example was first
justified to us and I was trying to refer back to that example
and presentation.   Pursuing the same example, one could as
easily, in a hypothetical coding system different from Unicode,
but taking advantage of the same basic code space, write 
    U+0628 U+0654 <phoneme:U+0294/> 
which would use U+0654 to identify an abstraction for the Hamza
grapheme and qualify its phonetic use with a qualifier that
identifies the phoneme.  The same story could be applied to our
old example of "ö" with different phonetic and/or language
qualifiers for German and Swedish.

Again, I'm not advocating any of those alternatives, only trying
to reinforce my point that these results are the consequences of
decisions about coding (either due to a general, if complex, set
of rules or to per character decisions) and not intrinsic
properties of the grapheme clusters.

> It's a
> separate letter, even though, because it graphically floats,
> Unicode
> represents it as a combining character.  In the case of the
> Fula language,
> the same shape is used as a decoration on a letter, making a
> new,
> single entity, that just happens to look like a glottal stop
> was placed after a beh.

Given the above, and various other conversations, I believe the
above statement about "separate letter" is true, but that the
argument is ultimately circular.   In other words, Unicode has
adopted a set of rules about when characters are distinct (rules
that, sadly, are not completely described in the standard) and
them applied them.  Application of those rules (which I'm
prepared to believe are reasonable for some general case and
self-consistent) results in "combining character" in one place
and "separate character" in another.  Application of a
hypothetical different set of rules (including, purely as an
example, my reading of the [now obviously incomplete] set of
rules that actually appear in The Unicode Standards) might yield
a different result, such as a precomposed character that
decomposed or no precomposed character at all.

> There's no constraint, when it comes to Fula, for the writers
> and font
> designers (over long periods) to maintain the precise shape of
> the
> decoration on that letter. While they started out borrowing
> the shape
> of a hamza, who knows where this letter shape will end up --
> it's not
> constrained, I would argue, because the decoration isn't a
> hamza (glottal stop) any longer.
>...

And I suppose that, if we wait enough centuries and Fula evolves
backwards from being written primarily in Latin script (after
being written primarily in Arabic script until a few centuries
ago), we will find out.  But, if things a going to be coded
differently because the writing systems associated with
different languages might evolve differently (as I agree they
have over the centuries) and the desire is to future-proof
things), then we should really code "ö" differently for Swedish
than for German because, after all, Swedish writing might evolve
in the direction of Danish or Norwegian so that "ö" would be
replaced by the phonetically-similar "ø".     Or perhaps the
evolution would work the other way and we should therefore code
Norwegian and Danish separately lest "ø" in Norwegian evolve in
the direction of "ö" while Danish remains unchanged.  The
"borrowed letter" analogy applies there too because few, if any,
of those languages were originally written in Latin script
(whether we should permit, e.g., Runic in the DNS is of course a
separate issue as is the likelihood of, e.g., Norwegian evolving
back toward it).  I imagine those changes are probably unlikely,
but they are less drastic than the Fula example which would
require a complete reversal of the writing system's choice of
basic script.

Unless Unicode is going to go back and review the decisions
(which various stability rules now prohibit), I don't think this
helps except for understanding whether there are actually
predictable rules or whether code point allocation decisions are
ultimately made by a small cluster of people sitting more or
less in private and saying "that should be a separate,
non-decomposing, code point because it feels that way to us and
that other thing shouldn't because it doesn't" (or in the words
of a notorious US judge "I know it when I see it ").   I still
hope the situation is closer to the former and your explanations
have helped me a lot in understanding why that may be true.  

But our bottom line is still (or again) that we need some sort
of overlay or mechanism that will allow identifier comparison
among strings with absolutely no language, phonetic, or other
usage-context information (other than script) to work, present
tense, to a first order approximation of what those writing the
grapheme clusters or seeing them in isolation perceive.  How
symbols are pronounced or affect pronunciation, how those
symbols might evolve as the writing system evolves, whether they
are used as word-forming letters in a script or as, e.g.,
Mathematical notations, or what languages they are used with are
simply not relevant to the issue.  

Either enough of the alternate coding sequences have to be
disallowed to eliminate ambiguity because only one combination
is left, the alternatives have to somehow compare equal, or we
need to conclude that certain combinations are sufficiently
unlikely to justify not worrying about.  I imagine we might end
up with a combination of those options for different cases.   If
the choice rules can be turned into properties that help in that
process, that would be great.  If not, I think we may need to
lose interest in them other than accepting that they will
sometimes produce these distinctions in coding choices.

> While it's possible to convey all of these distinctions by
> invisible
> codes (markup) they are not necessarily the most robust choice:
> invisible codes and markup of any kind have a way of getting
> separated from the letters they are supposed to affect.

Of course, that is an argument against direction indicators in
Bidi situations and a number of other things.  I trust we all
understand by now that this is all about tradeoffs, optimization
for particular classes of usage, good choices about what to
optimize for, and, as you have pointed out, sometimes the result
that optimizing a coding system for one group of things may
pessimize it for another.

best,
    john