Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]

Asmus Freytag <asmusf@ix.netcom.com> Thu, 19 March 2015 22:15 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 594F51B2AF5 for <lucid@ietfa.amsl.com>; Thu, 19 Mar 2015 15:15:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4
X-Spam-Level:
X-Spam-Status: No, score=-4 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, GB_I_LETTER=-2, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vFrlZDN4Pk1t for <lucid@ietfa.amsl.com>; Thu, 19 Mar 2015 15:15:41 -0700 (PDT)
Received: from elasmtp-mealy.atl.sa.earthlink.net (elasmtp-mealy.atl.sa.earthlink.net [209.86.89.69]) by ietfa.amsl.com (Postfix) with ESMTP id 78DB51B2AF4 for <lucid@ietf.org>; Thu, 19 Mar 2015 15:15:41 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk20050327; d=ix.netcom.com; b=JeANuFUafvJApyDGu6+2vDt3D8C5t/6km4XQGVB56iAutE74qU64x+TlEbpbH7RA; h=Received:Message-ID:Date:From:User-Agent:MIME-Version:To:CC:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding:X-ELNK-Trace:X-Originating-IP;
Received: from [72.244.206.133] (helo=[192.168.0.107]) by elasmtp-mealy.atl.sa.earthlink.net with esmtpa (Exim 4.67) (envelope-from <asmusf@ix.netcom.com>) id 1YYiik-0000ud-MZ; Thu, 19 Mar 2015 18:15:20 -0400
Message-ID: <550B4A78.2020108@ix.netcom.com>
Date: Thu, 19 Mar 2015 15:15:20 -0700
From: Asmus Freytag <asmusf@ix.netcom.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0
MIME-Version: 1.0
To: John C Klensin <john-ietf@jck.com>, lucid@ietf.org
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com> <55008F97.8040701@ix.netcom.com> <CA+9kkMAcgSA1Ch0B9W1Np0LMn2udegZ=AzU1b26dAi+SDcbGgg@mail.gmail.com> <CY1PR0301MB07310C68F6CFDD46AE22086F82190@CY1PR0301MB0731.namprd0 3.prod.outlook.com> <20150311200941.GV15037@mx1.yitter.info> <CY1PR0301MB0731F4EBE5EB5C3340F7059282190@CY1PR0301MB0731.namprd03.prod.outlook.com> <20150319014018.GI5743@mx1.yitter.info> <BLUPR03MB1378184CE32E928A3086665582010@BLUPR03MB1378.namprd03.prod.outlook.com> <20150319023029.GA6046@mx1.yitter.info> <BLUPR03MB137886903F15000BB01E3F5882010@BLUPR03MB1378.namprd03.prod.outlook.com> <A62526FD387D08270363E96E@JcK-HP8200.jck.com> <550B0C70.7030003@ix.netcom.com> <65FEF8E132C1656B0E10BCAC@JcK-HP8200.jck.com>
In-Reply-To: <65FEF8E132C1656B0E10BCAC@JcK-HP8200.jck.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
X-ELNK-Trace: 464f085de979d7246f36dc87813833b2b65b6112f8911537af5e3def6ebc7056cf102e77eeb84a22350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 72.244.206.133
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/6ogi2cJwpTIj9-nIiUmdYZkip9w>
Cc: Andrew Sullivan <ajs@anvilwalrusden.com>
Subject: Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Mar 2015 22:15:45 -0000

On 3/19/2015 1:36 PM, John C Klensin wrote:
>
> --On Thursday, March 19, 2015 10:50 -0700 Asmus Freytag
> <asmusf@ix.netcom.com> wrote:
>
>> ...
>>> Coming back to the issue that started this, had Unicode
>>> (followed by the IETF) not deprecated embedded code points to
>>> identify languages, U+08A1 could have been coded (using an odd
>>> notation whose intent should be obvious) as
>>>       <lang:fula>U+0628 U+0654</lang>
>>> without loss of any phonetic or semantic information and the
>>> justification for adding the code point at all would disappear
>>> (at least absent what we generically describe as "politics",
>>> in which case the justification for its not decomposing would
>>> disappear).
>> It's not just that a language unaccountably prefers a
>> different sequence,
>> but that the Hamza normally is used to indicate a glottal
>> stop.
> "unaccountability" is a judgment that has nothing to do with
> this.

I used that term to merely mean "apparently without cause".

>    I've talked about it and constructed an example in
> language terms because that is the way this example was first
> justified to us and I was trying to refer back to that example
> and presentation.   Pursuing the same example, one could as
> easily, in a hypothetical coding system different from Unicode,
> but taking advantage of the same basic code space, write
>      U+0628 U+0654 <phoneme:U+0294/>
> which would use U+0654 to identify an abstraction for the Hamza
> grapheme and qualify its phonetic use with a qualifier that
> identifies the phoneme.  The same story could be applied to our
> old example of "ö" with different phonetic and/or language
> qualifiers for German and Swedish.

Character encoding is not the process of numbering entities by shape.

It's the process of numbering entities by their underlying identity in
writing.

For 90% plus of all cases, the distinction seemingly doesn't matter,
because, especially in context, we experience the different entities
as having different shapes. We don't even realize that we abstract
away the shape, radically in many cases, and we treat unexpected
overlaps of the shapes associated with different characters as
accidents of fonts, low-resolution rendering, or what not.

But U+0041 really encodes the idea of an uppercase letter A.
That it is possible to render it in a Fraktur font, where it may well
look (to the uninitiated) as a form of triangular U, and that this
does not cause any issues (usually) is fully indicative of this:
to repeat, what is encoded is the (abstract) identity, not the concrete
shape.

This works really well for ordinary texts where context allows us
to effortlessly (more or less) read characters even if they are shown
in (artistically) distorted shapes, such as used in decorative fonts.

There are cases, mostly for symbols and punctuation, where there
is both less of a shared understanding of what the expected range
of appearance is for a given text element defined in the abstract.

Here, then, is the case where character encoders (including Unicode)
will decide to defer more to the appearance in marking the boundary
between different abstract text elements, and thus disunify them
in the encoding.

On top of that, frozen letter forms (selected, very specific renderings that
do no longer admit of the same range of glyph shapes as the original
did) have been borrowed into technical notation - most prominently
phonetic and mathematical. (Much phonetic notation has then given
rise to new alphabets, returning  the frozen form back into the
pool, but now as a new letter).

No wonder that Unicode doesn't have an "algorithm" to decide on
where the boundaries fall in identifying the boundary between
separate abstract entities to be encoded. In the end, more often
than not, this bottoms out in a pragmatic judgment of the "I know
it when I see it" variety.

It's a fundamental misunderstanding, then, of character encoding to
expect it to be glyph based, and an equally fundamental misunder-
standing to expect combining marks primarily as lego blocks for
glyph construction.

Combining marks represent, variously, separate letters that happen
to graphically "float" above other characters or result in merged
shapes (Arabic and Indic scripts have plenty of examples) or they
represent identifiable decorations of characters, best understood
by the combining arrow above to represent a vector in math.

The decoration, in this case has a very clear semantic component
of its own, hence is a separate character, but graphically exists
only as a decoration (requires a base).

Most diacritics (accent marks, etc.) at least in phonetics, also have
these separate identities.

True dual encoding entered the picture because of the conservative
nature of writing systems (and character encodings). The precursor
encodings had (largely, but not exclusively) chosen a simpler
analysis, and encoded precomposed entities. Unicode inherited that
model (but also the competing model). It turns out combinations could
not be rationally supported in a universal set, because to be universal
one would have to cater for the needs of scholars to apply these marks
to (practically) *any* other character. Incomplete drafts of lists of known
combinations, for Latin only, reached several thousand combinations,
before sanity finally prevailed, and the precomposed entities were
limited to the then existing set needed for legacy support.

So why does Unicode continue to encode "precomposed" Latin
letters? The reason is that for some decorations, it's not sufficient
to say "apply the decoration" -- you also have to say where and
how to apply it. And there are many possible options. Therefore,
Unicode has identified certain combinations that will always be
encoded whenever they are part of an alphabet -- leaving the
use of the combining mark for non-standard purposes, like a one-off
use in a phonetics paper.

The list of combining marks that are treated that way (for the purposes
of Latin and similar scripts) is not currently documented in machine
readable form, but Unicode Technical Committee experts are quite
willing to entertain the needs of the larger community to identify
them via code point properties, so that IETF can use that property
to DISALLOW the combining marks.

Other combining marks can be identified as never (realistically)
leading to potential use in letters that are part of conceivable alphabets;
these can be identified and be added to the set of combining marks
for which NFC already prevents additional precomposed entities.

Any discussion leading to the adoption of such properties by IETF
would receive the necessary support from Unicode, and would
allow to cut down the problem significantly in the future.

The hamza cases may be harder in some ways.

>
> Again, I'm not advocating any of those alternatives, only trying
> to reinforce my point that these results are the consequences of
> decisions about coding (either due to a general, if complex, set
> of rules or to per character decisions) and not intrinsic
> properties of the grapheme clusters.

I digressed earlier away from a point by point reply. But the key item I
want to reiterate is that "identifying the abstract entity" that is to 
be encoded
is a matter that cannot be reduced to an algorithm; it cannot be reduced
to a pseudo algorithm (such as human-applied, but otherwise inflexible
principles). In its full generality (including all edge cases) it will 
always
boil down to judgment calls.
>
>> It's a
>> separate letter, even though, because it graphically floats,
>> Unicode
>> represents it as a combining character.  In the case of the
>> Fula language,
>> the same shape is used as a decoration on a letter, making a
>> new,
>> single entity, that just happens to look like a glottal stop
>> was placed after a beh.
> Given the above, and various other conversations, I believe the
> above statement about "separate letter" is true, but that the
> argument is ultimately circular.   In other words, Unicode has
> adopted a set of rules about when characters are distinct (rules
> that, sadly, are not completely described in the standard) and
> them applied them.

See above about rules.

Principles exist, in the form of "guiding principles", but they cannot be
absolute, because the complexity of human writing system is such that
the only way you can make inflexible principles is retroactively -- and
not only is Unicode still a work in progress, writing systems themselves
are never fully stable.

> Application of those rules (which I'm
> prepared to believe are reasonable for some general case and
> self-consistent) results in "combining character" in one place
> and "separate character" in another.  Application of a
> hypothetical different set of rules (including, purely as an
> example, my reading of the [now obviously incomplete] set of
> rules that actually appear in The Unicode Standards) might yield
> a different result, such as a precomposed character that
> decomposed or no precomposed character at all.

A good chunk of the problem is treatable "by rule" -- that is, Unicode
experts have agreed that it is reasonable to assign properties to some
subsets of combining marks that would identify their participation or
non-participation in forming sequences that are used as letters.
(non-participation means that any use of the combining mark today
is not considered a "letter" in the sense of "member of an alphabet").

There will be edge cases, hamza being one of them, where such simple
solutions appear not possible. Some code points, again hamza being an
example, could be explicitly given a property of having a problem in that
regard - if IETF would find that helpful.
>
>> There's no constraint, when it comes to Fula, for the writers
>> and font
>> designers (over long periods) to maintain the precise shape of
>> the
>> decoration on that letter. While they started out borrowing
>> the shape
>> of a hamza, who knows where this letter shape will end up --
>> it's not
>> constrained, I would argue, because the decoration isn't a
>> hamza (glottal stop) any longer.
>> ...
> And I suppose that, if we wait enough centuries and Fula evolves
> backwards from being written primarily in Latin script (after
> being written primarily in Arabic script until a few centuries
> ago), we will find out.
I'm not proposing that we should wait --- the point is more about that
things that can evolve on different paths are probably distinct entities,
and as character encoding is not a cataloging of shapes, but of entities,
then this considerations enters in the decision where to draw the boundary
between entities.
>   But, if things a going to be coded
> differently because the writing systems associated with
> different languages might evolve differently (as I agree they
> have over the centuries) and the desire is to future-proof
> things), then we should really code "ö" differently for Swedish
> than for German because, after all, Swedish writing might evolve
> in the direction of Danish or Norwegian so that "ö" would be
> replaced by the phonetically-similar "ø".     Or perhaps the
> evolution would work the other way and we should therefore code
> Norwegian and Danish separately lest "ø" in Norwegian evolve in
> the direction of "ö" while Danish remains unchanged.  The
> "borrowed letter" analogy applies there too because few, if any,
> of those languages were originally written in Latin script
> (whether we should permit, e.g., Runic in the DNS is of course a
> separate issue as is the likelihood of, e.g., Norwegian evolving
> back toward it).  I imagine those changes are probably unlikely,
> but they are less drastic than the Fula example which would
> require a complete reversal of the writing system's choice of
> basic script.

We have, in the history of Latin writing, nothing that might serve as a
strong precedent. Most orthographic reforms result in the adoption of
entirely new forms (like a-with-ring instead of aa in Danish).

The closest we come is a preference for different angles of
the accent when used in Polish vs. French.

The judgment call was to leave that to font designers.

A more problematic case is the distinction between comma below
and cedilla below on S and T, which are claimed (against available
evidence of actual use) to track with local preference.

The actual situation is that font designs are all over the map, which
is fine for the precomposed shapes. But for the combining sequence,
the two marks, comma below and cedilla below are distinct. That
presents a quandary, because now the name clearly indicates a
preferred shape and there is only one possible decomposition for the
precomposed shape. As a result, Unicode was forced, by the internal
logic of a universal set to provide the alternate shapes with alternate
decomposition - doing a huge disservice to users who now have
confusability and data migration issues (this went down before
NFC was frozen, so only the confusability issue remains, because
depending on the font, both comma below and cedilla below may
appear the same in the precomposed form).
>
> Unless Unicode is going to go back and review the decisions
> (which various stability rules now prohibit), I don't think this
> helps except for understanding whether there are actually
> predictable rules or whether code point allocation decisions are
> ultimately made by a small cluster of people sitting more or
> less in private and saying "that should be a separate,
> non-decomposing, code point because it feels that way to us and
> that other thing shouldn't because it doesn't" (or in the words
> of a notorious US judge "I know it when I see it ").   I still
> hope the situation is closer to the former and your explanations
> have helped me a lot in understanding why that may be true.

I think the bulk of all cases tends to fall in the predictable scenario, 
because
UTC strives quite hard, actually to follow precedent. There is an 
irreducible
set of cases where things come down to judgment. You see that more often
in code points in the symbols and punctuation area, because of the 
difficulty
of separating form and function there.

>
> But our bottom line is still (or again) that we need some sort
> of overlay or mechanism that will allow identifier comparison
> among strings with absolutely no language, phonetic, or other
> usage-context information (other than script) to work, present
> tense, to a first order approximation of what those writing the
> grapheme clusters or seeing them in isolation perceive.  How
> symbols are pronounced or affect pronunciation, how those
> symbols might evolve as the writing system evolves, whether they
> are used as word-forming letters in a script or as, e.g.,
> Mathematical notations, or what languages they are used with are
> simply not relevant to the issue.

See above.
>
> Either enough of the alternate coding sequences have to be
> disallowed to eliminate ambiguity because only one combination
> is left, the alternatives have to somehow compare equal, or we
> need to conclude that certain combinations are sufficiently
> unlikely to justify not worrying about.
Yes, that too.
> I imagine we might end
> up with a combination of those options for different cases.

I agree - this will not be a "one size fits all".

>    If
> the choice rules can be turned into properties that help in that
> process, that would be great.  If not, I think we may need to
> lose interest in them other than accepting that they will
> sometimes produce these distinctions in coding choices.

Properties. Definitely.

>> While it's possible to convey all of these distinctions by
>> invisible
>> codes (markup) they are not necessarily the most robust choice:
>> invisible codes and markup of any kind have a way of getting
>> separated from the letters they are supposed to affect.
> Of course, that is an argument against direction indicators in
> Bidi situations and a number of other things.

Correct -- and to the degree that they are not implemented, at some point
protocols can decide that they are safely prohibited. Unicode just recently
added a new set, one that could be used to isolate one bidi stretch from 
another.
Those look like they are useful ones for protocols to require when 
displaying
labels. Too early to see whether they will be widely adopted. (Looks 
like, though,
because they would allow plain text mapping of HTML directional markup).
>   I trust we all
> understand by now that this is all about tradeoffs, optimization
> for particular classes of usage, good choices about what to
> optimize for, and, as you have pointed out, sometimes the result
> that optimizing a coding system for one group of things may
> pessimize it for another.

Yep. As long as we understand that Unicode can't be optimized for IDNA,
we can then find ways to optimize IDNA for Unicode :)

A./