Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
Asmus Freytag <asmusf@ix.netcom.com> Thu, 19 March 2015 22:15 UTC
Return-Path: <asmusf@ix.netcom.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 594F51B2AF5 for <lucid@ietfa.amsl.com>; Thu, 19 Mar 2015 15:15:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4
X-Spam-Level:
X-Spam-Status: No, score=-4 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, GB_I_LETTER=-2, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vFrlZDN4Pk1t for <lucid@ietfa.amsl.com>; Thu, 19 Mar 2015 15:15:41 -0700 (PDT)
Received: from elasmtp-mealy.atl.sa.earthlink.net (elasmtp-mealy.atl.sa.earthlink.net [209.86.89.69]) by ietfa.amsl.com (Postfix) with ESMTP id 78DB51B2AF4 for <lucid@ietf.org>; Thu, 19 Mar 2015 15:15:41 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk20050327; d=ix.netcom.com; b=JeANuFUafvJApyDGu6+2vDt3D8C5t/6km4XQGVB56iAutE74qU64x+TlEbpbH7RA; h=Received:Message-ID:Date:From:User-Agent:MIME-Version:To:CC:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding:X-ELNK-Trace:X-Originating-IP;
Received: from [72.244.206.133] (helo=[192.168.0.107]) by elasmtp-mealy.atl.sa.earthlink.net with esmtpa (Exim 4.67) (envelope-from <asmusf@ix.netcom.com>) id 1YYiik-0000ud-MZ; Thu, 19 Mar 2015 18:15:20 -0400
Message-ID: <550B4A78.2020108@ix.netcom.com>
Date: Thu, 19 Mar 2015 15:15:20 -0700
From: Asmus Freytag <asmusf@ix.netcom.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0
MIME-Version: 1.0
To: John C Klensin <john-ietf@jck.com>, lucid@ietf.org
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com> <55008F97.8040701@ix.netcom.com> <CA+9kkMAcgSA1Ch0B9W1Np0LMn2udegZ=AzU1b26dAi+SDcbGgg@mail.gmail.com> <CY1PR0301MB07310C68F6CFDD46AE22086F82190@CY1PR0301MB0731.namprd0 3.prod.outlook.com> <20150311200941.GV15037@mx1.yitter.info> <CY1PR0301MB0731F4EBE5EB5C3340F7059282190@CY1PR0301MB0731.namprd03.prod.outlook.com> <20150319014018.GI5743@mx1.yitter.info> <BLUPR03MB1378184CE32E928A3086665582010@BLUPR03MB1378.namprd03.prod.outlook.com> <20150319023029.GA6046@mx1.yitter.info> <BLUPR03MB137886903F15000BB01E3F5882010@BLUPR03MB1378.namprd03.prod.outlook.com> <A62526FD387D08270363E96E@JcK-HP8200.jck.com> <550B0C70.7030003@ix.netcom.com> <65FEF8E132C1656B0E10BCAC@JcK-HP8200.jck.com>
In-Reply-To: <65FEF8E132C1656B0E10BCAC@JcK-HP8200.jck.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
X-ELNK-Trace: 464f085de979d7246f36dc87813833b2b65b6112f8911537af5e3def6ebc7056cf102e77eeb84a22350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 72.244.206.133
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/6ogi2cJwpTIj9-nIiUmdYZkip9w>
Cc: Andrew Sullivan <ajs@anvilwalrusden.com>
Subject: Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Mar 2015 22:15:45 -0000
On 3/19/2015 1:36 PM, John C Klensin wrote: > > --On Thursday, March 19, 2015 10:50 -0700 Asmus Freytag > <asmusf@ix.netcom.com> wrote: > >> ... >>> Coming back to the issue that started this, had Unicode >>> (followed by the IETF) not deprecated embedded code points to >>> identify languages, U+08A1 could have been coded (using an odd >>> notation whose intent should be obvious) as >>> <lang:fula>U+0628 U+0654</lang> >>> without loss of any phonetic or semantic information and the >>> justification for adding the code point at all would disappear >>> (at least absent what we generically describe as "politics", >>> in which case the justification for its not decomposing would >>> disappear). >> It's not just that a language unaccountably prefers a >> different sequence, >> but that the Hamza normally is used to indicate a glottal >> stop. > "unaccountability" is a judgment that has nothing to do with > this. I used that term to merely mean "apparently without cause". > I've talked about it and constructed an example in > language terms because that is the way this example was first > justified to us and I was trying to refer back to that example > and presentation. Pursuing the same example, one could as > easily, in a hypothetical coding system different from Unicode, > but taking advantage of the same basic code space, write > U+0628 U+0654 <phoneme:U+0294/> > which would use U+0654 to identify an abstraction for the Hamza > grapheme and qualify its phonetic use with a qualifier that > identifies the phoneme. The same story could be applied to our > old example of "ö" with different phonetic and/or language > qualifiers for German and Swedish. Character encoding is not the process of numbering entities by shape. It's the process of numbering entities by their underlying identity in writing. For 90% plus of all cases, the distinction seemingly doesn't matter, because, especially in context, we experience the different entities as having different shapes. We don't even realize that we abstract away the shape, radically in many cases, and we treat unexpected overlaps of the shapes associated with different characters as accidents of fonts, low-resolution rendering, or what not. But U+0041 really encodes the idea of an uppercase letter A. That it is possible to render it in a Fraktur font, where it may well look (to the uninitiated) as a form of triangular U, and that this does not cause any issues (usually) is fully indicative of this: to repeat, what is encoded is the (abstract) identity, not the concrete shape. This works really well for ordinary texts where context allows us to effortlessly (more or less) read characters even if they are shown in (artistically) distorted shapes, such as used in decorative fonts. There are cases, mostly for symbols and punctuation, where there is both less of a shared understanding of what the expected range of appearance is for a given text element defined in the abstract. Here, then, is the case where character encoders (including Unicode) will decide to defer more to the appearance in marking the boundary between different abstract text elements, and thus disunify them in the encoding. On top of that, frozen letter forms (selected, very specific renderings that do no longer admit of the same range of glyph shapes as the original did) have been borrowed into technical notation - most prominently phonetic and mathematical. (Much phonetic notation has then given rise to new alphabets, returning the frozen form back into the pool, but now as a new letter). No wonder that Unicode doesn't have an "algorithm" to decide on where the boundaries fall in identifying the boundary between separate abstract entities to be encoded. In the end, more often than not, this bottoms out in a pragmatic judgment of the "I know it when I see it" variety. It's a fundamental misunderstanding, then, of character encoding to expect it to be glyph based, and an equally fundamental misunder- standing to expect combining marks primarily as lego blocks for glyph construction. Combining marks represent, variously, separate letters that happen to graphically "float" above other characters or result in merged shapes (Arabic and Indic scripts have plenty of examples) or they represent identifiable decorations of characters, best understood by the combining arrow above to represent a vector in math. The decoration, in this case has a very clear semantic component of its own, hence is a separate character, but graphically exists only as a decoration (requires a base). Most diacritics (accent marks, etc.) at least in phonetics, also have these separate identities. True dual encoding entered the picture because of the conservative nature of writing systems (and character encodings). The precursor encodings had (largely, but not exclusively) chosen a simpler analysis, and encoded precomposed entities. Unicode inherited that model (but also the competing model). It turns out combinations could not be rationally supported in a universal set, because to be universal one would have to cater for the needs of scholars to apply these marks to (practically) *any* other character. Incomplete drafts of lists of known combinations, for Latin only, reached several thousand combinations, before sanity finally prevailed, and the precomposed entities were limited to the then existing set needed for legacy support. So why does Unicode continue to encode "precomposed" Latin letters? The reason is that for some decorations, it's not sufficient to say "apply the decoration" -- you also have to say where and how to apply it. And there are many possible options. Therefore, Unicode has identified certain combinations that will always be encoded whenever they are part of an alphabet -- leaving the use of the combining mark for non-standard purposes, like a one-off use in a phonetics paper. The list of combining marks that are treated that way (for the purposes of Latin and similar scripts) is not currently documented in machine readable form, but Unicode Technical Committee experts are quite willing to entertain the needs of the larger community to identify them via code point properties, so that IETF can use that property to DISALLOW the combining marks. Other combining marks can be identified as never (realistically) leading to potential use in letters that are part of conceivable alphabets; these can be identified and be added to the set of combining marks for which NFC already prevents additional precomposed entities. Any discussion leading to the adoption of such properties by IETF would receive the necessary support from Unicode, and would allow to cut down the problem significantly in the future. The hamza cases may be harder in some ways. > > Again, I'm not advocating any of those alternatives, only trying > to reinforce my point that these results are the consequences of > decisions about coding (either due to a general, if complex, set > of rules or to per character decisions) and not intrinsic > properties of the grapheme clusters. I digressed earlier away from a point by point reply. But the key item I want to reiterate is that "identifying the abstract entity" that is to be encoded is a matter that cannot be reduced to an algorithm; it cannot be reduced to a pseudo algorithm (such as human-applied, but otherwise inflexible principles). In its full generality (including all edge cases) it will always boil down to judgment calls. > >> It's a >> separate letter, even though, because it graphically floats, >> Unicode >> represents it as a combining character. In the case of the >> Fula language, >> the same shape is used as a decoration on a letter, making a >> new, >> single entity, that just happens to look like a glottal stop >> was placed after a beh. > Given the above, and various other conversations, I believe the > above statement about "separate letter" is true, but that the > argument is ultimately circular. In other words, Unicode has > adopted a set of rules about when characters are distinct (rules > that, sadly, are not completely described in the standard) and > them applied them. See above about rules. Principles exist, in the form of "guiding principles", but they cannot be absolute, because the complexity of human writing system is such that the only way you can make inflexible principles is retroactively -- and not only is Unicode still a work in progress, writing systems themselves are never fully stable. > Application of those rules (which I'm > prepared to believe are reasonable for some general case and > self-consistent) results in "combining character" in one place > and "separate character" in another. Application of a > hypothetical different set of rules (including, purely as an > example, my reading of the [now obviously incomplete] set of > rules that actually appear in The Unicode Standards) might yield > a different result, such as a precomposed character that > decomposed or no precomposed character at all. A good chunk of the problem is treatable "by rule" -- that is, Unicode experts have agreed that it is reasonable to assign properties to some subsets of combining marks that would identify their participation or non-participation in forming sequences that are used as letters. (non-participation means that any use of the combining mark today is not considered a "letter" in the sense of "member of an alphabet"). There will be edge cases, hamza being one of them, where such simple solutions appear not possible. Some code points, again hamza being an example, could be explicitly given a property of having a problem in that regard - if IETF would find that helpful. > >> There's no constraint, when it comes to Fula, for the writers >> and font >> designers (over long periods) to maintain the precise shape of >> the >> decoration on that letter. While they started out borrowing >> the shape >> of a hamza, who knows where this letter shape will end up -- >> it's not >> constrained, I would argue, because the decoration isn't a >> hamza (glottal stop) any longer. >> ... > And I suppose that, if we wait enough centuries and Fula evolves > backwards from being written primarily in Latin script (after > being written primarily in Arabic script until a few centuries > ago), we will find out. I'm not proposing that we should wait --- the point is more about that things that can evolve on different paths are probably distinct entities, and as character encoding is not a cataloging of shapes, but of entities, then this considerations enters in the decision where to draw the boundary between entities. > But, if things a going to be coded > differently because the writing systems associated with > different languages might evolve differently (as I agree they > have over the centuries) and the desire is to future-proof > things), then we should really code "ö" differently for Swedish > than for German because, after all, Swedish writing might evolve > in the direction of Danish or Norwegian so that "ö" would be > replaced by the phonetically-similar "ø". Or perhaps the > evolution would work the other way and we should therefore code > Norwegian and Danish separately lest "ø" in Norwegian evolve in > the direction of "ö" while Danish remains unchanged. The > "borrowed letter" analogy applies there too because few, if any, > of those languages were originally written in Latin script > (whether we should permit, e.g., Runic in the DNS is of course a > separate issue as is the likelihood of, e.g., Norwegian evolving > back toward it). I imagine those changes are probably unlikely, > but they are less drastic than the Fula example which would > require a complete reversal of the writing system's choice of > basic script. We have, in the history of Latin writing, nothing that might serve as a strong precedent. Most orthographic reforms result in the adoption of entirely new forms (like a-with-ring instead of aa in Danish). The closest we come is a preference for different angles of the accent when used in Polish vs. French. The judgment call was to leave that to font designers. A more problematic case is the distinction between comma below and cedilla below on S and T, which are claimed (against available evidence of actual use) to track with local preference. The actual situation is that font designs are all over the map, which is fine for the precomposed shapes. But for the combining sequence, the two marks, comma below and cedilla below are distinct. That presents a quandary, because now the name clearly indicates a preferred shape and there is only one possible decomposition for the precomposed shape. As a result, Unicode was forced, by the internal logic of a universal set to provide the alternate shapes with alternate decomposition - doing a huge disservice to users who now have confusability and data migration issues (this went down before NFC was frozen, so only the confusability issue remains, because depending on the font, both comma below and cedilla below may appear the same in the precomposed form). > > Unless Unicode is going to go back and review the decisions > (which various stability rules now prohibit), I don't think this > helps except for understanding whether there are actually > predictable rules or whether code point allocation decisions are > ultimately made by a small cluster of people sitting more or > less in private and saying "that should be a separate, > non-decomposing, code point because it feels that way to us and > that other thing shouldn't because it doesn't" (or in the words > of a notorious US judge "I know it when I see it "). I still > hope the situation is closer to the former and your explanations > have helped me a lot in understanding why that may be true. I think the bulk of all cases tends to fall in the predictable scenario, because UTC strives quite hard, actually to follow precedent. There is an irreducible set of cases where things come down to judgment. You see that more often in code points in the symbols and punctuation area, because of the difficulty of separating form and function there. > > But our bottom line is still (or again) that we need some sort > of overlay or mechanism that will allow identifier comparison > among strings with absolutely no language, phonetic, or other > usage-context information (other than script) to work, present > tense, to a first order approximation of what those writing the > grapheme clusters or seeing them in isolation perceive. How > symbols are pronounced or affect pronunciation, how those > symbols might evolve as the writing system evolves, whether they > are used as word-forming letters in a script or as, e.g., > Mathematical notations, or what languages they are used with are > simply not relevant to the issue. See above. > > Either enough of the alternate coding sequences have to be > disallowed to eliminate ambiguity because only one combination > is left, the alternatives have to somehow compare equal, or we > need to conclude that certain combinations are sufficiently > unlikely to justify not worrying about. Yes, that too. > I imagine we might end > up with a combination of those options for different cases. I agree - this will not be a "one size fits all". > If > the choice rules can be turned into properties that help in that > process, that would be great. If not, I think we may need to > lose interest in them other than accepting that they will > sometimes produce these distinctions in coding choices. Properties. Definitely. >> While it's possible to convey all of these distinctions by >> invisible >> codes (markup) they are not necessarily the most robust choice: >> invisible codes and markup of any kind have a way of getting >> separated from the letters they are supposed to affect. > Of course, that is an argument against direction indicators in > Bidi situations and a number of other things. Correct -- and to the degree that they are not implemented, at some point protocols can decide that they are safely prohibited. Unicode just recently added a new set, one that could be used to isolate one bidi stretch from another. Those look like they are useful ones for protocols to require when displaying labels. Too early to see whether they will be widely adopted. (Looks like, though, because they would allow plain text mapping of HTML directional markup). > I trust we all > understand by now that this is all about tradeoffs, optimization > for particular classes of usage, good choices about what to > optimize for, and, as you have pointed out, sometimes the result > that optimizing a coding system for one group of things may > pessimize it for another. Yep. As long as we understand that Unicode can't be optimized for IDNA, we can then find ways to optimize IDNA for Unicode :) A./
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Asmus Freytag
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Andrew Sullivan
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Andrew Sullivan
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Asmus Freytag
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Andrew Sullivan
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Asmus Freytag
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- [Lucid] [mark@macchiato.com: Re: Non-normalizable… Andrew Sullivan
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Ted Hardie
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Ted Hardie
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Shawn Steele
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Andrew Sullivan
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… John C Klensin
- [Lucid] FW: [mark@macchiato.com: Re: Non-normaliz… Shawn Steele