Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics

On 3/19/2015 1:07 AM, John C Klensin wrote:
>
> --On Thursday, March 19, 2015 04:31 +0000 Shawn Steele
> <Shawn.Steele@microsoft.com> wrote:
>
>>>> No, even all NFC or NFKC would be 100% unique to the machine
>>> This is either tautologically true, or false.  Certainly we
>>> learned with IDNA2003 that NFKC doesn't work, because while
>>> it's good for increasing match probability the identifiers
>>> aren't stable.  So when they're handed around through
>>> different environments, stuff happens that is bad.
>> I said "to the machine".  They're "just numbers", and the
>> NFC/NFKC rules are mathematical.  Yes, you do have to exclude
>> unassigned code points as those don't have defined behavior,
>> however "Assigned NFC or NFKC for defined code points" would
>> be 100% unique to the machine.
> Shawn,
>
> A few observations, although we may need to agree to disagree.
>
> First, saying "NFC/NFKC rules" tends to obscure the issue here.
> An important property of NFC/NFD is reversibility, i.e., a dual
> relationship in the mathematical sense.  You may not be able to
> recover whatever the original way, but you can go back and forth
> between the two normalized forms without loss of information.
> By contrast, NFKC (and NFKD) are potentially information-losing
> and that is (although more of some sets of compatibility
> equivalences than others) significant.

Correct about the information loss, but, like case folding (which also loses
information that is significant to the human), NFKC does pick a unique
element out of each equivalence set, satisfying Shawn's statement about
"unique to the machine".

Any of these forms, if they can be guaranteed to be stable, will satisfy
that condition, but the point about the discussion is that identifiers
that are "reasonably mnemonic" as one could characterize IDNs, do
occupy a space between machine and human interaction with writing systems.
>
>> ...
>>> I think speculating about the anthropololical facts here is
>>> going to lead us to grief.  Let's stick with a domain of
>>> discourse we know well.
>> IDN is sociological exercise.  If the need was purely
>> scientific/mathematical, then we'd only need a bunch of
>> numbers or opaque IDs.  In order to make much progress here I
>> think we need to think about how people use them.
> At one level, I agree about "sociological exercise".  I would
> normally characterize it a bit differently, but let's accept
> that for the purposes of this note.  I would say much the same
> thing, not just about IDNs, but about almost any human-readable
> identifier, noting that, as soon as a "bunch of numbers" is
> expressed as numerals in written forms, there are "sociological"
> issues of number base and the set of numerals to be used (in
> Unicode-land and modern writing systems bound to script for
> everything but Arabic).

Actually, Arabic even has potential "variants" among the digits, so while
you can make the identifier be based on the numeric value of what the
human wrote in whatever writing system, you get into issues of recognition
of identifiers etc. once they are communicated in their human-readable
form.
>
> However, maybe there is a different way to characterize this
> issue that avoids the sociological exercise.

It's not just the conventions of a group of users, but also human perception
issues that plague human-readable identifiers, and the "sociological" issue
of the need to counteract fraud.

>   I continue to
> accept that, at least in general, Unicode has a reasonable set
> of coding principles and that, in general, a reasonable set of
> decisions has been made given those principles.  I firmly
> believe that a different coding system, with different
> principles (or priorities among principles) would simply trade
> one set of issues for another but might otherwise be equally
> "good".

Definitely agree with this take.

> However, many of those decisions -- about coding, not
> about characters -- do have alternatives.  For example, instead
> of coding lower and upper case characters differently, one could
> have dealt with case distinctions by assigning a single code
> point to the case-insensitive abstract character and then using
> a qualifier to designate case (either always for one case or
> when anyone cared for both -- two different coding style
> distinctions).  Similarly, ZWJ and ZWNJ are, in some sense,
> indicators and artifacts of coding decisions, not "characters":
> one could have avoiding them by assigning separate code points
> to the characters they affect on rendering (note that there have
> been passionate arguments that just that should have been done
> with some Indic scripts).

Also definitely agree. The task of a universal character set is to make
available a set of what Unicode calls "code elements" that are sufficiently
fine-grained and have the correct properties so that you can use them to
represent all "text elements" needed in writing.

There is certainly some latitude in that analysis of the problem so that the
outcome could, in principle, be a different set of code elements.

Usually, competing solutions are not equal in their degree of efficiency or
usability, but, perhaps unfortunately, the metrics for that is not a single
scale: each type of operation on text, from input, to storage, search, sort
and finally output, could potentially benefit from a different way of
analyzing the fundamental text elements. Creating and processing identifiers
is, not surprisingly, no different.
>
> In a way, decisions about script boundaries are, themselves,
> coding decisions rather than inherent character properties

Correct. You see that when Unicode is forced to duplicate letters because
they were borrowed from one script into another. (Like Latin Q and W
were borrowed into Cyrillic when it was used to write Kurdish)

> but,
> having been vaguely involved with an attempt to define a
> universal character set that did not depend on such boundaries,
> that path seems to lead to madness.  Unicode still could have
> defined its script boundaries differently, e.g., seeking a
> higher level of integration (or "unification") across the board
> but, while I think understand that as a choice is helpful,
> pursuing it very far probably is not.

In some sense one has to accept that writing is a historical process that
accumulates design decisions over time, and very conservatively sticks
to them for long time periods.

Not surprisingly, character encoding, though much younger than writing,
mimics that to some degree. For better or for worse, Unicode largely
inherited the analysis of many writing systems into code elements as
it had been present in existing precursors. And, yes, collectively, these
did include the duality of combining sequences and precomposed forms,
for example.

The value added proposition of a single, universal set was such that it
resulted in the critical mass needed to lead to first adoption and then
rapid replacement of the preceding encodings -- a process that's still
in the tail end of the curve.

While most likely many choices made by Unicode (where there was a
choice to be made that was not constrained) are not "perfect", it's
highly doubtful that they are collectively so bad that any alternative
would represent enough value added to offset the cost and disruption
of a wholesale replacement. Certainly not for some decades.

(I won't say never, because the same as languages go through wrenching
orthography reforms, nothing is truly static when it comes to writing
systems and the technology to implement them).

>
> Modulo the script boundary issue, it is (approximately) possible
> to see all of the "confusion" and "too-similar appearance"
> problems as human perception issues involving the
> recipient/viewer (not really "sociological", but the distinction
> may not be important).   Seen that way, the issue here is not
> about the viewer perception issue but about the coding one --
> resolving differences in decisions about how something might
> have been code (again and for convenience, within a script).  If
> one goes back a few hundred years, the question might become,
> not whether someone viewing the character would "see" the
> difference but whether the calligrapher or perhaps even someone
> trying to set a string in cold type would see an important
> difference -- also, in a way, a coding decision problem.
>
> Coming back to the issue that started this, had Unicode
> (followed by the IETF) not deprecated embedded code points to
> identify languages, U+08A1 could have been coded (using an odd
> notation whose intent should be obvious) as
>      <lang:fula>U+0628 U+0654</lang>
> without loss of any phonetic or semantic information and the
> justification for adding the code point at all would disappear
> (at least absent what we generically describe as "politics", in
> which case the justification for its not decomposing would
> disappear).

It's not just that a language unaccountably prefers a different sequence,
but that the Hamza normally is used to indicate a glottal stop. It's a
separate letter, even though, because it graphically floats, Unicode
represents it as a combining character.  In the case of the Fula language,
the same shape is used as a decoration on a letter, making a new,
single entity, that just happens to look like a glottal stop was placed
after a beh.

There's no constraint, when it comes to Fula, for the writers and font
designers (over long periods) to maintain the precise shape of the
decoration on that letter. While they started out borrowing the shape
of a hamza, who knows where this letter shape will end up -- it's not
constrained, I would argue, because the decoration isn't a hamza
(glottal stop) any longer.

This, in addition to sorting, is one of the arguments for duplicating
letters across scripts, in case of borrowings. While this hasn't happened
on a large scale, glyph shapes and font designs definitely drift over
time and it's best to allow these borrowed letters to drift with the
new script they are now embedded in, and not link them to the
old script that they originated from.

While it's possible to convey all of these distinctions by invisible
codes (markup) they are not necessarily the most robust choice:
invisible codes and markup of any kind have a way of getting
separated from the letters they are supposed to affect.
> Similarly, a different CCS could have avoided at least the
> portion of the "Mathematical" collection that are ultimately
> Latin or Greek letters in special fonts by different coding
> conventions that would use the base character plus qualifiers
> for usage and/or type style.
>
> Another example lies in the collection of combining characters
> that can be used to for precomposed character that don't
> decompose, not for, e.g., phonetic reasons but because the
> definitions of those combining characters don't contain quite
> enough information (see Section 3.3.2.3 of
> draft-klensin-idna-5892upd-unicode70-04.txt and its citations).
> At least in theory, Unicode could have chosen to assign code
> points to more precisely-defined (as to how they affect base
> characters) combining characters.  A coding system with
> different principles might have used position and/or size
> indicator coding with similar effect (an approach that will
> probably be needed if Unicode ever takes on, e.g., Classic Mayan
> script).
>
> Now, from that perspective, this issue is about smoothing over
> (by either some form of equivalence rules or exclusion (or
> non-inclusion)) differences among character code sequences that
> are the result of coding decisions (decisions that are at least
> semi-arbitrary because others could have been made with no
> "important" information loss -- and, yes, "important" is
> sometimes debatable without yet other coding decisions).  For
> sequences that compose and decompose symmetrically, NFC (or NFD)
> normalization does the necessary job.  IDNA2008 disallows those
> mathematical characters as a way to do a different part of the
> job without making non-reversible compatibility equivalences
> part of the standard.

Unfortunately, IDNA2008 lacks a tool that would make many of these issues
easy to manage: there is a subset of the problem where neither normalization
nor non-inclusion is the most adequate way of handling the issue.

Normalization presupposes that in each equivalence set, one an pick one
canonical element without prejudice. Where two different traditions of
writing each use a different element in the set, any canonical choice
in favor of one represents a prejudice against the other.

The same problem plagues non-inclusion. One would have to pick
favorites.

Outside the protocol, other means of addressing the problem exist, such
as mutual exclusion with first-come, first-serve. This requires a registry
with appropriate policies (resulting in a definition of a blocked variant).

The touchstone case for the current discussion, possibly falls into this
category, except that it seems that the use of the glottal stop in Arabic
in general is seen as less essential for identifier purposes (for example
the root zone would not include it).

Luckily, many of the other cases that have been identified following
the publication of the original paper, would be addressable by 
non-inclusion.
In those cases, Unicode has an operational principle of always encoding
any combinations needed for orthographic (as opposed to technical)
use, so that non-inclusion of certain combining marks would be feasible.
>
> As another coding decision matter, all of this would be
> significantly easier were Unicode consistent about its coding
> distinctions.   Such consistency is likely impossible, at least
> given other decisions, but that doesn't mean it wouldn't be
> helpful.   However, we have, instead, a combining sequence
> preference (with exceptions) for Latin but a precomposed
> character preference for Arabic.

It's actually more complex. There are systematic exceptions in Latin, based
on the nature of the mark.

The way Arabic ended up is perhaps more historical accident than perfect
design; few would argue that. But we are stuck with it, just as we are
stuck with other historical accidents.
>   We have all precombined Latin
> characters where combining sequences decomposing, except for
> some combining characters.  Most European scripts code the
> abstract graphics in grapheme clusters but East and South Asian
> ones use indicators like ZWJ and ZWNJ.  There is a strict rule
> against assigning separate code points to typestyle distinctions
> but an exception for some usage contexts such as mathematics and
> phonetic description.  Unicode does not have indicator codes or
> separate code points for layout or presentation (leaving that to
> external markup), but such coding has proved necessary for
> writing systems that are primarily right-to-left and in such
> cases as non-breaking space.  There are no language-dependent or
> pronunciation-dependent coding distinctions (see Section 2 of
> draft-klensin-idna-5892upd-unicode70-04.txt and/or Chapter 2 of
> Uniocde 7.0) except where there are.

An important issue is to remember that code elements aren't text elements,
they are merely guaranteed to be able to represent all needed text elements.

And, while the basic goal is to cover text elements needed to make the text
"readable" (for content) as opposed to final form (with style), some cases
exists where the borderline between form and content is drawn differently,
and in some cases, in order to process a text, more information about 
content
has to be supplied, such as using a no-break space between words that form
some sort of compound (e.g. title and name) without being written together.

Incidentally, not all of the invisible (format or control) characters 
have been
accepted by the user community (LS, PS, or NEL anyone?). But where they
have been, it's a sign that they were addressing a real need in a way that
added value.
>
> Again, I don't think any of those decisions are "wrong".  But
> they are all problematic for the IETF's language-insensitive,
> fairly context-free, identifier comparison purposes.  And they
> are, at least IMO, worth some effort because (again, independent
> of discussions about "confusion"), at least,
>
> (i) We have already established the precedent of dealing with
> all of the important groups of coding artifacts we knew about
> when IDNA2008 was under consideration by adopting normalization
> rules, DISALLOWing a lot of characters, and even developing
> special context-dependent rules for some of them.
>
> (ii) When different input methods, using data entry devices that
> are indistinguishable to the user (e.g., the alphabetic key
> layouts on a German keyboard for Windows, Linux, and the Mac are
> the same) and will produce different output (stored) strings for
> the same input, we are dealing with coding artifacts, not
> "visual confusion".   Whether the difference in internal coding
> is the decision of one system to prefer NFC and that of another
> to prefer NFD or the result of one typist using an "ö" key and
> another deciding to type an "o" and a "dead key" umlaut, we have
> (and IMO, should have) comparison systems that eliminate those
> coding differences.    This is, from that point of view, just a
> new set of coding decision differences that neither Unicode nor
> we arranged to compensate for earlier.

I think the process will bump up against he fundamental limitations that
I pointed out earlier, but, as long as IETF isn't trying to pretend 
those don't
exists, searching for a solution for the tractable cases (and there are 
many)
seems worthwhile to me.

I just don't think it adds value to get hung up over any intractable 
edge cases,
because there are whole classes of problems that are also not tractable on
the protocol level, and, perforce, these have to be handled outside of it.

That's where the recognition of the human perception issue as a big part
of the overall issue with identifiers is a necessary corrective to any 
attempts
to use only the tools available in the protocol.

A./
>
>       john
>
>
>
> One way to look at the issue involved here
>
> _______________________________________________
> Lucid mailing list
> Lucid@ietf.org
> https://www.ietf.org/mailman/listinfo/lucid

Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]