Re: [Lucid] FW: [ Re: Non-normalizable diacritics - new property]

John C Klensin <> Thu, 19 March 2015 08:07 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id 53ED21A1B57 for <>; Thu, 19 Mar 2015 01:07:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -4.61
X-Spam-Status: No, score=-4.61 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, GB_I_LETTER=-2, RCVD_IN_DNSWL_LOW=-0.7, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id eVidwkYAgkVz for <>; Thu, 19 Mar 2015 01:07:36 -0700 (PDT)
Received: from ( []) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id BC3EF1A6EF1 for <>; Thu, 19 Mar 2015 01:07:36 -0700 (PDT)
Received: from [] ( by with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <>) id 1YYVU8-000NpR-RU; Thu, 19 Mar 2015 04:07:20 -0400
Date: Thu, 19 Mar 2015 04:07:15 -0400
From: John C Klensin <>
To: Shawn Steele <>
Message-ID: <>
In-Reply-To: <>
References: <> <> <> <> <CY1PR0301MB07310C68F6CFDD46AE22086F82190@CY1PR0301MB0731.namprd0> <> <> <> <> <> <>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Scanned: No (on; SAEximRunCond expanded to false
Archived-At: <>
Cc:, Andrew Sullivan <>
Subject: Re: [Lucid] FW: [ Re: Non-normalizable diacritics - new property]
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 19 Mar 2015 08:07:42 -0000

--On Thursday, March 19, 2015 04:31 +0000 Shawn Steele
<> wrote:

>> > No, even all NFC or NFKC would be 100% unique to the machine
>> This is either tautologically true, or false.  Certainly we
>> learned with IDNA2003 that NFKC doesn't work, because while
>> it's good for increasing match probability the identifiers
>> aren't stable.  So when they're handed around through
>> different environments, stuff happens that is bad.
> I said "to the machine".  They're "just numbers", and the
> NFC/NFKC rules are mathematical.  Yes, you do have to exclude
> unassigned code points as those don't have defined behavior,
> however "Assigned NFC or NFKC for defined code points" would
> be 100% unique to the machine.


A few observations, although we may need to agree to disagree.

First, saying "NFC/NFKC rules" tends to obscure the issue here.
An important property of NFC/NFD is reversibility, i.e., a dual
relationship in the mathematical sense.  You may not be able to
recover whatever the original way, but you can go back and forth
between the two normalized forms without loss of information.
By contrast, NFKC (and NFKD) are potentially information-losing
and that is (although more of some sets of compatibility
equivalences than others) significant.

> I'm not trying to go from "this can't be perfect" to "we
> shouldn't try".  I am trying to say that this is good enough.
> The additional cost of trying doesn't add value.   

Sorry.  I think you can argue that it doesn't add enough value
to be worth the trouble.  I disagree with that given an
appropriate definition of the problem and believe we don't agree
on that definition.   But "doesn't add value"... well, we

>> I think speculating about the anthropololical facts here is
>> going to lead us to grief.  Let's stick with a domain of
>> discourse we know well.
> IDN is sociological exercise.  If the need was purely
> scientific/mathematical, then we'd only need a bunch of
> numbers or opaque IDs.  In order to make much progress here I
> think we need to think about how people use them.

At one level, I agree about "sociological exercise".  I would
normally characterize it a bit differently, but let's accept
that for the purposes of this note.  I would say much the same
thing, not just about IDNs, but about almost any human-readable
identifier, noting that, as soon as a "bunch of numbers" is
expressed as numerals in written forms, there are "sociological"
issues of number base and the set of numerals to be used (in
Unicode-land and modern writing systems bound to script for
everything but Arabic).  

However, maybe there is a different way to characterize this
issue that avoids the sociological exercise.  I continue to
accept that, at least in general, Unicode has a reasonable set
of coding principles and that, in general, a reasonable set of
decisions has been made given those principles.  I firmly
believe that a different coding system, with different
principles (or priorities among principles) would simply trade
one set of issues for another but might otherwise be equally
"good".  However, many of those decisions -- about coding, not
about characters -- do have alternatives.  For example, instead
of coding lower and upper case characters differently, one could
have dealt with case distinctions by assigning a single code
point to the case-insensitive abstract character and then using
a qualifier to designate case (either always for one case or
when anyone cared for both -- two different coding style
distinctions).  Similarly, ZWJ and ZWNJ are, in some sense,
indicators and artifacts of coding decisions, not "characters":
one could have avoiding them by assigning separate code points
to the characters they affect on rendering (note that there have
been passionate arguments that just that should have been done
with some Indic scripts).  

In a way, decisions about script boundaries are, themselves,
coding decisions rather than inherent character properties but,
having been vaguely involved with an attempt to define a
universal character set that did not depend on such boundaries,
that path seems to lead to madness.  Unicode still could have
defined its script boundaries differently, e.g., seeking a
higher level of integration (or "unification") across the board
but, while I think understand that as a choice is helpful,
pursuing it very far probably is not.

Modulo the script boundary issue, it is (approximately) possible
to see all of the "confusion" and "too-similar appearance"
problems as human perception issues involving the
recipient/viewer (not really "sociological", but the distinction
may not be important).   Seen that way, the issue here is not
about the viewer perception issue but about the coding one --
resolving differences in decisions about how something might
have been code (again and for convenience, within a script).  If
one goes back a few hundred years, the question might become,
not whether someone viewing the character would "see" the
difference but whether the calligrapher or perhaps even someone
trying to set a string in cold type would see an important
difference -- also, in a way, a coding decision problem.

Coming back to the issue that started this, had Unicode
(followed by the IETF) not deprecated embedded code points to
identify languages, U+08A1 could have been coded (using an odd
notation whose intent should be obvious) as 
    <lang:fula>U+0628 U+0654</lang>
without loss of any phonetic or semantic information and the
justification for adding the code point at all would disappear
(at least absent what we generically describe as "politics", in
which case the justification for its not decomposing would

Similarly, a different CCS could have avoided at least the
portion of the "Mathematical" collection that are ultimately
Latin or Greek letters in special fonts by different coding
conventions that would use the base character plus qualifiers
for usage and/or type style.

Another example lies in the collection of combining characters
that can be used to for precomposed character that don't
decompose, not for, e.g., phonetic reasons but because the
definitions of those combining characters don't contain quite
enough information (see Section of
draft-klensin-idna-5892upd-unicode70-04.txt and its citations).
At least in theory, Unicode could have chosen to assign code
points to more precisely-defined (as to how they affect base
characters) combining characters.  A coding system with
different principles might have used position and/or size
indicator coding with similar effect (an approach that will
probably be needed if Unicode ever takes on, e.g., Classic Mayan

Now, from that perspective, this issue is about smoothing over
(by either some form of equivalence rules or exclusion (or
non-inclusion)) differences among character code sequences that
are the result of coding decisions (decisions that are at least
semi-arbitrary because others could have been made with no
"important" information loss -- and, yes, "important" is
sometimes debatable without yet other coding decisions).  For
sequences that compose and decompose symmetrically, NFC (or NFD)
normalization does the necessary job.  IDNA2008 disallows those
mathematical characters as a way to do a different part of the
job without making non-reversible compatibility equivalences
part of the standard.

As another coding decision matter, all of this would be
significantly easier were Unicode consistent about its coding
distinctions.   Such consistency is likely impossible, at least
given other decisions, but that doesn't mean it wouldn't be
helpful.   However, we have, instead, a combining sequence
preference (with exceptions) for Latin but a precomposed
character preference for Arabic.  We have all precombined Latin
characters where combining sequences decomposing, except for
some combining characters.  Most European scripts code the
abstract graphics in grapheme clusters but East and South Asian
ones use indicators like ZWJ and ZWNJ.  There is a strict rule
against assigning separate code points to typestyle distinctions
but an exception for some usage contexts such as mathematics and
phonetic description.  Unicode does not have indicator codes or
separate code points for layout or presentation (leaving that to
external markup), but such coding has proved necessary for
writing systems that are primarily right-to-left and in such
cases as non-breaking space.  There are no language-dependent or
pronunciation-dependent coding distinctions (see Section 2 of
draft-klensin-idna-5892upd-unicode70-04.txt and/or Chapter 2 of
Uniocde 7.0) except where there are. 

Again, I don't think any of those decisions are "wrong".  But
they are all problematic for the IETF's language-insensitive,
fairly context-free, identifier comparison purposes.  And they
are, at least IMO, worth some effort because (again, independent
of discussions about "confusion"), at least,

(i) We have already established the precedent of dealing with
all of the important groups of coding artifacts we knew about
when IDNA2008 was under consideration by adopting normalization
rules, DISALLOWing a lot of characters, and even developing
special context-dependent rules for some of them.

(ii) When different input methods, using data entry devices that
are indistinguishable to the user (e.g., the alphabetic key
layouts on a German keyboard for Windows, Linux, and the Mac are
the same) and will produce different output (stored) strings for
the same input, we are dealing with coding artifacts, not
"visual confusion".   Whether the difference in internal coding
is the decision of one system to prefer NFC and that of another
to prefer NFD or the result of one typist using an "ö" key and
another deciding to type an "o" and a "dead key" umlaut, we have
(and IMO, should have) comparison systems that eliminate those
coding differences.    This is, from that point of view, just a
new set of coding decision differences that neither Unicode nor
we arranged to compensate for earlier.


One way to look at the issue involved here