Re: Fwd: Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison

--On Tuesday, August 05, 2014 16:07 -0700 Mark Davis ☕️
<mark@macchiato.com> wrote:

>  I hadn't heard back from John, but I'm guessing that the
> right place to discuss this is here, based on Marc's email.

Mark (and, by extension, Ken and others who have said much the
same thing),

It is certainly a reasonable place.  As Patrik suggested, it may
not make much difference where it is discussed.

I think we have a fundamental difference in perspectives here
or, if you prefer, assumptions about criteria for making
decisions.   As long as that difference persists, while we can
do our best to understand each other and explain those
differences carefully, it is not likely that anyone will
convince anyone else to change positions.   That situation is
further complicated by the observation that I have believed all
along (where "all along" goes back prior to Unicode 1.0 to the
work that led to ISO 10646 DIS-1) that the nature of human
writing systems and their evolution leads to a situation in
which there are no perfect solutions, only choices among
tradeoffs and compromises.   I presume that belief is not
controversial: the first sentences of Section 2.2 (Unicode
Standard, versions 3 - 6.2 at least) says, I think, almost the
same thing.

Many of those reading this will recall, from the earliest days
of IDN discussions, that some people argued that the balance
struck by Unicode is fundamentally inappropriate for the needs
of IDNs and hence that the choice of Unicode as a base was,
itself, inappropriate.  Some have believed that because they
felt that unification lost too much information, others that
"compatibility ... with existing standards" led to bad design
decisions that even those prior standards would not have made
had they not been constrained (e.g., to seven or eight bits) or
to arbitrary and duplicative division into "scripts", and so on.
There was even an (IMO extreme) position that IDNs should be
based on a phonetic rendering rather than on any sort of coding
of writing systems.   

More of us have taken the position that Unicode represented a
reasonable balance --both generally and for IDN needs-- although
not a perfect one and that any other existing or hypothetical
system would just represent a different balance and different
set of tradeoffs and resulting issues.

There were also some fundamental IDN design decisions that had
little to do with the structure or contents of Unicode.
Perhaps the most important example was that IDns would not be
coded by language or contain any sort of language identifier.
That decision involved its own set of tradeoffs: language coding
as part of IDN labels would have been possible and would clearly
convey some advantages, but, for example, the costs of having
false negatives on comparison if a user could not correctly
guess the language associated with a string seen "on the side of
a bus" were seen as unacceptable.  There are also some technical
issues resulting from the DNS context in which IDN labels are
accessed: even today, thinking persists in some quarters that a
top-level domain can be used to identify a language and the
language of all of the nodes associated with it.  To a [very]
limited extent (see below), that could be done as a matter of
policy but the DNS structure makes doing it as a technical
matter --essentially conditioning comparison of labels deep in
the tree based on higher (or TLD) nodes-- impossible.

This note is written in the spirit of a desire to explain, with
little expectation of convincing you that my position is,
indeed, heading in the right direction.   

The "no language coding for IDNs" decision mentioned above has
profound implications, especially when multiple languages share
a script.  Your experience is obviously much broader, but the
vast majority of the Unicode applications to coded character
strings I've seen fall into one of three categories:

 * The language context is explicit

 * Only a subset of Unicode is in use, with that subset
	chosen for the needs of a particular language community.
	In some respects, that sort of use is no different from
	the earlier tradition of choosing or designating (but
	not necessarily explicitly identifying) a national (or
	language) standard, "code table", or "code page" and 
   then using it,
	except that the universality of Unicode is much more
	elegant (and non-problematic due to statelessness) than,
	e.g., the designation rules of ISO/IEC 2022.

 * The coded character strings are used for display
	purposes only and no one cares what the codes are as
	long as the display on the page comes out the same.

Because of the "no language coding" principle and the need to be
able to compare character strings to labels (and labels to each
other) with a high degree of accuracy, IDNs do not fall into any
of those categories.  Subject only to the exclusions (DISALLOWED
code points and character sequences not allowed by CONTEXTx or
Bidi rules) imposed by IDNA2008 (RFCs 5890-5893) and string
length limits, an IDNA label can consist of any sequence of
arbitrarily-chosen Unicode characters without regard to
language, script, or homogeneity constraints.  

Whether it was correct or not in retrospect, one of the
assumptions of the IDNA design --an assumption basic enough that
we didn't bother to make it explicit in any of the  five base
documents -- was that, within a script, normalization would be
sufficient to cause two different ways of coding the same
abstracted glyph form to compare equal.  That assumption was
based, in part, on strong assurances that few, if any, new
precomposed characters would be added and that, if they were,
they would decompose back to the previously-existing combining
forms.  At least some of us believed that, if there were
exception cases of which we should be aware, someone more expert
in Unicode would tell us and no one did.  Would the knowledge
have made any difference to how IDNA was designed?  I don't know
-- more of those tradeoffs -- but at least we would have been
aware that there were cases in which normalization would not
behave as we felt we had been led to expect.  

Your comments (and earlier ones) about letters in Fula, etc.,
are extremely interesting and obviously important to Unicode
decisions about coding.  However, for better or worse, that "no
language information" element of the IDNA design renders any
distinctions (within a script) based on language irrelevant for
IDNA purposes.  From that IDNA perspective, U+08A1 (and U+0681
and U+076C) are simply exceptions to the (assumed and possibly
incorrect) principle that NFC or NFD normalization are
sufficient to assure IDNA comparison accuracy, where "IDNA
comparison" is bound to the appearance of the character forms
given a choice of type styles but without language information
or even language specific rendering.

In retrospect, that assumption about comparison integrity under
normalization may have been naive.  Perhaps we would even have
done some things differently had we known that a half-dozen
years ago.   The question is what to do today.  In that regard,
the I-D really has two pieces.  One is an explanation of the
situation.   It could (and would) clearly be written differently
if IDN labels were language-sensitive within a script or if we
could allow different comparison rules for different scripts,
but those considerations and options are irrelevant to IDNA
except for historical perspective on how we got here.  The other
is the more operational question of whether U+08A1 should be
exceptionally DISALLOWED and, if not, what should be done with
it.  It seems to me there are only a few possibilities:

	(i) To DISALLOW it as the draft now suggests,
	acknowledging that doing so creates some issues for
	anyone intending to include Arabic/Ajami forms of
	Fula-specific characters or label strings or substrings
	in the DNS.  Doing so requires accepting the
	inconsistency with U+0681 and U+076C on the assumption
	that it is too late to change how those characters (or
	the composing sequences that could mechanically produce
	them) are handled.

	(ii) To DISALLOW U+08A1 and do something unpleasant,
	despite the risks of invalidating now-valid labels, with
	U+0681 and U+076C as well, thereby ending up in a
	consistent state.   Or to somehow figure out how to
	DISALLOW the combining sequences associated with all
	three.

	(iii) To take the "the problem has existed for a very
	long time" argument to the conclusion it seems to me
	that you are arguing for, which is that the existence of
	other cases means that this case has to be allowed too
	and doesn't make things significantly worse.  That may
	be the right answer for this case but I have to admit
	that, as a principle, it makes me very nervous, in part
	because those criteria in Section 2.2 have, necessarily,
	not been applied consistently across scripts and
	languages (the text more or less says that -- the
	comment is not a criticism).

	(iv) To conclude that a requirement for NFC is not
	adequate for IDNA and create a new exception category
	that applies special, IDNA-specific, normalization rules
	that would have the effect of disallowing certain
	character sequences that cannot be disallowed by the
	current contextual rules and then apply that category to
	these in-script "it is really a character for that
	language but only a combining sequence for those others"
	situations.

Again, all of this is a result of the IDNA "no language
information" design constraint.  To say things equivalent to
"NFC works this way, its normalizations have been defined this
way, and it is incorrect to say that things 'should' work some
other way" is, in that context, not a comment about my draft as
much as it is an assertion that NFC, as defined, is less
appropriate for IDNA than we might have thought.  That may or
may not lead to the conclusion that (iv) above is the proper
solution for IDNA, that (i) and (ii) are just poor surrogates
for it, and that (iii) is just a form of denial.   As I have
said before, none of this is in any way a criticism of Unicode
decisions made with other contexts in mind.  All I can say
personally about those four options right now is that I think it
is very important that we document the problems, the risks that
go with it, and, by extension, the difficulties created for IDNA
by those decisions that [otherwise] make sense for Unicode.
Which of those options is then chosen may be of less importance.

Especially given the language issue, I don't think it is worth
going through your note below point by point.  I should,
however, comment on one IDNA-related issue.  You (or whomever
you are quoting) wrote:

> There are four levels at which confusables, including
> homoglyphs , can be addressed for domain names
> 
> 1. Encoding
> 2. Protocol (IDNA)
> 3. Label Generation Ruleset
> 4. String Review

> A more natural level [for addressing
> confusables] would be the Label Generation Ruleset level.
>...

Unless "string review" is intended to include it, there is at
least one more level.   Since RFC 1591 or earlier (i.e., long
before IDNs) any domain registry, at any level of the tree, can
apply rules to the particular labels that will be accepted for
entry into its zone, rules that are more restrictive than the
constraints of the relevant protocols.  We usually assume that
such restrictions will further narrow whatever is permitted at
the level above and, often, that it will propagate downward, but
there is no technically-plausible way to enforce those
relationships (what can be done by draconian administration is
another matter).  

That level and set of distinctions are important because of
their corollaries.  First, the LGR process is applicable only to
ICANN-delegated TLDs (I believe strictly only those created as
part of the new gTLD process since other rules apply to IDN
ccTLDs).  As far as I know, no one has seriously suggested that
they be imposed on second-level registrations within
ICANN-contracted TLDs, much less on other second-level
registrations or registrations below that level.   Second,
despite the obvious relationship between "two identical
character forms within the same script, coded in different ways,
do not compare equal" and confusability, my concern about U+08A1
(and similar situations) is not about the rather subjective
issue of confusability, it is about IDNA's expectations about
the properties of normalization and their relationship to
equality comparison.  By itself, the confusability issue, as you
have pointed out several times, is mostly a cross-script
similarity issue.  ICANN has chosen to sweep may examples of
those cross-script similarities away by a prohibition on
mixed-script labels (with "COMMON", etc., as special cases)but,
again, that prohibition is effective for labels created in
ICANN-created zones and perhaps by contracted parties and is
nothing more than a recommendation for other zones.  I note that
the IDNA work considered a protocol prohibition on mixed-script
labels and decided against it because such labels make
considerable sense in some cases, especially ones that were
expected to occur lower in the tree).  

Now, perhaps the best (or least-bad) solution for the present
situation would be to combine (iv) with a strong warning about
this and other situations in which normalization is insufficient
to provide a good language-independent test of visual equality
within a script but without language or writing system
information, enumerating those cases, and advising registries at
all levels of the tree to take appropriate precautions.  I can't
speak for others, but I'd welcome suggested text and a list of
cases along those lines.  I'm skeptical about the adoption of
such a list as long as UTR 46, and other efforts encouraged by
it, essentially discourage adoption of IDNA 2008, but that is
just my opinion.

best regards,
    john

> ---------- Forwarded message ----------
> From: Mark Davis ☕️ <mark@macchiato.com>
> Date: Wed, Jul 30, 2014 at 7:38 AM
> Subject: Re: Unicode 7.0.0, (combining) Hamza Above, and
> normalization for comparison
> To: John C Klensin <john+w3c@jck.com>, Patrik Fältström
> <paf@frobbit.se> Cc: member-i18n-core@w3.org, Asmus Freytag
> <asmusf@ix.netcom.com>
> 
> 
> On what email address is this being discussed?  I'd like to
> convey to that list some comments from an internal
> discussion about draft-klensin
> -idna-5892upd-unicode70-00.txt.
> 
> (These are not my wording, but I agree with them. I edited
> slightly for flow. I will add that from a confusability
> standpoint, the proposed draft accomplishes nothing, since
> there are thousands of cases of confusable characters;
> restricting just this one character has no useful effect; like
> removing a quart of water from a lake.)
> 
> For U+08A1, this certainly is a *letter* of Fula (Fulfulde,
> Pula, ...), a large language spoken across swaths of
> West Africa. Fula is mostly written with the Latin script,
> but Islamists also write it in Ajami (Arabic extensions for
> African languages), particularly in Guinea.
> See:
> 
> http://en.wikipedia.org/wiki/Fula_orthographies
> 
> The *letter* in question is the one used to write the phoneme
> /ɓ/, the bilabial implosive. See:
> 
> http://en.wikipedia.org/wiki/%C6%81
> 
> for the African alphabet convention for the Latin writing of
> this letter.
> 
> For the Arabic Ajami alphabets for Fula, the form has been
> missing. For whatever reason, in at least one Fulfulde Ajami
> orthography, this implosive was (reasonably) represented by
> using a Hamza diacritic on the beh letter. Following the way
> such *diacritic* (ijam) letter derivations are encoded in the
> Unicode Standard, a separate, non-decomposed entry was
> required. Note that this use of Hamza is *different* from the
> Arabic (language) use of a combining Hamza to indicate a
> glottal stop, often in combination with a letter that is
> actually pronounced as a vowel.
> 
> As to *why* it was encoded as a single, undecomposed letter,
> that is explained at length in the proposal document, as well
> as in the section on Hamza in the Unicode Standard, which you
> have referred to in the Internet Draft you mention.
> 
> The newly encoded character U+08A1 for Unicode 7.0 has
> *already* been added to the relevant table "Arabic Letters
> With Hamza Above" in the draft core specification for
> Unicode 7.0, where, like the long-encoded U+0681
> and U+076C, it is noted as having no decomposition.
> (The core specification will be posted around October -- it
> is still undergoing its final editing for all the 7.0
> additions.)
> 
> U+08A1 does not have a canonical decomposition in Unicode 7.0
> (nor, of course, will it *ever* have a canonical decomposition,
> because of normalization stability). This is exactly the same
> treatment that U+0681 and U+076C got, and for exactly the same
> reasons. (And, as you know, of course, those characters date
> back to Unicode 4.1 for U+076C and even earlier, Unicode 1.1
> for U+0681.)
> 
> Note that it is incorrect to assert that U+08A1 ARABIC LETTER
> BEH WITH HAMZA ABOVE "should" be normalized to U+0628 ARABIC
> LETTER BEH + U+0654 ARABIC HAMZA ABOVE. Those are distinct
> sequences, and they are never going to compare equal in their
> NFC normalizations.
> 
> I am concerned that the Internet Draft here is heading in
> exactly the wrong direction. If it ends up changing RFC 5892
> to override the derivation for U+08A1 and force it to INVALID,
> all I can see that accomplishing is to guarantee forever that
> correctly spelled Ajami Fulfulde cannot be used in domain
> names, and that instead people would end up having to use
> misspellings to represent their implosive b in a domain names.
> 
> With all due respect to the Arabic script experts that have
> been consulted, I rather doubt that they are experts on Ajami
> orthographies in West Africa, or are in touch with the people
> who would be supporting those languages and implementing
> keyboarding and such for West Africa.
> 
> Also, I don't see any way you can justify the abrupt (and
> permanent) discontinuity that this would put in place between
> the treatment of U+08A1 for Fulfulde and U+076C for Ormuri or
> U+0681 for Pashto.
> 
> 
> If you are looking for a more analogous precedent I suggest,
> for example:
> 
> U+2C65 LATIN SMALL LETTER A WITH STROKE
> 
> That was added in Unicode 5.0, and nobody has ever had any
> problem with it being PVALID in IDNA. It only has limited use
> in a minor orthography,
> but what is the harm?
> 
> Now, if you examine U+2C65, you could well claim that it
> *should* be decomposed to "a" plus the combining stroke
> overlay, U+0338. And both of those have been encoded for a
> long, long time in the standard, so in principle, somebody
> *could* have been representing their data for a letter a with
> stroke before Unicode 5.0 using the sequence with the stroke
> overlay. It might even look o.k. in text, depending on the
> font support for the combina̸tion. But the Unicode Standard
> has rules now for the encoding of certain combinations of base
> letters and diacritic modifiers that overlay or modify the
> base character form. So U+2C65 was separately encoded. And
> there is no normalization of the sequence involved. That
> stroked letter use is, in text, distinct from somebody, say,
> using a bunch of overlay strokes as a strikethrough convention
> for some reason: a̸a̸a̸a̸a̸a̸
> 
> Consider the Hamza diacritic as falling in this same class of
> edge cases, if you will.
> 
> And in this case, I don't think it will be doing anybody any
> favors to update RFC 5892 to make U+08A1 DISALLOWED in IDNA.
> It doesn't "fix" normalization for it. All it accomplishes is
> to force any Fulfulde user of Ajami orthography to misspell
> their text in order to use a /ɓ/ in a domain name. It would
> just create an unexplained (and unfixable) discontinuity
> between what the domain registrations would accept and what
> the Fulfulde input and spelling tools would support. Or I
> guess it would just force people to give up the Arabic
> spellings and go back to the more widely supported Latin
> alphabets for Fula to get their domain names.
> 
> What would be accomplished by
> forcing another point incompatibility that just ends up getting
> carried around forever?
> 
> ====
> 
> There are four levels at which confusables, including
> homoglyphs ,
> can be addressed for domain names
> 
> 1. Encoding
> 2. Protocol (IDNA)
> 3. Label Generation Ruleset
> 4. String Review
> 
> A
>  more natural level [for addressing confusables]
> would be the Label Generation Ruleset level. For an LGR,
> there are three ways to deal with homoglyphs, one of which is
> not available on the protocol level. The first two of these
> are to rule out a code point (by not including it in the LGR's
> repertoire), or to rule out a code point or sequence
> conditionally. Unlike using these methods on the Protocol
> level, doing so on the LGR level means that it is possible to
> be more restrictive, say, for the root of the DNS than for
> domains several levels down the tree. The downside of using the
> LGR is, of course, that it is specific to the given zone on
> the internet.
> 
> The upside is that an LGR has additional mechanisms, such as
> defining a "blocked" variant. That creates an "either/or"
> situation, where both are permitted, but not at the same time
> in the same position of an otherwise identical label. This is
> a very nice solution for a number of confusables/homoglyphs
> that are systemic (not dependent on accidents of rendering or
> "arms length" similarity).
> 
> Unlike the final level, String Review, an LGR has the
> advantage of being applied mechanically without any
> case-by-case review, which is why it's appropriate for cases
> like the one that gave rise to this discussion.
> 
> In principle, both the Label Generation Ruleset or the String
> Review are created/carried out by people/entities that have
> access to the necessary and specific linguistic and script
> expertise, unlike IDNA which seems to be created largely by
> protocol experts.
> 
> 
> On Tue, Jul 22, 2014 at 12:22 AM, John C Klensin
> <john+w3c@jck.com> wrote:
> 
>> Hi.   I was asked to forward the announcement of this Internet
>> Draft to this group once it was posted.  See attached.
>> 
>> For information -- comments welcome, but the core issue may be
>> rather specific to concerns that surround IDNs and IDNA.  Or
>> not.
>> 
>> Or course, if I/we are still completely confused, corrections
>> and explanations would be welcome.
>> 
>>     john
>> 
>> 
>> ---------- Forwarded message ----------
>> From: internet-drafts@ietf.org
>> To: i-d-announce@ietf.org
>> Cc:
>> Date: Mon, 21 Jul 2014 04:03:58 -0700
>> Subject: I-D Action:
>> draft-klensin-idna-5892upd-unicode70-00.txt
>> 
>> A New Internet-Draft is available from the on-line
>> Internet-Drafts directories.
>> 
>> 
>>         Title           : IDNA Update for Unicode 7.0.0
>>         Authors         : John C Klensin
>>                           Patrik Faltstrom
>>         Filename        :
>>         draft-klensin-idna-5892upd-unicode70-00.txt Pages
>>         : 10
>>         Date            : 2014-07-21
>> 
>> Abstract:
>>    The current version of the IDNA specifications anticipated
>>    that each new version of Unicode would be reviewed to
>>    verify that no changes had been introduced that required
>>    adjustments to the set of rules and, in particular,
>>    whether new exceptions or backward compatibility
>>    adjustments were needed.  That review was conducted for
>>    Unicode 7.0.0 and identified a problematic new code point.
>>    This specification updates RFC 5982 to disallow that code
>>    point and provides information about the reasons why that
>>    exclusion is appropriate.  It also applies an editorial
>>    clarification that was the subject of an earlier erratum.
>> 
>> 
>> The IETF datatracker status page for this draft is:
>> https://datatracker.ietf.org/doc/draft-klensin-idna-5892upd-u
>> nicode70/
>> 
>> There's also a htmlized version available at:
>> http://tools.ietf.org/html/draft-klensin-idna-5892upd-unicode
>> 70-00
>> 
>> 
>> Please note that it may take a couple of minutes from the
>> time of submission
>> until the htmlized version and diff are available at
>> tools.ietf.org.
>> 
>> Internet-Drafts are also available by anonymous FTP at:
>> ftp://ftp.ietf.org/internet-drafts/
>> 
>> _______________________________________________
>> I-D-Announce mailing list
>> I-D-Announce@ietf.org
>> https://www.ietf.org/mailman/listinfo/i-d-announce
>> Internet-Draft
>> <https://www.ietf.org/mailman/listinfo/i-d-announceInternet-D
>> raft> directories: http://www.ietf.org/shadow.html
>> or ftp://ftp.ietf.org/ietf/1shadow-sites.txt
>> 
>>