Re: [I18nrp] Last Call: <draft-faltstrom-unicode11-05.txt> (IDNA2008 and Unicode 11.0.0) to Informational RFC

On 12/3/2018 4:02 PM, Paul Hoffman wrote:
> Before I go to the ietf@ietf.org mailing list with my concerns about 
> this draft, I hope it is OK to bounce them off people here in case I'm 
> wildly off track.
>
> =====
>
> In Section 1:
>    Specifically, the Internet Architecture Board did issue a statement
>    [IAB] which requested IETF to resolve the issues related to the code
>    point ARABIC LETTER BEH WITH HAMZA ABOVE (U+08A1), introduced in
>    Unicode 7.0.0 [Unicode-7.0.0].  This document resolves this issue and
>    suggests IDNA2008 standard is to follow the Unicode Standard and not
>    update RFC 5892 [RFC5892] or any other IDNA2008 RFCs.
>
> In Section 4.1:
>    The discussion in the IETF concluded that although it is possible to
>    create "the same" character in multiple ways, the issue with U+08A1
>    is not unique.  In the case of U+08A1, it can be represented with the
>    sequence ARABIC LETTER BEH (U+0628) and ARABIC HAMZA ABOVE (U+0654).
>    Just like LATIN SMALL LETTER A WITH DIAERESIS (U+00E4) can be
>    represented via the sequence LATIN SMALL LETTER A (U+0061), and
>    COMBINING DIAERESIS (U+0308).  One difference between these sequences
>    is how they are treated in the normalization forms specified by the
>    Unicode Consortium.
>
> This sounds like the IETF is saying that if the Unicode Consortium 
> changes how a character appears in a normalization form other than for 
> case folding (Section 2.2 of RFC 5892), that change does not affect 
> the tables for IDNA2008. Is that correct?

It's actually not correct that these are necessarily the "same" 
characters, if by "same" you mean _identical_ glyph outlines in 
high-resolution/high-quality layout.

While U+08A1 contains a graphical element that looks like the letter BEH 
and another that looks like a HAMZA, the relative placement of these two 
are not necessarily the same as when both BEH and HAMZA are used 
independently. (Effectively reflecting that the combination is an 
independent letter, not an alternate encoding for the sequence).

Unlike the case of Latin Ä, where high-resolution/high-quality rendering 
ideally results in the exact same appearance, reflecting the fact that 
precomposed code point and sequence are alternating encodings of the 
same letter).

Unicode normalization has never been a full "glyph folding"; it was 
never intended to do that: it's design point is to fold alternate 
encodings for the same abstract entity. This is not the same as 
preventing exact lookalikes, because there are some entities that are 
quite obviously logically distinct, yet are lookalikes.

The case of Latin/Greek capitals (not so relevant for IDNs) and 
Latin/Cyrillic (both Capitals and lowercase, therefore more relevant for 
IDNs) are well known. Everybody accepts that once a letter is part of  a 
different script, it is different and cannot be normalized.

Another distinction that cannot be normalized is that between letters 
and digits. Yet there are quite a few scripts where native digits 
sometimes have the same shape as a native letter. (In Latin we have 0 
and O; in our typographic tradition most, but not all, fonts will strive 
to make these distinct - that is not so in other scripts, where the 
shapes are often identical). Again, there's a shared understanding that 
normalizing digits to letters causes more problems than it solves and 
therefore the two are treated as distinct.

What I see Patrik's document reflecting is the inherent limitation of 
the normalization algorithm: it folds alternate encodings of the same 
abstract entity, but does not purport to be a universal glyph folding.

Given that the set of PVALID code points has long included both code 
points and sequences that have identical glyphs, it was surprising that 
the process was stopped for something that was merely a set of close 
lookalikes.

Finally, there's nothing here about a "change" in normalization. Whether 
any code point that is contemplated for addition should have a 
decomposition or not is something that depends on the identity of the 
abstract entity that it represents, not simply on the shape, and 
certainly not on an approximate shape - or a shape oriented character name.

Now, once a decomposition is found to be appropriate, the stability 
rules actually stipulate that the code point should not be added, so the 
only cases that are at all permissible are code points that represent a 
distinct entity that is not otherwise encoded. And, to continue that 
line of thought, the fact that some entities share a shape (or that some 
sequences could result in a more or less identical shape) is not 
something that effects this determination. Unicode is fundamentally not 
a lego set for creating glyph shapes; what is encoded is abstract 
characters (even if there's the historical baggage of precomposed 
characters in the face of open-ended use of diacritics that has lead to 
normalization in the Latin/Greek and Cyrillic scripts).

>
> =====
>
> In Section 4.1:
>    As U+08A1 is discussed in draft-freytag-troublesome-characters
>    [I-D.freytag-troublesome-characters] and elsewhere.  Regardless of
>    whether those discussions ends in recommending including the code
>    point in the repertoire of characters permissable for registration or
>    not, it is acceptable to allow the code point to have a derived
>    property value of PVALID.
>
> This sounds like it is saying that even though 
> draft-freytag-troublesome-characters is meant for standards track, 
> because it is not yet finished, this document (which is informational) 
> can ignore the other document and make changes to the IANA registry. 
> If that's correct, it concerns me because it could make the IANA 
> registry unstable for characters that we know about and are actively 
> discussing. If I'm not correct, I'd like to hear why so that maybe 
> this document can be reworded.

The set of code points that are PVALID are overbroad; they represent an 
outer envelope of what is allowed under the protocol for ANY given zone, 
but that set is much wider than what should be RECOMMENDED for inclusion 
in ALL zones (or, setting aside the special restrictions for the Root, 
recommended for all public zones).

The draft ID cited in Section 4.1 is clear that it does not intend to 
change the set of PVALID code points; there are some zones where 
allowing any PVALID code point may well not cause issues. However, in 
widely shared zone, particularly those with users from different 
languages and scripts, a well-chosen subset of the PVALID code points 
would reduce security issues and improve usability.

The details of the best approach in these cases are not always obvious. 
In Arabic, for example, it is the combining HAMZA that would become NOT 
RECOMMENDED (because unlike the letter U+08A1 and similar) it is not 
needed to form "useful mnemonics" in Arabic or any of the other 
languages using the Arabic script.

PVALID code points include those that look like (!) and (') - these are 
also not to be recommended for general use.

In the Indic scripts there are many examples of compound glyphs that can 
be achieved by more than one sequence of code points. In these 
instances, the Unicode standard has gone on record identifying the 
sequences that MUST NOT be used. (As only one of the alternate sequences 
is legitimate, there is no normalization -- because normalization would 
have implied both sequences are equivalent).

Unfortunately, IDNA2008 lacks a scalable mechanism to DISALLOW sequences 
of code points. (The only available formalism, the use of CONTEXTO rules 
is not scalable beyond a few cases, as it is not presented in a machine 
readable format -- unlike for example RFC 7940).

The draft ID cited in Section 4.1 may well RECOMMEND that registries not 
allow such sequences (and that they use machine-readable specifications 
based on RFC 7940 to do so). However, that entire discussion is 
unaffected by Patrik's document and unaffected by IANA proceeding in the 
context of the existing parameters for IDNA 2008.

A deeper linguistic reason for keeping the domains of these ID's 
separate is in the fact that best practices for support of complex 
scripts in public zones would implement policies that have context rules 
that go beyond preventing specific alternate sequences. In many cases, 
the latter fall out from applying more general restrictions that 
reflect, for example, the syllable structure that is a common feature of 
complex scripts. Such general restrictions are sensitive not only to the 
script, but in some cases, to the language(s) to be supported. A 
one-size-fits-all approach (as would be required on the protocol level) 
is therefore inappropriate.

Perhaps it might be possible to edit the text of Section 4.1 to make 
sure that the "repertoire" mentioned is to be understood as the 
repertoire for "any particular zone" as opposed to something like the 
set of PVALID code points.

With that clarification, I see no issue in that section.

A./

>
> --Paul Hoffman
>
> _______________________________________________
> i18nRP mailing list
> i18nRP@ietf.org
> https://www.ietf.org/mailman/listinfo/i18nrp
>