[Idna-update] Heads up -- Arabic Normalization or rendering issue -- draft UTR 53

John C Klensin <john-ietf@jck.com> Thu, 05 October 2017 17:35 UTC

Date: Thu, 05 Oct 2017 13:35:00 -0400
From: John C Klensin <john-ietf@jck.com>
To: idna-update@ietf.org
Message-ID: <8EB791E33C4FFB1EEE2614FB@PSB>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Archived-At: <https://mailarchive.ietf.org/arch/msg/idna-update/NUhnoMeTDpMDKa3HRWWBB3hBQL4>
Subject: [Idna-update] Heads up -- Arabic Normalization or rendering issue -- draft UTR 53
Precedence: list

Hi.

I call the attention of this group to draft UTR#53, "UNICODE
ARABIC MARK ORDERING ALGORITHM".   I'm not going to try to
summarize it because people should read it for themselves, but
it _may_ have some or all of the following implications:

* Special normalization rules, based on combining classes and
inconsistent with current Unicode canonical normalization, may
be needed to Arabic script when combining marks are used.
That would presumably have significant impact on IDNA.  On the
other hand, if we continue to require that all strings that are
input to IDNA be in NFC form, this might be irrelevant to even
if it is important for Arabic 

* If certain marks are used, Arabic script requires more
attention to rendering that current Unicode specifications
assume.  That probably would not affect IDNA because we do no
deal with rendering at all.

* Independent of the proposed solution, the document suggests
that a situation exists that is the opposite of "confusability",
i.e., a single valid U-label might look very different, and
possibly even be interpreted as different strings, depending on
whether the label is in NFC form or has has something like this
algorithm applied after normalization.  

I will leave other IDNA implications to the imagination (and
further research), but it is possible that we need to build
contextual rules around combining classes.   On the other hand,
if combining sequences were banned entirely with Arabic script,
as has been proposed a few times, this (real or potential)
problem would disappear for IDNs.  Precis would probably be
another story.

If this really is intended as a normalization add-on or post
processing step, it seems to me it is really a new normalization
form or set of forms (additional to NFC, NFD, NFKC, and NFKD) in
disguise and that treating it as such might be a lot less
problematic than trying to be sure UAOA is applied (if needed)
after each time one of the traditional normalizations is applied.

Note that this document is out for public review.  Anyone having
comments on it should send them directly to Unicode as specified
in http://www.unicode.org/review/pri359/.   The document is
probably worth discussing on this list only with regard to
IDNA-specific effects.

    john

[Idna-update] Heads up -- Arabic Normalization or… John C Klensin
Re: [Idna-update] Heads up -- Arabic Normalizatio… Mark Davis ☕️
Re: [Idna-update] Heads up -- Arabic Normalizatio… Patrik Fältström
Re: [Idna-update] Heads up -- Arabic Normalizatio… Patrik Fältström
Re: [Idna-update] Heads up -- Arabic Normalizatio… John C Klensin
Re: [Idna-update] Heads up -- Arabic Normalizatio… Mark Davis ☕️
Re: [Idna-update] Heads up -- Arabic Normalizatio… Asmus Freytag
Re: [Idna-update] Heads up -- Arabic Normalizatio… Patrik Fältström
Re: [Idna-update] Heads up -- Arabic Normalizatio… John C Klensin