[Idna-update] Heads up -- Arabic Normalization or rendering issue -- draft UTR 53

John C Klensin <john-ietf@jck.com> Thu, 05 October 2017 17:35 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: idna-update@ietfa.amsl.com
Delivered-To: idna-update@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 59863134301 for <idna-update@ietfa.amsl.com>; Thu, 5 Oct 2017 10:35:29 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.5
X-Spam-Level:
X-Spam-Status: No, score=-0.5 tagged_above=-999 required=5 tests=[BAYES_05=-0.5] autolearn=unavailable autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pHUzPNtbzzxN for <idna-update@ietfa.amsl.com>; Thu, 5 Oct 2017 10:35:19 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A6068134305 for <idna-update@ietf.org>; Thu, 5 Oct 2017 10:35:08 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1e0A39-000Ck4-Mt for idna-update@ietf.org; Thu, 05 Oct 2017 13:35:07 -0400
Date: Thu, 05 Oct 2017 13:35:00 -0400
From: John C Klensin <john-ietf@jck.com>
To: idna-update@ietf.org
Message-ID: <8EB791E33C4FFB1EEE2614FB@PSB>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/idna-update/NUhnoMeTDpMDKa3HRWWBB3hBQL4>
Subject: [Idna-update] Heads up -- Arabic Normalization or rendering issue -- draft UTR 53
X-BeenThere: idna-update@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Internationalized Domain Names in Applications \(IDNA\) implementation and update discussions" <idna-update.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idna-update>, <mailto:idna-update-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idna-update/>
List-Post: <mailto:idna-update@ietf.org>
List-Help: <mailto:idna-update-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idna-update>, <mailto:idna-update-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 05 Oct 2017 17:35:29 -0000

Hi.

I call the attention of this group to draft UTR#53, "UNICODE
ARABIC MARK ORDERING ALGORITHM".   I'm not going to try to
summarize it because people should read it for themselves, but
it _may_ have some or all of the following implications:

* Special normalization rules, based on combining classes and
inconsistent with current Unicode canonical normalization, may
be needed to Arabic script when combining marks are used.
That would presumably have significant impact on IDNA.  On the
other hand, if we continue to require that all strings that are
input to IDNA be in NFC form, this might be irrelevant to even
if it is important for Arabic 

* If certain marks are used, Arabic script requires more
attention to rendering that current Unicode specifications
assume.  That probably would not affect IDNA because we do no
deal with rendering at all.

* Independent of the proposed solution, the document suggests
that a situation exists that is the opposite of "confusability",
i.e., a single valid U-label might look very different, and
possibly even be interpreted as different strings, depending on
whether the label is in NFC form or has has something like this
algorithm applied after normalization.  

I will leave other IDNA implications to the imagination (and
further research), but it is possible that we need to build
contextual rules around combining classes.   On the other hand,
if combining sequences were banned entirely with Arabic script,
as has been proposed a few times, this (real or potential)
problem would disappear for IDNs.  Precis would probably be
another story.

If this really is intended as a normalization add-on or post
processing step, it seems to me it is really a new normalization
form or set of forms (additional to NFC, NFD, NFKC, and NFKD) in
disguise and that treating it as such might be a lot less
problematic than trying to be sure UAOA is applied (if needed)
after each time one of the traditional normalizations is applied.

Note that this document is out for public review.  Anyone having
comments on it should send them directly to Unicode as specified
in http://www.unicode.org/review/pri359/.   The document is
probably worth discussing on this list only with regard to
IDNA-specific effects.

    john