Re: [Ietf-languages] Fwd: I-D Action: draft-msporny-d-langtag-ext-00.txt

"Doug Ewell" <doug@ewellic.org> Mon, 27 May 2019 17:08 UTC

Return-Path: <doug@ewellic.org>
X-Original-To: ietf-languages@ietfa.amsl.com
Delivered-To: ietf-languages@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 69E771200BA for <ietf-languages@ietfa.amsl.com>; Mon, 27 May 2019 10:08:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.898
X-Spam-Level:
X-Spam-Status: No, score=-1.898 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ljf01QmFsMv6 for <ietf-languages@ietfa.amsl.com>; Mon, 27 May 2019 10:08:06 -0700 (PDT)
Received: from mork.alvestrand.no (mork.alvestrand.no [IPv6:2001:700:1:2::117]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 3C96E120006 for <ietf-languages@ietf.org>; Mon, 27 May 2019 10:08:06 -0700 (PDT)
Received: by mork.alvestrand.no (Postfix) id 9C17B7C37CE; Mon, 27 May 2019 19:08:04 +0200 (CEST)
Delivered-To: ietf-languages@alvestrand.no
Received: from localhost (localhost [127.0.0.1]) by mork.alvestrand.no (Postfix) with ESMTP id 8033A7C37C4 for <ietf-languages@alvestrand.no>; Mon, 27 May 2019 19:08:04 +0200 (CEST)
X-Virus-Scanned: Debian amavisd-new at alvestrand.no
Received: from mork.alvestrand.no ([127.0.0.1]) by localhost (mork.alvestrand.no [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ER9rLl1-QiPZ for <ietf-languages@alvestrand.no>; Mon, 27 May 2019 19:08:02 +0200 (CEST)
X-Greylist: from auto-whitelisted by SQLgrey-1.8.0
X-Greylist: from auto-whitelisted by SQLgrey-1.8.0
X-Comment: SPF skipped for whitelisted relay - client-ip=192.0.46.74; helo=pechora8.dc.icann.org; envelope-from=doug@ewellic.org; receiver=ietf-languages@alvestrand.no
Received: from pechora8.dc.icann.org (pechora8.icann.org [192.0.46.74]) by mork.alvestrand.no (Postfix) with ESMTPS id E7B817C3646 for <ietf-languages@alvestrand.no>; Mon, 27 May 2019 19:08:01 +0200 (CEST)
Received: from p3plsmtpa06-02.prod.phx3.secureserver.net (p3plsmtpa06-02.prod.phx3.secureserver.net [173.201.192.103]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by pechora8.dc.icann.org (Postfix) with ESMTPS id CA949C02FB for <ietf-languages@iana.org>; Mon, 27 May 2019 17:08:00 +0000 (UTC)
Received: from DESKTOPLPOB1E4 ([73.229.14.229]) by :SMTPAUTH: with ESMTPSA id VJ63hskntb5uOVJ64hGNUh; Mon, 27 May 2019 10:07:40 -0700
From: Doug Ewell <doug@ewellic.org>
To: 'Mark Davis ☕️' <mark@macchiato.com>, "'Phillips, Addison'" <addison@lab126.com>
Cc: 'IETF Languages Discussion' <ietf-languages@iana.org>, "'Martin J. Dürst'" <duerst@it.aoyama.ac.jp>, 'Manu Sporny' <msporny@digitalbazaar.com>
References: <155881874982.30992.4869767614014356043@ietfa.amsl.com> <49b6a1de-e016-514f-90e4-24703b5818d2@it.aoyama.ac.jp> <63b4f786-8b44-ecdf-ed33-ff0567ecc839@digitalbazaar.com> <000001d51425$a48ac140$eda043c0$@ewellic.org> <CAJ2xs_EwKg3Tu5etk-ELXXd0u2Go-6TZbGm3QsBxV1upKTa8_g@mail.gmail.com>
In-Reply-To: <CAJ2xs_EwKg3Tu5etk-ELXXd0u2Go-6TZbGm3QsBxV1upKTa8_g@mail.gmail.com>
Date: Mon, 27 May 2019 11:07:39 -0600
Message-ID: <000001d514ae$b15bdbf0$141393d0$@ewellic.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AQKLCtGP2z0dTCPwWpJCCMnlEYhybwL64n7IAmGOKmIBrGprnQHBCMpOpMzyJGA=
Content-Language: en-us
X-Greylist: Sender DNS name whitelisted, not delayed by milter-greylist-4.6.2 (pechora8.dc.icann.org [192.0.46.74]); Mon, 27 May 2019 17:08:00 +0000 (UTC)
X-CMAE-Envelope: MS4wfM23mrIDSUGiDaZeOIbm8Zuy+zNTKVflokr74TCGY84377NKBCXmC69V/uRtUXXEionBDelxqJcMiWy7glQV+X9wStVLbLkDFJofg6KDwvPIwLRQW6fx dJmdcEzlUajAfUTpyMS5owh8jf7XPe2NFjRFor1jFMlCjTl6d0WtkaWsAgzu1WJH32Z0M8CNo8deHMp3NHhzrQ24xknGMu9foCjuOkGp92yd8SnPOF4lzU9n S6SiQhJPyRAVrola5uYmCVvqGGyEYAL+tnQShTIhTq21lCMhaODHvh67q2dwosFZTcYHJ6+9fzSj5fBhE8axYQ==
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-languages/pYRWGo51dHdEftOVplMeqDA0M0Q>
Subject: Re: [Ietf-languages] Fwd: I-D Action: draft-msporny-d-langtag-ext-00.txt
X-BeenThere: ietf-languages@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <ietf-languages.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-languages/>
List-Post: <mailto:ietf-languages@ietf.org>
List-Help: <mailto:ietf-languages-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 27 May 2019 17:08:08 -0000

Mark Davis wrote:

> 1. So from the tag "ar-Arab", we get the script "Arab". Then use
> https://github.com/unicode-org/cldr/blob/master/common/properties/scriptMetadata.txt,
> which has a mapping from script to direction (RTL=YES). (I'm pointing
> to trunk, just so people can read the file easily; one would use the
> latest release.)

Yes, all right, if you are a text process that renders or formats text, but you don't know that the Arabic script is RTL, then fine, use CLDR to tell you that. OK.

> 2. But let's suppose that you have just "ar". Since the script is not
> explicit, the best way to get it is also CLDR. You can use
> https://github.com/unicode-org/cldr/blob/master/common/supplemental/likelySubtags.xml,
> which has a mapping from language or language+region to default
> language-script-region. So "ar" => "ar_Arab_EG", from which we get the
> script "Arab", and then use step 1. Or from "fr" you'd get "Latn" and
> map it to RTL=NO.

You don't need CLDR for this step. The BCP 47 Language Subtag Registry could have told you that:

Type: language
Subtag: ar
Description: Arabic
Added: 2005-10-16
Suppress-Script: Arab      ← THIS
Scope: macrolanguage

I mean, jeez, that's why we invented Suppress-Script in the first place. (Well, sort of. We invented it to go the other way, when you're *creating* a tag, so when you're tempted to write "ar-Arab" you can stop and just write "ar" instead. But this scenario works just as well.)

If you have a more complex matching scenario in mind, involving region and variant subtags, or crossing over from Breton to French or whatever, then go ahead and use CLDR. Or, I suppose, you can also use it here if you already didn't know Arabic was a RTL script in #1 and you had to have CLDR available for that anyway.

This is a lot like RFC 4647 matching: it isn't good enough for all scenarios, so you might need customized or more ambitious matching algorithms for those, but for many simpler cases it's just fine. But it DOES exist and there's no need to pretend it doesn't and invent another basic scheme.

> It isn't that Arabic would be displayed left to right, it is what
> establishes the paragraph ordering. The problem arises when you have
> mixed text. Look at the following example, using the convention that
> lowercase = English and uppercase=Arabic. The majority of the text and
> the first strong character are both English, but the sentence is meant
> to be used in an Arabic environment, so the default paragraph
> embedding level needs to be RTL.

It still doesn't sound as though inserting 'ltr' or 'rtl' into the language tab would solve this problem. We have Unicode bidi controls for this. (I know HTML has its own tags for this, which are preferred in that context, but the GitHub thread indicates this isn't mostly about HTML.)

>> 4. Scripts exist in other directionalities besides LTR and RTL...
>
> While this is true, for the [v]ast majority of cases, LTR and RTL are
> the important issues. Most computer systems don't really handle
> vertical natively; one needs to have more specialized text processing
> systems, and that is not, I imagine, the target for this syntax.

Today, yes. N years from now, when these specialized text processing systems become mainstream, maybe it will be a different story. You don't want to lock out that possibility by hard-coding the set of allowable values into the RFC for all time.

> I don't see that there is any reason to approve it, given that it is,
> as far as I can tell, completely unnecessary and would just complicate
> implementer's lives to no good end.

Agreed, obviously.

What we need to do is get together with the folks on the GitHub thread and explain the situation to them, how this proposed solution is neither necessary nor sufficient, and show them the right way(s) to do what they need.

--
Doug Ewell | Thornton, CO, US | ewellic.org