Re: [Ietf-languages] First cut at a BCP 47 extension structure for ISO TR 21636 (was: Language subtag registration form)

Doug Ewell <doug@ewellic.org> Sat, 28 November 2020 06:30 UTC

Return-Path: <doug@ewellic.org>
X-Original-To: ietf-languages@ietfa.amsl.com
Delivered-To: ietf-languages@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 67E623A0B8E for <ietf-languages@ietfa.amsl.com>; Fri, 27 Nov 2020 22:30:35 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id D1OoNV6kC2vr for <ietf-languages@ietfa.amsl.com>; Fri, 27 Nov 2020 22:30:33 -0800 (PST)
Received: from mork.alvestrand.no (mork.alvestrand.no [158.38.152.117]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4AE723A0B77 for <ietf-languages@ietf.org>; Fri, 27 Nov 2020 22:30:33 -0800 (PST)
Received: by mork.alvestrand.no (Postfix) id AE67F7C60E8; Sat, 28 Nov 2020 07:30:31 +0100 (CET)
Delivered-To: ietf-languages@alvestrand.no
X-Comment: SPF skipped for whitelisted relay - client-ip=192.0.46.73; helo=pechora3.dc.icann.org; envelope-from=doug@ewellic.org; receiver=ietf-languages@alvestrand.no
Received: from pechora3.dc.icann.org (pechora3.icann.org [192.0.46.73]) by mork.alvestrand.no (Postfix) with ESMTPS id 634EA7C5735 for <ietf-languages@alvestrand.no>; Sat, 28 Nov 2020 07:30:31 +0100 (CET)
Received: from p3plwbeout15-01.prod.phx3.secureserver.net (p3plsmtp15-01-2.prod.phx3.secureserver.net [173.201.193.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by pechora3.dc.icann.org (Postfix) with ESMTPS id 15B6F70000C9 for <ietf-languages@iana.org>; Sat, 28 Nov 2020 06:30:28 +0000 (UTC)
Received: from p3plgemwbe15-01.prod.phx3.secureserver.net ([173.201.193.7]) by :WBEOUT: with SMTP id itjhko1W8Nlyoitjhk2O4z; Fri, 27 Nov 2020 23:29:33 -0700
X-CMAE-Analysis: v=2.4 cv=boyJuGWi c=1 sm=1 tr=0 ts=5fc1ee4e a=juA+L1ol8AAG8pmebQJ39A==:117 a=t2ofW5BjkNkA:10 a=stt6pwU7uy4A:10 a=IkcTkHD0fZMA:10 a=nNwsprhYR40A:10 a=8pif782wAAAA:8 a=nORFd0-XAAAA:8 a=MCgb7rFWJw7P3YlYFq4A:9 a=QEXdDO2ut3YA:10 a=AYkXoqVYie-NGRFAsbO8:22
X-SECURESERVER-ACCT: doug@ewellic.org
X-SID: itjhko1W8Nlyo
Received: (qmail 6477 invoked by uid 99); 28 Nov 2020 06:29:33 -0000
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"
X-Originating-IP: 73.229.14.229
User-Agent: Workspace Webmail 6.11.19
Message-Id: <20201127232932.665a7a7059d7ee80bb4d670165c8327d.20171979ac.wbe@email15.godaddy.com>
From: Doug Ewell <doug@ewellic.org>
To: Sebastian Drude <drude@xs4all.nl>, Mark Davis ☕ <mark@macchiato.com>
Cc: "ietf-languages@iana.org" <ietf-languages@iana.org>
Date: Fri, 27 Nov 2020 23:29:32 -0700
Mime-Version: 1.0
X-CMAE-Envelope: MS4xfPR+F39yu+Vs8etcJZu/6F8b9e6ABeCusS+rKHgsIMSClkG/w3KQmKtP030rtZi9n4v9SDYV/9ZjKwgi0tq3s3Cu/HnAT+HyOIqocMQOjKRPPaGwwkyh c8TZQewce9v69v/f7WAKRVCB1U7xKaLV2ZQpbTJNpeb0u4npGC20xBe3oxAm+YjS0tZ0DtYigZV8i3fV9NzpOn2Ww3yOm4tJM1re5m2iVrrk6Xi7Adx5XDab
X-Greylist: Sender DNS name whitelisted, not delayed by milter-greylist-4.6.2 (pechora3.dc.icann.org [0.0.0.0]); Sat, 28 Nov 2020 06:30:28 +0000 (UTC)
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-languages/As_bHtgRKVMnBXIxx_MDAfKuEhs>
Subject: Re: [Ietf-languages] First cut at a BCP 47 extension structure for ISO TR 21636 (was: Language subtag registration form)
X-BeenThere: ietf-languages@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <ietf-languages.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-languages/>
List-Post: <mailto:ietf-languages@ietf.org>
List-Help: <mailto:ietf-languages-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 28 Nov 2020 06:30:35 -0000

Sebastian Drude wrote:

> I believe we should not worry about the personal varieties at this
> point.

I think this would be a prudent exclusion. As with the 'u' and 't'
extensions, keys can be added after the initial rollout, but I can't
imagine a scenario in which encoding this dimension would ever be
practical.

> Probably we should form a small committee to continue this discussion,
> instead of involving (and spamming, at this point) the whole list.  I
> am certainly willing to be instrumental in (contributing to) forming
> and driving such a group, in whichever role I can contribute best.

Maybe a custom mailing list could be formed. Everyone here would need to
be eligible to join the discussion if they choose to.

> Points that I can see now that would need to be discussed, besides
> flashing out what you have begun:
>
> -- interaction with existing subtags, in particular dialects --
> probably we want to avoid synonym language tags composed according to
> different frameworks

I can envision this becoming a real time and effort sink. As just one
example, script subtags imply "written" (one implies "not written"), and
this extension would provide a "medium" dimension. What if they disagree
within the same tag? Trying to cherry-pick the 21636 values to exclude
those that might conflict with other subtags would probably be very
messy.

> -- can more than one value in the SAME dimension be indicated? (I would
> argue yes, if that is syntactically okay)

Syntactically, sure:

Sometimes this will make sense, such as for "communicative functioning."
In many cases it clearly won't, as in combining "formal" and "informal"
situations. Some semantic restrictions might be appropriate in this
extension to guard against the latter. Our approach in BCP 47 has simply
been to advise tag creators to "tag wisely," discouraging nonsensical
combinations but not making them invalid.

> -- the 'certainty' and similar "adjectives" (yes, that is how I would
> see them; -- e.g. primary vs. secondary modality, genuine vs.
> imitated, ...)

I knew I had missed some in reading quickly through the NP document. If
there are many, this would require some thought.

> -- default values

I know CLDR has this concept for the 'u' extension. It may just amount
to defaulting true/false values to true. If this is needed on a
per-dimension basis, there could be a "default" attribute on one value
in the code list. But interpreting this would require tag consumers to
have access to the code list, which is not usually desirable for
syntactic analysis.

> -- requirements for a registry, and its feasability, and finally
> implementation

Oh, we'll go there.

> One question I have: need the values in the key-value-pairs be unique
> over different languages?  As an example, can two dialects of
> different languages share the same string xyz as a designation used as
> value in ...v-...-sp-xyz?

The existing, analogous rule for variants is a point of controversy
here. BCP 47 says that variants should not be defined to have different
meanings depending on the language. So we have variant subtags like
'pinyin' that can apply to both Chinese and Tibetan, because it's nearly
the same romanization scheme for both languages; and we have variants
that can apply to almost any language, such as the one that means
"written in IPA." But we can't have something like 'western' because the
concept of "language X as used in the western part of country Y" has
different meanings for different values of X and Y. Not everyone here
agrees where to draw the line.

I assume that many of the 21636 concepts have the same meaning
regardless of the language, but this needs more study.

> Alternatively/additionally, we could use the glottocodes for the major
> dialects already included in the Glottolog (they are a string of 4
> letters and 4 digits), and come to an agreement with the Glottolog
> folks to extend their list of dialects.
> (It would be excellent to be in touch with them anyways, also for
> cases where ISO is less accurate than the Glottologue.)

When I saw this originally, I wasn't in a position to respond, but it
alarmed me. Establishing Glottolog as a competing standard to ISO 639
for encoding language information in BCP 47, even in different subtag
types, can only lead to confusion and duplicate representations, exactly
what you expressed a desire to avoid earlier.

Then I saw the following, and became even more alarmed:

> However, differently from when BCP 47 was created, Glottolog now is an
> impressive and very complete and accurate work.  Many scholars combine
> it, or even prefer it over ISO 639, it is used in WIkipedia, etc., so
> perhaps it is the time for rethinking the relationship between BCP and
> Glottolog, especially for the cases of languages that are missing in
> the Ethnologue / ISO 639.

So, just to be 100% clear: the core standards that are used as the
source for subtag structure in BCP 47, and for values of certain subtag
types in the Language Subtag Registry, are fixed. They cannot be swapped
in and out. Any proposal for "rethinking the relationship between BCP
[47] and Glottolog," to the extent that means switching from ISO 639
code elements to Glottolog code elements, is completely off the table.

(Incidentally, Wikipedia uses an extension of ISO 639 for language
coding, and only displays Glottolog code elements in language boxes. The
Portuguese Wikipedia is at pt.wikipedia.org, not
port1283.wikipedia.org.)

The core standards were chosen deliberately, for a variety of reasons
such as existing practice, suitability, and stability promises. We know
there are other encoding systems for languages, some of which may even
have isolated or perceived advantages over ISO 639. But this is what we
have chosen. There is no significant likelihood this will be overturned.

--
Doug Ewell | Thornton, CO, US | ewellic.org