Re: [Ietf-languages] First cut at a BCP 47 extension structure for ISO TR 21636 (was: Language subtag registration form)

Doug Ewell <doug@ewellic.org> Sun, 29 November 2020 03:54 UTC

Return-Path: <doug@ewellic.org>
X-Original-To: ietf-languages@ietfa.amsl.com
Delivered-To: ietf-languages@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CCB353A1003 for <ietf-languages@ietfa.amsl.com>; Sat, 28 Nov 2020 19:54:14 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.896
X-Spam-Level:
X-Spam-Status: No, score=-1.896 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id S_ZXUj_R93og for <ietf-languages@ietfa.amsl.com>; Sat, 28 Nov 2020 19:54:12 -0800 (PST)
Received: from mork.alvestrand.no (mork.alvestrand.no [158.38.152.117]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 89CC23A1001 for <ietf-languages@ietf.org>; Sat, 28 Nov 2020 19:54:12 -0800 (PST)
Received: by mork.alvestrand.no (Postfix) id D47457C620A; Sun, 29 Nov 2020 04:54:10 +0100 (CET)
Delivered-To: ietf-languages@alvestrand.no
X-Comment: SPF skipped for whitelisted relay - client-ip=192.0.33.74; helo=pechora4.lax.icann.org; envelope-from=doug@ewellic.org; receiver=ietf-languages@alvestrand.no
Received: from pechora4.lax.icann.org (pechora4.icann.org [192.0.33.74]) by mork.alvestrand.no (Postfix) with ESMTPS id 721F87C60E8 for <ietf-languages@alvestrand.no>; Sun, 29 Nov 2020 04:54:10 +0100 (CET)
Received: from p3plsmtpa09-08.prod.phx3.secureserver.net (p3plsmtpa09-08.prod.phx3.secureserver.net [173.201.193.237]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by pechora4.lax.icann.org (Postfix) with ESMTPS id 5B574700E8B7 for <ietf-languages@iana.org>; Sun, 29 Nov 2020 03:54:07 +0000 (UTC)
Received: from DESKTOPLPOB1E4 ([73.229.14.229]) by :SMTPAUTH: with ESMTPSA id jDmSk2zINCLVcjDmTkGdoJ; Sat, 28 Nov 2020 20:53:46 -0700
X-CMAE-Analysis: v=2.4 cv=Uommi88B c=1 sm=1 tr=0 ts=5fc31b4a a=9XGd8Ajh92evfb2NHZFWmw==:117 a=9XGd8Ajh92evfb2NHZFWmw==:17 a=IkcTkHD0fZMA:10 a=I0CVDw5ZAAAA:8 a=nORFd0-XAAAA:8 a=vclMJSNbKwdv_kq9k5kA:9 a=QEXdDO2ut3YA:10 a=YdXdGVBxRxTCRzIkH2Jn:22 a=AYkXoqVYie-NGRFAsbO8:22
X-SECURESERVER-ACCT: doug@ewellic.org
From: Doug Ewell <doug@ewellic.org>
To: 'Sebastian Drude' <drude@xs4all.nl>, 'Mark Davis ☕' <mark@macchiato.com>
Cc: ietf-languages@iana.org, iso639-3@sil.org
References: <20201127232932.665a7a7059d7ee80bb4d670165c8327d.20171979ac.wbe@email15.godaddy.com> <7903ae59-951e-9f46-0af8-b2a3f6657513@xs4all.nl>
In-Reply-To: <7903ae59-951e-9f46-0af8-b2a3f6657513@xs4all.nl>
Date: Sat, 28 Nov 2020 20:53:44 -0700
Message-ID: <000301d6c603$3cf4b540$b6de1fc0$@ewellic.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AQIjHw5mCbgVjIDR+xyhgRiaaPYxhAHbTVvqqTbe4CA=
Content-Language: en-us
X-CMAE-Envelope: MS4xfOUrGRY2Tnx14m6H3zPOyu5G9xB32nk2lNapLET5PqN5bN7xZfagIkB55JjapXlCb8nFb+ZiR60sD/PnwCVsaY86QNK8UnD8+zK9U47RW3AimRi8+WSR Wq3/Zbf+a2HbFfkYoij3mEZoTqPwSJadFEPIpiu5C9uSgiUZf1LyZG+TyqpJcaTfApNtuXKEXjbA5dF7h2RtYReswAOpfnT8a2sbfHTLtUstfWYoaNPoRT+z OHKbJeBtTjsKbs/zkXVgPp5F3N82JDJiJpOLMbkBxVg=
X-Greylist: Sender DNS name whitelisted, not delayed by milter-greylist-4.6.2 (pechora4.lax.icann.org [0.0.0.0]); Sun, 29 Nov 2020 03:54:07 +0000 (UTC)
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-languages/emf1d3_WVSqltipHuKn5ZclM8ow>
Subject: Re: [Ietf-languages] First cut at a BCP 47 extension structure for ISO TR 21636 (was: Language subtag registration form)
X-BeenThere: ietf-languages@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <ietf-languages.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-languages/>
List-Post: <mailto:ietf-languages@ietf.org>
List-Help: <mailto:ietf-languages-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 29 Nov 2020 03:54:15 -0000

Sebastian Drude wrote:

>> Maybe a custom mailing list could be formed. Everyone here would need
>> to be eligible to join the discussion if they choose to.
>
> Yes, that would be great.  I am not connected any more to any service
> where I could do so, whom could we ask?  The LinguistList?

That shouldn’t be necessary. Many folks have access to Mailman and can spin up an ad-hoc mailing list. I know Michael Everson has done so on several occasions, including one for a Unicode encoding proposal I’ve been involved with that really couldn’t have succeeded without it. But Michael doesn’t run a mailing-list business, and I hesitate to pester him for this.

>>> -- can more than one value in the SAME dimension be indicated? (I
>>> would argue yes, if that is syntactically okay)
>
> That is fine.  One could need it for at least two cases:
>
> - more general and more specific indications (dialects, sociolects...)
>
> - idiolects belonging to intersections of varieties (e.g., showing all
> the defining criteria for more than one dialect, in border areas)

I need to read Mark’s response about this, and the issues surrounding it, a bit more thoroughly. I was under the impression that a given extension could quite happily allow you to string together “fr-v-xx-abcdef-ghijk”, or not, depending on how the spec is written. Mark seems to imply some inherent limitations on this in BCP 47. Fortunately we have plenty of time to figure out what can and cannot, or should not, be done here.

> For the first case, several ever more specific indications would only
> be admissible if each is relevant for some appliaction, and one cannot
> presume that the language-tag-consumer knows of logical implications
> (South Tirol German implies Bavarian which implies High German, for
> instance).

Nothing about BCP 47 tagging is ever intended to presume that the user (at either end) knows anything about language family hierarchies, about which there is much disagreement anyway.

I suspect I am not reading this comment carefully enough, and taking it out of context.

>>> -- the 'certainty' and similar "adjectives" (yes, that is how I
>>> would see them; -- e.g. primary vs. secondary modality, genuine vs.
>>> imitated, ...)
>>
>> I knew I had missed some in reading quickly through the NP document.
>> If there are many, this would require some thought.
>
> At this point, I do not foresee this to be heavily used, but what do I
> know about possible needs in 20 years?  I would need to compile a list
> of such "adjectives", perhaps there is one more case besides the three
> we have identified here.  Again (see below, next comment), there are
> default values, and only exceptions would need coding.

No, I literally meant I did not even know there were three at present. I thought only "certainty" fell into this category. That's why I need to read through that part of the NP document again.

But yes, the mechanism does need to allow for future modifiers of this type, just like the 'u' extension allows for additional keys, as well as additional values within each key.

>>> -- default values
>
> I understand.  A common objection to the 8-dimension-approach is that
> it is very cumbersome to indicate values for each dimension for each
> resource, and I could not agree more.
>
> One obvious solution is to imply default values and code only
> 'deviant' varieties.  The default values could be: (1) the respective
> "standard varieties" in the case of the space dimension, (2) "current
> period" for time, (3) "middle-class" or "socially neutral" for social
> group, (5) "neutral" for register, (7) "full" proficiency, and (8)
> "regular functioning" (no 'anomaly') for the commun. funct. dimension.
> (We leave (6), the person dimension, out, as discussed.)
>
> For (3) the medium dimension, the default value would perhaps depend
> on (a) the media carrier ("oral" for an audio recording, "written" for
> a text document or PDF, although one may want to indicate the specific
> writing system etc. -- that is currently also the case), and (b) the
> language (sign languages would have the signed modality as default for
> videos, other languages the multimodal modality; Latin would have the
> written modality as default independent of the media carrier, and so
> forth).

I have to confess that it never occurred to me that users of this extension, or indeed users of the TR in any form, would always be expected to provide values for all eight (or seven) dimensions, and that a defaulting mechanism would be necessary to permit eliding some of them.

> I agree.  I was not at all proposing to complement, let alone replace,
> the ISO 639 identifiers in the main language subtags or any other
> crucial area of BCP 47 by glottocodes.

Thank you very much. I am glad in this case that I simply misread this and panicked for no reason.

> Still, that is what I mean: while not changing the current setting for
> the URLs, Wikipedia at some time recognized that the Glottocodes
> exist, and are important enough to be now a standard information given
> for each language.

Well, I mean, we also know that Glottocodes exist. For that matter, we also know the Linguasphere coding system exists. Whether we consider these to be “standards,” whether on a par with ISO 639 or not, can be debated.

> Instead, one could think of introducing an extension, say, with subtag
> "g" (if that is not already assigned -- where can I find the list of
> existing extensions?),

https://www.iana.org/assignments/language-tag-extensions-registry

> to be followed by a glottocode, in order to allow for a user to
> indicate some variety that is well defined by its Glottolog entry.
> Especially in cases where ISO does not have an appropriate language
> code (yet), a user could use, for instance, mis-g-adha1238, where the
> g-extension subtag adha1238 is the Glottocode for the extinct Adhari
> language, which is not (yet) in ISO 639.  Perhaps even a better
> solution can be imagined than using the special purpose ISO 639
> identifier "mis" -- miscellaneous language (no ISO 639 code element is
> assigned for this language or included in the part of ISO 639 used by
> a given application).

ISO 639-3/RA does have a well-established process to add (and modify, and delete) code elements to reflect linguistic realities. The reviewers may simply have not seen the documentation for Adhari (Old Azeri) that they were looking for. Submitting a proposal to add a language like this to 639-3, and thus to the Registry, might be more productive than building a mechanism to swap in another coding system, ISO 2022-like. This is especially true if the Adhari example isn’t intended to represent hundreds or thousands of language missing from 639-3.

--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org