Re: [Ietf-languages] First cut at a BCP 47 extension structure for ISO TR 21636 (was: Language subtag registration form)

Sebastian Drude <drude@xs4all.nl> Sat, 28 November 2020 19:44 UTC

Return-Path: <drude@xs4all.nl>
X-Original-To: ietf-languages@ietfa.amsl.com
Delivered-To: ietf-languages@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id EA1A63A082F for <ietf-languages@ietfa.amsl.com>; Sat, 28 Nov 2020 11:44:31 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.433
X-Spam-Level:
X-Spam-Status: No, score=-1.433 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.001, SPF_HELO_NONE=0.001, SPF_SOFTFAIL=0.665, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=xs4all.nl
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id G8OYyzX4-GFS for <ietf-languages@ietfa.amsl.com>; Sat, 28 Nov 2020 11:44:28 -0800 (PST)
Received: from mork.alvestrand.no (mork.alvestrand.no [IPv6:2001:700:1:2::117]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 067CC3A0809 for <ietf-languages@ietf.org>; Sat, 28 Nov 2020 11:44:24 -0800 (PST)
Received: by mork.alvestrand.no (Postfix) id 70B117C651E; Sat, 28 Nov 2020 20:44:22 +0100 (CET)
Delivered-To: ietf-languages@alvestrand.no
X-Comment: SPF skipped for whitelisted relay - client-ip=2620:0:2d0:201::1:71; helo=pechora1.lax.icann.org; envelope-from=drude@xs4all.nl; receiver=ietf-languages@alvestrand.no
Received: from pechora1.lax.icann.org (pechora1.icann.org [IPv6:2620:0:2d0:201::1:71]) by mork.alvestrand.no (Postfix) with ESMTPS id 07B9A7C5735 for <ietf-languages@alvestrand.no>; Sat, 28 Nov 2020 20:44:22 +0100 (CET)
Received: from lb1-smtp-cloud9.xs4all.net (lb1-smtp-cloud9.xs4all.net [194.109.24.22]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) (No client certificate requested) by pechora1.lax.icann.org (Postfix) with ESMTPS id EE219700048F for <ietf-languages@iana.org>; Sat, 28 Nov 2020 19:44:19 +0000 (UTC)
Received: from cust-d2ef4cbd ([IPv6:fc0c:c138:75cc:34bc:4631:c48c:494:61cb]) by smtp-cloud9.xs4all.net with ESMTPA id j68LkeOfqkGBYj68PkSKiV; Sat, 28 Nov 2020 20:43:55 +0100
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=xs4all.nl; s=s2; t=1606592635; bh=UXfF8xqXPy2Okc5CSuODvXqJXE84U5O4zwUZ+6RUlts=; h=Subject:To:From:Message-ID:Date:MIME-Version:Content-Type:From: Subject; b=J3HUYTgaYg9DtyziYYg3KhA1zL79vR8bv/Mm+d67QZ9uEXV58tDSLOOOipo49oJTj iG2/dxdzDgxIZPB8wMDSDYrrO8FDbDqOU75C1raoD8Lu/P222hmtWJQ7OoqH7hfeO6 xnJ1uM5nfqUoq5LGrr3XZARfDV6nQGUBF/H2OTP7BHJq6JJ3FemozU8rSJm3oYjbGa 1WFbeeT9yRSN0BT8dMbLvXusSz7WFovYL+kencNo8hWDFpjV0o+H4E8u2vD3FE0hTF itMPWfrVusCNvU4w9u+qOMmLoxUUpv+rIUsTnGX64xAMg+IxQtEC1OfX9DQxinVyGu R89/TYZFq3Hzw==
To: Doug Ewell <doug@ewellic.org>, Mark Davis ☕ <mark@macchiato.com>
Cc: "ietf-languages@iana.org" <ietf-languages@iana.org>
References: <20201127232932.665a7a7059d7ee80bb4d670165c8327d.20171979ac.wbe@email15.godaddy.com>
From: Sebastian Drude <drude@xs4all.nl>
Message-ID: <7903ae59-951e-9f46-0af8-b2a3f6657513@xs4all.nl>
Date: Sat, 28 Nov 2020 16:43:48 -0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0
MIME-Version: 1.0
In-Reply-To: <20201127232932.665a7a7059d7ee80bb4d670165c8327d.20171979ac.wbe@email15.godaddy.com>
Content-Type: multipart/alternative; boundary="------------E85C53ABACA62839AF627A35"
Content-Language: pt-BR
X-CMAE-Envelope: MS4xfFl36SWZEE0Z4v/NkySPd9z5HXIYQrE2FPkBYv+TB2/on73z47HqIe80zWrTnqmDQUpczmnOc/7t/SlOTC7qUK6uN6Dm8usToNSOvt5Zn66z9BcuW4Qc ecvi3cL/wZeUj910Mjw5e/rS1oPs6Pt5Xba80FY4cJ8RYO0XDJqT21Re6ve4/2nY1oCrrXQgACzJadbpXuWPBChEh1JCPJukzJbC3YlcDBqLgYDUVJDHb3O8 iKDTtj30IksSKDxlD/tANjovFYidrAPOSGafoI/3Yg/oF3cQMIueorgWaGJjqt8rJqpB5ZUKYSvRX7CHuQ7WN2Osl1nfqU0GfAFg7GhqUwdEgNFZ37sHgoNL gSwyI5ab
X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.2 (pechora1.lax.icann.org [0.0.0.0]); Sat, 28 Nov 2020 19:44:20 +0000 (UTC)
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-languages/taFCufrE5j6lyFfbHA969bwRjPg>
Subject: Re: [Ietf-languages] First cut at a BCP 47 extension structure for ISO TR 21636 (was: Language subtag registration form)
X-BeenThere: ietf-languages@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <ietf-languages.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-languages/>
List-Post: <mailto:ietf-languages@ietf.org>
List-Help: <mailto:ietf-languages-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 28 Nov 2020 19:44:32 -0000

Thanks again, Dear Doug.

My comments below.

Sebastian

-- 

Museu P.E. Goeldi, CCH, Linguistica ▪ Av. Perimetral, 1901
Terra Firme, CEP: 66077-530 ▪ Belém do Pará – PA ▪ Brazil
drude@xs4all.nl ▪ +55 (91) 3217 6024 ▪ +55 (91) 983733319
Priv: Tv. Juvenal Cordeiro, 184, Apt 104 ▪ 66070-300 Belém

On 28/11/2020 03:29, Doug Ewell wrote:
> Sebastian Drude wrote:
>
>> I believe we should not worry about the personal varieties at this
>> point.
> I think this would be a prudent exclusion. As with the 'u' and 't'
> extensions, keys can be added after the initial rollout, but I can't
> imagine a scenario in which encoding this dimension would ever be
> practical.
Indeed.

>
>> Probably we should form a small committee to continue this discussion,
>> instead of involving (and spamming, at this point) the whole list.  I
>> am certainly willing to be instrumental in (contributing to) forming
>> and driving such a group, in whichever role I can contribute best.
> Maybe a custom mailing list could be formed. Everyone here would need to
> be eligible to join the discussion if they choose to.

Yes, that would be great.  I am not connected any more to any service 
where I could do so, whom could we ask?  The LinguistList?

>> Points that I can see now that would need to be discussed, besides
>> flashing out what you have begun:
>>
>> -- interaction with existing subtags, in particular dialects --
>> probably we want to avoid synonym language tags composed according to
>> different frameworks
> I can envision this becoming a real time and effort sink. As just one
> example, script subtags imply "written" (one implies "not written"), and
> this extension would provide a "medium" dimension. What if they disagree
> within the same tag? Trying to cherry-pick the 21636 values to exclude
> those that might conflict with other subtags would probably be very
> messy.

I understand.  We would then, as you write below, just trust the wit and 
common sense of thouse creating tags.


>> -- can more than one value in the SAME dimension be indicated? (I would
>> argue yes, if that is syntactically okay)
> Syntactically, sure:
>
> Sometimes this will make sense, such as for "communicative functioning."
> In many cases it clearly won't, as in combining "formal" and "informal"
> situations. Some semantic restrictions might be appropriate in this
> extension to guard against the latter. Our approach in BCP 47 has simply
> been to advise tag creators to "tag wisely," discouraging nonsensical
> combinations but not making them invalid.

That is fine.  One could need it for at least two cases:

- more general and more specific indications (dialects, sociolects...)

- idiolects belonging to intersections of varieties (e.g., showing all 
the defining criteria for more than one dialect, in border areas)

For the first case, several ever more specific indications would only be 
admissible if each is relevant for some appliaction, and one cannot 
presume that the language-tag-consumer knows of logical implications 
(South Tirol German implies Bavarian which implies High German, for 
instance).

>> -- the 'certainty' and similar "adjectives" (yes, that is how I would
>> see them; -- e.g. primary vs. secondary modality, genuine vs.
>> imitated, ...)
> I knew I had missed some in reading quickly through the NP document. If
> there are many, this would require some thought.
At this point, I do not foresee this to be heavily used, but what do I 
know about possible needs in 20 years?  I would need to compile a list 
of such "adjectives", perhaps there is one more case besides the three 
we have identified here.  Again (see below, next comment), there are 
default values, and only exceptions would need coding.

>> -- default values
> I know CLDR has this concept for the 'u' extension. It may just amount
> to defaulting true/false values to true. If this is needed on a
> per-dimension basis, there could be a "default" attribute on one value
> in the code list. But interpreting this would require tag consumers to
> have access to the code list, which is not usually desirable for
> syntactic analysis.

I understand.  A common objection to the 8-dimension-approach is that it 
is very cumbersome to indicate values for each dimension for each 
resource, and I could not agree more.

One obvious solution is to imply default values and code only 'deviant' 
varieties.  The default values could be: (1) the respective "standard 
varieties" in the case of the space dimension, (2) "current period" for 
time, (3) "middle-class" or "socially neutral" for social group, (5) 
"neutral" for register, (7) "full" proficiency, and (8) "regular 
functioning" (no 'anomaly') for the commun. funct. dimension.  (We leave 
(6), the person dimension, out, as discussed.)
For (3) the medium dimension, the default value would perhaps depend on 
(a) the media carrier ("oral" for an audio recording, "written" for a 
text document or PDF, although one may want to indicate the specific 
writing system etc. -- that is currently also the case), and (b) the 
language (sign languages would have the signed modality as default for 
videos, other languages the multimodal modality; Latin would have the 
written modality as default independent of the media carrier, and so forth).

>> One question I have: need the values in the key-value-pairs be unique
>> over different languages?  As an example, can two dialects of
>> different languages share the same string xyz as a designation used as
>> value in ...v-...-sp-xyz?
> The existing, analogous rule for variants is a point of controversy
> here. BCP 47 says that variants should not be defined to have different
> meanings depending on the language. So we have variant subtags like
> 'pinyin' that can apply to both Chinese and Tibetan, because it's nearly
> the same romanization scheme for both languages; and we have variants
> that can apply to almost any language, such as the one that means
> "written in IPA." But we can't have something like 'western' because the
> concept of "language X as used in the western part of country Y" has
> different meanings for different values of X and Y. Not everyone here
> agrees where to draw the line.
>
> I assume that many of the 21636 concepts have the same meaning
> regardless of the language, but this needs more study.
Indeed, except for dialects, I would assume that most values are 
universal, but some languages may have additional values 
(south-east-asian languages are famous for their many registers based on 
the social status of speaker and addressee, for instance, and certain 
sociolects will only exist for, say, indian casts but not elsewhere).  I 
do not see any problem with that, each value is conceived as universal 
and each language picks the values it needs.

Similarly, there can be different time extensions for periods and epochs 
(time dimension) -- middle English covers very different centuries from 
middle Persian.  But it would still be fine to use these labels for each 
language; the concrete time span is less crucial than the respective 
variety being named by, e.g. "middle".

>> Alternatively/additionally, we could use the glottocodes for the major
>> dialects already included in the Glottolog (they are a string of 4
>> letters and 4 digits), and come to an agreement with the Glottolog
>> folks to extend their list of dialects.
>> (It would be excellent to be in touch with them anyways, also for
>> cases where ISO is less accurate than the Glottologue.)
> When I saw this originally, I wasn't in a position to respond, but it
> alarmed me. Establishing Glottolog as a competing standard to ISO 639
> for encoding language information in BCP 47, even in different subtag
> types, can only lead to confusion and duplicate representations, exactly
> what you expressed a desire to avoid earlier.
I agree.  I was not at all proposing to complement, let alone replace, 
the ISO 639 identifiers in the main language subtags or any other 
crucial area of BCP 47 by glottocodes.

> Then I saw the following, and became even more alarmed:
>
>> However, differently from when BCP 47 was created, Glottolog now is an
>> impressive and very complete and accurate work.  Many scholars combine
>> it, or even prefer it over ISO 639, it is used in WIkipedia, etc., so
>> perhaps it is the time for rethinking the relationship between BCP and
>> Glottolog, especially for the cases of languages that are missing in
>> the Ethnologue / ISO 639.
> So, just to be 100% clear: the core standards that are used as the
> source for subtag structure in BCP 47, and for values of certain subtag
> types in the Language Subtag Registry, are fixed. They cannot be swapped
> in and out.

Absolutely, I am aware of that and agree with it.  I am sorry if I 
expressed myself in a way that was easy to misunderstand.  I would not 
be so heavily involved in ISO TC37/SC2 if I wanted to replace ISO 639 by 
Glottocodes.

> Any proposal for "rethinking the relationship between BCP
> [47] and Glottolog," to the extent that means switching from ISO 639
> code elements to Glottolog code elements, is completely off the table.
Sure.
> (Incidentally, Wikipedia uses an extension of ISO 639 for language
> coding, and only displays Glottolog code elements in language boxes. The
> Portuguese Wikipedia is at pt.wikipedia.org, not
> port1283.wikipedia.org.)

Exactly.

Still, that is what I mean: while not changing the current setting for 
the URLs, Wikipedia at some time recognized that the Glottocodes exist, 
and are important enough to be now a standard information given for each 
language.

Similarly, when I said "perhaps it is the time for rethinking the 
relationship between BCP and Glottolog, especially for the cases of 
languages that are missing in the Ethnologue / ISO 639.", I meant that 
allowing for people to use Glottocodes in the "x" private use area 
whithout any official status is too little/weak, it does not do them 
justice, so to say.
Instead, one could think of introducing an extension, say, with subtag 
"g" (if that is not already assigned -- where can I find the list of 
existing extensions?), to be followed by a glottocode, in order to allow 
for a user to indicate some variety that is well defined by its 
Glottolog entry.  Especially in cases where ISO does not have an 
appropriate language code (yet), a user could use, for instance, 
*mis-g-adha1238*, where the g-extension subtag *adha1238* is the 
Glottocode for the extinct Adhari language, which is not (yet) in ISO 
639.  Perhaps even a better solution can be imagined than using the 
special purpose ISO 639 identifier "mis" -- /miscellaneous language (no 
ISO 639 code element is assigned for this language or included in the 
part of ISO 639 used by a given application)/.

> The core standards were chosen deliberately, for a variety of reasons
> such as existing practice, suitability, and stability promises. We know
> there are other encoding systems for languages, some of which may even
> have isolated or perceived advantages over ISO 639. But this is what we
> have chosen. There is no significant likelihood this will be overturned.

Absolutely, and I would not have it otherwise, unless ISO becomes 
unfeasible for some reason, or in practice everybody is abandoning it 
for some good reason -- but I do not see that coming.


>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages@ietf.org
> https://www.ietf.org/mailman/listinfo/ietf-languages