Re: [Ietf-languages] First cut at a BCP 47 extension structure for ISO TR 21636 (was: Language subtag registration form)

Sebastian Drude <drude@xs4all.nl> Sat, 28 November 2020 02:22 UTC

Return-Path: <drude@xs4all.nl>
X-Original-To: ietf-languages@ietfa.amsl.com
Delivered-To: ietf-languages@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D00AC3A0AA7 for <ietf-languages@ietfa.amsl.com>; Fri, 27 Nov 2020 18:22:44 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.433
X-Spam-Level:
X-Spam-Status: No, score=-1.433 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.001, SPF_HELO_NONE=0.001, SPF_SOFTFAIL=0.665, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=xs4all.nl
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fUskOOn6Qrzr for <ietf-languages@ietfa.amsl.com>; Fri, 27 Nov 2020 18:22:41 -0800 (PST)
Received: from mork.alvestrand.no (mork.alvestrand.no [IPv6:2001:700:1:2::117]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id CAA303A0AA1 for <ietf-languages@ietf.org>; Fri, 27 Nov 2020 18:22:40 -0800 (PST)
Received: by mork.alvestrand.no (Postfix) id B69FE7C64FC; Sat, 28 Nov 2020 03:22:38 +0100 (CET)
Delivered-To: ietf-languages@alvestrand.no
X-Comment: SPF skipped for whitelisted relay - client-ip=192.0.33.71; helo=pechora1.lax.icann.org; envelope-from=drude@xs4all.nl; receiver=ietf-languages@alvestrand.no
Received: from pechora1.lax.icann.org (pechora1.icann.org [192.0.33.71]) by mork.alvestrand.no (Postfix) with ESMTPS id 550357C64EF for <ietf-languages@alvestrand.no>; Sat, 28 Nov 2020 03:22:37 +0100 (CET)
Received: from lb1-smtp-cloud8.xs4all.net (lb1-smtp-cloud8.xs4all.net [194.109.24.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) (No client certificate requested) by pechora1.lax.icann.org (Postfix) with ESMTPS id 3913B70000D1 for <ietf-languages@iana.org>; Sat, 28 Nov 2020 02:22:35 +0000 (UTC)
Received: from cust-d2ef4cbd ([IPv6:fc0c:c138:75cc:34bc:4631:c48c:494:61cb]) by smtp-cloud8.xs4all.net with ESMTPA id ipsFkUpodDuFjipsJktD5a; Sat, 28 Nov 2020 03:22:13 +0100
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=xs4all.nl; s=s2; t=1606530133; bh=U47j2Y3ylojihkPbjqa5U0EOoHw1pq4sBKdfc3TnZwI=; h=Subject:To:From:Message-ID:Date:MIME-Version:Content-Type:From: Subject; b=hoxGQStOB20DucvA4JeBvfl5PDDcBP8m+ZtDfTJ5wU++Cre1sWN/XZL7AlZh/yyLm 8FjgsKvBkMU4N0g85T3ZK4Uh++yEdH/25bLTaI8U7VyByPm6WrpdLkvc6AErB4Alkk wOitJdqi5RPTkJYvIHcDBUP5UsZj+UQ1eOmzhjtlmcZMR96E3/SMtMsycgQWM85QCz 0sAF40f2oSx32smQ+AjuHOqmgOYIV0wX5BtcwDFaX/DU9X9v0gOmAHbkO+g6FwLH9q JoIICYDnRbXgxoK/U6S0dbZnsGRJ1MCUNniGcJLww84gAp3NSw2DLkjmphJavs0A3J CptxXAgA81Fbg==
To: Hugh Paterson III <sil.linguist@gmail.com>
Cc: Doug Ewell <doug@ewellic.org>, Mark Davis ☕ <mark@macchiato.com>, IETF Languages Discussion <ietf-languages@iana.org>
References: <001201d6c511$3b953b40$b2bfb1c0$@ewellic.org> <f3c2ced8-c9e9-abef-cd23-1f65a7c4d97e@xs4all.nl> <CAE=3Ky8WNVj1qVS_mRvU9N5h4x90TfJ+rca6fdSSgbWV93x7Qw@mail.gmail.com>
From: Sebastian Drude <drude@xs4all.nl>
Message-ID: <3ab381b3-c2d5-3f18-4f1b-13c08c72b26f@xs4all.nl>
Date: Fri, 27 Nov 2020 23:22:06 -0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0
MIME-Version: 1.0
In-Reply-To: <CAE=3Ky8WNVj1qVS_mRvU9N5h4x90TfJ+rca6fdSSgbWV93x7Qw@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------B2EF9244689DBB6085474FC8"
Content-Language: pt-BR
X-CMAE-Envelope: MS4xfFdg74HTz9KnOyikZUV2DLa9+LVObZbI8YZW0zctalk496jqhSI4Udl0zlQiMj+mhturJlvIuX3Pm2xc6RAMwcSl/ih78YlG96tBCnlb02n47QZEI8Iu 2guXOac+HC2JQx6Ado3dkxcVRavlX6yI0NlpWSJiWevUrKgz+Un21Wew/siSEnJAFeLneCXF5SeYryCRF18aF1e1XL4dCBrNA+JD2/drPeC4p6VbTdeTS0Og jjEaJmSGXUQfAtlY8R+JYgj1ZuG8whbx08xr9FGaFOLQO2BZF81zAzGwTxW8Jk5aA9T7UWhyP4yCOgUrV7paDUxHrEXKlyHdXOr5shGrJWJewZlrnaikjNwG FUwdFXnZJagux3yFaseYOCOnd4XLugARoWgYCq0Yc09XbIv0pHE=
X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.2 (pechora1.lax.icann.org [0.0.0.0]); Sat, 28 Nov 2020 02:22:36 +0000 (UTC)
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-languages/7A4fTWCcMcgiaBujyRXJ8tnZi98>
Subject: Re: [Ietf-languages] First cut at a BCP 47 extension structure for ISO TR 21636 (was: Language subtag registration form)
X-BeenThere: ietf-languages@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <ietf-languages.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-languages/>
List-Post: <mailto:ietf-languages@ietf.org>
List-Help: <mailto:ietf-languages-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 28 Nov 2020 02:22:45 -0000

Thanks again, Hugh.

The interaction with or building on Glottolog would certainly have to be 
thought through thoroughly, and I have not even started to do so.  But 
they have a good list of major language varieties, especially if these 
have already received some kind of description.

More comments below.

Best,

Sebastian

-- 

Museu P.E. Goeldi, CCH, Linguistica ▪ Av. Perimetral, 1901
Terra Firme, CEP: 66077-530 ▪ Belém do Pará – PA ▪ Brazil
drude@xs4all.nl ▪ +55 (91) 3217 6024 ▪ +55 (91) 983733319
Priv: Tv. Juvenal Cordeiro, 184, Apt 104 ▪ 66070-300 Belém

On 27/11/2020 21:54, Hugh Paterson III wrote:
> Sebastian,
>
> This is all very interesting, I am particularly interested by the 
> comment related to the Glottolog.
> 1. You refer to their "list of dialects" but are you certain that some 
> of these sub-varieties are not actually socio-lects, and therefore 
> should not be placed in the space dimension? It would be problematic 
> if one were to assume that the Glottolog's list of ways to 
> subcategorize languages were all within one dimension.
Very true.  One more resource that was started out without paying 
attention to the different dimensions of inner-language linguistic 
variation.

> Having said that, my understanding, having talked with one of their 
> editors is that they don't create their own data but rather they 
> aggregate other people's data.
Indeed, but they (mainly Harald Hammarström) review the literature 
carefully and come to an informed conclusion, for instance regarding the 
genealogical trees.

> But perhaps the editors do have first hand or documentary evidence for 
> their sub-categorization as you seem to suggest.
No, I did not intend to suggest that.  I know they work with many 
specialists for local/regional language families and groups to ensure 
maximum accuracy, and being done by the same circle of few people, the 
whole work has much consistency in the criteria used for distinguishing 
languages, dialects etc.

> 2. My understanding is that people can already use glottocodes (even 
> now) with BCP47 with the -x- extension in the private use area.
Sure, but is it not true that in that area you can do anything? However, 
differently from when BCP 47 was created, Glottolog now is an impressive 
and very complete and accurate work.  Many scholars combine it, or even 
prefer it over ISO 639, it is used in WIkipedia, etc., so perhaps it is 
the time for rethinking the relationship between BCP and Glottolog, 
especially for the cases of languages that are missing in the Ethnologue 
/ ISO 639.

> 3. Having communicated with a different editor of the Glottolog, my 
> understanding from him was that the seed data for the Glottolog came 
> from the Ethnologue directly and indirectly via Multi-tree which the 
> indirect method included the sub-categorization. So, maybe it would be 
> equally good to consult that reference work.
The Ethnologue and ISO 639 are twins maintained by SIL International.  
Multi-Tree is exactly that: MULTI.  So, for many major language 
families, there are a large number of trees, compiled from decades of 
scholarly work, and unfortuantely it is not ensured at all that the same 
terms for languages, dialects and (sub)families refer to the same 
entities in the different trees, so that is quite a mess to start with.  
Good for historical comparative linguists who want to compare 
hypotheses, but not very good for standardization. True, they have a 
"composite" or compromise tree for many or all families, but in my 
experience it is either very close to the Ethnologue, or nevertheless 
inferior in quality to the Glottolog. So I believe, in particular for 
language classification (which is not at all relevant for BCP 47, as far 
as I understand), Glottolog is the best resource.

> 4. Please correct me if I am wrong, but my impression after having 
> read some of the Glottolog's about pages was that they do occasionally 
> obliviate their ID codes that they use, which is a different process 
> than the status that the ISO 639-3 gives to retired language codes 
> which get a special status when they enter retirement. see: 
> https://github.com/glottolog/glottolog/issues/243

Glottolog (and even the Glottocodes) was not made for standardization or 
tagging, but then, that was true for the Ethnologue 20 years ago, too.  
I understand Glottolog also forbids recycling of Glottocodes, so 
somewhere there must be a list of once valid 
retired/withdrawn/deprecated codes.  Perhaps we can come to agreements 
with them.  It would indeed be good if Glottolog also recognized the 
multi-dimensionality of inner-language linguistic variation.


>
> all the best,
> - Hugh
>
> On Sat, Nov 28, 2020 at 12:44 AM Sebastian Drude <drude@xs4all.nl 
> <mailto:drude@xs4all.nl>> wrote:
>
>     Dear Doug,
>
>     thank you so much for this.  It goes EXACTLY along the lines of
>     what I
>     have imagined could be achieved, as much as I have understood of
>     BCP 47.
>
>     For me, all what you have proposed could be taken over to be the
>     seed of
>     a future regulation.
>
>     I also agree that repeating values in different dimensions should be
>     avoided, even if they are syntactically possible.
>
>     I believe we should not worry about the personal varieties at this
>     point.  I am quite confident that whatever framework of metadata
>     (which
>     is what BCP 47 is all about, or is always in a context of) will
>     certainly already provide ways of indicating the
>     author/speaker/signer/... (one -- or several, in a resource with a
>     dialogue, for instance), so that it would be a redundant feature
>     for the
>     language tag anyways.  Also, this dimension would be relevant only in
>     very special applications, if ever.
>
>
>     Probably we should form a small committee to continue this
>     discussion,
>     instead of involving (and spamming, at this point) the whole
>     list.  I am
>     certainly willing to be instrumental in (contributing to) forming and
>     driving such a group, in whichever role I can contribute best.
>
>
>     Points that I can see now that would need to be discussed, besides
>     flashing out what you have begun:
>
>     -- interaction with existing subtags, in particular dialects --
>     probably
>     we want to avoid synonym language tags composed according to
>     different
>     frameworks
>
>     -- can more than one value in the SAME dimension be indicated? (I
>     would
>     argue yes, if that is syntactically okay)
>
>     -- the 'certainty' and similar "adjectives" (yes, that is how I would
>     see them; -- e.g. primary vs. secondary modality, genuine vs.
>     imitated, ...)
>
>     -- default values
>
>     -- requirements for a registry, and its feasability, and finally
>     implementation
>
>     One question I have: need the values in the key-value-pairs be unique
>     over different languages?  As an example, can two dialects of
>     different
>     languages share the same string xyz as a designation used as value in
>     ...v-...-sp-xyz?
>     Alternatively/additionally, we could use the glottocodes for the
>     major
>     dialects already included in the Glottolog (they are a string of 4
>     letters and 4 digits), and come to an agreement with the Glottolog
>     folks
>     to extend their list of dialects.
>     (It would be excellent to be in touch with them anyways, also for
>     cases
>     where ISO is less accurate than the Glottologue.)
>
>
>     Again, thanks so much.  I hope we can build on this initial proposal!
>
>     Sebastian
>
>     -- 
>
>     Museu P.E. Goeldi, CCH, Linguistica ▪ Av. Perimetral, 1901
>     Terra Firme, CEP: 66077-530 ▪ Belém do Pará – PA ▪ Brazil
>     drude@xs4all.nl <mailto:drude@xs4all.nl> ▪ +55 (91) 3217 6024 ▪
>     +55 (91) 983733319
>     Priv: Tv. Juvenal Cordeiro, 184, Apt 104 ▪ 66070-300 Belém
>
>     On 27/11/2020 20:01, Doug Ewell wrote:
>     > Sebastian Drude wrote:
>     >
>     >> I am here participating in this group exactly to touch base and see
>     >> how we can implement this in a way that it is usable and compatible
>     >> with BCP 47, BEFORE a list is created which may not be
>     compatible in
>     >> some way to IETF/Languages' approach.
>     > Here's one possible way the structure of ISO TR 21636 could be
>     made to fit into a BCP 47 extension. This shows the type of
>     groundwork that should, and really must, be done up front if it is
>     a major goal to make 21636 fit into BCP 47, without requiring all
>     or even most of the values or code elements to be known.
>     >
>     > As a partial reference, I'm using a document called "NP-21636,"
>     provided by Sebastian and dated 2019-09-30. It's up to Sebastian
>     whether I may redistribute that document to anyone.
>     >
>     > First, an extension singleton must be chosen. Here I suggest 'v'
>     for "varieties," as the title of the TR is "Identification and
>     description of language varieties." The alphabetically consecutive
>     nature of the three extensions, 't' and 'u' and now 'v', may or
>     may not be unfortunate.
>     >
>     > Next, the TR identifies eight "dimensions" which are independent
>     of each other to a greater or lesser extent. It is clearly
>     desirable to be able to specify more than one dimension in a
>     single tag, up to all eight, but not necessarily so. For this I
>     will suggest a key-value mechanism like that devised for the 'u'
>     extension, so that each dimension is indicated by a two-character
>     key, derived from the names Sebastian provided on the 24th:
>     >
>     > sp    Space
>     > ti    Time
>     > so    Social group
>     > me    Medium
>     > si    Situation
>     > pe    Person
>     > pr    Proficiency
>     > co    Communicative functioning
>     >
>     > (The NP-21636 document used the term "performance" instead of
>     "communicative functioning," which would have caused a minor
>     inconvenience by overloading 'pe'.)
>     >
>     > Then there would be "value" identifiers of 3 to 8 characters
>     each, providing values for each of the dimensions. The lengths of
>     each of these subtags, 1 for the extension singleton and 2 for the
>     dimensions and 3 to 8 for the values, are a critical part of BCP
>     47 syntax, permitting humans and processes to parse the tag and
>     figure out what modifies what.
>     >
>     > So you might have "pt-BR-v-so-business-pr-fullpro" as one example.
>     >
>     > A finite number of values for each dimension would need to be
>     identified so they could be encoded, keeping uniqueness and
>     syntactical constraints in mind. This is the process of building a
>     code list, or registry. The set of values would be extensible; no
>     one and no group could be expected to get this perfect the first
>     time. But stability rules would also come into play here.
>     >
>     > Note that the values might not have to be unique across
>     dimensions; you could have "en-v-so-foo-pr-foo" if the string
>     'foo' represented a social group value and coincidentally also a
>     proficiency value. Too much of this could lead to human confusion,
>     though.
>     >
>     > I have no idea what to do about things like "certainty status,"
>     but some syntax along these lines could be devised. If this is an
>     "adjective" modifying dimensional values, rather than a value
>     itself, that should be syntactically clear.
>     >
>     > The dimension that troubles me the most, of course, is "person."
>     The document makes it clear in several passages that Sebastian has
>     his own personal variety (times the number of languages in which
>     he can communicate), I have my own, Michael and John and Leon and
>     Peter have their own, and so forth. Obviously neither we at
>     ietf-languages, nor anyone else, are going to encode identifiers
>     for each of 7.7 billion living people, plus those deceased, plus
>     those not yet born. It may be that the only possible way to encode
>     this dimension will be with a private-use subtag at (by
>     definition) the end of the tag.
>     >
>     > This list may or may not be the proper venue to continue this
>     discussion.
>     >
>     > --
>     > Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
>     <http://ewellic.org>
>     >
>     >
>
>     _______________________________________________
>     Ietf-languages mailing list
>     Ietf-languages@ietf.org <mailto:Ietf-languages@ietf.org>
>     https://www.ietf.org/mailman/listinfo/ietf-languages
>