[Ltru] Re: [ExtV] Glottolog source?
Sebastian Drude <Sebastian_Drude@museu-goeldi.br> Thu, 01 August 2024 12:06 UTC
Return-Path: <Sebastian_Drude@museu-goeldi.br>
X-Original-To: ltru@ietfa.amsl.com
Delivered-To: ltru@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B59EBC14F701 for <ltru@ietfa.amsl.com>; Thu, 1 Aug 2024 05:06:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.107
X-Spam-Level:
X-Spam-Status: No, score=-7.107 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=museu-goeldi.br
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HMuUYe1Mex9N for <ltru@ietfa.amsl.com>; Thu, 1 Aug 2024 05:06:46 -0700 (PDT)
Received: from capivara.museu-goeldi.br (capivara.museu-goeldi.br [200.129.128.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 35546C14F68F for <ltru@ietf.org>; Thu, 1 Aug 2024 05:06:44 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by capivara.museu-goeldi.br (Postfix) with ESMTP id 0203FC291D; Thu, 1 Aug 2024 09:06:41 -0300 (-03)
Received: from capivara.museu-goeldi.br ([127.0.0.1]) by localhost (capivara.museu-goeldi.br [127.0.0.1]) (amavis, port 10032) with ESMTP id KoxIf4mJ43ww; Thu, 1 Aug 2024 09:06:40 -0300 (-03)
Received: from localhost (localhost [127.0.0.1]) by capivara.museu-goeldi.br (Postfix) with ESMTP id 3E010C2923; Thu, 1 Aug 2024 09:06:40 -0300 (-03)
DKIM-Filter: OpenDKIM Filter v2.10.3 capivara.museu-goeldi.br 3E010C2923
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=museu-goeldi.br; s=D9B666CC-A8A9-11ED-AF9F-C275171E05A9; t=1722514000; bh=16RwoEnEfyrGD9sWwmusq7lgvnG0BghYX5+mlR5jY/Y=; h=Message-ID:Date:MIME-Version:To:From; b=viwhNIWjIyVS4zEebUfohH927jCRAk6flAvT7542xpSlGMDPvk8e4kjT9XSI9eNC5 BJlScvDn7kuR4H8p79bBsJ4RPS1WJUgsztH1p+auHjI9fSAnfz4SI262vF+zWFA6DI xA4Afkk0vNouF+jBBIxqPqY7LLv6tf/9eEehqbrGfkDJ5NcRwYas/wmmwJw8600o01 0JqePQ3chrMNJfXFMgz82eZ/+NBedWCzA2pJ8zBr6HDHHAw0Ijc9fIi0fPcgkNiW1e 5rG2pBAFBY6vty+PHtTzHVYv1orJaGtTlLYV+67ZBY2dxjXGcD3xzFHII7YgAgGCno XpOtkzJMF/0ow==
X-Virus-Scanned: amavis at capivara.museu-goeldi.br
Received: from capivara.museu-goeldi.br ([127.0.0.1]) by localhost (capivara.museu-goeldi.br [127.0.0.1]) (amavis, port 10026) with ESMTP id 12yyJgR1v2iq; Thu, 1 Aug 2024 09:06:40 -0300 (-03)
Received: from [10.0.48.169] (unknown [10.0.48.169]) by capivara.museu-goeldi.br (Postfix) with ESMTPSA id 239B2C291D; Thu, 1 Aug 2024 09:06:40 -0300 (-03)
Message-ID: <68c28ba4-2d8b-440a-964c-6e52dcb7377e@museu-goeldi.br>
Date: Thu, 01 Aug 2024 09:06:39 -0300
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
To: Doug Ewell <doug@ewellic.org>, "ltru@ietf.org" <ltru@ietf.org>
References: <SJ0PR03MB6598AB5D3CC5E24280FC5773CAAD2@SJ0PR03MB6598.namprd03.prod.outlook.com> <3b121609-cefe-4183-b97d-031f9b6fadad@museu-goeldi.br> <SJ0PR03MB6598D21E5188DA9B6A8C9B44CAB02@SJ0PR03MB6598.namprd03.prod.outlook.com>
Content-Language: en-US
From: Sebastian Drude <Sebastian_Drude@museu-goeldi.br>
In-Reply-To: <SJ0PR03MB6598D21E5188DA9B6A8C9B44CAB02@SJ0PR03MB6598.namprd03.prod.outlook.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: quoted-printable
Message-ID-Hash: GAWFPADPJFGY35M3IFWDXTUVEH7SIEPK
X-Message-ID-Hash: GAWFPADPJFGY35M3IFWDXTUVEH7SIEPK
X-MailFrom: Sebastian_Drude@museu-goeldi.br
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-ltru.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Ltru] Re: [ExtV] Glottolog source?
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/ltru/uPCjxXyzkqV-y8Jxd6vzCnwkKA0>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ltru>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Owner: <mailto:ltru-owner@ietf.org>
List-Post: <mailto:ltru@ietf.org>
List-Subscribe: <mailto:ltru-join@ietf.org>
List-Unsubscribe: <mailto:ltru-leave@ietf.org>
Dear Doug, all, this is indeed a problem (which I had not seen). I will inquire with Harald and Robert if this policy can be improved. Otherwise, are all of our working group "on board" (subscribed to this list)? Then we should be able to start work on the future internet draft. I am in the middle of other urgent projects and therefore somewhat slow to answer these days. Best wishes, Sebastian On 30/07/2024 01:00, Doug Ewell wrote: > On Monday, July 22, 2024 at 6:51, Sebastian Drude wrote: > >> For dialects, I believe the basic resource is languoid.csv, >> considering all the lines which have "False" in the column >> "bookkeeping" and "Dialect" in the column "level". The Glottocode >> sits in the first column. >> [...] >> Retired entries are maintained (with the value in the "bookkeeping" >> column set to "True"). > Unfortunately, this turns out to be true only for "language" entries, not for "family" or (crucially) "dialect" entries. > > From the 2021 article by Forkel and Hammarström, which Sebastian forwarded to us as a PDF on June 27 (privately, not on this list): > > "If a language-level languoid was completely erroneous, it is moved to the Bookkeeping category. [...] In contrast, the family-level and dialect-level glottocodes are not 'protected' and may be removed from the inventory between releases. Since they do not necessarily reflect a real-world entity like an L1 language, it cannot systematically be explained what 'happened' to them, e.g., if they never really existed." > > Regardless of whether Glottolog's curators or contributors believe that a previously coded dialect may not actually exist, the fact remains that someone, somewhere, at some point, may have used its Glottocode conformantly to represent it. That is why Section 3.7 of RFC 5646 states clearly with regard to extensions: > > "The specification MUST be stable. That is, extension subtags, once defined by a specification, MUST NOT be retracted or change in meaning in any substantial way." > > This requirement exists for a very sound reason: It is absolutely imperative that BCP 47 tags, including those that use extensions, must not lose their meaning because one of its components simply vanishes from the external standard which once defined it. That was one of the key reasons for replacing RFC 3066 in the first place -- to prevent tags that depended directly on ISO 639 and 3166 from losing their meaning when code elements were removed from those standards with no trace. > > I had actually suspected, before reading the article, that some Glottocodes might disappear from the list, and have confirmed that, in every version of Glottolog since 3.1 -- when the format of 'languoid.csv' was changed to more or less its present form -- at least one dialect code marked "False" (i.e. not Bookkeeping) and "dialect" has been quietly removed from the following version. In the table below: > > column 1 = last version in which the code appeared > column 2 = following version, in which it was permanently removed > column 3 = Glottocode > column 4 = description > > 3.1 3.2 bika1249 Bika > 3.2 3.3 aiwa1238 Aiwanat > 3.3 3.4 bula1258 Bulalakawnon > 3.4 4.0 kara1463 Karapapakh > 4.0 4.1 awun1244 Awuna > 4.1 4.2 aizu1242 Aizuare > 4.2 4.3 belu1238 Beludji > 4.3 4.4 bedu1242 Beduanda > 4.4 4.5 akit1239 Akit > 4.5 4.6 serb1266 Serbo-Bosnian Vlax > 4.6 4.7 ante1238 Antekerrepenh > 4.7 4.8 abaw1238 Abawa > 4.8 5.0 bara1408 Baraba Tatar > > Most of these codes begin with 'a' or 'b' for a reason: there are actually several codes in each release that are erased in this way, and I simply chose the one for this table that appeared first alphabetically. > > The article by Forkel and Hammarström continues: > > "However, some tracking possibilities are always guaranteed because both families and dialects are linked to language-level languoids. [... ] if a dialect-level glottocode disappears, it is possible to check which language-level languoid it pertained to and to check which dialects are now associated with it." > > This also is not true, at least when considering the reasonably sized file on the Glottolog site. Each of the retired codes listed above was completely removed, with no direct or indirect reference in the following version. Thus it is not possible to check which language-level languoid it pertained to, unless one consults the previous, superseded version of Glottolog. > > It is possible that these deleted codes are carried over into the 'superseded.csv' file which is included in 'glottolog-X.X.zip', a comprehensive zip file which also includes the entire Glottolog tree structure (resulting, BTW, in path names which exceed the maximum length supported by many Windows tools) and which totals 99 megabytes as of version 5.0. I would hope that users of Extension V would not be required to download a 99-megabyte file to get the code list needed to support the extension. (Then again, I also hoped that users of Extensions T and U would not have to download the 30-megabyte 'cldr-common-X.X.zip' file to get the code lists for those extensions, but that argument was lost.) > > Given the current Glottolog policy of deleting Glottocodes that refer to dialects, rather than marking them "obsolete" or "retired" or "bookkeeping" or some such, it seems the only way to use Glottolog conformantly in Extension V is for us to create our own registry, using codes taken from 'languoids.csv' but with an added policy of preserving and identifying retired codes. This is essentially what we had to do with ISO 639 and 3166, back in 2005. > > This would be a large registry; version 5.0 of 'languoids.csv' lists 13,507 dialect codes, which is about twice the number of languages in ISO 639-3. But at least the deleted Glottocodes from versions prior to 5.0 would not need to be included, because they would not have been valid in Extension V. The grandfathering process, if you will, would only have to begin once the extension is approved. > > Without a mechanism like this, or unless Harald and Robert change their policy so that dialect codes are guaranteed to be preserved, I do not see any way we can use Glottolog for this extension and stay conformant with BCP 47. Perhaps Sebastian may be able to persuade them. > > -- > Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > Sebastian -- Museu P.E. Goeldi, CCH, Linguistica ▪ Av. Perimetral, 1901 Terra Firme, CEP: 66077-530 ▪ Belém do Pará – PA ▪ Brazil Sebastian_Drude@museu-goeldi.br ▪ Mobile: +55 (91) 983 733 319
- [Ltru] [ExtV] Glottolog source? Doug Ewell
- [Ltru] Re: [ExtV] Glottolog source? Sebastian Drude
- [Ltru] Re: [ExtV] Glottolog source? Doug Ewell
- [Ltru] Re: [ExtV] Glottolog source? Sebastian Drude