[Ltru] Re: [ExtV] Glottolog source?

Doug Ewell <doug@ewellic.org> Tue, 30 July 2024 04:00 UTC

Return-Path: <doug@ewellic.org>
X-Original-To: ltru@ietfa.amsl.com
Delivered-To: ltru@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id EF875C151088 for <ltru@ietfa.amsl.com>; Mon, 29 Jul 2024 21:00:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.908
X-Spam-Level:
X-Spam-Status: No, score=-1.908 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WYnj8LUmqH2w for <ltru@ietfa.amsl.com>; Mon, 29 Jul 2024 21:00:37 -0700 (PDT)
Received: from NAM12-BN8-obe.outbound.protection.outlook.com (mail-bn8nam12on2093.outbound.protection.outlook.com [40.107.237.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D2357C14F693 for <ltru@ietf.org>; Mon, 29 Jul 2024 21:00:36 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=qfWXsBgeD3F+4CQBOGrnoYL5SH44f4a2vMfqpwL9CX7nJwfQVn+iJNkfiQPc540jOQIZmRbzfFm4IM7Cg+jXrjD8cOSs/mB6imcUWhAbfIW470+WfGJs7z2MN/dIqdJma4LJWviBFLjW1VoUMdKv4Fzy0eBZbPfZby0FiAr7zmBxu8IHIK9m+90iEpe1PquV1MNhVZXm2sXGYec5Dn4pIU4lvqmlXJWgYMSlbBhzD/78RRnxLXZHOXk7Lxp8ds7A1smMOscag3jziK+r5kaSRD6FtM8yQYs6/G3lOxAIu+NngRCUMRRsS49dF1iBiYmun3K8HQe97QyWL3pVO3rrFg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=/qI5KV2ncywuuANULfg2CpMR7y+ANYzYcixbBjxxDmg=; b=QGWsaWJSp9gaNSZYO/NIyQmZpu3CICYeaVHyj9qyybYPFvFg5f0JMAsuDwOZrGijlySTP+NMCmOPt7howpE65ODjLBOUVjOHbKYuEeWtMrblY8QGE37fx8tTFHCy1eNqxzVgoqufdD1XMyG8vDox4LDDGugKA+nGYzjLKbkmootPPSgYtXFkrkmI3WfHIWwuXlsZtaZqnsLsyNxfkoJ6xJRtoOOrQWhTxoG4+2kPP6RXqp3A/Z4bKwOb8Fn84ZCJc7Zso6J2s1pEPr4yOp/FcZw1zCnrCMTYFwZecy30lL5Qz8Ip1CEl0k4N08uY2KIPg4b2nBuJvoobgfmvOWyNEg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=ewellic.org; dmarc=pass action=none header.from=ewellic.org; dkim=pass header.d=ewellic.org; arc=none
Received: from SJ0PR03MB6598.namprd03.prod.outlook.com (2603:10b6:a03:38a::21) by BN8PR03MB4929.namprd03.prod.outlook.com (2603:10b6:408:79::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7807.28; Tue, 30 Jul 2024 04:00:33 +0000
Received: from SJ0PR03MB6598.namprd03.prod.outlook.com ([fe80::5490:5fa1:988a:678e]) by SJ0PR03MB6598.namprd03.prod.outlook.com ([fe80::5490:5fa1:988a:678e%3]) with mapi id 15.20.7807.026; Tue, 30 Jul 2024 04:00:32 +0000
From: Doug Ewell <doug@ewellic.org>
To: Sebastian Drude <Sebastian_Drude@museu-goeldi.br>, "ltru@ietf.org" <ltru@ietf.org>
Thread-Topic: [Ltru] [ExtV] Glottolog source?
Thread-Index: AdraMw7MQt11+ZlnQgayw8Pdy5A85wCAsJIAAX0wZzA=
Date: Tue, 30 Jul 2024 04:00:32 +0000
Message-ID: <SJ0PR03MB6598D21E5188DA9B6A8C9B44CAB02@SJ0PR03MB6598.namprd03.prod.outlook.com>
References: <SJ0PR03MB6598AB5D3CC5E24280FC5773CAAD2@SJ0PR03MB6598.namprd03.prod.outlook.com> <3b121609-cefe-4183-b97d-031f9b6fadad@museu-goeldi.br>
In-Reply-To: <3b121609-cefe-4183-b97d-031f9b6fadad@museu-goeldi.br>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=ewellic.org;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: SJ0PR03MB6598:EE_|BN8PR03MB4929:EE_
x-ms-office365-filtering-correlation-id: f0de27d3-5c47-49cd-13bb-08dcb04c2873
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;ARA:13230040|376014|1800799024|366016|38070700018;
x-microsoft-antispam-message-info: e3vNmqPaOC6+tOnf1rZH50kePym0Iqr9d9hhuGSWEzfsjJQ0nzk4HCkuQd/fn5UyP7ox2hfR4xoFzd3LWbRrg3VOUgurZTGG+PznZtrD9pOHg769DWAcIOmQ+dM9vco0J5/JI9ydn60QFPBzjMQlO0XFscELJkSyb7qqTWF3UOITC0r9L4iY2tIvOFEFL0v5qOBR76/i30QaC1H6DQ/r0fH7U2hCcgc0mOg4k/s9uyPC7SA0FRG9yA9DNtb6EJyz2njxHdE9tX/bfxmqo8MMg95bwPi/E7mPVljX9hH8goNqvLoFJGMl8LejocRP6VEJpXJt8LS+v+Ak+tARz17/jkBeOTPFt7aiEqZHIhJbocsMd1rvGrKwgpUrnHOQULgn+XQCTKinMcauFv/XGb5Za/sP0MNYqNx24UOtkMF15Io6yRH1MArQI4tGvkRgL89q7sp69AIQaQzkBuwcVmD/ytKs2noRdWMLGAjfIFwsQf/E/+GQvpuKCwS3Bl4WQ3+mjr5A6sfXI6zXM2RUXSjcI2f5eQAatWJDDkSzKbtS2y7Yn8s3eOg6TeQTC+85xj0+DN7NvQFXvsyxelvQE4Wa9RCt8WFtvVVqCAJ+UgM+PJLGQTubyQMUPwMhfnkdY0pAXtFm2HA7q3udWaxIIwrWPaLx8kH8By9BiYiU3I6P6SIX2QMw3sQbcfS6+T+CiKm3H2JcqM2J2vAwFaTyvevd92K03VrdpBJBS3rBszZo5GykQeq5+hbMGKyDz8cjjh9Pm1gd9Jq1eG9oviP7sH2KYDqcti92DFI4llik5w2q8spHMbBFa9uAQ+s4vwTyDJabkMl/thuj59GEnUP0H62K8PXTVrxQ4VdvV3/t/NtcvJqIEeh/Qf8ujlMzJE90+NPR5JnqJS3735ubyQbc1iy8Ta4Wq9oNsa8eIWE3RGs6tgqHtAx4irMuRW9G44JbXQly03Q8+mhclya+50lQbdZLDSjpHVo5Z2UUlpRxPlbqeN3qJLURWGUwpdYBvL7u+P+TFeauwizKM0/1zGzKzG8cty6p/RD0P0fJxL0K1joKq7PSwbtbQ+tXGKN5Wp39Svp4PY/QUwu4YCn9/D/UIpYNeAigTTw+gl/fFRMrTFMnT56gWeMyeJCVtvwu4y+OaqyNzbxw5HXqxqCBIXC2EOSU67M5+iUsBIZNBfC4WW1LpWkGNDfnkDF5ZZPe+9J827AXGhbnSAqWs44e9QTtH2Lp+bcjkdmXmQJqEf//Pfn+vq0ghScnBQT8Ir9q918hgKv88kDCJOeYzahOupi7DYmZZG12n8Bj9FXj+DtSIPQ6JZ5oIB2Vcq5Yl7+uZ9py3tQdo33yEe/ND8di+Voutj7alPqyxua68nVLLfrR3kLADKuJAC2c6nUNpsG+TZ/+zLJ2W+RCgZvK8lUQU80GiGs6Dw==
x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SJ0PR03MB6598.namprd03.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(376014)(1800799024)(366016)(38070700018);DIR:OUT;SFP:1102;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: swhv5oWFpcHaUaWFCJe8HzV/WOgqOYrp96ZWqwn34e66dnh38ueqAZlA2OAoXIZ3Ku2YL5mUsXFoQVtnv+pnnYMSIdg/yHh92g02ZfTiK+TQec5UWWmdxyIe5ZxeMcZv9oxNlSnVT0lOwpfk1zN7Nhppi9UWUVeJvCymdVh2RAzl2PoxzL6KIzMklOJsNNvk+Tc2dZBVbvtD+mc5N4TA1Cl+YNAu6PwyXUO6dQ8PvKFNN/v+1O7cm+/Wkzvw+DFUtdYWt/q4eW8W59QtPSEXGpxgjZiB9894XdB5vHTZgVT4xiYDxrBqb3i9M3/V58NSvvZQ/s3jBEuaA5Q7WZF7f0Vx2nmVDx0+7ZTuKkWrKlvHvafdsnVAAsF2YV9Z7gAkSTXz5Lm4YV8kaTNkErq/ssfuOsKl2vlxKNrKCtWxmBNwzke5uiDvfjMntDOZMquSC2VbQPdrJw3BQPCqLzuBLuvlQaCsFRVT4qTaMB+w+jFMNYk80mgdvxN+jN/xb5sYzOUy/8fd1ITAzxi/H/94sV7QZMk0SGIkwmo8LjPd37D0K0eKSW3IZWCsZIp7vCzBDxERRcphA6O1HYQLrEYrNC7tSh8FLswz9Yjj920eI9VkwRPTSz+vCivstKRY0ZTcjtNjzO4WkIe9Kla4x0RHgxdeqvKEElDuuYfIAm/RiyR4i76ZI+1ONpsHmFp4yDx8WnwZb/spL5Hb5RqmjVyVuOmH1pSaOtGqcXYZ/QqbzmUscWhuyvg6hKFWKqFAUYED+ZuSkRXT7uBgtjR1RHkE/aSRyKRCu/6tLHVQZhgWbUXSEDVGN4ELAFYg8ak3pdglQ305iErXO6tOwQte/V7+Saa4pnWd4rvoaaxZycFbLC8qSx6QRIyB1JpWE2iPLL82wnA4Tovjw0vM7wo3bHR0T8zh4phZMN9RLumx9R4hyLDa3Oi8W7YvSv27yshqL/naneywExVoBy9hdjP+rU2GRMADJDOksuuX6smwao/ZxZJEKR481dDCCj3uBkCjf44bZoNh3m5Yq/DuSheDDcc/HhJN7rQcU6tQsv1dJiv/Z1lZGckX9xUEeBr5n5LCuDFBbO6Wf4ZHtpbhlyaIWXFm8/1q2GhphCNMAf/l/ZHt40piCfJ4l+BSHtJuTDBw41xJs2Pq2ZsBvaQIjE8kO3xsAoWnyvMd0scxIefpxxH984swJ3sJNXdQWXFka979fvzwFAgu4a+Fl36nk9gPYbiiNoW35ABPBr2+Kk45Ts3nMm5NGDk96ckKgmHbwUJTBQFEjlmVjq+hS/pJxt6ZcsUs0CZNSUnMAdAKUlYjk31E4fsw5zsyc30iX+gRJE+Y4w/0AJppm4gIxufwfGkWgdD505UR0u50bS2/4xj8h/XunvWCf5OlISP71CrBv9JstY/Uw7PLx7BkUTfaCsdeUH6bmq7c8/IBlIe6LM4a6EaqbZacmCkmBqCRt3bvUQNKfaxwNtueeXwlkrj5omx2k1pHKeVhotebLGq0UhG74cnrsRGaPZCC0Rq9VCYaBEOj3eek6uY/Izf6xHthJ7OlOixEIRg/XkMUDe0oCgEQpyMKZVU=
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
X-OriginatorOrg: ewellic.org
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: SJ0PR03MB6598.namprd03.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: f0de27d3-5c47-49cd-13bb-08dcb04c2873
X-MS-Exchange-CrossTenant-originalarrivaltime: 30 Jul 2024 04:00:32.2383 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: af914547-9fbe-40e1-a852-1a58e1f247dc
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: k0oEl6FqNQ/Bp2TY0JZTZgHVEqDNgm/wVxV3TZIiwFho6U9xPltfmiYYS+fBFPoBa8VIHO7N9b0XfPOHTdj9aw==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN8PR03MB4929
Message-ID-Hash: XFD6N243YOJVS4HNJPW2QA36IWXPPSWC
X-Message-ID-Hash: XFD6N243YOJVS4HNJPW2QA36IWXPPSWC
X-MailFrom: doug@ewellic.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-ltru.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Ltru] Re: [ExtV] Glottolog source?
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/ltru/Z0Ky22ENmyyya74j14ncHBv299k>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ltru>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Owner: <mailto:ltru-owner@ietf.org>
List-Post: <mailto:ltru@ietf.org>
List-Subscribe: <mailto:ltru-join@ietf.org>
List-Unsubscribe: <mailto:ltru-leave@ietf.org>

On Monday, July 22, 2024 at 6:51, Sebastian Drude wrote:

> For dialects, I believe the basic resource is languoid.csv,
> considering all the lines which have "False" in the column
> "bookkeeping" and "Dialect" in the column "level".  The Glottocode
> sits in the first column.
> [...]
> Retired entries are maintained (with the value in the "bookkeeping"
> column set to "True").

Unfortunately, this turns out to be true only for "language" entries, not for "family" or (crucially) "dialect" entries.

From the 2021 article by Forkel and Hammarström, which Sebastian forwarded to us as a PDF on June 27 (privately, not on this list):

"If a language-level languoid was completely erroneous, it is moved to the Bookkeeping category. [...] In contrast, the family-level and dialect-level glottocodes are not 'protected' and may be removed from the inventory between releases. Since they do not necessarily reflect a real-world entity like an L1 language, it cannot systematically be explained what 'happened' to them, e.g., if they never really existed."

Regardless of whether Glottolog's curators or contributors believe that a previously coded dialect may not actually exist, the fact remains that someone, somewhere, at some point, may have used its Glottocode conformantly to represent it. That is why Section 3.7 of RFC 5646 states clearly with regard to extensions:

"The specification MUST be stable.  That is, extension subtags, once defined by a specification, MUST NOT be retracted or change in meaning in any substantial way."

This requirement exists for a very sound reason: It is absolutely imperative that BCP 47 tags, including those that use extensions, must not lose their meaning because one of its components simply vanishes from the external standard which once defined it. That was one of the key reasons for replacing RFC 3066 in the first place -- to prevent tags that depended directly on ISO 639 and 3166 from losing their meaning when code elements were removed from those standards with no trace.

I had actually suspected, before reading the article, that some Glottocodes might disappear from the list, and have confirmed that, in every version of Glottolog since 3.1 -- when the format of 'languoid.csv'  was changed to more or less its present form -- at least one dialect code marked "False" (i.e. not Bookkeeping) and "dialect" has been quietly removed from the following version. In the table below:

column 1 = last version in which the code appeared
column 2 = following version, in which it was permanently removed
column 3 = Glottocode
column 4 = description

3.1	3.2	bika1249	Bika
3.2	3.3	aiwa1238	Aiwanat
3.3	3.4	bula1258	Bulalakawnon
3.4	4.0	kara1463	Karapapakh
4.0	4.1	awun1244	Awuna
4.1	4.2	aizu1242	Aizuare
4.2	4.3	belu1238	Beludji
4.3	4.4	bedu1242	Beduanda
4.4	4.5	akit1239	Akit
4.5	4.6	serb1266	Serbo-Bosnian Vlax
4.6	4.7	ante1238	Antekerrepenh
4.7	4.8	abaw1238	Abawa
4.8	5.0	bara1408	Baraba Tatar

Most of these codes begin with 'a' or 'b' for a reason: there are actually several codes in each release that are erased in this way, and I simply chose the one for this table that appeared first alphabetically.

The article by Forkel and Hammarström continues:

"However, some tracking possibilities are always guaranteed because both families and dialects are linked to language-level languoids. [... ] if a dialect-level glottocode disappears, it is possible to check which language-level languoid it pertained to and to check which dialects are now associated with it."

This also is not true, at least when considering the reasonably sized file on the Glottolog site. Each of the retired codes listed above was completely removed, with no direct or indirect reference in the following version. Thus it is not possible to check which language-level languoid it pertained to, unless one consults the previous, superseded version of Glottolog.

It is possible that these deleted codes are carried over into the 'superseded.csv' file which is included in 'glottolog-X.X.zip', a comprehensive zip file which also includes the entire Glottolog tree structure (resulting, BTW, in path names which exceed the maximum length supported by many Windows tools) and which totals 99 megabytes as of version 5.0. I would hope that users of Extension V would not be required to download a 99-megabyte file to get the code list needed to support the extension. (Then again, I also hoped that users of Extensions T and U would not have to download the 30-megabyte 'cldr-common-X.X.zip' file to get the code lists for those extensions, but that argument was lost.)

Given the current Glottolog policy of deleting Glottocodes that refer to dialects, rather than marking them "obsolete" or "retired" or "bookkeeping" or some such, it seems the only way to use Glottolog conformantly in Extension V is for us to create our own registry, using codes taken from 'languoids.csv' but with an added policy of preserving and identifying retired codes. This is essentially what we had to do with ISO 639 and 3166, back in 2005.

This would be a large registry; version 5.0 of 'languoids.csv' lists 13,507 dialect codes, which is about twice the number of languages in ISO 639-3. But at least the deleted Glottocodes from versions prior to 5.0 would not need to be included, because they would not have been valid in Extension V. The grandfathering process, if you will, would only have to begin once the extension is approved.

Without a mechanism like this, or unless Harald and Robert change their policy so that dialect codes are guaranteed to be preserved, I do not see any way we can use Glottolog for this extension and stay conformant with BCP 47. Perhaps Sebastian may be able to persuade them.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org