Re: [Ietf-languages] A long post about long Description and Comments fields

Peter Constable <pgcon6@msn.com> Sat, 12 September 2020 20:46 UTC

From: Peter Constable <pgcon6@msn.com>
To: Doug Ewell <doug@ewellic.org>, "ietf-languages@ietf.org" <ietf-languages@ietf.org>
Thread-Topic: [Ietf-languages] A long post about long Description and Comments fields
Thread-Index: AdaJQS3i2doTMj5wQECcmyQZtwj7tgAAuD9g
Date: Sat, 12 Sep 2020 20:46:32 +0000
Message-ID: <MWHPR1301MB21124E6EEDCE805AFC14B9C286250@MWHPR1301MB2112.namprd13.prod.outlook.com>
References: <001b01d68941$ed51d110$c7f57330$@ewellic.org>
In-Reply-To: <001b01d68941$ed51d110$c7f57330$@ewellic.org>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000
X-MS-Exchange-CrossTenant-Network-Message-Id: 027a30a8-9059-4be8-81f8-08d8575cee98
X-MS-Exchange-CrossTenant-originalarrivaltime: 12 Sep 2020 20:46:32.4025 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Internet
X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-rms-persistedconsumerorg: 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR13MB2734
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-languages/qQqmtYsoux7M-uXgDzOzAVtJpcw>
Subject: Re: [Ietf-languages] A long post about long Description and Comments fields
X-BeenThere: ietf-languages@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <ietf-languages.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-languages/>
List-Post: <mailto:ietf-languages@ietf.org>
List-Help: <mailto:ietf-languages-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 12 Sep 2020 20:46:36 -0000

>"Name" might have been better in retrospect.

In ISO 639-3:2007, I attempted to convey the idea that there are symbols (the language IDs) that are used to code a conceptual meaning (the _denotation_), and that the standard provides a "reference name" as minimal information to capture the denotation (normative to the extent that it indicates the intended denoation), but the RA could supplement other alternate names, and that a complete characterization of the denotation will in general require other encyclopedic information, which the RA might supply.

It's not unlike character names in Unicode: ARABIC NUMBER SIGN is a minimal, sufficient way to capture the concept that is encoded, though there's a bunch of other information needed (in character property data and prose description in the core text) to fully understand the concept.

I think there's a similar issue here: one field should provide a minimal, sufficient indication of the meaning, but there should be additional supplemental information to provide more complete explanation of the meaning. The suggestion that the former could be used in UI does make it a little bit trickier to get that right, but probably not by much.


Peter

-----Original Message-----
From: Ietf-languages <ietf-languages-bounces@ietf.org> On Behalf Of Doug Ewell
Sent: Saturday, September 12, 2020 1:19 PM
To: ietf-languages@ietf.org
Subject: [Ietf-languages] A long post about long Description and Comments fields

While we wait for the review period for variant subtag 'vecdruka' to conclude, I did a little research into the longest Description and Comments fields we currently have in the Registry. This is an attempt to justify my constant and probably tedious complaints about what I consider overly long proposed values for these fields, not just in this most recent proposal but for many years now.

My opinion is that Description fields should be suitable to identify the language, script, region, variation, etc. in a list, possibly in a user interface. It is almost never necessary to explain the entity in great detail in order to identify it, and the longer the Description field, the more cumbersome it is for users to determine whether it is the subtag they need.

Here is a quote from RFC 5646, Section 3.1.5, which I use as guidance:

"The 'Description' field is used for identification purposes. Descriptions SHOULD contain all and only that information necessary to distinguish one subtag from others with which it might be confused.  They are not intended to provide general background information or to provide all possible alternate names or designations.  'Description' fields don't necessarily represent the actual native name of the item in the record, nor are any of the  descriptions guaranteed to be in any particular language (such as English or French, for example)."

Perhaps the name "Description" tends to encourage explanatory text; "Name" might have been better in retrospect.

The longest Description fields we currently have are for the three Portuguese orthographic variants added in 2015:

Type: variant
Subtag: abl1943
Description: Orthographic formulation of 1943 - Official in Brazil
  (Formulário Ortográfico de 1943 - Oficial no Brasil)
[106 characters]

Type: variant
Subtag: ao1990
Description: Portuguese Language Orthographic Agreement of 1990 (Acordo
  Ortográfico da Língua Portuguesa de 1990)
[100 characters]

Type: variant
Subtag: colb1945
Description: Portuguese-Brazilian Orthographic Convention of 1945
  (Convenção Ortográfica Luso-Brasileira de 1945)
[100 characters]

All of these state the title of the agreement in both English and Portuguese, effectively doubling their length. I don't know why it was felt necessary to do this, and don't remember if that was a discussion we had at the time.

Beyond these, the longest Description for a variant subtag is this:

Type: variant
Subtag: 1959acad
Description: "Academic" ("governmental") variant of Belarusian as
  codified in 1959
[69 characters]

That at least seems reasonable; the explanation is necessary to explain the choice of subtag value.

By comparison, for the current Latvian request, the originally proposed Description was this:

Description: Latvian language in the old orthography used before 1910s
  ("vecā druka" in Latvian)
[83 characters]

which I have proposed to reduce to the following:

Description: Latvian orthography used before 1920s ("vecā druka")
[52 characters]

which I hope still communicates the same meaning, and explains the subtag value equally well, but more compactly. (The change from "1910s" to "1920s" is for consistency with the Comments field, as mentioned last week.)

For the Comments field, RFC 5646 has this to say about brevity (Section 3.1.12):

"The primary reason for the 'Comments' field is subtag identification -- to help distinguish the subtag from others with which it might be confused as an aid to usage.  Large amounts of information about the use, history, or general background of a subtag are frowned upon, as these generally belong in a registration request rather than in the registry."

The longest Comments field we currently have is this beast:

Type: variant
Subtag: baku1926
Description: Unified Turkic Latin Alphabet (Historical)
Comments: Denotes alphabet used in Turkic republics/regions of the
  former USSR in late 1920s, and throughout 1930s, which aspired to
  represent equivalent phonemes in a unified fashion. Also known as: New
  Turkic Alphabet; Birlәşdirilmiş Jeni Tyrk
  Әlifbasь (Birlesdirilmis Jeni Tyrk Elifbasi);
  Jaŋalif (Janalif).
[300 characters]

At the time this was registered (2007, under RFC 4646), I believe there was concern that software or fonts would not be able to render some of the more obscure Latin letters, yet it was important to document the alphabet by including them; so in addition to explaining why the alphabet was used, all of its alternative names were spelled out twice. I would hope we never do this again. Maybe we weren't frowning hard enough.

(The bizarre line-wrapping is due to a RFC 5646 requirement that physical lines in the Registry be no more than 72 UTF-8 code points -- not characters -- in length. All of the non-Basic Latin characters in the non-parenthesized fields use two UTF-8 code points, and they add up.)

After this, the next longest Comments fields both include a bibliographic reference (including publisher) to works which define them, which might have been placed in the registration form instead:

Type: variant
Subtag: tarask
Description: Belarusian in Taraskievica orthography
Comments: The subtag represents Branislau Taraskievic's Belarusian
  orthography as published in "Bielaruski klasycny pravapis" by Juras
  Buslakou, Vincuk Viacorka, Zmicier Sanko, and Zmicier Sauka (Vilnia-
  Miensk 2005).
[206 characters]

Type: variant
Subtag: alalc97
Description: ALA-LC Romanization, 1997 edition
Comments: Romanizations recommended by the American Library Association
  and the Library of Congress, in "ALA-LC Romanization Tables:
  Transliteration Schemes for Non-Roman Scripts" (1997), ISBN
  978-0-8444-0940-5.
[201 characters]

For the current Latvian proposal, the original Comments field outstripped even the 'baku1926' example:

Comments: The subtag represents the old orthography of Latvian language
  used during c. 1600s–1920s. It was first described in 1863 by August
  Bielenstein in his book "Die lettische Sprache, nach ihren Lauten und
  Formen". The orthography has been been official for Latvian language
  till new orthography was approved in 1908 and fully adopted in 1930.
[348 characters]

All of this background can go to the registration form. I have proposed the following text:

Comments: The subtag represents the old orthography of the Latvian
  language used during c. 1600s–1920s.
[93 characters]

which may be so much briefer that it doesn't add much information beyond the Description. I'm open to suggestions that some additional content is important to understanding the identity of this subtag, but probably not three and a half times as much.

--
Doug Ewell | Thornton, CO, US | ewellic.org


_______________________________________________
Ietf-languages mailing list
Ietf-languages@ietf.org
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ietf.org%2Fmailman%2Flistinfo%2Fietf-languages&amp;data=02%7C01%7C%7C66552ed584a4479de41a08d85759151e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637355387398918313&amp;sdata=c14Ex7IM%2BjFZGtJE0iCCXcdojBtMSHz%2BNKNAWagr1Vs%3D&amp;reserved=0

[Ietf-languages] A long post about long Descripti… Doug Ewell
Re: [Ietf-languages] A long post about long Descr… Peter Constable
Re: [Ietf-languages] A long post about long Descr… Michael Everson