Re: [Ietf-languages] First cut at a BCP 47 extension structure for ISO TR 21636 (was: Language subtag registration form)

Mark Davis ☕️ <mark@macchiato.com> Sun, 29 November 2020 02:33 UTC

Return-Path: <mark.edward.davis@gmail.com>
X-Original-To: ietf-languages@ietfa.amsl.com
Delivered-To: ietf-languages@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BA31A3A0DFF for <ietf-languages@ietfa.amsl.com>; Sat, 28 Nov 2020 18:33:05 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.598
X-Spam-Level:
X-Spam-Status: No, score=0.598 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FONT_INVIS_MSGID=1.32, FREEMAIL_FORGED_FROMDOMAIN=0.249, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.249, HTML_FONT_FACE_BAD=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_SOFTFAIL=0.665, T_KAM_HTML_FONT_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=macchiato-com.20150623.gappssmtp.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FPPWKK_TQOco for <ietf-languages@ietfa.amsl.com>; Sat, 28 Nov 2020 18:33:02 -0800 (PST)
Received: from mork.alvestrand.no (mork.alvestrand.no [IPv6:2001:700:1:2::117]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1A8DB3A0DEF for <ietf-languages@ietf.org>; Sat, 28 Nov 2020 18:33:01 -0800 (PST)
Received: by mork.alvestrand.no (Postfix) id 5C5AA7C620A; Sun, 29 Nov 2020 03:32:59 +0100 (CET)
Delivered-To: ietf-languages@alvestrand.no
X-Comment: SPF skipped for whitelisted relay - client-ip=2620:0:2d0:201::1:74; helo=pechora4.lax.icann.org; envelope-from=mark.edward.davis@gmail.com; receiver=ietf-languages@alvestrand.no
Received: from pechora4.lax.icann.org (pechora4.icann.org [IPv6:2620:0:2d0:201::1:74]) by mork.alvestrand.no (Postfix) with ESMTPS id B5E647C60E8 for <ietf-languages@alvestrand.no>; Sun, 29 Nov 2020 03:32:58 +0100 (CET)
Received: from mail-qk1-x729.google.com (mail-qk1-x729.google.com [IPv6:2607:f8b0:4864:20::729]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)) (No client certificate requested) by pechora4.lax.icann.org (Postfix) with ESMTPS id F330270004A9 for <ietf-languages@iana.org>; Sun, 29 Nov 2020 02:32:53 +0000 (UTC)
Received: by mail-qk1-x729.google.com with SMTP id n132so7902133qke.1 for <ietf-languages@iana.org>; Sat, 28 Nov 2020 18:32:53 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=macchiato-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=5sjraqNPbvPVorNTixP5AlM2yGZR3YXS84hDZ1MIn10=; b=nYQZMiAiJgEYnubCE3sxz3u5+WnkqxLnmNLNoRkEJJj35AMdGlZb1ZjcB7Veg/w4sZ 0m3fwPbaaDmjA+Pk0aDNwMz6yz3n45SVel+7k+wY9jBeQTjNODCnWGfYhjsi+RQ6bG9Q YoHIgGvmh/0DlY106z++obs5XxUIeDub1ul/np0T3YjkdXPt+R1cSv3tv6pvOrVEX07L g9yMjDpwnHfi+shGUwa+conWtfANzKfzEqElmMUzPDT8FAHamO2IDrlA4/WMk7wB3b53 rBgbCaWFjFSqNyONmGT42Z/bUaiLYk70IxjOZQDc/hBL41uGqmk22ppHZNtUnvhMfin5 cd4w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=5sjraqNPbvPVorNTixP5AlM2yGZR3YXS84hDZ1MIn10=; b=BxWGbIISwUJZZ9Mc7G+qON3es+lro3MBSg6rBRGdspkkfzBp++n9REZUmTLmvuI6Yw Q7zCgh5bm+jGIDVsAF/WgCPbOpDPYlrtCrcdNmT+eyrizXvLTkrxz7f4dDSq4l/c0DsJ yiTSeVvYUUA4Bng86fAjsNHW8+EAIoVHitbFXoq/e7fmYnbPZjklJTPQayjIjldHdwhd kni95y+Vd6vMFSeSuaK/FegIAzNm2sCHCDh+wWgA+suUWHevxoAUbYURtvNBQUnJOi2U rfGHIc9lfKPm8ogFJHo8Wo5/CGREjOXN62qKiFW4Fse7Mal6O0iRoTijgSX1YOvpE38E TEeA==
X-Gm-Message-State: AOAM531J9bg5bve6WWiL3eYElY8ofX2Rn9Pru+k5GlJviQoFlHrz35Z2 B2lOoGygjF3ptFLZGNGXWVTHQ9Xy6BkRJdzmJB4=
X-Google-Smtp-Source: ABdhPJzhhb46aCSv9ylIhwnbO/c5MXjWPTJMuvS35yNqXPqjHpV3ukSeSBTzIpc1PLKRj3FAScr4cxy527s8c8GsoHU=
X-Received: by 2002:a37:7085:: with SMTP id l127mr16143524qkc.106.1606617153135; Sat, 28 Nov 2020 18:32:33 -0800 (PST)
MIME-Version: 1.0
References: <20201127232932.665a7a7059d7ee80bb4d670165c8327d.20171979ac.wbe@email15.godaddy.com>
In-Reply-To: <20201127232932.665a7a7059d7ee80bb4d670165c8327d.20171979ac.wbe@email15.godaddy.com>
From: Mark Davis ☕️ <mark@macchiato.com>
Date: Sat, 28 Nov 2020 18:32:20 -0800
Message-ID: <CAJ2xs_HnUGXTzFmvVdq3ov5Y5ArBwPnAX8ZxS=tW3ANEYMYCQw@mail.gmail.com>
To: Doug Ewell <doug@ewellic.org>
Cc: Sebastian Drude <drude@xs4all.nl>, "ietf-languages@iana.org" <ietf-languages@iana.org>
Content-Type: multipart/alternative; boundary="000000000000b3e7f705b535b44c"
X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.2 (pechora4.lax.icann.org [0.0.0.0]); Sun, 29 Nov 2020 02:32:54 +0000 (UTC)
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-languages/BZTzBXdhIYnU38bzKGtz9XPLfw0>
Subject: Re: [Ietf-languages] First cut at a BCP 47 extension structure for ISO TR 21636 (was: Language subtag registration form)
X-BeenThere: ietf-languages@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <ietf-languages.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-languages/>
List-Post: <mailto:ietf-languages@ietf.org>
List-Help: <mailto:ietf-languages-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 29 Nov 2020 02:33:06 -0000

> BCP 47 says that variants should not be defined to have different
meanings depending on the language.

1. I think the actual phrasing implies something slightly different:

   Requests to add a 'Prefix' field to a variant subtag that imply a
   different semantic meaning SHOULD be rejected.  For example, a
   request to add the prefix "de" to the subtag '1994' so that the tag
   "de-1994" represented some German dialect or orthographic form would
   be rejected.  The '1994' subtag represents a particular Slovenian
   orthography, and the additional registration would change or blur the
   semantic meaning assigned to the subtag.
The main purpose of that clause is that if variant1 modifies prefix1 in a
particular way, then variant1 shouldn't modify prefix2 in a very different
way. It doesn't apply when there isn't a prefix, such as with fonipa, or
anything else that represents a well-defined modification of previous
subtags. It also doesn't apply to defined extensions; it is up to the spec
for that extension to determine the right mechanisms.

Certainly it doesn't not require gratuitously different spellings for a
general-purpose subtag, based on the locales it is applied to. That is,
there is no necessity to have subtags for a 'register' dimension of
'formal' that add extra letters or numbers just to be different.

fr-v-re-formalfr
de-v-fr-formde
en-v-fr-formen
en-v-fr-formengb
...

However, it would be very ill-advised to define a subtag 'formal' for a
*register* dimension, and use that same subtag for a different dimension
like *space*.

Moreover, for the highest degree of utility, the functional application of
each value should be applicable across languages. For example, if 'formal'
for one language is reserved for the emperor, while 'formal' in another
language is for any people that you don't know well, then it becomes very
difficult for developers to apply the value consistently.

2. There was an issue around multiple subtag values. Eg
fr-v-xx-abcdef-ghijk. If you allow them, you need to specify precisely
which combinations are allowed, and what the semantics of any combinations
means.

In CLDR, for example, we allow a few different combinations, currently. The
examples in the following are from the machine-readable specification for
CLDR extension key-value pairs.

*incremental*: combinations where success elements 'narrow' the scope of
the previous element, eg:

            <type name=*"islamic"* description=*"Islamic calendar"*/>

            <type name=*"islamic-umalqura"* description=*"Islamic calendar,
Umm al-Qura"* since=*"24"*/>

            <type name=*"islamic-tbla"* description=*"Islamic calendar,
tabular (intercalary years [2,5,7,10,13,16,18,21,24,26,29] - astronomical
epoch)"* since=*"24"*/>

            <type name=*"islamic-civil"* description=*"Islamic calendar,
tabular (intercalary years [2,5,7,10,13,16,18,21,24,26,29] - civil epoch)"*
since=*"24"*/>

            <type name=*"islamic-rgsa"* description=*"Islamic calendar,
Saudi Arabia sighting"* since=*"24"*/>

*multiple*: any ordered list, such as collation reordering codes, eg:

            <type name=*"space"* description=*"Whitespace reordering code,
see LDML Part 5: Collation"* since=*"21"*/>

            <type name=*"punct"* description=*"Punctuation reordering code,
see LDML Part 5: Collation"* since=*"21"*/>

            <type name=*"symbol"* description=*"Symbol reordering code
(other than currency), see LDML Part 5: Collation"* since=*"21"*/>

            <type name=*"currency"* description=*"Currency reordering code,
see LDML Part 5: Collation"* since=*"21"*/>

            <type name=*"digit"* description=*"Digit (number) reordering
code, see LDML Part 5: Collation"* since=*"21"*/>

            <type name=*"REORDER_CODE"* description=*"Other collation
reorder code — for script, see LDML Part 5: Collation"* since=*"21"*/>

*any*:

            <type name=*"PRIVATE_USE"* description=*"Private use transform
identifier. All subfields consistent with rfc6497 (that is, subtags of 3-8
alphanum characters) are valid, and do not require registration."* since=
*"21.0.2"*/>

3. I would advise proceeding slowly and carefully; as remarked earlier you
need to guarantee that there will not be backwards compatibility problems
or people won't touch the extension with a 10 foot pole. So it would be
best to make sure that each dimension's core values are well defined, with
lots of examples from multiple languages, before adding each of them.

Mark


On Fri, Nov 27, 2020 at 10:30 PM Doug Ewell <doug@ewellic.org> wrote:

> Sebastian Drude wrote:
>
> > I believe we should not worry about the personal varieties at this
> > point.
>
> I think this would be a prudent exclusion. As with the 'u' and 't'
> extensions, keys can be added after the initial rollout, but I can't
> imagine a scenario in which encoding this dimension would ever be
> practical.
>
> > Probably we should form a small committee to continue this discussion,
> > instead of involving (and spamming, at this point) the whole list.  I
> > am certainly willing to be instrumental in (contributing to) forming
> > and driving such a group, in whichever role I can contribute best.
>
> Maybe a custom mailing list could be formed. Everyone here would need to
> be eligible to join the discussion if they choose to.
>
> > Points that I can see now that would need to be discussed, besides
> > flashing out what you have begun:
> >
> > -- interaction with existing subtags, in particular dialects --
> > probably we want to avoid synonym language tags composed according to
> > different frameworks
>
> I can envision this becoming a real time and effort sink. As just one
> example, script subtags imply "written" (one implies "not written"), and
> this extension would provide a "medium" dimension. What if they disagree
> within the same tag? Trying to cherry-pick the 21636 values to exclude
> those that might conflict with other subtags would probably be very
> messy.
>
> > -- can more than one value in the SAME dimension be indicated? (I would
> > argue yes, if that is syntactically okay)
>
> Syntactically, sure:
>
> Sometimes this will make sense, such as for "communicative functioning."
> In many cases it clearly won't, as in combining "formal" and "informal"
> situations. Some semantic restrictions might be appropriate in this
> extension to guard against the latter. Our approach in BCP 47 has simply
> been to advise tag creators to "tag wisely," discouraging nonsensical
> combinations but not making them invalid.
>
> > -- the 'certainty' and similar "adjectives" (yes, that is how I would
> > see them; -- e.g. primary vs. secondary modality, genuine vs.
> > imitated, ...)
>
> I knew I had missed some in reading quickly through the NP document. If
> there are many, this would require some thought.
>
> > -- default values
>
> I know CLDR has this concept for the 'u' extension. It may just amount
> to defaulting true/false values to true. If this is needed on a
> per-dimension basis, there could be a "default" attribute on one value
> in the code list. But interpreting this would require tag consumers to
> have access to the code list, which is not usually desirable for
> syntactic analysis.
>
> > -- requirements for a registry, and its feasability, and finally
> > implementation
>
> Oh, we'll go there.
>
> > One question I have: need the values in the key-value-pairs be unique
> > over different languages?  As an example, can two dialects of
> > different languages share the same string xyz as a designation used as
> > value in ...v-...-sp-xyz?
>
> The existing, analogous rule for variants is a point of controversy
> here. BCP 47 says that variants should not be defined to have different
> meanings depending on the language. So we have variant subtags like
> 'pinyin' that can apply to both Chinese and Tibetan, because it's nearly
> the same romanization scheme for both languages; and we have variants
> that can apply to almost any language, such as the one that means
> "written in IPA." But we can't have something like 'western' because the
> concept of "language X as used in the western part of country Y" has
> different meanings for different values of X and Y. Not everyone here
> agrees where to draw the line.
>
> I assume that many of the 21636 concepts have the same meaning
> regardless of the language, but this needs more study.
>
> > Alternatively/additionally, we could use the glottocodes for the major
> > dialects already included in the Glottolog (they are a string of 4
> > letters and 4 digits), and come to an agreement with the Glottolog
> > folks to extend their list of dialects.
> > (It would be excellent to be in touch with them anyways, also for
> > cases where ISO is less accurate than the Glottologue.)
>
> When I saw this originally, I wasn't in a position to respond, but it
> alarmed me. Establishing Glottolog as a competing standard to ISO 639
> for encoding language information in BCP 47, even in different subtag
> types, can only lead to confusion and duplicate representations, exactly
> what you expressed a desire to avoid earlier.
>
> Then I saw the following, and became even more alarmed:
>
> > However, differently from when BCP 47 was created, Glottolog now is an
> > impressive and very complete and accurate work.  Many scholars combine
> > it, or even prefer it over ISO 639, it is used in WIkipedia, etc., so
> > perhaps it is the time for rethinking the relationship between BCP and
> > Glottolog, especially for the cases of languages that are missing in
> > the Ethnologue / ISO 639.
>
> So, just to be 100% clear: the core standards that are used as the
> source for subtag structure in BCP 47, and for values of certain subtag
> types in the Language Subtag Registry, are fixed. They cannot be swapped
> in and out. Any proposal for "rethinking the relationship between BCP
> [47] and Glottolog," to the extent that means switching from ISO 639
> code elements to Glottolog code elements, is completely off the table.
>
> (Incidentally, Wikipedia uses an extension of ISO 639 for language
> coding, and only displays Glottolog code elements in language boxes. The
> Portuguese Wikipedia is at pt.wikipedia.org, not
> port1283.wikipedia.org.)
>
> The core standards were chosen deliberately, for a variety of reasons
> such as existing practice, suitability, and stability promises. We know
> there are other encoding systems for languages, some of which may even
> have isolated or perceived advantages over ISO 639. But this is what we
> have chosen. There is no significant likelihood this will be overturned.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
>