Re: [Ietf-languages] First cut at a BCP 47 extension structure for ISO TR 21636 (was: Language subtag registration form)

Hugh Paterson III <sil.linguist@gmail.com> Sat, 28 November 2020 00:54 UTC

Return-Path: <sil.linguist@gmail.com>
X-Original-To: ietf-languages@ietfa.amsl.com
Delivered-To: ietf-languages@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 081663A09AC for <ietf-languages@ietfa.amsl.com>; Fri, 27 Nov 2020 16:54:47 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.431
X-Spam-Level:
X-Spam-Status: No, score=-1.431 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_SOFTFAIL=0.665, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id DkdyWohy-_wp for <ietf-languages@ietfa.amsl.com>; Fri, 27 Nov 2020 16:54:44 -0800 (PST)
Received: from mork.alvestrand.no (mork.alvestrand.no [IPv6:2001:700:1:2::117]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 13B753A09A4 for <ietf-languages@ietf.org>; Fri, 27 Nov 2020 16:54:44 -0800 (PST)
Received: by mork.alvestrand.no (Postfix) id 6E0677C64EF; Sat, 28 Nov 2020 01:54:42 +0100 (CET)
Delivered-To: ietf-languages@alvestrand.no
X-Comment: SPF skipped for whitelisted relay - client-ip=2620:0:2830:201::1:71; helo=pechora5.dc.icann.org; envelope-from=sil.linguist@gmail.com; receiver=ietf-languages@alvestrand.no
Received: from pechora5.dc.icann.org (pechora5.icann.org [IPv6:2620:0:2830:201::1:71]) by mork.alvestrand.no (Postfix) with ESMTPS id 2D99E7C64BE for <ietf-languages@alvestrand.no>; Sat, 28 Nov 2020 01:54:42 +0100 (CET)
Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)) (No client certificate requested) by pechora5.dc.icann.org (Postfix) with ESMTPS id E43F1700BC12 for <ietf-languages@iana.org>; Sat, 28 Nov 2020 00:54:40 +0000 (UTC)
Received: by mail-ej1-x629.google.com with SMTP id x16so2770590ejj.7 for <ietf-languages@iana.org>; Fri, 27 Nov 2020 16:54:40 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=IqIc0lN7UR6mG9W+L6vz2m6qdJpTyEBQfrSrYty2eew=; b=ep5QK3koflwqxNrladAAYh9n4FHOfFqlkg9nSqmH0I5sIiaqRYa0lJg7UbMTUhW0T7 p25/pv+Z8XQlrPKvVLzufjn9pf5780OuVZ29O+9YOo6oOmEvIY1lf3MbqGQawBoKlOq+ BfYU5I5Zv+s5bs4yyTwkQGTHNN3vrbXy43BHSyOwbj0oC6uh4hSYjGJYQh6p38YNT2ER GTI8fYByXJc6gi/oD+Wo+BzKvFl7Oyl2/ZBKD+wDosADUlfSjUjtOTJZD+W8Q4ElZKBq s2D5Vy/kXD1NCfBv6hAfmzthxc0Ns4xaVNg+W18sRFsqZiXccCB7ZfvuoIscZHbFWL7r pQMw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=IqIc0lN7UR6mG9W+L6vz2m6qdJpTyEBQfrSrYty2eew=; b=NWYKzAXsopqliDNqUvMrlq8Tf7/rRI4GOo+3Wg8vbM4rzYPRaWsUEeOJJ7Ot2vcCnt 5GBp53fln7jztN22wGrtiiVmeHrctY/nsRtbPzlTydExlTD/ioaCGg7SkSALSZsKtFdh pVbYQCpedRUVohxYA6ogKzga5KzIbT6L1kyW6GSdN95NgSb0KJxrDqPtR3T9RAIjlPuO yY4nGP2cNqfHUnktquPgDQa9NdUKM3RwAJhwz0Bap0wGFZ35d9Rx6c7T/LZ2BLeg88Xh xVnTnJ0LfipF9YUGxY8EhcN8zHL5SBIV5hTxJcUT6E8Dk3REKN2F9vUdY7FYeA4GE4qK OzUg==
X-Gm-Message-State: AOAM531E9S4DlFHqY1D4qxRAmwonQPV4gQzMqw0vokuVzwkqfTR+EciL wPq+4mYRib1wV4G6MW18OhBGrISDz0A6Q3nNNa8=
X-Google-Smtp-Source: ABdhPJy25WZQ2E8190J0xsGmsjnj0OxnGXaFoAg4LZlSRyhHrMxP44VpXurV0mRLdQz3gSTd8Ciw6cbKiC4fH5BvUQM=
X-Received: by 2002:a17:907:262d:: with SMTP id aq13mr10650898ejc.484.1606524859767; Fri, 27 Nov 2020 16:54:19 -0800 (PST)
MIME-Version: 1.0
References: <001201d6c511$3b953b40$b2bfb1c0$@ewellic.org> <f3c2ced8-c9e9-abef-cd23-1f65a7c4d97e@xs4all.nl>
In-Reply-To: <f3c2ced8-c9e9-abef-cd23-1f65a7c4d97e@xs4all.nl>
From: Hugh Paterson III <sil.linguist@gmail.com>
Date: Sat, 28 Nov 2020 01:54:08 +0100
Message-ID: <CAE=3Ky8WNVj1qVS_mRvU9N5h4x90TfJ+rca6fdSSgbWV93x7Qw@mail.gmail.com>
To: Sebastian Drude <drude@xs4all.nl>
Cc: Doug Ewell <doug@ewellic.org>, Mark Davis ☕ <mark@macchiato.com>, IETF Languages Discussion <ietf-languages@iana.org>
Content-Type: multipart/alternative; boundary="00000000000096de9b05b52037b3"
X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.2 (pechora5.dc.icann.org [0.0.0.0]); Sat, 28 Nov 2020 00:54:41 +0000 (UTC)
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-languages/o7bV_veqFasjCldYEo4hTTgxE4s>
Subject: Re: [Ietf-languages] First cut at a BCP 47 extension structure for ISO TR 21636 (was: Language subtag registration form)
X-BeenThere: ietf-languages@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <ietf-languages.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-languages/>
List-Post: <mailto:ietf-languages@ietf.org>
List-Help: <mailto:ietf-languages-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 28 Nov 2020 00:54:47 -0000

Sebastian,

This is all very interesting, I am particularly interested by the comment
related to the Glottolog.
1. You refer to their "list of dialects" but are you certain that some of
these sub-varieties are not actually socio-lects, and therefore should not
be placed in the space dimension? It would be problematic if one were to
assume that the Glottolog's list of ways to subcategorize languages were
all within one dimension. Having said that, my understanding, having talked
with one of their editors is that they don't create their own data but
rather they aggregate other people's data. But perhaps the editors do have
first hand or documentary evidence for their sub-categorization as you seem
to suggest.
2. My understanding is that people can already use glottocodes (even now)
with BCP47 with the -x- extension in the private use area.
3. Having communicated with a different editor of the Glottolog, my
understanding from him was that the seed data for the Glottolog came from
the Ethnologue directly and indirectly via Multi-tree which the indirect
method included the sub-categorization. So, maybe it would be equally good
to consult that reference work.
4. Please correct me if I am wrong, but my impression after having read
some of the Glottolog's about pages was that they do occasionally obliviate
their ID codes that they use, which is a different process than the status
that the ISO 639-3 gives to retired language codes which get a special
status when they enter retirement. see:
https://github.com/glottolog/glottolog/issues/243

all the best,
- Hugh

On Sat, Nov 28, 2020 at 12:44 AM Sebastian Drude <drude@xs4all.nl> wrote:

> Dear Doug,
>
> thank you so much for this.  It goes EXACTLY along the lines of what I
> have imagined could be achieved, as much as I have understood of BCP 47.
>
> For me, all what you have proposed could be taken over to be the seed of
> a future regulation.
>
> I also agree that repeating values in different dimensions should be
> avoided, even if they are syntactically possible.
>
> I believe we should not worry about the personal varieties at this
> point.  I am quite confident that whatever framework of metadata (which
> is what BCP 47 is all about, or is always in a context of) will
> certainly already provide ways of indicating the
> author/speaker/signer/... (one -- or several, in a resource with a
> dialogue, for instance), so that it would be a redundant feature for the
> language tag anyways.  Also, this dimension would be relevant only in
> very special applications, if ever.
>
>
> Probably we should form a small committee to continue this discussion,
> instead of involving (and spamming, at this point) the whole list.  I am
> certainly willing to be instrumental in (contributing to) forming and
> driving such a group, in whichever role I can contribute best.
>
>
> Points that I can see now that would need to be discussed, besides
> flashing out what you have begun:
>
> -- interaction with existing subtags, in particular dialects -- probably
> we want to avoid synonym language tags composed according to different
> frameworks
>
> -- can more than one value in the SAME dimension be indicated? (I would
> argue yes, if that is syntactically okay)
>
> -- the 'certainty' and similar "adjectives" (yes, that is how I would
> see them; -- e.g. primary vs. secondary modality, genuine vs. imitated,
> ...)
>
> -- default values
>
> -- requirements for a registry, and its feasability, and finally
> implementation
>
> One question I have: need the values in the key-value-pairs be unique
> over different languages?  As an example, can two dialects of different
> languages share the same string xyz as a designation used as value in
> ...v-...-sp-xyz?
> Alternatively/additionally, we could use the glottocodes for the major
> dialects already included in the Glottolog (they are a string of 4
> letters and 4 digits), and come to an agreement with the Glottolog folks
> to extend their list of dialects.
> (It would be excellent to be in touch with them anyways, also for cases
> where ISO is less accurate than the Glottologue.)
>
>
> Again, thanks so much.  I hope we can build on this initial proposal!
>
> Sebastian
>
> --
>
> Museu P.E. Goeldi, CCH, Linguistica ▪ Av. Perimetral, 1901
> Terra Firme, CEP: 66077-530 ▪ Belém do Pará – PA ▪ Brazil
> drude@xs4all.nl ▪ +55 (91) 3217 6024 ▪ +55 (91) 983733319
> Priv: Tv. Juvenal Cordeiro, 184, Apt 104 ▪ 66070-300 Belém
>
> On 27/11/2020 20:01, Doug Ewell wrote:
> > Sebastian Drude wrote:
> >
> >> I am here participating in this group exactly to touch base and see
> >> how we can implement this in a way that it is usable and compatible
> >> with BCP 47, BEFORE a list is created which may not be compatible in
> >> some way to IETF/Languages' approach.
> > Here's one possible way the structure of ISO TR 21636 could be made to
> fit into a BCP 47 extension. This shows the type of groundwork that should,
> and really must, be done up front if it is a major goal to make 21636 fit
> into BCP 47, without requiring all or even most of the values or code
> elements to be known.
> >
> > As a partial reference, I'm using a document called "NP-21636," provided
> by Sebastian and dated 2019-09-30. It's up to Sebastian whether I may
> redistribute that document to anyone.
> >
> > First, an extension singleton must be chosen. Here I suggest 'v' for
> "varieties," as the title of the TR is "Identification and description of
> language varieties." The alphabetically consecutive nature of the three
> extensions, 't' and 'u' and now 'v', may or may not be unfortunate.
> >
> > Next, the TR identifies eight "dimensions" which are independent of each
> other to a greater or lesser extent. It is clearly desirable to be able to
> specify more than one dimension in a single tag, up to all eight, but not
> necessarily so. For this I will suggest a key-value mechanism like that
> devised for the 'u' extension, so that each dimension is indicated by a
> two-character key, derived from the names Sebastian provided on the 24th:
> >
> > sp    Space
> > ti    Time
> > so    Social group
> > me    Medium
> > si    Situation
> > pe    Person
> > pr    Proficiency
> > co    Communicative functioning
> >
> > (The NP-21636 document used the term "performance" instead of
> "communicative functioning," which would have caused a minor inconvenience
> by overloading 'pe'.)
> >
> > Then there would be "value" identifiers of 3 to 8 characters each,
> providing values for each of the dimensions. The lengths of each of these
> subtags, 1 for the extension singleton and 2 for the dimensions and 3 to 8
> for the values, are a critical part of BCP 47 syntax, permitting humans and
> processes to parse the tag and figure out what modifies what.
> >
> > So you might have "pt-BR-v-so-business-pr-fullpro" as one example.
> >
> > A finite number of values for each dimension would need to be identified
> so they could be encoded, keeping uniqueness and syntactical constraints in
> mind. This is the process of building a code list, or registry. The set of
> values would be extensible; no one and no group could be expected to get
> this perfect the first time. But stability rules would also come into play
> here.
> >
> > Note that the values might not have to be unique across dimensions; you
> could have "en-v-so-foo-pr-foo" if the string 'foo' represented a social
> group value and coincidentally also a proficiency value. Too much of this
> could lead to human confusion, though.
> >
> > I have no idea what to do about things like "certainty status," but some
> syntax along these lines could be devised. If this is an "adjective"
> modifying dimensional values, rather than a value itself, that should be
> syntactically clear.
> >
> > The dimension that troubles me the most, of course, is "person." The
> document makes it clear in several passages that Sebastian has his own
> personal variety (times the number of languages in which he can
> communicate), I have my own, Michael and John and Leon and Peter have their
> own, and so forth. Obviously neither we at ietf-languages, nor anyone else,
> are going to encode identifiers for each of 7.7 billion living people, plus
> those deceased, plus those not yet born. It may be that the only possible
> way to encode this dimension will be with a private-use subtag at (by
> definition) the end of the tag.
> >
> > This list may or may not be the proper venue to continue this discussion.
> >
> > --
> > Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
> >
> >
>
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages@ietf.org
> https://www.ietf.org/mailman/listinfo/ietf-languages
>