[Ietf-languages] Default tagging

Martin Hosken <martin_hosken@sil.org> Fri, 14 August 2020 03:57 UTC

Return-Path: <martin_hosken@sil.org>
X-Original-To: ietf-languages@ietfa.amsl.com
Delivered-To: ietf-languages@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 46C823A0C9A for <ietf-languages@ietfa.amsl.com>; Thu, 13 Aug 2020 20:57:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.099
X-Spam-Level:
X-Spam-Status: No, score=-2.099 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=sil.org
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jq9IMoX4XJ7I for <ietf-languages@ietfa.amsl.com>; Thu, 13 Aug 2020 20:57:31 -0700 (PDT)
Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com [IPv6:2607:f8b0:4864:20::102b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4A0623A0C98 for <ietf-languages@ietf.org>; Thu, 13 Aug 2020 20:57:31 -0700 (PDT)
Received: by mail-pj1-x102b.google.com with SMTP id mw10so3771678pjb.2 for <ietf-languages@ietf.org>; Thu, 13 Aug 2020 20:57:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sil.org; s=google; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=yU53ztUzSWzDp0JI9uG6KIcgfF5mhC/d2NNmHU22gKA=; b=h2FunFAehBCxh2wynSfncjOuK5US9B+4kQNhb6YNF9caqsvLGja32L1+SQ58iO6zyF n7KI88iQUAfecIQtkfgFmJUQQnK98ZyJ+DXqpp47+PoI3mqDgtKEATvLYqBP5h4Z1KRk C1Pz9gyDeqc+M7ijeiyHf/OQikb1hELjL9uhs=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=yU53ztUzSWzDp0JI9uG6KIcgfF5mhC/d2NNmHU22gKA=; b=dsU83UiUArlh3maE+vakJhhI5s5N5sdZm6lRun3YXuKPj4AvNM3QN4DXbQcNLdnHnu O2xtnvksFCcJmb3GQCB6w8hyMrHK0pRG56vQdhbMJVDRoXkPP5uej89oO5W4obAjjo1H h3L18rSm1ymZjaZiDtuOon+5dumr6E9Y+QHzpVMIAa8Q5Vkzvm5El/Eyq2atsTOBmyAJ PB8MOIjdfl9F/RwjN03DHmpqvCRPiEWSeZvKmgap08+6e2nspDJBCXVB30JOK06IWdqX Egj6E10PnaV3TLUfc4CWIX7oZ/fA0etcaIKrDfs/kaIeErcKElp7Y2nWqIEos2mGXDk3 Jv/g==
X-Gm-Message-State: AOAM530hpRhPkoYDexJi2KVM2GcYSdrLSVpILRPrdizCXILhav26jA31 IIHnKjdoP+f1G7VvJHjgg/4qJw==
X-Google-Smtp-Source: ABdhPJwadg/sbZmAnvdlTZLI4fdS1mDkEgPpuGG/dzqXrnAOMkTntDEHejihBD+bWaWVGYxTT14EtQ==
X-Received: by 2002:a17:90a:4709:: with SMTP id h9mr677472pjg.235.1597377450381; Thu, 13 Aug 2020 20:57:30 -0700 (PDT)
Received: from sil-mh8 ([110.78.153.142]) by smtp.gmail.com with ESMTPSA id o192sm7904049pfg.81.2020.08.13.20.57.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Aug 2020 20:57:29 -0700 (PDT)
Date: Fri, 14 Aug 2020 10:57:24 +0700
From: Martin Hosken <martin_hosken@sil.org>
To: Hugh Paterson III <sil.linguist@gmail.com>
Cc: r12a <ishida@w3.org>, Daniel LaVon Billings <daniel=40ChurchofJesusChrist.org@dmarc.ietf.org>, ietf-languages@ietf.org, Doug Ewell <doug@ewellic.org>
Message-ID: <20200814105724.6e494a48@sil-mh8>
In-Reply-To: <CAE=3Ky-ZR1py3+Ok1i+YjDR-WUH1Q=0bahZhcAC_Y+i+xc80Cw@mail.gmail.com>
References: <CY4PR0401MB36203305BEFEBF938B654E8FC6420@CY4PR0401MB3620.namprd04.prod.outlook.com> <000201d670e8$d25e7e60$771b7b20$@ewellic.org> <CY4PR0401MB362045E1E4D11D92E1F89443C6420@CY4PR0401MB3620.namprd04.prod.outlook.com> <001a01d670ed$9c868530$d5938f90$@ewellic.org> <f4fa9f5c-3bb6-6b27-f294-7df9e0afa3d4@w3.org> <CAE=3Ky-ZR1py3+Ok1i+YjDR-WUH1Q=0bahZhcAC_Y+i+xc80Cw@mail.gmail.com>
X-Mailer: Claws Mail 3.16.0 (GTK+ 2.24.32; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-languages/su5mXwtcCeXfTNN9g9jwEy-bWRc>
Subject: [Ietf-languages] Default tagging
X-BeenThere: ietf-languages@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <ietf-languages.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-languages/>
List-Post: <mailto:ietf-languages@ietf.org>
List-Help: <mailto:ietf-languages-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 14 Aug 2020 03:57:34 -0000

Dear Hugh,

> In Nigeria Hausa can also be written with the Latin script. Where can I go
> to find what the basic default assumptions are for a language tag? Is the
> default always Latin?

This is an area that SIL has been working on recently. You can find the results of our work at https://github.com/silnrsi/langtags/tree/master/pub in json and summary txt. The json has much more information.

The work is based on a theory of tag sets. A set of tags that are considered semantically equivalent with a suggested default and full tag for each set. The reason for producing such a database is that when it comes to 'normal' tagging, people generally skip tag elements that add no distinction from the notional default. Therefore this is a need for a list of those defaults for systems that need that information.

Notice that we have not included aran, latv or any of the other variant scripts and have just kept to the core scripts. The tag sets would get rather big if we included those as well. If people are actually using those variants, I would be open to including them. But I do agree that I consider them unhelpful noise and that information about such things as font choice would be better held in the CLDR. In fact we have done this in our more humble embracing and extending of the CLDR to support minority languages for which there is much less information here at https://github.com/silnrsi/sldr/

We hope that that work to collect information on minority languages will help enable better support for them in wider technologies.

One difficulty we have is that with so many dialects and the informal requests for orthography variants (like the one you list below), we are not sure how to go about registering all these variants. We do not want to become a registrar of dialect information. It's bad enough running ISO 639-3. But neither does this list probably want to wade through all the various orthography variants and dialects that arise at the local level. So, currently, they are all -x-. Is anyone willing to help with that?

> In an interesting case where I have done some research. A language of the
> Ivory Coast and Liberia share a language called Dan. I know of 4
> orthographies used in print in Dan all of them Latin or Latin with
> borrowings from Cyrillic . One in Liberia, three in CI. So dnj_ci is not
> sufficient to distinguish the three in the Ivory Coast. My work focuses on
> the production of optimized keyboard layouts which are orthography
> specific, so I use the -x- component of bcp47 to distinguish the texts and
> the tools. This seems to be a valid way but is it the best way? It seems to
> me that the script and the orthography layers are independent (and perhaps
> also the writing style), and orthography is only explicitly addressed by
> the sub-tag registry. I.e the German language tags including the 1996
> related tag.

GB,
Martin

> 
> 
> On Thu, Aug 13, 2020 at 12:54 PM r12a <ishida@w3.org> wrote:
> 
> > Doug Ewell wrote on 12/08/2020 22:14:
> >
> > We can certainly check with ISO 15924/RA-JAC to see if there is any
> > unstated expectation that ‘Arab’ implies the Naskh variant.
> >
> >
> > I would hope not, since Naskh is only one of several writing styles used
> > for Arabic.  These include Naskh, Nastaliq (Aran), Ruq'a, Kano, Kufi, and
> > so on.  If Arab was equated with naskh only, we'd be stuck for what to use
> > to represent text written in the other styles.
> >
> > I would have thought that, generally speaking, the presence of ur would
> > already indicate that an application should by default use a nastaliq font
> > for Urdu (and ks for Kashmiri), without the need to further qualify.
> > Additional subtags are mostly useful for modifying the default assumptions
> > that come with the language, rather than completing the intent.
> >
> > It seems to me that Aran might appeal for languages such as Persian, which
> > are commonly written in naskh style, but can be written in a kind of
> > nastaliq, so the -Aran marker could help to indicate that distinction. In a
> > similar way, then, -Arab could be used after ur to indicate that a
> > non-nastaliq font should be used.  But the problem here seems to be that
> > -Arab and -Aran only work for a tiny subset of the actual list of
> > writing-style identifiers that are actually needed,. There are also other
> > places where it would be useful to distinguish between particular styles.
> > For example, Hausa in Arabic script can be written with the hafs or warsh
> > orthographies (typically requiring different fonts because they include
> > different character repertoires), but in Nigeria Hausa also uses the Kano
> > writing-style.  One might also want to label text that uses a magrebi style
> > font in North Africa. Etc.
> >
> > I think the -Arab subtag is mostly useful for languages (such as the many
> > in Central Asia and nearby) that can be written in more than one script,
> > and where you need to explicitly distinguish whether, say, Latin, Arabic,
> > or Cyrillic, should be used for a given bit of text. But even there my
> > personal preference is only to use the script tag when i need to make a
> > meaningful distinction, not all the time.
> >
> > I find myself wondering whether Aran ought really to have been a variant
> > subtag, to which we could add others for different writing styles.  In
> > particular, because some of the usage distinctions just mentioned can't be
> > expressed by combining language and region tags.
> >
> >
> > ri
> >
> >
> >
> > _______________________________________________
> > Ietf-languages mailing list
> > Ietf-languages@ietf.org
> > https://www.ietf.org/mailman/listinfo/ietf-languages
> >  
> -- 
> All the best,
> -Hugh
> 
> Sent from my iPhone
> Paris, France