Re: [Ltru] Re: Macrolanguage and extlang

John Cowan <> Wed, 18 July 2007 05:14 UTC

Return-path: <>
Received: from [] ( by with esmtp (Exim 4.43) id 1IB1rt-00065Q-JK; Wed, 18 Jul 2007 01:14:33 -0400
Received: from ltru by with local (Exim 4.43) id 1IB1rt-00065L-30 for; Wed, 18 Jul 2007 01:14:33 -0400
Received: from [] ( by with esmtp (Exim 4.43) id 1IB1rs-000658-7K for; Wed, 18 Jul 2007 01:14:32 -0400
Received: from ([]) by with esmtp (Exim 4.43) id 1IB1rq-0003k1-N0 for; Wed, 18 Jul 2007 01:14:32 -0400
Received: from cowan by with local (Exim 4.63) (envelope-from <>) id 1IB1rm-0006cm-F9; Wed, 18 Jul 2007 01:14:26 -0400
Date: Wed, 18 Jul 2007 01:14:26 -0400
To: Addison Phillips <>
Subject: Re: [Ltru] Re: Macrolanguage and extlang
Message-ID: <>
References: <> <013b01c7c6a8$55cb4a20$6401a8c0@DGBP7M81> <> <>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <>
User-Agent: Mutt/1.5.13 (2006-08-11)
From: John Cowan <>
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 5ebbf074524e58e662bc8209a6235027
Cc: LTRU Working Group <>, Doug Ewell <>
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>

Addison Phillips scripsit:

> Good, although I note that it can't be (as) informative if we have 
> extlangs. I mean, we could allow extlangs to lose their encompassed 
> status, but not their Prefix. This is a problem we don't have to have if 
> we don't have extlangs.

Not necessarily.  For example, 639-3/RA might decide to group the
four Min Chinese languages (cdo, czo, mnp, nan) as a macrolanguage
within the general zh macrolanguage.  In that case, we would change the
Macrolanguage: field for these four languages from zh to whatever the
new code is, but we would leave the Prefix: field at zh.  So Prefix:
will remain normative and Macrolanguage: informative.

> With extlangs we get the benefit of the subtag hierarchy being visible. 
> Users are required to retag data to get full benefit from the new 
> subtags, but they aren't *required* to retag their data. Neither are 
> they required to retag their data if we just include all of the 639-3 
> codes as primary language subtags and forego extlangs.

Indeed, without extlangs it is a disadvantage to tag your data correctly,
and that's the main argument for having them.  Suppose that you have
a written document in Cantonese.  The difference between that and the
equivalent Mandarin document is not zero, but it will be plausible
that someone who can read Mandarin will be able to make sense of it.
If you tag it zh, on the one hand, such people will discover it; if
you tag it yue, on the other, they will not discover it unless they are
using advanced matchers.  But on the gripping hand, if you use zh-yue
(which you may, because that is a grandfathered tag), you combine the
advantages of discoverability and correctness.  Allowing extlangs extends
the advantage from zh-yue to the other zh languages.  Currently they
are not often written, but "not often" is not "never".

> Rather limited amounts of content are tagged with the macrolanguage, let 
> alone with the enclosed sub language. Users can choose between primary 
> language subtags on their own and need not use complex (and possibly not 
> well-understood) subtag sequences. The various Hmongs, Quechuas, and 
> Zapotecs fall into this category.

Actually, you don't know that.  There are a ferocious number of books
in the world, and 639-2 began as a bibliographic standard.  The whole
point of tags like "hmn", "qu", and "zap" is that bibliographers often
don't know exactly which language they are dealing with, but a partial
tag is much better than nothing.

It's important not to take a parochial view of language tagging, as if
only electronic resources count.

> The languages which are really affected are Arabic and Chinese. For 
> Arabic, I think we could deprecate (or just frown at) the use of 'arb' 
> (Standard Arabic) and let people choose regional Arabic dialects--or 
> just use 'ar-*'.

Standard Arabic is the most common kind of written Arabic, but it
is rather rare as spoken Arabic goes.  My guess would be that
a plurality of recorded spoken Arabic is Egyptian Arabic.

> This situation isn't that different, I might as well note, as the one 
> for Norwegian (no/nn/nb) and some other languages. 

Eh?  The distinction between nn and nb is strictly written.  Spoken
Norwegian is a congeries of dialects with none really dominant,
much like spoken American English.

> Either of these solutions would be acceptable to me and each involves 
> some level of compromise. However, with only primary language subtags, 
> we don't have to invent any silly rules or procedures for stability, 
> maintenance, or choosing subtag types. We don't have to cherry-pick and 
> we aren't reliant on ISO 639 being picture perfect with their 
> macrolanguage definitions.

You overestimate the problem.  Once we get past the initial 4645bis load, we
never have to do an extlang again, except in the unlikely case
(and I would be OK with ignoring this) that a new macrolanguage
is created encompassing a bunch of languages created at the same
time; e.g. 639/RA adds the macrolanguage Foovian encompassing
Barvian and Bazvian.

> And I am concerned that users have a hard enough time figuring out 
> language tags as it is. Extlang is another level of complexity that 
> users will not fully understand. So that's what starts me leaning away 
> from extlangs.

This is also an overestimation.  There will be an entry for Cree
and another for Plains Cree: if you know you have Cree, use
cre; if you further know you have Plains Creee, use cre-crk.
That's not much more complicated than saying "use crk".

> In addition, my guess is that the world mostly won't notice. Chinese 
> will mostly be tagged as "zh-*". Some Cantonese, Min-Nan, Hakka, Xiang, 
> etc. users will use the available subtags, but mostly in a recognizable 
> context. No matter which solution we choose, we'll have to write 
> extensive documents called, roughly, "HOW TO TAG YOUR CHINESE" :-).


> --
> In most cases, use the Macrolanguage to form the language tag in 
> preference to the encompassed language. Only use the encompassed 
> language if it adds useful distinguishing information to the tag within 
> your application.
> --
> Note that one can (and should) as easily s/use the encompassed 
> language/use the extlang subtag/

I'm happy with the "use the extlang subtag" version of this.

Your worships will perhaps be thinking          John Cowan
that it is an easy thing to blow up a dog?
[Or] to write a book?
    --Don Quixote, Introduction       

Ltru mailing list