Re: [Ltru] Re: Macrolanguage and extlang

Addison Phillips <> Tue, 17 July 2007 19:34 UTC

Return-path: <>
Received: from [] ( by with esmtp (Exim 4.43) id 1IAsoZ-0004UQ-63; Tue, 17 Jul 2007 15:34:31 -0400
Received: from ltru by with local (Exim 4.43) id 1IAsoX-0004UL-Ug for; Tue, 17 Jul 2007 15:34:29 -0400
Received: from [] ( by with esmtp (Exim 4.43) id 1IAsoX-0004UD-L9 for; Tue, 17 Jul 2007 15:34:29 -0400
Received: from ([]) by with esmtp (Exim 4.43) id 1IAsoV-0005CI-Rc for; Tue, 17 Jul 2007 15:34:29 -0400
Received: from [] ( []) (authenticated bits=0) by (8.13.8/8.13.6/y.rout) with ESMTP id l6HJYGDN022028 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 17 Jul 2007 12:34:17 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; s=serpent;; c=nofws; q=dns; h=message-id:date:from:user-agent:mime-version:to:cc:subject: references:in-reply-to:content-type:content-transfer-encoding; b=r3tvpdO1O0zGPS/m8hVScQOYk+ZkfXKEXZW+dpb16Rej8odQlQdZPhtkagdjtiiK
Message-ID: <>
Date: Tue, 17 Jul 2007 12:34:16 -0700
From: Addison Phillips <>
User-Agent: Thunderbird (Windows/20070604)
MIME-Version: 1.0
To: John Cowan <>
Subject: Re: [Ltru] Re: Macrolanguage and extlang
References: <> <013b01c7c6a8$55cb4a20$6401a8c0@DGBP7M81> <>
In-Reply-To: <>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Spam-Score: -15.0 (---------------)
X-Scan-Signature: 6ffdee8af20de249c24731d8414917d3
Cc: Doug Ewell <>, LTRU Working Group <>
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>

So far I've let Mark take this thread on, but now I have a few minutes 
to chime in.

>>> Addision and I have discussed the issue of extlang and Macrolanguages 
>>> and are proposing the following text replacing the use of extlang.
>> I won't object to this proposed solution.  
> I don't object to having a Macrolanguage header as such, and even to
> making it mutable (and therefore informative).

Good, although I note that it can't be (as) informative if we have 
extlangs. I mean, we could allow extlangs to lose their encompassed 
status, but not their Prefix. This is a problem we don't have to have if 
we don't have extlangs.

> I do most strenuously object to upsetting the extlang applecart.  Extlangs
> are a sensible and necessary shim between the 639-2-only world of 4646
> and the mixed 639-2/3 world of 4646bis, and I have seen no compelling
> reasons why they should be discarded.  We have been planning them for
> years, and we should stick to our plan unless there is some hard reason
> to change it.

Up until now I've been pushing the extlang applecart (as it were), so 
Mark's message marks a departure for me. The problem is whether the shim 
is actually necessary.

With extlangs we get the benefit of the subtag hierarchy being visible. 
Users are required to retag data to get full benefit from the new 
subtags, but they aren't *required* to retag their data. Neither are 
they required to retag their data if we just include all of the 639-3 
codes as primary language subtags and forego extlangs.

For most of the affected languages, extlangs are nothing but a hassle. 
Rather limited amounts of content are tagged with the macrolanguage, let 
alone with the enclosed sub language. Users can choose between primary 
language subtags on their own and need not use complex (and possibly not 
well-understood) subtag sequences. The various Hmongs, Quechuas, and 
Zapotecs fall into this category.

The languages which are really affected are Arabic and Chinese. For 
Arabic, I think we could deprecate (or just frown at) the use of 'arb' 
(Standard Arabic) and let people choose regional Arabic dialects--or 
just use 'ar-*'.

So we're left with Chinese, which is a huge sticky wicket. It's quite 
clear that there is a *lot* of content tagged with 'zh-*'. It's also 
clear that, for written content, we can maintain the "fiction" of 
Chinese and basically ignore regional languages/dialects for most 
tagging purposes. Users who really need the different Sinitic languages 
can tag them using the appropriate primary language subtag.

The people who are hurt by this are those with non-written content, such 
as audio or video recordings, or similar kinds of applications (such as 
text-to-speech). For these sorts of applications, the users will have to 
retag their data. Either they use extlangs or they just use primary 
langauge subtags. It does mean that I might get a DVD with:

   <subtitles xml:lang="zh-Hant-HK" />
   <audio xml:lang="yue-HK" />

This situation isn't that different, I might as well note, as the one 
for Norwegian (no/nn/nb) and some other languages. The range of tag 
choices is appalling at times, but not intractable.

Either of these solutions would be acceptable to me and each involves 
some level of compromise. However, with only primary language subtags, 
we don't have to invent any silly rules or procedures for stability, 
maintenance, or choosing subtag types. We don't have to cherry-pick and 
we aren't reliant on ISO 639 being picture perfect with their 
macrolanguage definitions.

And I am concerned that users have a hard enough time figuring out 
language tags as it is. Extlang is another level of complexity that 
users will not fully understand. So that's what starts me leaning away 
from extlangs.

In addition, my guess is that the world mostly won't notice. Chinese 
will mostly be tagged as "zh-*". Some Cantonese, Min-Nan, Hakka, Xiang, 
etc. users will use the available subtags, but mostly in a recognizable 
context. No matter which solution we choose, we'll have to write 
extensive documents called, roughly, "HOW TO TAG YOUR CHINESE" :-).

>> It still floors me that we expect any parsers at all to be able to match 
>> "yue" with "zh" but not to be able to pick "en-US" out of "en-Latn-US". 
>> And of course all of the grandfathered tags of the form "zh-yue" will 
>> now have to be deprecated in favor of "yue", instead of being made 
>> redundant.
> Exactly.

And this is bad because...?

The truth is, "yue" (or, more importantly, "cmn") and "zh" don't match. 
And neither do "no" and "nb". Which is semantically silly, but not 
fatal. For that matter, if 'cmn' or 'yue' mark an actual linguistic 
distinction, either "zh" != "zh-cmn" or "zh" matches both "zh-yue" and 
"zh-cmn". Any change to tagging yields new matching problems. If, for 
the most part, people do not need to change their tagging behavior, then 
that's best.

Note that the proposed text probably isn't strong enough in this area. 
It should probably say something like:

In most cases, use the Macrolanguage to form the language tag in 
preference to the encompassed language. Only use the encompassed 
language if it adds useful distinguishing information to the tag within 
your application.

Note that one can (and should) as easily s/use the encompassed 
language/use the extlang subtag/


Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG

Internationalization is an architecture.
It is not a feature.

Ltru mailing list