Re: [Ltru] extlang for users

John Cowan <> Tue, 27 May 2008 14:04 UTC

Return-Path: <>
Received: from [] (localhost []) by (Postfix) with ESMTP id 54A803A69C5; Tue, 27 May 2008 07:04:24 -0700 (PDT)
Received: from localhost (localhost []) by (Postfix) with ESMTP id C99193A68B3 for <>; Tue, 27 May 2008 07:04:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.183
X-Spam-Status: No, score=-1.183 tagged_above=-999 required=5 tests=[AWL=-1.184, BAYES_50=0.001]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id azwMpsP-+DIi for <>; Tue, 27 May 2008 07:04:11 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id B4A013A69C5 for <>; Tue, 27 May 2008 07:04:10 -0700 (PDT)
Received: from cowan by with local (Exim 4.63) (envelope-from <>) id 1K0zmd-0008Jo-EQ; Tue, 27 May 2008 10:04:11 -0400
Date: Tue, 27 May 2008 10:04:11 -0400
To: Stephane Bortzmeyer <>
Message-ID: <>
References: <20080527000859.7a70d079@sil-mh4> <>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <>
User-Agent: Mutt/1.5.13 (2006-08-11)
From: John Cowan <>
Cc: LTRU Working Group <>
Subject: Re: [Ltru] extlang for users
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Stephane Bortzmeyer scripsit:

> PS: people who understand "fr" are welcome to read
> <>

I fed it through Systran and fixed a few things.  My apologies for any
errors: I am no translator, and neither is Systran.

NOTE error in the original: for "zaa-zap" read "zap-zaa"; "zap" is the
macrolanguage tag.

Work group LTRU (Language Tag Registry Update) of the IETF is very late
in its project of standardization of the future labels of language,
those short character strings which make it possible to indicate the
language of a document or to specify the language which one wishes to use
on the Internet. One of the reasons of this delay is the endless debate on
"extlangs", these mechanisms making it possible to represent the concept
of "macrolanguage" introduced by the ISO standard 639-3. Approximately,
should Egyptian be labeled ar-arz or arz? And does Cantonese have to be
zh-yue or simply yue?

In the current standard, RFC 4646, language tags are made of several
subtags, each one identifying the language, the writing or the
country. The subtags which identify the language are drawn from ISO 639-1
or ISO 639-2, standards which place all languages on an equal footing.

But the world of human languages is complex. The traditional definition of
a language is the criterion of mutual comprehension. Despite differences
of accent, vocabulary and orthography, the English of Australia can be
understood by the Irish (sometimes with effort). It is thus the same
language, whose tag is "en". In the same way, so close relations who are
French and the Catalan, by their common origin, two speakers of these
two languages cannot understand each other.  They are thus correctly
two distinct languages, whose labels are "fr" and "ca".

Naturally, there exists a grey area: is Danish so different from
Norwegian? Moroccan Arabic from Tunisian? The case is all the more
difficult because certain languages are different in their oral forms
but less with the writing (this is precisely the case of Arabic). In one
way, there exists only one Arab language. In another, there are several
(which do not follow the national borders inevitably exactly). Sometimes,
history or politics contribute to scramble correct linguistic perceptions,
as in the case of Mandarin, often loosely called Chinese because it is
the language of the Chinese state.

To try to model this very complex world which was not conceived
rationally, SIL, the organization which was principally responsible for
the ISO 63 standard9-3 created a new concept, that of macrolanguages. A
macrolanguage is a single language according to certain criteria, and
a group of languages according to others. The two most famous examples
are Chinese (with the subtag "zh") and Arabic ("ar"). The registry of
ISO 639-3 thus indicates for each language if it is encompassed by a
macrolanguage. Cantonese is thus encompassed by Chinese. Immediately let
us note that the encompassed languages (sometimes called pejoratively
"microlanguages") are not dialects, they are distinct languages, typically
lacking mutual comprehensibility.

How to represent macrolanguages in the successor of RFC 4646? The
first idea, whose premises appear in RFC 4646, was to use a new concept,
Extended Language Subtag, the extlang. In this system, as envisaged in the
beginning, the first subtag identified the macrolanguage (or the language
itself for the more frequent cases where there was no macrolanguage)
and one or more extlangs followed it. Thus, Tunisian Arabic was ar-aeb
and Sierra de Juarez Zapotec was zap-zaa (zap being the subtag of the
Zapotec macrolanguage).

To understand the concept of this system, it should be seen that the
software which handles language tags, when it does not find a appropriate
language, search in general by removing the subtags from the right, where
the least significant subtags are (this mechanism is described in RFC
4647). Thus, if one requests from a search engine a az-Arab-IR document
(Azeri written in the Arab script as used in Iran), and that no document
corresponds to this tag, the software can check if it has az-Arab (by
giving up Iranian specificities) or even az (Azeri, whatever its other
characteristics). The extlangs stuck well to this model. A request for
"sq-als" (Tosk) would be thus truncated to "sq" (the macrolanguage for
Albanian), which would not be a bad mapping solution.

But with a more attentive examination of extlangs, several problems were
found. Initially, the method of mapping by truncating from the right
always did not give correct results. Nothing guarantees that the languages
encompassed by the same macrolanguage are mutually comprehensible, quite
to the contrary. Therefore, the "blind" mapping obtained by removing the
subtags from the right will not be satisfactory. Then, extlangs complicate
the model since they represent a new case to be specified, to implement
and explain. Lastly, extlangs can carry an erroneous message, that one
particular language of the group is the principal language. This message
can be desirable (majority of the Arabic-speaking people are very attached
to the concept  of a single Arabic language) or not, but it is always
delicate, since it places the languages in a hierarchy. Many discussions,
sometimes extensive, had animated the work group around these problems.

Thus LTRU had as of December 2007 given up using the extlangs and changed
the Internet-Drafts to implement this decision. In this new version,
the Language Subtag Registry kept the information of which languages
were encompassed by a macrolanguage, but the language tags were only
formed with the language itself. Tosk was thus tagged "als", Sierra de
Juarez Zapotec "zaa", and Mandarin "cmn".

But to the IETF the things do not happen always also simply. Five months
after this change, problems with the new version appeared, often raised
by people who had hardly contributed to the concrete labour, and said
nothing at the time of the abandonment of the extlangs. Whenever the group
around a macrolanguage comprises a dominant language (which is the case
of Chinese with the preponderance of Mandarin, or that of Arabic with
standard Arabic, but not the Zapotec case), many documents were already
tagged with the tag which became that of the macrolanguage. For example,
Mandarin resources should no longer be tagged "zh", as formerly, but
"cmn". What to do with the innumerable Mandarin documents which had been
tagged "zh" while following RFC 4646?

Ideally, the Language Subtag Registry would be enough to find good
solutions. But, in fact, the decisions to be taken by software often
depend on nonlinguistic criteria. For example, if a user configured as
his preferred language Breton, it is not unreasonable serve to him French
text. Not that the two languages are similar (the first is a Celtic
language and the second a Romance language) but because, in practice,
almost everyone who understands Breton also understands French. But one
cannot put this information (of significant size and which often changes)
in the registry! It would thus be necessary to be resigned, which does
not seem easy, not to have in the register of information allowing an
"intelligent" mapping.

After, again, a very hot debate, the LTRU group finally went into reverse
on May 26, 2008, returning to the extlangs. That requires returning to
a previous version of the Internet-Drafts, to re-examine documents and
implementations. And with no guarantee that the process does not start
again in a few months...

My personal view is that the system is unstable: one can put together a
good argument for the two solutions (see Mark Davis's document against
extlangs) but nothing is perfect, because human languages were not
conceived to facilitate the task of the IETF. It is been necessary to
adopt a solution and to stick to it, but the IETF decision mechanisms
do not make that easy.

Híggledy-pìggledy / XML programmers            John Cowan
Try to escape those / I-eighteen-N woes;
Incontrovertibly / What we need more of is
Unicode weenies and / François Yergeaus.
Ltru mailing list