Re: [Ltru] extlang for users
John Cowan <cowan@ccil.org> Tue, 27 May 2008 14:04 UTC
Return-Path: <ltru-bounces@ietf.org>
X-Original-To: ltru-archive@megatron.ietf.org
Delivered-To: ietfarch-ltru-archive@core3.amsl.com
Received: from [127.0.0.1] (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 54A803A69C5; Tue, 27 May 2008 07:04:24 -0700 (PDT)
X-Original-To: ltru@core3.amsl.com
Delivered-To: ltru@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id C99193A68B3 for <ltru@core3.amsl.com>; Tue, 27 May 2008 07:04:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.183
X-Spam-Level:
X-Spam-Status: No, score=-1.183 tagged_above=-999 required=5 tests=[AWL=-1.184, BAYES_50=0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id azwMpsP-+DIi for <ltru@core3.amsl.com>; Tue, 27 May 2008 07:04:11 -0700 (PDT)
Received: from earth.ccil.org (earth.ccil.org [192.190.237.11]) by core3.amsl.com (Postfix) with ESMTP id B4A013A69C5 for <ltru@ietf.org>; Tue, 27 May 2008 07:04:10 -0700 (PDT)
Received: from cowan by earth.ccil.org with local (Exim 4.63) (envelope-from <cowan@ccil.org>) id 1K0zmd-0008Jo-EQ; Tue, 27 May 2008 10:04:11 -0400
Date: Tue, 27 May 2008 10:04:11 -0400
To: Stephane Bortzmeyer <bortzmeyer@nic.fr>
Message-ID: <20080527140411.GB18303@mercury.ccil.org>
References: <20080527000859.7a70d079@sil-mh4> <20080526200237.GA19588@sources.org>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20080526200237.GA19588@sources.org>
User-Agent: Mutt/1.5.13 (2006-08-11)
From: John Cowan <cowan@ccil.org>
Cc: LTRU Working Group <ltru@ietf.org>
Subject: Re: [Ltru] extlang for users
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Sender: ltru-bounces@ietf.org
Errors-To: ltru-bounces@ietf.org
Stephane Bortzmeyer scripsit: > PS: people who understand "fr" are welcome to read > <http://www.bortzmeyer.org/extlang-or-not-extlang.html> I fed it through Systran and fixed a few things. My apologies for any errors: I am no translator, and neither is Systran. NOTE error in the original: for "zaa-zap" read "zap-zaa"; "zap" is the macrolanguage tag. Work group LTRU (Language Tag Registry Update) of the IETF is very late in its project of standardization of the future labels of language, those short character strings which make it possible to indicate the language of a document or to specify the language which one wishes to use on the Internet. One of the reasons of this delay is the endless debate on "extlangs", these mechanisms making it possible to represent the concept of "macrolanguage" introduced by the ISO standard 639-3. Approximately, should Egyptian be labeled ar-arz or arz? And does Cantonese have to be zh-yue or simply yue? In the current standard, RFC 4646, language tags are made of several subtags, each one identifying the language, the writing or the country. The subtags which identify the language are drawn from ISO 639-1 or ISO 639-2, standards which place all languages on an equal footing. But the world of human languages is complex. The traditional definition of a language is the criterion of mutual comprehension. Despite differences of accent, vocabulary and orthography, the English of Australia can be understood by the Irish (sometimes with effort). It is thus the same language, whose tag is "en". In the same way, so close relations who are French and the Catalan, by their common origin, two speakers of these two languages cannot understand each other. They are thus correctly two distinct languages, whose labels are "fr" and "ca". Naturally, there exists a grey area: is Danish so different from Norwegian? Moroccan Arabic from Tunisian? The case is all the more difficult because certain languages are different in their oral forms but less with the writing (this is precisely the case of Arabic). In one way, there exists only one Arab language. In another, there are several (which do not follow the national borders inevitably exactly). Sometimes, history or politics contribute to scramble correct linguistic perceptions, as in the case of Mandarin, often loosely called Chinese because it is the language of the Chinese state. To try to model this very complex world which was not conceived rationally, SIL, the organization which was principally responsible for the ISO 63 standard9-3 created a new concept, that of macrolanguages. A macrolanguage is a single language according to certain criteria, and a group of languages according to others. The two most famous examples are Chinese (with the subtag "zh") and Arabic ("ar"). The registry of ISO 639-3 thus indicates for each language if it is encompassed by a macrolanguage. Cantonese is thus encompassed by Chinese. Immediately let us note that the encompassed languages (sometimes called pejoratively "microlanguages") are not dialects, they are distinct languages, typically lacking mutual comprehensibility. How to represent macrolanguages in the successor of RFC 4646? The first idea, whose premises appear in RFC 4646, was to use a new concept, Extended Language Subtag, the extlang. In this system, as envisaged in the beginning, the first subtag identified the macrolanguage (or the language itself for the more frequent cases where there was no macrolanguage) and one or more extlangs followed it. Thus, Tunisian Arabic was ar-aeb and Sierra de Juarez Zapotec was zap-zaa (zap being the subtag of the Zapotec macrolanguage). To understand the concept of this system, it should be seen that the software which handles language tags, when it does not find a appropriate language, search in general by removing the subtags from the right, where the least significant subtags are (this mechanism is described in RFC 4647). Thus, if one requests from a search engine a az-Arab-IR document (Azeri written in the Arab script as used in Iran), and that no document corresponds to this tag, the software can check if it has az-Arab (by giving up Iranian specificities) or even az (Azeri, whatever its other characteristics). The extlangs stuck well to this model. A request for "sq-als" (Tosk) would be thus truncated to "sq" (the macrolanguage for Albanian), which would not be a bad mapping solution. But with a more attentive examination of extlangs, several problems were found. Initially, the method of mapping by truncating from the right always did not give correct results. Nothing guarantees that the languages encompassed by the same macrolanguage are mutually comprehensible, quite to the contrary. Therefore, the "blind" mapping obtained by removing the subtags from the right will not be satisfactory. Then, extlangs complicate the model since they represent a new case to be specified, to implement and explain. Lastly, extlangs can carry an erroneous message, that one particular language of the group is the principal language. This message can be desirable (majority of the Arabic-speaking people are very attached to the concept of a single Arabic language) or not, but it is always delicate, since it places the languages in a hierarchy. Many discussions, sometimes extensive, had animated the work group around these problems. Thus LTRU had as of December 2007 given up using the extlangs and changed the Internet-Drafts to implement this decision. In this new version, the Language Subtag Registry kept the information of which languages were encompassed by a macrolanguage, but the language tags were only formed with the language itself. Tosk was thus tagged "als", Sierra de Juarez Zapotec "zaa", and Mandarin "cmn". But to the IETF the things do not happen always also simply. Five months after this change, problems with the new version appeared, often raised by people who had hardly contributed to the concrete labour, and said nothing at the time of the abandonment of the extlangs. Whenever the group around a macrolanguage comprises a dominant language (which is the case of Chinese with the preponderance of Mandarin, or that of Arabic with standard Arabic, but not the Zapotec case), many documents were already tagged with the tag which became that of the macrolanguage. For example, Mandarin resources should no longer be tagged "zh", as formerly, but "cmn". What to do with the innumerable Mandarin documents which had been tagged "zh" while following RFC 4646? Ideally, the Language Subtag Registry would be enough to find good solutions. But, in fact, the decisions to be taken by software often depend on nonlinguistic criteria. For example, if a user configured as his preferred language Breton, it is not unreasonable serve to him French text. Not that the two languages are similar (the first is a Celtic language and the second a Romance language) but because, in practice, almost everyone who understands Breton also understands French. But one cannot put this information (of significant size and which often changes) in the registry! It would thus be necessary to be resigned, which does not seem easy, not to have in the register of information allowing an "intelligent" mapping. After, again, a very hot debate, the LTRU group finally went into reverse on May 26, 2008, returning to the extlangs. That requires returning to a previous version of the Internet-Drafts, to re-examine documents and implementations. And with no guarantee that the process does not start again in a few months... My personal view is that the system is unstable: one can put together a good argument for the two solutions (see Mark Davis's document against extlangs) but nothing is perfect, because human languages were not conceived to facilitate the task of the IETF. It is been necessary to adopt a solution and to stick to it, but the IETF decision mechanisms do not make that easy. -- Híggledy-pìggledy / XML programmers John Cowan Try to escape those / I-eighteen-N woes; http://www.ccil.org/~cowan Incontrovertibly / What we need more of is cowan@ccil.org Unicode weenies and / François Yergeaus. _______________________________________________ Ltru mailing list Ltru@ietf.org https://www.ietf.org/mailman/listinfo/ltru
- [Ltru] extlang for users Martin Hosken
- Re: [Ltru] extlang for users John Cowan
- Re: [Ltru] extlang for users Stephane Bortzmeyer
- Re: [Ltru] extlang for users Martin Duerst
- Re: [Ltru] extlang for users John Cowan
- Re: [Ltru] extlang for users Kent Karlsson
- Re: [Ltru] extlang for users John Cowan
- Re: [Ltru] extlang for users Stephane Bortzmeyer
- Re: [Ltru] extlang for users John Cowan
- [Ltru] Scandinavian (RE: extlang for users) Kent Karlsson
- Re: [Ltru] extlang for users Stephane Bortzmeyer