RE: [Ltru] Re: Macrolanguage and extlang

Peter Constable <> Mon, 16 July 2007 15:52 UTC

Return-path: <>
Received: from [] ( by with esmtp (Exim 4.43) id 1IASs4-0005TE-N8; Mon, 16 Jul 2007 11:52:24 -0400
Received: from ltru by with local (Exim 4.43) id 1IASs3-0005T6-Gw for; Mon, 16 Jul 2007 11:52:23 -0400
Received: from [] ( by with esmtp (Exim 4.43) id 1IASs3-0005Sy-7P for; Mon, 16 Jul 2007 11:52:23 -0400
Received: from ([] by with esmtp (Exim 4.43) id 1IASry-0000R4-SW for; Mon, 16 Jul 2007 11:52:23 -0400
Received: from ( by ( with Microsoft SMTP Server (TLS) id 8.0.700.0; Mon, 16 Jul 2007 08:52:18 -0700
Received: from ([]) by ([]) with mapi; Mon, 16 Jul 2007 08:52:16 -0700
From: Peter Constable <>
To: LTRU Working Group <>
Date: Mon, 16 Jul 2007 08:52:21 -0700
Subject: RE: [Ltru] Re: Macrolanguage and extlang
Thread-Topic: [Ltru] Re: Macrolanguage and extlang
Thread-Index: AcfHOInSie+xawurRiqXq8VKFSVaDgAgQz0w
Message-ID: <>
References: <> <013b01c7c6a8$55cb4a20$6401a8c0@DGBP7M81> <> <> <00d701c7c738$841e6930$6a01a8c0@DGBP7M81>
In-Reply-To: <00d701c7c738$841e6930$6a01a8c0@DGBP7M81>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 538aad3a3c4f01d8b6a6477ca4248793
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>

It's not clear to me why matching

(A) zh-Hans-CN with zh-yue-Hans-CN

is much harder than matching

(B) zh-Hans-CN with yue-Hans-CN.

In the (B) case, you need to treat "zh" and "yue" as a match at some level. In the (A) case, you need to treat "zh" and "zh-yue" as a match at that same level. The only added work for (A) is that you need to recognize "zh-yue" as the entity to be compared. But then, for (B) you still need to recognize that "yue" may match something other than "yue". There is some extra work either way.

The RFC 1766/3066 remove-from-right algorithm won't get a match in case (A), but neither will it do so in case (B). At least with the approach used in (A), it will match zh and zh-yue, whereas it won't match zh and yue, as Doug pointed out.

On the other hand...

It's not clear how important the zh/zh-yue versus zh/yue example is for Chinese: it makes an valid argument against (B) if existing content is tagged "zh-yue"/"zh-yue-Han?" and a language range "zh" is used, but says nothing about cases of content tagged "zh"/"zh-Han?"/"zh-??" and a language range "zh-yue". If we're concerned about legacy content created before the extlang "yue" was introduced, that would be the latter.

Mark's and Addison's proposal does have a couple of points in its favour -- these may or may not be significant:

- Whereas using an extlang in a case like "zh-yue" was easy to consider, not all other cases will necessarily be as easy. (Clearly we don't want to introduce macrolanguage/extlang pairs for Norwegian or Serbo-Croatian, but I think we've assumed all along these can be treated as grandfathered cases.) Mark's proposal means we're never required to consider whether "xxx-yyy" is really helpful: we just always use "yyy".

- If there's ever a case in which ISO 639 has related languages "xxx" and "yyy" and a new macrolanguage "mmm" encompassing the two is later added, we don't need to retag existing content as "mmm-xxx" and "mmm-yyy", nor do we need to do any special-case processing with the registry to keep "mmm-xxx" and "mmm-yyy" from being used. And neither do implementers need to introduce new mechanisms to get appropriate matches between "mmm" and "xxx"/"yyy" since, under M&A's proposal, they already would have a general mechanism in their matching process for this; all they need is to incorporate the new data.


Ltru mailing list