RE: [Ltru] Re: extlang

Peter Constable <petercon@microsoft.com> Thu, 30 August 2007 14:29 UTC

Return-path: <ltru-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1IQl1C-00016k-O5; Thu, 30 Aug 2007 10:29:10 -0400
Received: from ltru by megatron.ietf.org with local (Exim 4.43) id 1IQl1A-000168-Tw for ltru-confirm+ok@megatron.ietf.org; Thu, 30 Aug 2007 10:29:08 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1IQl1A-00015j-DE for ltru@ietf.org; Thu, 30 Aug 2007 10:29:08 -0400
Received: from smtp.microsoft.com ([131.107.115.215]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1IQl18-0005ST-Fj for ltru@ietf.org; Thu, 30 Aug 2007 10:29:08 -0400
Received: from tk5-exhub-c103.redmond.corp.microsoft.com (157.54.70.186) by TK5-EXGWY-E802.partners.extranet.microsoft.com (10.251.56.168) with Microsoft SMTP Server (TLS) id 8.1.177.2; Thu, 30 Aug 2007 07:29:02 -0700
Received: from NA-EXMSG-C117.redmond.corp.microsoft.com ([157.54.62.44]) by tk5-exhub-c103.redmond.corp.microsoft.com ([157.54.70.186]) with mapi; Thu, 30 Aug 2007 07:29:02 -0700
From: Peter Constable <petercon@microsoft.com>
To: LTRU Working Group <ltru@ietf.org>
Date: Thu, 30 Aug 2007 07:29:01 -0700
Subject: RE: [Ltru] Re: extlang
Thread-Topic: [Ltru] Re: extlang
Thread-Index: Acfp2aWuIFgKV1mIQFeq1nr3p3argABKwY4g
Message-ID: <DDB6DE6E9D27DD478AE6D1BBBB83579561ABDC7644@NA-EXMSG-C117.redmond.corp.microsoft.com>
References: <30b660a20708281459r6000d746qe007f2882fae6d73@mail.gmail.com> <20070828223536.GB31670@mercury.ccil.org> <30b660a20708281812s3401e193u7c90d3ab22ac3eda@mail.gmail.com>
In-Reply-To: <30b660a20708281812s3401e193u7c90d3ab22ac3eda@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
MIME-Version: 1.0
X-Spam-Score: -8.0 (--------)
X-Scan-Signature: c021adebe99b05433d94f84a85f41df2
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============1316597711=="
Errors-To: ltru-bounces@ietf.org

Here are my responses to this mail from Mark. I critique his argumentation, but be careful not to jump to conclusions about what I’m saying regarding the open issue: what I say here critiques Marks arguments against extlang, but does not attempt to make a sufficient case for extlang.


From: Mark Davis [mailto:mark.davis@icu-project.org]
Sent: Tuesday, August 28, 2007 6:12 PM

> If a language yyy has the macrolanguage xx, we are talking about two possible representations
a) extlang: xx-yyy
b) lang: yyy

>The main reason I've heard from you for doing (a) instead of (b) is that (a) it has better fallback behavior. For that to be true, xx has to be a good fallback for users of yyy, in the majority of cases.
I think “fallback behavior” needs more careful consideration here. It appears that Mark is focusing on a particular scenario: request is for resource in lang A, but that is not available so process needs to fallback to a likely-next-best choice available.
When I first proposed the extlang mechanism, that was not the intent. Rather, the intent was focused on another scenario: author wants to tag content using more specific ID yyy, but many requests, especially from legacy implementations, will use xx, which has been in use for some time. This scenario pertains to Language-range as defined since HTTP/1.1.
                Language-range = xx
                Content to be matched = yyy or xx-yyy
If the content is tagged xx-yyy, there is a match. But if the content is tagged yyy, there is no match using a basic algorithm – one would need a more advanced algorithm that knows about the relationship between xx and yyy.
I think Mark’s is the reciprocal scenario: users prefers the specific variety yyy, but most content is already tagged using xx, which has been in use for some time. This is a particular case in the general set of fallback scenarios, and it seems to be exactly the one Mark is focusing on. But note that language-range does not apply here – or, from a different perspective, doesn’t work here:
language-range = yyy or xx-yyy
content to be matched = xx
Whichever way the language-range is expressed, there is no match with a basic algorithm – one would need a more advanced algorithm that knows about the relationship between xx and yyy.
So, considering just those scenarios, xx-yyy has an advantage over yyy in that it provides an advantage for the one scenario while the two alternatives are equal wrt the other. Of course, those aren’t the only scenarios. We need to consider a broader set of scenarios, and also consider how they rank in priority.


> There are (at least) two cases to consider here.

 1.  There is a predominent choice in the industry for the macro language. For example, the content for zh is typically always Mandarin; the content for ar is typically always standard Arabic.
Typically, yes. We just must not assume that is always the case. The very thing that started me thinking about the macrolanguage concept in the first place, rather than equating existing ISO 639 IDs like zh and ar with the predominant variety, was the fact that there were language tags registered with IANA explicitly associating zh with Chinese languages other than Mandarin. When faced with pre-existing usage “zh-wuu”, we have two choices:
a) wuu is a specific variety of zh
b) wuu relates to zh in some other way, such as zh being a good fallback choice
I suspect that the request for zh-wuu was based on the first perspective, and that was the perspective I assumed.

> Look at the concrete implications. It means that whenever Joe looks for a web page in Hakka Chinese, he will typically fall back to Mandarin. Whenever Sarah looks for a page in Tunisian Arabic, it will fall back to standard Mandarin.
We need to be a bit more careful here: exactly how does Joe go about requesting Hakka? If he asks for “zh” hoping to get content in Hakka, there is a very high probability he will get pages in Mandarin. But if he asks for “zh-hakka”, he will only get pages tagged “zh-hakka” from servers implementing HTTP/1.1 language-range, not “zh” pages: that is how language-range works.

>If you are saying however, that zh is not necessarily Mandarin, that Arabic is not necessarily Standard Arabic, then we fall through to case 2.

 1.  There is not a predominant choice in the industry, let's say for Hmong. In this case, the situation is different. I could choose any of the Hmong for the content for hmn. We then have an even dicer case for the value of extlang. I localize my hmn locale with contents appropriate for Northeastern Dian Hmong; is that a good default for someone speaking Eastern Xiangxi Hmong? for Luopohe Hmong? For all the other Hmongs?
> For extlang to be a good apparatus, these always have to be good choices, since we are baking the structure into the tag.

Again, let’s be more careful in the analysis and argumentation. You’re questioning whether a request for hmn should be able to return content in any Hmong language, and saying that “hmn-hmd” (NE Dian) is bad because someone asking for Hmong might really be a speaker of Luopohe Hmong. It seems to me this is a fallacious argument: it’s premise is that “hmn” can and must be sufficient for any of these various speakers. Well, either it is or it isn’t. If it is, then the argument fails. If it isn’t, then all that proves is that “hmn” really is never sufficient: a more specific language-range really is needed, and anyone requesting their resources using “hmn” is making a vague request that will be subject to somewhat arbitrary results. At that point you decide a more specific language-range is needed, it makes no difference whether the language range used is “hmd” or “hmn-hmd”: both would succeed in obtaining the desired result.

So, we really can’t use the macrolanguage cases like Hmong to decide this open issue. If NE Dian is not a good choice for Luopohe, the *only* thing that points to is that “hmn” is too vague to be useful for requesting resources.

Btw, keep in mind why “hmn” was created in the first place, and why it was created as an individual-language identifier:


-          librarians needed a tag for content


-          they were not Hmong specialists and had no ability to differentiate between varieties


-          as these are not-highly-developed varieties (in the language-development sense – literature, media, standardization), there was no reason for these non-specialists to suppose that these varieties were anything more than dialects of a single language (assuming they had much awareness of any variations in the first place)


Now, as I approached how to deal with “hmn” in ISO 639-2 when it came to creating ISO 639-3, I had two options: argue that “hmn” should really be a collection, or treat “hmn” as a macrolanguage. In the original analysis, (http://www.ethnologue.com/14/iso639/analysis.asp) I had concluded “hmn” is really a collection. But since I also am not a Hmong specialist and didn’t have the capacity (far from it!) of getting an expert analysis of this and every other uncertain case in ISO 639 in any reasonable amount of time, I chose the path of least resistance: treat it as a macrolanguage since ISO 639-2 and its user community considers it an individual language, and that way I don’t have to get the JAC to take action on yet one decision where the impact is unclear and the internal expertise on which to base the decision is minimal.

But note that for our purposes here it really doesn’t matter whether “hmn” ended up as a macrolanguage or as a collection: either way, it is still vague and therefore not a good tag to use for requesting resources if the distinctions matter to you.


> If Peter Constable came out and said the following, then I would admit to my sins, give in gracefully, and go along with extlang.

 *   "Yes, each of the Hmongs (encompassed by hmn) are mutually intelligible, and are better for each one than another fallback like Chinese"

    *   and the same is true for all the other cases with no predominant variant.
I haven’t come out saying that; I assume they are not. But, I’m saying that this case is irrelevant for what we need to decide.
>

 *   "Yes, standard Arabic is intelligible for all the encompassed languages from Algerian Saharan Arabic to Shihhi Arabic, and are better for each one than another fallback like French"

    *   and the same is true for all the other cases with a predominant variant.
Well, what you’re asking me to say here is that a request for e.g. Algerian Saharan Arabic can appropriately be serviced with Standard Arabic resources rather than, say, French. In other words, a language-range “ar-arq” can appropriately match “ar” content. But that would never happen because that is not how language-range works. So, I’m not sure how this is relevant: whether the language-range is “ar-arq” or “arq” the only results returned will be Algerian Saharan Arabic, unless some more advanced fallback behaviour is invoked.

The question that would be relevant in terms of language-range for the Arabic/Mandarin cases is this: if someone requests “ar”, how bad is it if they get pages in “ar-arq”? If someone requests “zh”, how bad is it if they get pages in “zh-hakka”? These are the cases that relate to the way language-range works.


Peter
_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru