Re: [Ltru] Re: extlang

"Mark Davis" <mark.davis@icu-project.org> Thu, 30 August 2007 16:39 UTC

Return-path: <ltru-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1IQn3L-0007Aj-Fb; Thu, 30 Aug 2007 12:39:31 -0400
Received: from ltru by megatron.ietf.org with local (Exim 4.43) id 1IQn3K-0007Ad-8D for ltru-confirm+ok@megatron.ietf.org; Thu, 30 Aug 2007 12:39:30 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1IQn3J-0007AU-Ug for ltru@ietf.org; Thu, 30 Aug 2007 12:39:29 -0400
Received: from wa-out-1112.google.com ([209.85.146.181]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1IQn3I-0001HZ-63 for ltru@ietf.org; Thu, 30 Aug 2007 12:39:29 -0400
Received: by wa-out-1112.google.com with SMTP id k40so803596wah for <ltru@ietf.org>; Thu, 30 Aug 2007 09:39:27 -0700 (PDT)
DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=cvCrrPcKQFnWFG/MfGuV1gA1YkqEc6VzS5yd/uLw/vBaZDIuiKYY8UXrRbJwrXIFsrLBaEGC8qDdAQQdYokv+YRAFx0F7E6l3aFqAgXj2uyTMKNHaejZ90u/wksN150SUKMWhrX5bOD+iDcUrplLqB7jPMdOek818LBbS08h9MI=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=MAJMiE4mUqujwmvoJXtthFOwEPnTrcD4KmEiGPKWtA8wbQM1pQqrPH9GbxcBAd+/Ssubt2kollu7D9qGCXiD7w9qK0hSoPiOEe0I9W1ONWajmQUMdFXCQZNzdxnwQrd5keibd+5ooBMlv/HzMhzj/MlKPjK3Ck7/u/AxupcHLE4=
Received: by 10.114.106.1 with SMTP id e1mr10725wac.1188491966912; Thu, 30 Aug 2007 09:39:26 -0700 (PDT)
Received: by 10.114.196.12 with HTTP; Thu, 30 Aug 2007 09:39:26 -0700 (PDT)
Message-ID: <30b660a20708300939i959d765o4bc21c0b67a46bda@mail.gmail.com>
Date: Thu, 30 Aug 2007 09:39:26 -0700
From: Mark Davis <mark.davis@icu-project.org>
To: Peter Constable <petercon@microsoft.com>
Subject: Re: [Ltru] Re: extlang
In-Reply-To: <DDB6DE6E9D27DD478AE6D1BBBB83579561ABDC7644@NA-EXMSG-C117.redmond.corp.microsoft.com>
MIME-Version: 1.0
References: <30b660a20708281459r6000d746qe007f2882fae6d73@mail.gmail.com> <20070828223536.GB31670@mercury.ccil.org> <30b660a20708281812s3401e193u7c90d3ab22ac3eda@mail.gmail.com> <DDB6DE6E9D27DD478AE6D1BBBB83579561ABDC7644@NA-EXMSG-C117.redmond.corp.microsoft.com>
X-Google-Sender-Auth: 987a79792e698403
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 210770d71723b650f9c8e3db4e95b596
Cc: LTRU Working Group <ltru@ietf.org>
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============2060856516=="
Errors-To: ltru-bounces@ietf.org

Thanks Peter; background material like this and specific scenarios really
help to make clear what the requirements are, because they are very
different depending on the kinds of usage patterns people have in mind.
While I may disagree with some of your points, this discussion is going in a
good direction.

It is key to identify exactly how extlang would work in different
situations. We thought long and hard about the introduction of scripts, and
looked at many different aspects -- we need to do the same with the extlang
proposal.

It appears that there are two substantially different kinds of usage that we
need to examine:

Resource Lookup: user requests language X; supplier doesn't have exactly X
but wants to get the best match (in the sense of most likely to be
understood by the user). The extlang proponents appeared to be arguing that
it was better for this case, but it just doesn't seem to work well.

Query: user requests content matching X, in some "loose" fashion, such as a
library lookup. Unless I'm misreading Peter, it appears that this was
actually the primary target of the extlang mechanism. So this bears some
significant attention, to see how the extlang mechanism would work in this
case.

Once we get a sense of the relative benefits/drawbacks of extlang for each
of these -- and weight the relative importance of these with respect to
language tag usage, then we'd be in better shape to know the overall pros
and cons of adding it.

Mark

On 8/30/07, Peter Constable <petercon@microsoft.com> wrote:
>
>   Here are my responses to this mail from Mark. I critique his
> argumentation, but be careful not to jump to conclusions about what I'm
> saying regarding the open issue: what I say here critiques Marks arguments
> against extlang, but does not attempt to make a sufficient case for extlang.
>
>
>
> * *
>
> *From:* Mark Davis [mailto:mark.davis@icu-project.org]
> *Sent:* Tuesday, August 28, 2007 6:12 PM
>
> **
>
> > If a language yyy has the macrolanguage xx, we are talking about two
> possible representations
>
> a) extlang: xx-yyy
> b) lang: yyy
>
>
> >The main reason I've heard from you for doing (a) instead of (b) is that
> (a) it has better fallback behavior. For that to be true, xx has to be a
> good fallback for users of yyy, in the majority of cases.
>
> I think "fallback behavior" needs more careful consideration here. It
> appears that Mark is focusing on a particular scenario: request is for
> resource in lang A, but that is not available so process needs to fallback
> to a likely-next-best choice available.
>
> When I first proposed the extlang mechanism, that was not the intent.
> Rather, the intent was focused on another scenario: author wants to tag
> content using more specific ID yyy, but many requests, especially from
> legacy implementations, will use xx, which has been in use for some time.
> This scenario pertains to Language-range as defined since HTTP/1.1.
>
>                 Language-range = xx
>
>                 Content to be matched = yyy or xx-yyy
>
> If the content is tagged xx-yyy, there is a match. But if the content is
> tagged yyy, there is no match using a basic algorithm – one would need a
> more advanced algorithm that knows about the relationship between xx and
> yyy.
>
> I think Mark's is the reciprocal scenario: users prefers the specific
> variety yyy, but most content is already tagged using xx, which has been in
> use for some time. This is a particular case in the general set of fallback
> scenarios, and it seems to be exactly the one Mark is focusing on. But note
> that language-range does not apply here – or, from a different perspective,
> doesn't work here:
>
> language-range = yyy or xx-yyy
>
> content to be matched = xx
>
> Whichever way the language-range is expressed, there is no match with a
> basic algorithm – one would need a more advanced algorithm that knows about
> the relationship between xx and yyy.
>
> So, considering just those scenarios, xx-yyy has an advantage over yyy in
> that it provides an advantage for the one scenario while the two
> alternatives are equal wrt the other. Of course, those aren't the only
> scenarios. We need to consider a broader set of scenarios, and also consider
> how they rank in priority.
>
>
>
>
> > There are (at least) two cases to consider here.
>
>    1. There is a predominent choice in the industry for the macro
>    language. For example, the content for zh is typically always Mandarin; the
>    content for ar is typically always standard Arabic.
>
> Typically, yes. We just must not assume that is always the case. The very
> thing that started me thinking about the macrolanguage concept in the first
> place, rather than equating existing ISO 639 IDs like zh and ar with the
> predominant variety, was the fact that there were language tags registered
> with IANA explicitly associating zh with Chinese languages other than
> Mandarin. When faced with pre-existing usage "zh-wuu", we have two choices:
>
> a) wuu is a specific variety of zh
>
> b) wuu relates to zh in some other way, such as zh being a good fallback
> choice
>
> I suspect that the request for zh-wuu was based on the first perspective,
> and that was the perspective I assumed.
>
>
>
> > Look at the concrete implications. It means that whenever Joe looks for
> a web page in Hakka Chinese, he will typically fall back to Mandarin.
> Whenever Sarah looks for a page in Tunisian Arabic, it will fall back to
> standard Mandarin.
>
> We need to be a bit more careful here: exactly *how* does Joe go about
> requesting Hakka? If he asks for "zh" hoping to get content in Hakka, there
> is a very high probability he will get pages in Mandarin. But if he asks for
> "zh-hakka", he will only get pages tagged "zh-hakka" from servers
> implementing HTTP/1.1 language-range, not "zh" pages: that is how
> language-range works.
>
>
> >If you are saying however, that zh is not necessarily Mandarin, that
> Arabic is not necessarily Standard Arabic, then we fall through to case 2.
>
>    1. There is not a predominant choice in the industry, let's say for
>    Hmong. In this case, the situation is different. I could choose any of the
>    Hmong for the content for hmn. We then have an even dicer case for the value
>    of extlang. I localize my hmn locale with contents appropriate for
>    Northeastern Dian Hmong; is that a good default for someone speaking Eastern
>    Xiangxi Hmong? for Luopohe Hmong? For all the other Hmongs?
>
> > For extlang to be a good apparatus, these always have to be good
> choices, since we are baking the structure into the tag.
>
>  Again, let's be more careful in the analysis and argumentation. You're
> questioning whether a request for hmn should be able to return content in
> any Hmong language, and saying that "hmn-hmd" (NE Dian) is bad because
> someone asking for Hmong might really be a speaker of Luopohe Hmong. It
> seems to me this is a fallacious argument: it's premise is that "hmn" can
> and must be sufficient for any of these various speakers. Well, either it is
> or it isn't. If it is, then the argument fails. If it isn't, then all that
> proves is that "hmn" really is never sufficient: a more specific
> language-range really is needed, and anyone requesting their resources using
> "hmn" is making a vague request that will be subject to somewhat arbitrary
> results. At that point you decide a more specific language-range is needed,
> it makes no difference whether the language range used is "hmd" or
> "hmn-hmd": both would succeed in obtaining the desired result.
>
>
>
> So, we really can't use the macrolanguage cases like Hmong to decide this
> open issue. If NE Dian is not a good choice for Luopohe, the **only**
> thing that points to is that "hmn" is too vague to be useful for requesting
> resources.
>
>
>
> Btw, keep in mind why "hmn" was created in the first place, and why it was
> created as an individual-language identifier:
>
>
>
> -          librarians needed a tag for content
>
>
>
> -          they were not Hmong specialists and had no ability to
> differentiate between varieties
>
>
>
> -          as these are not-highly-developed varieties (in the
> language-development sense – literature, media, standardization), there was
> no reason for these non-specialists to suppose that these varieties were
> anything more than dialects of a single language (assuming they had much
> awareness of any variations in the first place)
>
>
>
> Now, as I approached how to deal with "hmn" in ISO 639-2 when it came to
> creating ISO 639-3, I had two options: argue that "hmn" should really be a
> collection, or treat "hmn" as a macrolanguage. In the original analysis, (
> http://www.ethnologue.com/14/iso639/analysis.asp), I had concluded "hmn"
> is really a collection. But since I also am not a Hmong specialist and
> didn't have the capacity (far from it!) of getting an expert analysis of
> this and every other uncertain case in ISO 639 in any reasonable amount of
> time, I chose the path of least resistance: treat it as a macrolanguage
> since ISO 639-2 and its user community considers it an individual language,
> and that way I don't have to get the JAC to take action on yet one decision
> where the impact is unclear and the internal expertise on which to base the
> decision is minimal.
>
>
>
> But note that for our purposes here it really doesn't matter whether "hmn"
> ended up as a macrolanguage or as a collection: either way, it is still
> vague and therefore not a good tag to use for requesting resources if the
> distinctions matter to you.
>
>
>
>
>
> > If Peter Constable came out and said the following, then I would admit
> to my sins, give in gracefully, and go along with extlang.
>
>    - "Yes, each of the Hmongs (encompassed by hmn) are mutually
>    intelligible, and are better for each one than another fallback like
>    Chinese"
>
>
>     - and the same is true for all the other cases with no predominant
>       variant.
>
> I haven't come out saying that; I assume they are not. But, I'm saying
> that this case is irrelevant for what we need to decide.
>
> >
>
>    - "Yes, standard Arabic is intelligible for all the encompassed
>    languages from Algerian Saharan Arabic to Shihhi Arabic, and are better for
>    each one than another fallback like French"
>
>
>     - and the same is true for all the other cases with a predominant
>       variant.
>
>  Well, what you're asking me to say here is that a request for e.g.
> Algerian Saharan Arabic can appropriately be serviced with Standard Arabic
> resources rather than, say, French. In other words, a language-range
> "ar-arq" can appropriately match "ar" content. But that would never happen
> because that is not how language-range works. So, I'm not sure how this is
> relevant: whether the language-range is "ar-arq" or "arq" the only results
> returned will be Algerian Saharan Arabic, unless some more advanced fallback
> behaviour is invoked.
>
>
>
> The question that would be relevant in terms of language-range for the
> Arabic/Mandarin cases is this: if someone requests "ar", how bad is it if
> they get pages in "ar-arq"? If someone requests "zh", how bad is it if they
> get pages in "zh-hakka"? These are the cases that relate to the way
> language-range works.
>
>
>
>
>
> Peter
>
> _______________________________________________
> Ltru mailing list
> Ltru@ietf.org
> https://www1.ietf.org/mailman/listinfo/ltru
>
>


-- 
Mark
_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru