Re: [Ltru] Re: extlang

Addison Phillips <addison@yahoo-inc.com> Tue, 20 March 2007 01:26 UTC

Return-path: <ltru-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1HTT6u-0006v6-JL; Mon, 19 Mar 2007 21:26:00 -0400
Received: from [10.90.34.44] (helo=chiedprmail1.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1HTT6t-0006uv-BR for ltru@lists.ietf.org; Mon, 19 Mar 2007 21:25:59 -0400
Received: from rsmtp2.corp.yahoo.com ([207.126.228.150]) by chiedprmail1.ietf.org with esmtp (Exim 4.43) id 1HTT6m-00016y-Uw for ltru@lists.ietf.org; Mon, 19 Mar 2007 21:25:58 -0400
Received: from [10.72.72.188] (snvvpn1-10-72-72-c188.corp.yahoo.com [10.72.72.188]) (authenticated bits=0) by rsmtp2.corp.yahoo.com (8.13.8/8.13.6/y.rout) with ESMTP id l2K1PZKd060687 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 19 Mar 2007 18:25:38 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:date:from:user-agent:mime-version:to:cc:subject: references:in-reply-to:content-type:content-transfer-encoding; b=QWSE0T/ho1+MAJuDwkhGM6P4/ett+HHP2lDUvG5rf06Arybife+vFdXRcXWWWNs8
Message-ID: <45FF380E.5050402@yahoo-inc.com>
Date: Mon, 19 Mar 2007 18:25:34 -0700
From: Addison Phillips <addison@yahoo-inc.com>
User-Agent: Thunderbird 1.5.0.10 (Windows/20070221)
MIME-Version: 1.0
To: Mark Davis <mark.davis@icu-project.org>
Subject: Re: [Ltru] Re: extlang
References: <E1HRsNL-0001ob-5h@megatron.ietf.org> <30b660a20703161617u85dbfe1r44ddc29fcfcf1a6d@mail.gmail.com> <45FB2C4E.9090303@yahoo-inc.com> <006e01c7682b$f0687b10$d1397130$@net> <004501c768bb$3bc185e0$6401a8c0@DGBP7M81> <00fd01c76914$18377ae0$48a670a0$@net> <45FD1A0A.2EED@xyzzy.claranet.de> <30b660a20703181137y6448508exb3e75f8e21a80a64@mail.gmail.com> <01b801c76990$e3e9b5a0$abbd20e0$@net> <45FEA785.2080003@yahoo-inc.com> <30b660a20703190910u636658b1g56489b0d30d2333a@mail.gmail.com>
In-Reply-To: <30b660a20703190910u636658b1g56489b0d30d2333a@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Spam-Score: -15.0 (---------------)
X-Scan-Signature: 501044f827b673024f6a4cb1d46e67d2
Cc: ltru@lists.ietf.org
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Errors-To: ltru-bounces@ietf.org

Mark Davis wrote:
> I see it a somewhat different way. The fact that there is a macro 
> language (zh), should not skew the way we encode an individual language 
> (yue). You say: "Extlangs help this a little by avoiding the choice on 
> the initial subtag. Thus: "ar-EG" is related to "ar-arz-EG" and 
> "ar-arb-EG" in distinct and somewhat logical ways."

I'm not sure we see it a different way.

My main problem is not with the body of your argument: most of the
macrolanguages are actually rare in practice and their constituent
languages can be encoded on the top level without rancor. This will
represent a minimum of interruption for users. Some content will be best
tagged with the macro language and some with the constituent language.

My basic problem is with the two widely used and accepted subtags 'ar'
and 'zh'. You're asking for users to change to using a new set of tags
for these languages/"languages". How should users choose between the
current practice ("zh-Hans-CN", "ar-EG") and "modern" practice
("cmn-Hans-CN", "arb-EG")? It isn't clear.

This is the original justification for extlang: to help users of widely
used macro-language tags adopt the right tagging approach in a
compatible manner.

> 
> However, functionally, we have to see advantage in subordinating some 
> languages as extlangs. There isn't value if they just add complication, 
> as per your "The problem with lookup (and I use lookup extensively, so 
> it concerns me deeply) might suggest that some extra smarts related to 
> extlangs is going to be needed."

The advantage of subordinating some languages is that there is a
significant body of content already tagged with the macro-language code
and that the macro-language retains a broadly applicable set of uses: 
there is, in many senses, a real thing called "Chinese" or "Arabic".
With extlangs, the "subordinated" language code can be added to tags in 
a manner consistent with the matching schemes and tagging already in place.

"sgn" benefits mightily from this scheme. I believe that at least "zh"
does (given that written forms of Chinese, to a significant degree, are
not very distinctive between the various Sinitic languages).

> 
> In order to have a good case for the extlang model, we would need to see 
> concrete scenarios where we can demonstrate that "zh-cmn" and "zh-yue" 
> work better than "cmn" and "yue" resp., and demonstration that those are 
> more important than the scenarios where they cause problems.

I think your citation of Norwegian is not a good example to pick on.
Historically the subtags 'nn' and 'nb' have been problematic: they 
overlap with the more common 'no'. The existence of three subtags for 
the language and its variations is confusing, and, ultimately, not that 
useful... since no one is ever sure if some 'no' labeled resources exist 
(or not). This case can perhaps be extended to the 
Mandarin/Cantonese/etc. case as an example where extlangs would have 
been less painful than full-fledged language codes?

> 
> Note that ISO 639-3 does not at all force the use of the extlang model; 
> the extlang model is just one possible way of expressing the information 
> in ISO 639-3.

That's right. I think I suggested this as one of the solutions.

>  We already have macrolanguages and "subordinate" languages 
> in BCP 47, but we put them at the same level. 

No, we didn't recognize the distinction: each primary language subtag is 
supposed to encode "a language". This has been occasionally confusing 
for users who expect the relationship to have meaning.

>  We shouldn't be 
> making Cantonese a subordinate language either.

The anti-question would be: why *not*? It's quite clear that "zh-HK"
doesn't mean "Cantonese" any more than "zh-TW" meant "Traditional
Chinese". The question of whether to use extlang or not is not a value
judgment about the language, but merely an encoding question.

In fact, for the most part, what you're suggesting is little different
from using extlang. Let's explore...

Let's say I request "zh-yue-Hant-HK". The default lookup scheme produces:

zh-yue-Hant-HK
zh-yue-Hant
zh-yue
zh

One way to have a lookup implementation use macro-language information
would be if it were free to treat extlangs as equivalencies or
ignorable. The fallback could be:

zh-yue-Hant-HK
zh-Hant-HK (implied)
zh-yue-Hant
zh-Hant (implied)
zh-yue
zh

...and let's say we request without extlang (zh-Hant-HK):

zh-Hant-HK
zh-Hant
zh

This misses any content tagged with 'yue'.

We could say that any "zh-yue" or "zh-cmn" content also matches on each
level (an implied match), but this would be more troubling (the user
didn't request that specific content):

zh-Hant-HK
zh-yue-Hant-HK (implied)
zh-Hant
zh-yue-Hant (implied)
zh
zh-cmn (implied, just for the sake of ickiness)

Now consider extended filtering for a second. A range of "zh-Hant"
matches each of:

zh-Hant
zh-yue-Hant
zh-cmn-Hant
zh-Hant-CN
zh-yue-Hant-CN

This is logical. Specifying "zh-yue" would not find a tag "zh-Hant" that
might be Cantonese, though.

Now what you're suggesting is that we put "yue" on the top-level. Some
Cantonese content will thus be tagged as "yue-*" and some as "zh-*". One
of two things has to happen. If I request "yue", I might get some
content labeled "zh". Or if I request some content labeled "yue" I do
NOT get any content labeled "zh" that might actually be Cantonese. This
doesn't solve the lookup problem:

yue-Hant-HK
zh-Hant-HK (implied)
yue-Hant
zh-Hant (implied)
yue
zh (implied)

This pattern is identical to the pattern I get (above, first example),
but with the complication of having the mapping table.

The compelling part of your argument is if we think in terms of a
language priority list instead. Then my fallback is:

yue-Hant-HK
yue-Hant
yue
zh-Hant-HK
zh-Hant
zh

Here we see the advantage for using 'yue' on the top level: we don't get
to the (probably Simplified) 'zh' and (probably Mandarin) 'zh-Hant'
resources too early (or at all, if we omit the pass through 'zh'-bearing 
ranges).

> 
> At one point, I did think that having the extlang structure would be 
> better, but the more I get into actually implementing them, the more I 
> find that they are just a complication for no good result. What I think 
> we should instead be doing is adding a field to the registry that says 
> that X is a macro language for Y, and adding information to 4647 that 
> indicates how one can make use of this information in matching. 

My implementations just don't make me think it is that complicated. Of 
course, most of my implementations have control over both the content 
(tags) and ranges used in the selection---not to mention defaulting 
behavior. I haven't gotten to the point yet of needing to identify the 
spoken Chinese variations in uncontrolled content, so haven't really the 
problems you do yet. Thinking about them suggests, however, that I'll 
want the broader selection profile that extlang gives me than the 
narrower one that (multiple choice) primary language gives me on the top 
level.

That's because the more structure a tag has, the more information it can 
contain and thus the simpler the lookup/filtering is to do. If an 
equivalence table is required, we'll, for the first time, be reliant on 
external data tables to do simple select-type matching (think about 
simplistic implementations such as the :lang pseudo-attribute in CSS for 
a moment!). And I'd like to avoid that if possible.

I recognize what you're saying is valid, for the kinds of data you're
working with. But I'm concerned equally that content tagging will be
mixed up for years to come because users do not understand when and
whether to use 'zh' vs. 'cmn' or 'ar' vs 'arb'.

Addison

-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.

Internationalization is an architecture.
It is not a feature.


_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru