Re: [Ltru] Re: extlang
Addison Phillips <addison@yahoo-inc.com> Tue, 20 March 2007 01:26 UTC
Return-path: <ltru-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1HTT6u-0006v6-JL; Mon, 19 Mar 2007 21:26:00 -0400
Received: from [10.90.34.44] (helo=chiedprmail1.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1HTT6t-0006uv-BR for ltru@lists.ietf.org; Mon, 19 Mar 2007 21:25:59 -0400
Received: from rsmtp2.corp.yahoo.com ([207.126.228.150]) by chiedprmail1.ietf.org with esmtp (Exim 4.43) id 1HTT6m-00016y-Uw for ltru@lists.ietf.org; Mon, 19 Mar 2007 21:25:58 -0400
Received: from [10.72.72.188] (snvvpn1-10-72-72-c188.corp.yahoo.com [10.72.72.188]) (authenticated bits=0) by rsmtp2.corp.yahoo.com (8.13.8/8.13.6/y.rout) with ESMTP id l2K1PZKd060687 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 19 Mar 2007 18:25:38 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:date:from:user-agent:mime-version:to:cc:subject: references:in-reply-to:content-type:content-transfer-encoding; b=QWSE0T/ho1+MAJuDwkhGM6P4/ett+HHP2lDUvG5rf06Arybife+vFdXRcXWWWNs8
Message-ID: <45FF380E.5050402@yahoo-inc.com>
Date: Mon, 19 Mar 2007 18:25:34 -0700
From: Addison Phillips <addison@yahoo-inc.com>
User-Agent: Thunderbird 1.5.0.10 (Windows/20070221)
MIME-Version: 1.0
To: Mark Davis <mark.davis@icu-project.org>
Subject: Re: [Ltru] Re: extlang
References: <E1HRsNL-0001ob-5h@megatron.ietf.org> <30b660a20703161617u85dbfe1r44ddc29fcfcf1a6d@mail.gmail.com> <45FB2C4E.9090303@yahoo-inc.com> <006e01c7682b$f0687b10$d1397130$@net> <004501c768bb$3bc185e0$6401a8c0@DGBP7M81> <00fd01c76914$18377ae0$48a670a0$@net> <45FD1A0A.2EED@xyzzy.claranet.de> <30b660a20703181137y6448508exb3e75f8e21a80a64@mail.gmail.com> <01b801c76990$e3e9b5a0$abbd20e0$@net> <45FEA785.2080003@yahoo-inc.com> <30b660a20703190910u636658b1g56489b0d30d2333a@mail.gmail.com>
In-Reply-To: <30b660a20703190910u636658b1g56489b0d30d2333a@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Spam-Score: -15.0 (---------------)
X-Scan-Signature: 501044f827b673024f6a4cb1d46e67d2
Cc: ltru@lists.ietf.org
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Errors-To: ltru-bounces@ietf.org
Mark Davis wrote: > I see it a somewhat different way. The fact that there is a macro > language (zh), should not skew the way we encode an individual language > (yue). You say: "Extlangs help this a little by avoiding the choice on > the initial subtag. Thus: "ar-EG" is related to "ar-arz-EG" and > "ar-arb-EG" in distinct and somewhat logical ways." I'm not sure we see it a different way. My main problem is not with the body of your argument: most of the macrolanguages are actually rare in practice and their constituent languages can be encoded on the top level without rancor. This will represent a minimum of interruption for users. Some content will be best tagged with the macro language and some with the constituent language. My basic problem is with the two widely used and accepted subtags 'ar' and 'zh'. You're asking for users to change to using a new set of tags for these languages/"languages". How should users choose between the current practice ("zh-Hans-CN", "ar-EG") and "modern" practice ("cmn-Hans-CN", "arb-EG")? It isn't clear. This is the original justification for extlang: to help users of widely used macro-language tags adopt the right tagging approach in a compatible manner. > > However, functionally, we have to see advantage in subordinating some > languages as extlangs. There isn't value if they just add complication, > as per your "The problem with lookup (and I use lookup extensively, so > it concerns me deeply) might suggest that some extra smarts related to > extlangs is going to be needed." The advantage of subordinating some languages is that there is a significant body of content already tagged with the macro-language code and that the macro-language retains a broadly applicable set of uses: there is, in many senses, a real thing called "Chinese" or "Arabic". With extlangs, the "subordinated" language code can be added to tags in a manner consistent with the matching schemes and tagging already in place. "sgn" benefits mightily from this scheme. I believe that at least "zh" does (given that written forms of Chinese, to a significant degree, are not very distinctive between the various Sinitic languages). > > In order to have a good case for the extlang model, we would need to see > concrete scenarios where we can demonstrate that "zh-cmn" and "zh-yue" > work better than "cmn" and "yue" resp., and demonstration that those are > more important than the scenarios where they cause problems. I think your citation of Norwegian is not a good example to pick on. Historically the subtags 'nn' and 'nb' have been problematic: they overlap with the more common 'no'. The existence of three subtags for the language and its variations is confusing, and, ultimately, not that useful... since no one is ever sure if some 'no' labeled resources exist (or not). This case can perhaps be extended to the Mandarin/Cantonese/etc. case as an example where extlangs would have been less painful than full-fledged language codes? > > Note that ISO 639-3 does not at all force the use of the extlang model; > the extlang model is just one possible way of expressing the information > in ISO 639-3. That's right. I think I suggested this as one of the solutions. > We already have macrolanguages and "subordinate" languages > in BCP 47, but we put them at the same level. No, we didn't recognize the distinction: each primary language subtag is supposed to encode "a language". This has been occasionally confusing for users who expect the relationship to have meaning. > We shouldn't be > making Cantonese a subordinate language either. The anti-question would be: why *not*? It's quite clear that "zh-HK" doesn't mean "Cantonese" any more than "zh-TW" meant "Traditional Chinese". The question of whether to use extlang or not is not a value judgment about the language, but merely an encoding question. In fact, for the most part, what you're suggesting is little different from using extlang. Let's explore... Let's say I request "zh-yue-Hant-HK". The default lookup scheme produces: zh-yue-Hant-HK zh-yue-Hant zh-yue zh One way to have a lookup implementation use macro-language information would be if it were free to treat extlangs as equivalencies or ignorable. The fallback could be: zh-yue-Hant-HK zh-Hant-HK (implied) zh-yue-Hant zh-Hant (implied) zh-yue zh ...and let's say we request without extlang (zh-Hant-HK): zh-Hant-HK zh-Hant zh This misses any content tagged with 'yue'. We could say that any "zh-yue" or "zh-cmn" content also matches on each level (an implied match), but this would be more troubling (the user didn't request that specific content): zh-Hant-HK zh-yue-Hant-HK (implied) zh-Hant zh-yue-Hant (implied) zh zh-cmn (implied, just for the sake of ickiness) Now consider extended filtering for a second. A range of "zh-Hant" matches each of: zh-Hant zh-yue-Hant zh-cmn-Hant zh-Hant-CN zh-yue-Hant-CN This is logical. Specifying "zh-yue" would not find a tag "zh-Hant" that might be Cantonese, though. Now what you're suggesting is that we put "yue" on the top-level. Some Cantonese content will thus be tagged as "yue-*" and some as "zh-*". One of two things has to happen. If I request "yue", I might get some content labeled "zh". Or if I request some content labeled "yue" I do NOT get any content labeled "zh" that might actually be Cantonese. This doesn't solve the lookup problem: yue-Hant-HK zh-Hant-HK (implied) yue-Hant zh-Hant (implied) yue zh (implied) This pattern is identical to the pattern I get (above, first example), but with the complication of having the mapping table. The compelling part of your argument is if we think in terms of a language priority list instead. Then my fallback is: yue-Hant-HK yue-Hant yue zh-Hant-HK zh-Hant zh Here we see the advantage for using 'yue' on the top level: we don't get to the (probably Simplified) 'zh' and (probably Mandarin) 'zh-Hant' resources too early (or at all, if we omit the pass through 'zh'-bearing ranges). > > At one point, I did think that having the extlang structure would be > better, but the more I get into actually implementing them, the more I > find that they are just a complication for no good result. What I think > we should instead be doing is adding a field to the registry that says > that X is a macro language for Y, and adding information to 4647 that > indicates how one can make use of this information in matching. My implementations just don't make me think it is that complicated. Of course, most of my implementations have control over both the content (tags) and ranges used in the selection---not to mention defaulting behavior. I haven't gotten to the point yet of needing to identify the spoken Chinese variations in uncontrolled content, so haven't really the problems you do yet. Thinking about them suggests, however, that I'll want the broader selection profile that extlang gives me than the narrower one that (multiple choice) primary language gives me on the top level. That's because the more structure a tag has, the more information it can contain and thus the simpler the lookup/filtering is to do. If an equivalence table is required, we'll, for the first time, be reliant on external data tables to do simple select-type matching (think about simplistic implementations such as the :lang pseudo-attribute in CSS for a moment!). And I'd like to avoid that if possible. I recognize what you're saying is valid, for the kinds of data you're working with. But I'm concerned equally that content tagging will be mixed up for years to come because users do not understand when and whether to use 'zh' vs. 'cmn' or 'ar' vs 'arb'. Addison -- Addison Phillips Globalization Architect -- Yahoo! Inc. Internationalization is an architecture. It is not a feature. _______________________________________________ Ltru mailing list Ltru@ietf.org https://www1.ietf.org/mailman/listinfo/ltru
- [Ltru] Punjabi Mark Davis
- RE: [Ltru] Punjabi Don Osborn
- RE: [Ltru] Punjabi Peter Constable
- Re: [Ltru] Punjabi Mark Davis
- Re: [Ltru] Punjabi John Cowan
- RE: [Ltru] Punjabi Peter Constable
- [Ltru] Re: Punjabi Doug Ewell
- RE: [Ltru] Re: Punjabi Peter Constable
- [Ltru] Re: [everson@evertype.com: The Language Su… Doug Ewell
- RE: [Ltru] Punjabi Don Osborn
- Re: [Ltru] Re: [everson@evertype.com: The Languag… Addison Phillips
- Re: [Ltru] Punjabi Mark Davis
- RE: [Ltru] Punjabi Peter Constable
- RE: [Ltru] Punjabi Sukhjinder Sidhu
- RE: [Ltru] Punjabi Sarmad Hussain, Dr.
- Re: [Ltru] Punjabi John Cowan
- Re: [Ltru] Punjabi sukhjinder_sidhu
- Re: [Ltru] Punjabi sukhjinder_sidhu
- Re: [Ltru] Punjabi sukhjinder_sidhu
- Fwd: [Ltru] Punjabi Mark Davis
- [Ltru] Re: Punjabi Doug Ewell
- [Ltru] Punjabi Abbas Malik
- [Ltru] Re: Punjabi John Cowan
- [Ltru] extlang (was: Punjabi) Frank Ellermann
- Re: [Ltru] Punjabi Mark Davis
- Re: [Ltru] Punjabi sukhjinder_sidhu
- Re: [Ltru] Re: Punjabi Mark Davis
- Re: [Ltru] Re: Punjabi John Cowan
- Re: [Ltru] Re: Punjabi Mark Davis
- Re: [Ltru] Re: Punjabi Addison Phillips
- Re: [Ltru] Re: Punjabi Mark Davis
- Re: [Ltru] Re: Punjabi Addison Phillips
- RE: [Ltru] Re: Punjabi Don Osborn
- Re: [Ltru] Re: Punjabi Mark Davis
- RE: [Ltru] Re: Punjabi Peter Constable
- [Ltru] Re: Punjabi Doug Ewell
- Re: [Ltru] Re: Punjabi Doug Ewell
- Re: [Ltru] Re: Punjabi Doug Ewell
- RE: [Ltru] extlang (was: Punjabi) Don Osborn
- [Ltru] Re: extlang Frank Ellermann
- Re: [Ltru] Re: extlang Mark Davis
- RE: [Ltru] Re: extlang Don Osborn
- Re: [Ltru] Re: extlang Addison Phillips
- Re: [Ltru] Re: extlang Mark Davis
- Re: [Ltru] Re: extlang John Cowan
- Re: [Ltru] Re: extlang Addison Phillips
- RE: [Ltru] Re: extlang Don Osborn
- Re: [Ltru] Re: extlang GerardM
- RE: [Ltru] Re: extlang Don Osborn
- [Ltru] Re: extlang Stephane Bortzmeyer
- RE: [Ltru] Re: extlang Peter Constable
- Re: [Ltru] Re: extlang Marion Gunn
- RE: [Ltru] Re: extlang Peter Constable
- Re: [Ltru] Re: extlang Addison Phillips
- VS: [Ltru] Re: extlang Erkki I. Kolehmainen
- RE: [Ltru] Re: extlang Don Osborn
- Re: [Ltru] Re: extlang Mark Davis
- Re: [Ltru] Re: extlang John Cowan
- Re: [Ltru] Re: extlang Addison Phillips