Re: [Ltru] Extended language tags (long reply)
Addison Phillips <addison@yahoo-inc.com> Sun, 07 October 2007 17:16 UTC
Return-path: <ltru-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1IeZk7-0004rb-2F; Sun, 07 Oct 2007 13:16:39 -0400
Received: from ltru by megatron.ietf.org with local (Exim 4.43) id 1IeZk5-0004rU-Kq for ltru-confirm+ok@megatron.ietf.org; Sun, 07 Oct 2007 13:16:37 -0400
Received: from [10.90.34.44] (helo=chiedprmail1.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1IeZk5-0004rM-8n for ltru@ietf.org; Sun, 07 Oct 2007 13:16:37 -0400
Received: from rsmtp2.corp.yahoo.com ([207.126.228.150]) by chiedprmail1.ietf.org with esmtp (Exim 4.43) id 1IeZk3-0006uy-Sk for ltru@ietf.org; Sun, 07 Oct 2007 13:16:37 -0400
Received: from [10.72.77.22] (snvvpn2-10-72-77-c22.corp.yahoo.com [10.72.77.22]) (authenticated bits=0) by rsmtp2.corp.yahoo.com (8.13.8/8.13.8/y.rout) with ESMTP id l97HGV5v004226 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 7 Oct 2007 10:16:31 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:date:from:user-agent:mime-version:to:cc:subject: references:in-reply-to:content-type:content-transfer-encoding; b=m4KDDGDlEgk47KODVgv6MUKfhlRGk8+edzGtsKG4eWUd1/Ggt2oLb6kvk+QQmmlz
Message-ID: <4709146F.6020504@yahoo-inc.com>
Date: Sun, 07 Oct 2007 10:16:31 -0700
From: Addison Phillips <addison@yahoo-inc.com>
User-Agent: Thunderbird 2.0.0.6 (Windows/20070728)
MIME-Version: 1.0
To: Shawn Steele <Shawn.Steele@microsoft.com>
Subject: Re: [Ltru] Extended language tags (long reply)
References: <E1IdT7z-0001vv-Ly@megatron.ietf.org> <C9BF0238EED3634BA1866AEF14C7A9E55A597AC370@NA-EXMSG-C116.redmond.corp.microsoft.com>
In-Reply-To: <C9BF0238EED3634BA1866AEF14C7A9E55A597AC370@NA-EXMSG-C116.redmond.corp.microsoft.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Spam-Score: -15.0 (---------------)
X-Scan-Signature: 311e798ce51dbeacf5cdfcc8e9fda21b
Cc: "ltru@ietf.org" <ltru@ietf.org>
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Errors-To: ltru-bounces@ietf.org
(warning: this is a long screed, which, now that I've alluded to it on list--and even quoted from the abominable thing-- I may as well share, even though I'll probably think better of it in about an hour when the responses start to arrive) Shawn Steele wrote: > > * We (the bigger software ecosystem we, tools, etc) have encouraged use of zh-HK, etc for labeling both Mandarin and Cantonese. Whatever happens in the future a large number of legacy documents, both Manderin and Cantonese, will be labeled with the existing language tag. Not exactly. "We" have (at least historically) abused the tags "zh-HK" and "zh-TW" to represent Traditional Chinese and ignored the Cantonese/Mandarin split (which is possible if one only means written content, which historically has composed most of the material in our information systems and on the Web). More recently, language tags have allowed for direct tagging of these differences as "zh-Hant" and "zh-Hans" (Traditional vs. Simplified) allowing TW and HK to represent regional differences (there are regional preferential differences, even in written texts, even when ignoring the Cantonese/Mandarin split--we encounter these at Yahoo!, for example). It is unclear how quickly the transition to these subtags is proceeding. Some user-agents don't support these tags yet. With the advent of audio and visual content in particular, tagging all Chinese as 'zh' becomes less useful (at least for those cases), since we suddenly find applications for which we need to describe specific spoken variations not captured by 'zh' and not directly discernible from the region subtag. I note that Chinese *may* be a special case, in comparison to others, in which we have very large document bases in a variety of Sinitic languages tagged with a wide variety of regional subtags representing something other than regional variations. Other macro languages don't exhibit this sort or range of imputed meaning. For example, the Arabic "sub-languages" are mostly regionally based, so one might very well consider 'arz' (Egyptian Arabic) to be a synonym for "ar-EG" (Arabic as used in Egypt). Of course, some of the other Arabic languages don't map so closely to modern nation-states... for the record, I oppose assigning "secret handshake" meaning to region subtags. Explicit subtags seem to work better. [Yes, there really are regional differences, for which region subtags are good, even when the region where the language is spoken leaks over borders a bit.] > > * Whether or not yue or zh-yue was used, it is a change from the current label (in most cases). It seems that in nearly all cases this will require a code change. In particular applications that want to include zh-HK in lists containing zh-cmn-HK or cmn-HK will need extra awareness. It requires tag changes to indicate things with more precision, yes. Nothing says that we might not have continued use of the existing tags; indeed, I would be surprised by a wholesale conversion over to the new scheme (whatever it is). Most people won't "get the message" for awhile. And sometimes the additional specificity doesn't matter ('zh' is perfectly good for many written documents). However, for many applications where content labeling can be controlled, converting over makes life easier. For Chinese in particular, I expect a pretty messy tagging situation to persist for quite some time (since we have overlapping levels of imputed meaning in region subtags and elsewhere). For other languages, a messy situation may not be necessary. Arabic probably doesn't *require* extlangs, although that may be my parochial perspective. While Andrew and Don point out the usefulness of deliberate vagaries in tagging, mightn't we be just as well be served by tagging "generic" Dinka documents (for example) as 'din' and specific Dinka documents as, for example, 'dib' or 'dik'? Again, the question here seems to depend on whether tags or ranges are what matter most. > > * I don't think the Breton example applies. http-accept-lang allows for "br-FR;fr-FR" type fallback, so that is a solution. If an application independently wanted to make this assumption that's fine by me, but this seems orthogonal to the problem we're trying to solve here. I agree that this fallback case is valid as a use-case for language priority lists, but not for language tags. But I think this misses the point. It is quoted without Mark's preamble, in which he posits Breton as a sublanguage of Welsh. Mark is using it more as a parable to illustrate the extlang case, not as an example-in-fact. Let me rephrase Mark's Breton case: when you want Chippewa, any old Ojibwa (its macrolanguage), which might be a language such as Ottawa, will not do. It is unintelligible (maybe: I don't actually know in this case). If it is unintelligible, you're probably better off *not* mapping Chippewa to Ojibwa. You're better off serving some useful default (for Chippewa, this is probably English, but is up to the application, not something RFC 4647 or the LSR do) or even failing. The counter-example would be the ietf-languages adventure in Norwegian tagging. Were we better off when Norwegian was represented by the older registered values (which are quite extlang like, eh?): no-nyn no-bok Than by the codes: nn nb no Things certainly were easier when 'no' meant generic Norwegian (which, it turns out, tends to be 'nb') and Nynorsk could be represented extlang/variant-like. In code that deals with Norwegian language tags today (as well as other odd cases, such as the mistaken 'he'/'iw' pair), you often have this irksome special mapping table of the (what would be) sublanguage codes to its macrolanguage, which is what we "mean". But you can't eat/change the original subtag---other processors might interpret or map the codes differently or be able to serve specific resources tailored to the original request. This is a flaw in the no-extlang case and why Mark and others have insisted that Macrolanguage becomes an important piece of information in the registry. > > * Lots of zh data right now is zh-cmn. Most of "us" seem to agree that we can't narrow the meaning of zh because it does allow Cantonese. However if some application wanted to make an assumption that zh == zh-cmn, then that seems up to the application for fallback. Agreed. More to the point, the "zh" tag encompasses a resource containing *some* (which is to say, exactly ONE) variety of Chinese. It is probably also Simplified Chinese, for example--it certainly can't be both Simplified and Traditional. But you can't make that assumption safely! There are also differences in (for example) the accepted level of English word borrowing between (say) Taiwan and HK, etc. "zh" embodies the least common denominator, which may not be very common in certain circumstances (if you're talking about our HK web site, 'zh' probably means content that could be tagged "zh-yue-Hant-HK", whereas I'm pretty sure other sites or applications would prefer it to be pretty similar to the stuff that could be tagged "zh-cmn-Hans-CN"). Heck, when I started my internationalization career, 'zh' meant Traditional, because many of us did business in Taiwan and hardly anyone in the PRC. Now 'zh' mostly "means" Simplified. > > * Even for the current tags, many of the people in the teleconference seem to extend RFC 4647 in ways that are best for them. Strict use of 4647 behavior seems rare. It seems reasonable to me to expect that in the future people may continue to do so and that RFC 4647 and the registry can only provide guidelines. I don't think that it can solve all problems for all applications, and I'm fine with that. I disagree. RFC 4647 is fine for many purposes. I'm proposing, if we do extlangs, a small modification due to finding a slightly better version of just one algorithm (lookup). I think our ideal goal would be for RFC 4647 matching schemes to work fully for everyone for the scope they are designed for. We might make one or other tweak to the document and possibly the algorithms based on further testing. For now, the existing schemes mostly work. They exhibit varying impact on tag and range choice in either extlang or no-extlang clothes. I disagree that you have to have far more complex matching systems in all cases. Yes, there are use cases that call for more complex systems. But most low-level users have no need of these elaborations. For basic protocols, I think we would prefer the simplest possible algorithms that produce mostly the right results. Locale systems, for example, are based on lookup (well, more like lookup is based on them), and they produce meaningful results for resource lookup pretty reliably. I have a concern that lookup work well, because I have a vested interest in the continued good behavior of systems such as CLDR, which use and depend on these simplest possible matching systems, such as lookup. So we can change the algorithm or "change" the (as yet undefined) tags, which is what is suggested as the other route. > > So from these points, my conclusion is that the zh-cmn form is preferable. > > My reasoning is that either cmn by itself or zh-cmn will require code change in nearly all cases, either to include or exclude existing data or fallback rules. Either was will require a code change, and either way may require knowledge that zh-HK might be interesting if the request is for {Mandarin tag}-HK. Searching may need to include zh files for {Manderin tag} queries and exclude them for {Cantonese tag} queries, but the actual tag doesn't really change this logic. Neither variation is likely to work with existing code when the request is for the new name and the data is tagged with the old tag. > > The deciding factor for me is that to know that cmn is related to zh I'd have to look in the registry, but zh-cmn contains that information. Otherwise I don't really see advantages with either method. The question is: do we want to do this for all time? Or can we help people migrate and just get started on the migration? Yes, retagging with "yue-HK" is a PITA, but so is retagging with "zh-yue-HK". If we have to convince folks to retag their data anyway, what is the best solution/most sensible? The benefit to extlang partially comes from the expectation that the tags will be changed but users won't update their ranges. For that case, you can switch to extended filtering and a slightly modified lookup and use extlangs and go about your business. Of course, you have to explain these fairly complex tags to the Chinese...... (and Arabic, Dinka, Quechua, Zapotec, and etc. speakers) On the other hand, if you go with primary language subtags for everyone, you are expecting that users will change or augment their language priority lists (i.e. "yue-HK; zh-HK") to deal with un-retagged data or provide alternate lookup tables in your implementations. Matching algorithms do not change, but tags eventually have to catch up for the majority of documents or there is (probably quite icky) tagging chaos. So, for me, the main issue is whether we are going to explicitly break the connection between the language and its macrolanguage (at the tag level). In some cases (Norwegian) we already know this can be problematic; in others, it may actually be desirable. Addison -- Addison Phillips Globalization Architect -- Yahoo! Inc. Chair -- W3C Internationalization Core WG Internationalization is an architecture. It is not a feature. _______________________________________________ Ltru mailing list Ltru@ietf.org https://www1.ietf.org/mailman/listinfo/ltru
- [Ltru] Re: Extended language tags Doug Ewell
- [Ltru] Extended language tags Shawn Steele
- Re: [Ltru] Extended language tags Andrew Cunningham
- Re: [Ltru] Extended language tags Mark Davis
- RE: [Ltru] Extended language tags Don Osborn
- Re: [Ltru] Extended language tags Randy Presuhn
- Re: [Ltru] Extended language tags Andrew Cunningham
- Re: [Ltru] Extended language tags Andrew Cunningham
- Re: [Ltru] Extended language tags Andrew Cunningham
- Re: [Ltru] Extended language tags Randy Presuhn
- Re: [Ltru] Extended language tags John Cowan
- Re: [Ltru] Extended language tags Addison Phillips
- RE: [Ltru] Extended language tags Don Osborn
- [Ltru] Re: Extended language tags Doug Ewell
- Re: [Ltru] Extended language tags John Cowan
- Re: [Ltru] Extended language tags Randy Presuhn
- RE: [Ltru] Extended language tags Shawn Steele
- Re: [Ltru] Extended language tags John Cowan
- RE: [Ltru] Extended language tags Peter Constable
- RE: [Ltru] Extended language tags Peter Constable
- RE: [Ltru] Extended language tags Peter Constable
- Re: [Ltru] Extended language tags Addison Phillips
- Re: [Ltru] Extended language tags (long reply) Addison Phillips
- Re: [Ltru] Extended language tags (long reply) Andrew Cunningham
- Re: [Ltru] Extended language tags (long reply) Mark Davis
- Re: [Ltru] Extended language tags (long reply) Addison Phillips
- RE: [Ltru] Extended language tags Peter Constable
- RE: [Ltru] Extended language tags (long reply) Peter Constable
- Re: [Ltru] Extended language tags (long reply) John Cowan
- Re: [Ltru] Extended language tags (long reply) Marion Gunn
- Re: [Ltru] Extended language tags (long reply) Addison Phillips
- Re: [Ltru] Extended language tags (long reply) John Cowan
- Re: [Ltru] Re: Extended language tags Randy Presuhn
- [Ltru] Informative (was: Extended language tags) Frank Ellermann
- [Ltru] Re: Extended language tags Doug Ewell
- [Ltru] Re: Informative (was: Extended language ta… Doug Ewell
- Re: [Ltru] Re: Informative (was: Extended languag… John Cowan
- Re: [Ltru] Re: Extended language tags Randy Presuhn
- Re: [Ltru] Re: Informative (was: Extended languag… Randy Presuhn
- RE: [Ltru] Re: Extended language tags Shawn Steele
- RE: [Ltru] Extended language tags (long reply) Shawn Steele
- Re: [Ltru] Re: Informative Addison Phillips
- Re: [Ltru] Extended language tags (long reply) Addison Phillips
- RE: [Ltru] Extended language tags (long reply) - … Shawn Steele
- Re: [Ltru] Extended language tags (long reply) - … Randy Presuhn
- Re: [Ltru] Re: Extended language tags Mark Davis
- Re: [Ltru] Re: Extended language tags Addison Phillips
- RE: [Ltru] Re: Extended language tags Debbie Garside
- Re: [Ltru] Extended language tags (long reply) Mark Davis
- RE: [Ltru] Extended language tags (long reply) Debbie Garside
- RE: [Ltru] Re: Extended language tags Peter Constable
- Re: [Ltru] Re: Extended language tags Karen_Broome
- RE: [Ltru] Extended language tags (long reply) Shawn Steele
- RE: [Ltru] Extended language tags (long reply) Karen_Broome
- RE: [Ltru] Extended language tags (long reply) Peter Constable
- RE: [Ltru] Extended language tags (long reply) Karen_Broome
- RE: [Ltru] Extended language tags (long reply) Peter Constable
- [Ltru] Teleconference Shawn Steele
- Re: [Ltru] Teleconference Mark Davis
- RE: [Ltru] Extended language tags (long reply) Karen_Broome
- Re: [Ltru] Teleconference Randy Presuhn
- Re: [Ltru] Extended language tags (long reply) John Cowan
- Re: [Ltru] Teleconference Addison Phillips
- Re: [Ltru] Teleconference John Cowan
- Re: [Ltru] Extended language tags (long reply) Karen_Broome
- RE: [Ltru] Extended language tags (long reply) Peter Constable
- RE: [Ltru] Extended language tags (long reply) Karen_Broome
- Re: [Ltru] Re: Extended language tags Doug Ewell
- RE: [Ltru] Re: Extended language tags Debbie Garside