Re: [Ltru] Consensus call: extlang

Peter Constable <petercon@microsoft.com> Fri, 30 May 2008 05:50 UTC

Return-Path: <ltru-bounces@ietf.org>
X-Original-To: ltru-archive@megatron.ietf.org
Delivered-To: ietfarch-ltru-archive@core3.amsl.com
Received: from [127.0.0.1] (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 2EEA528C158; Thu, 29 May 2008 22:50:18 -0700 (PDT)
X-Original-To: ltru@core3.amsl.com
Delivered-To: ltru@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 855E428C158 for <ltru@core3.amsl.com>; Thu, 29 May 2008 22:50:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.599
X-Spam-Level:
X-Spam-Status: No, score=-10.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gdCsL491eqAd for <ltru@core3.amsl.com>; Thu, 29 May 2008 22:50:16 -0700 (PDT)
Received: from smtp.microsoft.com (smtp.microsoft.com [131.107.115.215]) by core3.amsl.com (Postfix) with ESMTP id 624A13A6C1F for <ltru@ietf.org>; Thu, 29 May 2008 22:50:16 -0700 (PDT)
Received: from tk1-exhub-c102.redmond.corp.microsoft.com (157.54.46.186) by TK5-EXGWY-E802.partners.extranet.microsoft.com (10.251.56.168) with Microsoft SMTP Server (TLS) id 8.1.240.5; Thu, 29 May 2008 22:50:15 -0700
Received: from NA-EXMSG-C117.redmond.corp.microsoft.com ([157.54.62.46]) by tk1-exhub-c102.redmond.corp.microsoft.com ([157.54.46.186]) with mapi; Thu, 29 May 2008 22:50:15 -0700
From: Peter Constable <petercon@microsoft.com>
To: LTRU Working Group <ltru@ietf.org>
Date: Thu, 29 May 2008 22:50:13 -0700
Thread-Topic: [Ltru] Consensus call: extlang
Thread-Index: AcjB37raAZt3s1pSQjORA3q6jueBVQABt0oQAAq/gPA=
Message-ID: <DDB6DE6E9D27DD478AE6D1BBBB835795633304EF08@NA-EXMSG-C117.redmond.corp.microsoft.com>
References: <01c301c8bbe5$8c2810c0$6801a8c0@oemcomputer> <30b660a20805252132g28ff50b0kd5b04d6f47ca35d2@mail.gmail.com> <002001c8bef3$e0497520$6801a8c0@oemcomputer> <6.0.0.20.2.20080527170755.05bd89c0@localhost> <002f01c8c024$0dcdb5c0$6801a8c0@oemcomputer> <6.0.0.20.2.20080528163346.074fac80@localhost> <001f01c8c122$0cbcae80$6801a8c0@oemcomputer> <4D25F22093241741BC1D0EEBC2DBB1DA013A84C314@EX-SEA5-D.ant.amazon.com> <007601c8c1bc$84d93920$6801a8c0@oemcomputer> <104f01c8c1d8$94ad6f30$0a00a8c0@CPQ86763045110> <30b660a20805291559x4f6243a8pecc7ee92c2a36d9c@mail.gmail.com> <E19FDBD7A3A7F04788F00E90915BD36C13C251B4FC@USSDIXMSG20.spe.sony.com>
In-Reply-To: <E19FDBD7A3A7F04788F00E90915BD36C13C251B4FC@USSDIXMSG20.spe.sony.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
MIME-Version: 1.0
Subject: Re: [Ltru] Consensus call: extlang
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: ltru-bounces@ietf.org
Errors-To: ltru-bounces@ietf.org

> From: ltru-bounces@ietf.org [mailto:ltru-bounces@ietf.org] On Behalf Of
> Broome, Karen


> Before I spend too much time picking apart your lengthy screed
> involving a scenario where the BBC presents its web site in Sudanese
> Creole Arabic with rotating languages code logic for each day of the
> week ... (ahem) ... here's my real-world Chinese language list:
>
> Chinese (Variant Unknown)
> Chinese (Cantonese, Spoken)
> Chinese (Cantonese, Written)
> Chinese (Mandarin, Spoken)
> Chinese (Mandarin, Spoken Taiwanese)
> Chinese (Mandarin, Simplified)
> Chinese (Mandarin, Traditional)
> Chinese (Taiwanese, Spoken)
> Chinese (Taiwanese, Written)


> 1             2          3         4
> zh            zh         zh        zh
> zh-yue        yue        yue       yue
> zh-yue        yue        yue       yue
> zh-cmn        cmn        zh        cmn
> zh-cmn-TW     cmn-TW     zh-TW     cmn-TW
> zh-cmn-Hans   cmn-Hans   zh-Hans   zh-Hans
> zh-cmn-Hant   cmn-Hant   zh-Hant   zh-Hant
> zh-min-nan    nan        nan       nan
> zh-min-nan    nan        nan       nan


> Comments:
>
> * Option #1 is unambiguous and shows that there is a relationship
> between these languages.

The only difference with #2 is that the relationship is not reflected directly in the tags.

Since you mention having the relationship shown, I gather you think there's a benefit, but you don't explain what that benefit is.


> It also preserves the legacy "zh" tag so
> developers that aren't hip to later versions of BCP 47 or 639-3 will
> have some idea what these tags mean.

I'm not sure how significant a benefit that is. The developers may encounter many subtags that they're not familiar with, and they'll need to do some learning to make sense of them. And if feedback we got the first time we tried a last call for 3066bis were to be believed (before this WG was formed), there are developers out there who will be confused by any new patterns for forming tags. I agree that the presence of "zh" will give them a hint, but I expect in any of these situations people that need to will find out what these things they don't recognize are.


> The tags are maybe longer than
> they need to be, but if I need a fixed-length tag, I can wait for 639-6.

The length is not a problem, nor is the variation in length a problem.

I've got some people in my company expressing concern over people (meaning people in my company, not users) being confused by the fact that some languages will be represented as "aa", others "aaa", and (worst of all) still others "aa-bbb" or "aaa-bbb". While I expect people in a technology company who need to work directly with these things ought to be able to learn about and understand extlangs, I think if any of these things we've been talking about is likely to be a source of confusion, it's probably this (extlang). I'm still not sure that's a serious concern, but my point is that I think something like unfamiliarity with "cmn" is less of a concern.


> The languages may not be mutually intelligible in some contexts, but
> they are related.

English and Frisian are related. French and Spanish are related. Vietnamese and Khmer are related. I'm not sure how those facts are especially significant for tagging.



> * Option #2 is unambiguous, but Microsoft, Google, and Amazon won't be
> using the same tags for Chinese that I do. Even if I don't follow their
> lead, others likely will. This worries me.

The potential for different parties to use different tags exists in any of the schemes you list.

While we may not always use the same tags that you do, it isn't necessarily the case that we would never use the same tags that you do. I can't speak for other vendors, but I wouldn't at all rule out MS using the same tags you use in scenarios where your content needs to be supported.


> Also, the rules for #2 must
> include fuzzy guidelines such as, "use the 'zh' tag except when you
> think it's a bad idea" and "use the shortest tag except when you don't
> want to."

That's the same for #1, though.


> This presents complications in trying to explain some sort of
> consistent method to the LTRU madness to others. Given this, I start to
> wish ISO 639-6 a safe and speedy passage.

Have you at all considered what complications there would be in going to that? If all you're thinking about is "identification" (how to declare an identity), then I can imaging you might not have. Being on the side of thinking about how to consume tags and do something with them, I don't see that as a rosy and simple alternative.


> * Option #3 is what I believe you might suggest, but for me, that's the
> worst list of all. There are five ambiguous "zh" categories on that
> list. It follows the "always use the shortest tag" rule and respects
> history, but it's useless to me from an identification perspective.

I won't argue against your concerns here: they are valid. You do need to acknowledge, though, that all of the tags you find problematic in this case (zh, zh-TW, zh-Han?) would all be valid in *any* case, no matter what we decide, and these *will* be encountered.


> * Option #4 has three ambiguous tags and means I have to explain to
> people who aren't in this industry about why I use different tags for
> the same language. This strategy is less ambiguous that #3, but I'm not
> sure I can explain it to other content creators for the same reasons as
> #2 and presents the spoken/written complication others may not want. In
> the long run, this seems messy and unclear enough that it will result
> in bad tagging.

Again, with or without extlang, some kind of guidance needs to be given, and some reasonable means must be available for those who must somehow support legacy processes and data.


> * Options #2,3,4: In general, it worries me that RFC 4646bis offers so
> many "preferred" options for the same thing. I really can't see how
> this simplifies things for anyone.

There is no avoiding certain options for Chinese, whether with or without extlang, whether with RFC 4646bis or some entirely new scheme based on ISO 639-6 or whatever (since IETF language tags as we know them aren't about to disappear).



Peter
_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www.ietf.org/mailman/listinfo/ltru