Re: [Ltru] Consensus call: extlang

Peter Constable <petercon@microsoft.com> Wed, 28 May 2008 16:13 UTC

Return-Path: <ltru-bounces@ietf.org>
X-Original-To: ltru-archive@megatron.ietf.org
Delivered-To: ietfarch-ltru-archive@core3.amsl.com
Received: from [127.0.0.1] (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id C105C3A6AA5; Wed, 28 May 2008 09:13:56 -0700 (PDT)
X-Original-To: ltru@core3.amsl.com
Delivered-To: ltru@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id A95DD3A69AB for <ltru@core3.amsl.com>; Wed, 28 May 2008 09:13:55 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.623
X-Spam-Level:
X-Spam-Status: No, score=-10.623 tagged_above=-999 required=5 tests=[AWL=-0.024, BAYES_00=-2.599, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ytb+IPQLfoSP for <ltru@core3.amsl.com>; Wed, 28 May 2008 09:13:52 -0700 (PDT)
Received: from smtp.microsoft.com (smtp.microsoft.com [131.107.115.215]) by core3.amsl.com (Postfix) with ESMTP id 340573A6808 for <ltru@ietf.org>; Wed, 28 May 2008 09:13:52 -0700 (PDT)
Received: from tk1-exhub-c104.redmond.corp.microsoft.com (157.54.46.188) by TK5-EXGWY-E802.partners.extranet.microsoft.com (10.251.56.168) with Microsoft SMTP Server (TLS) id 8.1.240.5; Wed, 28 May 2008 09:14:00 -0700
Received: from NA-EXMSG-C117.redmond.corp.microsoft.com ([157.54.62.46]) by tk1-exhub-c104.redmond.corp.microsoft.com ([157.54.46.188]) with mapi; Wed, 28 May 2008 09:14:00 -0700
From: Peter Constable <petercon@microsoft.com>
To: LTRU Working Group <ltru@ietf.org>
Date: Wed, 28 May 2008 09:13:58 -0700
Thread-Topic: [Ltru] Consensus call: extlang
Thread-Index: Aci/pjqrriDDSRZfTWWe42oHcWg9DQAnJslAACVkLKA=
Message-ID: <DDB6DE6E9D27DD478AE6D1BBBB835795633304E1F5@NA-EXMSG-C117.redmond.corp.microsoft.com>
References: <01c301c8bbe5$8c2810c0$6801a8c0@oemcomputer> <008a01c8bedc$72b97b20$6801a8c0@oemcomputer> <30b660a20805252132g28ff50b0kd5b04d6f47ca35d2@mail.gmail.com> <002001c8bef3$e0497520$6801a8c0@oemcomputer> <30b660a20805262003j21fff6c4tf20d59be11f28633@mail.gmail.com> <E19FDBD7A3A7F04788F00E90915BD36C13C251B2E2@USSDIXMSG20.spe.sony.com>
In-Reply-To: <E19FDBD7A3A7F04788F00E90915BD36C13C251B2E2@USSDIXMSG20.spe.sony.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
MIME-Version: 1.0
Subject: Re: [Ltru] Consensus call: extlang
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: ltru-bounces@ietf.org
Errors-To: ltru-bounces@ietf.org

> From: ltru-bounces@ietf.org [mailto:ltru-bounces@ietf.org] On Behalf Of
> Broome, Karen

> Of course, if you had actually asked for Mandarin using a precise code
> for Mandarin (either "cmn" or "zh-cmn"), that list doesn't look so ugly,
> does it?
>
> But you didn't ask for Mandarin, you asked for Chinese...

I completely agree. I also note some validity in Mark's concern: there are a lot of users that have been and are still using zh rather than cmn/zh-cmn, and probably a large proportion of them are expecting Mandarin content. But as your following comments note, there's no silver bullet for dealing with that.


> This is where I think we are prioritizing generalized fallback over
> precision in identification -- even if what we are precisely
> identifying is intentionally vague. I lose the ability to identify
> "some kind of Chinese, I don't know what kind" if zh is thought to be
> synonymous with Mandarin. (A real need given the history of these tags.)

I think there's general consensus here that zh is *not* thought to be synonymous with Mandarin; there's just a high-ish degree of statistical correlation in existing usage.


> I also will not know what the user intended if they use the tag "zh".

At best, we can only make a statistical inference: based on usage to date, Mandarin is most likely meant.


> When it becomes possible to tag more specifically, we won't be able to
> know whether the decision to use the more general tag ("zh") was a
> conscious choice or not.

In some closed systems, it may be possible to know, but in an open system involving interchange with other parties in other spheres of control, no.


> zh is a bad tag; its semantics have always been muddy. We are going to
> rather great lengths to avoid deprecating it.

Deprecation for something that has been so widely used is a big step, and it may take some time for people to buy into that. But you are right in saying it is problematic: if there are other Chinese choices besides Mandarin, zh just leaves people forced to make assumptions or guess that may not be valid.

I'd point out that all these things are true whether we use extlang or not, and that extlang is helpful in relation to that problem only to the extent that it can help processes take actions that will match user expectations better.

I think initially I thought it might on the basis that anyone asking for zh-xxx might, by right-truncation fallback, still get matches with existing zh content. I was assuming that would generally be a useful response from the user perspective, and while it probably is for Chinese it isn't in the general case, across all macrolanguages -- if for no other reason, because there isn't a bunch of existing content tagged aym or bal or bik etc. that people want to continue getting matches for.

Also, I was thinking of filtering (in RFC4747 terms), and for filtering I got it backwards: right-truncation fallback allows a request for zh to match content tagged zh-xxx, and in that case it *isn't* helpful in the general case: if someone is making a request using a macrolanguage ID, they're still looking for some particular variety, and in most cases returning *any* encompassed variety probably isn't helpful.

Rather than filtering, it's *lookup* for which a request for zh-cmn would match existing zh content. But I think it's fair to say that lookup is most typically used for UI resource matching or similar scenarios in which the options for available content are likely to be fairly limited and controlled, with users being presented with the set of specific options available. In that kind of controlled environment, fallback from (zh-)cmn to zh is less likely to be needed.



Peter
_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www.ietf.org/mailman/listinfo/ltru