[Ltru] my technical position on extlang
"Mark Davis" <mark.davis@icu-project.org> Sun, 18 May 2008 18:49 UTC
Return-Path: <ltru-bounces@ietf.org>
X-Original-To: ltru-archive@megatron.ietf.org
Delivered-To: ietfarch-ltru-archive@core3.amsl.com
Received: from [127.0.0.1] (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 28ACE3A68CF; Sun, 18 May 2008 11:49:57 -0700 (PDT)
X-Original-To: ltru@core3.amsl.com
Delivered-To: ltru@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 298803A68A1 for <ltru@core3.amsl.com>; Sun, 18 May 2008 11:49:54 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.777
X-Spam-Level:
X-Spam-Status: No, score=-0.777 tagged_above=-999 required=5 tests=[FM_FORGED_GMAIL=0.622, GB_I_LETTER=-2, HTML_MESSAGE=0.001, J_CHICKENPOX_81=0.6]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id s2wZdy0Ne-RK for <ltru@core3.amsl.com>; Sun, 18 May 2008 11:49:46 -0700 (PDT)
Received: from yw-out-2324.google.com (yw-out-2324.google.com [74.125.46.31]) by core3.amsl.com (Postfix) with ESMTP id 0A8FE3A68CF for <ltru@ietf.org>; Sun, 18 May 2008 11:49:45 -0700 (PDT)
Received: by yw-out-2324.google.com with SMTP id 3so939346ywj.49 for <ltru@ietf.org>; Sun, 18 May 2008 11:49:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:mime-version:content-type:x-google-sender-auth; bh=1fvn66B2VB0DWPI3naSm5NGsUZwojfg/TjnuOOBB5zE=; b=FlE3xM/MUAiGc83SBiJc6tGD3qzJn+1MUV+cJQ9a9H4U/eYMEA7uOAqkP+zrPZ5YXdAWyKaJqkYqinmDCtGJS6NffrN/KMItiBhdE/PUKXRy2vI6YDcG8v+j0kvNh+aYKvlV/tW2qyfbv8LMydrHMqxERExYphGPFLlkPeTjhQA=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:cc:mime-version:content-type:x-google-sender-auth; b=PGn+6CZPVIfKsA58tEo1cAS9fRmu9dve0kfP+XGHIdBGegEg1WwVAl31VPCHFw6XrUAwhAxwIywYtMjWChS2bpRszGULpFNy7qPcxeauWXcXNHZp94ht2m+Z8Qt+JZGxiaKiyRbwULrRAtCte53I7GlrEzFv25Q768jZS4H0Ku8=
Received: by 10.150.72.11 with SMTP id u11mr5538628yba.112.1211136582281; Sun, 18 May 2008 11:49:42 -0700 (PDT)
Received: by 10.150.206.3 with HTTP; Sun, 18 May 2008 11:49:42 -0700 (PDT)
Message-ID: <30b660a20805181149u2e1e3fb9y1a3b5b751c3e6998@mail.gmail.com>
Date: Sun, 18 May 2008 11:49:42 -0700
From: Mark Davis <mark.davis@icu-project.org>
To: Martin Duerst <duerst@it.aoyama.ac.jp>
MIME-Version: 1.0
X-Google-Sender-Auth: a78c8c42d36b279b
Cc: LTRU Working Group <ltru@ietf.org>
Subject: [Ltru] my technical position on extlang
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============0053086832=="
Sender: ltru-bounces@ietf.org
Errors-To: ltru-bounces@ietf.org
Here are some thoughts on extlang. The more readable version is at: http://docs.google.com/Doc?docid=dfqr8rd5_676kxxxjhd&hl=en Copied here for the archive: Extlang The arguments for extlang are that they give superior results, and are thus worth the complication of having some languages be unavailable as langtag, and only in a secondary position. I believe that when extlang is examined carefully, by people who have implemented language tag lookup, that on balance most people will be worse off than if we retain the structure of RFC4646, and do not complicate the structure to the overall detriment of implementations, do not make encompassed languages be in the inferior, extlang position. * Links* - This document: http://docs.google.com/Doc?id=dfqr8rd5_676kxxxjhd - BCP 47: http://www.rfc-editor.org/rfc/bcp/bcp47.txt - Current draft: http://www.inter-locale.com/ID/draft-ietf-ltru-4646bis-14.html - Macrolanguage mappings <http://www.sil.org/iso639-3/macrolanguages.asp> Process I looked back over the emails, and people may not remember everything that was discussed in the emails around the time that we came to (rough) consensus on extlang. While Shawn claimed that the topic should be reopened -- well after the last call went out -- as far as I can tell there is in fact *no* new information being presented that was not available on the list 6 months ago. - everything mentioned recently about the use of 'zh' and 'cmn' was discussed long ago in the email of September through December last year. - the teleconferences were discussed at length during September (see "[Ltru] 70th IETF - ltru session?", and we ended up with a few sessions. We made efforts to have them at times that would work for all those interested in discussing the issue; we also had jabber available for those who couldn't phone in. Nobody objected to the time except for Peter, and we delayed a week so that we could then include him. A number of people from both 'camps' evidenced interest in having them. We took notes and distributed them. (Doug bailed out at one point: "Go ahead and include me out of future teleconferences, and feel free to move them back to a more comfortable time for all. ") - as a result of those telecons, we were able to come to rough consensus - by December, we were moving on to whether to remove extlang from the ABNF or not, and settled on doing that. - the results of the telecon were announced on the list. Nothing prevented people who might have felt left out from contesting the results at that point or any time since, including the last call. - there have been no new technical reasons provided since December that would give us any reason for overturning the result of the telecon, nor for believing that we would get consensus for having encompassed languages be extlangs in RFC4646bis - moreover, we now have workable text in draft 14 for handling the macrolanguage/encompassed language issues without recourse to a new, untested mechanism. Since this issue seems to be reopening (for no good reason that I can see), I put together some responses from past emails on the topic. They are not wonderfully organized; I just tried to cast them as Q/A pairs to make them a bit more readable. There is definitely some repetition that ought to be edited away. The tone may seem too harsh at times -- sometimes that was in the heat of the moment, and I haven't had time to moderate the text, so I apologize in advance for any offense I may give. Q. Where would extlang make a difference? A. Where As RFC 4647 describes, there are two main processes for matching using language tags, filtering and lookup. There is reason to look at how the two different macrolanguage models in affect 4647. If primary reason cited for the extlang model is compatibility, we have to see what effects each model has on commonly used matching functions, which is where the rubber hits the road. For filtering, extlang offers no particular advantage. Let's look at queries of "ar-ary" Moroccan Arabic vs "ary". In either case I need a way to match all and only Moroccan Arabic; I must not fallback to "ar". If I fallback to include all Arabic, the actual content that is in Moroccan Arabic would be completely and utterly swamped by Standard Arabic. So for filtering, the extlang model just gives us a more complicated syntax, with no benefit. The only possible advantage of extlang would be in Lookup. Q. Isn't extlang better for lookup? A. Take an example - My site has support for zh, zh-Hans, and zh-Hant. zh has Mandarin content, since that is what 99.999% of web sites mean by "zh" currently. As is customary in fallback, my site's 'zh' content uses the predominent form (zh-Hans) [this is just for example; a TW site can use the opposite convention.] - A user comes in with different requests, listed below. * Scenario 1. *The user's browser has the proposed "zh-yue-Hant-US". My lookup falls back to zh, so I serve it up to the user. So even if the target of the match (zh) is not Cantonese, you want a fallback to zh. I'm guessing that you see this as better than if we defined the tag as "yue-Hant-US", since it gets to some fallback that the user is likely to understand. But I don't see this as much different than if we had fr-br-BE (meaning Breton, but fall back to French), or ro-mo (meaning Moldavian, but falling back to Romanian).* And note that in the fallback, the script and region are completely lost.* * Scenario 2. *The user's browser has zh-cmn-Hant-US. In matching, we fall back to zh. Note than in the fallback, the script and region are completely lost. *We have essentially just introduced a synonym for zh which causes fallback to lose information, for no good reason. * Q. What's wrong with simply including extlang? A. If we bake it in, then every simple algorithm will in practice automatically fall back from Cantonese to Mandarin, fall back from Dari to Persian, fall back from Khetrani to Lahnda -- and in doing so, strip the script and country information. That, unless they fix the algorithm to pretend that the secondary language is in fact a primary language. So we are forcing people into a model that is often, or mostly, wrong. *If we supply the information in the registry, then implementations can choose whatever they think is appropriate, given the particular facts about languages and the particular needs of their applications without having to work around the **extlang** mechanism.* I'm reminded again of a similar case with C++. The assignment operator gets a default implementation. That must have seemed like a nice convenience for the user, but *except for toy programs, it is always, always wrong. * So supplying that default just means that people usually have to take extra steps to disable it, and prevent it from causing bugs in their programs. I'm worried about this being similar. It is clear that companies like Google or Yahoo can work around the problems with extlang-- what I'm worried about are the people who don't have a lot of experience with these matters, and are just led down a garden path. We need to look long and hard at the experience of people who have had detailed implementation experience with filtering and matching these tags in production environments. Q. Where are some cases where extlang works particularly badly? Extlang plays especially badly in many cases. Suppose that we have macrolanguage m1, and microlanguages x1 and x2. By the design of ISO 639, we *can't* assume that a speaker of x1 can also speaker x2 or vice versa. If a user has as accept language the list <x1-Ssss-Rr, en, fr>, it works fine without extlang: she gets the fallback (A) Script fallback 1. x1-Ssss-Rr 2. x1-Ssss 3. x1 4. en 5. fr If she also speaks/reads x2, then she can specify <x1-Ssss-Rr, x2, en, fr> or <x1-Ssss-Rr, en, fr, x2>; that is, putting x2 in the list in the position she wants it. If x2 is the predominant microlanguage, meaning that m1 is essentially always assumed to be x2, then the priority list can be <x1-Ssss-Rr, en, fr, m1>, also wherever it belongs. Thus, the user gets something she can understand, based on the list she supplied. If we are using extlang, and x2 is the content for m1, then we get the fallback (B) Script + extlang fallback 1. m1-x1-Ssss-Rr 2. m1-x1-Ssss 3. m1-x1 4. m1 5. en 6. fr That has two problems: first, the script and region are lost. That can be fixed by hacking the fallback (although there is a *lot *of installed base that won't do this), to (C) Hacked Script + extlang fallback 1. m1-x1-Ssss-Rr 2. m1-x1-Ssss 3. m1-x1 4. m1-Ssss-Rr 5. m1-Ssss 6. m1 7. en 8. fr But even more importantly, we are disabling the user's explicit choice. If the user doesn't speak x2 (or whatever the content of m1 is), he's screwed. There is no way that he can indicate that he wants x1 * but no other version of m1 because he can't understand them. * We'd have to change the fallback to be quite substantially different to get around this, with (D) More Hacked Script + extlang fallback 1. m1-x1-Ssss-Rr 2. m1-x1-Ssss 3. m1-x1 4. en 5. fr 6. m1-Ssss-Rr 7. m1-Ssss 8. m1 This is, however, still only appropriate if it is likely that a user of x1 speaks *whatever happens to be the content for m1*. That is an extremely shaky assumption. If Peter Constable said, for each and every macrolanguage on http://www.sil.org/iso639-3/macrolanguages.asp, there is at least one microlanguage that all speakers (or even most speakers) of each of the other microlanguages would understand, I'd say: fine, let's do extlang and incorporate that information into the registry, with the "default microlanguage" for each macrolanguage. Then, for example, implementers would know that, say " fuf <http://www.sil.org/iso639-3/documentation.asp?id=fuf>Pular" is understood by all the speakers of the microlanguages under "ful <http://www.sil.org/iso639-3/documentation.asp?id=ful> Fulah", so we can tell people to always have the content of "ful" be "fuf", and bake the macrolanguages in as extlang. The point of the suggested text is that if your application wants to use macrolanguages to support extlang-equivalent fallback, there is nothing stopping you from doing so. If there are particular environments where an extlang-like fallback is right for a particular language community, it is simple to do. But we don't need to bake shaky assumptions into the structure of language tags. Q. Isn't extlang just like script fallback? A. The problem with extlang is that the fallback from encompassed language to macrolanguage is fundamentally different in kind than a fallback from region to script to base language. In the case of script, like uz-Arab and uz-Latn, or en-US vs en-GB, we really have variations on the same language, and fallback makes sense. We ordered the subtags so that it works optimally overall. The encompassed languages, on the other hand, are not just dialects, not just variants. *They are languages in their own right. *Trying to insert them into the fallback process just screws things up, because they need a "sideways" matching not just simple truncation fallback. If you want to do any fallback with extlang, it would be to fall back from zh-yue-<other stuff> to zh-<other stuff>. That means that in order to do reasonable fallback, you can't just use truncation fallback anyway. So I see the situation this way: 1. The only reason for adding the complication of the extlang mechanism is to make truncation fallback work better. 2. Truncation fallback with extlang doesn't work better. 3. So there is no need to make encompassed languages be "secondary" languages by making them be "secondary" subtags. The goals of extlang are good, to make matching work better, but in practice it just makes things worse. [Speaking to those familiar with C++, it feels a bit like the default assignment operator in C++. Nice in theory, but in practice it gums things up more than it fixes, since once you are beyond very simple (toy) classes, the default is almost always wrong -- but because it is supplied behind your back you don't realize it.] So instead of adding the extlang mechanism to RFC 4646, what we really need to do is to point people to how to handle yue and other encompassed languages along with mo/ro, tl/fil, and other edge cases in a reasonable way, by augmenting matching. Q. Where might the macrolanguage be useful? A. An implementation may choose to use that information in falling back from some encompassed languages to macro languages. For example, given the language priority list with Cantonese in Traditional Script as used in Hong Kong, followed by French ("yue-Hant-HK, fr"), the lookup could be the following: 1. yue-Hant-HK 2. yue-Hant 3. yue-HK 4. fr 5. implementation defined default: 5a. zh-Hant-Hk 5b. zh-Hant 5c. zh 5d. en Whether such fallback should be used -- and if so, the precise way in which such a fallback is done -- is application-dependent. Where it is very likely that the audience requesting Cantonese (as above) will accept and understand Mandarin (the predominant content for 'zh'), then this fallback might be useful. Where there is risk that that the audience requesting Cantonese will not be conversant with Mandarin, and would prefer an alternative in the language priority list, it should be avoided. (This might be the case, for example, with audio using yue-Zxxx-US. Q. Why not have extlang for using macrolanguages if the suggested text adds macrolanguages back into the fallback chain? A. The suggested text doesn't add it back in. It only says that *IF *an application wants to do extlang-equivalent fallback, the text in BCP 47 already allows for that. We really have *no* idea whether using macro languages in the fallback chain is a good idea or not. Some people think it will be an advantage for a few specific examples that are cited. (Cantonese comes up, but when I tested some of the assumptions with Cantonese speakers, they didn't quite hold up.) But nobody has substantiated that it will give better results for all or most macrolanguages. Or any indication of an even rough list of those for which it will be better. Nor has anyone effectively argued that the situation between yue and zh is substantially different than the situation between gsw and de, where we get along just fine without extlang. So the suggested text just provides it as an option, and leaves it up to the application. Q. How should we look at macrolanguages? I think one of the things that we realized when looking at how this would work in practice is that we are better off if we treat macrolanguage as a piece of *perhaps* information for matching, but one that can be enhanced (that is, changed) over time as more information becomes available. Hard-coding it into extlang doesn't serve that purpose, and causes other problems, notably that the other fields are lost in fallback: if we had zh-yue-Hant, then by the time we get to zh in fallback, we've lost the Hant. So that is the origin of the text we are proposing. There are a number of edge cases such as deprecated codes, closely related languages, or practical fallbacks (eg if someone speaks X they are likely to speak Y, even if X and Y are not linguistically related) that are simply unsuited to hard-coding in the tag. If I hit a code like "iw-PL", I want to match that with "he-PL", not depend on some kind of fallback between "iw" and "he"; otherwise it loses information (the PL). We do provide that kind of information in the deprecated field, and with the macrolanguage field we would provide more (for example, it would make clear the relation between no, nb, and nn and how that could be used in matching). Other useful information is the scripts used with a language in practice; suppress script supplies just a little information, but doesn't tell me that Uzbek is customarily written with Arabic, Latin, or Cyrillic, but not with (say) Tagalog. The more information there is available, whether it be in the language subtag registry or somewhere else, the better a job people can do in dealing with some of the edge cases that turn up in matching. Q. Doesn't the macrolanguage relationship uniquely define the best fallback? A. Certain languages are closely related, and the lookup process may take that into account. Macrolanguage is just one factor that may (or might not) be useful. For example, since the the tag "gsw-CH" (for Swiss German as used in Switzerland) was first available on 2006-12-08, Swiss German ("Schwyzerduetsch") text may have been tagged with "de-CH" instead. ISO 639 was not (and is still not) clear on whether "de" meant only High German or also included variants such as Low German or not. Thus Swiss German material may have been, and may still be tagged with "de". Essentially all Swiss German speakers are comfortable in High German, so where Swiss German is not available, High German is a very good fallback. Thus when given the language priority list: "gsw-CH, fr-CH", an implementation using lookup may augment the default values to also include the lookup of related values, such as the following search order: 1. gsw-CH 2. gsw 3. fr-CH // next language 4. fr 5. implementation defined default: 5a. de-CH // special fallback from gsw-CH 5b. de 5c. en // root In this way, other likely possibilities are tried before the final fallback to the root value. Note that typically the fallback to related languages should include the script and region codes if available. In this way, the lookup process may take into account what languages people are likely to understand, given a language priority list. Similarly, the close relations between Romanian and Moldavian, Tagalog and Filipino, Serbo-Croatian and Croatian, and so on may all be useful in doing related language lookup. This is not restricted to related languages. For example, a Breton speaker is very likely to also understand French, given the language priority list. Thus the implementation may choose to use the following lookup for the language priority list "br-FR, de": 1. br-FR 2. br 3. de 4. implementation defined default: 4a. fr-FR // special fallback from br (Breton) 4b. fr 4c. en Q. I see "zh" and "cmn", I have no way of telling that they're related without looking at the registry, which basically means a hard-coded table. If "zh" is preferred", then I may want to move from "zh" to "cmn" or whatever. A. *You can't tell that from the **registry** either! *Sometimes microlanguages are related in such a way as to be a good fallback, but *usually they are not.* The handful of actual cases where we think it might be a good idea are listed in the text around Table 8, *not* in the registry. Knowing that X is a macro/micro language does not necessarily mean -- *and usually doesn't mean* -- that you want to use it in fallbacks. If there is no predominant form, then it's a crapshoot as to whether the macro/microlanguage is a good fallback. There is no special runtime information in the registry. When a new macro/microlanguage shows up in the registryregistry macro/microlanguages is not a good idea. Nor is it complete, since it misses tl/fil, ro/mo, and many others that are much higher frequency cases than most of the macro/microlanguages. If an implementation provides a UI for selecting language priority lists, it may be better to give the user the option of having explicit fallbacks (such as from Cantonese to Mandarin or Tagalog to Filipino), rather than trying to guess the user's intent (and run the distinct risk of getting it wrong). For that purpose, when a user adds a language to the priority list, the UI may suggest macrolanguages, or other related languages, as additional fallbacks. *and* you support the one of the pair, then it may be useful to review whether or not you want to add a fallback. Automatically updating fallbacks blindly according to the Q. In lookup, if there is a predominant form how is it best (in practice) to deal with the macrolanguage? For most programs, I believe that treating them as synonyms is the right thing to do, and alternative approaches would be extremely counterproductive. And this goes for any of the macrolanguage cases where there is a predominant encompassed language with long usage in the computer industry. Let's take a scenario where this is not done (a la Ewell). Suppose that a user picks Arabic as her Accept-Language in her browser. Any existing browser will represent that with "ar". Then she goes to the BBC site. The entire site is translated, not into standard Arabic, but into Sudanese Creole Arabic. The user complains, since she can't understand it, and the BBC responds that they are just following the standard to the letter and spirit: "ar" means any kind of Arabic whatsoever, and so in the interests of fairness, they pick a different encompassed language to serve up each day. They inform the user that it is her fault for using 'ar' if she really only wants Standard Arabic. So they have the following schedule: Monday aao <http://www.sil.org/iso639-3/documentation.asp?id=aao> Algerian Saharan Arabic Tuesday abh <http://www.sil.org/iso639-3/documentation.asp?id=abh> Tajiki Arabic Wednesday abv <http://www.sil.org/iso639-3/documentation.asp?id=abv> Baharna Arabic ... acm <http://www.sil.org/iso639-3/documentation.asp?id=acm> Mesopotamian Arabic acq <http://www.sil.org/iso639-3/documentation.asp?id=acq> Ta'izzi-Adeni Arabic acw <http://www.sil.org/iso639-3/documentation.asp?id=acw> Hijazi Arabic acx <http://www.sil.org/iso639-3/documentation.asp?id=acx> Omani Arabic acy <http://www.sil.org/iso639-3/documentation.asp?id=acy> Cypriot Arabic adf <http://www.sil.org/iso639-3/documentation.asp?id=adf> Dhofari Arabic aeb <http://www.sil.org/iso639-3/documentation.asp?id=aeb> Tunisian Arabic aec <http://www.sil.org/iso639-3/documentation.asp?id=aec> Saidi Arabic afb <http://www.sil.org/iso639-3/documentation.asp?id=afb> Gulf Arabic ajp <http://www.sil.org/iso639-3/documentation.asp?id=ajp> South Levantine Arabic apc <http://www.sil.org/iso639-3/documentation.asp?id=apc> North Levantine Arabic apd <http://www.sil.org/iso639-3/documentation.asp?id=apd> Sudanese Arabic arb <http://www.sil.org/iso639-3/documentation.asp?id=arb> Standard Arabic arq <http://www.sil.org/iso639-3/documentation.asp?id=arq> Algerian Arabic ars <http://www.sil.org/iso639-3/documentation.asp?id=ars> Najdi Arabic ary <http://www.sil.org/iso639-3/documentation.asp?id=ary> Moroccan Arabic arz <http://www.sil.org/iso639-3/documentation.asp?id=arz> Egyptian Arabic auz <http://www.sil.org/iso639-3/documentation.asp?id=auz> Uzbeki Arabic avl <http://www.sil.org/iso639-3/documentation.asp?id=avl> Eastern Egyptian Bedawi Arabic ayh<http://www.sil.org/iso639-3/documentation.asp?id=ayh> Hadrami Arabic ayl <http://www.sil.org/iso639-3/documentation.asp?id=ayl> Libyan Arabic ayn <http://www.sil.org/iso639-3/documentation.asp?id=ayn> Sanaani Arabic ayp <http://www.sil.org/iso639-3/documentation.asp?id=ayp> North Mesopotamian Arabic bbz<http://www.sil.org/iso639-3/documentation.asp?id=bbz> Babalia Creole Arabic pga <http://www.sil.org/iso639-3/documentation.asp?id=pga> Sudanese Creole Arabic shu <http://www.sil.org/iso639-3/documentation.asp?id=shu> Chadian Arabic ssh <http://www.sil.org/iso639-3/documentation.asp?id=ssh> Shihhi Arabic If 'ar' means any Arabic, without any preference, this would be a perfectly reasonable thing to do. But for users, it would hardly be satisfactory. And this would be a bizarrely stupid thing for the BBC to do. Someone might respond that, well, everyone needs to convert over to cmn for Mandarin since it is now the Right Thing to Do. Even if that magically happened, it would take years, and during the transition we would get all kinds of screwups with different programs transitioning at different paces. And there isn't much magic around; it is very hard to get people to change infrastructure that works just fine -- you have to give them a compelling case for why users are served better by the change. And that would be a very hard sell, since there isn't any real advantage. The right approach for the BBC is to treat a request 'ar' as a request for Standard Arabic*, just as they have always done*. Internally, that means treating 'ar' and 'arb', and any language tag that starts with them, as a request for Standard Arabic. That is, treating ar-EG and arb-EG, or ar-SA and arb-SA, or other combinations, as synonyms for the purpose of lookup. Now, this "treating as synonyms" could be done in different ways. One way is to mash on input; the other is to have a fancier fallback, eg arb-EG => ar-EG => arb => ar (mutatis mutandis, when starting with ar-EG). Nor was this solved at all by extlang -- as we discussed at some length, it discarded all script and region info when falling back, and produces worse results in many cases, especially where the macrolanguage does not have a predominant form. The "treating as synonyms" strategy is always going to be the right answer. There are undoubtedly scenarios where this strategy is not necessary, although I can't think of any off the top of my head. Moreover, I think one of the more productive things we can do is to push for the incorporation of Language Priority Lists in any query-like protocols. That way I could say I'd like "ary, fr, ar" if my preferred ordering is Moroccan Arabic, then French, then as a last resort, Standard Arabic Q. How do I use fancier fallback for the predominant form? Here is a more detailed case 1. cmn-Hant-HK 2. zh-Hant-Hk 3. cmn-Hant 4. zh-Hant 5. yue-HK 6. zh-HK 7. yue 8. zh Q. Haven't people always interpreted zh as meaning anything from Mandarin to Hakka to Min? A. That's very unclear. People usually choose 'zh' not literally, but through a UI that shows a human readable form. So the question is, how many people have looked at interfaces that say simply " 中文" and think 1. "that could mean Mandarin but could also mean Hakka" vs 2. "that means just Mandarin, they don't offer Hakka so I'll pick something else", vs 3. "that means just Mandarin, they don't offer Hakka, but Mandarin is the closest to Hakka that is offered so I'll pick that." In lookup on computer systems it is clear that nobody's expectation is that by picking 'zh', they will get Hakka. And anything but Mandarin is a vanishingly small percentage of tagged text; the same is true of 'ar'; anything but Standard Arabic is a vanishingly small percentage. As Karen said on a related topic: *"My experience is that the users who need to specify Cantonese most often make up an **illegal** tag. Not saying that's what we should recommend, but I believe my experience does not support the statement as worded. "* Q. Isn't the written form of Cantonese the same as Mandarin?A. No, no more than the "written form of Swiss German is the same as High German". What is true is that when the Swiss write, they write in High German; that's different. They are using different words than what is spoken, and very different syntax: "I bi doo gsy." => "Ich war hier." Here is what I have on the subject from John Jenkins: I believe you said that a Mandarin speaker can read written Cantonese, but will not understand everything (a bit like a Dane reading Swedish). More like French and Spanish, actually. Some characters would not (normally) be used in Mandarin. Most famously U+4E5C, the Cantonese for "what". Some characters would have different meanings than in Mandarin Best illustrated with U+4FC2, which means "to bind" in Mandarin but is frequently borrowed to for the Cantonese word for "to be." Some syntax would be different Yes, but I can't think of any examples off the top of my head and the book I've got that lists the differences is successfully playing hide-and-seek at the moment. There are actually not an awful lot of these. The main differences between the two are phonetic and lexical. The grammars are very similar. Can you point me to some web pages with written Cantonese that would demonstrate that to a Chinese reader? Nothing better to start with than the Cantonese Wikipedia article on Cantonese: <http://zh-yue.wikipedia.org/wiki/粵語<http://zh-yue.wikipedia.org/wiki/%E7%B2%B5%E8%AA%9E>>. Similarly, <http://zh-yue.wikipedia.org/wiki/香港<http://zh-yue.wikipedia.org/wiki/%E9%A6%99%E6%B8%AF>>, <http://zh-yue.wikipedia.org/wiki/Unicode>, and pretty much everything else in the Cantonese Wikipedia. I can give you the relative character frequencies in the Cantonese and traditional (or simplified Chinese) Wikipediae, if you like. I've still got that data around somewhere. *The thing you have to emphasize is the difference between what-**Cantonese* *-speakers-generally-read-and-write, which is just Mandardin with ** Cantonese** phonetics, and writing-down-what-**Cantonese**-speakers-actually-speak, which is what "written **Cantonese**" should be used to mean. Unfortunately, not everybody groks this. Fortunately, Wikipedia does.* Meanwhile, I quote from Stephen Matthews and Virginia Yip, _Cantonese: A Comprehensive Grammar_ (London: Routledge, 1994), pp. 5-6: "Traditionally, Cantonese has been regarded as one of the many Chinese dialects. It does not have a standardized written form on a par with standard written Chinese. No form of written Cantonese is taught in schools or used in academic settings in any Cantonese-speaking community. When it comes to the written form, it is standard written Chinese that is taught and learnt. For educated Cantonese speakers, standard written Chinese is the written form they use in most contexts. However, in colloquial genres such as novels, popular magazines, newspaper gossip columns, informal personal communications, written Cantonese may be used. When written Cantonese Cantonese words and expressions, non-Cantonese speakers may find it totally unintelligible." contains too many exclusively Since they wrote, however, there's been a distinct upsurge in the use of written Cantonese. (It's tied in with a kind of Hong Kongese pseudo-nationalism.) It's still not exactly *common*, but it's a lot more common than it used to be. -- Mark
_______________________________________________ Ltru mailing list Ltru@ietf.org https://www.ietf.org/mailman/listinfo/ltru
- [Ltru] my technical position on extlang Martin Duerst
- Re: [Ltru] my technical position on extlang John Cowan
- Re: [Ltru] my technical position on extlang Gerard Meijssen
- Re: [Ltru] my technical position on extlang Debbie Garside
- [Ltru] my technical position on extlang Mark Davis
- Re: [Ltru] my technical position on extlang Doug Ewell
- Re: [Ltru] my technical position on extlang Peter Constable
- Re: [Ltru] my technical position on extlang Leif Halvard Silli
- Re: [Ltru] my technical position on extlang Leif Halvard Silli
- Re: [Ltru] my technical position on extlang Doug Ewell
- Re: [Ltru] my technical position on extlang Leif Halvard Silli
- Re: [Ltru] my technical position on extlang Doug Ewell
- Re: [Ltru] my technical position on extlang Peter Constable
- Re: [Ltru] my technical position on extlang John Cowan
- Re: [Ltru] my technical position on extlang Martin Duerst
- Re: [Ltru] my technical position on extlang Mark Davis
- Re: [Ltru] my technical position on extlang John Cowan
- Re: [Ltru] my technical position on extlang Mark Davis
- Re: [Ltru] my technical position on extlang John Cowan
- Re: [Ltru] my technical position on extlang Mark Davis
- Re: [Ltru] my technical position on extlang John Cowan
- Re: [Ltru] my technical position on extlang Mark Davis
- Re: [Ltru] my technical position on extlang John Cowan
- Re: [Ltru] my technical position on extlang Randy Presuhn
- Re: [Ltru] my technical position on extlang Peter Constable
- Re: [Ltru] my technical position on extlang Leif Halvard Silli
- Re: [Ltru] my technical position on extlang Gerard Meijssen
- Re: [Ltru] my technical position on extlang Mark Davis
- Re: [Ltru] my technical position on extlang Mark Davis
- Re: [Ltru] my technical position on extlang John Cowan
- Re: [Ltru] my technical position on extlang John Cowan
- Re: [Ltru] my technical position on extlang Peter Constable
- Re: [Ltru] my technical position on extlang John Cowan
- Re: [Ltru] my technical position on extlang Leif Halvard Silli
- Re: [Ltru] my technical position on extlang Leif Halvard Silli
- Re: [Ltru] my technical position on extlang Gerard Meijssen
- Re: [Ltru] my technical position on extlang Leif Halvard Silli
- Re: [Ltru] my technical position on extlang John Cowan
- [Ltru] What people want (Was: my technical positi… Stephane Bortzmeyer
- Re: [Ltru] my technical position on extlang Stephane Bortzmeyer
- Re: [Ltru] What people want (Was: my technical po… Mark Davis
- Re: [Ltru] my technical position on extlang John Cowan
- Re: [Ltru] my technical position on extlang Leif Halvard Silli
- Re: [Ltru] my technical position on extlang Peter Constable
- Re: [Ltru] my technical position on extlang Peter Constable
- Re: [Ltru] my technical position on extlang Peter Constable
- Re: [Ltru] What people want (Was: my technical po… Peter Constable
- Re: [Ltru] my technical position on extlang Nicolas Krebs
- Re: [Ltru] my technical position on extlang Kent Karlsson
- Re: [Ltru] my technical position on extlang Shawn Steele
- Re: [Ltru] my technical position on extlang Leif Halvard Silli
- Re: [Ltru] my technical position on extlang Leif Halvard Silli
- [Ltru] [OT] Logic (was: Re: my technical position… Martin Duerst
- Re: [Ltru] [OT] Logic (was: Re: my technical posi… Peter Constable