Re: [Ltru] Consensus call: extlang

Peter's addressed some of the questions. Back to your original question.
For backward compatibility, we'll continue to represent Mandarin as "zh",
Standard Arabic as "ar", and so on. Note that this is independent of whether
extlang is used or not. That is, if extlang exists, we'll treat incoming
"zh-cmn" as if it were "zh"; if it doesn't, we'll treat "cmn" as if it
were "zh". And under either scenario it is conformant to tag Mandarin as
'zh'.

Why?
While 639-3 now specifies clearly that "de" means (for example) just
Standard German, while "zh" means Any
Chinese, this clarity of specification was not present earlier. The code
"zh" has been used in the past for Mandarin, overwhelmingly so; not just 99%
or 99.9%, but many 9's. As you said, the tendency was to use illegal (or
private use codes) for non-Mandarin content. All of our internal software
and any external software that we talk to will expect Mandarin to be tagged
as 'zh' for the forseeable future.  Of course, we recognize that others may
end up using 'zh-cmn' / 'cmn', so we're prepared to deal with that.

Note also that really the whole premise of extlang is that 'zh' continues to
normally map to Mandarin. After all, if 'zh' really meant that you were as
likely to get Gan or Hakka as Mandarin, then having "zh-yue" in order to get
some kind of automatic fallback wouldn't make any sense.

Other comments below.

On Thu, May 29, 2008 at 6:04 PM, Broome, Karen <Karen_Broome@spe.sony.com>
wrote:

> Mark,
>
> One thing I think you aren't acknowledging is that "treat as synonyms"
> means something very different to the vast numbers of content creators who
> use this standard than it does the handful of search engines that use the
> fuzzy logic associated with companion standards. As you note in your
> document, "It is clear that companies like Google or Yahoo can work around
> the problems with extlang." How many other users need and can afford to
> implement the extended fallback and filtering logic? Enough that this logic
> should be the primary driver behind the chosen solution?
>
> Before I spend too much time picking apart your lengthy screed involving a
> scenario where the BBC presents its web site in Sudanese Creole Arabic with
> rotating languages code logic for each day of the week ... (ahem) ... here's
> my real-world Chinese language list:
>
> Chinese (Variant Unknown)
> Chinese (Cantonese, Spoken)
> Chinese (Cantonese, Written)
> Chinese (Mandarin, Spoken)
> Chinese (Mandarin, Spoken Taiwanese)
> Chinese (Mandarin, Simplified)
> Chinese (Mandarin, Traditional)
> Chinese (Taiwanese, Spoken)
> Chinese (Taiwanese, Written)

Sorry you consider it a scree. I realize that the emails have sometimes
gotten heated -- email really is a poor substitute for audio discussions in
controversial issues; I've seen many, many issues in Unicode and other
standards flare for months in email, and be resolved in a few hours of
discussion.

My real point is that if a query for 'ar' really means "give me any kind of
Arabic", then a query for 'ar' would be almost meaningless, since it could
return any of a number of mutually incomprehensible alternatives. Although
639-3 now defines it to be "any Arabic", in practice what users will expect
to get back is Standard Arabic, and they would be unpleasantly surprised to
get back other varieties. And our purpose should be to avoid our users'
getting unpleasant surprises.

>
>
> (Apologies, this is hard to represent in ASCII. I have a mini-spreadsheet
> if someone wants it.)
>
>
>    1             2             3           4
> a. zh            zh            zh          zh
> b. zh-yue        yue           yue         yue
> c. zh-yue        yue           yue         yue
> d. zh-cmn        cmn           zh          cmn
> e. zh-cmn-TW     cmn-TW        zh-TW       cmn-TW
> f. zh-cmn-Hans   cmn-Hans      zh-Hans     zh-Hans
> g. zh-cmn-Hant   cmn-Hant      zh-Hant     zh-Hant
> h. zh-min-nan    nan           nan         nan
> i. zh-min-nan    nan           nan         nan

above modified slightly to add row references.

>
>
>
> * Option #1 (RFC 4646) contains the codes as I have them today.

Note that this is not actually RFC4646 conformant: zh-cmn-TW is not valid.

>
> * Option #2 (RFC 4646bis) contains the codes if I choose to go against the
> grain and use "cmn".

> * Option #3 (RFC 4646bis) treats "zh" and "cmn" as synonyms; avoids using
> "cmn" for compatibility.
> * Option #4 (RFC 4646bis) contains the codes "cmn" for spoken context
> (where distinction is essential) and "zh" for written context.
>
> Comments:
>
> * Option #1 is unambiguous and shows that there is a relationship between
> these languages. It also preserves the legacy "zh" tag so developers that
> aren't hip to later versions of BCP 47 or 639-3 will have some idea what
> these tags mean. The tags are maybe longer than they need to be, but if I
> need a fixed-length tag, I can wait for 639-6. The languages may not be
> mutually intelligible in some contexts, but they are related.
>
> * Option #2 is unambiguous, but Microsoft, Google, and Amazon won't be
> using the same tags for Chinese that I do. Even if I don't follow their
> lead, others likely will. This worries me. Also, the rules for #2 must
> include fuzzy guidelines such as, "use the 'zh' tag except when you think
> it's a bad idea" and "use the shortest tag except when you don't want to."
> This presents complications in trying to explain some sort of consistent
> method to the LTRU madness to others. Given this, I start to wish ISO 639-6
> a safe and speedy passage.
>
> * Option #3 is what I believe you might suggest, but for me, that's the
> worst list of all. There are five ambiguous "zh" categories on that list. It
> follows the "always use the shortest tag" rule and respects history, but
> it's useless to me from an identification perspective.

Your list is already ambiguous for columns 1 and 2; you are using "yue" for
two different things (written and spoken). The only change it really makes
is that you don't have a term for "any chinese".

RFC 4646 lacks terms for many, many combinations of things: a term for "any
german" (including de, gsw, ...), "any french", "any scandinavian", or any
one of the countless other possible sets of languages that people consider
to be important for some particular purpose. That's why lists of languages
are really the appropriate vehicle.

>
> * Option #4 has three ambiguous tags and means I have to explain to people
> who aren't in this industry about why I use different tags for the same
> language. This strategy is less ambiguous that #3, but I'm not sure I can
> explain it to other content creators for the same reasons as #2 and presents
> the spoken/written complication others may not want. In the long run, this
> seems messy and unclear enough that it will result in bad tagging.
>
> * Options #2,3,4: In general, it worries me that RFC 4646bis offers so many
> "preferred" options for the same thing. I really can't see how this
> simplifies things for anyone.
>
> I don't have a need for fuzzy fallback scenarios. I need precise tags and
> mostly simple lookup. I think if you take the fallback scenarios and
> absurdities out of the document you reference, I don't think there's much
> left.

The only purpose I have heard for extlang *is* for fallback; that's why the
document goes into (painful) depth on that topic. For identification alone,
"zh" and "zh-cmn" really mean just the same thing. It is only in the context
of matching (filtering and lookup) that they differ in semantics *because of
their behavior*: where "cmn" means simply Cantonese, "zh-cmn" effectively
means "Cantonese but fallback to any Chinese".

>
>
> Regards,
>
> Karen Broome
>
>
>
>
> >-----Original Message-----
> >From: ltru-bounces@ietf.org [mailto:ltru-bounces@ietf.org] On Behalf
> >Of Mark Davis
> >Sent: Thursday, May 29, 2008 4:00 PM
> >To: debbie@ictmarketing.co.uk
> >Cc: LTRU Working Group
> >Subject: Re: [Ltru] Consensus call: extlang
> >
> >What would be useful is to hear from the extlangistas what their
> >concerns are specifically; many have not given reasons for favoring
> >encompassed languages into extlang instead of into the primary
> >language subtag. It would be useful for them to give the scenarios
> >where they think extlang is an improvement. It would be useful to
> >find out why they think the scenarios such as in
> >http://docs.google.com/Doc?docid=dfqr8rd5_676kxxxjhd&hl=en are not a
> >problem.
> >
> >Clearly people think that using the extlang model solves more
> >problems than it causes, so it would be useful to example specific
> >cases and see if that is, in fact, true.
> >
> >
> >Mark
>
>

-- 
Mark