RE: [Ltru] Re: Punjabi

Perhaps it would help not to consider all macrolanguage (or putative macrolanguages - keep in mind "lah" still needs to be evaluated) as comparable cases. There's a big difference between "lah" and "zh": the former is rare in existing data (certain libraries perhaps being exceptional), while the latter is *very* common. The extlang construct was suggested with cases like the latter in mind, not the former.

Peter

From: Mark Davis [mailto:mark.davis@icu-project.org]
Sent: Friday, March 16, 2007 6:06 PM
To: Addison Phillips
Cc: Doug Ewell; LTRU Working Group
Subject: Re: [Ltru] Re: Punjabi

> You might retag with "zh-cmn-Hans-CN", but this interferes less drastically with matching than outright change of the primary subtag.

This is the crux of the matter, whether this is in fact true. We've all been sort-of assuming that it is, but our goal at the time of 4646 was to make sure that we had some structure to accommodate the possibility in case we wanted it, long before we'd really take hard looks at the pros and cons. In order to establish whether this statement is true or not, I think we need to walk through some of the scenarios, and see how important it is in practice. I'll take a crack at a few cases, although I think we need a more extensive look.

Filtering, No Extlang Model

Let's suppose that I have content tagged with the following:

#1 zh-Hant-HK
#2 zh-Hans-HK
#3 yue-SG

If I basic filter for zh, I'll get #1 and #2. With just a bit smarter filter, using information from ISO 639 (maybe put into our registry), I'll also match #3. If I search with zh-Hant, I'll get #1 in either case. If I search for yue, I'll get just #3.

Filtering, Extlang Model

#1 zh-Hant-HK
#2 zh-Hans-HK
#3 zh-yue-SG

If I search for content matching zh, I don't have to use the "bit smarter" filter. Otherwise the same.

Lookup, No extlang Model

#1 cmn-Hant-HK
#2 cmn-Hant
#3 cmn
#4 yue-Hant-HK
#5 yue-Hant
#6 yue
#7 zh

If I request zh-Hans-CN, I'll fall all the way back to zh. If I didn't have #7, I'd fail, unless I had some smarts to do an alias to cmn. If I request yue-Hans, I'd fallback to yue.

Lookup, Extlang Model

#1 zh-cmn-Hant-HK
#2 zh-cmn-Hant
#3 zh-cmn
#4 zh-yue-Hant-HK
#5 zh-yue-Hant
#6 zh-yue
#7 zh

If I request zh-Hans-CN, I'll fall all the way back to zh. If I didn't have #7, I'd fail, unless I had some smarts to do an alias to cmn. If I request zh-yue-Hans, I'd fallback to zh-yue.

I'm not seeing a lot of difference here, but I certainly haven't explored all the possibilities. We need to try some more scenarios to see where we have some compelling differences. In particular, we need to look at cases where we are looking up where the target has cmns or yues and the key doesn't, and vice versa.

Mark
On 3/16/07, Addison Phillips <addison@yahoo-inc.com<mailto:addison@yahoo-inc.com>> wrote:
What extlangs buys us is: for languages that are already tagged, the
current primary language subtags remain consistently the primary
language subtag of choice. You might add additional subtags, but you
don't have to retag all of your content.

Thus "zh-Hans-CN" is still valid. You would not (could not) retag with
"cmn-Hans-CN". You might retag with "zh-cmn-Hans-CN", but this
interferes less drastically with matching than outright change of the
primary subtag.

What we want are consistent choices for language tags.

One alternative would be to allow both "zh-cmn" and "cmn". Users would
have to be careful to use these consistently in their content and range
selection.

Another alternative would be to forget extlang altogether and permit
*either* "zh" *or* "cmn" but not both in the same tag (except by
grandfathering). This frees the extlang up for other, nefarious, purposes.

My surmise is that macro-languages are a one-time event: "discovery" of
future macro-languages will mostly be prohibited by rule (since most of
the languages will already have codes in the "primary" position when
they become part of a macro-language collection). If my surmise is
correct, we could ban future extlang additions and use the remainder of
that namespace for (well) nefarious purposes.

The only exception to my surmise would be: a language not previously
given a code that is part of an existing macro-language. A language that
isn't currently a macro-language that receives a new sublanguage would
be a problem (all of the macro-languages are, by definition, one-to-many
mappings). That is, the sub languages always travel in (at least) pairs.

Does that make sense?

Addison

Mark Davis wrote:
> I must not be clearly stating my point. Let me try again.
>
> I'm getting at the fundamental reasoning behind extlang at all. I'm not
> arguing that we should use different DATA than ISO 639-3 in order to
> decide what are extlangs and what are not. What I'm saying is: why do we
> need the extlang construction at all? Why do we need to have zh-cmn
> instead of just cmn?
>
>  > Worst case we have some languages that could have been extlangs that
> become primary language subtags instead.
>
> That is: Why don't we simply have *all* languages be primary language
> subtags instead of extlangs? What do extlangs buy us, that we need the
> complication that they introduce?
>
> Mark
>
> On 3/16/07, *Addison Phillips* <addison@yahoo-inc.com<mailto:addison@yahoo-inc.com>
> <mailto:addison@yahoo-inc.com<mailto:addison@yahoo-inc.com>>> wrote:
>
>
>
>     Mark Davis wrote:
>      > Yes, and it concerns me that we are baking in a particular view
>     of the
>      > world by requiring that some language tags can only be used in
>      > conjunction with others (eg pmu can only be used with lah-pmu). Now
>      > maybe I'm missing something, can you articulate the reasons why
>     we must
>      > use lah-pmu (for example) instead of just pmu?
>      >
>
>     The reasons are all procedural: any ISO 639-3 code that is contained by
>     a macro-language and is not previously encoded by ISO 639-1 or ISO 639-2
>     must be an extlang whose Prefix is the macro-language code.
>
>     This allows us to piggyback on ISO 639-3's work in this area to create
>     tags such as zh-cmn and avoid naked 'cmn' language tags without having,
>     ourselves, to squint at the lists and make separate, possibly
>     unreasonable, decisions.
>
>     One possible way to avoid this would be to limit the "automatic"
>     creation of mappings to those languages that have ISO 639-1 codes. This
>     greatly limits the impact. The problem here is that it is arbitrary (why
>     Aymara and not Baluchi? why Cree and not Delaware?)
>
>     To respond to Doug's point, I think that we are not *forced* to delay
>     RFC 4646bis or even 4645bis's appearance. What we need are clear rules
>     for the incorporation of ISO 639-3 into the registry scheme. This
>     *could* take the form of a Big Bang insertion. But it would
>     certainly be
>     valid to insert only the language (and not the extlang) codes initially
>     or to include only the finalized and stable extlang codes when they are
>     mature---on a different day.
>
>     I would suggest that a mechanism for doing this would be to take each
>     macro-language, as a collection, and vet the contents with the RA and
>     on-list with ietf-languages before doing an insert. The Chinese and
>     Arabic collections probably get in straight-away. Lahnda's subtags (by
>     way of example) could wait---and no one is hurt by that delay. The only
>     hurt is getting the mapping wrong. Worst case we have some languages
>     that could have been extlangs that become primary language subtags
>     instead.
>
>     Addison
>
>     --
>     Addison Phillips
>     Globalization Architect -- Yahoo! Inc.
>
>     Internationalization is an architecture.
>     It is not a feature.
>
>
>
>
> --
> Mark

--
Addison Phillips
Globalization Architect -- Yahoo! Inc.

Internationalization is an architecture.
It is not a feature.

--
Mark