Re: [Ltru] Extended language tags (long reply)

(warning: this is a long screed, which, now that I've alluded to it on 
list--and even quoted from the abominable thing-- I may as well share, 
even though I'll probably think better of it in about an hour when the 
responses start to arrive)

Shawn Steele wrote:
> 
>         * We (the bigger software ecosystem we, tools, etc) have encouraged use of zh-HK, etc for labeling both Mandarin and Cantonese.  Whatever happens in the future a large number of legacy documents, both Manderin and Cantonese, will be labeled with the existing language tag.

Not exactly. "We" have (at least historically) abused the tags "zh-HK"
and "zh-TW" to represent Traditional Chinese and ignored the
Cantonese/Mandarin split (which is possible if one only means written
content, which historically has composed most of the material in our
information systems and on the Web).

More recently, language tags have allowed for direct tagging of these
differences as "zh-Hant" and "zh-Hans" (Traditional vs. Simplified)
allowing TW and HK to represent regional differences (there are regional
preferential differences, even in written texts, even when ignoring the
Cantonese/Mandarin split--we encounter these at Yahoo!, for example). It
is unclear how quickly the transition to these subtags is proceeding.
Some user-agents don't support these tags yet.

With the advent of audio and visual content in particular, tagging all
Chinese as 'zh' becomes less useful (at least for those cases), since we
suddenly find applications for which we need to describe specific spoken
variations not captured by 'zh' and not directly discernible from the
region subtag.

I note that Chinese *may* be a special case, in comparison to others, in
which we have very large document bases in a variety of Sinitic
languages tagged with a wide variety of regional subtags representing
something other than regional variations. Other macro languages don't
exhibit this sort or range of imputed meaning.

For example, the Arabic "sub-languages" are mostly regionally based, so
one might very well consider 'arz' (Egyptian Arabic) to be a synonym for
"ar-EG" (Arabic as used in Egypt). Of course, some of the other Arabic
languages don't map so closely to modern nation-states... for the
record, I oppose assigning "secret handshake" meaning to region subtags.
Explicit subtags seem to work better. [Yes, there really are regional
differences, for which region subtags are good, even when the region
where the language is spoken leaks over borders a bit.]

> 
>         * Whether or not yue or zh-yue was used, it is a change from the current label (in most cases).  It seems that in nearly all cases this will require a code change.  In particular applications that want to include zh-HK in lists containing zh-cmn-HK or cmn-HK will need extra awareness.

It requires tag changes to indicate things with more precision, yes.
Nothing says that we might not have continued use of the existing tags;
indeed, I would be surprised by a wholesale conversion over to the new
scheme (whatever it is). Most people won't "get the message" for awhile.
And sometimes the additional specificity doesn't matter ('zh' is
perfectly good for many written documents). However, for many
applications where content labeling can be controlled, converting over
makes life easier. For Chinese in particular, I expect a pretty messy
tagging situation to persist for quite some time (since we have
overlapping levels of imputed meaning in region subtags and elsewhere).

For other languages, a messy situation may not be necessary. Arabic
probably doesn't *require* extlangs, although that may be my parochial
perspective. While Andrew and Don point out the usefulness of deliberate
vagaries in tagging, mightn't we be just as well be served by tagging
"generic" Dinka documents (for example) as 'din' and specific Dinka
documents as, for example, 'dib' or 'dik'? Again, the question here
seems to depend on whether tags or ranges are what matter most.

> 
>         * I don't think the Breton example applies.  http-accept-lang allows for "br-FR;fr-FR" type fallback, so that is a solution.  If an application independently wanted to make this assumption that's fine by me, but this seems orthogonal to the problem we're trying to solve here.

I agree that this fallback case is valid as a use-case for language
priority lists, but not for language tags. But I think this misses the
point. It is quoted without Mark's preamble, in which he posits Breton
as a sublanguage of Welsh. Mark is using it more as a parable to
illustrate the extlang case, not as an example-in-fact.

Let me rephrase Mark's Breton case: when you want Chippewa, any old
Ojibwa (its macrolanguage), which might be a language such as Ottawa,
will not do. It is unintelligible (maybe: I don't actually know in this
case). If it is unintelligible, you're probably better off *not* mapping
Chippewa to Ojibwa. You're better off serving some useful default (for
Chippewa, this is probably English, but is up to the application, not
something RFC 4647 or the LSR do) or even failing.

The counter-example would be the ietf-languages adventure in Norwegian
tagging. Were we better off when Norwegian was represented by the older
registered values (which are quite extlang like, eh?):

   no-nyn
   no-bok

Than by the codes:

   nn
   nb
   no

Things certainly were easier when 'no' meant generic Norwegian (which,
it turns out, tends to be 'nb') and Nynorsk could be represented
extlang/variant-like. In code that deals with Norwegian language tags
today (as well as other odd cases, such as the mistaken 'he'/'iw' pair),
you often have this irksome special mapping table of the (what would be)
sublanguage codes to its macrolanguage, which is what we "mean". But you
can't eat/change the original subtag---other processors might interpret
or map the codes differently or be able to serve specific resources
tailored to the original request. This is a flaw in the no-extlang case
and why Mark and others have insisted that Macrolanguage becomes an
important piece of information in the registry.

> 
>         * Lots of zh data right now is zh-cmn.  Most of "us" seem to agree that we can't narrow the meaning of zh because it does allow Cantonese.  However if some application wanted to make an assumption that zh == zh-cmn, then that seems up to the application for fallback.

Agreed. More to the point, the "zh" tag encompasses a resource
containing *some* (which is to say, exactly ONE) variety of Chinese. It
is probably also Simplified Chinese, for example--it certainly can't be
both Simplified and Traditional.

But you can't make that assumption safely! There are also differences in
(for example) the accepted level of English word borrowing between (say)
Taiwan and HK, etc. "zh" embodies the least common denominator, which
may not be very common in certain circumstances (if you're talking about
our HK web site, 'zh' probably means content that could be tagged
"zh-yue-Hant-HK", whereas I'm pretty sure other sites or applications
would prefer it to be pretty similar to the stuff that could be tagged
"zh-cmn-Hans-CN"). Heck, when I started my internationalization career,
'zh' meant Traditional, because many of us did business in Taiwan and
hardly anyone in the PRC. Now 'zh' mostly "means" Simplified.

> 
>         * Even for the current tags, many of the people in the teleconference seem to extend RFC 4647 in ways that are best for them.  Strict use of 4647 behavior seems rare.  It seems reasonable to me to expect that in the future people may continue to do so and that RFC 4647 and the registry can only provide guidelines.  I don't think that it can solve all problems for all applications, and I'm fine with that.

I disagree. RFC 4647 is fine for many purposes. I'm proposing,
if we do extlangs, a small modification due to finding a slightly better
version of just one algorithm (lookup).

I think our ideal goal would be for RFC 4647 matching schemes to work
fully for everyone for the scope they are designed for.

We might make one or other tweak to the document and possibly the
algorithms based on further testing. For now, the existing schemes
mostly work. They exhibit varying impact on tag and range choice in
either extlang or no-extlang clothes.

I disagree that you have to have far more complex matching systems in
all cases. Yes, there are use cases that call for more complex systems.
But most low-level users have no need of these elaborations. For basic
protocols, I think we would prefer the simplest possible algorithms that
produce mostly the right results. Locale systems, for example, are based
on lookup (well, more like lookup is based on them), and they produce
meaningful results for resource lookup pretty reliably. I have a concern
that lookup work well, because I have a vested interest in the continued
good behavior of systems such as CLDR, which use and depend on these
simplest possible matching systems, such as lookup.

So we can change the algorithm or "change" the (as yet undefined) tags,
which is what is suggested as the other route.

> 
> So from these points, my conclusion is that the zh-cmn form is preferable.
> 
> My reasoning is that either cmn by itself or zh-cmn will require code change in nearly all cases, either to include or exclude existing data or fallback rules.   Either was will require a code change, and either way may require knowledge that zh-HK might be interesting if the request is for {Mandarin tag}-HK.  Searching may need to include zh files for {Manderin tag} queries and exclude them for {Cantonese tag} queries, but the actual tag doesn't really change this logic.  Neither variation is likely to work with existing code when the request is for the new name and the data is tagged with the old tag.
> 
> The deciding factor for me is that to know that cmn is related to zh I'd have to look in the registry, but zh-cmn contains that 
information.  Otherwise I don't really see advantages with either method.

The question is: do we want to do this for all time? Or can we help
people migrate and just get started on the migration? Yes, retagging
with "yue-HK" is a PITA, but so is retagging with "zh-yue-HK". If we
have to convince folks to retag their data anyway, what is the best
solution/most sensible?

The benefit to extlang partially comes from the expectation that the 
tags will be changed but users won't update their ranges. For that case, 
you can switch to extended filtering and a slightly modified lookup and 
use extlangs and go about your business. Of course, you have to explain
these fairly complex tags to the Chinese...... (and Arabic, Dinka, 
Quechua, Zapotec, and etc. speakers)

On the other hand, if you go with primary language subtags for everyone,
you are expecting that users will change or augment their language
priority lists (i.e. "yue-HK; zh-HK") to deal with un-retagged data or
provide alternate lookup tables in your implementations. Matching
algorithms do not change, but tags eventually have to catch up for the
majority of documents or there is (probably quite icky) tagging chaos.

So, for me, the main issue is whether we are going to explicitly break 
the connection between the language and its macrolanguage (at the tag
level). In some cases (Norwegian) we already know this can be 
problematic; in others, it may actually be desirable.

Addison

-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG

Internationalization is an architecture.
It is not a feature.

_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru