[Ltru] tags for Chinese

[warning: this is very long]

On May 27, Debbie wrote:

> > > we should consider each of the applications of language tags:
> > > identification, lookup, filtering, and Accept-Language, and
> > be able to
> > > have a reasoned judgment on the technical merits
> >
> > I would only add: we need to do this with *carefully*
> > reasoned judgment, pausing to make sure the arguments
> > presented really do stand up.
>
> I think this is probably a good way forward.  It would be best to start
> a new thread for each application.  I would also add backward
> compatibility and historical usage to the pot.

In the spirit of this, I'm going to attempt some analysis here of Chinese tags. I'm putting this all in one mail rather than separate threads per application since I'm organizing what I write around different conceptual categories and their corresponding tags rather than by applications. There are lots of details to sift, and in trying to get something put down and sent out I know I haven't done as careful analysis as could be done -- certainly I haven't gotten much past cataloguing lots of facts.

I hope this will be useful, and that responses will take the analysis further. If you want to go on a tangent, please start a different thread.

On May 29, Karen wrote:

> here's my real-world Chinese language list:
>
> Chinese (Variant Unknown)
> Chinese (Cantonese, Spoken)
...

I gather Karen's approach here has been to take different "Chinese" categories of interest to her and consider what tag would be recommended for each under different proposals we might adopt. I'd like to approach things from a different direction for a moment: list what "Chinese" tags would be available under each of three proposals and consider what we might say about each tag.

The three proposals are (1) extlang, (2) no-extlang, (3) mixed: extlang, but with IDs for encompassed languages also allowed in first-subtag position.

Before proceeding, I'll briefly review matching processes and scenarios:

- Filtering returns a number of records from a (typically sizeable) corpus. Here, a record tagged with greater specificity than was indicated in the query can always be returned (e.g., if user asks for en, then en-GB is certainly valid); relevance of record that are tagged more generically (e.g., en when user asked specifically for en-GB) is very much in question. A Web search would be a typical scenario in which filtering is appropriate.

- Lookup returns a single record that is a best match for some language request. Here, it is appropriate for a user to indicate very specific preferences and then to match that as best as possible, returning a more-generically-tagged resource if needed. (E.g. if user asks for en-GB is best, but en is certainly appropriate if no en-GB resource is available.) Software UI and Web sites are typical examples in which lookup is appropriate.

Considerations for macrolanguages in filtering: if a user specifies the macrolanguage, the result set may vary considerably in relevance if encompassed languages are not mutually intelligible; and if there is a predominant language, records in that may be relevant for all while records in other languages will be relevant only for certain users -- though there may be many fewer records in other languages. If a record is tagged for the macrolanguage, then it is likely in a predominant language that may be relevant to speakers of any of the encompassed languages (probably true for more developed languages such as Chinese, but not necessarily true for less developed language-nets, such as Ojibwa).

In lookup scenarios, the number of language varieties for a given resource is likely managed as a relatively small set; across a number of resources, the number of language varieties may still be limited, or there may be a very divers set of varieties. For instance, in a closed system such as UI resources for a software title, the number of varieties overall will be limited, and the number of varieties for a single resource may be much smaller. (E.g. there may be resources for en-GB, but there probably won't be resources for a lot of English dialects; and as for distinct languages, 100 would be exceptionally high.) In an open system such as the Web, a given site will likely support a limited set of varieties, though across the Web there are a lot more varieties represented. (E.g. there are going to be sites in many English dialects (though not necessarily tagged distinctly), and there are going to be sites in very many distinct languages.)

Considerations for macrolanguages in lookup: If a user specifies the macrolanguage, only resources tagged exactly the same way will match. If a user specifies an encompassed language, resources tagged for either the encompassed language or the macrolanguage will match. Since resources must be in some particular variety, it may not make sense to tag resources for the macrolanguage unless all the encompassed varieties really are mutual intelligible, or unless the resources are in a predominant language acceptable to speakers of all encompassed languages.

OK, now let's look at "Chinese" tags that would be available under different proposals. I won't give fully exhaustive lists (e.g. won't give tags involving every possible Region subtag), but will cover all the relevant cases with one or more representatives.

Here's the list -- since this will likely get wrapped and be hard to read, I'm also attaching a txt:

extlang          no-extlang    mixed                          meaning                              Notes
----------------------------------------------------------------------------------------------------------
zh               zh            zh                             'Chinese'                            (1)
zh-CN            zh-CN         zh-CN                          'Chinese as used in PRC'             (2)
zh-TW            zh-TW         zh-TW                          'Chinese as used in TW'              (3)
zh-HK            zh-HK         zh-HK                          'Chinese as used in Hong Kong'
zh-Hant          zh-Hant       zh-Hant                        'Chinese, Tdnl writing'              (4)
zh-Hans          zh-Hans       zh-Hans                        'Chinese, Simp writing'              (4)
zh-Hant-CN       zh-Hant-CN    zh-Hant-CN                     'Chinese, Tdnl as used in PRC'       (4)
zh-Hans-CN       zh-Hans-CN    zh-Hans-CN                     'Chinese, Simp as used in PRC'       (4)
zh-Hant-TW       zh-Hant-TW    zh-Hant-TW                     'Chinese, Tdnl as used in Taiwan'    (4)
zh-Hans-TW       zh-Hans-TW    zh-Hans-TW                     'Chinese, Simp as used in Taiwan'    (4)
zh-cmn           zh-cmn        zh-cmn                         'Mandarin'                           (4)(6)
                 cmn           cmn                            'Mandarin'                           (7)
zh-yue           zh-yue        zh-yue                         'Cantonese'                          (5)(6)
                 yue           yue                            'Cantonese'                          (7)
zh-cmn-Hant      zh-cmn-Hant   zh-cmn-Hant                    'Mandarin, Tdnl writing'             (4)(6)
                 cmn-Hant      cmn-Hant                       'Mandarin, Tdnl writing'             (7)
zh-cmn-Hans      zh-cmn-Hans   zh-cmn-Hans                    'Mandarin, Simp writing'             (4)(6)
                 cmn-Hans      cmn-Hans                       'Mandarin, Simp writing'             (7)
zh-yue-Hant      yue-Hant      zh-yue-Hant / yue-Hant         'Cantonese, Tdnl writing'            (7)
zh-yue-Hans      yue-Hans      zh-yue-Hans / yue-Hans         'Cantonese, Simp writing'            (7)
zh-cmn-CN        cmn-CN        zh-cmn-CN / cmn-CN             'Madarin as used in PRC'             (7)
zh-cmn-TW        cmn-TW        zh-cmn-TW / cmn-TW             'Mandarin as used in Taiwan'         (7)
zh-yue-CN        yue-CN        zh-yue-CN / yue-CN             'Cantonese as used in PRC'           (7)
zh-yue-HK        yue-HK        zh-yue-HK / yue-HK             'Cantonese as used in Hong Kong'     (7)
zh-cmn-Hant-CN   cmn-Hant-CN   zh-cmn-Hant-CN / cmn-Hant-CN   'Mandarin, Tdnl as used in PRC'      (7)
zh-cmn-Hans-CN   cmn-Hans-CN   zh-cmn-Hans-CN / cmn-Hans-CN   'Mandarin, Simp as used in PRC'      (7)
zh-cmn-Hant-TW   cmn-Hant-TW   zh-cmn-Hant-TW / cmn-Hant-TW   'Mandarin, Tdnl as used in Taiwan'   (7)
zh-cmn-Hans-TW   cmn-Hans-TW   zh-cmn-Hans-TW / cmn-Hans-TW   'Mandarin, Simp as used in Taiwan'   (7)
zh-yue-Hant-CN   yue-Hant-CN   zh-yue-Hant-CN / yue-Hant-CN   'Cantonese, Tdnl as used in PRC'     (7)
zh-yue-Hans-CN   yue-Hans-CN   zh-yue-Hans-CN / yue-Hans-CN   'Cantonese, Simp as used in PRC'     (7)
zh-yue-Hant-HK   yue-Hant-HK   zh-yue-Hant-HK / yue-Hant-HK   'Cantonese, Tdnl as used in HK'      (7)
zh-yue-Hans-HK   yue-Hans-HK   zh-yue-Hans-HK / yue-Hans-HK   'Cantonese, Simp as used in HK'      (7)

Notes:
1. Most (by far) existing content tagged "zh" is Mandarin. (Not
   clear how much content is tagged simply "zh" rather than
   "zh-CN", etc. or "zh-Hant"/"zh-Hans".)

2. "zh-CN" has often been used to declare content as "Chinese,
   Simp writing" (almost all, of course, being Mandarin), though
   the preferred tag for this, "zh-Hans", has been available
   since mid 2005.

3. "zh-TW" has often been used to declare content as "Chinese,
   Tdnl writing" (almost all, of course, being Mandarin), though
   the preferred tag for this, "zh-Hant", has been available
   since mid 2005.

4. Tag has been available since mid 2005.

5. Tag "zh-yue" has been available since 1999. (Also the case for
   "zh-gan" and "zh-wuu".)

6. Tag is grandfathered in current LSTR.

7. Not valid under RFC 4646.

Of course, some of those possibilities are currently more common than others,* and some are more useful than others. The challenge is to figure out which would be recommended under each proposal, and how to handle those that might not be generally recommended yet are currently in use and so likely to be encountered.

*(I wonder if Mark can get Web stats on all of these.)

So, let's consider each semantic and the tags available:

"Chinese":

The only tag option is "zh". Because "zh" is widely used and couldn't quickly go away (no matter how hard we might want), we will have to deal with it for some time.

It may be a reasonable choice to use "zh" to tag text content (see more below), but it is probably not a good choice for A/V media unless a system is closed and providing Mandarin content only.

Apart from maintaining current practice of using "zh" to tag Mandarin content, it is unclear how it would ever be useful to tag content as "Chinese". There may scenarios in which it would be useful for a user to request "Chinese" in filtering or lookup queries, though association between "zh" and Mandarin in existing usage limits feasibility of using "zh" to indicate a truly general "Chinese" request.

In filtering, users that ask for "zh" probably want Mandarin or would consider Mandarin acceptable, so content tagged "zh-cmn*" or "cmn*" is probably relevant, while content tagged "zh-yue"/"zh-gan"/etc. (or "yue"/"gan"/etc.) has a much lower likelihood of being relevant. In a no-extlang paradigm, "zh" records should probably be returned when "cmn" is requested, but not for requests of "yue"/"gan"/etc.; search engines would need to understand the relationship between "zh" and "cmn". Going into the future, increased importance of Cantonese (especially for A/V media) suggests that "zh" should be discouraged, though transition will be feasible only if requests for "zh" get appropriate results.

Wrt lookup, in a closed system there are no problems using "zh" if only Mandarin resources are provided; it may not be a problem to use for Mandarin if both Mandarin and Cantonese resources are provided -- because the system is closed -- but that's probably not a good choice. For Web sites, many may use "zh", and many browsers may send requests for "zh". In a no-extlang paradigm requests for "cmn" should match "zh", and "cmn" sites should match "zh" requests; the implication is that browsers should send both "cmn" and "zh", and "cmn" shouldn't be used for a site unless the server can match "zh".

"Chinese as used in PRC/Taiwan/etc."

The only tag options are "zh-CN"/"zh-TW"/etc. As noted, these have been used in the past to indicate (Mandarin) content in Simp or Tdnl writing. This is definitely to be discouraged, and hopefully is on the decline. Such tags continue to be used as locale IDs in software systems, and that would take some time to phase out (assuming that's feasible). Because of such usage in the wild we will need to deal with these tags for some time.

Apart from such legacy usage, these categories/tags probably are not that useful, and even if there were usefulness for a user to make a request in terms of the generic "Chinese" language but for a specific region, the association between "zh" and Mandarin limits feasibility of making such a general request.

In filtering, users that ask for "zh-??" probably want Mandarin or would consider Mandarin acceptable. Comments for "Chinese" above are applicable, modulo the added consideration regarding writing system. Users that ask for "zh-CN" or zh-SG probably want Simp or would find that acceptable, so records tagged "zh-Hans*", "zh-cmn-Hans*" (or "cmn-Hans*") would probably have high relevance. On the other hand, users that ask for "zh-TW", "zh-HK" or "zh-MO" probably want Tdnl or would find that acceptable, so records tagged "zh-Hant*", "zh-cmn-Hant*" (or "cmn-Hant*") would probably have high relevance. Similarly, users that ask for "zh-Hant"/"zh-cmn-Hant"/"cmn-Hant" would probably find records tagged "zh-TW" highly relevant; mutatis mutandi for "*-Hans" and "zh-CN".

Wrt lookup: these categories/tags should not be used. In closed systems, they could be used for Simp/Tdnl resources, or these tags used as locale IDs could map to more preferred tags -- "zh-Hant", or "zh-cmn-Hans"/"cmn-Hans", or "zh-Hant-TW", etc.

"Chinese, Tdnl/Simp writing"

The only tag options are "zh-Hant" and "zh-Hans". As noted, these have been available for a few years; they were registered as preferred alternatives to "zh-TW" and "zh-CN". It is unclear how much these have been used for content; they are used in some software platforms (e.g. .Net cultures). Of course, existing content would almost be certainly Mandarin. Also, they are obviously relevant for text but not for A/V.

Wrt filtering, see above for association between "zh" and "cmn", and between "*-Hant"/"*-Hans" and region subtags.

Wrt lookup: these may be useful in closed systems that support only Mandarin, and are definitely preferable to "zh-CN" and "zh-TW". If both Mandarin and Cantonese resources are supported, it is probably best not to use "zh" to represent Mandarin. Some Web sites or browsers may be using these for Mandarin; see comments under "Chinese" for some implications.

"Chinese, Tdnl/Simp as used in PRC/Taiwan/etc."
The tag options are the same for all three proposals: "zh-Hant-CN", "zh-Hans-CN", "zh-Hant-TW", etc. These have been available for a few years, though it is unclear how much they have been used (my guess: very, very little). If they are currently used, "zh" is probably assumed to represent Mandarin. Issues for filtering and lookup described above due to this association between "zh" and Mandarin apply here as well.

Since there is likely little existing usage and these perpetuate issues surrounding "zh" and Mandarin, I don't see any scenario in which it would make sense to use these categories/tags.

"Mandarin", "Cantonese", etc.

For Mandarin, the tag "zh-cmn" would be available under all proposals; the no-extlang and mixed proposals would also make "cmn" available. This category would be appropriate for A/V content. For text, written form should probably also be included (i.e. this is probably not a useful category for text scenarios). The tag "zh-cmn" is in use for A/V content (not sure how widespread already, but it is likely growing); it is probably not currently used for text content. Under the no-extlang and mixed proposals, one of "zh-cmn" and "cmn" would have to be recommended for A/V scenarios, and the other deprecated in general. While "cmn" would be canonical, it is "zh-cmn" that has prior usage.

Similar comments apply to "Cantonese", "zh-yue" and "yue" (and likewise for other languages -- though not all currently registered tags would become redundant in an extlang paradigm; e.g. "zh-hakka").

The particular case of "Mandarin" has the added complication of existing usage of "zh"; the relevant issues are discussed above. In general, the more explicit tags "zh-cmn" or "cmn" are preferable when "Mandarin" is intended, though applications need to take the existing use of "zh" into consideration.

Wrt filtering, the extlang/no-extlang decision affects results. For filtering as spec'd in RFC 4647 in an extlang paradigm, a request for the macrolanguage will return any records tagged for the macrolanguage or any encompassed language. Since a request for "zh" is (at least currently) likely intended to be a request for Mandarin, the result set would include any records in other encompassed languages -- generally, probably low relevance, but also probably low frequency. Reversing things, if "zh-cmn" is requested, it is probably appropriate to return records tagged "zh", though the currently-spec'd algorithm would not do that. (Records tagged "zh" also would not be returned if any other encompassed language is requested -- which is probably appropriate.) In a paradigm of no-extlang 4646bis, "zh" would not match for records tagged for any encompassed language by the currently-spec'd algorithm. The filtering algorithm would have to be modified or tailored to get a match between "zh" and "cmn".

Wrt lookup also, the extlang/no-extlang decision affects results. With extlang, a request for any encompassed language would match resources tagged "zh". This could potentially lead to missing preferred matches. Thus, suppose a user indicated a preference for "zh-yue" then "zh-gan", and after that "zh-cmn" or "zh"; and supposed the available resources included resources tagged "zh-gan" and "zh": the match would be on "zh", not "zh-gan". In a no-extlang 4646bis paradigm, none of the tags for encompassed languages match "zh". This would avoid the problem just described, but it also means that "cmn" would not match "zh" -- unless the algorithm were modified or tailored.

"Mandarin, Tdnl/Simp writing"/"Cantonese, Tdnl/Simp writing"/etc.

As above, the currently-available tags for Mandarin would remain available under all proposals: "zh-cmn-Hant" and "zh-cmn-Hans"; the no-extlang and mixed proposals would add "cmn-Hant" and "cmn-Hans". These categories are appropriate for text but not A/V scenarios. Of course, in text scenarios "zh-Hant"/"zh-Hans" have been used to represent these categories (due to the association between "zh" and Mandarin). Tags using "cmn" should be recommended in general, but applications need to take the legacy tags into consideration. Under the extlang and mixed proposals, a choice to recommend one of "zh-cmn-Han?" or "cmn-Han?" and to deprecate the other would be needed.

Some, but not all, of these comments apply to Cantonese, "zh-yue-Han?" and "yue-Han?". No tags were registered in the past, and so no issues with synonyms will arise: either "zh-yue-Han?" will become available, or "yue-Han?" will become available, but not both. (These comments apply for the other Chinese languages as well.)

In a no-extlang or mixed paradigm, there would be alternate tags for a single category (e.g., "zh-cmn-Hans" and "cmn-Hans"); for each such pair, a choice would need to be made to recommend one and deprecate the other. In the no-extlang paradigm, such a pair would exist only for Mandarin; in that case, consistency would argue for recommending "cmn-Han?" rather than "zh-cmn-Han?". In the mixed paradigm, however, such pairs would exist for every language.

In filtering, users that ask for "zh-Han?" probably want Mandarin or would consider Mandarin acceptable. In this case, the filtering algorithm spec'd in RFC 4647 would not provide a match for content tagged either as "zh-cmn-Han?" or as "cmn-Han?"; that algorithm would need to be tailored or modified. If users request "zh-cmn", the current algorithm would match content tagged "zh-cmn-Han?" but not if tagged "cmn-Han?". (Mutatis mutandi for "zh-yue-Han?", etc.) If no-extlang tags are used in the request, e.g. "cmn", then only no-extlang tags ("cmn", "cmn-CN", "cmn-Hans", etc.) would be matched unless the algorithm were tailored or modified.

In lookup, a request for "zh-cmn-Han?" would match content tagged "zh-cmn" or "zh"; but there would not be a match in the more useful case of "zh-Han?" unless the current algorithm were tailored or modified. Also requests involving other languages, such as "zh-yue-Han?" could produce a match on resources tagged "zh", possibly missing a preferred match -- see the scenario discussed in previous section. (Since these are text scenarios, there is potential consolation in that "zh" resources are likely to be acceptable, provided the text follows the user's Tdnl/Simp preference.) In a no-extlang paradigm, none of the tags for encompassed languages would match "zh", and so this problem is avoided.

"Mandarin as used in PRC/TW/etc." / "Cantonese as used in ..." / etc.

No tags for these categories were ever registered, and so the available tags would depend on the choice between the extlang, no-extlang and mixed proposals: e.g., "zh-cmn-CN" vs. "cmn-CN" vs. "zh-cmn-CN" / "cmn-CN". In the mixed paradigm, there would be two tags available for each semantic, and a choice would need to be made to recommend one and deprecate the other.

These categories would be useful for A/V scenarios, but not text scenarios.

In filtering, a user requesting "zh" may perhaps want Mandarin, though since "zh" is not a great choice for Mandarin for A/V scenarios in general, it probably would not be used that way. In this case, "zh" could perhaps be used for a generic request, and in an extlang paradigm the filtering algorithm spec'd in RFC 4647 would treat that generically, matching any tag "zh*". It's not really clear when users would want to make such a generic request for A/V content, though: users are more likely to specify which particular language they want. A request for "zh-cmn" would match records tagged "zh-cmn-??" by the spec'd filtering algorithm, but not records tagged "cmn-??". Likewise, a request for "cmn" would match recoreds tagged "cmn-??", but not records tagged "zh-cmn-??" (a potential issue only for the mixed paradigm since there are no currently registered tags of the form "zh-cmn-??").

In lookup, a request for "zh-cmn-??" would match resources tagged "zh-cmn" or "zh"; a request for "cmn-??" would match resources tagged "cmn". The problem mentioned in the previous two sections of scenarios in which a preferred match could be missed exists here in an extlang paradigm but not in a no-extlang paradigm.

"Mandarin, Tdnl/Simp as used in PRC/TW/etc." / "Cantonese, Tdnl/Simp as..." / etc.

No tags for these categories were ever registered, and so the available tags would depend on the choice between the extlang, no-extlang and mixed proposals: e.g., "zh-cmn-Hans-CN" vs. "cmn-Hans-CN" vs. "zh-cmn-Hans-CN" / "cmn-Hans-CN". In the mixed paradigm, there would be two tags available for each semantic, and a choice would need to be made to recommend one and deprecate the other.

These categories would be useful for text scenarios, but not A/V scenarios.

In filtering, a user requesting "zh" probably wants Mandarin or would find Mandarin acceptable. In an extlang paradigm, this request would match "zh-cmn-Han?-??". It would also match "zh-yue-Han?-??" (or "zh-gan-Han?-??", etc.), which would likely be of low relevance, but also of low frequency. In a no-extlang paradigm, "zh" would not match the tags for these categories (e.g. "cmn-Hans-CN") unless the spec'd algorithm were tailored or modified.

If a user requests "zh-cmn" (or "zh-yue"), the categories from this set they would get matches on would be relevant in terms of language (though maybe not written form); this would happen using the RFC 4647 algorithm as is in an extlang paradigm, but tailoring or modification would be needed in a no-extlang paradigm.

If a user requests "zh-Han?", they probably want Mandarin or would find Mandarin acceptable. The spec'd algorithm would not produces matches for any of the potential tags for these categories in under any of the proposals unless the algorithm were tailored or modified.

In lookup, a request for "zh-cmn-Han?-??" would match resources tagged "zh-cmn-Han?" or "zh-cmn" or "zh"; a request for "cmn-Han?-??" would match resources tagged "cmn-Han?" or "cmn". The problem mentioned in the previous sections of scenarios in which a preferred match could be missed exists here in an extlang paradigm but not in a no-extlang paradigm.

All this needs further refinement and processing to get to some useful conclusions, but I've gone on long enough for one reading. I'll stop here for now.

Peter

[Ltru] tags for Chinese

Attachment: TagsForChinese-Alternatives.txt