Re: extlang (was Re: Suggested language for "mis" (Re: [Ltru] RE: ISO 639-2 decision: "mis"))

On 6/18/07, John Cowan <cowan@ccil.org> wrote:
>
> Mark Davis scripsit:
>
> > We added extlang to allow ourselves the freedom to make choices
> > when 639-3 came along. We *very clearly did not define its meaning*,
> > because we didn't know what 639-3 was finally going to look like,
>
> Not so much.  We had an excellent idea of what 639-3 would both look and
> actually be like when 4646 was finalized.  We couldn't include 639-3 or
> extlangs because 639-3 itself was not yet final.

I think perhaps both of our "we"s are overstated. Since we disagree on this
point, neither my "we" nor your "we" is inclusive.

For example, you can't mean that every member of the working group had
looked it over thoroughly, and explored all the implementation
ramifications. I'd like to see a show of hands for those who did -- maybe
everyone except for me had, but I'd be rather surprised at that.

> nor did we have agreement on what we should actually do.
>
> We had at least the consensus of silence; at least, I don't remember any
> complaints at the time.  Remember that the development of 4646 started at
> least a year before LTRU was formally created.

I don't think everyone had reviewed it in detail, but perhaps the show of
hands would prove me the only one.

> We *already* had macrolanguages with ISO 639-2 in RFC 4646 and we *did
> > not* use extlang for them: examples are "sr", "hr", "nb", etc.
>
> (Rather, these are examples of languages *encompassed* by macrolanguages,
> henceforth "encompassed languages".  I realize that's just a slip.)

True, thanks.

> We are not going to (and cannot) be forcing users to encode nb as no-nb,
> > nor sr as sh-sr.
>
> Nobody has ever proposed that.  Language subtags coming from 639-1
> or 639-2 will not change, even if 639-3 says they encode encompassed
> languages.

My point was that anyone who wants to deal with macro languages, has to
already deal with sr, hr, etc. as primary subtags, not as secondary. Thus
whatever mechanisms people have to work with sh, etc. can be extended to
other cases without the extlang mechanism.

> When we ("Google") tried implementing matching with "zh-yue" and others,
> > we found it made things *more* difficult, not less.
>
> Respectfully I suggest that because you ("Google") assign tags to incoming
> rather than outgoing content, your use of BCP 47 is essentially private
> rather than in interchange.  That makes it potentially important, but
> definitely not prototypical.

First of all, that isn't true (but thanks for the "respectfully"!) We
actually use language tags in a huge number of products, many of which have
APIs. We are not fully BCP 47 compliant, but are working towards that.

But the main point is, I want those people who have tried to implement --
professionally, not just in toy programs -- extlang to speak up about their
experiences. So if you want to speak to your experience implementing this
professionally, I'm all ears.

Addison mentions that fallback matching is not a problem. Mechanically it is
a no-brainer. But the *results* of that fallback are what we are having
problems with. And *that's* why I raise this issue, and call it "baking in
an assumption". Let me try to set this out, yet again. I'll call the
"languages encompassed by a macrolanguage" by the term "microlanguage", just
as a term.

I see two options. I may have captured the extlang reasoning incorrectly, so
please bear with me. And if there are other reasons for having extlang,
that'd be good to hear.

Premises

   1. The reason for making microlanguages be extlang instead of primary
   sublanguages is so that truncation-style matching will have better results.
   2. Fallback works when there is mutual comprehensibility (not
   necessarily 100%, but to a high degree); if you fallback to something that
   is not comprehensible, then fallback has failed.

Option A.

   1. Thus for extlang to work for microlanguages, the speakers of any
   microlanguages sharing a macrolanguage need to be able to understand the
   speakers of any other microlanguages sharing that macrolanguage.
   2. Peter and the ISO JAC can verify that A1 is true; that every
   speaker of Hakka can understand Jinyu; every speaker of Shihhi Arabic
   can understand Cypriot Arabic; and so on).
   3. Everything is hunky-dory.

Option B.

   1. The macrolanguage alone is always assumed to be the "standard", and
   that can be identified. That is, "zh" is always assumed to be Mandarin,
   "ar" is always assumed to be "Standard Arabic", etc. (That is, I think, the
   correct approach, but is *not* currently in the spec.)
   2. Thus for extlang to work for microlanguages, the speakers of any
   microlanguages sharing a macrolanguage need to be able to understand the
   speakers of the standard used for that macro language.
   3. Peter and the ISO JAC can verify that B21 is true; that every
   speaker of Hakka can understand Mandarin; every speaker of Shihhi
   Arabic can understand Cypriot Arabic, and so on).
   4. Everything is hunky-dory.

If both A and B are not plausible, and we can't find something other
compelling reason to have extlangs, it is *far* better to add the
Macrolanguage field to the registry, and let people implement their own
matching making use of that AND other factors.

After all, it is trivial to make a 4647bis that adds an optional step for
microlanguages, which is that when you get to a microlanguage, the next step
is to look at its macrolanguage before falling back to the default. That has
the same result (and same problems) as extlang, but is something that is not
baked into the standard -- is something that people can implement if they
want without impacting matching for everyone else.

> Matching "zh" and "yue" is not something you want to do
> > automatically. Moreover, because of #2 we had to have a mechanism for
> > dealing with macrolanguages in RFC 4646 *anyway*.
>
> Very plausible in your circumstances.  But note that Yue (Cantonese)
> content is *already* properly tagged "zh-yue" on precisely the theory
> that's being applied to 639-3 encompassed languages.
>
> I realize that this cause is probably not important to you, because
> you can (comparatively) easily change all "zh-yue" tags to "yue", but
> this is not the case for other users of BCP 47 on and off the Internet,
> who will never even hear about the change.

These are irregular tags anyway, and can stay irregular tags afterwards. We
and everyone else already have to deal with equivalences with grandfathered
and irregular tags anyway; these are not a real problem.

> Thus to make a proposed change from 4646 to use the extlang mechanism
> > for languages that have macrolanguages, we need a very compelling
> > case that the additional complication solves more problems than it
> > creates. We haven't seen that yet, and certainly have no consensus
> > that it is the case.
>
> On the contrary, the burden of persuasion is with you.  Tags like
> zh-yue are already present in 4646, and it's up to you to provide a
> convincing argument to deprecate them.  Furthermore, LTRU and its ad hoc
> predecessor has been assuming the extlang structure since at least 2004.
> Derailing that is what will take a "very compelling case".

I disagree strongly. If you can't make a compelling case that extlang will
make BCP 47 better instead of worse, and won't even look at the reasons not
to do it, nor even bother to set out a case for it, then why should we add
it?

> A. [Don't use extlangs.]
>
> I continue in opposition to this.
>
> > B. (optional) Add a field Macrolanguage: to the language subtag
> > registry.
>
> I am not opposed to this, precisely because encompassed languages and
> the corresponding macrolanguage cannot be identified syntactically.

Good.

--
> May the hair on your toes never fall out!       John Cowan
>         --Thorin Oakenshield (to Bilbo)         cowan@ccil.org
>

I hope not....

-- 
Mark