RE: extlang (was Re: Suggested language for "mis" (Re: [Ltru] RE: ISO 639-2 decision: "mis"))

[Mark: the quoting behavior of your mail client is not consistent and extremely hard to follow.]

From: Mark Davis [mailto:mark.davis@icu-project.org]
Sent: Tuesday, June 19, 2007 11:31 AM

> Premises
> 1. The reason for making microlanguages be extlang
> instead of primary sublanguages is so that
> truncation-style matching will have better results.
> 2. Fallback works when there is mutual comprehensibility (not necessarily 100%, but to a high degree); if you fallback to something that is not comprehensible, then fallback has failed.

I'd put this differently. Given macro ID "xxx" and encompassed micro ID "yyy":

1. There is no direct connection between primary subtags "xxx" and "yyy"; a connection would have to be maintained by tables in matching processes. But, if "yyy" always required "xxx" as a prefix, then that relationship is carried in the tag itself, and because truncation gets used in matching (it's part of basic filtering, extended filtering and lookup in 4647), a relationship falls out from the matching process.

2. The relationship is important because there is a significant level of existing usage of "xxx", and we want to allow for future usage of "yyy".

The point about existing usage is significant here, I think. If "xxx" and "yyy" were introduced at the same time and users adopted "yyy" while "xxx" never took off, then there's no particular need to relate them. But if both are going to get used (i.e. "xxx" would get used apart from "yyy", and "yyy" would also get used -- whether that is with or without "xxx"), then we probably want the relationship to be captured.

> Option A.
> 1. Thus for extlang to work for microlanguages, the
> speakers of any microlanguages sharing a macrolanguage
> need to be able to understand the speakers of any
> other microlanguages sharing that macrolanguage.
> 2. Peter and the ISO JAC can verify that A1 is true...

A.2 may be a little more than JAC can guarantee. The idea I worked with is that macrolanguages are coded because in some application the group of microlanguages are being coded as one -- they are not distinguished. Presumably that would happen because they are mutually intelligible, though that may not always be the case.

I don't assume that Mandarin and Cantonese are mutually intelligible at a functional level (though perhaps in their written forms they may be); appropriately or not -- perhaps an accident of history -- a single ID "zh" has been used for both. What we need to do is to be able to relate *as appropriate* "zh" content with "Cantonese", "Mandarin", etc., or Cantonese, etc. content with requests for "zh". Similarly for other cases in which -- for whatever reason -- things are, in practice, sometimes split but sometimes clumped.

> Option B.
> 1. The macrolanguage alone is always assumed to be
> the "standard", and that can be identified. That is,
> "zh" is always assumed to be Mandarin, "ar" is
> always assumed to be "Standard Arabic", etc. (That
> is, I think, the correct approach, but is *not*
> currently in the spec.)

Of course, this assumes that for macro/micro cases there always *is* one variety that can be considered *the* major variety.

When Gary Simons first began to analyze 639-2 with a view to how to map it onto Ethnologue, the original operational principles we established looked for just that -- we referred to it as the "major language variety (MLV) principle". (See http://www.sil.org/silewp/2002/SILEWP2002-004.pdf, page 7.) The idea was that we would equate the existing 639-2 ID with *the* major variety, when one could be identified. Of course, we immediately encountered a reality that there isn't always *one* major variety. See http://www.ethnologue.com/14/iso639/analysis.asp#U for cases we found in our initial analysis that we could not resolve.

I had expected we would try to work with the JAC to come up with some resolution for those cases. That was until the macrolanguage concept occurred to me. And a key thing that prompted it was the pre-existing IANA registrations involving zh: we could not equate "zh" with Mandarin because in existing usage it was clearly being used for Mandarin, Cantonese, Minnan, etc. An available alternative was to consider "zh" to be a collection (and that was, in fact, what we did in our initial analysis), but that just didn't reflect the way it was actually being used: "zh" was in widespread use as though it represented an individual language. Somehow, I had to marry those two: in use as though it represents an individual language, but also in use for multiple languages. Hence the prototype for macrolanguage.

Now, you're introducing the possibility of having the one-language/many-language dichotomy while also maintaining the MLV principle. (Of course, the most widespread use for an *individual* language would have involved for Mandarin.) Of course, that's not formalized in 639-3, but that doesn't block it from being a useful idea. One issue that would need to be worked out is when "xxx" as an individual language is the MLV-preferred microlanguage, when it's the undifferentiated group of microlanguages, or if this difference matters in practice.

But there's also the other issue: not all macrolanguages have one microlanguage that's clearly picked out by the MLV principle.

Peter

_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru