Re: [Ltru] my technical position on extlang

A general comment, repeated below.

People keep confusing two orthogonal concepts, and I think this is at the
root of the extlang issue. These are:

   1. "get me languages that are mutually intelligible with X" (maybe to
   degree Y), and
   2. "get me the languages that have the same macrolanguage as X"

Number 1 is very interesting, and would be very useful; but it is not at all
the same as #2. If macrolanguage were defined as #1, I would probably be all
in favor of baking it into extlangs. But is not at all the same, as many,
many examples illustrate. Moreover, "mutual intelligibility" differs whether
the content is written or spoken - forcing it to be baked into the syntax
does not allow for that difference.

Also, as I read your response, I think at least part of our apparent
differences is the use of different terminology.

I was using "content negotiation"  in the lookup sense, which is what is
typically done with Accept-Language (not always, but typically). That is,
the client is supplying a list of languages, perhaps with q values, and the
expectation is that s/he will get one thing back. For example, a web page in
one of the requested languages, but could be any sort of resource. And in
such interactions, you do want lookup on the individual items; "ar" is not
treated like "ar(-.*)"

Filtering is much different. An example would be: give me a list of all the
documents in my index that are "fr, ar, sku" that have X in field 5. In that
case, I do want to treate "ar" as if it were "ar(-.*)". Accept-Language
could also be used for this, but with a different meaning.

Bearing that in mind, let me see if I can make some useful responses.

On Fri, May 23, 2008 at 9:09 AM, John Cowan <cowan@ccil.org> wrote:

> Mark Davis scripsit:
>
> [points 1, 2, 3a snipped]
>
> These all amount to "Some implementations are buggy and don't follow
> RFC 2616."  Such bugs that could be fixed at any time, in which
> case the servers would come into compliance.  Historical practice in
> tagging documents is important, buggy protocol implementations are not.
> (A variant of the ISO C motto about "code matters, implementations
> don't".)

I was really referring to the ' lookup content negotiation' side of things,
which may or may not be relevant to what you are discussing. And I think
reality is not completely irrelevant to our discussion. ;-)

>
> > [3b]  - Interpreting 'ar' as Standard Arabic is permissible now
> > and after,
>
> Certainly.
>
> > With the above list, the practical impact will be that it will stop at
> > 'ar' anyway and return Standard Arabic.
>
> As I say, that's a clear violation of the RFC.

No, not if we are talking about content negotiation (lookup) rather than
filtering. (this may be due to our differences in terminology)

>
>
> > The odds of an implementation supporting "pga" but not ar/arb are
> > pretty darn'd low.
>
> Unless it just happens to be a site with lots of audio from around
> the arabophone world.  *Across all sites* all non-standard Arabics are
> rare, but that doesn't apply to *specific* sites that really need to do
> language negotiation.

Again, if we are talking about content negotiation, where we return a 'best
fit', it is going to be a rare case where 'pga' is supported but 'ar' is
not. So if one stops at 'ar', "pga" would never be seen, whether it were
expressed as "ar-pga" or not.

If we are talking about real filtering (all documents that match the list),
see below.

>
> > Even if the server doesn't support extlang matching, you can get exactly
> > what you said you want -- reliably -- with an explicit list:
> >
> > arz, arb, ar, arq, aao, bbz, abv, acy, adf, avl, afb, ayh, acw, ayl, acm,
> > ary, ars, apc, ayp, acx, aec, ayn, ssh, ajp, apd, pga, acq, abh, aeb, auz
>
> Which is just what I said.  However, that isn't very stable with time
> as new Arabic varieties get encoded, not to mention extremely obnoxious
> to the user without special support in the client.

But again, this is an extremely unusual query, since these languages are not
mutually comprehensible. And the very fact that you excluded "shu" in your
example means that you don't want all the things that could be encompassed
by "ar". If a new one shows up in the future, maybe you can understand it,
maybe you can't and would want it excluded like "shu". And what happens with
other macrolanguages; will a speaker of gan be able to understand hakka?
Better to just explicitly list the ones that are wanted.

>
> > And this will work even if filtering isn't supported according to the
> spec;
> > and note that we have no guarantee that the server will filter according
> to
> > the spec -- based on the experience with Accept-Language, it's actually
> > unlikely.
>
> Most authors don't tag according to the spec either, so why have language
> tags at all?  Search engines can and should ignore them; that doesn't
> mean they are worthless for other purposes.  We design around the central
> point that people tag wisely and programs behave properly, which is not
> necessarily the best *implementation* strategy for special cases like
> search engines.

That isn't my point. My point is that the explicit list is

   - currently works.
   - handles everything that extlang could, with finer control
   - more powerful, since you can express things that extlang simply can't
   (eg "ro, mo, no, nn, nb"), and provide exactly the list you want.

>
> > And really, what are the odds that someone wants that exact list
> > above? And MUST have it supported with a short enumeration? And does
> > not care if additional encompassed languages are added?  According
> > to information from Peter, the microlanguages are all independent
> > languages, meaning that they are not mutually comprehensible.
>
> The Arabic languages don't have full mutual intelligibility, but that's not
> the same as saying intelligibility is zero.  I specifically picked Shuwa
> (Chadian/Nigerian) to exclude because it's very remote from most others.

People keep confusing two orthogonal concepts, and I think this is at the
root of the extlang issue.

   1. "get me languages that are mutually intelligible with X" (maybe to
   degree Y), and
   2. "get me the languages that have the same macrolanguage as X"

Number 1 is very interesting, and would be very useful; but it is not at all
the same as #2.

If macrolanguage were defined as #1, I would probably be all in favor of
baking it into extlangs. But is not at all the same.

> - Languages that are *very* closely related do not have a macrolanguage
> > ("ro" and "mo").
>
> I know.  That doesn't mean not to take advantage of the ISO 639-3
> information we do have.

See comment on mutual intelligibility.

>
> > I don't see extlang as any really practical benefit in filtering. It
> isn't
> > useful in a great many other cases of related languages; if I want to use
> > with nn, nb, no it doesn't help, nor with ro/mo, various varieties of
> > German, and so on. Moreover, if a customer searches for 'ar' and get a
> bunch
> > of documents in shu, they are as or more likely to be unhappy as if they
> > search for 'de' documents and get back a bunch of 'gsw'.
>
> As I keep saying, *search* isn't the issue here.  It's *negotiation
> by filtering*.

See my discussion above.

>
>
> > Putting some ISO language tags into the extlang position just because
> they
> > have a macrolanguage is an unnecessary complication for implementations
> and
> > represents a substantive, controversial change to RFC 4646.
>
> *Now* it does, yes.  All that shows is that I should have thought of
> this argument before giving in earlier.

I didn't quite understand this.

>
>
> --
> Values of beeta will give rise to dom!          John Cowan
> (5th/6th edition 'mv' said this if you tried    http://www.ccil.org/~cowan<http://www.ccil.org/%7Ecowan>
> to rename '.' or '..' entries; see              cowan@ccil.org
> http://cm.bell-labs.com/cm/cs/who/dmr/odd.html)
>

-- 
Mark