Re: [Ltru] my technical position on extlang

John Cowan <cowan@ccil.org> Fri, 23 May 2008 04:43 UTC

Return-Path: <ltru-bounces@ietf.org>
X-Original-To: ltru-archive@megatron.ietf.org
Delivered-To: ietfarch-ltru-archive@core3.amsl.com
Received: from [127.0.0.1] (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 937063A6C43; Thu, 22 May 2008 21:43:12 -0700 (PDT)
X-Original-To: ltru@core3.amsl.com
Delivered-To: ltru@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 6E4CE3A69AD for <ltru@core3.amsl.com>; Thu, 22 May 2008 21:43:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.066
X-Spam-Level:
X-Spam-Status: No, score=-2.066 tagged_above=-999 required=5 tests=[AWL=0.533, BAYES_00=-2.599]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id elZv3iyast1R for <ltru@core3.amsl.com>; Thu, 22 May 2008 21:43:10 -0700 (PDT)
Received: from earth.ccil.org (earth.ccil.org [192.190.237.11]) by core3.amsl.com (Postfix) with ESMTP id 1F9F33A6C43 for <ltru@ietf.org>; Thu, 22 May 2008 21:43:10 -0700 (PDT)
Received: from cowan by earth.ccil.org with local (Exim 4.63) (envelope-from <cowan@ccil.org>) id 1JzP7R-0006o0-RT; Fri, 23 May 2008 00:43:05 -0400
Date: Fri, 23 May 2008 00:43:05 -0400
To: Mark Davis <mark.davis@icu-project.org>
Message-ID: <20080523044305.GB7960@mercury.ccil.org>
References: <30b660a20805181149u2e1e3fb9y1a3b5b751c3e6998@mail.gmail.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <30b660a20805181149u2e1e3fb9y1a3b5b751c3e6998@mail.gmail.com>
User-Agent: Mutt/1.5.13 (2006-08-11)
From: John Cowan <cowan@ccil.org>
Cc: LTRU Working Group <ltru@ietf.org>
Subject: Re: [Ltru] my technical position on extlang
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: ltru-bounces@ietf.org
Errors-To: ltru-bounces@ietf.org

Mark Davis scripsit:

> For filtering, extlang offers no particular advantage. Let's look at queries
> of "ar-ary" Moroccan Arabic vs "ary". In either case I need a way to match
> all and only Moroccan Arabic; I must not fallback to "ar". If I fallback to
> include all Arabic, the actual content that is in Moroccan Arabic would be
> completely and utterly swamped by Standard Arabic. So for filtering, the
> extlang model just gives us a more complicated syntax, with no benefit.

I agree with most of your document, which is almost entirely about
lookup, but I can't agree with your characterization of filtering.
It would apply to filtering applied to a search engine, but a search
engine is a very atypical case of a Web server -- its content is vast
and not controlled by the server's owner except in the broadest sense.

Filtering is the algorithm applied to language-tag negotiation by the HTTP
RFC (2616); indeed, 4647's simple filtering algorithm is taken directly
from 2616.  So as an IETF group we must take filtering seriously, and
it is my contention that using extlang syntax is a big win for filtering.

Remember that the operational difference between filtering and lookup
is that in filtering, a range with *fewer* subtags can match a resource
tagged with *more* subtags; whereas in lookup, a range with *more*
subtags can match a resource tagged with *fewer* subtags if no longer
match is available.

Suppose we are downloading Arabic-language audio, and we wish to receive
Egyptian Arabic by choice; failing that, Standard Arabic; and failing
that, any available Arabic variety except Shuwa Arabic, which we do not
understand at all.  We can inform the server of our wishes using the
following header:

	Accept-Language: ar-arz, ar-arb, ar, ar-shw;q=0

A 2616-conformant server will apply the filtering algorithm and return
anything beginning with "ar-arz"; if none, then anything beginning with
"ar-arb"; if none, then anything beginning with "ar" except something
beginning with "ar-shw".  (Some web servers ignore "q=0", making the
exclusion of Shuwa not work, but that's just a bug; RFC 2616 section
3.9 is clear that ";q=0" means content so tagged is not acceptable to
the client.)

What is more (and this is key) if the server has more than one match
for ar-arb, say ar-arb-koranic and ar-arb-modern, it can return either,
or return a disambiguation page allowing the user to choose.  This is
entirely different from lookup, where at most one resource must be
returned, and we remove subtags until we get an exact match.

Now what if extlang syntax is not used?  Then we must say

	Accept-Language: arz, arb, ar, shw;q=0

And in order for this to mean what it is intended to mean, the server
must extend RFC 2616 and treat filtering for "ar" as magically matching
resources that begin with "arq", "aao", "bbz", ..., "aeb", "auz" (see
http://www.sil.org/iso639-3/documentation.asp?id=ara ).  Either that,
or the user must specify all these tags in Accept-Language.  What's more,
if the list grows, the server or the client must be updated.  Whereas if
all Arabics begin with "ar", there is no such issue.

Obviously extlang syntax won't solve *all* problems with filtering,
but it will help a great deal with some important cases.  That's why I
think the issue needs to be reconsidered.

-- 
If you understand,                      John Cowan
   things are just as they are;         http://www.ccil.org/~cowan
if you do not understand,               cowan@ccil.org
   things are just as they are.
_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www.ietf.org/mailman/listinfo/ltru