Re: [Ltru] my technical position on extlang

"Mark Davis" <mark.davis@icu-project.org> Fri, 23 May 2008 15:51 UTC

Return-Path: <ltru-bounces@ietf.org>
X-Original-To: ltru-archive@megatron.ietf.org
Delivered-To: ietfarch-ltru-archive@core3.amsl.com
Received: from [127.0.0.1] (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 1633928C2AE; Fri, 23 May 2008 08:51:24 -0700 (PDT)
X-Original-To: ltru@core3.amsl.com
Delivered-To: ltru@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id BB0FB28C2AF for <ltru@core3.amsl.com>; Fri, 23 May 2008 08:51:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.976
X-Spam-Level:
X-Spam-Status: No, score=-1.976 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6gxh9n1iNwym for <ltru@core3.amsl.com>; Fri, 23 May 2008 08:51:18 -0700 (PDT)
Received: from gv-out-0910.google.com (gv-out-0910.google.com [216.239.58.190]) by core3.amsl.com (Postfix) with ESMTP id AE7973A6B3B for <ltru@ietf.org>; Fri, 23 May 2008 08:51:17 -0700 (PDT)
Received: by gv-out-0910.google.com with SMTP id e6so310928gvc.15 for <ltru@ietf.org>; Fri, 23 May 2008 08:51:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; bh=6AU6iV9UHKxQy1+H2OfOSyhDSAtojWLfXEWnZfu4tfA=; b=LRxnzLWJfluIfA4WZ3tlTfETA2DZyzyN737jNFsGK4kVZRXsMsUo3Fon0EcK/BokLZmpfpbXkIPsbrXKiEVYRSMnqWEQE2YsDlgvXIocyOeeXaCMQXzAXqNo7CPACGb9mSuaGtFqIVEP94VDAKExfh3ay5sllP3wXK75VYUFHfo=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=UUc34YVas9Sedqqh32nlkuej+sZJyPBKBEiqWsJveixvpvvGFGZCNtlEpWSHEC03fds/hXTYTYdB0U6Myr0HIhMXxFqXpSnYf7vRd++yXbDZXNDxe1EQujQ0KFOK8OW7Q5TLukOUGsDltVkqCSe8lfNvd6mUTYj2wGWiCHAX50s=
Received: by 10.150.72.11 with SMTP id u11mr1939410yba.122.1211557875242; Fri, 23 May 2008 08:51:15 -0700 (PDT)
Received: by 10.150.206.3 with HTTP; Fri, 23 May 2008 08:51:15 -0700 (PDT)
Message-ID: <30b660a20805230851r519f5d14wd93a92494d1db1c9@mail.gmail.com>
Date: Fri, 23 May 2008 08:51:15 -0700
From: Mark Davis <mark.davis@icu-project.org>
To: John Cowan <cowan@ccil.org>
In-Reply-To: <20080523044305.GB7960@mercury.ccil.org>
MIME-Version: 1.0
References: <30b660a20805181149u2e1e3fb9y1a3b5b751c3e6998@mail.gmail.com> <20080523044305.GB7960@mercury.ccil.org>
X-Google-Sender-Auth: 5551829614717a84
Cc: LTRU Working Group <ltru@ietf.org>
Subject: Re: [Ltru] my technical position on extlang
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============0912778429=="
Sender: ltru-bounces@ietf.org
Errors-To: ltru-bounces@ietf.org

The contrast is between:

   - ar-arz, ar-arb, ar, ar-shu;q=0 // with extlang=macrolanguage
   - arz, arb, ar, shu;q=0 // without extlang=macrolanguage, eg plain RFC
   4646

There are two cases where people use lists, filtering and content
negotiation.

   1. For content negotiation most people use some form of fallback
   (implicit or explicit). For example, if an Accept-Language list is handed to
   a Java implementation, many implementations will actually do fallback on the
   items.
   2. We'd have to look at how Accept-Language is implemented in practice.
   The spec is pretty lame, and a lot of implementations just disregard the q
   values. Take Tomcat, for example. With the Accept-Language string "ar-arz,
   ar-arb, ar, ar-shu;q=0, de;q=0.5", it gets the locales locales: [ar_ARZ,
   ar_ARB, ar, ar_SHU, de]. Note that they are not even in order or precedence,
   and that mentioning ar-shu actually includes it explicitly instead of
   removing it!!!
   3. Even if they pay attention to q values, normally implementations just
   reorder the list, then walk through the list and pick the first one that
   works. By your definition, the content for "ar" could be any of the
   microlanguages, including "arb" but also "shu" or "pga", so this would not
   control that unless there were some metalanguage in the system that said
   that the content for "ar" happened to be "shu". I don't know of anyone that
   does that, or that it would ever be worth the effort to support.
      - Interpreting 'ar' as Standard Arabic is permissible now and after,
      whatever happens with extlang. With the above list, the practical impact
      will be that it will stop at 'ar' anyway and return Standard Arabic. The
      odds of an implementation supporting "pga" but not ar/arb are
pretty darn'd
      low. In practice, of course, very implementation that I know of
will treat
      'ar' as simply Standard Arabic.

Look then at filtering.

Even if the server doesn't support extlang matching, you can get exactly
what you said you want -- reliably -- with an explicit list:

arz, arb, ar, arq, aao, bbz, abv, acy, adf, avl, afb, ayh, acw, ayl, acm,
ary, ars, apc, ayp, acx, aec, ayn, ssh, ajp, apd, pga, acq, abh, aeb, auz
// excluding shu

And this will work even if filtering isn't supported according to the spec;
and note that we have no guarantee that the server will filter according to
the spec -- based on the experience with Accept-Language, it's actually
unlikely.

And really, what are the odds that someone wants that exact list above? And
MUST have it supported with a short enumeration? And does not care if
additional encompassed languages are added?

According to information from Peter, the microlanguages are all independent
languages, meaning that they are not mutually comprehensible. Someone might
know more than one, but there is no particular expectation in the absence of
language-specific information. So while it is possible that someone happens
to know 15-odd different microlanguages under "ar" (but not "shu"), it is
pretty darned unlikely. What is more likely is that someone knows a few of
them, and simply lists the ones they know. The usefulness of having such a
shorthand is really not there.

Peter did his best with the macrolanguage construction faced with what was
there. But the information content useful to implementations are extremely
low, and cannot be relied on.

   - Languages that are *very* closely related do not have a macrolanguage
   ("ro" and "mo").
   - Languages that are mutually incomprehensible do share a macrolanguage
   (most examples).

I don't see extlang as any really practical benefit in filtering. It isn't
useful in a great many other cases of related languages; if I want to use
with nn, nb, no it doesn't help, nor with ro/mo, various varieties of
German, and so on. Moreover, if a customer searches for 'ar' and get a bunch
of documents in shu, they are as or more likely to be unhappy as if they
search for 'de' documents and get back a bunch of 'gsw'.

Extlang is really just not a good mechanism for implementation in terms of
getting good matches, and baking that into the syntax just makes it worse.
It is a negative in lookup, and no positive for filtering either.

Putting some ISO language tags into the extlang position just because they
have a macrolanguage is an unnecessary complication for implementations and
represents a substantive, controversial change to RFC 4646.

Mark

On Thu, May 22, 2008 at 9:43 PM, John Cowan <cowan@ccil.org> wrote:

> Mark Davis scripsit:
>
> > For filtering, extlang offers no particular advantage. Let's look at
> queries
> > of "ar-ary" Moroccan Arabic vs "ary". In either case I need a way to
> match
> > all and only Moroccan Arabic; I must not fallback to "ar". If I fallback
> to
> > include all Arabic, the actual content that is in Moroccan Arabic would
> be
> > completely and utterly swamped by Standard Arabic. So for filtering, the
> > extlang model just gives us a more complicated syntax, with no benefit.
>
> I agree with most of your document, which is almost entirely about
> lookup, but I can't agree with your characterization of filtering.
> It would apply to filtering applied to a search engine, but a search
> engine is a very atypical case of a Web server -- its content is vast
> and not controlled by the server's owner except in the broadest sense.
>
> Filtering is the algorithm applied to language-tag negotiation by the HTTP
> RFC (2616); indeed, 4647's simple filtering algorithm is taken directly
> from 2616.  So as an IETF group we must take filtering seriously, and
> it is my contention that using extlang syntax is a big win for filtering.
>
> Remember that the operational difference between filtering and lookup
> is that in filtering, a range with *fewer* subtags can match a resource
> tagged with *more* subtags; whereas in lookup, a range with *more*
> subtags can match a resource tagged with *fewer* subtags if no longer
> match is available.
>
> Suppose we are downloading Arabic-language audio, and we wish to receive
> Egyptian Arabic by choice; failing that, Standard Arabic; and failing
> that, any available Arabic variety except Shuwa Arabic, which we do not
> understand at all.  We can inform the server of our wishes using the
> following header:
>
>        Accept-Language: ar-arz, ar-arb, ar, ar-shw;q=0
>
> A 2616-conformant server will apply the filtering algorithm and return
> anything beginning with "ar-arz"; if none, then anything beginning with
> "ar-arb"; if none, then anything beginning with "ar" except something
> beginning with "ar-shw".  (Some web servers ignore "q=0", making the
> exclusion of Shuwa not work, but that's just a bug; RFC 2616 section
> 3.9 is clear that ";q=0" means content so tagged is not acceptable to
> the client.)
>
> What is more (and this is key) if the server has more than one match
> for ar-arb, say ar-arb-koranic and ar-arb-modern, it can return either,
> or return a disambiguation page allowing the user to choose.  This is
> entirely different from lookup, where at most one resource must be
> returned, and we remove subtags until we get an exact match.
>
> Now what if extlang syntax is not used?  Then we must say
>
>        Accept-Language: arz, arb, ar, shw;q=0
>
> And in order for this to mean what it is intended to mean, the server
> must extend RFC 2616 and treat filtering for "ar" as magically matching
> resources that begin with "arq", "aao", "bbz", ..., "aeb", "auz" (see
> http://www.sil.org/iso639-3/documentation.asp?id=ara ).  Either that,
> or the user must specify all these tags in Accept-Language.  What's more,
> if the list grows, the server or the client must be updated.  Whereas if
> all Arabics begin with "ar", there is no such issue.
>
> Obviously extlang syntax won't solve *all* problems with filtering,
> but it will help a great deal with some important cases.  That's why I
> think the issue needs to be reconsidered.
>
> --
> If you understand,                      John Cowan
>   things are just as they are;         http://www.ccil.org/~cowan<http://www.ccil.org/%7Ecowan>
> if you do not understand,               cowan@ccil.org
>   things are just as they are.
>



-- 
Mark
_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www.ietf.org/mailman/listinfo/ltru