RE: [Ltru] Extended language tags (long reply)

"Debbie Garside" <debbie@ictmarketing.co.uk> Wed, 10 October 2007 16:25 UTC

Return-path: <ltru-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1IfeNH-0002e7-GL; Wed, 10 Oct 2007 12:25:31 -0400
Received: from ltru by megatron.ietf.org with local (Exim 4.43) id 1IfeNH-0002e1-0O for ltru-confirm+ok@megatron.ietf.org; Wed, 10 Oct 2007 12:25:31 -0400
Received: from [10.90.34.44] (helo=chiedprmail1.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1IfeNG-0002dW-KG for ltru@ietf.org; Wed, 10 Oct 2007 12:25:30 -0400
Received: from 132.nexbyte.net ([62.197.41.132] helo=mx1.nexbyte.net) by chiedprmail1.ietf.org with esmtp (Exim 4.43) id 1IfeN7-0001bj-EY for ltru@ietf.org; Wed, 10 Oct 2007 12:25:23 -0400
Received: from 145.nexbyte.net ([62.197.41.145]) by mx1.nexbyte.net (mx1.nexbyte.net [62.197.41.132]) (MDaemon PRO v9.6.2) with ESMTP id md50007324662.msg for <ltru@ietf.org>; Wed, 10 Oct 2007 17:28:49 +0100
Received: from CPQ86763045110 ([83.67.121.192]) by 145.nexbyte.net with MailEnable ESMTP; Wed, 10 Oct 2007 17:25:22 +0100
From: Debbie Garside <debbie@ictmarketing.co.uk>
To: mark.davis@icu-project.org, 'Addison Phillips' <addison@yahoo-inc.com>
References: <E1IdT7z-0001vv-Ly@megatron.ietf.org><C9BF0238EED3634BA1866AEF14C7A9E55A597AC370@NA-EXMSG-C116.redmond.corp.microsoft.com><4709146F.6020504@yahoo-inc.com><9d70cb000710071715p398a669fhd06326843d9d9390@mail.gmail.com><30b660a20710071740ma6d39a3u61c8543c70125847@mail.gmail.com><4709A420.80508@yahoo-inc.com> <30b660a20710100855g5130486awf10f33d3d31fb891@mail.gmail.com>
Subject: RE: [Ltru] Extended language tags (long reply)
Date: Wed, 10 Oct 2007 17:24:09 +0100
Message-ID: <059801c80b59$fc7d15b0$0d00a8c0@CPQ86763045110>
MIME-Version: 1.0
X-Mailer: Microsoft Office Outlook 11
In-Reply-To: <30b660a20710100855g5130486awf10f33d3d31fb891@mail.gmail.com>
Thread-Index: AcgLV7LOyCsgsFTsREyZMdvpTIYCjQAAQvJA
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3138
X-Spam-Processed: mx1.nexbyte.net, Wed, 10 Oct 2007 17:28:49 +0100 (not processed: message from valid local sender)
X-MDRemoteIP: 62.197.41.145
X-Return-Path: prvs=18038546a2=debbie@ictmarketing.co.uk
X-Envelope-From: debbie@ictmarketing.co.uk
X-MDaemon-Deliver-To: ltru@ietf.org
X-MDAV-Processed: mx1.nexbyte.net, Wed, 10 Oct 2007 17:28:50 +0100
X-Spam-Score: 0.0 (/)
X-Scan-Signature: e52c6009a9b39871b75233310d7f3490
Cc: ltru@ietf.org
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: debbie@ictmarketing.co.uk
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============0857091542=="
Errors-To: ltru-bounces@ietf.org

Hi Mark
 
I haven't time at the moment to put together the fallback for this (as it
would take me ages) but I think fas-prs and fas-pes are good cases for
macrolanguages to be included as extended language tags.  These languages
are mutually intelligible and to be able to choose prs and have a system
create a fallback based on fas would be a huge bonus.  Implementations
should have the extended language tags option and end users should be
presented with a question asking if they want macrolanguage related
languages matched IMHO. It is not just about whether they understand the
macrolanguage it is whether a user wants to return the related languages
within the macrolanguage; I do understand that the end user would not want
this in many cases but where is the harm in the provision?  If I was
querying using prs I would certainly want to have pes and fas tagged items
returned as a secondary match. 
 
Best regards
 
Debbie  


  _____  

From: Mark Davis [mailto:mark.davis@icu-project.org] 
Sent: 10 October 2007 16:55
To: Addison Phillips
Cc: ltru@ietf.org
Subject: Re: [Ltru] Extended language tags (long reply)


Some detailed comments on these scenarios.


On 10/7/07, Addison Phillips <addison@yahoo-inc.com> wrote: 

I would say:

1. With Extlangs.

- change to filtering: none, but you probably want to use extended 
filtering instead of basic filtering (i.e. "zh-Hant-HK" matches
"zh-yue-Hant-HK" and "zh-cmn-Hant-HK")


Actually, I think the change is the reverse; you have to disable extended
filtering. In the filtering / search business, it is just as bad (if not
worse) to give too many responses as it is to give too few. Otherwise your
user is swamped in irrelevant documents. 

Except in very few cases, notably Arabic and Chinese, according to what I
have heard Peter say, we have no reason to believe that a speaker of one
microlanguage speaks any other particular macrolanguage. If my user searches
for Fulah, and does s/he want to get Maasina Fulfulde, Adamawa Fulfulde,
Pulaar, Central-Eastern Niger Fulfulde, and so on, which I have no reason to
believe that the original person speaks? 

And even for Arabic and Chinese, unmodified extended filtering with extlang
might make sense only because ar-arb is equivalent to arb and zh-cmn is
equivalent to cmn, for all practical purposes, and where all speakers of the
microlanguages understand the macrolanguages (which is unclear -- I've asked
on the list whether the latter is true -- and it is a crucial fact for
extlang -- but get no answer). Better is that if someone really wants both
standard Arabic and Tajiki Arabic, than the input be "ar, abh", just like
now with Norwegian or Serbo-croation, where someone can just list "nn, nb,
no" and "sr, hr, bs, sh", 

So with extlang we have a backwards compatibility issue with extended
filtering; we need to disable macrolanguage lookup in all but a few cases,
and even there it is better for the user to just supply the list of items
they want. 

http://www.sil.org/iso639-3/macrolanguages.asp



- change to lookup: treat extlang as atomic with the primary language
subtag; potentially loop-back through the subtags. That is, given the
range "zh-yue-Hant-HK", the fallback pattern is this:

  zh-yue-Hant-HK 
  zh-yue-Hant
  zh-yue
  zh-Hant-HK
  zh-Hant
  zh
  (default)


And again, we disable this for all but a few specific cases. It only makes
sense to do this as a heuristic when the speaker of the microlanguage is
extremely likely to speak the macrolanguage. 

If the query is Maasina Fulfulde as used in Ghana (ful-)ffm-GH, we don't
want to fall back to some arbitrary Fulah; instead we want to probably fall
back to Hausa.

And this would need to be handled even more carefully where the input is
multiple tags, as in Accept-Language. If the user supplies "zh-yue; en",
meaning that they prefer Cantonese, then English -- and may not know
Mandarin at all! -- then the stated algorithm, the one that everyone uses
now, would give completely incorrect results. 

(1) Incorrect
  zh-yue-Hant-HK
  zh-yue-Hant
  zh-yue
  zh
  en
  (default)

Even the modified version doesn't work

(2) Still incorrect
  zh-yue-Hant-HK
  zh-yue-Hant
  zh-yue 
  zh-Hant-HK
  zh-Hant
  zh
  en
  (default)

What we need to do is more complicated:

(3) what the user asked for, Cantonese, then English, then maybe something
else.
  zh-yue-Hant-HK
  zh-yue-Hant 
  zh-yue
  en
  zh-Hant-HK
  zh-Hant
  zh
  (default)

Number 3 is especially needed whenever the macrolanguage is not
overwhelmingly likely to be understood by the macrolanguage users.

So with extlang we also have a backwards compatibility issue with lookup;
code that used to work fine will fail. 



Or this:

  zh-yue-Hant-HK
  zh-yue-Hant
  zh-yue
  (default)


This would be better, but still requires a change to algorithms, and then
simply amounts to adding extlang, but coding around it so that the results
were as if there were no extlang -- better to not use the extlang mechanism.




2. Without extlangs.

- change to filtering: none

- change to lookup: none 


Agreed so far, but... 



BUT... you want to include the macro language in your ranges in some 
cases. Alternatively, we would have to define new filtering and lookup
options that include mapping to macrolanguages. For example, with the
range "yue-Hant-HK", you would want the fallback to be:

  yue-Hant-HK
  yue-Hant
  yue
  zh-Hant-HK
  zh-Hant
  zh
  (default)



In the case of no extlangs, we really do not want to make any changes to the
algorithms. Instead, we want to point out in the text that macro languages
*MAY* be a useful resource for having some extended fallback. That is, it
*MAY* be reasonable to fall back from the Chinese and Arabic microlanguages
to their macrolanguages, but also may not. And the interaction between this
additional fallback and multiple tags needs to be taken into consideration. 

Suppose that the input is "yue-Hant-HK; en"; the current algorithm works.

(1) 
  yue-Hant-HK
  yue-Hant
  yue
  en
  (default)

If a particular implementation wants to change the default processing to
have an extra step that adds fallback to zh-Hant-HK (and so on) for certain
specific languages, such as particular Chinese and Arabic microlanguages,
that's a small amount of work. Moreover, it is just what a more enhanced
algorithm *currently* would do with any of "nn, nb, no" and "sr, hr, bs,
sh", -- in this case they are known to be good, practical fallbacks. 

Alternatively, if the user just feeds in "yue-Hant-HK, en, zh-Hant-HK" or
the reverse "yue-Hant-HK, zh-Hant-HK, en", that will work without any code
changes, or any enhancements; the user gets what is asked for. 

================

What we don't want to do is make recommendations that if implemented, are
harder for people to control and get the right answer. And baking extlang
into the tags is even worse -- since it introduces backwards
incompatibilities that require old code to be modified to work around. 

Someone mentioned on this list that they thought that these were only
philosophical differences. I don't see it that way at all. The choices we
make here will affect the ability of people to work with their own languages
for a long time -- it is important to get this right, and pay attention both
to backwards compatibility and to future capabilities -- the choice of
architecture will make real, practical differences for people, especially
for minority language users. My fear is exactly minority language users will
be disadvantage with extlang. It isn't the French or German speakers that
will end up with problems; it is the Gondi and Grebo speakers. 

Mark


_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru