Re: [Ltru] Re: extlang

"Mark Davis" <mark.davis@icu-project.org> Mon, 19 March 2007 16:10 UTC

Return-path: <ltru-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1HTKRK-00055E-Db; Mon, 19 Mar 2007 12:10:30 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1HTKRJ-000554-4o for ltru@lists.ietf.org; Mon, 19 Mar 2007 12:10:29 -0400
Received: from an-out-0708.google.com ([209.85.132.244]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1HTKRE-00021U-Br for ltru@lists.ietf.org; Mon, 19 Mar 2007 12:10:29 -0400
Received: by an-out-0708.google.com with SMTP id c18so1338330anc for <ltru@lists.ietf.org>; Mon, 19 Mar 2007 09:10:23 -0700 (PDT)
DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=ms5NjgPrsNFh+iSI9cjesjD6qIU0fma8khyCcLrj2FF0paGiD/z5BW5cndtMuoxRUUDODOd1aRJtxX0RzGh/+4l7i1woM4Lm14GzictBO3b79c1vHOcAwrByG1jY5HcxptprhYah1WzV73Jp4v9pB8G2+o1TD//FJ4MPp1UeZe0=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=rLgRHMX9i7pTmDWYpHYedHCPYOvKPmf9HR/O4QuergfYkDzBNuFf9OhCHJhVuz0ad0OvUNdaMEEJm3m3Y0u+eWLJBJ1CWkAl1glT7mOAD1ile3zGGMzZdX03/9/RRcFBZ39mWujkAzSteIsnaVSi/1tvOlvh9CCm0cPJ8i0No6Y=
Received: by 10.100.143.1 with SMTP id q1mr3873083and.1174320623399; Mon, 19 Mar 2007 09:10:23 -0700 (PDT)
Received: by 10.114.196.2 with HTTP; Mon, 19 Mar 2007 09:10:23 -0700 (PDT)
Message-ID: <30b660a20703190910u636658b1g56489b0d30d2333a@mail.gmail.com>
Date: Mon, 19 Mar 2007 09:10:23 -0700
From: Mark Davis <mark.davis@icu-project.org>
To: Addison Phillips <addison@yahoo-inc.com>
Subject: Re: [Ltru] Re: extlang
In-Reply-To: <45FEA785.2080003@yahoo-inc.com>
MIME-Version: 1.0
References: <E1HRsNL-0001ob-5h@megatron.ietf.org> <30b660a20703161617u85dbfe1r44ddc29fcfcf1a6d@mail.gmail.com> <45FB2C4E.9090303@yahoo-inc.com> <006e01c7682b$f0687b10$d1397130$@net> <004501c768bb$3bc185e0$6401a8c0@DGBP7M81> <00fd01c76914$18377ae0$48a670a0$@net> <45FD1A0A.2EED@xyzzy.claranet.de> <30b660a20703181137y6448508exb3e75f8e21a80a64@mail.gmail.com> <01b801c76990$e3e9b5a0$abbd20e0$@net> <45FEA785.2080003@yahoo-inc.com>
X-Google-Sender-Auth: 1c520ffd384ea4b5
X-Spam-Score: 0.1 (/)
X-Scan-Signature: 32604d42645517c44d778f1d111b40a6
Cc: Frank Ellermann <nobody@xyzzy.claranet.de>, ltru@lists.ietf.org
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============0990807351=="
Errors-To: ltru-bounces@ietf.org

I see it a somewhat different way. The fact that there is a macro language
(zh), should not skew the way we encode an individual language (yue). You
say: "Extlangs help this a little by avoiding the choice on the initial
subtag. Thus: "ar-EG" is related to "ar-arz-EG" and "ar-arb-EG" in distinct
and somewhat logical ways."

However, functionally, we have to see advantage in subordinating some
languages as extlangs. There isn't value if they just add complication, as
per your "The problem with lookup (and I use lookup extensively, so it
concerns me deeply) might suggest that some extra smarts related to extlangs
is going to be needed."

In order to have a good case for the extlang model, we would need to see
concrete scenarios where we can demonstrate that "zh-cmn" and "zh-yue" work
better than "cmn" and "yue" resp., and demonstration that those are more
important than the scenarios where they cause problems.

Note that ISO 639-3 does not at all force the use of the extlang model; the
extlang model is just one possible way of expressing the information in ISO
639-3. We already have macrolanguages and "subordinate" languages in BCP 47,
but we put them at the same level. For example, ISO 639-3 already
categorizes no and sh as macrolanguages, and nb, no and sr, hr, and bs as
subordinate to them. We are not going to (and cannot) be forcing users to
encode nb as no-nb, nor sr as sh-sr. We shouldn't be making Cantonese a
subordinate language either.

At one point, I did think that having the extlang structure would be better,
but the more I get into actually implementing them, the more I find that
they are just a complication for no good result. What I think we should
instead be doing is adding a field to the registry that says that X is a
macro language for Y, and adding information to 4647 that indicates how one
can make use of this information in matching. That would also extend to the
current situation with no and sh in a uniform manner. It is also a much less
fragile mechanism: one can add more such X,Y relations over time without
gumming up everything.

Mark

On 3/19/07, Addison Phillips <addison@yahoo-inc.com> wrote:
>
> The idea of macro-languages is that they are not, themselves, languages.
> Rather, they are groupings of languages that can be usefully referred to
> collectively. That is, strictly speaking, "Chinese" (meaning "zh") isn't
> a language---and neither is "Arabic" (by which I mean the code "ar").
>
> Unfortunately, that isn't common usage or understanding of the
> situation. To most people, "Arabic" is a language. The idea that it has
> regional or other variations seems natural enough, but not, perhaps,
> that it is a set of somewhat related languages that share a historical
> and/or written tradition.
>
> Allowing both the macro- and plain-language codes on the same level is
> recipe for confusion: does "ar-EG" == "arb-EG" == "arz" == "arz-EG"?
> What is supposed to match? Which tag should be used for a given
> document? How should one distinguish these?
>
> We have a tradition, furthermore, of not having "secret information" in
> language tags requiring extra mapping tables to make sense of or process
> the tags. This tradition is punctuated (and punctured) by an equal
> tradition of assigning "secret" meaning to language tags. Thus, for the
> longest time, "zh-TW" meant "Traditional Chinese".
>
> Extlangs help this a little by avoiding the choice on the initial
> subtag. Thus: "ar-EG" is related to "ar-arz-EG" and "ar-arb-EG" in
> distinct and somewhat logical ways.
>
> Mark's concern is that this tagging system doesn't play nicely with
> basic filtering. He's not cited the fact that we have two filtering
> schemes: extended filtering works more reasonably with extlangs.
> "zh-yue-Hans-CN", "zh-cmn-Hans", and "zh-Hans-CN" all match the range
> "zh-Hans", for example.
>
> The problem with lookup (and I use lookup extensively, so it concerns me
> deeply) might suggest that some extra smarts related to extlangs is
> going to be needed. On the other hand, some of this is going to have to
> be related to Maxim #1: "Tag Content Wisely". Choosing to avoid extlangs
> where they add no distinguishing information (not uncommon in resource
> file lookup for Arabic or Chinese, say) or *consistently* including them
> when they do (Lahnda???) will make things better.
>
> I admit to a good bit of trepidation writing the above, though.
>
> Addison
>
> Don Osborn wrote:
> > Thanks Frank for the summary list and Mark for the pointer. Actually I
> > was going to reference that page and ask if it is exactly what is in
> > question.
> >
> >
> >
> > Next (trying my luck here), is there any kind of way(s) that these can
> > be subgrouped? For instance ar ends up referring to standard Arabic, if
> > I understand correctly, and zh has been discussed already a lot; but
> > some other macrolanguages do not necessarily have a single standard
> > form. Some (macro) languages have a lot more in writing than others.
> > What I'm getting at is are there different sets of (possible)
> > complications that can be identified for the macrolanguages, languages
> > and extlang relationships, such that the list of 54 can be disaggregated
> > (perhaps in more than one way)?
> >
> >
> >
> > Not as if folks don't have enough else to think about, but it seems like
> > such an analysis, if it hasn't been done, might raise other productive
> > questions.
> >
> >
> >
> > Don
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *From:* Mark Davis [mailto:mark.davis@icu-project.org]
> > *Sent:* Sunday, March 18, 2007 2:37 PM
> > *To:* Frank Ellermann
> > *Cc:* ltru@lists.ietf.org
> > *Subject:* Re: [Ltru] Re: extlang
> >
> >
> >
> > Another way to look at it is on
> > http://www.sil.org/iso639-3/macrolanguages.asp, which provides the
> > language names and breakdowns.
> >
> > Mark
> >
> > On 3/18/07, *Frank Ellermann* <nobody@xyzzy.claranet.de
> > <mailto:nobody@xyzzy.claranet.de>> wrote:
> >
> > Don Osborn wrote:
> >
> >  > what is the actual number of (macro)languages where extlang issues
> arise?
> >
> > I try to determine this "manually" for Doug's latest 4645bis:
> > There are 480 "Prefix:" lines in the extlang part.
> >
> > ar      30              ay       2              az       2
> > bal      3              bik      5              bua      3
> > chm      2              cr       6              del      2
> > den      2              din      5              doi      2
> > fa       2              ff       9              gba      5
> > gn       5              gon      2              grb      5
> > hai      2              hmn     21              ik       2
> > iu       2              jrb      5              kg       3
> > kok      2              kpe      2              kr       3
> > ku       3              kv       2              lah      8
> > man      7              mg      10              mn       2
> > ms      13              mwr      6              oc       5
> > oj       7              om       4              ps       3
> > qu      44              raj      6              rom      7
> > sc       4              sgn     124             sq       4
> > sw       2              syr      2              tnh      4
> > uz       2              yi       2              za       2
> > zap     58              zh      13              zza      2
> >
> > That's 54 at the moment.
> >
> > Frank
> >
> >
> >
> > _______________________________________________
> > Ltru mailing list
> > Ltru@ietf.org <mailto:Ltru@ietf.org>
> > https://www1.ietf.org/mailman/listinfo/ltru
> >
> >
> >
> >
> > --
> > Mark
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Ltru mailing list
> > Ltru@ietf.org
> > https://www1.ietf.org/mailman/listinfo/ltru
>
> --
> Addison Phillips
> Globalization Architect -- Yahoo! Inc.
>
> Internationalization is an architecture.
> It is not a feature.
>



-- 
Mark
_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru