Re: [Ltru] Macrolanguage and extlang

"Mark Davis" <mark.davis@icu-project.org> Sat, 14 July 2007 16:32 UTC

Return-path: <ltru-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1I9kXk-0003ck-7N; Sat, 14 Jul 2007 12:32:28 -0400
Received: from ltru by megatron.ietf.org with local (Exim 4.43) id 1I9kXj-0003cf-SB for ltru-confirm+ok@megatron.ietf.org; Sat, 14 Jul 2007 12:32:27 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1I9kXj-0003cU-Ew for ltru@ietf.org; Sat, 14 Jul 2007 12:32:27 -0400
Received: from nz-out-0506.google.com ([64.233.162.230]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1I9kXf-0005dw-LY for ltru@ietf.org; Sat, 14 Jul 2007 12:32:27 -0400
Received: by nz-out-0506.google.com with SMTP id n1so622982nzf for <ltru@ietf.org>; Sat, 14 Jul 2007 09:32:23 -0700 (PDT)
DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=GTsZufDjrRcbWJXsqZiiOXPihTipUxi/jHkqzOBygABsxYaQ5QFLXBbiYXbr/Nl/vBaA0STdok8RcydV1FBdoygwy7cIbJ20mjUyB2yli7gl/Nu6vCk2CUUO/Uo3ezXrpt4y7Ht12Qc4eGjMmFetYsK80OYN/7Edo5qZJSXhIkU=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=aW2EFuYM3cHd4M+F3SG7c7WMAPfO4hLL8dUN9uAmTejuFjRuC6pqx6V9sxRdYc5Ceajdf4ZvPLri4YVZs78DU4xVgKhMi/4i9BmMelEBEDek4tAjSYp0Q8rjdX56gM1yObl5Z3QKmuAUbte+Kp2Ji35OslLr7t76eBFywoYFxkM=
Received: by 10.114.106.1 with SMTP id e1mr2663577wac.1184430742381; Sat, 14 Jul 2007 09:32:22 -0700 (PDT)
Received: by 10.114.196.12 with HTTP; Sat, 14 Jul 2007 09:32:22 -0700 (PDT)
Message-ID: <30b660a20707140932qa998ab3y23d07c062d08aab1@mail.gmail.com>
Date: Sat, 14 Jul 2007 09:32:22 -0700
From: Mark Davis <mark.davis@icu-project.org>
To: Don Osborn <dzo@bisharat.net>
Subject: Re: [Ltru] Macrolanguage and extlang
In-Reply-To: <001001c7c612$30a30710$91e91530$@net>
MIME-Version: 1.0
References: <30b660a20707131806o19919cc7v97cc82f3eada43ff@mail.gmail.com> <001001c7c612$30a30710$91e91530$@net>
X-Google-Sender-Auth: 7172d7f503618f10
X-Spam-Score: 0.5 (/)
X-Scan-Signature: a5d64674af3d12893846a18a44c07b83
Cc: LTRU Working Group <ltru@ietf.org>
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============1285593298=="
Errors-To: ltru-bounces@ietf.org

Thanks.

I think one of the things that we realized when looking at how this would
work in practice is that we are better off if we treat macrolanguage as a
piece of very useful information for matching, but one that can be enhanced
(that is, changed) over time as more information becomes available.
Hard-coding it into extlang doesn't serve that purpose, and causes other
problems, notably that the other fields are lost in fallback: if we had
zh-yue-Hant, then by the time we get to zh in fallback, we've lost the Hant.

So that is the origin of the text we are proposing.

There are a number of edge cases such as deprecated codes, closely related
languages, or practical fallbacks (eg if someone speaks X they are likely to
speak Y, even if X and Y are not linguistically related) that are simply
unsuited to hard-coding in the tag. If I hit a code like "iw-PL", I want to
match that with "he-PL", not depend on some kind of fallback between "iw"
and "he"; otherwise it loses information (the PL). We do provide that kind
of information in the deprecated field, and with the macrolanguage field we
would provide more (for example, it would make clear the relation between
no, nb, and nn and how that could be used in matching). Other useful
information is the scripts used with a language in practice; suppress script
supplies just a little information, but doesn't tell me that Uzbek is
customarily written with Arabic, Latin, or Cyrillic, but not with (say)
Tagalog.

The more information there is available, whether it be in the language
subtag registry or somewhere else, the better a job people can do in dealing
with some of the edge cases that turn up in matching.

Mark

On 7/14/07, Don Osborn <dzo@bisharat.net> wrote:
>
>  Mark, Thanks for this update. In reading this over (and trying to see
> between the lines) with an eye to implications for many African languages
> and macrolanguages, the mention of Romanian/Moldavian seems particularly
> relevant, as an example of cases  "where the 'best fit' information is not
> contained in the language registry." This is an area to which I hope that
> experts on African languages who have a familiarity with tagging isues can
> be organized to propose amendments to the system (i.e., ISO 639, which I
> realize is not the purview of this list, and the current RFCs).
>
>
>
> There are also cases where macrolanguages are defined, but a still
> somewhat fluid situation wrt standardization makes defining their use
> problematic (I've posted previously on some of the issues as I see them,
> both on this list and on ietf-languages - such as cases where the language
> tag may be less appropriate than the macrolanguage tag).  Here too there
> seems to be a need for input by experts on African languages in discussions
> of tagging as well as in language planning.
>
>
>
> In the meantime, I hope that the new wording can accommodate all such
> situations, especially for languages with less resources and emerging
> standards.
>
>
>
> The fallback language issue (mentioned in the example re Breton) raises
> another question: can there be more than one fallback language? In the case
> of many of the crossborder languages in Africa (such as Hausa, Swahili,
> Wolof, Fula, Tsonga, Oshiwambo, etc.) this would be helpful.
>
>
>
> Don
>
>
>
>
>
>
>
> *From:* Mark Davis [mailto:mark.davis@icu-project.org]
> *Sent:* Friday, July 13, 2007 9:06 PM
> *To:* LTRU Working Group
> *Subject:* [Ltru] Macrolanguage and extlang
>
>
>
> Addision and I have discussed the issue of extlang and Macrolanguages and
> are proposing the following text replacing the use of extlang.
>
> *[A new section called Macrolanguages: ]*
>
> The Macrolanguage field contains a primary language subtag that *
> encompasses* this subtag. That is, this language is a dialect or
> sub-language of the Macrolanguage, and is called an *encompassed* subtag.
> The Macrolanguage value is defined by ISO 639-3. The field can be useful to
> applications or users when selecting language tags or as additional metadata
> useful in matching. The Macrolanguage field can only occur in records of
> type 'language'. Only values assigned by ISO 639-3 will be considered for
> inclusion. Macrolanguage fields MAY be added via the normal registration
> process whenever ISO 639-3 defines new values. Macrolanguages are
> informational, and MAY be removed or changed if ISO 639-3 changes the
> values.
>
> For example, the language subtags 'nb' (Norwegian Bokmal) and 'nn'
> (Norwegian Nynorsk) has a Macrolanguage entry of 'no' (Norwegian). For more
> information see [Choice].
>
> *[A new section in tag choice (section 4.1), referenced from the above] *
>
> Languages with a Macrolanguage field in the registry sometimes can be
> usefully referenced using their Macrolanguage. However, the Macrolanguage
> field doesn't define what the relationship is between the language subtag
> whose record it appears in and its encompassed language or languages. Nor
> does it define how the encompassed languages are related to one-another. In
> some cases, the Macrolanguage has a standard form as well as a variety of
> less-common dialects. For example, the Macrolanguage 'ar' (Arabic) and the
> subtag 'arb' (Standard Arabic) generally describe the same language, with
> other subtags describing less-common local variations. In other cases there
> is no particular standard form and the encompassed subtags describe specific
> variations within the parent language.
>
> Applications MAY use Macrolanguage information to improve matching or
> language negotiation. For example, the information that 'sr' and 'hr' share
> a Macrolanguage expresses a closer relation between those languages than
> between, say, "sr" and "ma" (Macedonian). It is valid to use either the
> encompassed language or its Macrolanguage to form language tags. However,
> many matching applications will not be aware of the relationship between the
> languages. Care in selecting which subtags are used is crucial to
> interoperability. In general, use the most specific tag. However, where the
> standard written form of an encompassed language is captured by the
> Macrolanguage, the Macrolanguage should still be used for written material.
>
> In particular, chinese language(s) and dialects call for special
> consideration. Because the written form is very similar for most languages
> having 'zh' as a Macrolanguage (and because historically subtags for the
> various sub-languages and dialects were not available), languages such as
> 'yue' (Cantonese) have usually used tags beginning with the subtag 'zh'.
> This past practice of tagging means that Macrolanguage information is
> encouraged when searching for content or when providing fallbacks in
> language negotiation. For example, the information that 'yue' has a
> macrolangauge of 'zh' could be used in the Lookup algorithm to fallback from
> a request for "yue-Hans-CN" to "zh-Hans-CN" *without losing the script and
> region information* (even though the user did not specify "zh-Hans-CN" in
> their language priority list).
>
> However, the Macrolanguage is only one of many additional pieces of
> information  that can be used in matching languages. There are many other
> circumstances where the "best fit" information is not contained in the
> language registry. For example, the languages "ro" (Romanian) and "mo"
> (Moldavian) are very closely related, and so for searching it is often best
> to treat them as being the same. In other cases, the best fallback for a
> requested language may be a completely unrelated language, but one that a
> majority of speakers in the requested language may understand. For example,
> in a given application the best fallback for "be" (Breton), may be "fr"
> (French) -- rather than the more closely related "cy" (Welsh) -- because
> Breton readers are far more likely to be able to read French than Welsh.
>
> For more information on matching, see [RFC 4647].
>
> *[In the section talking about updates]*
>
> The Macrolanguage field is added whenever a language has a corresponding
> Macrolanguage in [ISO 639-3]. For example, 'sr' (Serbian) will have the
> Macrolanguage value 'sh' (Serbo-Croatian).
>
> *[Other changes]*
>
> [Search for instances of "Suppress-Script" (just as a place to find where
> field descriptions are) and make an addition of "Macrolanguage" if
> appropriate, eg in the "LANGUAGE SUBTAG REGISTRATION FORM"]
>
>
>
> --
> Mark
>



-- 
Mark
_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru