Re: [Ltru] Re: Macrolanguage and extlang

"Mark Davis" <> Tue, 17 July 2007 19:19 UTC

Return-path: <>
Received: from [] ( by with esmtp (Exim 4.43) id 1IAsaM-00026J-I2; Tue, 17 Jul 2007 15:19:50 -0400
Received: from ltru by with local (Exim 4.43) id 1IAsaL-000268-4B for; Tue, 17 Jul 2007 15:19:49 -0400
Received: from [] ( by with esmtp (Exim 4.43) id 1IAsaK-00025y-Qh for; Tue, 17 Jul 2007 15:19:48 -0400
Received: from ([]) by with esmtp (Exim 4.43) id 1IAsaJ-0004mW-E9 for; Tue, 17 Jul 2007 15:19:48 -0400
Received: by with SMTP id n1so1213869nzf for <>; Tue, 17 Jul 2007 12:19:47 -0700 (PDT)
DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed;; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=Fk+oV7Jyn7sXwbXhVWplUWn9MqEOJznWHZXqeLEQyc8CS+BMRdc0emlATSjQ2Fet6svvnJFuwXgCDSMeV46Jc016dqEi2VX4Elgt27gEXoDDUYOgGye53TVK6b84nkOEmfP5c6KVPiNsLqcqPH/U3DWnkOuIUsN8vEgtKhp6yUM=
DomainKey-Signature: a=rsa-sha1; c=nofws;; s=beta; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=PqSm7kKCfy7A50ScVxI1uchN9JuLB2AmyL+w7S55TUZrOCQ7/gAT+nNK6pGQfYap8aXkQb2JHEBxItjSo19N9blWFQ+OlQ7x5pcPrTVocntRJz4g4TbG76TOOrx/4DyZ8r/iracSgbAL5IVymkGF8HUQBSTjiHmDiHNY4lDQFgw=
Received: by with SMTP id l1mr714415wab.1184699986526; Tue, 17 Jul 2007 12:19:46 -0700 (PDT)
Received: by with HTTP; Tue, 17 Jul 2007 12:19:46 -0700 (PDT)
Message-ID: <>
Date: Tue, 17 Jul 2007 12:19:46 -0700
From: Mark Davis <>
To: Doug Ewell <>
Subject: Re: [Ltru] Re: Macrolanguage and extlang
In-Reply-To: <00d701c7c738$841e6930$6a01a8c0@DGBP7M81>
MIME-Version: 1.0
References: <> <013b01c7c6a8$55cb4a20$6401a8c0@DGBP7M81> <> <> <00d701c7c738$841e6930$6a01a8c0@DGBP7M81>
X-Google-Sender-Auth: e98a2f4e3601165b
X-Spam-Score: 0.0 (/)
X-Scan-Signature: d2b46e3b2dfbff2088e0b72a54104985
Cc: LTRU Working Group <>
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
Content-Type: multipart/mixed; boundary="===============1771091650=="

I just find your position hard to understand, perhaps you can help me. Maybe
some scenarios would help.

Suppose that:

   - My site has support for zh, zh-Hans, and zh-Hant. zh = Mandarin
   since that is what everyone means by "zh" currently. As is customary in
   fallback, the content of zh is the predominent form (zh-Hans) [this is just
   for example; if a TW site has a different convention alternate examples can
   be given.
   - A user comes in with different requests, listed below.

Scenario 1. The user's browser has the proposed "zh-yue-Hant-US". My lookup
falls back to zh, so I serve it up to the user. So even if the target of the
match (zh) is not Cantonese, you want a fallback to zh. I'm guessing that
you see this as better than if we defined the tag as "yue-Hant-US", since it
gets to some fallback that the user is likely to understand. But I don't see
this as much different than if we had fr-br-BE (meaning Breton, but fall
back to French), or ro-mo (meaning Moldavian, but falling back to Romanian).
And note that in the fallback, the script and region are completely lost.

Scenario 2. The user's browser has zh-cmn-Hant-US. In matching, we fall back
to zh. Note than in the fallback, the script and region are completely lost.
We have essentially just introduced a synonym for zh which causes fallback
to lose information, for no good reason.

The problem with extlang is that the fallback from encompassed language to
macrolanguage is fundamentally different in kind than a fallback from region
to script to base language. In the case of script, like uz-Arab and uz-Latn,
or en-US vs en-GB, we really have variations on the same language, and
fallback makes sense. We ordered the subtags so that it works optimally

The encompassed languages, on the other hand, are not just dialects, not
just variants. They are languages in their own right. Trying to insert them
into the fallback process just screws things up, because they need a
"sideways" matching not just simple truncation fallback. If you want to do
any fallback with extlang, it would be to fall back from zh-yue-<other
stuff> to zh-<other stuff>. That means that in order to do reasonable
fallback, you can't just use truncation fallback anyway. So I see the
situation this way:

   1. The only reason for adding the complication of the extlang
   mechanism is to make truncation fallback work better.
   2. Truncation fallback with extlang doesn't work better.
   3. So there is no need to make encompassed languages be "secondary"
   languages by making them be "secondary" subtags.

The goals of extlang are good, to make matching work better, but in practice
it just makes things worse. [Speaking to those familiar with C++, it feels a
bit like the default assignment operator in C++. Nice in theory, but in
practice it gums things up more than it fixes, since once you are beyond
very simple (toy) classes, the default is almost always wrong -- but because
it is supplied behind your back you don't realize it.]

So instead of adding the extlang mechanism to RFC 4646, what we really need
to do is to point people to how to handle yue and other encompassed
languages along with mo/ro, tl/fil, and other edge cases in a reasonable
way, by augmenting matching.


On 7/15/07, Doug Ewell <> wrote:
> Mark Davis wrote:
> > The main argument I've hear for extlang is behind-the-scenes-inertia.
> > While we made provision in 4646 for possibly accepting them in the
> > future, it was by no means a done-deal.
> The main argument that has been offered is that it makes matching
> easier.  Whether that argument was heard is a different question.
> > The only reason I've heard advocated for them is that it makes
> > matching easier. But in practice, we have simply not found that to be
> > true. If it is indeed  worthwhile to add this mechanism to 4646, a
> > good case needs to be made for it; and inertia isn't a good case.
> Here is the argument restated.  It is based not on behind-the-scenes
> inertia, but on backward compatibility, the exact same issue that caused
> us to adopt Suppress-Script.
> Existing Cantonese text has been tagged as "zh", the basic ISO
> 639-1-based tag, or as "zh-yue", the tag that was registered for this
> purpose back in 1999.
> The extlang mechanism would have established "yue" as an extlang under
> "zh", so the proper tagging of Cantonese would continue to be "zh" (more
> general) or "zh-yue" (more specific).  Matching engines would continue
> to operate as they do now.
> The proposed mechanism establishes "yue" as a primary language subtag,
> so the proper tagging of Cantonese becomes "yue".  Matching engines must
> be upgraded to RFC 4646bis in order to have any chance at finding a
> match.  The much-beloved and much-catered-to RFC 3066 remove-from-right
> "fallback" algorithm will NEVER find a match between "yue" and "zh",
> regardless of whether script and/or region subtags are involved.
> Put another way: Extlangs may not make matching easier, but *not* having
> extlangs will make matching harder.
> --
> Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
> <>

Ltru mailing list