Re: [Ltru] updated demo

Mark Davis ⌛ <mark@macchiato.com> Sun, 28 June 2009 20:07 UTC

Return-Path: <mark.edward.davis@gmail.com>
X-Original-To: ltru@core3.amsl.com
Delivered-To: ltru@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id D35FD3A6BB0 for <ltru@core3.amsl.com>; Sun, 28 Jun 2009 13:07:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.235
X-Spam-Level:
X-Spam-Status: No, score=-2.235 tagged_above=-999 required=5 tests=[AWL=0.358, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, GB_I_LETTER=-2, HTML_MESSAGE=0.001, MIME_8BIT_HEADER=0.3, URIBL_RHS_DOB=1.083]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id c4TOALkNbgRx for <ltru@core3.amsl.com>; Sun, 28 Jun 2009 13:07:07 -0700 (PDT)
Received: from mail-yx0-f182.google.com (mail-yx0-f182.google.com [209.85.210.182]) by core3.amsl.com (Postfix) with ESMTP id D6BA63A6AD6 for <ltru@ietf.org>; Sun, 28 Jun 2009 13:07:06 -0700 (PDT)
Received: by yxe12 with SMTP id 12so345205yxe.29 for <ltru@ietf.org>; Sun, 28 Jun 2009 13:07:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to:cc :content-type; bh=sI6gPJkHP29GtRDW+/F3aMoY80xJJkTYOTJ2BXt2tX4=; b=LSwRgU0Ovus/q/Q16yp32ANlKL9EXRlOCJXd2sbJSku0T3KGTDlNfvJDN7qskNysFf NEV4zimWhrNGM/k4eLVFERY04qWhGbvNNbIh7ZUnRuNrgjdBN3i9WDZSH1mL76B/lvM3 uQnvmsfgNI2kWETF+ZvJvNrqn0rGoSN+oOdYo=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=fwYkl1DPrC4TlMUWYXW+OuHIjo6V5mWJ5PrhnKtDkuAheBfWSFQJpfifk5kwI1x+y8 /vxl8H1Tpk8gRJd7Gza7+K9PdTtHPwaUQH3B/dMLoHX8CeWPEGZBqJdTZUG2/Yr+5P8x onpSENGbHZjgVHHzZLw2ow2SIn2IfuIhjKjTw=
MIME-Version: 1.0
Sender: mark.edward.davis@gmail.com
Received: by 10.100.251.8 with SMTP id y8mr8078737anh.74.1246219644796; Sun, 28 Jun 2009 13:07:24 -0700 (PDT)
In-Reply-To: <ba4134970906280207td8dbdd4l8a4860f7ee4de28@mail.gmail.com>
References: <30b660a20906271138o186f82a5xd2531f70806ab3be@mail.gmail.com> <ba4134970906280207td8dbdd4l8a4860f7ee4de28@mail.gmail.com>
Date: Sun, 28 Jun 2009 13:07:24 -0700
X-Google-Sender-Auth: 836c5d1af1739f85
Message-ID: <30b660a20906281307p7324a2a4uf1a29a41d6271378@mail.gmail.com>
From: Mark Davis ⌛ <mark@macchiato.com>
To: Felix Sasaki <felix.sasaki@fh-potsdam.de>
Content-Type: multipart/alternative; boundary="001636af03423d8126046d6e1f6b"
Cc: LTRU Working Group <ltru@ietf.org>
Subject: Re: [Ltru] updated demo
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 28 Jun 2009 20:07:08 -0000

It does look similar. After looking them over, I think sometimes one is
better and sometimes the other is. The differences I can see (other than UI)
are:

*Feedback on ill-formed, invalid, or non-preferred values.*

de-x

   - http://unicode.org/cldr/utility/languageid.jsp?a=de-x shows the tag
   where the problem lies, but not potential fixes. (I use a regex based on the
   ABNF for well-formedness, and if it fails I just show the tag where the
   problem is.)
   - http://www.w3.org/2008/05/lta/language-tags/q?input=de-x shows what the
   potential subtags at that point might be.

iw-su

   - http://unicode.org/cldr/utility/languageid.jsp?a=iw-su&l=en shows the
   replacement values for iw and su.
   -
   http://www.w3.org/2008/05/lta/language-tags/q?input=iw-su&output=html&hl=enjust
says they are valid. It does show all the registry information, like
   when the code was added.

eng-840

   - http://unicode.org/cldr/utility/languageid.jsp?a=eng-840&l=en shows the
   replacements for the wrong choice of source code (3 letter language when 2
   letter exists (common in the field), 3 digit region when 2 letter exists)
   -
   http://www.w3.org/2008/05/lta/language-tags/q?input=eng-840&output=html&hl=enjust
says they are invalid.

*Localization:*

sl-Cyrl-YU - Arabic, German

   - http://unicode.org/cldr/utility/languageid.jsp?a=sl-Cyrl-YU&l=ar and
   http://unicode.org/cldr/utility/languageid.jsp?a=sl-Cyrl-YU&l=de show
   localized subtag names.
   -
   http://www.w3.org/2008/05/lta/language-tags/q?input=sl-Cyrl-YU&output=html&hl=aromits
text;
   http://www.w3.org/2008/05/lta/language-tags/q?input=sl-Cyrl-YU&output=html&hl=dehas
a localized UI, but not localized subtag names.

*Prefix Warnings*

en-cmn-rozaj

   - http://unicode.org/cldr/utility/languageid.jsp?a=en-cmn&l=ar doesn't
   give a warning (it just applies strict validity).
   -
   http://www.w3.org/2008/05/lta/language-tags/q?input=en-cmn-rozaj&output=html&hl=endoes
supply warnings for missing variant prefixes.

*Canonical Form*

sl-cyrl-Yu-rozaj-Solba-1994-b-1234-a-Foobar-x-b-1234-a-Foobar

   -
   http://unicode.org/cldr/utility/languageid.jsp?a=sl-cyrl-Yu-rozaj-Solba-1994-b-1234-a-Foobar-x-b-1234-a-Foobar&l=enputs
the results in canonical casing and order (and shows canonical
   replacements). It does not validate extensions, like "b-1234". (It follows
   LDML canonical order for variants - alphabetical.)
   -
   http://www.w3.org/2008/05/lta/language-tags/q?input=sl-Cyrl-YU-rozaj-solba-1994-b-1234-a-Foobar-x-b-1234-a-Foobar&output=html&hl=endoesn't.
It also gives a validation error on extensions.

Validating extensions is debatable - the validity of these is established
outside of the spec and iana subtag registry. Probably best would be neither
of the above: a warning, not an error.

Note that http://unicode.org/cldr/utility/languageid.jsp<http://unicode.org/cldr/utility/languageid.jsp?a=sl-cyrl-Yu-rozaj-Solba-1994-b-1234-a-Foobar-x-b-1234-a-Foobar&l=en>says
"suggested canonical form", since in the case of multiple replacements
it doesn't try to pick the best one. Eg the best guess for ru-SU is ru-RU,
but the best guess for az-SU would be az-AZ. It also doesn't try to find
missing prefix values for variants; that's probably of such low frequency
that it doesn't pay.

*Completeness*

   - http://unicode.org/cldr/utility/languageid.jsp?a=i-default&l=en doesn't
   allow grandfathered codes. (Following LDML.)
   - http://www.w3.org/2008/05/lta/language-tags/q?input=i-default does.


FYI, the regex it uses is:

      (?: ( [a-z A-Z]{2,8} | [a-z A-Z]{2,3} [-_] [a-z A-Z]{3} )
      (?: [-_] ( [a-z A-Z]{4} ) )?
      (?: [-_] ( [a-z A-Z]{2} | [0-9]{3} ) )?
      (?: [-_] ( (?: [0-9 a-z A-Z]{5,8} | [0-9] [0-9 a-z A-Z]{3} ) (?: [-_]
(?: [0-9 a-z A-Z]{5,8} | [0-9] [0-9 a-z A-Z]{3} ) )* ) )?
      (?: [-_] ( [a-w y-z A-W Y-Z] (?: [-_] [0-9 a-z A-Z]{2,8} )+ (?: [-_]
[a-w y-z A-W Y-Z] (?: [-_] [0-9 a-z A-Z]{2,8} )+ )* ) )?
      (?: [-_] ( [xX] (?: [-_] [0-9 a-z A-Z]{1,8} )+ ) )? )
    | ( [xX] (?: [-_] [0-9 a-z A-Z]{1,8} )+ )

Mark


On Sun, Jun 28, 2009 at 02:07, Felix Sasaki <felix.sasaki@fh-potsdam.de>wrote:

> Hello Mark,
>
> this looks similar to
> http://www.w3.org/2008/05/lta/
> my language tag parser currently based on draft 21 of rfc4646bis. lta also
> contains some error checking mechanisms, see examples like
>
> http://www.w3.org/2008/05/lta/language-tags/q?input=de-x
> http://www.w3.org/2008/05/lta/language-tags/q?input=xa
> http://www.w3.org/2008/05/lta/language-tags/q?input=en-latn
> http://www.w3.org/2008/05/lta/language-tags/q?input=ja-1901
> http://www.w3.org/2008/05/lta/language-tags/q?input=fr-cmn
> http://www.w3.org/2008/05/lta/language-tags/q?input=zh-cmn-cmn
> http://www.w3.org/2008/05/lta/language-tags/q?input=zh-cmn-a-bbb-a-ccc
> http://www.w3.org/2008/05/lta/language-tags/q?input=de-de-1901-1901
>
> Output is available in HTML with German UI and English, and in an XML
> format, see e.g.
>
> http://www.w3.org/2008/05/lta/language-tags/q?input=de-de-1901-1901&output=xml
>
> My comment on your tool is that to co-ordinate such efforts it would be
> great to have a common machine-readable output format for language tag
> parsing, also e.g. to deal with error descriptions like
>
>  <lta:variant>
>
>       <lta:subtag>1901</lta:subtag>
>       <lta:registryInfo>
>
>          <lta:var ty="variant" su="1901" ad="2005-10-16">
>
>             <lta:ds>Traditional German orthography
>
> </lta:ds>
>             <lta:pref>de</lta:pref>
>
>          </lta:var>
>       </lta:registryInfo>
>       <lta:matchedPrefix>de</lta:matchedPrefix>
>
>       <lta:error type="e007">
>          <lta:errorText>Variant repetition</lta:errorText>
>
>          <lta:errorAddInfo>
>
>             <lta:subtag>1901</lta:subtag>
>          </lta:errorAddInfo>
>
>       </lta:error>
>    </lta:variant>
>
>
> Felix
>
> 2009/6/27 Mark Davis ⌛ <mark@macchiato.com>
>
>> I updated the demo at http://unicode.org/cldr/utility/languageid.jsp to
>> parse extlangs. The samples include official languages and the scripts they
>> use (based on CLDR data), and the names have localizations where available.
>>
>> Comments welcome.
>>
>> Mark
>>
>> _______________________________________________
>> Ltru mailing list
>> Ltru@ietf.org
>> https://www.ietf.org/mailman/listinfo/ltru
>>
>>
>