Re: [Ltru] updated demo

Felix Sasaki <felix.sasaki@fh-potsdam.de> Sun, 28 June 2009 20:47 UTC

Return-Path: <felix.sasaki@googlemail.com>
X-Original-To: ltru@core3.amsl.com
Delivered-To: ltru@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 8B66B3A6BF2 for <ltru@core3.amsl.com>; Sun, 28 Jun 2009 13:47:06 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.643
X-Spam-Level:
X-Spam-Status: No, score=-1.643 tagged_above=-999 required=5 tests=[AWL=0.950, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, GB_I_LETTER=-2, HTML_MESSAGE=0.001, MIME_8BIT_HEADER=0.3, URIBL_RHS_DOB=1.083]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Ouatbn8nBPXh for <ltru@core3.amsl.com>; Sun, 28 Jun 2009 13:47:04 -0700 (PDT)
Received: from mail-bw0-f213.google.com (mail-bw0-f213.google.com [209.85.218.213]) by core3.amsl.com (Postfix) with ESMTP id B64273A6B70 for <ltru@ietf.org>; Sun, 28 Jun 2009 13:47:03 -0700 (PDT)
Received: by bwz9 with SMTP id 9so2924361bwz.37 for <ltru@ietf.org>; Sun, 28 Jun 2009 13:47:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to:cc :content-type; bh=Qpxw2Fg+JoWBCAaZlcTy3ER/Cd1Nqt2u4l86DPmG9xY=; b=wDuZ399QJTPuZM6uuksWuQSWgWHrxzHJVoHp8QcgRvRB3xGj1Bdcc0EI2DgR3+2zYs QpVIhj0L0VfnmW/b7LFsTa4Gpp50WCxPqiYXdaZuNsdNHKUuXmP1XcZ5945ugqnCJmHs DZUhJSbeqilheEIJdIdg34CFs9LNRvVWtAGOY=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=aYjnn8AerRzPDizl0J088SssB51p39jK/FiRjN7acXiNQpWzljpxO/T9zNTGN4p3Oe NlQ8UYd/zZ77Vn5PF2/18M4sgYKdXB5bwiseh/uUh/CPFphLKWP+GTijh4s12CG65ack Bl1M0y1ENwmFhKfkHAxe1wC+znGdDlhmSz5hE=
MIME-Version: 1.0
Sender: felix.sasaki@googlemail.com
Received: by 10.223.114.74 with SMTP id d10mr3983802faq.87.1246222039237; Sun, 28 Jun 2009 13:47:19 -0700 (PDT)
In-Reply-To: <30b660a20906281307p7324a2a4uf1a29a41d6271378@mail.gmail.com>
References: <30b660a20906271138o186f82a5xd2531f70806ab3be@mail.gmail.com> <ba4134970906280207td8dbdd4l8a4860f7ee4de28@mail.gmail.com> <30b660a20906281307p7324a2a4uf1a29a41d6271378@mail.gmail.com>
Date: Sun, 28 Jun 2009 22:47:18 +0200
X-Google-Sender-Auth: 17b8576619c01dde
Message-ID: <ba4134970906281347o63cb306g5df5ed06651b75e7@mail.gmail.com>
From: Felix Sasaki <felix.sasaki@fh-potsdam.de>
To: Mark Davis ⌛ <mark@macchiato.com>
Content-Type: multipart/alternative; boundary="0016368e2bc9f5f207046d6ead34"
Cc: LTRU Working Group <ltru@ietf.org>
Subject: Re: [Ltru] updated demo
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 28 Jun 2009 20:47:06 -0000

2009/6/28 Mark Davis ⌛ <mark@macchiato.com>

> It does look similar. After looking them over, I think sometimes one is
> better and sometimes the other is. The differences I can see (other than UI)
> are:
>
> *Feedback on ill-formed, invalid, or non-preferred values.*
>
> de-x
>
>    - http://unicode.org/cldr/utility/languageid.jsp?a=de-x shows the tag
>    where the problem lies, but not potential fixes. (I use a regex based on the
>    ABNF for well-formedness, and if it fails I just show the tag where the
>    problem is.)
>
>
FYI, my proposed fixes are based on an LL(1) parsing implementation of the
rfc4646bis ABNF, see
http://www.w3.org/2008/05/lta/04/abnf-check.xsl
of course one might argue whether LL(1) parsing is the appropriate means for
generating the proposals, but the proposals looked intuitive to me


>
>    -
>    - http://www.w3.org/2008/05/lta/language-tags/q?input=de-x shows what
>    the potential subtags at that point might be.
>
> iw-su
>
>    - http://unicode.org/cldr/utility/languageid.jsp?a=iw-su&l=en shows the
>    replacement values for iw and su.
>    -
>    http://www.w3.org/2008/05/lta/language-tags/q?input=iw-su&output=html&hl=enjust says they are valid. It does show all the registry information, like
>    when the code was added.
>
>
correct. That and the replacement values you mention below are still on my
TODO list.

>
>    -
>
> eng-840
>
>    - http://unicode.org/cldr/utility/languageid.jsp?a=eng-840&l=en shows
>    the replacements for the wrong choice of source code (3 letter language when
>    2 letter exists (common in the field), 3 digit region when 2 letter exists)
>    -
>    http://www.w3.org/2008/05/lta/language-tags/q?input=eng-840&output=html&hl=enjust says they are invalid.
>
> *Localization:*
>
> sl-Cyrl-YU - Arabic, German
>
>    - http://unicode.org/cldr/utility/languageid.jsp?a=sl-Cyrl-YU&l=ar and
>    http://unicode.org/cldr/utility/languageid.jsp?a=sl-Cyrl-YU&l=de show
>    localized subtag names.
>    -
>    http://www.w3.org/2008/05/lta/language-tags/q?input=sl-Cyrl-YU&output=html&hl=aromits text;
>    http://www.w3.org/2008/05/lta/language-tags/q?input=sl-Cyrl-YU&output=html&hl=dehas a localized UI, but not localized subtag names.
>
>
Yes. I guess you are using CLDR data for the localized subtag names? For
including such data easily, a common result format for language tag analysis
would be good.



>
>    -
>
> *Prefix Warnings*
>
> en-cmn-rozaj
>
>    - http://unicode.org/cldr/utility/languageid.jsp?a=en-cmn&l=ar doesn't
>    give a warning (it just applies strict validity).
>    -
>    http://www.w3.org/2008/05/lta/language-tags/q?input=en-cmn-rozaj&output=html&hl=endoes supply warnings for missing variant prefixes.
>
>
Yes, for extlang prefixes and for variant prefixes.


>
>    -
>
> *Canonical Form*
>
> sl-cyrl-Yu-rozaj-Solba-1994-b-1234-a-Foobar-x-b-1234-a-Foobar
>
>    -
>    http://unicode.org/cldr/utility/languageid.jsp?a=sl-cyrl-Yu-rozaj-Solba-1994-b-1234-a-Foobar-x-b-1234-a-Foobar&l=enputs the results in canonical casing and order (and shows canonical
>    replacements). It does not validate extensions, like "b-1234". (It follows
>    LDML canonical order for variants - alphabetical.)
>
>



>
>    -
>    -
>    http://www.w3.org/2008/05/lta/language-tags/q?input=sl-Cyrl-YU-rozaj-solba-1994-b-1234-a-Foobar-x-b-1234-a-Foobar&output=html&hl=endoesn't. It also gives a validation error on extensions.
>
> Validating extensions is debatable - the validity of these is established
> outside of the spec and iana subtag registry. Probably best would be neither
> of the above: a warning, not an error.



Reading
"Note that there might not be a registry of these subtags and validating
processors are not required to validate extensions."
from sec. 2.2.6 of
http://www.ietf.org/internet-drafts/draft-ietf-ltru-4646bis-23.txt I think
you are correct, I wil change the error to a warning in my next version.



>
>
> Note that http://unicode.org/cldr/utility/languageid.jsp<http://unicode.org/cldr/utility/languageid.jsp?a=sl-cyrl-Yu-rozaj-Solba-1994-b-1234-a-Foobar-x-b-1234-a-Foobar&l=en>says "suggested canonical form", since in the case of multiple replacements
> it doesn't try to pick the best one. Eg the best guess for ru-SU is ru-RU,
> but the best guess for az-SU would be az-AZ. It also doesn't try to find
> missing prefix values for variants; that's probably of such low frequency
> that it doesn't pay.
>
> *Completeness*
>
>    - http://unicode.org/cldr/utility/languageid.jsp?a=i-default&l=endoesn't allow grandfathered codes. (Following LDML.)
>    - http://www.w3.org/2008/05/lta/language-tags/q?input=i-default does.
>
>
yes, implementing the ABNF of rfc4646bis, see
http://www.w3.org/2008/05/lta/04/abnf.xsl



>
>    -
>
>
> FYI, the regex it uses is:



Thank you for this and for your feedback! FYI, the goal of lta is a)
educational about language tags, probably similar to your tool, and b) to be
used in RESTful web services which need language tag information. For b)
there is the XML output, see
http://www.w3.org/2008/05/lta/language-tags/q?input=sl-Cyrl-YU-rozaj-solba-1994-b-1234-a-Foobar-x-b-1234-a-Foobar&output=xml
and others like json or RDF might follow, depending on (users) need.

Felix



>
>
>       (?: ( [a-z A-Z]{2,8} | [a-z A-Z]{2,3} [-_] [a-z A-Z]{3} )
>       (?: [-_] ( [a-z A-Z]{4} ) )?
>       (?: [-_] ( [a-z A-Z]{2} | [0-9]{3} ) )?
>       (?: [-_] ( (?: [0-9 a-z A-Z]{5,8} | [0-9] [0-9 a-z A-Z]{3} ) (?: [-_]
> (?: [0-9 a-z A-Z]{5,8} | [0-9] [0-9 a-z A-Z]{3} ) )* ) )?
>       (?: [-_] ( [a-w y-z A-W Y-Z] (?: [-_] [0-9 a-z A-Z]{2,8} )+ (?: [-_]
> [a-w y-z A-W Y-Z] (?: [-_] [0-9 a-z A-Z]{2,8} )+ )* ) )?
>       (?: [-_] ( [xX] (?: [-_] [0-9 a-z A-Z]{1,8} )+ ) )? )
>     | ( [xX] (?: [-_] [0-9 a-z A-Z]{1,8} )+ )
>
> Mark
>
>
>
> On Sun, Jun 28, 2009 at 02:07, Felix Sasaki <felix.sasaki@fh-potsdam.de>wrote:
>
>> Hello Mark,
>>
>> this looks similar to
>> http://www.w3.org/2008/05/lta/
>> my language tag parser currently based on draft 21 of rfc4646bis. lta also
>> contains some error checking mechanisms, see examples like
>>
>> http://www.w3.org/2008/05/lta/language-tags/q?input=de-x
>> http://www.w3.org/2008/05/lta/language-tags/q?input=xa
>> http://www.w3.org/2008/05/lta/language-tags/q?input=en-latn
>> http://www.w3.org/2008/05/lta/language-tags/q?input=ja-1901
>> http://www.w3.org/2008/05/lta/language-tags/q?input=fr-cmn
>> http://www.w3.org/2008/05/lta/language-tags/q?input=zh-cmn-cmn
>> http://www.w3.org/2008/05/lta/language-tags/q?input=zh-cmn-a-bbb-a-ccc
>> http://www.w3.org/2008/05/lta/language-tags/q?input=de-de-1901-1901
>>
>> Output is available in HTML with German UI and English, and in an XML
>> format, see e.g.
>>
>> http://www.w3.org/2008/05/lta/language-tags/q?input=de-de-1901-1901&output=xml
>>
>> My comment on your tool is that to co-ordinate such efforts it would be
>> great to have a common machine-readable output format for language tag
>> parsing, also e.g. to deal with error descriptions like
>>
>>  <lta:variant>
>>
>>
>>       <lta:subtag>1901</lta:subtag>
>>       <lta:registryInfo>
>>
>>          <lta:var ty="variant" su="1901" ad="2005-10-16">
>>
>>             <lta:ds>Traditional German orthography
>>
>> </lta:ds>
>>             <lta:pref>de</lta:pref>
>>
>>          </lta:var>
>>       </lta:registryInfo>
>>       <lta:matchedPrefix>de</lta:matchedPrefix>
>>
>>       <lta:error type="e007">
>>          <lta:errorText>Variant repetition</lta:errorText>
>>
>>          <lta:errorAddInfo>
>>
>>             <lta:subtag>1901</lta:subtag>
>>          </lta:errorAddInfo>
>>
>>       </lta:error>
>>    </lta:variant>
>>
>>
>> Felix
>>
>> 2009/6/27 Mark Davis ⌛ <mark@macchiato.com>
>>
>>> I updated the demo at http://unicode.org/cldr/utility/languageid.jsp to
>>> parse extlangs. The samples include official languages and the scripts they
>>> use (based on CLDR data), and the names have localizations where available.
>>>
>>> Comments welcome.
>>>
>>> Mark
>>>
>>> _______________________________________________
>>> Ltru mailing list
>>> Ltru@ietf.org
>>> https://www.ietf.org/mailman/listinfo/ltru
>>>
>>>
>>
>
> _______________________________________________
> Ltru mailing list
> Ltru@ietf.org
> https://www.ietf.org/mailman/listinfo/ltru
>
>