Re: [Ltru] Re: Test suite for language tags?

"Mark Davis" <mark.davis@icu-project.org> Sat, 16 September 2006 23:28 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1GOjZv-0006pH-B6; Sat, 16 Sep 2006 19:28:07 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1GOjZt-0006np-KM for ltru@lists.ietf.org; Sat, 16 Sep 2006 19:28:05 -0400
Received: from nf-out-0910.google.com ([64.233.182.184]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1GOjZr-0006gm-3p for ltru@lists.ietf.org; Sat, 16 Sep 2006 19:28:05 -0400
Received: by nf-out-0910.google.com with SMTP id n15so2764947nfc for <ltru@lists.ietf.org>; Sat, 16 Sep 2006 16:28:02 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=fMS9qmBsNxr+aXeYzVXu13407P1rLk6ucjSzY3huUkd1ErgkNHTEkgCAEJ6w/h1+oIRXPSV261LNJI7HpDNVnpOUZ6KU226aCKDCnk3MWOkBtV0mTch6YT68gKo1iqBzQvcufJdLKt2yGXhj2W7vpi/EoQFG9HDQSWi1PWI2POc=
Received: by 10.48.48.15 with SMTP id v15mr15176032nfv; Sat, 16 Sep 2006 16:28:01 -0700 (PDT)
Received: by 10.49.65.16 with HTTP; Sat, 16 Sep 2006 16:28:01 -0700 (PDT)
Message-ID: <30b660a20609161628t22ab3c4flc81ea92f40800a09@mail.gmail.com>
Date: Sat, 16 Sep 2006 16:28:01 -0700
From: Mark Davis <mark.davis@icu-project.org>
To: Martin Duerst <duerst@it.aoyama.ac.jp>
Subject: Re: [Ltru] Re: Test suite for language tags?
In-Reply-To: <6.0.0.20.2.20060901024806.109a6d90@localhost>
MIME-Version: 1.0
References: <20060801203351.GA8854@sources.org> <20060802072709.GA17404@nic.fr> <44D21ACD.4040707@yahoo-inc.com> <20060804165720.GA24037@sources.org> <44D4AC42.79E0@xyzzy.claranet.de> <20060830093000.GA31895@nic.fr> <44F6313D.2070000@yahoo-inc.com> <6.0.0.20.2.20060831201004.101ab8d0@localhost> <44F6EF0E.20602@yahoo-inc.com> <6.0.0.20.2.20060901024806.109a6d90@localhost>
X-Google-Sender-Auth: 00c82b82f04de333
X-Spam-Score: 0.3 (/)
X-Scan-Signature: bdc523f9a54890b8a30dd6fd53d5d024
Cc: Frank Ellermann <nobody@xyzzy.claranet.de>, ltru@lists.ietf.org
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============1034181887=="
Errors-To: ltru-bounces@ietf.org

BTW, I had updated my regex to the final spec for 4646. Here is a single
Perl or Java regex that does most of the parse:

Regex: ((?: [a-z A-Z]{2,3} (?: [-] [a-z A-Z]{3} ){0,3} | [a-z A-Z]{4,8}
))(?: [-] ((?: [a-z A-Z]{4} )) )?(?: [-] ((?: [a-z A-Z]{2} | [0-9]{3} ))
)?(?: [-] ((?: (?: [0-9] [a-z A-Z 0-9]{3} | [a-z A-Z 0-9]{5,8} ) (?: [-] (?:
[0-9] [a-z A-Z 0-9]{3} | [a-z A-Z 0-9]{5,8} ) )* )) )?(?: [-] ((?: (?: [a-w
y-z A-W Y-Z] (?: [-] [a-z A-Z 0-9]{2,8} )+ ) (?: [-] (?: [a-w y-z A-W Y-Z]
(?: [-] [a-z A-Z 0-9]{2,8} )+ ) )* )) )?(?: [-] ((?: [xX] (?: [-] [a-z A-Z
0-9]{1,8} )+ )) )?| ( (?i) art [-] lojban| cel [-] gaulish| en [-] (?: boont
| GB [-] oed | scouse )| i [-] (?: ami | bnn | default | enochian | hak |
klingon | lux | mingo | navajo | pwn | tao | tay | tsu )| no [-] (?: bok |
nyn)| sgn [-] (?: BE [-] fr | BE [-] nl | CH [-] de)| zh [-] (?: cmn | zh
[-] cmn [-] Hans | cmn [-] Hant | gan | guoyu | hakka | min | min [-] nan |
wuu | xiang | yue))| ((?: [xX] (?: [-] [a-z A-Z 0-9]{1,8} )+ ))

It checks for the grandfathered tags, since otherwise too much cruft sneaks
in. You can't check in regex that there are only single instances of each
singleton extension. (In retrospect we could have allowed multiple
singletons: we could have accepted en-a-bcdef-ghijk-b-123-a-lmnop as
equivalent to the canonical form en-a-bcdef-ghijk-lmnop-b-123, but that's
water under the bridge at this point.) Of course, I didn't put this together
by hand. The table used to build it is much more readable, at

http://unicode.org/cldr/data/tools/java/org/unicode/cldr/util/data/langtagRegex.txt

and a test file that includes strings mentioned on this list is at:

http://unicode.org/cldr/data/tools/java/org/unicode/cldr/util/data/langtagTest.txt
Mark
_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru