[Ltru] Regex

Mark Davis <mark@macchiato.com> Fri, 03 April 2009 01:31 UTC

Return-Path: <mark.edward.davis@gmail.com>
X-Original-To: ltru@core3.amsl.com
Delivered-To: ltru@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 0A9113A6896 for <ltru@core3.amsl.com>; Thu, 2 Apr 2009 18:31:36 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.347
X-Spam-Level:
X-Spam-Status: No, score=-1.347 tagged_above=-999 required=5 tests=[AWL=-1.230, BAYES_20=-0.74, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9RcqN4TfmFyW for <ltru@core3.amsl.com>; Thu, 2 Apr 2009 18:31:35 -0700 (PDT)
Received: from wf-out-1314.google.com (wf-out-1314.google.com [209.85.200.168]) by core3.amsl.com (Postfix) with ESMTP id 3A5C93A67EE for <ltru@ietf.org>; Thu, 2 Apr 2009 18:31:35 -0700 (PDT)
Received: by wf-out-1314.google.com with SMTP id 24so934185wfg.31 for <ltru@ietf.org>; Thu, 02 Apr 2009 18:32:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=0hwO8WSPUfiMvZV/4J22CrhMLowEnO9FpQcXRWAnMZo=; b=TS3bmz9YU+V9MVZDxjCk9nlLPTPeDQjGSUTCXO3aOANNQ3RRzwTx7JZQhvYkzg2oOr XuhQtfNoVzST7v+FYhddBbhp5U3mzepRKBlsL8Wy6Mdy0g5l6N5+I6hWoqvBynIOe7LA X93a0n8YFsqxcF3Tb7J/xOxXSFoZ1f93JkfGk=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; b=st9bCOffm2k5c20rIGsPLvxkTeVs2OnltZl2NTMgDhE/Ot1jHp8zskcMwgs1ID2K3M YRnGcMFphfyEdj2M+IAYm+bBi6/7Gz2SZOYSJtaRNd/Twwj2y/iEVQ/AQzg4aWgCTCtJ T12SkWKQVcaRYhf4v4fcvoXyarfdoyeg+F43k=
MIME-Version: 1.0
Sender: mark.edward.davis@gmail.com
Received: by 10.143.9.11 with SMTP id m11mr158012wfi.44.1238722357304; Thu, 02 Apr 2009 18:32:37 -0700 (PDT)
Date: Thu, 02 Apr 2009 18:32:37 -0700
X-Google-Sender-Auth: 4938334d8ef40d00
Message-ID: <30b660a20904021832t45ac5f02p92f9eb5e4bc75a13@mail.gmail.com>
From: Mark Davis <mark@macchiato.com>
To: LTRU Working Group <ltru@ietf.org>
Content-Type: multipart/alternative; boundary="001636e90c9d151ced04669c86f1"
Subject: [Ltru] Regex
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 03 Apr 2009 01:31:36 -0000

I just redid my regex for the new version. The only tweak I made was that it
only allows one extlang (as per the description, item 4), which means that I
had to retain zh-min-nan as irregular.

(?i)
  (?:
      (?: ( [a-z]{2,8} | [a-z]{2,3} [-_] [a-z]{3} )
      (?: [-_] ( [a-z]{4} ) )?
      (?: [-_] ( [a-z]{2} | [0-9]{3} ) )?
      (?: [-_] ( (?: [a-z 0-9]{5,8} | [0-9] [a-z 0-9]{3} ) (?: [-_] (?: [a-z
0-9]{5,8} | [0-9] [a-z 0-9]{3} ) )* ) )?
      (?: [-_] ( [a-w y-z] (?: [-_] [a-z 0-9]{2,8} )+ (?: [-_] [a-w y-z] (?:
[-_] [a-z 0-9]{2,8} )+ )* ) )?
      (?: [-_] ( x (?: [-_] [a-z 0-9]{1,8} )+ ) )? )
    | ( x (?: [-_] [a-z 0-9]{1,8} )+ )
    | ( en [-_] GB [-_] oed
      | i [-_] (?: ami | bnn | default | enochian | hak | klingon | lux |
mingo | navajo | pwn | tao | tay | tsu )
      | no [-_] (?: bok | nyn )
      | sgn [-_] (?: BE [-_] (?: fr | nl) | CH [-_] de )
      | zh [-_] min [-_] nan ) )

As before,

   - the (?i) is for case insensitive matching,
   - the [-_] is for implemenatations (like Unicode) that allow alternate
   separators (can be replaced by [-] otherwise), and
   - the (?: is Perl/Java syntax for non-capturing groups. [That is, the
   regular (..) capture the main components of the regex, for extraction
   later.]


Mark