Re: [Ltru] Test suite for language tags?

"Mark Davis" <mark.davis@icu-project.org> Tue, 01 August 2006 21:31 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1G81pQ-0002a3-B2; Tue, 01 Aug 2006 17:31:04 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1G81pP-0002Zy-Lm for ltru@ietf.org; Tue, 01 Aug 2006 17:31:03 -0400
Received: from py-out-1112.google.com ([64.233.166.177]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1G81pP-0002Wm-A4 for ltru@ietf.org; Tue, 01 Aug 2006 17:31:03 -0400
Received: by py-out-1112.google.com with SMTP id t32so1350218pyc for <ltru@ietf.org>; Tue, 01 Aug 2006 14:30:58 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=tcR7TZp1Dw/d+tnjyawuOcvgzFgMpfnocfu7i3th1zUS2xtIfTu04OBerTjHvVN5mx2Z9f81QZNHrtiWCmRZrdLUczL2MExjlVi4MDIc7pdXCAoDcA2RZ/UFAGDYjPT7jXY0Uopzlk0Zbh/DpYu8mj6gldfjN85SWWzO6iO8q60=
Received: by 10.35.99.5 with SMTP id b5mr195058pym; Tue, 01 Aug 2006 14:30:57 -0700 (PDT)
Received: by 10.35.67.20 with HTTP; Tue, 1 Aug 2006 14:30:57 -0700 (PDT)
Message-ID: <30b660a20608011430i15ddee97p702adbf55e72b9d4@mail.gmail.com>
Date: Tue, 01 Aug 2006 14:30:57 -0700
From: Mark Davis <mark.davis@icu-project.org>
To: Addison Phillips <addison@yahoo-inc.com>
Subject: Re: [Ltru] Test suite for language tags?
In-Reply-To: <44CFC23A.2000703@yahoo-inc.com>
MIME-Version: 1.0
References: <20060801203351.GA8854@sources.org> <44CFC23A.2000703@yahoo-inc.com>
X-Google-Sender-Auth: fdb74ed6cb2b456e
X-Spam-Score: 0.5 (/)
X-Scan-Signature: 825e642946eda55cd9bc654a36dab8c2
Cc: ltru@ietf.org
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============1314520633=="
Errors-To: ltru-bounces@ietf.org

What may be useful is that ICU has a test string generator (BNF) that
generates strings that match a specified BNF syntax. It augments the regular
syntax with percent values that indicate relative weights. That is, if you
have

x = (a | b | c)

in the BNF, you can make it

x = (a 25% | b 45% | c 30%)

so that it generates those alternatives with those frequencies.

It is an internal testing class, and doesn't have much documentation, but I
thought I'd mention it in case you'd find it useful.

Mark

On 8/1/06, Addison Phillips <addison@yahoo-inc.com> wrote:
>
> > I just wrote a non-validating parser for language tags and I'm looking
> > for test data. I want to test bizarre tags to see if the parser does
> > classify them properly.
>
> Good for you!
>
> > I'm specially interested in badly-formed tags: the I-D contains mostly
> > well-formed tags.
>
> Your best bet is probably to generate subtag sequences based on the
> ABNF. Some particular problem cases to check would be:
>
> - singletons in the first position (except for 'x' and the grandfathered
> list)
> - overlong subtags (longer than 8 characters)
> - more than three extlangs
> - misplaced extlang (3ALPHA in the third or later position following any
> of these: 4ALPHA, 2ALPHA, 3DIGIT, 5*8alphanum, DIGIT 3alpha)[note: stop
> at singleton]
> - misplaced script (4ALPHA following any of these: 2ALPHA, 3DIGIT,
> 5*8alphanum, DIGIT 3alphanum)[note: stop at singleton]
> - misplaced variant (five or more characters, or four or more starting
> with a digit; either occurring before an extlang/script/region is an
> error).
> - non-x singleton followed immediately by a singleton (including 'x')
> - missing subtag ("--")
> - a dangling hyphen ("foo-bar-baz-") or initial hyphen ("-foo-bar-baz")
> - digits in the primary (first) subtag
> - repeated singleton (note case insensitivity)
>
> Thus, these are all errors:
>
> "a-foo"
> "abcdefghi-012345678"
> "ab-abc-abc-abc-abc"
> "ab-abcd-abc"
> "ab-ab-abc"
> "ab-123-abc"
> "ab-abcde-abc"
> "ab-1abc-abc"
> "ab-ab-abcd"
> "ab-123-abcd"
> "ab-abcde-abcd"
> "ab-1abc-abcd"
> "ab-a-b"
> "ab-a-x"
> "ab--ab"
> "ab-abc-"
> "-ab-abc"
> "ab-a-abc-a-abc"
>
> These are not errors:
>
> "ab-x-abc-x-abc" // anything goes after x
> "ab-x-abc-a-a"   // ditto
> "i-default"      // grandfathered
>
> Hope that helps,
>
> Addison
>
> Addison Phillips
> Globalization Architect − Yahoo! Inc.
>
> Internationalization is an architecture.
> It is not a feature.
>
> _______________________________________________
> Ltru mailing list
> Ltru@ietf.org
> https://www1.ietf.org/mailman/listinfo/ltru
>
_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru