[newprep] code!

Mark Lentczner <markl@lindenlab.com> Thu, 13 May 2010 19:41 UTC

Return-Path: <markl@lindenlab.com>
X-Original-To: newprep@core3.amsl.com
Delivered-To: newprep@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id B594F3A6902 for <newprep@core3.amsl.com>; Thu, 13 May 2010 12:41:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.498
X-Spam-Level:
X-Spam-Status: No, score=-0.498 tagged_above=-999 required=5 tests=[AWL=-0.500, BAYES_50=0.001, HTML_MESSAGE=0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id m1mRtoCxvnpn for <newprep@core3.amsl.com>; Thu, 13 May 2010 12:41:36 -0700 (PDT)
Received: from mail-pz0-f200.google.com (mail-pz0-f200.google.com [209.85.222.200]) by core3.amsl.com (Postfix) with ESMTP id 89FD93A6851 for <newprep@ietf.org>; Thu, 13 May 2010 12:41:35 -0700 (PDT)
Received: by pzk38 with SMTP id 38so1184411pzk.31 for <newprep@ietf.org>; Thu, 13 May 2010 12:41:23 -0700 (PDT)
Received: by 10.143.25.34 with SMTP id c34mr6744267wfj.181.1273779262715; Thu, 13 May 2010 12:34:22 -0700 (PDT)
Received: from nil.lindenlab.com ([38.99.52.137]) by mx.google.com with ESMTPS id 20sm1143474pzk.7.2010.05.13.12.34.21 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 13 May 2010 12:34:22 -0700 (PDT)
From: Mark Lentczner <markl@lindenlab.com>
Content-Type: multipart/alternative; boundary="Apple-Mail-7--880681827"
Date: Thu, 13 May 2010 12:34:21 -0700
Message-Id: <3FEEE0B7-9CDE-47D5-87A9-1A4988295381@lindenlab.com>
To: newprep@ietf.org
Mime-Version: 1.0 (Apple Message framework v1078)
X-Mailer: Apple Mail (2.1078)
Subject: [newprep] code!
X-BeenThere: newprep@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Stringprep after IDNA2008 <newprep.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/newprep>, <mailto:newprep-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/newprep>
List-Post: <mailto:newprep@ietf.org>
List-Help: <mailto:newprep-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/newprep>, <mailto:newprep-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 13 May 2010 19:41:37 -0000

I'm happy to announce that Linden Lab has open sourced the exploratory code I wrote (see my prior experience report messages.) The code can be found here:
	http://hg.secondlife.com/newprep

Please let me know if this is useful, and if you need any help using it.

	- Mark

From the README file:

Utilities and code for exploring Unicode string preparation.

This package contains code for exploring various different strategies for
preparing Unicode strings for use as identifiers. There have been various
different approaches to this problem over the years, and this code lets you
explore them, compare them, and create new ones.

At present, this code focuses only on what characters are allowed, rejected, 
stripped, etc... by these preparation methods. Some of these preparation methods
have further processing on the whole string, including contextual checks which
are not yet embodied in this code base. 

You can start by running the included command line utility:

    > python preputil.py --help
    Usage:
    preputil.py (-m | --map) [method]               -- map out a method
    preputil.py (-d | --diff) [method1] [method2]   -- compare two methods
    preputil.py (-l | --list)                       -- list available methods
    preputil.py (-h | --help)                       -- this help text

    > python preputil.py --list
    Available methods, and aliases to use on command line:
        IDNA2008:    idna idna2008 idnabis
        StringPrep:  string stringprep
        UAX31:       id uax31
        VName:       vname

    > python preputil.py --map idna2008 | less
    > python preputil.py --diff stringprep uax31 | less

Note that the '--map' and '--diff' modes are computing functions over the entire
Unicode code point range, and so take some time to run. On my 2.2Ghz MacBook
they take ~17 seconds each!

The processing code is all in the package newprep. See the docstrings in the
modules for more extensive notes, caveats, and information on how to use and
extend this code.
    
Main Modules:
    codepoint   - Utilities for handling code points and sets of code points.
    methods     - A list of available preparation methods.
    UCD         - Information and utilities from the Unicode Character Database.
    prep        - Base class for representing a preparation method.
    
Method Modules:             
    idna2008    - IDNA2008's code point classification
    stringprep  - An RFC 3454 ("StringPrep") style profile
    uax31       - Unicode's Identifier Syntax
    vname       - A proposed method for use in VWRAP

This code was written by Mark Lentczner at Linden Lab. It was inspired by the
newprep BoF session at IETF 77 (Anaheim, 2010) and work needed for both the
VWRAP working gorup, and internal Linden Lab development. Discussion of this
code and the surrounding issues can be on the newprep mailing list:
    https://www.ietf.org/mailman/listinfo/newprep
or directly with the author.
    
The code is released open source under an "MIT style" license. See LICENSE.

    - Mark Lentczner
      markl@lindenlab.com
      May, 2010



Mark Lentczner
Sr. Systems Architect
Technology Integration
Linden Lab

markl@lindenlab.com

Zero Linden
zero.linden@secondlife.com