[newprep] phase 2 experience report

Mark Lentczner <markl@lindenlab.com> Tue, 27 April 2010 23:11 UTC

Return-Path: <markl@lindenlab.com>
X-Original-To: newprep@core3.amsl.com
Delivered-To: newprep@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 937563A6924 for <newprep@core3.amsl.com>; Tue, 27 Apr 2010 16:11:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.332
X-Spam-Level:
X-Spam-Status: No, score=-1.332 tagged_above=-999 required=5 tests=[AWL=0.666, BAYES_50=0.001, GB_I_LETTER=-2, HTML_MESSAGE=0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vPN0dJlfga86 for <newprep@core3.amsl.com>; Tue, 27 Apr 2010 16:11:46 -0700 (PDT)
Received: from mail-yw0-f200.google.com (mail-yw0-f200.google.com [209.85.211.200]) by core3.amsl.com (Postfix) with ESMTP id 53E7C3A6898 for <newprep@ietf.org>; Tue, 27 Apr 2010 16:11:46 -0700 (PDT)
Received: by ywh38 with SMTP id 38so6903763ywh.29 for <newprep@ietf.org>; Tue, 27 Apr 2010 16:11:31 -0700 (PDT)
Received: by 10.101.49.11 with SMTP id b11mr1853417ank.258.1272409891164; Tue, 27 Apr 2010 16:11:31 -0700 (PDT)
Received: from nil.lindenlab.com ([38.99.52.137]) by mx.google.com with ESMTPS id 22sm3886251ywh.13.2010.04.27.16.11.29 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 27 Apr 2010 16:11:30 -0700 (PDT)
From: Mark Lentczner <markl@lindenlab.com>
Content-Type: multipart/alternative; boundary="Apple-Mail-5--102571674"
Date: Tue, 27 Apr 2010 16:11:27 -0700
Message-Id: <18ECEEF3-3280-4ABB-B866-9495D03FAD8C@lindenlab.com>
To: newprep@ietf.org
Mime-Version: 1.0 (Apple Message framework v1078)
X-Mailer: Apple Mail (2.1078)
Subject: [newprep] phase 2 experience report
X-BeenThere: newprep@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Stringprep after IDNA2008 <newprep.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/newprep>, <mailto:newprep-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/newprep>
List-Post: <mailto:newprep@ietf.org>
List-Help: <mailto:newprep-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/newprep>, <mailto:newprep-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Apr 2010 23:11:48 -0000

I've been continuing my work and thought I'd share my findings here.

As I mentioned before, I have both IETF and internal product needs to develop a string preparation method for use in validating names for various entities. Based on my earlier investigation (see my prior report to this list), I decided to build a string preparation method with aims of:
	1) Following the experience and guidelines from the IDNAbis group
	2) Based as much as possible on published specs or guidelines
	3) Implementable in Python

IDNA2008 specifications aren't designed to be adapted to any use other than IDNA, which makes reconciling #1 and #2 difficult. However, IDNA2008 isn't all that far in practice from a "reasonable" profile based on UAX #31, and UAX #31 is designed to be adapted. Hence, I decided to build the preparation method based on UAX #31.

The definition of the function, which here I'll call DNamePrep(), contains two distinct phases: preparation and validation:

1) Preparation
    1.1) Normalize the input via NFKC
    1.2) strip leading and trailing whitespace ([:White_Space:]) [a]

2) Validation
    2.1) If the string is empty, FAIL
    2.2) For each character of the string
        2.2.1) If it is in [:Noncharacter_Code_Point:], FAIL
        2.2.2) If it is in [:General_Category=Cn:], FAIL [b]
        2.2.3) If it is not in DName_Continue, FAIL
    2.3) VALID

[a] At present, my implementation uses Python's strip(), which is a slightly different set of characters.
[b] If preparing a string for search, then "FAIL" is changed to "skip 2.2.3"


DName_Continue is defined, in the manner suggested by UAX #31, as:

    DName_Continue =
        [:ID_Continue:]
        - [:General_Category=Nl:]  # no letter-like numbers
        - [:General_Category=Pc:]  # no connecting punctuation
        - [:Other_ID_Start:]       # no need to support legacy ID characters
        - [:Other_ID_Continue:]    # ditto
        - ignored_blocks
        + name_additional
        + U+0020                   # names can have spaces

    name_additional = 
        # Additional characters that we choose to include in names. This
        # is a subset of the suggested additional characters in UAX #31.
        U+0027, U+002D, U+002E, U+00B7,
        U+058A, U+05F3, U+05F4, U+0F0B,
        U+200C, U+200D,
        U+30FB

    ignored_blocks =
        # These are a blocks of characters that we exclude. This is a
        # subset of the blocks recommended for exclusion by UAX #31
        [:block=Combining_Diacritical_Marks_for_Symbols:]
        + [:block=Musical_Symbols:]
        + [:block=Ancient_Greek_Musical_Notation:]
        + [:Default_Ignorable_Code_Point:]

N.B: Collections of characters given as [:prop:] are defined by the UCD


Implementation in Python is hindered somewhat because while Python has the module 'unicodedata', it has only a partial set of the UCD data. Hence, in my implementation, I had to hard code some of the character collections defined by properties. This algorithm also does away with the suggested contextual checks in UAX #31 for a few characters (fewer than IDNA 2008 requires). I didn't see the harm in always accepting those characters, rather than limiting them the contexts given. This might merit further study.

This definition leverages UAX #31's design and spec, and should gracefully handle evolution of Unicode, as Unicode gives very strong stability guarantees for the character properties used in UAX #31.

Comparison with IDNA 2008:

IDNA 2008 has no "preparation" step, per se. However, it is clear from the design, that applications will have to perform some preparation of user input before entry into the protocol. Exactly what is left hanging... but one can infer that at the very least casefolding to lower case will occur. I choose to assume casefold(input). Given that, I compared all input characters between the two algorithms.

The rather small set of significant differences are:
1) IDNA 2008 allows four characters (from §2.6) that DName rejects
2) IDNA 2008 disallows ten characters (from §2.6) that DName accepts
3) IDNA 2008 rejects U+A77D as Unstable (§2.2) and U+1D79 by default, where DName accepts
4) DName accepts four punctuation characters by choice, which IDNA 2008 rejects
5) IDNA 2008 rejects non-combining Jamo (§2.9), which DName accepts
6) IDNA 2008 accepts four symbols which are normalized to other (accepted) characters by DName

If further harmony with IDNA 2008 were desired, differences 1, 2, 4 & 5 could be easily accounted for, staying within the UAX #31 framework. Difference 3 would be harder to codify as UAX #31 has no rule like that in IDNA 2008, and simply hard coding the code points would be agains the spirit of IDNA 2008. Lastly, I can't find any rationale for why IDNA 2008 rejects the characters in difference 2, so I'm not sure why one should replicate it.

There is also a large set of characters that IDNA 2008 rejects, whereas DName never sees them before of the NFKC normalization step. If one assumes NFKC normalization for any input then this difference is almost inconsequential except for the four characters listed under #6. However, I can't see the value in supporting, for example, the ANGSTROM SIGN as distinct from its NFKC equivalent LATIN CAPITAL A WITH RING ABOVE.

Also, remember, the contextual checks required by IDNA 2008 were not used here, since I was checking individual characters, and so such characters were treated as accepted (i.e. assuming they were in the correct context.)

Conclusion (for now)

It seems to me that building a string preparation and validation method based on UAX #31, yet in line with the lessons and work from IDAN 2008 is practical. I think that as we consider creating a new framework to replace StringPrep, this is a fruitful direction.

	- Mark Lentczner

P.S.: I have full technical details of this work, including full character lists by protocol and in comparison, and Python code. If you are interested in any of it, just ask.


Mark Lentczner
Sr. Systems Architect
Technology Integration
Linden Lab

markl@lindenlab.com

Zero Linden
zero.linden@secondlife.com