[newprep] phase 2 experience report

Mark Lentczner <markl@lindenlab.com> Tue, 27 April 2010 23:11 UTC

From: Mark Lentczner <markl@lindenlab.com>
Content-Type: multipart/alternative; boundary="Apple-Mail-5--102571674"
Date: Tue, 27 Apr 2010 16:11:27 -0700
Message-Id: <18ECEEF3-3280-4ABB-B866-9495D03FAD8C@lindenlab.com>
To: newprep@ietf.org
Mime-Version: 1.0 (Apple Message framework v1078)
Subject: [newprep] phase 2 experience report
Precedence: list

I've been continuing my work and thought I'd share my findings here.

As I mentioned before, I have both IETF and internal product needs to develop a string preparation method for use in validating names for various entities. Based on my earlier investigation (see my prior report to this list), I decided to build a string preparation method with aims of:
1) Following the experience and guidelines from the IDNAbis group
2) Based as much as possible on published specs or guidelines
3) Implementable in Python

IDNA2008 specifications aren't designed to be adapted to any use other than IDNA, which makes reconciling #1 and #2 difficult. However, IDNA2008 isn't all that far in practice from a "reasonable" profile based on UAX #31, and UAX #31 is designed to be adapted. Hence, I decided to build the preparation method based on UAX #31.

The definition of the function, which here I'll call DNamePrep(), contains two distinct phases: preparation and validation:

1) Preparation
1.1) Normalize the input via NFKC
1.2) strip leading and trailing whitespace ([:White_Space:]) [a]

2) Validation
2.1) If the string is empty, FAIL
2.2) For each character of the string
2.2.1) If it is in [:Noncharacter_Code_Point:], FAIL
2.2.2) If it is in [:General_Category=Cn:], FAIL [b]
2.2.3) If it is not in DName_Continue, FAIL
2.3) VALID

[a] At present, my implementation uses Python's strip(), which is a slightly different set of characters.
[b] If preparing a string for search, then "FAIL" is changed to "skip 2.2.3"

DName_Continue is defined, in the manner suggested by UAX #31, as:

DName_Continue =
[:ID_Continue:]
- [:General_Category=Nl:] # no letter-like numbers
- [:General_Category=Pc:] # no connecting punctuation
- [:Other_ID_Start:] # no need to support legacy ID characters
- [:Other_ID_Continue:] # ditto
- ignored_blocks
+ name_additional
+ U+0020 # names can have spaces

name_additional =
# Additional characters that we choose to include in names. This
# is a subset of the suggested additional characters in UAX #31.
U+0027, U+002D, U+002E, U+00B7,
U+058A, U+05F3, U+05F4, U+0F0B,
U+200C, U+200D,
U+30FB

ignored_blocks =
# These are a blocks of characters that we exclude. This is a
# subset of the blocks recommended for exclusion by UAX #31
[:block=Combining_Diacritical_Marks_for_Symbols:]
+ [:block=Musical_Symbols:]
+ [:block=Ancient_Greek_Musical_Notation:]
+ [:Default_Ignorable_Code_Point:]

N.B: Collections of characters given as [:prop:] are defined by the UCD

Implementation in Python is hindered somewhat because while Python has the module 'unicodedata', it has only a partial set of the UCD data. Hence, in my implementation, I had to hard code some of the character collections defined by properties. This algorithm also does away with the suggested contextual checks in UAX #31 for a few characters (fewer than IDNA 2008 requires). I didn't see the harm in always accepting those characters, rather than limiting them the contexts given. This might merit further study.

This definition leverages UAX #31's design and spec, and should gracefully handle evolution of Unicode, as Unicode gives very strong stability guarantees for the character properties used in UAX #31.

Comparison with IDNA 2008:

IDNA 2008 has no "preparation" step, per se. However, it is clear from the design, that applications will have to perform some preparation of user input before entry into the protocol. Exactly what is left hanging... but one can infer that at the very least casefolding to lower case will occur. I choose to assume casefold(input). Given that, I compared all input characters between the two algorithms.

The rather small set of significant differences are:
1) IDNA 2008 allows four characters (from §2.6) that DName rejects
2) IDNA 2008 disallows ten characters (from §2.6) that DName accepts
3) IDNA 2008 rejects U+A77D as Unstable (§2.2) and U+1D79 by default, where DName accepts
4) DName accepts four punctuation characters by choice, which IDNA 2008 rejects
5) IDNA 2008 rejects non-combining Jamo (§2.9), which DName accepts
6) IDNA 2008 accepts four symbols which are normalized to other (accepted) characters by DName

If further harmony with IDNA 2008 were desired, differences 1, 2, 4 & 5 could be easily accounted for, staying within the UAX #31 framework. Difference 3 would be harder to codify as UAX #31 has no rule like that in IDNA 2008, and simply hard coding the code points would be agains the spirit of IDNA 2008. Lastly, I can't find any rationale for why IDNA 2008 rejects the characters in difference 2, so I'm not sure why one should replicate it.

There is also a large set of characters that IDNA 2008 rejects, whereas DName never sees them before of the NFKC normalization step. If one assumes NFKC normalization for any input then this difference is almost inconsequential except for the four characters listed under #6. However, I can't see the value in supporting, for example, the ANGSTROM SIGN as distinct from its NFKC equivalent LATIN CAPITAL A WITH RING ABOVE.

Also, remember, the contextual checks required by IDNA 2008 were not used here, since I was checking individual characters, and so such characters were treated as accepted (i.e. assuming they were in the correct context.)

Conclusion (for now)

It seems to me that building a string preparation and validation method based on UAX #31, yet in line with the lessons and work from IDAN 2008 is practical. I think that as we consider creating a new framework to replace StringPrep, this is a fruitful direction.

- Mark Lentczner

P.S.: I have full technical details of this work, including full character lists by protocol and in comparison, and Python code. If you are interested in any of it, just ask.

Mark Lentczner
Sr. Systems Architect
Technology Integration
Linden Lab

markl@lindenlab.com

Zero Linden
zero.linden@secondlife.com

[newprep] phase 2 experience report Mark Lentczner