[precis] string classes and normalization forms
Peter Saint-Andre <stpeter@stpeter.im> Fri, 04 March 2011 22:18 UTC
Return-Path: <stpeter@stpeter.im>
X-Original-To: precis@core3.amsl.com
Delivered-To: precis@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 4A9313A68AB for <precis@core3.amsl.com>; Fri, 4 Mar 2011 14:18:05 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -103.635
X-Spam-Level:
X-Spam-Status: No, score=-103.635 tagged_above=-999 required=5 tests=[AWL=0.964, BAYES_00=-2.599, GB_I_LETTER=-2, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SPMnf1YfRnCA for <precis@core3.amsl.com>; Fri, 4 Mar 2011 14:18:04 -0800 (PST)
Received: from stpeter.im (stpeter.im [207.210.219.233]) by core3.amsl.com (Postfix) with ESMTP id EED603A6869 for <precis@ietf.org>; Fri, 4 Mar 2011 14:18:03 -0800 (PST)
Received: from dhcp-64-101-72-185.cisco.com (dhcp-64-101-72-185.cisco.com [64.101.72.185]) (Authenticated sender: stpeter) by stpeter.im (Postfix) with ESMTPSA id BCB6A400F6 for <precis@ietf.org>; Fri, 4 Mar 2011 15:38:47 -0700 (MST)
Message-ID: <4D71655E.1070409@stpeter.im>
Date: Fri, 04 Mar 2011 15:19:10 -0700
From: Peter Saint-Andre <stpeter@stpeter.im>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.13) Gecko/20101207 Thunderbird/3.1.7
MIME-Version: 1.0
To: precis@ietf.org
X-Enigmail-Version: 1.1.1
OpenPGP: url=http://www.saint-andre.com/me/stpeter.asc
Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg="sha1"; boundary="------------ms030606060806000201040409"
Subject: [precis] string classes and normalization forms
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/precis>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 04 Mar 2011 22:18:05 -0000
<hat type='individual'/> I started to write a document outlining results of my own research and discussion within the XMPP WG, but then I realized it would be more productive to provide feedback on draft-blanchet-precis-framework-00. Please take these comments in the spirit of exploration and as a spur to discussion in the PRECIS WG. (Thanks to various XMPP WG folks, esp. Joe Hildebrand, for productive conversations about these issues.) Issue #1: String Classes draft-blanchet-precis-framework-00 describes these string classes: o domain U-label o domain A-label o domain name o email address o restricted identifier o less-restrictive identifier We can leave the first four to other specs, no? In the document I started to write, I was going to define two classes: a. "names" (or "usernamey things" if you like) b. "codes" (or "passwordy things" if you like) (There is also the possibility that we might want something like a free-form string, but it's not clear to me if we really need a technology for preparing and comparing those -- we can simply treat them as UTF-8 encoded Unicode codepoints, or somesuch.) Let me try to describe the classes I had in mind: a. NAMES. I see a "name" as a word or set of words that is used to identify or address a network entity such as a user, an account, a venue (e.g., a chatroom), an information source (e.g., a feed), or a collection of data (e.g., a file). For the convenience of humans, a name typically consists of a memorable sequence of letters, numbers, and a few conventional symbol and punctuation characters. The "name" class would disallow spaces, the at-sign (because usernamey things are often used as the left-hand side of email addresses and Jabber IDs and such), almost all symbol characters (except those from the ASCII range), etc. Also disallowed would be any character that is compatibility decomposable into another character (e.g., U+017F "ſ" is compatibility decomposable into U+0073 "s") or into a sequence of characters (e.g., U+2163 "Ⅳ" is compatibility decomposable into U+0049 "I" and U+0056 "V"). All members of the "name" class would contain only lowercase letters, not uppercase letters or titlecase letters (this is different from IDNA, where uppercase letters are allowed and preserved but case is ignored for comparison purposes). The foregoing description is similar to the "Less-Restrictive Identifier" class from draft-blanchet-precis-framework-00. I don't know if I see a need for the "Restricted Identifier" class from the I-D -- i.e., a string class that disallows all punctuation and all display characters (BTW what exactly is a display character?). b. CODES. I see a "code" as a sequence of letters, numbers, and symbols that is used as a secret for access to some resource on a network (e.g., an account or a venue). To improve security, codes would be case-sensitive. The "@" character and other punctuation and basic symbol characters would be allowed, but symbols outside the US-ASCII range would be disallowed. We would also still disallow any character that is compatibility decomposable into another character or into a sequence of characters. Issue #2: Normalization. Following IDNA2003, existing stringprep profiles all use Unicode Normalization Form KC (NFKC), which performs canonical decomposition and compatibility decomposition, followed by canonical and compatibility recomposition. This choice made sense in IDNA2003 because the DNS packet format has fixed-length labels, and NFKC in effect compresses a sequence of characters into the smallest number of bytes possible by performing recomposition. However, experience with some of the application protocols that are currently using NFKC (e.g., XMPP) has shown that recomposition is an expensive operation to perform in application servers. In addition, the application protocols that use stringprep all use TCP with security-layer or application-layer compression (e.g., via TLS or things like XEP-0138 in XMPP), so fixing the length of strings is much less important. What matters most in application protocols is ensuring that network entities (such as clients and servers) all communicate a consistent string representation over the wire. For this purpose, Normalization Form D (NFD), which simply performs canonical decomposition, provides the most efficient approach. As noted above, we can disallow any characters that would require compatibility decomposition, thus removing the need for compatibility decomposition and recomposition. This is what happened in IDNA208, enabling the IDNA folks to move from NFKC to NFC. If we take the same approach in PRECIS but also get rid of recomposition entirely, we can move from NFKC (the most complex and therefore most computationally intensive normalization form) to NFD (the least complex and therefore least computationally intensive normalization form). This will be a big win for application servers. OK, I think that's enough controversy for today. :) Peter -- Peter Saint-Andre https://stpeter.im/
- [precis] string classes and normalization forms Peter Saint-Andre
- Re: [precis] string classes and normalization for… Patrik Fältström
- Re: [precis] string classes and normalization for… Marc Blanchet
- Re: [precis] string classes and normalization for… Patrik Fältström