[precis] string classes and normalization forms

Peter Saint-Andre <stpeter@stpeter.im> Fri, 04 March 2011 22:18 UTC

Message-ID: <4D71655E.1070409@stpeter.im>
Date: Fri, 04 Mar 2011 15:19:10 -0700
From: Peter Saint-Andre <stpeter@stpeter.im>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.13) Gecko/20101207 Thunderbird/3.1.7
MIME-Version: 1.0
To: precis@ietf.org
OpenPGP: url=http://www.saint-andre.com/me/stpeter.asc
Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg="sha1"; boundary="------------ms030606060806000201040409"
Subject: [precis] string classes and normalization forms
Precedence: list

<hat type='individual'/>

I started to write a document outlining results of my own research and
discussion within the XMPP WG, but then I realized it would be more
productive to provide feedback on draft-blanchet-precis-framework-00.
Please take these comments in the spirit of exploration and as a spur to
discussion in the PRECIS WG. (Thanks to various XMPP WG folks, esp. Joe
Hildebrand, for productive conversations about these issues.)

Issue #1: String Classes

draft-blanchet-precis-framework-00 describes these string classes:

   o  domain U-label
   o  domain A-label
   o  domain name
   o  email address
   o  restricted identifier
   o  less-restrictive identifier

We can leave the first four to other specs, no?

In the document I started to write, I was going to define two classes:

a. "names" (or "usernamey things" if you like)
b. "codes" (or "passwordy things" if you like)

(There is also the possibility that we might want something like a
free-form string, but it's not clear to me if we really need a
technology for preparing and comparing those -- we can simply treat them
as UTF-8 encoded Unicode codepoints, or somesuch.)

Let me try to describe the classes I had in mind:

a. NAMES. I see a "name" as a word or set of words that is used to
identify or address a network entity such as a user, an account, a venue
(e.g., a chatroom), an information source (e.g., a feed), or a
collection of data (e.g., a file). For the convenience of humans, a name
typically consists of a memorable sequence of letters, numbers, and a
few conventional symbol and punctuation characters. The "name" class
would disallow spaces, the at-sign (because usernamey things are often
used as the left-hand side of email addresses and Jabber IDs and such),
almost all symbol characters (except those from the ASCII range), etc.
Also disallowed would be any character that is compatibility
decomposable into another character (e.g., U+017F "ſ" is compatibility
decomposable into U+0073 "s") or into a sequence of characters (e.g.,
U+2163 "Ⅳ" is compatibility decomposable into U+0049 "I" and U+0056
"V"). All members of the "name" class would contain only lowercase
letters, not uppercase letters or titlecase letters (this is different
from IDNA, where uppercase letters are allowed and preserved but case is
ignored for comparison purposes).

The foregoing description is similar to the "Less-Restrictive
Identifier" class from draft-blanchet-precis-framework-00. I don't know
if I see a need for the "Restricted Identifier" class from the I-D --
i.e., a string class that disallows all punctuation and all display
characters (BTW what exactly is a display character?).

b. CODES. I see a "code" as a sequence of letters, numbers, and symbols
that is used as a secret for access to some resource on a network (e.g.,
an account or a venue). To improve security, codes would be
case-sensitive. The "@" character and other punctuation and basic symbol
characters would be allowed, but symbols outside the US-ASCII range
would be disallowed. We would also still disallow any character that is
compatibility decomposable into another character or into a sequence of
characters.

Issue #2: Normalization.

Following IDNA2003, existing stringprep profiles all use Unicode
Normalization Form KC (NFKC), which performs canonical decomposition and
compatibility decomposition, followed by canonical and compatibility
recomposition. This choice made sense in IDNA2003 because the DNS packet
format has fixed-length labels, and NFKC in effect compresses a sequence
of characters into the smallest number of bytes possible by performing
recomposition. However, experience with some of the application
protocols that are currently using NFKC (e.g., XMPP) has shown that
recomposition is an expensive operation to perform in application
servers. In addition, the application protocols that use stringprep all
use TCP with security-layer or application-layer compression (e.g., via
TLS or things like XEP-0138 in XMPP), so fixing the length of strings is
much less important.

What matters most in application protocols is ensuring that network
entities (such as clients and servers) all communicate a consistent
string representation over the wire. For this purpose, Normalization
Form D (NFD), which simply performs canonical decomposition, provides
the most efficient approach. As noted above, we can disallow any
characters that would require compatibility decomposition, thus removing
the need for compatibility decomposition and recomposition. This is what
happened in IDNA208, enabling the IDNA folks to move from NFKC to NFC.
If we take the same approach in PRECIS but also get rid of recomposition
entirely, we can move from NFKC (the most complex and therefore most
computationally intensive normalization form) to NFD (the least complex
and therefore least computationally intensive normalization form). This
will be a big win for application servers.

OK, I think that's enough controversy for today. :)

Peter

-- 
Peter Saint-Andre
https://stpeter.im/

Attachment: smime.p7s

[precis] string classes and normalization forms Peter Saint-Andre
Re: [precis] string classes and normalization for… Patrik Fältström
Re: [precis] string classes and normalization for… Marc Blanchet
Re: [precis] string classes and normalization for… Patrik Fältström

[precis] string classes and normalization forms

Attachment: smime.p7s