Re: [precis] string classes and normalization forms

Patrik Fältström <patrik@frobbit.se> Sat, 05 March 2011 15:19 UTC

Return-Path: <patrik@frobbit.se>
X-Original-To: precis@core3.amsl.com
Delivered-To: precis@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id CFDFD3A6A74 for <precis@core3.amsl.com>; Sat, 5 Mar 2011 07:19:29 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -103.227
X-Spam-Level:
X-Spam-Status: No, score=-103.227 tagged_above=-999 required=5 tests=[AWL=1.072, BAYES_00=-2.599, GB_I_LETTER=-2, MIME_8BIT_HEADER=0.3, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uQzZ9JonM+mz for <precis@core3.amsl.com>; Sat, 5 Mar 2011 07:19:27 -0800 (PST)
Received: from srv01.frobbit.se (srv01.frobbit.se [IPv6:2a02:80:3ffe::39]) by core3.amsl.com (Postfix) with ESMTP id D02FF3A69DE for <precis@ietf.org>; Sat, 5 Mar 2011 07:19:26 -0800 (PST)
Received: from localhost (localhost [127.0.0.1]) by srv01.frobbit.se (Postfix) with ESMTP id 3B198FDB73AF; Sat, 5 Mar 2011 16:20:36 +0100 (CET)
X-Virus-Scanned: amavisd-new at frobbit.se
Received: from srv01.frobbit.se ([127.0.0.1]) by localhost (srv01.frobbit.se [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ACf5WD4tPUk4; Sat, 5 Mar 2011 16:20:35 +0100 (CET)
Received: from [IPv6:2a02:80:3ffc::dead:beef] (unknown [IPv6:2a02:80:3ffc::dead:beef]) (Authenticated sender: paf01) by srv01.frobbit.se (Postfix) with ESMTP id B1F22FDB73A8; Sat, 5 Mar 2011 16:20:35 +0100 (CET)
Mime-Version: 1.0 (Apple Message framework v1082)
Content-Type: text/plain; charset="utf-8"
From: Patrik Fältström <patrik@frobbit.se>
In-Reply-To: <4D72540A.5090800@viagenie.ca>
Date: Sat, 05 Mar 2011 16:20:35 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <9DC29E4C-0D0D-4E92-908E-7AC8601971B2@frobbit.se>
References: <4D71655E.1070409@stpeter.im> <A17A39A6-6314-4704-B98B-3523A0BEA54C@frobbit.se> <4D72540A.5090800@viagenie.ca>
To: Marc Blanchet <marc.blanchet@viagenie.ca>
X-Mailer: Apple Mail (2.1082)
Cc: precis@ietf.org
Subject: Re: [precis] string classes and normalization forms
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/precis>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 05 Mar 2011 15:19:29 -0000

On 5 mar 2011, at 16.17, Marc Blanchet wrote:

> Le 11-03-05 01:59, Patrik Fältström a écrit :
>> Sorry if this has been discussed already...
>> 
>> Lots of the information in this document is the same as RFC 5892.
>> 
>> Is not a better solution to have this document be a "diff", so that it is building upon RFC 5892?
>> 
> 
> to me, it is just too early. The framework was a first start to discuss. When we will have agreement on what to do, then we would see whatever is the best way to write it down, including a diff, or else.

Makes sense...

   Patrik

> Marc.
> 
>>    Patrik
>> 
>> On 4 mar 2011, at 23.19, Peter Saint-Andre wrote:
>> 
>>> <hat type='individual'/>
>>> 
>>> I started to write a document outlining results of my own research and
>>> discussion within the XMPP WG, but then I realized it would be more
>>> productive to provide feedback on draft-blanchet-precis-framework-00.
>>> Please take these comments in the spirit of exploration and as a spur to
>>> discussion in the PRECIS WG. (Thanks to various XMPP WG folks, esp. Joe
>>> Hildebrand, for productive conversations about these issues.)
>>> 
>>> Issue #1: String Classes
>>> 
>>> draft-blanchet-precis-framework-00 describes these string classes:
>>> 
>>>   o  domain U-label
>>>   o  domain A-label
>>>   o  domain name
>>>   o  email address
>>>   o  restricted identifier
>>>   o  less-restrictive identifier
>>> 
>>> We can leave the first four to other specs, no?
>>> 
>>> In the document I started to write, I was going to define two classes:
>>> 
>>> a. "names" (or "usernamey things" if you like)
>>> b. "codes" (or "passwordy things" if you like)
>>> 
>>> (There is also the possibility that we might want something like a
>>> free-form string, but it's not clear to me if we really need a
>>> technology for preparing and comparing those -- we can simply treat them
>>> as UTF-8 encoded Unicode codepoints, or somesuch.)
>>> 
>>> Let me try to describe the classes I had in mind:
>>> 
>>> a. NAMES. I see a "name" as a word or set of words that is used to
>>> identify or address a network entity such as a user, an account, a venue
>>> (e.g., a chatroom), an information source (e.g., a feed), or a
>>> collection of data (e.g., a file). For the convenience of humans, a name
>>> typically consists of a memorable sequence of letters, numbers, and a
>>> few conventional symbol and punctuation characters. The "name" class
>>> would disallow spaces, the at-sign (because usernamey things are often
>>> used as the left-hand side of email addresses and Jabber IDs and such),
>>> almost all symbol characters (except those from the ASCII range), etc.
>>> Also disallowed would be any character that is compatibility
>>> decomposable into another character (e.g., U+017F "ſ" is compatibility
>>> decomposable into U+0073 "s") or into a sequence of characters (e.g.,
>>> U+2163 "Ⅳ" is compatibility decomposable into U+0049 "I" and U+0056
>>> "V"). All members of the "name" class would contain only lowercase
>>> letters, not uppercase letters or titlecase letters (this is different
>>> from IDNA, where uppercase letters are allowed and preserved but case is
>>> ignored for comparison purposes).
>>> 
>>> The foregoing description is similar to the "Less-Restrictive
>>> Identifier" class from draft-blanchet-precis-framework-00. I don't know
>>> if I see a need for the "Restricted Identifier" class from the I-D --
>>> i.e., a string class that disallows all punctuation and all display
>>> characters (BTW what exactly is a display character?).
>>> 
>>> b. CODES. I see a "code" as a sequence of letters, numbers, and symbols
>>> that is used as a secret for access to some resource on a network (e.g.,
>>> an account or a venue). To improve security, codes would be
>>> case-sensitive. The "@" character and other punctuation and basic symbol
>>> characters would be allowed, but symbols outside the US-ASCII range
>>> would be disallowed. We would also still disallow any character that is
>>> compatibility decomposable into another character or into a sequence of
>>> characters.
>>> 
>>> Issue #2: Normalization.
>>> 
>>> Following IDNA2003, existing stringprep profiles all use Unicode
>>> Normalization Form KC (NFKC), which performs canonical decomposition and
>>> compatibility decomposition, followed by canonical and compatibility
>>> recomposition. This choice made sense in IDNA2003 because the DNS packet
>>> format has fixed-length labels, and NFKC in effect compresses a sequence
>>> of characters into the smallest number of bytes possible by performing
>>> recomposition. However, experience with some of the application
>>> protocols that are currently using NFKC (e.g., XMPP) has shown that
>>> recomposition is an expensive operation to perform in application
>>> servers. In addition, the application protocols that use stringprep all
>>> use TCP with security-layer or application-layer compression (e.g., via
>>> TLS or things like XEP-0138 in XMPP), so fixing the length of strings is
>>> much less important.
>>> 
>>> What matters most in application protocols is ensuring that network
>>> entities (such as clients and servers) all communicate a consistent
>>> string representation over the wire. For this purpose, Normalization
>>> Form D (NFD), which simply performs canonical decomposition, provides
>>> the most efficient approach. As noted above, we can disallow any
>>> characters that would require compatibility decomposition, thus removing
>>> the need for compatibility decomposition and recomposition. This is what
>>> happened in IDNA208, enabling the IDNA folks to move from NFKC to NFC.
>>> If we take the same approach in PRECIS but also get rid of recomposition
>>> entirely, we can move from NFKC (the most complex and therefore most
>>> computationally intensive normalization form) to NFD (the least complex
>>> and therefore least computationally intensive normalization form). This
>>> will be a big win for application servers.
>>> 
>>> OK, I think that's enough controversy for today. :)
>>> 
>>> Peter
>>> 
>>> --
>>> Peter Saint-Andre
>>> https://stpeter.im/
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> precis mailing list
>>> precis@ietf.org
>>> https://www.ietf.org/mailman/listinfo/precis
>> 
>> _______________________________________________
>> precis mailing list
>> precis@ietf.org
>> https://www.ietf.org/mailman/listinfo/precis
> 
> 
> -- 
> =========
> IPv6 book: Migrating to IPv6, Wiley. http://www.ipv6book.ca
> Stun/Turn server for VoIP NAT-FW traversal: http://numb.viagenie.ca
> DTN Implementation: http://postellation.viagenie.ca
> NAT64-DNS64 Opensource: http://ecdysis.viagenie.ca
> 
> _______________________________________________
> precis mailing list
> precis@ietf.org
> https://www.ietf.org/mailman/listinfo/precis