[precis] spaces in names, and compound names?

Peter Saint-Andre <stpeter@stpeter.im> Mon, 07 May 2012 23:35 UTC

Return-Path: <stpeter@stpeter.im>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E59A821F8692 for <precis@ietfa.amsl.com>; Mon, 7 May 2012 16:35:15 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -102.549
X-Spam-Level:
X-Spam-Status: No, score=-102.549 tagged_above=-999 required=5 tests=[AWL=0.050, BAYES_00=-2.599, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FoDduLnSiOGL for <precis@ietfa.amsl.com>; Mon, 7 May 2012 16:35:15 -0700 (PDT)
Received: from stpeter.im (mailhost.stpeter.im [207.210.219.225]) by ietfa.amsl.com (Postfix) with ESMTP id D535521F8670 for <precis@ietf.org>; Mon, 7 May 2012 16:35:14 -0700 (PDT)
Received: from [64.101.72.115] (unknown [64.101.72.115]) (Authenticated sender: stpeter) by stpeter.im (Postfix) with ESMTPSA id F3A0240058 for <precis@ietf.org>; Mon, 7 May 2012 17:50:29 -0600 (MDT)
Message-ID: <4FA85C31.5020009@stpeter.im>
Date: Mon, 07 May 2012 17:35:13 -0600
From: Peter Saint-Andre <stpeter@stpeter.im>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:12.0) Gecko/20120428 Thunderbird/12.0.1
MIME-Version: 1.0
To: "precis@ietf.org" <precis@ietf.org>
X-Enigmail-Version: 1.4.1
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Subject: [precis] spaces in names, and compound names?
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/precis>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 07 May 2012 23:35:16 -0000

[Sorry about the length of this message.]

In Paris, we had a 20-minute discussion about the inclusion of space
characters in names. Here are some notes I just made based on listening
to the recording. [1]

###

Alexey Melnikov raised an issue about SASLprepbis needing ASCII space in
simple user names (and also perhaps for LDAP). One approach would be to
allow space in the NameClass and tell protocols that don't want it in
usernames to forbid the space.

Marc Blanchet voiced a preference for defining a common class that would
be usable by the majority of PRECIS customers, and noted that so far
(except for SASLprepbis) they don't seem to need space.

Andrew Sullivan pointed out that in some locales the "space bar" on a
keyboard can result in generation not of ASCII space but of a zero width
non joiner (ZWNJ) and voiced concern that allowing space might be the
thin edge of a wedge leading to significant problems later on.

Pete Resnick noted that the left hand side of an email address can
contain all sorts of things if it is enclosed in quotes, and leaned
toward saying that some application protocols use something like
FreeClass or a subclass thereof because it appears that they might not
need just ASCII space but all sorts of interesting characters.

Alexey Melnikov noted that for SASLprepbis we only need ASCII space.

Joe Hildebrand pointed out that we might provide advice to protocols
regarding which codepoints are safe to use.

Marc Blanchet suggested formulating a safe class that would be less
exclusive than the (currently very restricted) NameClass.

Peter Saint-Andre concurred, adding that it would be better to do that
than to make the NameClass more inclusive.

Andrew Sullivan went further and suggested that the NameClass just needs
to be a good class for names, not a class that can be used to represent
every possible name. He suggested that if you have something like email
that is less restrictive, then use FreeClass and subclass it carefully
yourself. His conclusion was that putting spaces in the NameClass would
be a bad idea.

Joe Hildebrand suggested other options: (1) say spaces are bad so don't
use them and they are disallowed in NameClass or (2) spaces aren't so
bad, allow them in all protocols.

[Note: Discussion of superclassing elided because we decided against that.]

Pete Resnick noted that the working group concluded earlier that spaces
are to be avoided in names because sometimes they are visible and
sometimes they are not etc. ("spaces are the sorts of things that would
get names into trouble").

Andrew Sullivan brought up the Unicode confusables table [2] for spaces
and pointed out that there is no way to tell the difference between,
say, an em-space and a space; as a result, he concluded that this is a
disaster because we don't control the input method.

Joe Hildebrand said another approach would be to come back to the SASL
folk and say "this is a bad idea, don't do it". He asked: how widely
used are spaces in simple user names, and can't we tell folks to clean
up their databases (just as we'll need to do in XMPP)?

Marc Blanchet suggested that we could define an "unsafe name class"
which would include spaces, with big warnings not to use it.

Pete Resnick noted that we could go to the security area and say "we're
planning to remove spaces from our internationalized names spec, what
would break?" Because it might be safer to clean up existing code and
databases than to allow spaces in usernames.

David Black noted that if we get around to tackling NFS, spaces would
become relevant. However, fileneames in NFC are really weird and
probably don't fit in the PRECIS framework.

Marc Blanchet pointed out that NFS would probably need to use
FreeClass, not NameClass.

The conclusion was that we needed to follow up with our friends in the
Security Area.

###

Since the Paris meeting, I have indeed done just that:

http://www.ietf.org/mail-archive/web/kitten/current/msg03054.html

Later messages in that thread are here:

http://www.ietf.org/mail-archive/web/kitten/current/msg03055.html
http://www.ietf.org/mail-archive/web/kitten/current/msg03056.html
http://www.ietf.org/mail-archive/web/kitten/current/msg03057.html
http://www.ietf.org/mail-archive/web/kitten/current/msg03058.html
http://www.ietf.org/mail-archive/web/kitten/current/msg03059.html
http://www.ietf.org/mail-archive/web/kitten/current/msg03060.html
http://www.ietf.org/mail-archive/web/kitten/current/msg03061.html
http://www.ietf.org/mail-archive/web/kitten/current/msg03073.html
http://www.ietf.org/mail-archive/web/kitten/current/msg03074.html

As you can see, there was some pushback about removing spaces from SASL
simple user names. However, I did not ask the question I heard from Pete
Resnick at the end of that discussion in Paris, which was: "what would
break if we did this?" We seem to have anecdotal evidence of some usage
of spaces in simple user names, but we don't seem to know how widespread
that usage is, nor what would break if we defined NameClass to exclude
space (and note that, among other things, SASLprepbis could subclass the
FreeClass in order to meet the needs of SASLprepbis -- they don't
absolutely need to use NameClass just because they have a construct
called a simple user *name* -- I think there's a lack of precision about
these matters in the thread I pointed to above).

Another approach I've been thinking about would be to define (in the
framework document or in SASLprepbis) a class that we could call
CompoundNameClass. A compound name would start with a name (i.e., an
instance of the NameClass) and then contain one or more instances of the
NameClass separated from the other instances by a single ASCII space,
ending with an instance of the NameClass (thus it could not begin or end
with a space and could not contain multiple spaces in a row). This
CompoundNameClass would be less safe than the NameClass, but it would
explicitly allow only U+0020 -- all other spacelike codepoints would
need to be mapped to U+0020. We'd still have the same challenges with
regard to input methods and possible confusion that were voiced during
the discussion in Paris, but from a protocol perspective only U+0020
would be allowed on the wire.

I'd like to discuss that option a bit before taking this topic back to
the KITTEN WG.

Peter

[1] You can listen to the discussion yourself at around the 35-minute
mark of the audio recording, which is available here:

http://www.ietf.org/audio/ietf83/ietf83-253-20120329-1256-pm.mp3

[2] See the Unicode confusables.txt file for this table:

180E ;  0020 ;  SL      #* ( <180e> →   ) MONGOLIAN VOWEL SEPARATOR →
SPACE     #
2028 ;  0020 ;  SL      #* (  →   ) LINE SEPARATOR → SPACE      #
2029 ;  0020 ;  SL      #* (  →   ) PARAGRAPH SEPARATOR → SPACE #
2000 ;  0020 ;  SL      #* (   →   ) EN QUAD → SPACE    #
2001 ;  0020 ;  SL      #* (   →   ) EM QUAD → SPACE    #
2002 ;  0020 ;  SL      #* (   →   ) EN SPACE → SPACE   #
2003 ;  0020 ;  SL      #* (   →   ) EM SPACE → SPACE   #
2004 ;  0020 ;  SL      #* (   →   ) THREE-PER-EM SPACE → SPACE #
2005 ;  0020 ;  SL      #* (   →   ) FOUR-PER-EM SPACE → SPACE  #
2006 ;  0020 ;  SL      #* (   →   ) SIX-PER-EM SPACE → SPACE   #
2008 ;  0020 ;  SL      #* (   →   ) PUNCTUATION SPACE → SPACE  #
2009 ;  0020 ;  SL      #* (   →   ) THIN SPACE → SPACE #
200A ;  0020 ;  SL      #* (   →   ) HAIR SPACE → SPACE #
205F ;  0020 ;  SL      #* (   →   ) MEDIUM MATHEMATICAL SPACE → SPACE  #
00A0 ;  0020 ;  SL      #* (   →   ) NO-BREAK SPACE → SPACE     #
2007 ;  0020 ;  SL      #* (   →   ) FIGURE SPACE → SPACE       #
202F ;  0020 ;  SL      #* (   →   ) NARROW NO-BREAK SPACE → SPACE
#
1680 ;  0020 ;  SL      #* (   →   ) OGHAM SPACE MARK → SPACE   #