Re: [idn] Re: stringprep: PRI #29

"Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord.cnri.reston.va.us> Sun, 27 March 2005 21:36 UTC

Received: from psg.com (mailnull@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA07728 for <idn-archive@lists.ietf.org>; Sun, 27 Mar 2005 16:36:51 -0500 (EST)
Received: from majordom by psg.com with local (Exim 4.44 (FreeBSD)) id 1DFfKU-00042q-7N for idn-data@psg.com; Sun, 27 Mar 2005 21:29:54 +0000
Received: from [64.36.79.201] (helo=nicemice.net) by psg.com with esmtps (TLSv1:AES256-SHA:256) (Exim 4.44 (FreeBSD)) id 1DFfKR-00042V-1T for idn@ops.ietf.org; Sun, 27 Mar 2005 21:29:51 +0000
Received: from amc by nicemice.net with local (Exim 4.44) id 1DFfKP-0004Dm-KU for idn@ops.ietf.org; Sun, 27 Mar 2005 13:29:49 -0800
Date: Sun, 27 Mar 2005 21:29:49 +0000
From: "Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord.cnri.reston.va.us>
To: idn@ops.ietf.org
Subject: Re: [idn] Re: stringprep: PRI #29
Message-ID: <20050327212949.GA15994~@nicemice.net>
Reply-To: IETF idn working group <idn@ops.ietf.org>
References: <423754F3.50405@vanderpoel.org> <ilumzt47ezc.fsf@latte.josefsson.org> <20050316091126.GA24254~@nicemice.net> <iluzmx36h6t.fsf@latte.josefsson.org> <423CD9DC.5080401@vanderpoel.org> <ilur7iaycuj.fsf@latte.josefsson.org> <423DB883.1020408@vanderpoel.org> <iluis3mxbqq.fsf@latte.josefsson.org> <423F12A9.60009@vanderpoel.org> <ilupsxs7skt.fsf@latte.josefsson.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <ilupsxs7skt.fsf@latte.josefsson.org>
User-Agent: Mutt/1.5.6+20040907i
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on psg.com
X-Spam-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.0.1
Sender: owner-idn@ops.ietf.org
Precedence: bulk

Simon Josefsson <jas@extundo.com> wrote:

> Having a draft outlining the alternatives would be a useful
> contribution.

For an implementation, I see the following alternatives:

1) Reject the problem sequences (and no others) before normalizing (add
   a step 0 to the stringprep algorithm).

2) Reject some superset of the problem sequences that is easier to
   compute and still rejects only implausible strings.

3) Reject nothing, but make sure the normalization algorithm follows the
   proposed fixed normalization spec (and the original sample code).

4) Reject nothing, but make sure the normalization algorithm follows the
   current broken normalization spec (not the original sample code).

5) Reject nothing, and follow whichever normalization spec is already
   implemented (that is, ignore the problem).

I have listed these in order of decreasing quality (though the order of
2 and 3 is debatable), and in order of decreasing complexity (except
that 3 and 4 are equally complex).

I don't expect anyone to defend 5.

Given that there already exist implementations of both kinds of
normalization, and neither is more right or more wrong than the other
(in light of the internally inconsistent spec), it would be silly for
stringprep to standardize on the broken one, so we can discard 4.

I can imagine an implementor making a case for any of 1-3.  The next
question is how much latitude the stringprep spec should give the
implementor.

We might consider allowing implementations to choose from among more
than one approved alternative.  I lean against that idea, because
stringprep is used to define valid strings in protocols, and it's hard
to analyze what can happen when implementations disagree about the
validity of a string.  (There is already one way this can happen in
stringprep: unassigned code points become assigned.  The analysis was
very tricky, and I'd prefer to keep that sort of thing to a minimum.)

If stringprep were to pick alternative 3 as the one and only allowed
behavior, some implementations would consider that an unacceptable
security risk and implement 1 or 2 anyway, so in practice we'd still
have implementations choosing from multiple alternatives.

This argues for the stringprep spec to require the rejection of either
(1) the problem sequences or (2) some particular superset of the problem
sequences that is easier to compute.  But the implementation simplicity
of (3) is very appealing, and part of me hopes to be persuaded to go
with that in spite of the analytical complexity.

If we choose 1 or 2, we still need to decide how to define the sequences
that must be rejected.  For example, if we choose 1, I suggest something
like this:

    Definition:  A string must be rejected if and only if oldNormalize
    and newNormalize return different results.

    Theorem:  The rejected strings are exactly the ones containing
    code points with such-and-such properties in such-and-such
    combinations...

    Theorem:  In Unicode version x.y, the previous theorem amounts to
    the following table of specific code point combinations...

Notice that if an implementation has access to both oldNormalize and
newNormalize, it can implement the definition directly without the
complexity of applying either theorem.  It could also do a simple check
for the presence of any code points from the table, and optimize away
the second normalization if none are present.

AMC