Re: Complexity of :matches ?

Ned Freed <ned.freed@mrochek.com> Mon, 11 July 2005 16:10 UTC

Received: from above.proper.com (localhost.vpnc.org [127.0.0.1]) by above.proper.com (8.12.11/8.12.9) with ESMTP id j6BGAOJ4059957; Mon, 11 Jul 2005 09:10:24 -0700 (PDT) (envelope-from owner-ietf-mta-filters@mail.imc.org)
Received: (from majordom@localhost) by above.proper.com (8.12.11/8.12.9/Submit) id j6BGAORq059956; Mon, 11 Jul 2005 09:10:24 -0700 (PDT)
X-Authentication-Warning: above.proper.com: majordom set sender to owner-ietf-mta-filters@mail.imc.org using -f
Received: from mauve.mrochek.com (mauve.mrochek.com [209.55.107.55]) by above.proper.com (8.12.11/8.12.9) with ESMTP id j6BGAMoF059949 for <ietf-mta-filters@imc.org>; Mon, 11 Jul 2005 09:10:23 -0700 (PDT) (envelope-from ned.freed@mrochek.com)
Received: from mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01LQH95TGZXS000092@mauve.mrochek.com> for ietf-mta-filters@imc.org; Mon, 11 Jul 2005 09:10:21 -0700 (PDT)
Cc: ietf-mta-filters@imc.org
To: Michael Haardt <michael@freenet-ag.de>
Message-id: <01LQHS7CFONU000092@mauve.mrochek.com>
Date: Mon, 11 Jul 2005 08:15:21 -0700
From: Ned Freed <ned.freed@mrochek.com>
Subject: Re: Complexity of :matches ?
In-reply-to: "Your message dated Mon, 11 Jul 2005 12:48:37 +0200" <E1Drvq1-00039y-Vw@nostromo.freenet-ag.de>
MIME-version: 1.0
References: <E1Drvq1-00039y-Vw@nostromo.freenet-ag.de>
Sender: owner-ietf-mta-filters@mail.imc.org
Precedence: bulk
List-Archive: <http://www.imc.org/ietf-mta-filters/mail-archive/>
List-ID: <ietf-mta-filters.imc.org>
List-Unsubscribe: <mailto:ietf-mta-filters-request@imc.org?body=unsubscribe>

> I just had a user using a wildcard pattern that took a large amount of
> CPU time.  Now that this particular problem is solved, I wonder about
> ":matches".

> Does anybody know the complexity order of recognising wildcard patterns,
> which are a subset of regular expressions?

The complexity of a naive implementation of greedy glob-matching is pretty
clearly O(M*N**K), where N is the length of the target string, M is the length
of the pattern, and K is the number of *'s present in the pattern, and it is
pretty easy to construct inputs that will exhibit worse-case behavior - all you
need is lots of *'s in a pattern that never matches.

Using a  non-greedy approach makes it much harder to construct inputs that will
take a long time to process, especially if you also do some simple
optimizations like collapsing strings of repeated *'s. In practice this seems
to avoid many of the more common problems, such as when users specify an
argument to :matches that only makes sense to give to :contains, such as
"******** ATTENTION ********".

However, I do not know if simply using an optimized non-greedy approach
actually reduces the computational complexity to something less than O(M*N**K)
for extremely pathological cases. If it doesn't, my guess is that there's some
sort of dynamic programming trick that can get the complexity down to O(N*M) or
something similar. I haven't tried to find such a trick, mostly because we've
never encountered a pathological case when using optimized non-greedy matching
that would make it relevant.

In any case, the computational complexity of glob-style matches is clearly less
than that of regexp, which is hardly surprising given that regexps require
a NFA. This is why we disable regexp support by default in our implementation.

Perhaps it makes sense to recommend a non-greedy approach in the security
considerations section of the base specification? Allowing implementations to
limit the number of *'s they allog in glob-style patterns might also make
sense, if only to catch cases where users used :matches when they intended to
use :contains.

				Ned

P.S. I'm curious what pattern you encountered that took a long time. Mind
sharing it?