Re: Complexity of :matches ?

Michael Haardt <michael@freenet-ag.de> Tue, 12 July 2005 08:52 UTC

Received: from above.proper.com (localhost.vpnc.org [127.0.0.1]) by above.proper.com (8.12.11/8.12.9) with ESMTP id j6C8qwoh024147; Tue, 12 Jul 2005 01:52:58 -0700 (PDT) (envelope-from owner-ietf-mta-filters@mail.imc.org)
Received: (from majordom@localhost) by above.proper.com (8.12.11/8.12.9/Submit) id j6C8qwfB024146; Tue, 12 Jul 2005 01:52:58 -0700 (PDT)
X-Authentication-Warning: above.proper.com: majordom set sender to owner-ietf-mta-filters@mail.imc.org using -f
Received: from mout0.freenet.de (mout0.freenet.de [194.97.50.131]) by above.proper.com (8.12.11/8.12.9) with ESMTP id j6C8quOS024131 for <ietf-mta-filters@imc.org>; Tue, 12 Jul 2005 01:52:57 -0700 (PDT) (envelope-from michael@freenet-ag.de)
Received: from [194.97.55.192] (helo=mx8.freenet.de) by mout0.freenet.de with esmtpa (Exim 4.52) id 1DsGVb-0007rH-U0 for ietf-mta-filters@imc.org; Tue, 12 Jul 2005 10:52:55 +0200
Received: from nostromo.freenet-ag.de ([194.97.7.6]) by mx8.freenet.de with esmtps (TLSv1:AES256-SHA:256) (Exim 4.52 #3) id 1DsGVb-0005fU-Pw for ietf-mta-filters@imc.org; Tue, 12 Jul 2005 10:52:55 +0200
Received: from michael by nostromo.freenet-ag.de with local (ID michael) (Exim 4.52 #5) id 1DsGVb-0003gO-Fg for ietf-mta-filters@imc.org; Tue, 12 Jul 2005 10:52:55 +0200
To: ietf-mta-filters@imc.org
Subject: Re: Complexity of :matches ?
Message-Id: <E1DsGVb-0003gO-Fg@nostromo.freenet-ag.de>
From: Michael Haardt <michael@freenet-ag.de>
Date: Tue, 12 Jul 2005 10:52:55 +0200
Sender: owner-ietf-mta-filters@mail.imc.org
Precedence: bulk
List-Archive: <http://www.imc.org/ietf-mta-filters/mail-archive/>
List-ID: <ietf-mta-filters.imc.org>
List-Unsubscribe: <mailto:ietf-mta-filters-request@imc.org?body=unsubscribe>

> The complexity of a naive implementation of greedy glob-matching is pretty
> clearly O(M*N**K), where N is the length of the target string, M is the length
> of the pattern, and K is the number of *'s present in the pattern, and it is
> pretty easy to construct inputs that will exhibit worse-case behavior - all you
> need is lots of *'s in a pattern that never matches.

Right, and that's what happened.  It wasn't a problem until recently, when
users started to use multiple stars instead of one, like ********@domain.
Don't ask me why.

> However, I do not know if simply using an optimized non-greedy approach
> actually reduces the computational complexity to something less than O(M*N**K)
> for extremely pathological cases. If it doesn't, my guess is that there's some
> sort of dynamic programming trick that can get the complexity down to O(N*M) or
> something similar. I haven't tried to find such a trick, mostly because we've
> never encountered a pathological case when using optimized non-greedy matching
> that would make it relevant.

So you suspect that there might be an algorithm that does not perform
backtracking? I can not imagine O(N*M) as worst case unless we can kick
backtracking out, although it might well be an average case.  But that's
just a feeling and it may be wrong.

> In any case, the computational complexity of glob-style matches is clearly less
> than that of regexp, which is hardly surprising given that regexps require
> a NFA. This is why we disable regexp support by default in our implementation.

That's what I wonder about.  The NFA can be converted to a DFA,
backtracking being an alternative, but that same appears to be true for
wildcard patterns to me, if multiple stars are allowed.

> Perhaps it makes sense to recommend a non-greedy approach in the security
> considerations section of the base specification? Allowing implementations to
> limit the number of *'s they allog in glob-style patterns might also make
> sense, if only to catch cases where users used :matches when they intended to
> use :contains.

If we can not come up with a proof for worst-case complexity of non-greedy
algorithms, then I think we should document that pattern matching may
require time exponential to the number of stars in the pattern and indeed
allow implementations to put a limit on pattern matching, generating an
error if the limit is exceeded.  The implementation should be free to express
the limit in the number of stars, the number of performed backtracking
operations, used CPU time or whatever else.

Interestingly, the PCRE library already contains functions to do that
for regular expressions, which is why the Exim filter language (quite
similar to Sieve) can offer regular expressions at no risk.

Michael