Re: Complexity of :matches ?

Tim Showalter <tjs@psaux.com> Mon, 11 July 2005 19:51 UTC

Received: from above.proper.com (localhost.vpnc.org [127.0.0.1]) by above.proper.com (8.12.11/8.12.9) with ESMTP id j6BJpfuu081647; Mon, 11 Jul 2005 12:51:41 -0700 (PDT) (envelope-from owner-ietf-mta-filters@mail.imc.org)
Received: (from majordom@localhost) by above.proper.com (8.12.11/8.12.9/Submit) id j6BJpfDi081645; Mon, 11 Jul 2005 12:51:41 -0700 (PDT)
X-Authentication-Warning: above.proper.com: majordom set sender to owner-ietf-mta-filters@mail.imc.org using -f
Received: from mrout1-b.corp.dcn.yahoo.com (mrout1-b.corp.dcn.yahoo.com [216.109.112.27]) by above.proper.com (8.12.11/8.12.9) with ESMTP id j6BJpbpK081634 for <ietf-mta-filters@imc.org>; Mon, 11 Jul 2005 12:51:40 -0700 (PDT) (envelope-from tjs@psaux.com)
Received: from [66.228.163.20] (trustwho.corp.yahoo.com [66.228.163.20]) by mrout1-b.corp.dcn.yahoo.com (8.13.4/8.13.4/y.out) with ESMTP id j6BJonRO097650; Mon, 11 Jul 2005 12:50:53 -0700 (PDT)
Message-ID: <42D2CD99.4050601@psaux.com>
Date: Mon, 11 Jul 2005 12:50:49 -0700
From: Tim Showalter <tjs@psaux.com>
User-Agent: Mozilla Thunderbird 1.0 (X11/20050215)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Ned Freed <ned.freed@mrochek.com>
CC: Michael Haardt <michael@freenet-ag.de>, ietf-mta-filters@imc.org
Subject: Re: Complexity of :matches ?
References: <E1Drvq1-00039y-Vw@nostromo.freenet-ag.de> <01LQHS7CFONU000092@mauve.mrochek.com>
In-Reply-To: <01LQHS7CFONU000092@mauve.mrochek.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Sender: owner-ietf-mta-filters@mail.imc.org
Precedence: bulk
List-Archive: <http://www.imc.org/ietf-mta-filters/mail-archive/>
List-ID: <ietf-mta-filters.imc.org>
List-Unsubscribe: <mailto:ietf-mta-filters-request@imc.org?body=unsubscribe>

> However, I do not know if simply using an optimized non-greedy approach
> actually reduces the computational complexity to something less than O(M*N**K)
> for extremely pathological cases.

I had a implementation (wasn't Sieve, but had * and ? and worked the 
same way) that was only O(M^N), that is, the number of stars didn't 
matter.  Barring pathological cases that reduce the match to O(N^2), I 
believe this algorithm is Θ(N).  It's possible to be actually O(M+N) if 
you use Boyer-Moore.

The algorithm I used was just essentially this.
- ? can be handled trivially.
- if the pattern doesn't begin with *
   - handle that leader
- find the next string between stars
   - search for string between the stars
   - search returns a location of that substring
   - no match -> return false
   - we have a match; from now on, search only
     from the location of that match forward
- if the pattern doesn't end with *
   - handle that trailer

That is, for with a pattern like *a*b*c* is three strings, search for 
each of which can be found in best-case O(N) time.  If we search for 'a' 
first and find the first match at position 10, we start searching for 
'b' at position 11.  For a pattern like a*b*c*d*e, we do the same thing 
after making sure the string starts with a and ends with e.

The pathological case for this, like string comparison, is only, uh, 
O((M+N)^2) or so, and then only if you're using naive string comparisons 
(which my implementation did).

Since I'm not sure that I've gotten my point across, let me make a more 
straightforward statement.  If you're doing globbing, and you don't care 
if it's greedy or not, and you're doing recursion, then you're doing 
something wrong.  Finding one match is something that can be done in 
roughly linear time.  Recursion is only necessary for finding the "best" 
match which is not something that's in the base Sieve spec.

I think that this implements right-to-left greedy matching, that is, 
completely ungreedy matching.  If you wanted greedy matching, you might 
be able to get it by doing the search backwards.  I'm real sure that I 
can't prove this.

Tim