Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Cancel or timeout a long running regular expression

Reply
Thread Tools

Re: Cancel or timeout a long running regular expression

 
 
Terry Reedy
Guest
Posts: n/a
 
      09-15-2011
On 9/15/2011 1:19 AM, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> Is there a way to cancel or timeout a long running regular expression?
> I have a program that accepts regular expressions from users and I'm
> concerned about how to handle worst case regular expressions that seem
> to run forever. Ideally I'm looking for a way to evaluate a regular
> expression and timeout after a specified time period if the regular
> expression hasn't completed yet. Or a way for a user to cancel a long
> running regular expression.


This is a general problem when evaluating *any* expression from the
outside. [0]*10000*10000 will eat space as well as time. At least, as
far as I know, an re cannot cause a disk reformat .

There have been previous discussions on this generally topic.

> I was thinking there might be a technique I could use to evaluate
> regular expressions in a thread or another process launched via
> multiprocessing module and then kill the thread/process after a
> specified timeout period.


Only solution I remember ever seen posted. I wonder if there are any
heuristics for detecting exponential time re's.

> My concern about the multiprocessing module
> technique is that launching a process for every regex evaluation sounds
> pretty inefficient. And I don't think the threading module supports the
> ability to kill threads from outside a thread itself.


--
Terry Jan Reedy

 
Reply With Quote
 
 
 
 
Nobody
Guest
Posts: n/a
 
      09-16-2011
On Thu, 15 Sep 2011 14:54:57 -0400, Terry Reedy wrote:

>> I was thinking there might be a technique I could use to evaluate
>> regular expressions in a thread or another process launched via
>> multiprocessing module and then kill the thread/process after a
>> specified timeout period.

>
> Only solution I remember ever seen posted. I wonder if there are any
> heuristics for detecting exponential time re's.


Exponential growth results from non-determinism, i.e. if there are
multiple transitions for a given character from a given state.

Common patterns include:

...(a...)?a...
...(a...)*a...
...(a...)+a...

with a choice between matching the "a" at the start of the bracketed
pattern or the "a" following it.

(xxxa...|xxxb...|xxxc...)

with a choice between branches which cannot be resolved until more data
has been read.

For re.search (as opposed to re.match):

axxxa...

When axxx has been read, a following "a" gives a choice between
continuing the existing match with the second "a", or aborting and
matching against the first "a". For patterns which contain many copies of
the initial character, each copy creates another branch.

Also, using back-references in a regexp [sic] can be a significant
performance killer, as it rules out the use of a DFA (which, IIRC, Python
doesn't do anyhow) and can require brute-forcing many combinations. A
particularly bad case is:

(a*)\1

matching against "aaaaaaaa...".

If the performance issue is with re.match/re.search rather than with
re.compile, one option is to use ctypes to access libc's regexp functions.
Those are likely to provide better searching throughput at the expense of
potentially increased compilation time.

 
Reply With Quote
 
 
 
 
Terry Reedy
Guest
Posts: n/a
 
      09-16-2011
On 9/16/2011 9:57 AM, Nobody wrote:
>>I wonder if there are any
>> heuristics for detecting exponential time re's.

>
> Exponential growth results from non-determinism, i.e. if there are
> multiple transitions for a given character from a given state.
>
> Common patterns include:
>
> ...(a...)?a...
> ...(a...)*a...
> ...(a...)+a...
>
> with a choice between matching the "a" at the start of the bracketed
> pattern or the "a" following it.
>
> (xxxa...|xxxb...|xxxc...)
>
> with a choice between branches which cannot be resolved until more data
> has been read.
>
> For re.search (as opposed to re.match):
>
> axxxa...
>
> When axxx has been read, a following "a" gives a choice between
> continuing the existing match with the second "a", or aborting and
> matching against the first "a". For patterns which contain many copies of
> the initial character, each copy creates another branch.
>
> Also, using back-references in a regexp [sic] can be a significant
> performance killer, as it rules out the use of a DFA (which, IIRC, Python
> doesn't do anyhow) and can require brute-forcing many combinations. A
> particularly bad case is:
>
> (a*)\1
>
> matching against "aaaaaaaa...".
>
> If the performance issue is with re.match/re.search rather than with
> re.compile, one option is to use ctypes to access libc's regexp functions.
> Those are likely to provide better searching throughput at the expense of
> potentially increased compilation time.


Now, can you write that as a heuristic *algorithm* def
dangerous_re(some_re)

--
Terry Jan Reedy

 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      09-17-2011
On Fri, 16 Sep 2011 18:01:27 -0400, Terry Reedy wrote:

> Now, can you write that as a heuristic *algorithm*
> def dangerous_re(some_re)


return re.search(r'\\\d', some_re) is not None

That will handle the most extreme case

If the OP is serious about analysing regexps, sre_parse.parse() will
decompose a regexp to a more convenient form.

However, I wouldn't rely upon being able to catch every possible bad case.
The only robust solution is to use a separate process (threads won't
suffice, as they don't have a .kill() method).

 
Reply With Quote
 
python@bdurham.com
Guest
Posts: n/a
 
      09-18-2011
Thanks for everyone's comments - much appreciated!

Malcolm (the OP)
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Killing threads (was Re: Cancel or timeout a long running regularexpression) Chris Angelico Python 2 09-21-2011 09:25 AM
Having compilation error: no match for call to (const __gnu_cxx::hash<long long int>) (const long long int&) veryhotsausage C++ 1 07-04-2008 05:41 PM
HELP CANCEL CANCEL CANCEL Carmen Rosario Computer Support 7 04-06-2005 11:04 PM
Timeout::timeout and Socket timeout Mark Probert Ruby 1 10-06-2004 09:30 AM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments