Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Pathological regular expression

Reply
Thread Tools

Pathological regular expression

 
 
David Liang
Guest
Posts: n/a
 
      04-09-2009
Hi all,
I'm having a weird problem with a regular expression (tested in 2.6
and 3.0):

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

....hangs the interpreter. For reference, if the first command had been
_re_comments = re.compile(r'^(([^z]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

(off by one character z) it works fine, and all the equivalent
operations work in sed and awk. Am I missing something about Python
RE's?

-David
 
Reply With Quote
 
 
 
 
David Liang
Guest
Posts: n/a
 
      04-09-2009
On Apr 9, 2:56*am, David Liang <(E-Mail Removed)> wrote:
> Hi all,
> I'm having a weird problem with a regular expression (tested in 2.6
> and 3.0):
>
> Basically, any of these:
> _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
> _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
> _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
>
> followed by for example,
> line = r'~/.[m]ozilla/firefox/*.default/chrome'
> print(_re_comments.sub(r'\1', line))
>
> ...hangs the interpreter. For reference, if the first command had been
> _re_comments = re.compile(r'^(([^z]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
>
> (off by one character z) it works fine, and all the equivalent
> operations work in sed and awk. Am I missing something about Python
> RE's?
>
> -David


The problem was the redundant +'s; the fixed RE is

_re_comments = re.compile(r'^(([^#"\\]|\\.|"([^"\\]|\\.)*")*)#.*')
 
Reply With Quote
 
 
 
 
David Liang
Guest
Posts: n/a
 
      04-09-2009
On Apr 9, 2:56*am, David Liang <(E-Mail Removed)> wrote:
> Hi all,
> I'm having a weird problem with a regular expression (tested in 2.6
> and 3.0):
>
> Basically, any of these:
> _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
> _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
> _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
>
> followed by for example,
> line = r'~/.[m]ozilla/firefox/*.default/chrome'
> print(_re_comments.sub(r'\1', line))
>
> ...hangs the interpreter. For reference, if the first command had been
> _re_comments = re.compile(r'^(([^z]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
>
> (off by one character z) it works fine, and all the equivalent
> operations work in sed and awk. Am I missing something about Python
> RE's?
>
> -David


The problem was the redundant +'s; the fixed RE is

_re_comments = re.compile(r'^(([^#"\\]|\\.|"([^"\\]|\\.)*")*)#.*')
 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      04-11-2009
On Thu, 09 Apr 2009 02:56:00 -0700, David Liang wrote:

> Hi all,
> I'm having a weird problem with a regular expression (tested in 2.6 and
> 3.0):
>
> Basically, any of these:
> _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
> _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
> _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
>
> followed by for example,
> line = r'~/.[m]ozilla/firefox/*.default/chrome'
> print(_re_comments.sub(r'\1', line))
>
> ...hangs the interpreter.



I can confirm the first one hangs the interpreter in Python 2.5 as well.
I haven't tested the other two.

To my mind, this is a bug in the RE engine. Is there any reason to not
treat it as a bug?


--
Steven
 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      04-11-2009
Steven D'Aprano wrote:
> On Thu, 09 Apr 2009 02:56:00 -0700, David Liang wrote:
>
>> Hi all,
>> I'm having a weird problem with a regular expression (tested in 2.6 and
>> 3.0):
>>
>> Basically, any of these:
>> _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
>> _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
>> _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
>>
>> followed by for example,
>> line = r'~/.[m]ozilla/firefox/*.default/chrome'
>> print(_re_comments.sub(r'\1', line))
>>
>> ...hangs the interpreter.

>
>
> I can confirm the first one hangs the interpreter in Python 2.5 as well.
> I haven't tested the other two.
>
> To my mind, this is a bug in the RE engine. Is there any reason to not
> treat it as a bug?
>

It's not a bug but one of those pathological cases where there are a LOT
of possibilities to try (we're talking about exponential growth here).

Hmm, maybe the re module needs to let the user set a timeout control
that raises an exception if it takes longer than n seconds (it can
already be interrupted by ^C)...
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      04-11-2009
On Apr 12, 1:07*am, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:
> On Thu, 09 Apr 2009 02:56:00 -0700, David Liang wrote:
> > Hi all,
> > I'm having a weird problem with a regular expression (tested in 2.6 and
> > 3.0):

>
> > Basically, any of these:
> > _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
> > _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
> > _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

>
> > followed by for example,
> > line = r'~/.[m]ozilla/firefox/*.default/chrome'
> > print(_re_comments.sub(r'\1', line))

>
> > ...hangs the interpreter.

>
> I can confirm the first one hangs the interpreter in Python 2.5 as well.
> I haven't tested the other two.
>
> To my mind, this is a bug in the RE engine. Is there any reason to not
> treat it as a bug?


IMHO it's not a bug -- s/hang/takes a long time to compute/

Just look at it: 2 + operators and 3 * operators ... It's one of those
"come back after lunch" REs.

 
Reply With Quote
 
Dotan Cohen
Guest
Posts: n/a
 
      04-11-2009
> IMHO it's not a bug -- s/hang/takes a long time to compute/
>


‎That is quite what a hang is, and why the timeout was invented. The
real bug is that there is no timeout mechanism.

> Just look at it: 2 + operators and 3 * operators ... It's one of those
> "come back after lunch" REs.
>


Some users would not be able to tell that from the beginning. While a
timeout override would be necessary in these cases, a timeout should
be set.


--
Dotan Cohen

http://what-is-what.com
http://gibberish.co.il
 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      04-11-2009
Dotan Cohen wrote:
>> IMHO it's not a bug -- s/hang/takes a long time to compute/
>>

>
> ‎That is quite what a hang is, and why the timeout was invented. The
> real bug is that there is no timeout mechanism.
>

I wouldn't call it a "hang" because it is actually doing work. If it was
'stuck' on a certain part of the text and not progressing then it would
be a hang.

>> Just look at it: 2 + operators and 3 * operators ... It's one of those
>> "come back after lunch" REs.
>>

>
> Some users would not be able to tell that from the beginning. While a
> timeout override would be necessary in these cases, a timeout should
> be set.
>

 
Reply With Quote
 
Aaron Brady
Guest
Posts: n/a
 
      04-11-2009
On Apr 11, 10:07*am, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:
> On Thu, 09 Apr 2009 02:56:00 -0700, David Liang wrote:
> > Hi all,
> > I'm having a weird problem with a regular expression (tested in 2.6 and
> > 3.0):

>
> > Basically, any of these:
> > _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
> > _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
> > _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

>
> > followed by for example,
> > line = r'~/.[m]ozilla/firefox/*.default/chrome'
> > print(_re_comments.sub(r'\1', line))

>
> > ...hangs the interpreter.

>
> I can confirm the first one hangs the interpreter in Python 2.5 as well.
> I haven't tested the other two.
>
> To my mind, this is a bug in the RE engine. Is there any reason to not
> treat it as a bug?
>
> --
> Steven


How long does it take on a short example?
 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      04-11-2009
On Sat, 11 Apr 2009 08:40:03 -0700, John Machin wrote:

>> To my mind, this is a bug in the RE engine. Is there any reason to not
>> treat it as a bug?

>
> IMHO it's not a bug -- s/hang/takes a long time to compute/
>
> Just look at it: 2 + operators and 3 * operators ... It's one of those
> "come back after lunch" REs.


Well, it's been running now for about two and a half hours, that's a
rather long lunch. And despite MRAB's assertion, it *cannot* be
interrupted by ctrl-C. That means that to all intents and purposes, the
interpreter has locked up for the duration of the calculation, which may
be days or weeks for all I know.



--
Steven
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Seek xpath expression where an attribute name is a regular expression GIMME XML 3 12-29-2008 03:11 PM
C/C++ language proposal: Change the 'case expression' from "integral constant-expression" to "integral expression" Adem C++ 42 11-04-2008 12:39 PM
inject's pathological case... Just Another Victim of the Ambient Morality Ruby 22 04-02-2008 11:51 AM
Matching abitrary expression in a regular expression =?iso-8859-1?B?bW9vcJk=?= Java 8 12-02-2005 12:51 AM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments