Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Python RegExp

Reply
Thread Tools

Python RegExp

 
 
davidchipping@gmail.com
Guest
Posts: n/a
 
      03-22-2005
I have a string which I wish to match using RE, however when I run my
comparison (using the match method) on my machine it never returns,
using the CPU fully.

The code is (very simply):

------------------------------
import re

buffer = r"#1 1 550 111 SYNC_PEER RES
<YP>syncpeers=(id=54325432;add=10." \

"0.0.1;port=89;up=89),(id=97899;add=10.0.0.1;port= 543;up=543)," \
"(id=54325432;add=10.0.0.1;port=89;up=8"

p = re.compile(r'#(?P<tran>[0-9]+\s)(?P<sess>[0-9]+\s)' \
+ '(?P<ndto>[0-9]+\s)(?P<ndfr>[0-9]+\s)' \
+ '(?P<cmd>[a-zA-Z_]+\s)(?P<dir>[a-zA-Z_]+)' \
+ '(?P<parm>(\s<[a-zA-Z]+>(\s*[a-zA-Z]+=[a-zA' \
+ '-Z0-9.,=();]+\s*)+<[a-zA-Z]+>)*)#(?P<left>.*$)')

print "starting the match"
rst = (p.match (buffer) != None)
print "finishing the match"
--------------------------------

Should this every actually happen? Regardless of whether the statment
matches or not, surely it should do this?

 
Reply With Quote
 
 
 
 
Fredrik Lundh
Guest
Posts: n/a
 
      03-22-2005
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

>I have a string which I wish to match using RE, however when I run my
> comparison (using the match method) on my machine it never returns,
> using the CPU fully.
>
> The code is (very simply):
>
> ------------------------------
> import re
>
> buffer = r"#1 1 550 111 SYNC_PEER RES
> <YP>syncpeers=(id=54325432;add=10." \
>
> "0.0.1;port=89;up=89),(id=97899;add=10.0.0.1;port= 543;up=543)," \
> "(id=54325432;add=10.0.0.1;port=89;up=8"
>
> p = re.compile(r'#(?P<tran>[0-9]+\s)(?P<sess>[0-9]+\s)' \
> + '(?P<ndto>[0-9]+\s)(?P<ndfr>[0-9]+\s)' \
> + '(?P<cmd>[a-zA-Z_]+\s)(?P<dir>[a-zA-Z_]+)' \
> + '(?P<parm>(\s<[a-zA-Z]+>(\s*[a-zA-Z]+=[a-zA' \
> + '-Z0-9.,=();]+\s*)+<[a-zA-Z]+>)*)#(?P<left>.*$)')
>
> print "starting the match"
> rst = (p.match (buffer) != None)
> print "finishing the match"
> --------------------------------
>
> Should this every actually happen? Regardless of whether the statment
> matches or not, surely it should do this?


surely the following should never be slow?

n = len(s)
for a in range(n):
for b in range(n):
for c in range(n):
for d in range(n):
for e in range(n):
... etc ...

(your RE contains several nested and/or related repeat operators,
which forces the poor engine to test zillions of combinations before
giving up. see Friedl's RE book for more information on this)

I suggest using a real parser for this task (or using the RE engine as
a tokenizer, and doing the rest in code; see
http://effbot.org/zone/xml-scanner.htm for more on this)

</F>



 
Reply With Quote
 
 
 
 
Jeff Epler
Guest
Posts: n/a
 
      03-22-2005
On my machine the program finishes in 30 seconds. (it's a 1.5GHz machine)
If the 'parm' group is removed, or if the buffer is shortened, the time
is reduced considerably.

There are "pathological cases" for regular expressions which can take
quite a long time. In the case of your expression, it's happening for
the group 'parm'. I think, but don't know, that each time a candidate
for 'parm' is found, the following '#' (or maybe the second '<'?) is not
found, and it backtracks to try to match 'parm' in a different way,
which involves considering many different combinations (basically, each
'name=' could be the start of a new instance of the first parenthsized
subgroup of <parm>, or it could be part of the character class that
includes [a-zA-Z=])

You may wish to consider using other approaches for parsing this text
than regular expressions.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFCQK+2Jd01MZaTXX0RApByAJ4hIU1RVFWwdOeX28RLwQ H1rSk5awCeJXrE
mhBvK2/gMBghI/VamQfVFYc=
=8+da
-----END PGP SIGNATURE-----

 
Reply With Quote
 
David Chipping
Guest
Posts: n/a
 
      03-23-2005
Jeff,

Thanks very much for that, I didn't even consider the option of it
finishing, considering I'm using a much slower machine it was running
for over 2 minutes when I just killed it! I think I get the rest now.

Cheers again,

-David


On Tue, 22 Mar 2005 17:52:22 -0600, Jeff Epler <(E-Mail Removed)> wrote:
> On my machine the program finishes in 30 seconds. (it's a 1.5GHz machine)
> If the 'parm' group is removed, or if the buffer is shortened, the time
> is reduced considerably.
>
> There are "pathological cases" for regular expressions which can take
> quite a long time. In the case of your expression, it's happening for
> the group 'parm'. I think, but don't know, that each time a candidate
> for 'parm' is found, the following '#' (or maybe the second '<'?) is not
> found, and it backtracks to try to match 'parm' in a different way,
> which involves considering many different combinations (basically, each
> 'name=' could be the start of a new instance of the first parenthsized
> subgroup of <parm>, or it could be part of the character class that
> includes [a-zA-Z=])
>
> You may wish to consider using other approaches for parsing this text
> than regular expressions.
>
> Jeff
>
>
>

 
Reply With Quote
 
D H
Guest
Posts: n/a
 
      03-23-2005
(E-Mail Removed) wrote:
> I have a string which I wish to match using RE, however when I run my
> comparison (using the match method) on my machine it never returns,
> using the CPU fully.


In your case it may be simpler to just split the string into groups.
You don't even need regular expressions or a parser.

buff = r"#1 1 550 111 SYNC_PEER RES <YP>syncpeers=(id=54325432;add=10." \
"0.0.1;port=89;up=89),(id=97899;add=10.0.0.1;port= 543;up=543)," \
"(id=54325432;add=10.0.0.1;port=89;up="

tran, sess, ndto, ndf, cmd, dirr, rest = buff.split()

eq = rest.find("=")
parmname = rest[0:eq]
parms = rest[eq+1:].split(",")

for parm in parms:
parmitems = parm[1:-1].split(";")
for item in parmitems:
name, val = item.split("=")
print name, val
print "---"
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
new RegExp().test() or just RegExp().test() Matěj Cepl Javascript 3 11-24-2009 02:41 PM
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Ruby 1.9 - ArgumentError: incompatible encoding regexp match(US-ASCII regexp with ISO-2022-JP string) Mikel Lindsaar Ruby 0 03-31-2008 10:27 AM
Programmatically turning a Regexp into an anchored Regexp Greg Hurrell Ruby 4 02-14-2007 06:56 PM
RegExp.exec() returns null when there is a match - a JavaScript RegExp bug? Uldis Bojars Javascript 2 12-17-2006 09:59 PM



Advertisments