Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > RegExp performance?

Reply
Thread Tools

RegExp performance?

 
 
Christian Sonne
Guest
Posts: n/a
 
      02-25-2007
Long story short, I'm trying to find all ISBN-10 numbers in a multiline
string (approximately 10 pages of a normal book), and as far as I can
tell, the *correct* thing to match would be this:
".*\D*(\d{10}|\d{9}X)\D*.*"

(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)

however, on my 3200+ amd64, running the following:

reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")
isbn10s = reISBN10.findall(contents)

(where contents is the string)

this takes about 14 minutes - and there are only one or two matches...

if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
loosing results, but it runs in about 0.3 seconds

So what's the deal? - why would it take so long to run the correct one?
- especially when a slight modification makes it run as fast as I'd
expect from the beginning...


I'm sorry I cannot supply test data, in my case, it comes from
copyrighted material - however if it proves needed, I can probably
construct dummy data to illustrate the problem


Any and all guidance would be greatly appreciated,
kind regards
Christian Sonne

PS: be gentle - it's my first post here
 
Reply With Quote
 
 
 
 
Gabriel Genellina
Guest
Posts: n/a
 
      02-25-2007
En Sun, 25 Feb 2007 05:21:49 -0300, Christian Sonne <(E-Mail Removed)>
escribió:

> Long story short, I'm trying to find all ISBN-10 numbers in a multiline
> string (approximately 10 pages of a normal book), and as far as I can
> tell, the *correct* thing to match would be this:
> ".*\D*(\d{10}|\d{9}X)\D*.*"


Why the .* at the start and end? You dont want to match those, and makes
your regexp slow.
You didn't tell how exactly a ISBN-10 number looks like, but if you want
to match 10 digits, or 9 digits followed by an X:
reISBN10 = re.compile("\d{10}|\d{9}X")
That is, just the () group in your expression. But perhaps this other one
is better (I think it should be faster, but you should measure it):
reISBN10 = re.compile("\d{9}[\dX]")
("Nine digits followed by another digit or an X")

> if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
> loosing results, but it runs in about 0.3 seconds


Using my suggested expressions you might match some garbage, but not loose
anything (except two ISBN numbers joined together without any separator in
between). Assuming you have stripped all the "-", as you said.

> So what's the deal? - why would it take so long to run the correct one?
> - especially when a slight modification makes it run as fast as I'd
> expect from the beginning...


Those .* make the expression match a LOT of things at first, just to
discard it in the next step.

--
Gabriel Genellina

 
Reply With Quote
 
 
 
 
Marc 'BlackJack' Rintsch
Guest
Posts: n/a
 
      02-25-2007
In <45e145a0$0$90264$(E-Mail Removed)>, Christian Sonne wrote:

> Long story short, I'm trying to find all ISBN-10 numbers in a multiline
> string (approximately 10 pages of a normal book), and as far as I can
> tell, the *correct* thing to match would be this:
> ".*\D*(\d{10}|\d{9}X)\D*.*"
>
> (it should be noted that I've removed all '-'s in the string, because
> they have a tendency to be mixed into ISBN's)
>
> however, on my 3200+ amd64, running the following:
>
> reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")
> isbn10s = reISBN10.findall(contents)
>
> (where contents is the string)
>
> this takes about 14 minutes - and there are only one or two matches...


First of all try to get rid of the '.*' at both ends of the regexp. Don't
let the re engine search for any characters that you are not interested in
anyway.

Then leave off the '*' after '\D'. It doesn't matter if there are
multiple non-digits before or after the ISBN, there just have to be at
least one. BTW with the star it even matches *no* non-digit too!

So the re looks like this: '\D(\d{10}|\d{9}X)\D'

Ciao,
Marc 'BlackJack' Rintsch
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      02-25-2007
On Feb 25, 7:21 pm, Christian Sonne <(E-Mail Removed)> wrote:
> Long story short, I'm trying to find all ISBN-10 numbers in a multiline
> string (approximately 10 pages of a normal book), and as far as I can
> tell, the *correct* thing to match would be this:
> ".*\D*(\d{10}|\d{9}X)\D*.*"


All of those *s are making it work too hard.

Starting with your r".*\D*(\d{10}|\d{9}X)\D*.*" [you do have the
r"..." not just "....", don't you?]

Step 1: Lose the .* off each end -- this is meaningless in the context
of a search() or findall() and would slow the re engine down if it
doesn't optimise it away.

r"\D*(\d{10}|\d{9}X)\D*"

Step 2: I presume that the \D* at each (remaining) end is to ensure
that you don't pick up a number with 11 or more digits. You only need
to test that your presumed ISBN is not preceded/followed by ONE
suspect character. Is ABC1234567890DEF OK? I think not; I'd use \b
instead of \D

r"\b(\d{10}|\d{9}X)\b"

Step 3: Now that we have only \b (which matches 0 characters) at each
end of the ISBN, we can lose the capturing ()

r"\b\d{10}|\d{9}X\b"

Step 4: In the case of 123456789X, it fails on the X and then scans
the 123456789 again -- a tiny waste compared to all the * stuff, but
still worth fixing.

r"\b\d{9}[0-9X]\b"

Give that a whirl and let us know how correct and how fast it is.

>
> (it should be noted that I've removed all '-'s in the string, because
> they have a tendency to be mixed into ISBN's)
>
> however, on my 3200+ amd64, running the following:
>
> reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")


You should really get into the habit of using raw strings with re.

> isbn10s = reISBN10.findall(contents)
>
> (where contents is the string)
>
> this takes about 14 minutes - and there are only one or two matches...


How many actual matches and how many expected matches?

Note on "and there are only one or two matches": if your data
consisted only of valid ISBNs separated by a single space or comma, it
would run much faster. It is all the quadratic/exponential mucking
about with the in-between bits that slows it down. To demonstrate
this, try timing dummy data like "1234567890 " * 1000000 and
"123456789X " * 1000000 with your various regexes, and with the step1,
step2 etc regexes above.

>
> if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
> loosing results, but it runs in about 0.3 seconds
>
> So what's the deal? - why would it take so long to run the correct one?


Because you have .*\D* in other words 0 or more occurrences of almost
anything followed by 0 or more occurrences of almost anything. Even
assuming it ignores the .*, it will find the longest possible sequence
of non-digits, then try to match the ISBN stuff. If it fails, it will
shorten that sequence of non-digits, try the ISBN stuff again, etc etc
until it matches the ISBN stuff or that sequence of non-digits is down
to zero length. It will do that for each character position in the
file contents. Why is it faster when you change \D to []? Presumably
because in your data, sequences of non-digits are longer than
sequences of spaces IOW there is less stuff to back-track over.


> - especially when a slight modification makes it run as fast as I'd
> expect from the beginning...
>
> I'm sorry I cannot supply test data, in my case, it comes from
> copyrighted material - however if it proves needed, I can probably
> construct dummy data to illustrate the problem


What you call "dummy data", I'd call "test data". You should make a
sample of "dummy data" and test that your regex is (a) finding all
ISBNs that it should (b) not reporting incorrect matches, *BEFORE*
being concerned with speed.

>
> Any and all guidance would be greatly appreciated,


For some light reading borrow a copy of Friedl's book (mentioned
in the Python re docs) and read the parts on how backtracking regex
engines work.

HTH,
John

 
Reply With Quote
 
Christian Sonne
Guest
Posts: n/a
 
      02-25-2007
John Machin wrote:
> On Feb 25, 7:21 pm, Christian Sonne <(E-Mail Removed)> wrote:
>> Long story short, I'm trying to find all ISBN-10 numbers in a multiline
>> string (approximately 10 pages of a normal book), and as far as I can
>> tell, the *correct* thing to match would be this:
>> ".*\D*(\d{10}|\d{9}X)\D*.*"

>
> All of those *s are making it work too hard.
>
> Starting with your r".*\D*(\d{10}|\d{9}X)\D*.*" [you do have the
> r"..." not just "....", don't you?]
>
> Step 1: Lose the .* off each end -- this is meaningless in the context
> of a search() or findall() and would slow the re engine down if it
> doesn't optimise it away.
>
> r"\D*(\d{10}|\d{9}X)\D*"
>
> Step 2: I presume that the \D* at each (remaining) end is to ensure
> that you don't pick up a number with 11 or more digits. You only need
> to test that your presumed ISBN is not preceded/followed by ONE
> suspect character. Is ABC1234567890DEF OK? I think not; I'd use \b
> instead of \D
>
> r"\b(\d{10}|\d{9}X)\b"
>
> Step 3: Now that we have only \b (which matches 0 characters) at each
> end of the ISBN, we can lose the capturing ()
>
> r"\b\d{10}|\d{9}X\b"
>
> Step 4: In the case of 123456789X, it fails on the X and then scans
> the 123456789 again -- a tiny waste compared to all the * stuff, but
> still worth fixing.
>
> r"\b\d{9}[0-9X]\b"
>
> Give that a whirl and let us know how correct and how fast it is.
>
>> (it should be noted that I've removed all '-'s in the string, because
>> they have a tendency to be mixed into ISBN's)
>>
>> however, on my 3200+ amd64, running the following:
>>
>> reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")

>
> You should really get into the habit of using raw strings with re.
>
>> isbn10s = reISBN10.findall(contents)
>>
>> (where contents is the string)
>>
>> this takes about 14 minutes - and there are only one or two matches...

>
> How many actual matches and how many expected matches?
>
> Note on "and there are only one or two matches": if your data
> consisted only of valid ISBNs separated by a single space or comma, it
> would run much faster. It is all the quadratic/exponential mucking
> about with the in-between bits that slows it down. To demonstrate
> this, try timing dummy data like "1234567890 " * 1000000 and
> "123456789X " * 1000000 with your various regexes, and with the step1,
> step2 etc regexes above.
>
>> if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
>> loosing results, but it runs in about 0.3 seconds
>>
>> So what's the deal? - why would it take so long to run the correct one?

>
> Because you have .*\D* in other words 0 or more occurrences of almost
> anything followed by 0 or more occurrences of almost anything. Even
> assuming it ignores the .*, it will find the longest possible sequence
> of non-digits, then try to match the ISBN stuff. If it fails, it will
> shorten that sequence of non-digits, try the ISBN stuff again, etc etc
> until it matches the ISBN stuff or that sequence of non-digits is down
> to zero length. It will do that for each character position in the
> file contents. Why is it faster when you change \D to []? Presumably
> because in your data, sequences of non-digits are longer than
> sequences of spaces IOW there is less stuff to back-track over.
>
>
>> - especially when a slight modification makes it run as fast as I'd
>> expect from the beginning...
>>
>> I'm sorry I cannot supply test data, in my case, it comes from
>> copyrighted material - however if it proves needed, I can probably
>> construct dummy data to illustrate the problem

>
> What you call "dummy data", I'd call "test data". You should make a
> sample of "dummy data" and test that your regex is (a) finding all
> ISBNs that it should (b) not reporting incorrect matches, *BEFORE*
> being concerned with speed.
>
>> Any and all guidance would be greatly appreciated,

>
> For some light reading borrow a copy of Friedl's book (mentioned
> in the Python re docs) and read the parts on how backtracking regex
> engines work.
>
> HTH,
> John
>


Thanks to all of you for your replies - they have been most helpful, and
my program is now running at a reasonable pace...


I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
turns out to misbehave in further testing, I'll know where to turn
 
Reply With Quote
 
Kirk Sluder
Guest
Posts: n/a
 
      02-26-2007
In article <45e1d367$0$90273$(E-Mail Removed)>,
Christian Sonne <(E-Mail Removed)> wrote:

> Thanks to all of you for your replies - they have been most helpful, and
> my program is now running at a reasonable pace...
>
>
> I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
> turns out to misbehave in further testing, I'll know where to turn


Anything with variable-length wildcard matching (*+?) is going to
drag your performance down. There was an earlier thread on this very
topic. Another stupid question is how are you planning on handling
ISBNs formatted with hyphens for readability?

In general I've found the following factors to be critical in
getting good performance from re:

1: Reducing the number of times you call re.match or re.search.
2: Reducing the number of bytes that re has to search through.
3: Minimizing the use of wildcards in the expression.

If you can pre-filter your input with string.find before running
re.match you will improve performance quite a bit compared to
running re expressions over all 10 pages. I played around a bit and
attached some example code below searching over 21K of text for the
ISBN number. testPrefilter() runs about 1/5th the execution time of
line-by-line re calls or a single re call over a 21K string.
Interestingly this ratio scales up to something as big as Mobey
Dick.

The searchLabels() functions below beats all the other functions by
searching for "ISBN", or "International Book" and then using RE on
the surrounding 500 bytes. You might also try searching for
"Copyright" or "Library of Congress" since most modern books will
have it all on the same page.

A caveat here is that this only works if you can find a reasonably
unique string at or near what you want to find with re. If you need
to run re.search on every byte of the file anyway, this isn't going
to help.

---------------
timing test code
---------------

#!/usr/bin/env python

from timeit import Timer
import re

textString = """The text of a sample page using with an ISBN 10
number ISBN 0672328976 and some more text to compare."""

#add the full text of Mobey Dick to make the search functions
#work for their bread.
fileHandle= open("./mobey.txt")
junkText = fileHandle.readlines()
junkText.append(textString)
textString=''.join(junkText)
#print textString

#compile the regex
isbn10Re = re.compile(r"\b\d{9}[0-9X]\b")

def testPrefilter():
"""Work through a pre-loaded array running re only on lines
containing ISBN"""
for line in junkText:
#search for 'ISBN"
if (line.find('ISBN') > -1):
thisMatch = isbn10Re.search(line)
if thisMatch:
return thisMatch.group(0)

def testNofilter():
"""Run re.search on every line."""
for line in junkText:
#seaching using RE
thisMatch = isbn10Re.search(line)
if thisMatch:
return thisMatch.group(0)


def testFullre():
"""Run re.search on a long text string."""
thisMatch = isbn10Re.search(textString)
if thisMatch:
return thisMatch.group(0)

def searchLabels():
#identify some text that might be near an ISBN number.
isbnLabels=["ISBN",
"International Book"]

#use the fast string.find method to locate those
#labels in the text
isbnIndexes = [textString.find(x) for x in isbnLabels]

#exclude labels not found in the text.
isbnIndexes = [y for y in isbnIndexes if y > -1]

#run re.search on a 500 character string around the text label.
for x in isbnIndexes:
thisMatch=isbn10Re.search(textString[x-250+250])
return thisMatch.group(0)



#print searchLabels()
#print testPrefilter()
#print testNofilter()
t = Timer("testNofilter()","from __main__ import testNofilter")
print t.timeit(100)
u = Timer("testPrefilter()","from __main__ import testPrefilter")
print u.timeit(100)
v = Timer("testFullre()","from __main__ import testFullre")
print v.timeit(100)
w = Timer("searchLabels()", "from __main__ import searchLabels")
print w.timeit(100)
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      02-26-2007
On Feb 26, 2:01 pm, Kirk Sluder <(E-Mail Removed)> wrote:
> In article <45e1d367$0$90273$(E-Mail Removed)>,
> Christian Sonne <(E-Mail Removed)> wrote:
>
> > Thanks to all of you for your replies - they have been most helpful, and
> > my program is now running at a reasonable pace...

>
> > I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
> > turns out to misbehave in further testing, I'll know where to turn

>
> Anything with variable-length wildcard matching (*+?) is going to
> drag your performance down. There was an earlier thread on this very
> topic. Another stupid question is how are you planning on handling
> ISBNs formatted with hyphens for readability?


According to the OP's first message, 2nd paragraph:
"""
(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)
"""

Given a low density of ISBNs in the text, it may well be better to
avoid the preliminary pass to rip out the '-'s, and instead:

1. use an RE like r"\b\d[-\d]{8,11}[\dX]\b" (allows up to 3 '-'s
inside the number)

2. post-process the matches: strip out any '-'s, check for remaining
length == 10.

Another thought for the OP: Consider (irrespective of how you arrive
at a candidate ISBN) validating the ISBN check-digit.

Cheers,
John

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
new RegExp().test() or just RegExp().test() Matìj Cepl Javascript 3 11-24-2009 02:41 PM
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Ruby 1.9 - ArgumentError: incompatible encoding regexp match(US-ASCII regexp with ISO-2022-JP string) Mikel Lindsaar Ruby 0 03-31-2008 10:27 AM
Programmatically turning a Regexp into an anchored Regexp Greg Hurrell Ruby 4 02-14-2007 06:56 PM
RegExp.exec() returns null when there is a match - a JavaScript RegExp bug? Uldis Bojars Javascript 2 12-17-2006 09:59 PM



Advertisments