Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Program inefficiency?

Reply
Thread Tools

Program inefficiency?

 
 
hall.jeff@gmail.com
Guest
Posts: n/a
 
      09-29-2007
I wrote the following simple program to loop through our help files
and fix some errors (in case you can't see the subtle RE search that's
happening, we're replacing spaces in bookmarks with _'s)

the program works great except for one thing. It's significantly
slower through the later files in the search then through the early
ones... Before anyone criticizes, I recognize that that middle section
could be simplified with a for loop... I just haven't cleaned it
up...

The problem is that the first 300 files take about 10-15 seconds and
the last 300 take about 2 minutes... If we do more than about 1500
files in one run, it just hangs up and never finishes...

Is there a solution here that I'm missing? What am I doing that is so
inefficient?

# File: masseditor.py

import re
import os
import time

def massreplace():
editfile = open("pathname\editfile.txt")
filestring = editfile.read()
filelist = filestring.splitlines()
## errorcheck = re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
for i in range(len(filelist)):
source = open(filelist[i])
starttext = source.read()
interimtext = replacecycle(starttext)
interimtext = replacecycle(interimtext)
interimtext = replacecycle(interimtext)
interimtext = replacecycle(interimtext)
interimtext = replacecycle(interimtext)
interimtext = replacecycle(interimtext)
interimtext = replacecycle(interimtext)
interimtext = replacecycle(interimtext)
interimtext = replacecycle(interimtext)
interimtext = replacecycle(interimtext)
interimtext = replacecycle(interimtext)
interimtext = replacecycle(interimtext)
finaltext = replacecycle(interimtext)
source.close()
source = open(filelist[i],"w")
source.write(finaltext)
source.close()
## if errorcheck.findall(finaltext)!=[]:
## print errorcheck.findall(finaltext)
## print filelist[i]
if i == 100:
print "done 100"
print time.clock()
elif i == 300:
print "done 300"
print time.clock()
elif i == 600:
print "done 600"
print time.clock()
elif i == 1000:
print "done 1000"
print time.clock()
print "done"
print i
print time.clock()

def replacecycle(starttext):
p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)( )+(.*)(">)+')
p2= re.compile('(name=")+(.*)( )+(.*)(">)+')
p3= re.compile('(href=|HREF=)+(.*)(#)+(.*)(\')+(.*)("> )+')
p4= re.compile('(name=")+(.*)(\')+(.*)(">)+')
p5= re.compile('(href=|HREF=)+(.*)(#)+(.*)(-)+(.*)(">)+')
p6= re.compile('(name=")+(.*)(-)+(.*)(">)+')
p7= re.compile('(href=|HREF=)+(.*)(#)+(.*)(<)+(.*)(">) +')
p8= re.compile('(name=")+(.*)(<)+(.*)(">)+')
p7= re.compile('(href=|HREF=")+(.*)(#)+(.*)(+(.*)("> )+')
p8= re.compile('(name=")+(.*)(+(.*)(">)+')
p9= re.compile('(href=|HREF=")+(.*)(#)+(.*)(\?)+(.*)(" >)+')
p10= re.compile('(name=")+(.*)(\?)+(.*)(">)+')
p100= re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
q1= r"\1\2\3\4_\6\7"
q2= r"\1\2_\4\5"
interimtext = p1.sub(q1, starttext)
interimtext = p2.sub(q2, interimtext)
interimtext = p3.sub(q1, interimtext)
interimtext = p4.sub(q2, interimtext)
interimtext = p5.sub(q1, interimtext)
interimtext = p6.sub(q2, interimtext)
interimtext = p7.sub(q1, interimtext)
interimtext = p8.sub(q2, interimtext)
interimtext = p9.sub(q1, interimtext)
interimtext = p10.sub(q2, interimtext)
interimtext = p100.sub(q2, interimtext)

return interimtext

massreplace()

 
Reply With Quote
 
 
 
 
Grant Edwards
Guest
Posts: n/a
 
      09-29-2007
> [...]
> the program works great except for one thing. It's significantly
> slower through the later files in the search then through the early
> ones... Before anyone criticizes, I recognize that that middle section
> could be simplified with a for loop... I just haven't cleaned it
> up...
>
> The problem is that the first 300 files take about 10-15 seconds and
> the last 300 take about 2 minutes... If we do more than about 1500
> files in one run, it just hangs up and never finishes...
>
> Is there a solution here that I'm missing? What am I doing that is so
> inefficient?


The only thing I see is that you compile all of the RE's every
time you call replacecycle(). They really only need to be
compiled once, but I don't know why that would cause the
progressive slowing.

FWIW, it seems to me like a shell+sed script would be the
obvious solution to the problem.

> # File: masseditor.py
>
> import re
> import os
> import time
>
> def massreplace():
> editfile = open("pathname\editfile.txt")
> filestring = editfile.read()
> filelist = filestring.splitlines()
> ## errorcheck = re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
> for i in range(len(filelist)):
> source = open(filelist[i])
> starttext = source.read()
> interimtext = replacecycle(starttext)
> interimtext = replacecycle(interimtext)
> interimtext = replacecycle(interimtext)
> interimtext = replacecycle(interimtext)
> interimtext = replacecycle(interimtext)
> interimtext = replacecycle(interimtext)
> interimtext = replacecycle(interimtext)
> interimtext = replacecycle(interimtext)
> interimtext = replacecycle(interimtext)
> interimtext = replacecycle(interimtext)
> interimtext = replacecycle(interimtext)
> interimtext = replacecycle(interimtext)
> finaltext = replacecycle(interimtext)
> source.close()
> source = open(filelist[i],"w")
> source.write(finaltext)
> source.close()
> ## if errorcheck.findall(finaltext)!=[]:
> ## print errorcheck.findall(finaltext)
> ## print filelist[i]
> if i == 100:
> print "done 100"
> print time.clock()
> elif i == 300:
> print "done 300"
> print time.clock()
> elif i == 600:
> print "done 600"
> print time.clock()
> elif i == 1000:
> print "done 1000"
> print time.clock()
> print "done"
> print i
> print time.clock()
>
> def replacecycle(starttext):
> p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)( )+(.*)(">)+')
> p2= re.compile('(name=")+(.*)( )+(.*)(">)+')
> p3= re.compile('(href=|HREF=)+(.*)(#)+(.*)(\')+(.*)("> )+')
> p4= re.compile('(name=")+(.*)(\')+(.*)(">)+')
> p5= re.compile('(href=|HREF=)+(.*)(#)+(.*)(-)+(.*)(">)+')
> p6= re.compile('(name=")+(.*)(-)+(.*)(">)+')
> p7= re.compile('(href=|HREF=)+(.*)(#)+(.*)(<)+(.*)(">) +')
> p8= re.compile('(name=")+(.*)(<)+(.*)(">)+')
> p7= re.compile('(href=|HREF=")+(.*)(#)+(.*)(+(.*)("> )+')
> p8= re.compile('(name=")+(.*)(+(.*)(">)+')
> p9= re.compile('(href=|HREF=")+(.*)(#)+(.*)(\?)+(.*)(" >)+')
> p10= re.compile('(name=")+(.*)(\?)+(.*)(">)+')
> p100= re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
> q1= r"\1\2\3\4_\6\7"
> q2= r"\1\2_\4\5"
> interimtext = p1.sub(q1, starttext)
> interimtext = p2.sub(q2, interimtext)
> interimtext = p3.sub(q1, interimtext)
> interimtext = p4.sub(q2, interimtext)
> interimtext = p5.sub(q1, interimtext)
> interimtext = p6.sub(q2, interimtext)
> interimtext = p7.sub(q1, interimtext)
> interimtext = p8.sub(q2, interimtext)
> interimtext = p9.sub(q1, interimtext)
> interimtext = p10.sub(q2, interimtext)
> interimtext = p100.sub(q2, interimtext)
>
> return interimtext
>
> massreplace()
>



--
Grant Edwards grante Yow! Are you still
at SEXUALLY ACTIVE? Did you
visi.com BRING th' REINFORCEMENTS?
 
Reply With Quote
 
 
 
 
Carsten Haese
Guest
Posts: n/a
 
      09-29-2007
On Sat, 2007-09-29 at 15:22 +0000, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> [...]
> def replacecycle(starttext):
> p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)( )+(.*)(">)+')
> p2= re.compile('(name=")+(.*)( )+(.*)(">)+')
> p3= re.compile('(href=|HREF=)+(.*)(#)+(.*)(\')+(.*)("> )+')
> p4= re.compile('(name=")+(.*)(\')+(.*)(">)+')
> p5= re.compile('(href=|HREF=)+(.*)(#)+(.*)(-)+(.*)(">)+')
> p6= re.compile('(name=")+(.*)(-)+(.*)(">)+')
> p7= re.compile('(href=|HREF=)+(.*)(#)+(.*)(<)+(.*)(">) +')
> p8= re.compile('(name=")+(.*)(<)+(.*)(">)+')
> p7= re.compile('(href=|HREF=")+(.*)(#)+(.*)(+(.*)("> )+')
> p8= re.compile('(name=")+(.*)(+(.*)(">)+')
> p9= re.compile('(href=|HREF=")+(.*)(#)+(.*)(\?)+(.*)(" >)+')
> p10= re.compile('(name=")+(.*)(\?)+(.*)(">)+')
> p100= re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
> [...]


One obvious opportunity for optimization is to compile those re's only
once at the beginning of the program instead of every time
replacecycle() is called (which is inexplicably called 13 times for each
file).

--
Carsten Haese
http://informixdb.sourceforge.net


 
Reply With Quote
 
hall.jeff@gmail.com
Guest
Posts: n/a
 
      09-29-2007
I did try moveing the re.compile's up and out of the replacecylce()
but it didn't impact the time in any meaningful way (2 seconds
maybe)...

I'm not sure what an shell+sed script is... I'm fairly new to Python
and my only other coding experience is with VBA... This was my first
Python program

In case it helps... We started with only 6 loops of replacecycle() but
had to keep adding progressively more as we found more and more links
with lots of spaces in them... As we did that, the program's time grew
progressively longer but the length grew multiplicatively with the
added number of cycles... This is exactly what I would have expected
and it leads me to believe that the problem does not lie in the
replacecycle() def but in the masseditor() def... *shrug*

 
Reply With Quote
 
Pablo Ziliani
Guest
Posts: n/a
 
      09-29-2007
(E-Mail Removed) wrote:
> Is there a solution here that I'm missing? What am I doing that is so
> inefficient?
>


Hi Jeff,

Yes, it seems you have plenty of performance leaks.
Please see my notes below.

> def massreplace():
> editfile = open("pathname\editfile.txt")
> filestring = editfile.read()
> filelist = filestring.splitlines()
> ## errorcheck = re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
> for i in range(len(filelist)):
> source = open(filelist[i])
>
>


Read this post:
http://mail.python.org/pipermail/pyt...st/275319.html
Instead of reading the whole document, storing it in a variable,
splitting it and the iterating, you could simply do:

def massreplace():
editfile = open("pathname\editfile.txt")
for source in editfile:


> starttext = source.read()
> interimtext = replacecycle(starttext)
> (...)
>


Excuse me, but this is insane. Do just one call (or none at all, I don't
see why you need to split this into two functions) and let the function
manage the replacement "layers".

I'm skipping the next part (don't want to understand all your logic now).

> (...)
>
> def replacecycle(starttext):
>



Unneeded, IMHO.

> p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)( )+(.*)(">)+')
> (...)
> interimtext = p100.sub(q2, interimtext)
>


Same euphemism applies here. I might be wrong, but I'm pretty confident
you can make all this in one simple regex.
Anyway, although regexes are supposed to be cached, don't need to define
them every time the function gets called. Do it once, outside the
function. At the very least you save one of the most important
performance hits in python, function calls. Read this:
http://wiki.python.org/moin/PythonSpeed/PerformanceTips
Also, if you are parsing HTML consider using BeautifulSoup or
ElementTree, or something (particularly if you don't feel particularly
confident with regexes).


Hope you find this helpful.
Pablo
 
Reply With Quote
 
Grant Edwards
Guest
Posts: n/a
 
      09-29-2007
On 2007-09-29, (E-Mail Removed) <(E-Mail Removed)> wrote:

> I'm not sure what an shell+sed script is...


http://tldp.org/LDP/Bash-Beginners-G...#sect_05_01_01
http://tldp.org/LDP/Bash-Beginners-G...l/chap_05.html
http://www.grymoire.com/Unix/Sed.html

http://www.gnu.org/software/bash/
http://en.wikipedia.org/wiki/Bash

Unfortuantely it appears you're using Windows (a partucular bad
choice for this sort of file processing). You can, however,
get bash and sed for Windows if you wish:

http://www.cygwin.com/

> In case it helps... We started with only 6 loops of replacecycle() but
> had to keep adding progressively more as we found more and more links
> with lots of spaces in them...


I would think with the correct RE's you'd only have to call it
once.

> As we did that, the program's time grew progressively longer
> but the length grew multiplicatively with the added number of
> cycles... This is exactly what I would have expected and it
> leads me to believe that the problem does not lie in the
> replacecycle() def but in the masseditor() def... *shrug*


As the program runs on progressively more files does the
process's memory usage grow without bounds? Does the machine
start swapping?

--
Grant Edwards grante Yow! I'm pretending that
at we're all watching PHIL
visi.com SILVERS instead of RICARDO
MONTALBAN!
 
Reply With Quote
 
hall.jeff@gmail.com
Guest
Posts: n/a
 
      09-29-2007
no swaps... memory usage is about 14k (these are small Html files)...
no hard drive cranking away or fan on my laptop going nutty... CPU
usage isn't even pegged... that's what makes me think it's not some
sort of bizarre memory leak... Unfortunately, it also means I'm out of
ideas...

 
Reply With Quote
 
hall.jeff@gmail.com
Guest
Posts: n/a
 
      09-29-2007
For anyone that cares, I figured out the "problem"... the webhelp
files that it hits the wall on are the compiled search files... They
are the only files in the system that have line lengths that are
RIDICULOUS in length... I'm looking at one right now that has 32767
characters all on one line...

I'm absolutely certain that that's the problem...

Thanks for everyone's help

 
Reply With Quote
 
thebjorn
Guest
Posts: n/a
 
      09-29-2007
On Sep 29, 5:22 pm, (E-Mail Removed) wrote:
> I wrote the following simple program to loop through our help files
> and fix some errors (in case you can't see the subtle RE search that's
> happening, we're replacing spaces in bookmarks with _'s)
>
> the program works great except for one thing. It's significantly
> slower through the later files in the search then through the early
> ones... Before anyone criticizes, I recognize that that middle section
> could be simplified with a for loop... I just haven't cleaned it
> up...
>
> The problem is that the first 300 files take about 10-15 seconds and
> the last 300 take about 2 minutes... If we do more than about 1500
> files in one run, it just hangs up and never finishes...
>
> Is there a solution here that I'm missing? What am I doing that is so
> inefficient?


Ugh, that was entirely too many regexps for my taste

How about something like:

def attr_ndx_iter(txt, attribute):
"Return all the start and end indices for the values of
attribute."
txt = txt.lower()
attribute = attribute.lower() + '='
alen = len(attribute)
chunks = txt.split(attribute)
if len(chunks) == 1:
return

start = len(chunks[0]) + alen
end = -1

for chunk in chunks[1:]:
qchar = chunk[0]
end = start + chunk.index(qchar, 1)
yield start + 1, end
start += len(chunk) + alen

def substr_map(txt, indices, fn):
"Apply fn to text within indices."
res = []
cur = 0

for i,j in indices:
res.append(txt[cur:i])
res.append(fn(txt[i:j]))
cur = j

res.append(txt[cur:])
return ''.join(res)

def transform(s):
"The transformation to do on the attribute values."
return s.replace(' ', '_')

def zap_spaces(txt, *attributes):
for attr in attributes:
txt = substr_map(txt, attr_ndx_iter(txt, attr), transform)
return txt

def mass_replace():
import sys
w = sys.stdout.write

for f in open(r'pathname\editfile.txt'):
try:
open(f, 'w').write(zap_spaces(open(f).read(), 'href',
'name'))
w('.') # progress-meter
except:
print 'Error processing file:', f

minimally-tested'ly y'rs
-- bjorn

 
Reply With Quote
 
Pablo Ziliani
Guest
Posts: n/a
 
      09-29-2007
thebjorn wrote:
> On Sep 29, 5:22 pm, (E-Mail Removed) wrote:
>
>> I wrote the following simple program to loop through our help files
>> and fix some errors (in case you can't see the subtle RE search that's
>> happening, we're replacing spaces in bookmarks with _'s)
>> (...)
>>

>
> Ugh, that was entirely too many regexps for my taste
>
> How about something like:
>
> def attr_ndx_iter(txt, attribute):
> (...)
> def substr_map(txt, indices, fn):
> (...)
> def transform(s):
> (...)
> def zap_spaces(txt, *attributes):
> (...)
> def mass_replace():
> (...)


Oh yeah, now it's clear as mud.
I do think that the whole program shouldn't take more than 10 lines of
code using one sensible regex (impossible to define without knowing the
real input and output formats).
And (sorry to tell) I'm convinced this is a problem for regexes, in
spite of anybody's personal taste.

Pablo
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to javac a java program w/ another java program which is w/o a main method cjeffwang@yahoo.com Java 1 10-31-2005 04:25 AM
System program/ Application program ?? Parvsandhu Java 2 07-11-2005 09:08 AM
how to convert a java program to an exe program ola Java 3 02-16-2004 09:42 AM
Calling Java program in another Java program Rey Java 4 12-12-2003 10:18 PM
passing data between Java program and C program--help pipi Java 1 07-21-2003 05:02 AM



Advertisments