Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Help with script with performance problems

Reply
Thread Tools

Help with script with performance problems

 
 
Dennis Roberts
Guest
Posts: n/a
 
      11-23-2003
I have a script to parse a dns querylog and generate some statistics.
For a 750MB file a perl script using the same methods (splits) can
parse the file in 3 minutes. My python script takes 25 minutes. It
is enough of a difference that unless I can figure out what I did
wrong or a better way of doing it I might not be able to use python
(since most of what I do is parsing various logs). The main reason to
try python is I had to look at some early scripts I wrote in perl and
had no idea what the hell I was thinking or what the script even did!
After some googling and reading Eric Raymonds essay on python I jumped
in Here is my script. I am looking for constructive comments -
please don't bash my newbie code.

#!/usr/bin/python -u

import string
import sys

clients = {}
queries = {}
count = 0

print "Each dot is 100000 lines..."

f = sys.stdin

while 1:

line = f.readline()

if count % 100000 == 0:
sys.stdout.write(".")

if line:
splitline = string.split(line)

try:
(month, day, time, stype, source, qtype, query, ctype,
record) = splitline
except:
print "problem spliting line", count
print line
break

try:
words = string.split(source,'#')
source = words[0]
except:
print "problem splitting source", count
print line
break

if clients.has_key(source):
clients[source] = clients[source] + 1
else:
clients[source] = 1

if queries.has_key(query):
queries[query] = queries[query] + 1
else:
queries[query] = 1

else:
print
break

count = count + 1

f.close()

print count, "lines processed"

for numclient, count in clients.items():
if count > 100000:
print "%s,%s" % (numclient, count)

for numquery, count in queries.items():
if count > 100000:
print "%s,%s" % (numquery, count)
 
Reply With Quote
 
 
 
 
Ville Vainio
Guest
Posts: n/a
 
      11-23-2003
http://www.velocityreviews.com/forums/(E-Mail Removed) (Dennis Roberts) writes:

> is enough of a difference that unless I can figure out what I did
> wrong or a better way of doing it I might not be able to use python
> (since most of what I do is parsing various logs). The main reason to


Isn't parsing logs a batch-oriented thing, where 20 minutes more
wouldn't matter all that much? Log parsing is the home field of Perl,
so python probably can't match its performance there, but other
advantages of Python might make you still want to avoid going back to
Perl. As long as it's 'efficient enough', who cares?

> f = sys.stdin


Have you tried using a normal file instead of stdin? BTW, you can
iterate over a file easily by "for line in open("mylog.log"):". ISTR
it's also more efficient than readline()'s, because it caches the
lines instead of reading them one by one. You can also get the line
numbers by doing "for linenum, line in enumerate(open("mylog.log")):"


> splitline = string.split(line)


Do not use 'string' module (it's deprecated), use string methods
instead: line.split()

> clients[source] = clients[source] + 1


clients[source] += 1

or another way to handle the common 'add 1, might not exist' idiom:


clients[source] = 1 + clients.get(source,0)

See http://aspn.activestate.com/ASPN/Coo...n/Recipe/66516


--
Ville Vainio http://www.students.tut.fi/~vainio24
 
Reply With Quote
 
 
 
 
Miki Tebeka
Guest
Posts: n/a
 
      11-23-2003
Hello Dennis,

A general note: Use the "hotshot" module to find where you spend most of your time.

> splitline = string.split(line)

My guess is that if you'll use the "re" module things will be much faster.

import re
ws_split = re.compile("\s+").split
....
splitline = split(line)
....

HTH.

Miki
 
Reply With Quote
 
Paul Clinch
Guest
Posts: n/a
 
      11-23-2003
(E-Mail Removed) (Miki Tebeka) wrote in message news:<(E-Mail Removed) om>...
> Hello Dennis,
>
> A general note: Use the "hotshot" module to find where you spend most of your time.
>
> > splitline = string.split(line)

> My guess is that if you'll use the "re" module things will be much faster.
>
> import re
> ws_split = re.compile("\s+").split
> ...
> splitline = split(line)
> ...
>
> HTH.
>
> Miki



An alternative in python 2.3 is the timeit module, the following
extracted from doc.s:-
import timeit

timer1 = timeit.Timer('unicode("abc")')
timer2 = timeit.Timer('"abc" + u""')

# Run three trials
print timer1.repeat(repeat=3, number=100000)
print timer2.repeat(repeat=3, number=100000)

# On my laptop this outputs:
# [0.36831796169281006, 0.37441694736480713, 0.35304892063140869]
# [0.17574405670166016, 0.18193507194519043, 0.17565798759460449]

Regards Paul Clinch
 
Reply With Quote
 
Dennis Roberts
Guest
Posts: n/a
 
      11-23-2003
Ville Vainio <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
> > f = sys.stdin

>
> Have you tried using a normal file instead of stdin? BTW, you can
> iterate over a file easily by "for line in open("mylog.log"):". ISTR
> it's also more efficient than readline()'s, because it caches the
> lines instead of reading them one by one. You can also get the line
> numbers by doing "for linenum, line in enumerate(open("mylog.log")):"
>


i have a 240207 line sample log file that I test with. The script I
submitted parsed it in 18 seconds. My perl script parsed it in 4
seconds.

The new python script, using a normal file as suggested above, does it
in 3 seconds!

Changed "f = sys.stdin" to "f = open('sample', 'r')".

Thanks Ville!

Note (I made the other changes one at a time as well - the file open
change was the only one that made it faster)
 
Reply With Quote
 
Aahz
Guest
Posts: n/a
 
      11-23-2003
In article <(E-Mail Removed) >,
Dennis Roberts <(E-Mail Removed)> wrote:
>
>I have a script to parse a dns querylog and generate some statistics.
>For a 750MB file a perl script using the same methods (splits) can
>parse the file in 3 minutes. My python script takes 25 minutes. It
>is enough of a difference that unless I can figure out what I did
>wrong or a better way of doing it I might not be able to use python
>(since most of what I do is parsing various logs). The main reason to
>try python is I had to look at some early scripts I wrote in perl and
>had no idea what the hell I was thinking or what the script even did!
>After some googling and reading Eric Raymonds essay on python I jumped
>in Here is my script. I am looking for constructive comments -
>please don't bash my newbie code.


If you haven't yet, make sure you upgrade to Python 2.3; there are a lot
of speed enhancements. Also, it allows you to switch to idioms that work
more like Perl's:

for line in f:
fields = line.split()
...

Generally speaking, contrary to what another poster suggested, string
methods will almost always be faster than regexes (assuming that a
string method does what you want directly, of course; using multiple
string methods may or may not be faster than regexes).
--
Aahz ((E-Mail Removed)) <*> http://www.pythoncraft.com/

Weinberg's Second Law: If builders built buildings the way programmers wrote
programs, then the first woodpecker that came along would destroy civilization.
 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      11-23-2003
Dennis Roberts wrote:

> I have a script to parse a dns querylog and generate some statistics.
> For a 750MB file a perl script using the same methods (splits) can
> parse the file in 3 minutes. My python script takes 25 minutes. It
> is enough of a difference that unless I can figure out what I did
> wrong or a better way of doing it I might not be able to use python
> (since most of what I do is parsing various logs). The main reason to
> try python is I had to look at some early scripts I wrote in perl and
> had no idea what the hell I was thinking or what the script even did!
> After some googling and reading Eric Raymonds essay on python I jumped
> in Here is my script. I am looking for constructive comments -
> please don't bash my newbie code.


Below is my version of your script. It tries to use more idiomatic Python
and is about 20%t faster on some bogus data - but nowhere near to close the
performance gap you claim to the perl script.
However, it took 143 seconds to process 10**7 lines generated by

<makesample.py>
import itertools, sys
sample = "%dmonth day time stype source%d#sowhat qtype %dquery ctype record"
thousand = itertools.cycle(range(1000))
hundred = itertools.cycle(range(100))

out = file(sys.argv[1], "w")
try:
try:
count = int(sys.argv[2])
except IndexError:
count = 10**7
for i in range(count):
print >> out, sample % (i, thousand.next(), hundred.next())
finally:
out.close()
</makesample.py>

with Python 2.3.2 on my 2.6GHz P4. Would that mean Perl would do it in 17
seconds? Anyway, the performance problem would rather be your computer ,
Python should be fast enough for the purpose.

Peter

<parselog.py>
#!/usr/bin/python -u
#Warning, not seriously tested
import sys

#import time
#starttime = time.time()

clients = {}
queries = {}
lineNo = -1

threshold = 100
pointmod = 100000

f = file(sys.argv[1])
try:
print "Each dot is %d lines..." % pointmod
for lineNo, line in enumerate(f):
if lineNo % pointmod == 0:
sys.stdout.write(".")

try:
month, day, timestr, stype, source, qtype, query, ctype, record
= line.split()
except ValueError:
raise Exception("problem splitting line %d\n%s" % (lineNo,
line))

source = source.split('#', 1)[0]

clients[source] = clients.get(source, 0) + 1
queries[query] = queries.get(query, 0) + 1
finally:
f.close()

print
print lineNo+1, "lines processed"

for numclient, count in clients.iteritems():
if count > threshold:
print "%s,%s" % (numclient, count)

for numquery, count in queries.iteritems():
if count > threshold:
print "%s,%s" % (numquery, count)

#print "time:", time.time() - starttime
</parselog.py>
 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      11-23-2003
Peter Otten wrote:

> However, it took 143 seconds to process 10**7 lines generated by


I just downloaded psycho, oops, keep misspelling the name and it brings
down the time to 92 seconds - almost for free. I must say I'm impressed,
the psycologist(s) did an excellent job.

Peter

#!/usr/bin/python -u
import psyco, sys
psyco.full()

def main():
clients = {}
queries = {}
lineNo = -1

threshold = 100
pointmod = 100000

f = file(sys.argv[1])
try:
print "Each dot is %d lines..." % pointmod
for lineNo, line in enumerate(f):
if lineNo % pointmod == 0:
sys.stdout.write(".")

try:
month, day, timestr, stype, source, qtype, query, ctype,
record = line.split()
except ValueError:
raise Exception("problem splitting line %d\n%s" % (lineNo,
line))

source = source.split('#', 1)[0]

clients[source] = clients.get(source, 0) + 1
queries[query] = queries.get(query, 0) + 1
finally:
f.close()

print
print lineNo+1, "lines processed"

for numclient, count in clients.iteritems():
if count > threshold:
print "%s,%s" % (numclient, count)

for numquery, count in queries.iteritems():
if count > threshold:
print "%s,%s" % (numquery, count)

import time
starttime = time.time()
main()
print "time:", time.time() - starttime

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Performance Tutorials Services - Boosting Performance by DisablingUnnecessary Services on Windows XP Home Edition Software Engineer Javascript 0 06-10-2011 02:18 AM
How to execute a script from another script and other script does notdo busy wait. Rajat Python 3 01-08-2010 02:05 PM
RE: How to execute a script from another script and other script doesnotdo busy wait. VYAS ASHISH M-NTB837 Python 2 01-07-2010 08:18 PM
Quizz script performance bug -- too many questions? marc_r_bertrand@hotmail.com ASP General 16 10-10-2006 03:49 PM
Web Form Performance Versus Single File Performance jm ASP .Net 1 12-12-2003 11:14 PM



Advertisments