Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > help make it faster please

Reply
Thread Tools

help make it faster please

 
 
pkilambi@gmail.com
Guest
Posts: n/a
 
      11-10-2005
I wrote this function which does the following:
after readling lines from file.It splits and finds the word occurences
through a hash table...for some reason this is quite slow..can some one
help me make it faster...
f = open(filename)
lines = f.readlines()
def create_words(lines):
cnt = 0
spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+'
for content in lines:
words=content.split()
countDict={}
wordlist = []
for w in words:
w=string.lower(w)
if w[-1] in spl_set: w = w[:-1]
if w != '':
if countDict.has_key(w):
countDict[w]=countDict[w]+1
else:
countDict[w]=1
wordlist = countDict.keys()
wordlist.sort()
cnt += 1
if countDict != {}:
for word in wordlist: print (word+' '+
str(countDict[word])+'\n')

 
Reply With Quote
 
 
 
 
bonono@gmail.com
Guest
Posts: n/a
 
      11-10-2005
why reload wordlist and sort it after each word processing ? seems that
it can be done after the for loop.

http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I wrote this function which does the following:
> after readling lines from file.It splits and finds the word occurences
> through a hash table...for some reason this is quite slow..can some one
> help me make it faster...
> f = open(filename)
> lines = f.readlines()
> def create_words(lines):
> cnt = 0
> spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+'
> for content in lines:
> words=content.split()
> countDict={}
> wordlist = []
> for w in words:
> w=string.lower(w)
> if w[-1] in spl_set: w = w[:-1]
> if w != '':
> if countDict.has_key(w):
> countDict[w]=countDict[w]+1
> else:
> countDict[w]=1
> wordlist = countDict.keys()
> wordlist.sort()
> cnt += 1
> if countDict != {}:
> for word in wordlist: print (word+' '+
> str(countDict[word])+'\n')


 
Reply With Quote
 
 
 
 
pkilambi@gmail.com
Guest
Posts: n/a
 
      11-10-2005
Actually I create a seperate wordlist for each so called line.Here line
I mean would be a paragraph in future...so I will have to recreate the
wordlist for each loop

 
Reply With Quote
 
pkilambi@gmail.com
Guest
Posts: n/a
 
      11-10-2005
Oh sorry indentation was messed here...the
wordlist = countDict.keys()
wordlist.sort()
should be outside the word loop.... now
def create_words(lines):
cnt = 0
spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+'
for content in lines:
words=content.split()
countDict={}
wordlist = []
for w in words:
w=string.lower(w)
if w[-1] in spl_set: w = w[:-1]
if w != '':
if countDict.has_key(w):
countDict[w]=countDict[w]+1
else:
countDict[w]=1
wordlist = countDict.keys()
wordlist.sort()
cnt += 1
if countDict != {}:
for word in wordlist: print (word+' '+
str(countDict[word])+'\n')

ok now this is the correct question I am asking...

 
Reply With Quote
 
bonono@gmail.com
Guest
Posts: n/a
 
      11-10-2005
don't know your intend so have no idea what it is for. However, you are
doing :

wordlist=contDict.keys()
wordlist.sort()

for every word processed yet you don't use the content of x in anyway
during the loop. Even if you need one fresh snapshot of contDict after
each word, I don't see the need for sorting. seems like wasting cycles
to me.

(E-Mail Removed) wrote:
> Actually I create a seperate wordlist for each so called line.Here line
> I mean would be a paragraph in future...so I will have to recreate the
> wordlist for each loop


 
Reply With Quote
 
Lonnie Princehouse
Guest
Posts: n/a
 
      11-10-2005
You're making a new countDict for each line read from the file... is
that what you meant to do? Or are you trying to count word occurrences
across the whole file?

--

In general, any time string manipulation is going slowly, ask yourself,
"Can I use the re module for this?"

# disclaimer: untested code. probably contains typos

import re
word_finder = re.compile('[a-z0-9_]+', re.I)

def count_words (string, word_finder = word_finder): # avoid global
lookups
countDict = {}
for match in word_finder.finditer(string):
word = match.group(0)
countDict[word] = countDict.get(word,0) + 1
return countDict

f = open(filename)
for i, line in enumerate(f.xreadlines()):
countDict = count_words(line)
print "Line %s" % i
for word in sorted(countDict.keys()):
print " %s %s" % (word, countDict[word])

f.close()

 
Reply With Quote
 
bearophileHUGS@lycos.com
Guest
Posts: n/a
 
      11-10-2005
This can be faster, it avoids doing the same things more times:

from string import maketrans, ascii_lowercase, ascii_uppercase

def create_words(afile):
stripper = """'[",;<>{}_&?!():[]\.=+-*\t\n\r^%0123456789/"""
mapper = maketrans(stripper + ascii_uppercase,
" "*len(stripper) + ascii_lowercase)
countDict = {}
for line in afile:
for w in line.translate(mapper).split():
if w:
if w in countDict:
countDict[w] += 1
else:
countDict[w] = 1
word_freq = countDict.items()
word_freq.sort()
for word, freq in word_freq:
print word, freq

create_words(file("test.txt"))


If you can load the whole file in memory then it can be made a little
faster...

Bear hugs,
bearophile

 
Reply With Quote
 
Larry Bates
Guest
Posts: n/a
 
      11-10-2005
(E-Mail Removed) wrote:
> I wrote this function which does the following:
> after readling lines from file.It splits and finds the word occurences
> through a hash table...for some reason this is quite slow..can some one
> help me make it faster...
> f = open(filename)
> lines = f.readlines()
> def create_words(lines):
> cnt = 0
> spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+'
> for content in lines:
> words=content.split()
> countDict={}
> wordlist = []
> for w in words:
> w=string.lower(w)
> if w[-1] in spl_set: w = w[:-1]
> if w != '':
> if countDict.has_key(w):
> countDict[w]=countDict[w]+1
> else:
> countDict[w]=1
> wordlist = countDict.keys()
> wordlist.sort()
> cnt += 1
> if countDict != {}:
> for word in wordlist: print (word+' '+
> str(countDict[word])+'\n')
>

The way this is written you create a new countDict object
for every line of the file, it's not clear that this is
what you meant to do.

Also you are sorting wordlist for every line, not just
the entire file because it is inside the loop that is
processing lines.

Some extra work by testing for empty dictionary:

wordlist=countDict.keys()

then

if countdict != {}:
for word in wordlist:

if countDict is empty then wordlist will be empty so testing
for it is unnecessary.

Incrementing cnt, but never using it.

I don't think spl_set will do what you want, but I haven't modified
it. To split on all those characters you are going to need to
use regular expressions not split.


Modified code:

def create_words(lines):
spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+'
countDict={}
for content in lines:
words=content.split()
for w in words:
w=w.lower()
if w[-1] in spl_set: w = w[:-1]
if w:
if countDict.has_key(w):
countDict[w]=countDict[w]+1
else:
countDict[w]=1

return countDict


import time
filename=r'C:\cygwin\usr\share\vim\vim63\doc\versi on5.txt'
f = open(filename)
lines = f.readlines()
start_time=time.time()
countDict=create_words(lines)
stop_time=time.time()
elapsed_time=stop_time-start_time
wordlist = countDict.keys()
wordlist.sort()
for word in wordlist:
print "word=%s count=%i" % (word, countDict[word])

print "Elapsed time in create_words function=%.2f seconds" % elapsed_time

I ran this against a 551K text file and it runs in 0.11 seconds
on my machine (3.0Ghz P4).

Larry Bates
 
Reply With Quote
 
pkilambi@gmail.com
Guest
Posts: n/a
 
      11-10-2005
ok this sounds much better..could you tell me what to do if I want to
leave characters like @ in words.So I would like to consider this as a
part of word

 
Reply With Quote
 
Lonnie Princehouse
Guest
Posts: n/a
 
      11-10-2005
The word_finder regular expression defines what will be considered a
word.

"[a-z0-9_]" means "match a single character from the set {a through z,
0 through 9, underscore}".
The + means "match as many as you can, minimum of one"

To match @ as well, add it to the set of characters to match:

word_finder = re.compile('[a-z0-9_@]+', re.I)

The re.I flag makes the expression case insensitive.
See the documentation for re for more information.


Also--- It looks like I forgot to lowercase matched words. The line
word = match.group(0)
should read:
word = match.group(0).lower()

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
cry for help - make this faster. Ara.T.Howard Ruby 7 09-28-2006 04:38 AM
please make my startup faster =?Utf-8?B?cXVpY2hpdGE=?= Microsoft Certification 3 04-19-2006 06:57 PM
CDO Bulk Email Help - Need to make it faster Brent Patroch ASP General 0 09-16-2004 03:11 PM
HELP! HELP! PLEASE, PLEASE, PLEASE tpg comcntr Computer Support 11 02-15-2004 06:22 PM
please help... ...me learn C++ please please please :) KK C++ 2 10-14-2003 02:08 PM



Advertisments