![]() |
Re: "Newbie" questions - "unique" sorting ?
Cousin Stanley wrote:
> John .... > > { 1. Good News | 2. Bad News | 3. Good News } .... > > 1. Good News .... > > The last version of word_list.py that I up-loaded > works as expected with your input file producing > an indexed word list with no duplicates ... > > > 2. Bad News .... > > It was S L O W E R than the proverbial turtle > in the tar pit ... > How about the simple approach? total=0 unique={} for line in open(infile,"rb").xreadlines(): for word in line.strip().split(): unique[word]=unique.get(word,0)+1 total+=1 uwords=unique.keys() uwords.sort() open(outfile,"wb").write('\n'.join(uwords)) -- Med Venlig Hilsen / Regards Kim Petersen - Kyborg A/S (Udvikling) IT - Innovationshuset Havneparken 2 7100 Vejle Tlf. +4576408183 || Fax. +4576408188 |
Re: "Newbie" questions - "unique" sorting ?
| How about the simple approach?
| ... Kim ... The approach I used is fairly simple and similar to the one you posted, basically just stuffing words from split lines into a dictionary ... The following script produces an indexed list fairly quickly for relatively small files, but dogged out on the input file John supplied which yielded ... Total Words .... 467381 Unique Words .... 47122 Perhaps skipping the dictionary word count update in the following line might speed things up ... else : dict_words[ this_word ] += 1 -- Cousin Stanley Human Being Phoenix, Arizona ------------------------------------------------------------------- ''' Module ........... word_list.py Usage ............ python word_list.py File_In.txt File_Out.txt NewsGroup ........ comp.lang.python Date ............. 2003-06-18 Posted_By ........ John Fitzsimmons Replies_From ..... [ kpop , Erik Max Francis ] Coded_By ......... Stanley C. Kitching ''' import math import sys import time time_in = time.time() NL = '\n' module_name = sys.argv[ 0 ] print '%s %s ' % ( NL , module_name ) path_in = sys.argv[ 1 ] path_out = sys.argv[ 2 ] file_in = file( path_in , 'r' ) file_out = file( path_out , 'w' ) word_total = 0 dict_words = {} print ' Indexing Words .... ' , for iLine in file_in : if math.fmod( word_total , 1000 ) == 0 : print '.' , list_words = iLine.strip().split() for this_word in list_words : if this_word not in dict_words.keys() : dict_words[ this_word ] = 1 else : dict_words[ this_word ] += 1 word_total += 1 list_words = dict_words.keys() list_words.sort( lambda x , y : cmp( x.lower() , y.lower() ) ) print NL print ' Writing Output File ....' , for this_word in list_words : word_count = dict_words[ this_word ] str_out = '%6d %s %s' % ( word_count , this_word , NL ) file_out.write( str_out ) word_str = '%s Total Words .... %d %s' % ( NL , word_total , NL ) keys_total = len( dict_words.keys() ) keys_str = '%s Unique Words .... %d %s' % ( NL , keys_total , NL ) file_out.write( word_str ) file_out.write( keys_str ) print NL print ' Complete .................' print ' Total Words ....' , word_total print ' Unique Words ....' , keys_total file_in.close() file_out.close() time_out = time.time() time_diff = time_out - time_in print NL print ' Process Time ........ %-6.2f Seconds' % ( time_diff ) |
Re: "Newbie" questions - "unique" sorting ?
> totally skipping the word count should speed it up - and i believe that
> the approach of dict_words[this_word]=dict_words.get(this_word,0) btw. > is a bit faster than doing has_key() or the least of doing: > > if this_word not in dict_words.keys() : > > which aught to be extremely slow on a large dictionary (creating and > dropping lists of thousands + doing a O(n) search over it). And that may > very well be the culprit of the slow run you see.... > > > > > else : > > > > dict_words[ this_word ] += 1 > > > > this is how i check if a key is in a dictionary: if this_word not in dict_words: i believe this doesn't do an O(n) search over it. it's supposed to constant time for every lookup by way of a hash lookup. bryan |
Re: "Newbie" questions - "unique" sorting ?
Bryan wrote:
>>totally skipping the word count should speed it up - and i believe that >>the approach of dict_words[this_word]=dict_words.get(this_word,0) btw. >>is a bit faster than doing has_key() or the least of doing: >> >> if this_word not in dict_words.keys() : >> >>which aught to be extremely slow on a large dictionary (creating and >>dropping lists of thousands + doing a O(n) search over it). And that may >>very well be the culprit of the slow run you see.... > this is how i check if a key is in a dictionary: > > if this_word not in dict_words: > > i believe this doesn't do an O(n) search over it. it's supposed to constant > time for every lookup by way of a hash lookup. Which is probably correct since that would be equivalent to: not dict_words.has_key(this_word) But if you look at the above code mentioned - it doesn't do that - it calls .keys() [generating a list] and then looks if the string is in it... > > bryan > > -- Med Venlig Hilsen / Regards Kim Petersen - Kyborg A/S (Udvikling) IT - Innovationshuset Havneparken 2 7100 Vejle Tlf. +4576408183 || Fax. +4576408188 |
Re: "Newbie" questions - "unique" sorting ?
| Which is probably correct since that would be equivalent to:
| not dict_words.has_key(this_word) | | But if you look at the above code mentioned - it doesn't do that - | | it calls .keys() [generating a list] | and then looks if the string is in it... Kim ... I changed one line in the script .... from .... if this_word not in dict_words.keys() : to ...... if not dict_words.has_key( this_word ) : The results are a little MORE than satisfactory, running in 48 seconds instead of 7 hours !!!! Complete ................. Total Words .... 467381 Unique Words .... 47122 Process Time ........ 48.11 Seconds I failed to consider here that a NEW keys list might be created on each pass through a loop .... Thanks VERY MUCH for the suggestion .... -- Cousin Stanley Human Being Phoenix, Arizona |
Re: "Newbie" questions - "unique" sorting ?
i just did a test comparing:
if key not in mydict: and if not mydict.has_key(key): and the 1st way cameout 28% faster. (on python 2.3) bryan "Erik Max Francis" <max@alcyone.com> wrote in message news:3EFA2521.94B42B9B@alcyone.com... > Cousin Stanley wrote: > > > I changed one line in the script .... > > > > from .... if this_word not in dict_words.keys() : > > > > to ...... if not dict_words.has_key( this_word ) : > ... > > I failed to consider here > > that a NEW keys list might be created > > on each pass through a loop .... > > Not only does it create a new keys list (D.keys()), but it searches > linearly through it (x in L)! Those are two separate O(n) operations, > whereas D.has_key is effectively O(1). > > -- > Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/ > __ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE > / \ The multitude of books is making us ignorant. > \__/ Voltaire |
Re: "Newbie" questions - "unique" sorting ?
| Not only does it create a new keys list (D.keys()),
| but it searches linearly through it (x in L)! | | Those are two separate O(n) operations, | whereas D.has_key is effectively O(1). Erik ... Incurring the 7 hour wrath of 2 * O( n ) in this test was certainly informative for me, but not much fun .... I could have interrupted the process, but knowing it ran to completion for smaller inputs, I was curious as to the eventual outcome .... Since n increases as keys are added to the dictionary, 2 * O( n ) for the current pass should also be a bit slower than for the previous one making it even slower and slower as more and more keys are added .... Thanks for the info .... -- Cousin Stanley Human Being Phoenix, Arizona |
| All times are GMT. The time now is 02:08 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.