Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Re: "Newbie" questions - "unique" sorting ? (http://www.velocityreviews.com/forums/t318778-re-newbie-questions-unique-sorting.html)

Kim Petersen 06-24-2003 07:52 AM

Re: "Newbie" questions - "unique" sorting ?
 
Cousin Stanley wrote:
> John ....
>
> { 1. Good News | 2. Bad News | 3. Good News } ....
>
> 1. Good News ....
>
> The last version of word_list.py that I up-loaded
> works as expected with your input file producing
> an indexed word list with no duplicates ...
>
>
> 2. Bad News ....
>
> It was S L O W E R than the proverbial turtle
> in the tar pit ...
>

How about the simple approach?

total=0
unique={}
for line in open(infile,"rb").xreadlines():
for word in line.strip().split():
unique[word]=unique.get(word,0)+1
total+=1
uwords=unique.keys()
uwords.sort()
open(outfile,"wb").write('\n'.join(uwords))

--
Med Venlig Hilsen / Regards

Kim Petersen - Kyborg A/S (Udvikling)
IT - Innovationshuset
Havneparken 2
7100 Vejle
Tlf. +4576408183 || Fax. +4576408188


Cousin Stanley 06-24-2003 08:55 AM

Re: "Newbie" questions - "unique" sorting ?
 
| How about the simple approach?
| ...

Kim ...

The approach I used is fairly simple and similar
to the one you posted, basically just
stuffing words from split lines into
a dictionary ...

The following script produces an indexed list
fairly quickly for relatively small files,
but dogged out on the input file John supplied
which yielded ...

Total Words .... 467381
Unique Words .... 47122

Perhaps skipping the dictionary word count update
in the following line might speed things up ...

else :

dict_words[ this_word ] += 1

--
Cousin Stanley
Human Being
Phoenix, Arizona

-------------------------------------------------------------------

'''

Module ........... word_list.py

Usage ............ python word_list.py File_In.txt File_Out.txt

NewsGroup ........ comp.lang.python

Date ............. 2003-06-18

Posted_By ........ John Fitzsimmons

Replies_From ..... [ kpop , Erik Max Francis ]

Coded_By ......... Stanley C. Kitching

'''

import math
import sys
import time

time_in = time.time()

NL = '\n'

module_name = sys.argv[ 0 ]

print '%s %s ' % ( NL , module_name )

path_in = sys.argv[ 1 ]
path_out = sys.argv[ 2 ]

file_in = file( path_in , 'r' )
file_out = file( path_out , 'w' )

word_total = 0


dict_words = {}

print
print ' Indexing Words .... ' ,

for iLine in file_in :

if math.fmod( word_total , 1000 ) == 0 :

print '.' ,

list_words = iLine.strip().split()

for this_word in list_words :

if this_word not in dict_words.keys() :

dict_words[ this_word ] = 1

else :

dict_words[ this_word ] += 1

word_total += 1


list_words = dict_words.keys()

list_words.sort( lambda x , y : cmp( x.lower() , y.lower() ) )


print NL
print ' Writing Output File ....' ,

for this_word in list_words :

word_count = dict_words[ this_word ]

str_out = '%6d %s %s' % ( word_count , this_word , NL )

file_out.write( str_out )


word_str = '%s Total Words .... %d %s' % ( NL , word_total , NL )

keys_total = len( dict_words.keys() )

keys_str = '%s Unique Words .... %d %s' % ( NL , keys_total , NL )


file_out.write( word_str )

file_out.write( keys_str )

print NL
print ' Complete .................'
print
print ' Total Words ....' , word_total
print
print ' Unique Words ....' , keys_total


file_in.close()

file_out.close()


time_out = time.time()

time_diff = time_out - time_in

print NL
print ' Process Time ........ %-6.2f Seconds' % ( time_diff )





Bryan 06-25-2003 02:34 AM

Re: "Newbie" questions - "unique" sorting ?
 
> totally skipping the word count should speed it up - and i believe that
> the approach of dict_words[this_word]=dict_words.get(this_word,0) btw.
> is a bit faster than doing has_key() or the least of doing:
>
> if this_word not in dict_words.keys() :
>
> which aught to be extremely slow on a large dictionary (creating and
> dropping lists of thousands + doing a O(n) search over it). And that may
> very well be the culprit of the slow run you see....
>
> >
> > else :
> >
> > dict_words[ this_word ] += 1
> >

>
>



this is how i check if a key is in a dictionary:

if this_word not in dict_words:

i believe this doesn't do an O(n) search over it. it's supposed to constant
time for every lookup by way of a hash lookup.

bryan



Kim Petersen 06-25-2003 05:58 AM

Re: "Newbie" questions - "unique" sorting ?
 
Bryan wrote:
>>totally skipping the word count should speed it up - and i believe that
>>the approach of dict_words[this_word]=dict_words.get(this_word,0) btw.
>>is a bit faster than doing has_key() or the least of doing:
>>
>> if this_word not in dict_words.keys() :
>>
>>which aught to be extremely slow on a large dictionary (creating and
>>dropping lists of thousands + doing a O(n) search over it). And that may
>>very well be the culprit of the slow run you see....


> this is how i check if a key is in a dictionary:
>
> if this_word not in dict_words:
>
> i believe this doesn't do an O(n) search over it. it's supposed to constant
> time for every lookup by way of a hash lookup.


Which is probably correct since that would be equivalent to:
not dict_words.has_key(this_word)

But if you look at the above code mentioned - it doesn't do that - it
calls .keys() [generating a list] and then looks if the string is in it...

>
> bryan
>
>


--
Med Venlig Hilsen / Regards

Kim Petersen - Kyborg A/S (Udvikling)
IT - Innovationshuset
Havneparken 2
7100 Vejle
Tlf. +4576408183 || Fax. +4576408188


Cousin Stanley 06-25-2003 03:40 PM

Re: "Newbie" questions - "unique" sorting ?
 
| Which is probably correct since that would be equivalent to:
| not dict_words.has_key(this_word)
|
| But if you look at the above code mentioned - it doesn't do that -
|
| it calls .keys() [generating a list]
| and then looks if the string is in it...

Kim ...

I changed one line in the script ....

from .... if this_word not in dict_words.keys() :

to ...... if not dict_words.has_key( this_word ) :

The results are a little MORE than satisfactory,
running in 48 seconds instead of 7 hours !!!!

Complete .................

Total Words .... 467381

Unique Words .... 47122

Process Time ........ 48.11 Seconds


I failed to consider here
that a NEW keys list might be created
on each pass through a loop ....

Thanks VERY MUCH for the suggestion ....

--
Cousin Stanley
Human Being
Phoenix, Arizona



Bryan 06-26-2003 01:54 AM

Re: "Newbie" questions - "unique" sorting ?
 
i just did a test comparing:

if key not in mydict:

and

if not mydict.has_key(key):

and the 1st way cameout 28% faster. (on python 2.3)

bryan

"Erik Max Francis" <max@alcyone.com> wrote in message
news:3EFA2521.94B42B9B@alcyone.com...
> Cousin Stanley wrote:
>
> > I changed one line in the script ....
> >
> > from .... if this_word not in dict_words.keys() :
> >
> > to ...... if not dict_words.has_key( this_word ) :

> ...
> > I failed to consider here
> > that a NEW keys list might be created
> > on each pass through a loop ....

>
> Not only does it create a new keys list (D.keys()), but it searches
> linearly through it (x in L)! Those are two separate O(n) operations,
> whereas D.has_key is effectively O(1).
>
> --
> Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/
> __ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
> / \ The multitude of books is making us ignorant.
> \__/ Voltaire




Cousin Stanley 06-26-2003 05:16 PM

Re: "Newbie" questions - "unique" sorting ?
 
| Not only does it create a new keys list (D.keys()),
| but it searches linearly through it (x in L)!
|
| Those are two separate O(n) operations,
| whereas D.has_key is effectively O(1).

Erik ...

Incurring the 7 hour wrath of 2 * O( n ) in this test
was certainly informative for me, but not much fun ....

I could have interrupted the process,
but knowing it ran to completion for smaller inputs,
I was curious as to the eventual outcome ....

Since n increases as keys are added to the dictionary,
2 * O( n ) for the current pass should also be a bit slower
than for the previous one making it even slower and slower
as more and more keys are added ....

Thanks for the info ....

--
Cousin Stanley
Human Being
Phoenix, Arizona




All times are GMT. The time now is 02:08 AM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57