Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: "Newbie" questions - "unique" sorting ?

Reply
Thread Tools

Re: "Newbie" questions - "unique" sorting ?

 
 
Kim Petersen
Guest
Posts: n/a
 
      06-24-2003
Cousin Stanley wrote:
> John ....
>
> { 1. Good News | 2. Bad News | 3. Good News } ....
>
> 1. Good News ....
>
> The last version of word_list.py that I up-loaded
> works as expected with your input file producing
> an indexed word list with no duplicates ...
>
>
> 2. Bad News ....
>
> It was S L O W E R than the proverbial turtle
> in the tar pit ...
>

How about the simple approach?

total=0
unique={}
for line in open(infile,"rb").xreadlines():
for word in line.strip().split():
unique[word]=unique.get(word,0)+1
total+=1
uwords=unique.keys()
uwords.sort()
open(outfile,"wb").write('\n'.join(uwords))

--
Med Venlig Hilsen / Regards

Kim Petersen - Kyborg A/S (Udvikling)
IT - Innovationshuset
Havneparken 2
7100 Vejle
Tlf. +4576408183 || Fax. +4576408188

 
Reply With Quote
 
 
 
 
Cousin Stanley
Guest
Posts: n/a
 
      06-24-2003
| How about the simple approach?
| ...

Kim ...

The approach I used is fairly simple and similar
to the one you posted, basically just
stuffing words from split lines into
a dictionary ...

The following script produces an indexed list
fairly quickly for relatively small files,
but dogged out on the input file John supplied
which yielded ...

Total Words .... 467381
Unique Words .... 47122

Perhaps skipping the dictionary word count update
in the following line might speed things up ...

else :

dict_words[ this_word ] += 1

--
Cousin Stanley
Human Being
Phoenix, Arizona

-------------------------------------------------------------------

'''

Module ........... word_list.py

Usage ............ python word_list.py File_In.txt File_Out.txt

NewsGroup ........ comp.lang.python

Date ............. 2003-06-18

Posted_By ........ John Fitzsimmons

Replies_From ..... [ kpop , Erik Max Francis ]

Coded_By ......... Stanley C. Kitching

'''

import math
import sys
import time

time_in = time.time()

NL = '\n'

module_name = sys.argv[ 0 ]

print '%s %s ' % ( NL , module_name )

path_in = sys.argv[ 1 ]
path_out = sys.argv[ 2 ]

file_in = file( path_in , 'r' )
file_out = file( path_out , 'w' )

word_total = 0


dict_words = {}

print
print ' Indexing Words .... ' ,

for iLine in file_in :

if math.fmod( word_total , 1000 ) == 0 :

print '.' ,

list_words = iLine.strip().split()

for this_word in list_words :

if this_word not in dict_words.keys() :

dict_words[ this_word ] = 1

else :

dict_words[ this_word ] += 1

word_total += 1


list_words = dict_words.keys()

list_words.sort( lambda x , y : cmp( x.lower() , y.lower() ) )


print NL
print ' Writing Output File ....' ,

for this_word in list_words :

word_count = dict_words[ this_word ]

str_out = '%6d %s %s' % ( word_count , this_word , NL )

file_out.write( str_out )


word_str = '%s Total Words .... %d %s' % ( NL , word_total , NL )

keys_total = len( dict_words.keys() )

keys_str = '%s Unique Words .... %d %s' % ( NL , keys_total , NL )


file_out.write( word_str )

file_out.write( keys_str )

print NL
print ' Complete .................'
print
print ' Total Words ....' , word_total
print
print ' Unique Words ....' , keys_total


file_in.close()

file_out.close()


time_out = time.time()

time_diff = time_out - time_in

print NL
print ' Process Time ........ %-6.2f Seconds' % ( time_diff )




 
Reply With Quote
 
 
 
 
Bryan
Guest
Posts: n/a
 
      06-25-2003
> totally skipping the word count should speed it up - and i believe that
> the approach of dict_words[this_word]=dict_words.get(this_word,0) btw.
> is a bit faster than doing has_key() or the least of doing:
>
> if this_word not in dict_words.keys() :
>
> which aught to be extremely slow on a large dictionary (creating and
> dropping lists of thousands + doing a O(n) search over it). And that may
> very well be the culprit of the slow run you see....
>
> >
> > else :
> >
> > dict_words[ this_word ] += 1
> >

>
>



this is how i check if a key is in a dictionary:

if this_word not in dict_words:

i believe this doesn't do an O(n) search over it. it's supposed to constant
time for every lookup by way of a hash lookup.

bryan


 
Reply With Quote
 
Kim Petersen
Guest
Posts: n/a
 
      06-25-2003
Bryan wrote:
>>totally skipping the word count should speed it up - and i believe that
>>the approach of dict_words[this_word]=dict_words.get(this_word,0) btw.
>>is a bit faster than doing has_key() or the least of doing:
>>
>> if this_word not in dict_words.keys() :
>>
>>which aught to be extremely slow on a large dictionary (creating and
>>dropping lists of thousands + doing a O(n) search over it). And that may
>>very well be the culprit of the slow run you see....


> this is how i check if a key is in a dictionary:
>
> if this_word not in dict_words:
>
> i believe this doesn't do an O(n) search over it. it's supposed to constant
> time for every lookup by way of a hash lookup.


Which is probably correct since that would be equivalent to:
not dict_words.has_key(this_word)

But if you look at the above code mentioned - it doesn't do that - it
calls .keys() [generating a list] and then looks if the string is in it...

>
> bryan
>
>


--
Med Venlig Hilsen / Regards

Kim Petersen - Kyborg A/S (Udvikling)
IT - Innovationshuset
Havneparken 2
7100 Vejle
Tlf. +4576408183 || Fax. +4576408188

 
Reply With Quote
 
Cousin Stanley
Guest
Posts: n/a
 
      06-25-2003
| Which is probably correct since that would be equivalent to:
| not dict_words.has_key(this_word)
|
| But if you look at the above code mentioned - it doesn't do that -
|
| it calls .keys() [generating a list]
| and then looks if the string is in it...

Kim ...

I changed one line in the script ....

from .... if this_word not in dict_words.keys() :

to ...... if not dict_words.has_key( this_word ) :

The results are a little MORE than satisfactory,
running in 48 seconds instead of 7 hours !!!!

Complete .................

Total Words .... 467381

Unique Words .... 47122

Process Time ........ 48.11 Seconds


I failed to consider here
that a NEW keys list might be created
on each pass through a loop ....

Thanks VERY MUCH for the suggestion ....

--
Cousin Stanley
Human Being
Phoenix, Arizona


 
Reply With Quote
 
Bryan
Guest
Posts: n/a
 
      06-26-2003
i just did a test comparing:

if key not in mydict:

and

if not mydict.has_key(key):

and the 1st way cameout 28% faster. (on python 2.3)

bryan

"Erik Max Francis" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Cousin Stanley wrote:
>
> > I changed one line in the script ....
> >
> > from .... if this_word not in dict_words.keys() :
> >
> > to ...... if not dict_words.has_key( this_word ) :

> ...
> > I failed to consider here
> > that a NEW keys list might be created
> > on each pass through a loop ....

>
> Not only does it create a new keys list (D.keys()), but it searches
> linearly through it (x in L)! Those are two separate O(n) operations,
> whereas D.has_key is effectively O(1).
>
> --
> Erik Max Francis && http://www.velocityreviews.com/forums/(E-Mail Removed) && http://www.alcyone.com/max/
> __ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
> / \ The multitude of books is making us ignorant.
> \__/ Voltaire



 
Reply With Quote
 
Cousin Stanley
Guest
Posts: n/a
 
      06-26-2003
| Not only does it create a new keys list (D.keys()),
| but it searches linearly through it (x in L)!
|
| Those are two separate O(n) operations,
| whereas D.has_key is effectively O(1).

Erik ...

Incurring the 7 hour wrath of 2 * O( n ) in this test
was certainly informative for me, but not much fun ....

I could have interrupted the process,
but knowing it ran to completion for smaller inputs,
I was curious as to the eventual outcome ....

Since n increases as keys are added to the dictionary,
2 * O( n ) for the current pass should also be a bit slower
than for the previous one making it even slower and slower
as more and more keys are added ....

Thanks for the info ....

--
Cousin Stanley
Human Being
Phoenix, Arizona


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Sorting list vs sorting vector boltar2003@boltar.world C++ 2 07-06-2010 09:40 AM
fired event Sorting which wasn't handled - sorting and SelectedIndexChanged Jason ASP .Net Web Controls 0 10-04-2006 02:19 PM
Re: Questions....questions....questions Patrick Michael A+ Certification 0 06-16-2004 04:53 PM
sorting by multiple criterias (sub-sorting) Tom Kirchner Perl Misc 3 10-11-2003 05:16 PM
Re: "Newbie" questions - "unique" sorting ? John Fitzsimons Python 8 07-02-2003 01:34 AM



Advertisments