Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Re: "Newbie" questions - "unique" sorting ? (http://www.velocityreviews.com/forums/t318828-re-newbie-questions-unique-sorting.html)

John Fitzsimons 06-25-2003 12:44 PM

Re: "Newbie" questions - "unique" sorting ?
 
On Mon, 23 Jun 2003 20:35:59 -0700, "Cousin Stanley"
<CousinStanley@hotmail.com> wrote:

Hi Cousin Stanley,

>{ 1. Good News | 2. Bad News | 3. Good News } ....


> 1. Good News ....


> The last version of word_list.py that I up-loaded
> works as expected with your input file producing
> an indexed word list with no duplicates ...


< snip >

> That's 6.56 HOURS and un-acceptable performance !!!!


I agree. :-) Very clever of you to have worked out how long it would
take. I hope you didn't wait over 6 hours to find out !!!

> word_list.py works quickly on smaller files,
> but as coded, is an absolute dog for indexing
> larger files ....


Good. I was hoping it wasn't something that I had done wrong. :-)

> 3. Good News ....


> Since I FINALLY figured out that you're mostly interested
> in just the URLs and not a general word list,
> I coded a pre-process script to extract just the URLs
> from the original input file ....


> python url_list.py JF_In.txt JF_URLs.txt


Unless I missed something it does lines starting ftp, http, BUT not
lines that start www . Is that correct ? Or did I give you a file with
no lines starting www ?

< snip >

>Let me know if this output looks closer to what you are after ....


Very very good......and fast. If I can work out what happened to the
www lines, and fix it, then everything will be great. I then hope to
try this exercise using a different method to see if the numbers come
up the same.

Thank you for such excellent programming. :-)


Regards, John.


Cousin Stanley 06-25-2003 04:17 PM

Re: "Newbie" questions - "unique" sorting ?
 
| ...
| I hope you didn't wait over 6 hours to find out !!!
|

John ...

Actually, I did wait ....

Since I'd the program successfully a number of times,
but on smaller files, I wanted to know just how long
it woulds take to run to completion ....

The numbers in the output I posted
were an actual copy/paste directly
from the DOS window that I it ran in ...

| Unless I missed something it does lines
| starting ftp, http, BUT not lines that start www .
|

You didn't miss anything ....

The version of url_list.py that you ran
only looks for ....

[ 'http://' ,
'https://' ,
'ftp://' ,
'news://' ,
'res://' ,
'fido://' ]

However, I added a bit of code to url_list.py
to also extract lines starting with www. ...

Download newest versions ....

http://fastq.com/~sckitching/Python/word_list.zip

Run as before ....

python url_list.py JF_In.txt JF_URLs.txt

python word_list.py JF_URLs.txt JF_URLs_Index.txt

| Thank you for such excellent programming.

You're welcome ....

Thanks also to ....

Erik Max Francis for suggesting
the lambda sort for Mixed-Case sorting ....

Kim Petersen for suggesting usage of ....

dict_words.has_key[ this_word ] instead of

this_word in dict_words.keys()

which made an incredible difference in processing time ....

--
Cousin Stanley
Human Being
Phoenix, Arizona



Anton Vredegoor 06-26-2003 07:24 PM

Re: "Newbie" questions - "unique" sorting ?
 
"Cousin Stanley" <CousinStanley@hotmail.com> wrote:
>| Thank you for such excellent programming.
>
>You're welcome ....
>
>Thanks also to ....
>
> Erik Max Francis for suggesting
> the lambda sort for Mixed-Case sorting ....
>
> Kim Petersen for suggesting usage of ....
>
> dict_words.has_key[ this_word ] instead of
>
> this_word in dict_words.keys()
>
> which made an incredible difference in processing time ....


I've been playing a little with the script and managed to double the
speed by using a trick that was posted here some time ago by someone
called "Lulu". The trick was originally from someone else but I lost
the attribution somewhere. This bumps Erik's idea from the list I'm
afraid ..., because it translates all letters into lowercase and
translates the rest into spaces. This speeds up sorting and splitting.

Probably it's possible to shave off a few percents more, but I think
doubling speed once again will cost four times more programmer effort
or maybe twice as much money for computer equipment.

It's all in the script below, I hope I didn't introduce any new
errors. By the way, I don't like an empty line for every other line,
as in your script, and using "\n" is easier than what you did. Other
than that, nice job!

Anton


import sys
import time

time_in = time.time()
module_name = sys.argv[ 0 ]
print '\n %s ' % (module_name )
#to get the file below:
#http://sailor.gutenberg.org/etext97/1donq10.zip
path_in = '1donq10.txt'
path_out = 'words.out'
file_in = file( path_in , 'r' )
file_out = file( path_out , 'w' )
word_total = 0
dict_words = {}
#start lulu magic
i_r = map(chr, range(256))
trans = [' '] * 256
o_a, o_z = ord('a'), (ord('z')+1)
trans[ord('A'):(ord('Z')+1)] = i_r[o_a:o_z]
trans[o_a:o_z] = i_r[o_a:o_z]
trans = ''.join(trans)
#end lulu magic
print
print ' Indexing Words ....\n ' ,
for iLine in file_in :
if (word_total+1) % 10000 == 0 :
sys.stdout.write('.')
#use lulu magic in the line below here:
list_words = iLine.translate(trans).split()
for this_word in list_words :
if not dict_words.has_key(this_word) :
dict_words[this_word] = 1
else :
dict_words[this_word] += 1
word_total += 1
list_words = dict_words.keys()
#lulu magic turned all words into lowercase, so standard sort is
#possible:
list_words.sort()
print '\n\n Writing Output File ....' ,
for this_word in list_words :
word_count = dict_words[this_word]
str_out = '%6d %s\n' % (word_count ,this_word)
file_out.write(str_out)
word_str = '\n Total Words .... %d\n' % (word_total)
keys_total = len(dict_words.keys())
keys_str = '\n Unique Words .... %d\n' % (keys_total)
file_out.write(word_str)
file_out.write(keys_str)
print '\n Complete .................\n'
print ' Total Words ....' , word_total
print
print ' Unique Words ....' , keys_total
file_in.close()
file_out.close()
time_out = time.time()
time_diff = time_out - time_in
print '\n Process Time ........ %-6.2f Seconds' % (time_diff)

Cousin Stanley 06-27-2003 09:25 PM

Re: "Newbie" questions - "unique" sorting ?
 
Anton ....

Thanks for the feedback
and the __LuLu_Magic__ coding sample ....

I'll have to ponder the __LuLu_Magic__ a bit
to try and understand it ...

The reasons that I use NL instead of '\n'
are that ...

o it seems easier for me to type
since it's 2 fewer characters

o string and print statements
seem to read a bit easier for me

o only need to change 1 line of code
if a different End-of-Line character sequence
is needed

The reasons that I use vertically double-spaced code
with a lot of horizontal white-space is that my eyes
are old and tired and my feeble brain just is able
to parse it easier ....

A familar example is from several months ago
when you posted a link for your screensaver.py module ....

Edited version of screensaver.py ....

http://fastq.com/~sckitching/Python/scr_av.py

I'm sure you will hate it,
but it's much easier for me to read ....

Thanks for making your screen saver available
and thanks again for your comments and suggestions
regarding the word_list script ...

--
Cousin Stanley
Human Being
Phoenix, Arizona



Anton Vredegoor 06-28-2003 08:44 AM

Re: "Newbie" questions - "unique" sorting ?
 
"Cousin Stanley" <CousinStanley@hotmail.com> wrote:

> Edited version of screensaver.py ....
>
> http://fastq.com/~sckitching/Python/scr_av.py
>
>I'm sure you will hate it,
>but it's much easier for me to read ....


On the contrary, I am very glad someone reads my code and make changes
to it, for better or worse! The more "eyeball" inspection code gets,
the more chances it has into evolving into something better, even if
sometimes newer versions of the code are worse than earlier versions.
It works like a genetic algorithm improving ones code snippets :-)

The other thing is that while using Python it seems to be common to
read ones previous code and discover that code from only a few months
ago would be done very differently now. For example my screensaver
module imports a "sequencer.py" file that could now be rewritten in
probably a fourth of the number of lines, because someone on c.l.py
here made a comment on a newer version of it that was already half the
number of lines of "sequencer.py". Also the "Transformer" class in the
screensaver needlessly recomputes a lot of things at every call that
could be done during initialization, later versions of this class are
doing this better.

My personal observation is that *everything* I write in Python is a
candidate for improvement in only a few months time because of my
changing perspectives on the matter. For another perspective on the
"lulu" code for example, try this :

trans = [string.lower(chr(i)) for i in range(256)]
for i in range(256):
if not trans[i] in string.letters: trans[i] = ' '
trans = ''.join(trans)

I think this is both a line or so shorter than the original code and
is probably also a bit clearer. Because IMO everything written in
Python is improved sooner or later -according to how many people look
at it- I expect *this* code fragment to be updated once again soon.

This peculiar aspect of Python (and probably other high level
languages) is probably caused by the fact that Python code comes
closer to ones thoughts than other code, and -at least for me-
thoughts are the most volatile elements in the world.

So better get used to it (if your experience is anything like mine of
course) and don't let yourself be distracted by the code-reusers,
unit-testers, and static typers that are still trying to get a grip on
this elusive aspect of Python coding.

Anton






Fredrik Lundh 06-28-2003 10:01 AM

Re: "Newbie" questions - "unique" sorting ?
 
Anton Vredegoor wrote:

> My personal observation is that *everything* I write in Python is a
> candidate for improvement in only a few months time because of my
> changing perspectives on the matter. For another perspective on the
> "lulu" code for example, try this :
>
> trans = [string.lower(chr(i)) for i in range(256)]
> for i in range(256):
> if not trans[i] in string.letters: trans[i] = ' '
> trans = ''.join(trans)
>
> I think this is both a line or so shorter than the original code and
> is probably also a bit clearer.


shorter, but not necessarily clearer (code that contains "".join is
never clear, in my experience, and the and/or trick doesn't make
things better):

trans = "".join([chr(x).isalpha() and chr(x).lower() or " " for x in range(256)])

but on the other hand, this gives you room for a two lines of
comments, explaining the intent of this piece of code.

you can trade performance for a source code character or two:

trans = "".join([(" ", chr(x).lower())[chr(x).isalpha()] for x in range(256)])

fwiw, I'd probably spell it all out, to make it all obvious:

# map letters to lowercase, and everything else to spaces
trans = range(256)
for i in trans:
ch = chr(i)
if ch.isalpha():
trans[i] = ch.lower()
else:
trans[i] = " "
trans = string.join(trans, "")

but that's probably because I have a two-dimensional brain and a
working return key ;-)

</F>





John Fitzsimons 07-01-2003 02:08 AM

Re: "Newbie" questions - "unique" sorting ?
 
On Wed, 25 Jun 2003 09:17:56 -0700, "Cousin Stanley"
<CousinStanley@hotmail.com> wrote:

< snip >

>The version of url_list.py that you ran
>only looks for ....


> [ 'http://' ,
> 'https://' ,
> 'ftp://' ,
> 'news://' ,
> 'res://' ,
> 'fido://' ]


>However, I added a bit of code to url_list.py
>to also extract lines starting with www. ...


>Download newest versions ....
>
> http://fastq.com/~sckitching/Python/word_list.zip


>Run as before ....
>
> python url_list.py JF_In.txt JF_URLs.txt
>
> python word_list.py JF_URLs.txt JF_URLs_Index.txt


Yes, that works now.

>| Thank you for such excellent programming.


>You're welcome ....


>Thanks also to ....


> Erik Max Francis for suggesting
> the lambda sort for Mixed-Case sorting ....


> Kim Petersen for suggesting usage of ....


> dict_words.has_key[ this_word ] instead of
>
> this_word in dict_words.keys()


> which made an incredible difference in processing time ....


Yes, the whole process is very good now.

Thanks to everyone who helped with this. It was very much
appreciated. :-)

Bye the way Cousin Stanley I assume the numbers in the final result
were how many times that string appeared in the input file ? A very
interesting number to have.

When I get time though I might "rem" (is that # ?) out that/those
line/lines so that I have a second URL sorting python executable that
doesn't include the numbers. Assuming I can work out which one(s) to
disable ! :-)


Regards, John.


Cousin Stanley 07-01-2003 05:17 AM

Re: "Newbie" questions - "unique" sorting ?
 
| ...
| Bye the way Cousin Stanley I assume the numbers
| in the final result were how many times that string
| appeared in the input file ?
|
| A very interesting number to have.
|
| When I get time though I might "rem" (is that # ?) out
| that/those line/lines so that I have a second URL sorting
| python executable that doesn't include the numbers.
|
| Assuming I can work out which one(s) to disable !

John ...

Getting rid of the word_count
isn't too bad ....

Find the following code block in word_list.py ....

for this_word in list_words :

word_count = dict_words[ this_word ]

str_out = '%6d %s %s' % ( word_count , this_word , NL )

file_out.write( str_out )


Change the above 4 lines to the following 2 ....

for this_word in list_words :

file_out.write( this_word + NL )

Save the changed file to word_list2.py ....


As an alternative, since the words you are interested in
in this application are actually URLs, you might want
to generate actual clickable HTML links ....

for this_word in list_words :

file_out.write( '<a href="' + this_word + '">' + NL )

file_out.write( this_word + NL )

file_out.write( '</a>' + NL + NL )

Save the changed file to word_list3.py ....

I haven't tested either of the above changes,
but this should provide some ideas ....

--
Cousin Stanley
Human Being
Phoenix, Arizona



John Fitzsimons 07-02-2003 01:34 AM

Re: "Newbie" questions - "unique" sorting ?
 
On Mon, 30 Jun 2003 22:17:43 -0700, "Cousin Stanley"
<CousinStanley@hotmail.com> wrote:

< snip >

>| When I get time though I might "rem" (is that # ?) out
>| that/those line/lines so that I have a second URL sorting
>| python executable that doesn't include the numbers.


>Getting rid of the word_count
>isn't too bad ....


< snip >

Thanks for the additional info. :-)

Regards, John.



All times are GMT. The time now is 12:35 AM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57