![]() |
Re: "Newbie" questions - "unique" sorting ?
On Mon, 23 Jun 2003 20:35:59 -0700, "Cousin Stanley"
<CousinStanley@hotmail.com> wrote: Hi Cousin Stanley, >{ 1. Good News | 2. Bad News | 3. Good News } .... > 1. Good News .... > The last version of word_list.py that I up-loaded > works as expected with your input file producing > an indexed word list with no duplicates ... < snip > > That's 6.56 HOURS and un-acceptable performance !!!! I agree. :-) Very clever of you to have worked out how long it would take. I hope you didn't wait over 6 hours to find out !!! > word_list.py works quickly on smaller files, > but as coded, is an absolute dog for indexing > larger files .... Good. I was hoping it wasn't something that I had done wrong. :-) > 3. Good News .... > Since I FINALLY figured out that you're mostly interested > in just the URLs and not a general word list, > I coded a pre-process script to extract just the URLs > from the original input file .... > python url_list.py JF_In.txt JF_URLs.txt Unless I missed something it does lines starting ftp, http, BUT not lines that start www . Is that correct ? Or did I give you a file with no lines starting www ? < snip > >Let me know if this output looks closer to what you are after .... Very very good......and fast. If I can work out what happened to the www lines, and fix it, then everything will be great. I then hope to try this exercise using a different method to see if the numbers come up the same. Thank you for such excellent programming. :-) Regards, John. |
Re: "Newbie" questions - "unique" sorting ?
| ...
| I hope you didn't wait over 6 hours to find out !!! | John ... Actually, I did wait .... Since I'd the program successfully a number of times, but on smaller files, I wanted to know just how long it woulds take to run to completion .... The numbers in the output I posted were an actual copy/paste directly from the DOS window that I it ran in ... | Unless I missed something it does lines | starting ftp, http, BUT not lines that start www . | You didn't miss anything .... The version of url_list.py that you ran only looks for .... [ 'http://' , 'https://' , 'ftp://' , 'news://' , 'res://' , 'fido://' ] However, I added a bit of code to url_list.py to also extract lines starting with www. ... Download newest versions .... http://fastq.com/~sckitching/Python/word_list.zip Run as before .... python url_list.py JF_In.txt JF_URLs.txt python word_list.py JF_URLs.txt JF_URLs_Index.txt | Thank you for such excellent programming. You're welcome .... Thanks also to .... Erik Max Francis for suggesting the lambda sort for Mixed-Case sorting .... Kim Petersen for suggesting usage of .... dict_words.has_key[ this_word ] instead of this_word in dict_words.keys() which made an incredible difference in processing time .... -- Cousin Stanley Human Being Phoenix, Arizona |
Re: "Newbie" questions - "unique" sorting ?
"Cousin Stanley" <CousinStanley@hotmail.com> wrote:
>| Thank you for such excellent programming. > >You're welcome .... > >Thanks also to .... > > Erik Max Francis for suggesting > the lambda sort for Mixed-Case sorting .... > > Kim Petersen for suggesting usage of .... > > dict_words.has_key[ this_word ] instead of > > this_word in dict_words.keys() > > which made an incredible difference in processing time .... I've been playing a little with the script and managed to double the speed by using a trick that was posted here some time ago by someone called "Lulu". The trick was originally from someone else but I lost the attribution somewhere. This bumps Erik's idea from the list I'm afraid ..., because it translates all letters into lowercase and translates the rest into spaces. This speeds up sorting and splitting. Probably it's possible to shave off a few percents more, but I think doubling speed once again will cost four times more programmer effort or maybe twice as much money for computer equipment. It's all in the script below, I hope I didn't introduce any new errors. By the way, I don't like an empty line for every other line, as in your script, and using "\n" is easier than what you did. Other than that, nice job! Anton import sys import time time_in = time.time() module_name = sys.argv[ 0 ] print '\n %s ' % (module_name ) #to get the file below: #http://sailor.gutenberg.org/etext97/1donq10.zip path_in = '1donq10.txt' path_out = 'words.out' file_in = file( path_in , 'r' ) file_out = file( path_out , 'w' ) word_total = 0 dict_words = {} #start lulu magic i_r = map(chr, range(256)) trans = [' '] * 256 o_a, o_z = ord('a'), (ord('z')+1) trans[ord('A'):(ord('Z')+1)] = i_r[o_a:o_z] trans[o_a:o_z] = i_r[o_a:o_z] trans = ''.join(trans) #end lulu magic print ' Indexing Words ....\n ' , for iLine in file_in : if (word_total+1) % 10000 == 0 : sys.stdout.write('.') #use lulu magic in the line below here: list_words = iLine.translate(trans).split() for this_word in list_words : if not dict_words.has_key(this_word) : dict_words[this_word] = 1 else : dict_words[this_word] += 1 word_total += 1 list_words = dict_words.keys() #lulu magic turned all words into lowercase, so standard sort is #possible: list_words.sort() print '\n\n Writing Output File ....' , for this_word in list_words : word_count = dict_words[this_word] str_out = '%6d %s\n' % (word_count ,this_word) file_out.write(str_out) word_str = '\n Total Words .... %d\n' % (word_total) keys_total = len(dict_words.keys()) keys_str = '\n Unique Words .... %d\n' % (keys_total) file_out.write(word_str) file_out.write(keys_str) print '\n Complete .................\n' print ' Total Words ....' , word_total print ' Unique Words ....' , keys_total file_in.close() file_out.close() time_out = time.time() time_diff = time_out - time_in print '\n Process Time ........ %-6.2f Seconds' % (time_diff) |
Re: "Newbie" questions - "unique" sorting ?
Anton ....
Thanks for the feedback and the __LuLu_Magic__ coding sample .... I'll have to ponder the __LuLu_Magic__ a bit to try and understand it ... The reasons that I use NL instead of '\n' are that ... o it seems easier for me to type since it's 2 fewer characters o string and print statements seem to read a bit easier for me o only need to change 1 line of code if a different End-of-Line character sequence is needed The reasons that I use vertically double-spaced code with a lot of horizontal white-space is that my eyes are old and tired and my feeble brain just is able to parse it easier .... A familar example is from several months ago when you posted a link for your screensaver.py module .... Edited version of screensaver.py .... http://fastq.com/~sckitching/Python/scr_av.py I'm sure you will hate it, but it's much easier for me to read .... Thanks for making your screen saver available and thanks again for your comments and suggestions regarding the word_list script ... -- Cousin Stanley Human Being Phoenix, Arizona |
Re: "Newbie" questions - "unique" sorting ?
"Cousin Stanley" <CousinStanley@hotmail.com> wrote:
> Edited version of screensaver.py .... > > http://fastq.com/~sckitching/Python/scr_av.py > >I'm sure you will hate it, >but it's much easier for me to read .... On the contrary, I am very glad someone reads my code and make changes to it, for better or worse! The more "eyeball" inspection code gets, the more chances it has into evolving into something better, even if sometimes newer versions of the code are worse than earlier versions. It works like a genetic algorithm improving ones code snippets :-) The other thing is that while using Python it seems to be common to read ones previous code and discover that code from only a few months ago would be done very differently now. For example my screensaver module imports a "sequencer.py" file that could now be rewritten in probably a fourth of the number of lines, because someone on c.l.py here made a comment on a newer version of it that was already half the number of lines of "sequencer.py". Also the "Transformer" class in the screensaver needlessly recomputes a lot of things at every call that could be done during initialization, later versions of this class are doing this better. My personal observation is that *everything* I write in Python is a candidate for improvement in only a few months time because of my changing perspectives on the matter. For another perspective on the "lulu" code for example, try this : trans = [string.lower(chr(i)) for i in range(256)] for i in range(256): if not trans[i] in string.letters: trans[i] = ' ' trans = ''.join(trans) I think this is both a line or so shorter than the original code and is probably also a bit clearer. Because IMO everything written in Python is improved sooner or later -according to how many people look at it- I expect *this* code fragment to be updated once again soon. This peculiar aspect of Python (and probably other high level languages) is probably caused by the fact that Python code comes closer to ones thoughts than other code, and -at least for me- thoughts are the most volatile elements in the world. So better get used to it (if your experience is anything like mine of course) and don't let yourself be distracted by the code-reusers, unit-testers, and static typers that are still trying to get a grip on this elusive aspect of Python coding. Anton |
Re: "Newbie" questions - "unique" sorting ?
Anton Vredegoor wrote:
> My personal observation is that *everything* I write in Python is a > candidate for improvement in only a few months time because of my > changing perspectives on the matter. For another perspective on the > "lulu" code for example, try this : > > trans = [string.lower(chr(i)) for i in range(256)] > for i in range(256): > if not trans[i] in string.letters: trans[i] = ' ' > trans = ''.join(trans) > > I think this is both a line or so shorter than the original code and > is probably also a bit clearer. shorter, but not necessarily clearer (code that contains "".join is never clear, in my experience, and the and/or trick doesn't make things better): trans = "".join([chr(x).isalpha() and chr(x).lower() or " " for x in range(256)]) but on the other hand, this gives you room for a two lines of comments, explaining the intent of this piece of code. you can trade performance for a source code character or two: trans = "".join([(" ", chr(x).lower())[chr(x).isalpha()] for x in range(256)]) fwiw, I'd probably spell it all out, to make it all obvious: # map letters to lowercase, and everything else to spaces trans = range(256) for i in trans: ch = chr(i) if ch.isalpha(): trans[i] = ch.lower() else: trans[i] = " " trans = string.join(trans, "") but that's probably because I have a two-dimensional brain and a working return key ;-) </F> |
Re: "Newbie" questions - "unique" sorting ?
On Wed, 25 Jun 2003 09:17:56 -0700, "Cousin Stanley"
<CousinStanley@hotmail.com> wrote: < snip > >The version of url_list.py that you ran >only looks for .... > [ 'http://' , > 'https://' , > 'ftp://' , > 'news://' , > 'res://' , > 'fido://' ] >However, I added a bit of code to url_list.py >to also extract lines starting with www. ... >Download newest versions .... > > http://fastq.com/~sckitching/Python/word_list.zip >Run as before .... > > python url_list.py JF_In.txt JF_URLs.txt > > python word_list.py JF_URLs.txt JF_URLs_Index.txt Yes, that works now. >| Thank you for such excellent programming. >You're welcome .... >Thanks also to .... > Erik Max Francis for suggesting > the lambda sort for Mixed-Case sorting .... > Kim Petersen for suggesting usage of .... > dict_words.has_key[ this_word ] instead of > > this_word in dict_words.keys() > which made an incredible difference in processing time .... Yes, the whole process is very good now. Thanks to everyone who helped with this. It was very much appreciated. :-) Bye the way Cousin Stanley I assume the numbers in the final result were how many times that string appeared in the input file ? A very interesting number to have. When I get time though I might "rem" (is that # ?) out that/those line/lines so that I have a second URL sorting python executable that doesn't include the numbers. Assuming I can work out which one(s) to disable ! :-) Regards, John. |
Re: "Newbie" questions - "unique" sorting ?
| ...
| Bye the way Cousin Stanley I assume the numbers | in the final result were how many times that string | appeared in the input file ? | | A very interesting number to have. | | When I get time though I might "rem" (is that # ?) out | that/those line/lines so that I have a second URL sorting | python executable that doesn't include the numbers. | | Assuming I can work out which one(s) to disable ! John ... Getting rid of the word_count isn't too bad .... Find the following code block in word_list.py .... for this_word in list_words : word_count = dict_words[ this_word ] str_out = '%6d %s %s' % ( word_count , this_word , NL ) file_out.write( str_out ) Change the above 4 lines to the following 2 .... for this_word in list_words : file_out.write( this_word + NL ) Save the changed file to word_list2.py .... As an alternative, since the words you are interested in in this application are actually URLs, you might want to generate actual clickable HTML links .... for this_word in list_words : file_out.write( '<a href="' + this_word + '">' + NL ) file_out.write( this_word + NL ) file_out.write( '</a>' + NL + NL ) Save the changed file to word_list3.py .... I haven't tested either of the above changes, but this should provide some ideas .... -- Cousin Stanley Human Being Phoenix, Arizona |
Re: "Newbie" questions - "unique" sorting ?
On Mon, 30 Jun 2003 22:17:43 -0700, "Cousin Stanley"
<CousinStanley@hotmail.com> wrote: < snip > >| When I get time though I might "rem" (is that # ?) out >| that/those line/lines so that I have a second URL sorting >| python executable that doesn't include the numbers. >Getting rid of the word_count >isn't too bad .... < snip > Thanks for the additional info. :-) Regards, John. |
| All times are GMT. The time now is 12:35 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.