![]() |
Merlin, a fun little program
I posted to my web site a fun little program called merlin.py today.
Please keep in mind that I am a hobbyist and this is just a little hack, if you look at the code you will see that it is still possible to write spaghetti code, even with Python. I apologize, and I do intend to clean up the code, but it may take awhile. For now it works, with some bugs. It is a composite of a few scripts. The first, based on a script Max M uploaded to this newsgroup a while ago (2 years?), is a web scraper based multiple choice guesser. I re-wrote the web scraper to use Yahoo rather than Google, as Google somehow recognizes it as a script now and so has disabled the ability to use Google, as they say it violates their terms of service. I certainly do not want to violate anyone's terms of service, this is a just a fun little script. I also used string functions instead of regexes and an algorithm of my own. Kudos to David Mertz' Text Processing in Python for helping me figure out how to do this, indirectly. (BTW, I also posted a review of his new book on my web site...and submitted it to Slashdot, but one never knows if they will run it). The stand alone version of the web scraper (askMerlin.py) uses NLQ, a natural query language class found on the web at http://gurno.com/adam/nlq/ to identify possible answers to a user's's questions, to then be submitted to the main algorithm to choose amongst the possible answers, which I call options. Of course, the program is much more likely to be accurate when you give it a correct "option" to be picked out from amongst several incorrect options that you also give it; and in fact a bug in the composite program I call Merlin ( merlin.py) crashes completely if you do not give it any options; but this can be fixed. askMerlin.py doesn't crash and uses NLQ, but gives poor answers. However, I have a much better algorithm in mind for this part of the program; instead of giving NLQ the main response page from a query, I will give it the first "link" page from a query, which I reckon to be much more likely to contain keywords that represent good possible answers. Alas, this may have to wait until the next long weekend, unless someone else takes up the task ;-))) In the long run, the program is much more interesting using NLQ to find answers to questions where the user offers no possible answers to choose amongst or other clues; I think this has potential. For now, please give Merlin options to choose amongst. Then, I include a slightly improved Decision Analysis script, and two fun variations or specific applications of it. This script has the virtue of being my own creation, although I did recieve help from Paul Winkler and others on this list. Then I also include a script shamelessly stolen off the web that will be instantly recognizable to most of you on this newsgroup, but perhaps not to some newbies. I have in mind more such fun stuff to be added. Also, I intend to do a full GUI version, with a much better user interface, and then to create executable installers for Windows, Linux, and Mac OS X. For now though, the command line interface has the advatage of working anywhere one can get a Python command prompt; I have tested it on Windows, Linux, Mac OS X and the Sharp Zaurus PDA. The additon of a GUI and creation of executable files should keep this hobbyist busy for a while ;-))) A GUI version of Decision Analysis, that I wrote using PythonCard, is available already. All of the above can wait until I add more fun stuff to it, make it better, fix bugs, move it from the deprecated regex to the re module, and clean up the code! OK, so this hack may not be worth all the words I've given it, but, in the spirit of computer programming for everybody, I am pleased that I am producing something. I think it might be something other newbies might be able to understand and hack on also, since it is so simple. If not, so be it. I am having fun. All of this is on my web site, right at the top, at http://www.awaretek.com/plf.html Ron Stephens |
Re: Merlin, a fun little program
On Mon, Jul 07, 2003 at 01:37:15AM +0000, Ron Stephens wrote:
> based multiple choice guesser. I re-wrote the web scraper to use Yahoo > rather than Google, as Google somehow recognizes it as a script now and > so has disabled the ability to use Google, as they say it violates their > terms of service. I certainly do not want to violate anyone's terms of > service You can still use Google for this - just sign up for the Google API. http://www.google.com/apis/ http://diveintomark.org/projects/pygoogle/ Oren |
Re: Merlin, a fun little program
Oren wrote """You can still use Google for this """
Yes indeed. I have played with the Google API's, registered and also use pygoogle. They make this kind of thing easier, no doubt about it. The reason I used my hand-rolled web scraper on Yahoo is that using the Google API's means that other potential users, like those who download form my web site, can't run my code it uses Google api's unless they download and register also; which might be pain for them. At any rate, doing my own was fun and informative for me. A big disadvantage to web scraping is that they code tends to break over time, though. This happened to me with Max M's original; two year old algorithm. Google broke it, and I didnt realize it unit I can back and retried the code. The two links you gave are good ones and I studied both in my efforts; I recommend them. Thanks for the inputs. I guess the bigger question is; is there anything wrong with web scraping? I surely never meant any harm in it, and certainly no money is involved. But maybe I should give it up and do other things? Ron Stephens |
Re: Merlin, a fun little program
On Mon, Jul 07, 2003 at 02:05:28PM -0700, Ron Stephens wrote:
> I guess the bigger question is; is there anything wrong with web > scraping? I surely never meant any harm in it, and certainly no money > is involved. But maybe I should give it up and do other things? I don't think there's anything fundamentally wrong with web scraping but you have to consider the fact that a single script can easily consume resources that cost real money and would otherwise serve thousands of human users. If a provider installs mechanisms to detect scripts and block them this can quickly become a cat-and-mouse game where the scrapers try to fool these mechanisms, it starts to get ugly and everybody suffers. I think Google handled this very well - defusing most of the problem by letting people have what they want while keeping things under control. I generally find the way Google handles the issues that come with their dominant market position quite "Pythonic". Oren |
| All times are GMT. The time now is 06:31 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.