![]() |
How to make a Perl program do concurrent downloading?
Hi, there:
I wrote a program to download 500,000 HTML files from a website, I have compiled all the links in a file. my grabber.pl will download all of them... I have a fast internet connection. I think it is better to run multiple downloads at same time, but $INET = new Win32::Internet() only allows one at a time...what may I do? I also found, occassionally the grabber just hang somewhere...In such situation I need to bypass $INET->FetchURL($url), write the offending URL in an error file and continue on to next iteration...How may I do that? Best Regards, Adlene |
Re: How to make a Perl program do concurrent downloading?
"Adlene" <Adlene3352@hotmail.com> wrote in message news:<c6vvmn$ck4$1@mawar.singnet.com.sg>...
> Hi, there: > > I wrote a program to download 500,000 HTML files from a website, I > have compiled all the links in a file. my grabber.pl will download all of > them... Depending on who owns the Internet site, they may find it rude that you want to dowload so many files and that you may want to take as much resources as possible from their web server. Perhaps you should find a different way of retrieving the data, such as contacting the web site administrator and tell them what you want to do, they may give you a tar gzipped file of the site?? > > I have a fast internet connection. I think it is better to run multiple > downloads at It may be better for you, but that is questionable for everyone else. Here is some information on web robots. You might want to do some more searching though on web robots. http://www.phantomsearch.com/usersguide/R04Robot.htm <from the above URL> The Four Laws of Web Robotics Law One: A Web Robot Must Show Identification Phantom supports this. You can set the "User-Agent" and "From E-Mail" fields in the preferences dialog. Both of these are reported in the HTTP header when Phantom makes requests of remote Web servers. Law Two: A Web Robot Must Obey Exclusion Standard Phantom fully supports the exclusion standard. Law Three: A Web Robot Must Not Hog Resources Phantom only retrieves files it can index (unless mirroring with binaries option on) and restricts its movement to the path specified by starting point s. You can also set the minimum time between hits on the same server. Generally, 60 seconds is considered polite. For busy sites a greater hit rate may be acceptable, but do not assume whether a site is "busy" or notÑ contact the webmaster first. When crawling your own server, of course, you can set the hit interval to anything you like, including zero. Law Four: A Web Robot Must Report Errors Phantom can show you links that are no longer valid. Please contact the Webmaster and pass this information on if broken URLs are found. > same time, but $INET = new Win32::Internet() only allows one at a > time...what > may I do? > > I also found, occassionally the grabber just hang somewhere...In such > situation I > need to bypass $INET->FetchURL($url), write the offending URL in an error > file > and continue on to next iteration...How may I do that? > > Best Regards, > Adlene |
| All times are GMT. The time now is 08:30 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.