How to make a Perl program do concurrent downloading?
I wrote a program to download 500,000 HTML files from a website, I
have compiled all the links in a file. my grabber.pl will download all of
I have a fast internet connection. I think it is better to run multiple
same time, but $INET = new Win32::Internet() only allows one at a
may I do?
I also found, occassionally the grabber just hang somewhere...In such
need to bypass $INET->FetchURL($url), write the offending URL in an error
and continue on to next iteration...How may I do that?
Re: How to make a Perl program do concurrent downloading?
"Adlene" <Adlene3352@hotmail.com> wrote in message news:<firstname.lastname@example.org>...
> Hi, there:
> I wrote a program to download 500,000 HTML files from a website, I
> have compiled all the links in a file. my grabber.pl will download all of
Depending on who owns the Internet site, they may find it rude that
you want to dowload so many files and that you may want to take as
much resources as possible from their web server. Perhaps you should
find a different way of retrieving the data, such as contacting the
web site administrator and tell them what you want to do, they may
give you a tar gzipped file of the site??
> I have a fast internet connection. I think it is better to run multiple
> downloads at
It may be better for you, but that is questionable for everyone else.
Here is some information on web robots. You might want to do some
more searching though on web robots.
<from the above URL>
The Four Laws of Web Robotics
Law One: A Web Robot Must Show Identification
Phantom supports this. You can set the "User-Agent" and "From E-Mail"
fields in the preferences dialog. Both of these are reported in the
HTTP header when Phantom makes requests of remote Web servers.
Law Two: A Web Robot Must Obey Exclusion Standard
Phantom fully supports the exclusion standard.
Law Three: A Web Robot Must Not Hog Resources
Phantom only retrieves files it can index (unless mirroring with
binaries option on) and restricts its movement to the path specified
by starting point s. You can also set the minimum time between hits on
the same server. Generally, 60 seconds is considered polite.
For busy sites a greater hit rate may be acceptable, but do not assume
whether a site is "busy" or not— contact the webmaster first. When
crawling your own server, of course, you can set the hit interval to
anything you like, including zero.
Law Four: A Web Robot Must Report Errors
Phantom can show you links that are no longer valid. Please contact
the Webmaster and pass this information on if broken URLs are found.
> same time, but $INET = new Win32::Internet() only allows one at a
> may I do?
> I also found, occassionally the grabber just hang somewhere...In such
> situation I
> need to bypass $INET->FetchURL($url), write the offending URL in an error
> and continue on to next iteration...How may I do that?
> Best Regards,
|All times are GMT. The time now is 12:00 AM.|
SEO by vBSEO ©2010, Crawlability, Inc.