Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   Downloading lots and lots and lots of files (http://www.velocityreviews.com/forums/t901634-downloading-lots-and-lots-and-lots-of-files.html)

coolneo 01-29-2007 02:44 PM

Downloading lots and lots and lots of files
 
First, what I am doing is legit... I'm NOT trying to grab someone
elses content. I work for a non-profit organization and we have
something going on with Google where they are providing digitized
versions of our material. They (Google) provided some information on
howto write a script (shell) to download the digitized version using
wget.

There are about 50,000 items, raning in size from 15MB-600MB. My
script downloads them fine, but it would be much faster if i could
multi-thread(?) it. I'm running the wget using the sys command on a
windows box (i know, i know, but the whole place is windows so I don't
have much of a choice).

Am I on the right track? Or should I be doing this differently?

Thanks!
J


coolneo 01-29-2007 03:25 PM

Re: Downloading lots and lots and lots of files
 


On Jan 29, 10:04 am, Purl Gurl <purlg...@purlgurl.net> wrote:
> coolneo wrote:
> > There are about 50,000 items, raning in size from 15MB-600MB. My
> > script downloads them fine, but it would be much faster if i could
> > multi-thread(?) it.You indicate you have already downloaded those files.

>
> Why do you want to download those files again?
>
> Purl Gurl



I managed to download about 21,000 of the 50,000 items over the course
of some time. Initally, Google was processing these items at a slow
rate but lately they have picked it up.

Bandwidth is indeed a concern, and I understand downloading 5TB will
take a long long time, but I think it would be a little shorter if I
could spawn off 4 downloads at a time, or even 2, during our off
business hours and the weekend (I get . The average file size is
125MB. We have a 200mb pipe, so it's not entirely unreasonable (is
it?).




Peter Scott 01-29-2007 03:42 PM

Re: Downloading lots and lots and lots of files
 
On Mon, 29 Jan 2007 06:44:02 -0800, coolneo wrote:
> First, what I am doing is legit... I'm NOT trying to grab someone
> elses content. I work for a non-profit organization and we have
> something going on with Google where they are providing digitized
> versions of our material. They (Google) provided some information on
> howto write a script (shell) to download the digitized version using
> wget.
>
> There are about 50,000 items, ranging in size from 15MB-600MB. My
> script downloads them fine, but it would be much faster if i could
> multi-thread(?) it. I'm running the wget using the sys command on a
> windows box (i know, i know, but the whole place is windows so I don't
> have much of a choice).


You could try

http://search.cpan.org/~marclang/Par...WP/Parallel.pm

Looks like you'll need Cygwin.

--
Peter Scott
http://www.perlmedic.com/
http://www.perldebugged.com/


Ted Zlatanov 01-29-2007 05:20 PM

Re: Downloading lots and lots and lots of files
 
On 29 Jan 2007, coolneo@yahoo.com wrote:

> I managed to download about 21,000 of the 50,000 items over the course
> of some time. Initally, Google was processing these items at a slow
> rate but lately they have picked it up.


> Bandwidth is indeed a concern, and I understand downloading 5TB will
> take a long long time, but I think it would be a little shorter if I
> could spawn off 4 downloads at a time, or even 2, during our off
> business hours and the weekend (I get . The average file size is
> 125MB. We have a 200mb pipe, so it's not entirely unreasonable (is
> it?).


You should contact Google and request the data directly. I guarantee
you they will be happy to avoid the load on their network and
servers, since HTTP is not the best way to transfer lots of data.

Ted

xhoster@gmail.com 01-29-2007 05:22 PM

Re: Downloading lots and lots and lots of files
 
Abigail <abigail@abigail.be> wrote:
>
> Of course, it's quite likely that the network is the bottleneck.
> Starting up many simultaneous connections isn't going to help in
> that case.
>
> Finally, I wouldn't use threads. I'd either fork() or use a select()
> loop, depending on the details of the work that needs to be done.
> But then, I'm a Unix person.


I probably wouldn't even use fork. I'd just make 3 (or 4, or 10, whatever)
different to do lists, and start up 3 (or 4, or 10) completely independent
programs from the command line.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

gf 01-29-2007 05:55 PM

Re: Downloading lots and lots and lots of files
 

coolneo wrote:
> [...] They (Google) provided some information on
> howto write a script (shell) to download the digitized version using
> wget.
>
> There are about 50,000 items, raning in size from 15MB-600MB. My
> script downloads them fine, but it would be much faster if i could
> multi-thread(?) it. I'm running the wget using the sys command on a
> windows box (i know, i know, but the whole place is windows so I don't
> have much of a choice).
>
> Am I on the right track? Or should I be doing this differently?


You didn't say if this is a one-time job or something that'll be on-
going.

If it's a one-time job, then I'd split that file list into however
many processes I want to run, then start that many shell jobs and just
let 'em run until it's done. It's not elegant, it's brute force, but
sometimes that's plenty good.

If you're going to be doing this regularly, then LWP::Parallel is
pretty sweet. You can have each LWP agent shift an individual URL off
the list and slowly whittle it down.

The I/O issues mentioned are going to be worse on a single box though.
You can hit a point where the machine is network I/O bound so you
might want to consider confiscating a couple PCs and run a separate
job on each PC, as long as you're on a switch and a fast pipe.

I'd also seriously consider a modern sneaker-net, and see about buying
some hard-drives that'll hold the entire set of data, and send them to
Google, have them fill the drives, and then return them overnight air.
That might be a lot faster, and then you could reuse the drives later.


coolneo 01-29-2007 07:04 PM

Re: Downloading lots and lots and lots of files
 


On Jan 29, 12:20 pm, Ted Zlatanov <t...@lifelogs.com> wrote:
> On 29 Jan 2007, cool...@yahoo.com wrote:
>
> > I managed to download about 21,000 of the 50,000 items over the course
> > of some time. Initally, Google was processing these items at a slow
> > rate but lately they have picked it up.
> > Bandwidth is indeed a concern, and I understand downloading 5TB will
> > take a long long time, but I think it would be a little shorter if I
> > could spawn off 4 downloads at a time, or even 2, during our off
> > business hours and the weekend (I get . The average file size is
> > 125MB. We have a 200mb pipe, so it's not entirely unreasonable (is
> > it?).You should contact Google and request the data directly. I guarantee

> you they will be happy to avoid the load on their network and
> servers, since HTTP is not the best way to transfer lots of data.
>
> Ted


Ted, I didn't provide some addition information that would may make
you think differently:

Google is kinda odd sometimes. It took them forever to allow multiple
download streams, and then they provide this web interface to recall
data in text format with wget. I mean, for Google, you figure they
could do better. I think they would prefer to not give us anything at
all. Once we have it there is always the chance we'll give it way or
lose it or have it stolen (by Microsoft!).

Another thing I didn't mention is that this can grow to much larger
than the 50,000, in which case, I'd much rather just auto-download,
than deal with media.


Dr.Ruud 01-29-2007 07:34 PM

Re: Downloading lots and lots and lots of files
 
coolneo schreef:

> recall data in text format with wget.


I assume it is gz-compressed?

--
Affijn, Ruud

"Gewoon is een tijger."

Ted Zlatanov 01-29-2007 08:33 PM

Re: Downloading lots and lots and lots of files
 
On 29 Jan 2007, coolneo@yahoo.com wrote:

> Google is kinda odd sometimes. It took them forever to allow multiple
> download streams, and then they provide this web interface to recall
> data in text format with wget. I mean, for Google, you figure they
> could do better. I think they would prefer to not give us anything at
> all. Once we have it there is always the chance we'll give it way or
> lose it or have it stolen (by Microsoft!).


As a business decision it may make sense; technically it's nonsense :)

At the very least they should give you a rsync interface. It's a
single TCP stream, it's fast, and it can be resumed if the connection
should abort. HTTP is low on my list of transport mechanisms for
large files.

> Another thing I didn't mention is that this can grow to much larger
> than the 50,000, in which case, I'd much rather just auto-download,
> than deal with media.


Sure. I was talking about your initial data load; subsequent loads
can be incremental.

I would also suggest limiting to N downloads per hour, to avoid bugs
or other situations (unmounted disk, for example) where you're
repeatedly requesting all the data you already have. That's a very
nasty situation.

Ted

coolneo 01-30-2007 02:34 PM

Re: Downloading lots and lots and lots of files
 


Thanks everyone. I'm going to give LWP:Parallel a closer look. That
looks like it will do what I want. Thanks for the advice on queuing
the downloads. That makes perfect sense.



All times are GMT. The time now is 11:00 PM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57