Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   The huge amount response data problem (http://www.velocityreviews.com/forums/t906878-the-huge-amount-response-data-problem.html)

falconzyx@gmail.com 03-25-2008 02:44 AM

The huge amount response data problem
 
I have a issue:
1. I want to open a file and use the data from the file to construct
the url.
2. After I constructed the url and sent it, I got the response html
data and some parts are what I want store inot the files.

It seems like a very easy thing, however, the issue is that the data
from the file that I have to open are too huge, which I have to
consturct almost 200000 url address to send and parse response data.
And the speed is very very slow.

I have no idea with thread or db cache, so I want some help .

Please give me some advices that what I should do to improve the speed

Thanks very much.

falconzyx@gmail.com 03-25-2008 04:53 AM

Re: The huge amount response data problem
 
On Mar 25, 10:44 am, "falcon...@gmail.com" <falcon...@gmail.com>
wrote:
> I have a issue:
> 1. I want to open a file and use the data from the file to construct
> the url.
> 2. After I constructed the url and sent it, I got the response html
> data and some parts are what I want store inot the files.
>
> It seems like a very easy thing, however, the issue is that the data
> from the file that I have to open are too huge, which I have to
> consturct almost 200000 url address to send and parse response data.
> And the speed is very very slow.
>
> I have no idea with thread or db cache, so I want some help .
>
> Please give me some advices that what I should do to improve the speed
>
> Thanks very much.


this is my code:

use threads;
use LWP::UserAgent;
use LWP::Simple;
use Data::Dumper;
use strict;
use threads::shared;



my $wordsList = &get_request;
#print Dumper( @wordsList );

my @words = split("\n", $wordsList);
#print Dumper(@words);

my @url = &get_url(@words);
#print Dumper(@url);
my @thr;
foreach my $i (1..100000) {
push @thr, threads->new(\&get_html, $url[$i]);
}
foreach (@thr) {
$_->detach; # it doesn't work!!!!!!!!!!!!!!!!
}



sub get_html {
my (@url) = @_;

}
sub get_request {
..........
return $wordsList;
}

sub get_url {
my (@words) = @_;
................
return @url;
}

Ben Bullock 03-25-2008 07:06 AM

Re: The huge amount response data problem
 
Your code is hopelessly inefficient. 100,000 strings of even twenty
characters is at least two megabytes of memory. Then you've doubled
that number with the creation of the URL, and then you are creating
arrays of all these things, so you've used several megabytes of
memory.

Instead of first creating a huge array of names, then a huge array of
URLs, why don't you just read in one line of the file at a time, then
try to get data from each URL? Read in one line of the first file,
create its URL, get the response data, store it, then go back and get
the next line of the file, etc. A 100,000 line file actually isn't
that big.

But if you are getting all these files from the internet, the biggest
bottleneck is probably the time the code spends waiting for a response
from the web servers it's requested. You'd have to think about making
parallel requests somehow to solve that.


falconzyx@gmail.com 03-25-2008 08:25 AM

Re: The huge amount response data problem
 
On Mar 25, 3:06 pm, Ben Bullock <benkasminbull...@gmail.com> wrote:
> Your code is hopelessly inefficient. 100,000 strings of even twenty
> characters is at least two megabytes of memory. Then you've doubled
> that number with the creation of the URL, and then you are creating
> arrays of all these things, so you've used several megabytes of
> memory.
>
> Instead of first creating a huge array of names, then a huge array of
> URLs, why don't you just read in one line of the file at a time, then
> try to get data from each URL? Read in one line of the first file,
> create its URL, get the response data, store it, then go back and get
> the next line of the file, etc. A 100,000 line file actually isn't
> that big.
>
> But if you are getting all these files from the internet, the biggest
> bottleneck is probably the time the code spends waiting for a response
> from the web servers it's requested. You'd have to think about making
> parallel requests somehow to solve that.


Thanks Ben,

However, is there any good solution that use threads method? I use
that, and out of memory time by time after I refactor the code as you
told
I try thread::Pool and some other thread module that I found.
Doesn't it really Perl suit for mutil threads programming??

Thanks again for eveyone.

falconzyx@gmail.com 03-25-2008 09:01 AM

Re: The huge amount response data problem
 
On Mar 25, 4:25 pm, "falcon...@gmail.com" <falcon...@gmail.com> wrote:
> On Mar 25, 3:06 pm, Ben Bullock <benkasminbull...@gmail.com> wrote:
>
>
>
> > Your code is hopelessly inefficient. 100,000 strings of even twenty
> > characters is at least two megabytes of memory. Then you've doubled
> > that number with the creation of the URL, and then you are creating
> > arrays of all these things, so you've used several megabytes of
> > memory.

>
> > Instead of first creating a huge array of names, then a huge array of
> > URLs, why don't you just read in one line of the file at a time, then
> > try to get data from each URL? Read in one line of the first file,
> > create its URL, get the response data, store it, then go back and get
> > the next line of the file, etc. A 100,000 line file actually isn't
> > that big.

>
> > But if you are getting all these files from the internet, the biggest
> > bottleneck is probably the time the code spends waiting for a response
> > from the web servers it's requested. You'd have to think about making
> > parallel requests somehow to solve that.

>
> Thanks Ben,
>
> However, is there any good solution that use threads method? I use
> that, and out of memory time by time after I refactor the code as you
> told
> I try thread::Pool and some other thread module that I found.
> Doesn't it really Perl suit for mutil threads programming??
>
> Thanks again for eveyone.


here is my refactor code :
use threads;
use LWP::UserAgent;
use Data::Dumper;
use strict;



&get_request();

sub get_request {
open (FH, "...") or die "can not open file $!";
while (<FH>) {
my $i = <FH>;
my $url = ".../$i";
my $t = threads->new(\&get_html, $url);
$t->join();

}
close (FH);
}
sub get_html {
my ($url) = @_;
my $user_agent = LWP::UserAgent->new();
my $response = $user_agent->request(HTTP::Request->new('GET',
$url));
my $content = $response->content;
format_html ($content);
}
sub format_html {
my ($content) = shift;
my $html_data = $content;
my $word;
my $data;
while ( $html_data =~ m{...}igs ) {
$word = $1;
}
while ( $html_data =~ m{...}igs ) {
$data = $1;
save_data( $word, $data );
}
while ( $data =~ m{...}igs ) {
my $title = $1;
my $sound = $1.$2;
if ( defined($sound) ) {
save_sound( $word, $title, $sound );
}
}
}

sub save_data {
my ( $word, $data ) = @_;
open ( FH, " > ..." ) or die "Can not open $!";
print FH $data;
close(FH);
}

sub save_sound {
my ( $word, $title, $sound ) = @_;
getstore("....", "...") or warn $!;
}

RedGrittyBrick 03-25-2008 09:49 AM

Re: The huge amount response data problem
 
falconzyx@gmail.com wrote:
> On Mar 25, 3:06 pm, Ben Bullock <benkasminbull...@gmail.com> wrote:
>> Your code is hopelessly inefficient. 100,000 strings of even twenty
>> characters is at least two megabytes of memory. Then you've doubled
>> that number with the creation of the URL, and then you are creating
>> arrays of all these things, so you've used several megabytes of
>> memory.
>>
>> Instead of first creating a huge array of names, then a huge array of
>> URLs, why don't you just read in one line of the file at a time, then
>> try to get data from each URL? Read in one line of the first file,
>> create its URL, get the response data, store it, then go back and get
>> the next line of the file, etc. A 100,000 line file actually isn't
>> that big.
>>
>> But if you are getting all these files from the internet, the biggest
>> bottleneck is probably the time the code spends waiting for a response
>> from the web servers it's requested. You'd have to think about making
>> parallel requests somehow to solve that.

>
> Thanks Ben,
>
> However, is there any good solution that use threads method? I use
> that, and out of memory time by time after I refactor the code as you
> told


That's because, if your file contains 100000 lines, your program tries
to create 100000 simultaneous threads doesn't it?

I would create a pool with a fixed number of threads (say 10). I'd read
the file adding tasks to a queue of the same size, after filling the
queue I'd pause reading the file until the queue has a spare space.
Maybe this could be achieved by sleeping a while (say 100ms) and
re-checking if the queue is stuill full. When a thread is created or has
finished a task it should remove a task from the queue and process it.
If the queue is empty the thread should sleep for a while (say 200ms)
and try again, you'd need some mechanism to signal threads that all
tasks have been queued (maybe a flag, a special marker task, a signal or
a certain number of consecutive failed attempts to find work.)

I've never tried to program something like this in Perl so I'd imagine
someone (probably several people) has already solved this and added
modules to CPAN to assist in this sort of task.

There's probably some OO Design Patterns that apply too.

> I try thread::Pool and some other thread module that I found.
> Doesn't it really Perl suit for mutil threads programming??


I find it hard to understand what you are saying but I think the answer
is: Yes, Perl is well suited to programming with multiple threads (or
processes).

--
RGB

Jürgen Exner 03-25-2008 01:03 PM

Re: The huge amount response data problem
 
"falconzyx@gmail.com" <falconzyx@gmail.com> wrote:
>consturct almost 200000 url address to send and parse response data.
>And the speed is very very slow.
>
>Please give me some advices that what I should do to improve the speed


Get a T1 line.

jue

xhoster@gmail.com 03-25-2008 05:42 PM

Re: The huge amount response data problem
 
"falconzyx@gmail.com" <falconzyx@gmail.com> wrote:
> I have a issue:
> 1. I want to open a file and use the data from the file to construct
> the url.
> 2. After I constructed the url and sent it, I got the response html
> data and some parts are what I want store inot the files.
>
> It seems like a very easy thing, however, the issue is that the data
> from the file that I have to open are too huge, which I have to
> consturct almost 200000 url address to send and parse response data.
> And the speed is very very slow.


What part is slow, waiting for the response or parsing it?

Does those URLs point to *your* servers? If so, then you should be able
to bypass http and go directly to the source. If not, then do you have
permission from the owner of the servers to launch what could very well
be a denial of service attack against them?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

xhoster@gmail.com 03-25-2008 05:50 PM

Re: The huge amount response data problem
 
RedGrittyBrick <RedGrittyBrick@SpamWeary.foo> wrote:
>
> I find it hard to understand what you are saying but I think the answer
> is: Yes, Perl is well suited to programming with multiple threads (or
> processes).


I agree with the "(or processes)" part, provided you are running on a Unix
like platform. But in my experience/opinion Perl threads mostly suck.

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

falconzyx@gmail.com 03-27-2008 06:27 AM

Re: The huge amount response data problem
 
On Mar 26, 1:50*am, xhos...@gmail.com wrote:
> RedGrittyBrick <RedGrittyBr...@SpamWeary.foo> wrote:
>
> > I find it hard to understand what you are saying but I think the answer
> > is: Yes, Perl is well suited to programming with multiple threads (or
> > processes).

>
> I agree with the "(or processes)" part, provided you are running on a Unix
> like platform. *But in my experience/opinion Perl threads mostly suck.
>
> --
> --------------------http://NewsReader.Com/--------------------
> The costs of publication of this article were defrayed in part by the
> payment of page charges. This article must therefore be hereby marked
> advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
> this fact.


Here is my refactor code, which still at a very slow speed, please
advice me how to improve it, thanks very much:

require LWP::Parallel::UserAgent;
use HTTP::Request;
use LWP::Simple;
use threads;

# display tons of debugging messages. See 'perldoc LWP::Debug'
#use LWP::Debug qw(+);
my $reqs = [
HTTP::Request->new('GET',"http://www...."),
HTTP::Request->new('GET', "......"
..............# about nearly 200000 url here

];

my $pua = LWP::Parallel::UserAgent->new();
$pua->in_order (10000); # handle requests in order of registration
$pua->duplicates(0); # ignore duplicates
$pua->timeout (1); # in seconds
$pua->redirect (1); # follow redirects

foreach my $req (@$reqs) {
print "Registering '".$req->url."'\n";
if ( my $res = $pua->register ($req) ) {
print STDERR $res->error_as_HTML;
}
}
my $entries = $pua->wait();

foreach (keys %$entries) {
my $res = $entries->{$_}->response;
threads->new(\&format_html, $res->content);

}
foreach my $thr (threads->list()) {
$thr->join(); # I think it does not work......
}

sub format_html {
my ($html_data) = shift;
my $word;
my $data;
while ( $html_data =~ m{...}igs ) {
$word = $1;
}
while ( $html_data =~ m{...}igs ) {
$data = $1;
save_data( $word, $data );
}
while ( $data =~ m{...}igs ) {
my $title = $1;
my $sound = $1.$2;
if ( defined($sound) ) {
save_sound( $word, $title, $sound );
}
}
}

sub save_data {
my ( $word, $data ) = @_;
open ( FH, " > ..." ) or die "Can not open $!";
print FH $data;
close(FH);
}



sub save_sound {
my ( $word, $title, $sound ) = @_;
getstore("...", "...") or warn $!;

}



All times are GMT. The time now is 03:34 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.