Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Need ideas on how to make this code faster than a speeding turtle

Reply
Thread Tools

Need ideas on how to make this code faster than a speeding turtle

 
 
chadda@lonemerchant.com
Guest
Posts: n/a
 
      05-15-2008
I 'll eventually have the input file filled with 350 million items.
Right now there is only one

$more input
3308191

The following program reads in the number from the file named 'input'
and builds a url form this number. Then it builds a url from this
number. I have lynx then dump the data into a file called 'out' and
then just grep the entire thing for the Product Number, Product ID,
SKU, UPC, and weight.


m-net% more parse.pl
#!/usr/bin/perl -w

my (@****, $read, $build, @product, @id, @sku, @upc, @weight);
my $temp;

open(IN, '<', 'input') || die "cant open: $!";
$read = <IN>;
chomp($read);
$build = "http://www.doba.com/members/catalog/".$read.".html";
$temp = `lynx -accept_all_cookies -dump $build`;
open(OUTFILE, '>out');
print OUTFILE $temp;
close OUTFILE;

open(OUT, '<', 'out') || die "cant open: $!";
@**** = <OUT>;

@product = grep(/Product ID/, @****);
@id = grep(/Item ID/, @****);
@sku = grep(/SKU/, @****);
@upc = grep(/UPC/, @****); #this part doesn't grep UPC correctly. I
get some extra data after UPC.
@weight = grep(/Weight/, @****);

print @product;
print @id;
print @sku;
print @upc;
print @weight;

% ./parse.pl
Product ID: 3308191
Item ID: 3653992
SKU: 8930
UPC: 896207999816 Condition: refurbished
Weight: 4.7 lbs.
 
Reply With Quote
 
 
 
 
Uri Guttman
Guest
Posts: n/a
 
      05-15-2008
>>>>> "c" == chadda <(E-Mail Removed)> writes:


i have to know if you could write this mess any slower? you are doing
everything possible to slow you down.

c> open(IN, '<', 'input') || die "cant open: $!";
c> $read = <IN>;
c> chomp($read);
c> $build = "http://www.doba.com/members/catalog/".$read.".html";
c> $temp = `lynx -accept_all_cookies -dump $build`;

why are you calling out to a program when perl can load web pages just
fine with LWP? did you even look for web stuff on cpan?

c> open(OUTFILE, '>out');
c> print OUTFILE $temp;
c> close OUTFILE;

c> open(OUT, '<', 'out') || die "cant open: $!";
c> @**** = <OUT>;

why are you writing out the output of lynx JUST TO READ IT BACK IN
AGAIN? this is the most absurd part of this program.

you have the text in $temp. you know how to use backticks but why do you
do the file write and reading back in? if you assigned the backticks to
an array you would get the same thing as in @**** without the wasted
effort.

also calling it @**** is not a good thing.

c> @product = grep(/Product ID/, @****);
c> @id = grep(/Item ID/, @****);
c> @sku = grep(/SKU/, @****);
c> @upc = grep(/UPC/, @****); #this part doesn't grep UPC correctly. I
c> get some extra data after UPC.

that is a problem with the format of the html page. html isn't line
oriented and you are grepping over lines. the proper way to deal with
html is with a parser. or in special very well defined cases with
regexes to actually grab what you want from the text. whole html lines
are almost never what you want.

uri

--
Uri Guttman ------ http://www.velocityreviews.com/forums/(E-Mail Removed) -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
 
Reply With Quote
 
 
 
 
chadda@lonemerchant.com
Guest
Posts: n/a
 
      05-15-2008
On May 15, 1:37 pm, Uri Guttman <(E-Mail Removed)> wrote:
> >>>>> "c" == chadda <(E-Mail Removed)> writes:

>
> i have to know if you could write this mess any slower? you are doing
> everything possible to slow you down.


I know I shouldn't critize free help, but you seem to have some anger
management issues.
>
> c> open(IN, '<', 'input') || die "cant open: $!";
> c> $read = <IN>;
> c> chomp($read);
> c> $build = "http://www.doba.com/members/catalog/".$read.".html";
> c> $temp = `lynx -accept_all_cookies -dump $build`;
>
> why are you calling out to a program when perl can load web pages just
> fine with LWP? did you even look for web stuff on cpan?
>

Would using LWP speed up the code? By the way, this code is meant to
run on a server with restricted access. Ie, I can't install stuff from
cpan on that server.

> c> open(OUTFILE, '>out');
> c> print OUTFILE $temp;
> c> close OUTFILE;
>
> c> open(OUT, '<', 'out') || die "cant open: $!";
> c> @**** = <OUT>;
>
> why are you writing out the output of lynx JUST TO READ IT BACK IN
> AGAIN? this is the most absurd part of this program.
>
> you have the text in $temp. you know how to use backticks but why do you
> do the file write and reading back in? if you assigned the backticks to
> an array you would get the same thing as in @**** without the wasted
> effort.
>
> also calling it @**** is not a good thing.
>

Huh? Are you saying I don't need the 'out' file?

> c> @product = grep(/Product ID/, @****);
> c> @id = grep(/Item ID/, @****);
> c> @sku = grep(/SKU/, @****);
> c> @upc = grep(/UPC/, @****); #this part doesn't grep UPC correctly. I
> c> get some extra data after UPC.
>
> that is a problem with the format of the html page. html isn't line
> oriented and you are grepping over lines. the proper way to deal with
> html is with a parser. or in special very well defined cases with
> regexes to actually grab what you want from the text. whole html lines
> are almost never what you want.
>
> uri
>



 
Reply With Quote
 
chadda@lonemerchant.com
Guest
Posts: n/a
 
      05-15-2008
On May 15, 2:21 pm, (E-Mail Removed) wrote:
> On May 15, 1:37 pm, Uri Guttman <(E-Mail Removed)> wrote:
>
> > >>>>> "c" == chadda <(E-Mail Removed)> writes:

>
> > i have to know if you could write this mess any slower? you are doing
> > everything possible to slow you down.

>
> I know I shouldn't critize free help, but you seem to have some anger
> management issues.
>
> > c> open(IN, '<', 'input') || die "cant open: $!";
> > c> $read = <IN>;
> > c> chomp($read);
> > c> $build = "http://www.doba.com/members/catalog/".$read.".html";
> > c> $temp = `lynx -accept_all_cookies -dump $build`;

>
> > why are you calling out to a program when perl can load web pages just
> > fine with LWP? did you even look for web stuff on cpan?

>
> Would using LWP speed up the code? By the way, this code is meant to
> run on a server with restricted access. Ie, I can't install stuff from
> cpan on that server.
>
>
>
> > c> open(OUTFILE, '>out');
> > c> print OUTFILE $temp;
> > c> close OUTFILE;

>
> > c> open(OUT, '<', 'out') || die "cant open: $!";
> > c> @**** = <OUT>;

>
> > why are you writing out the output of lynx JUST TO READ IT BACK IN
> > AGAIN? this is the most absurd part of this program.

>
> > you have the text in $temp. you know how to use backticks but why do you
> > do the file write and reading back in? if you assigned the backticks to
> > an array you would get the same thing as in @**** without the wasted
> > effort.

>
> > also calling it @**** is not a good thing.

>
> Huh? Are you saying I don't need the 'out' file?


Maybe something like this?
% more parse.pl
#!/usr/bin/perl -w

my (@****, $read, $build, @product, @id, @sku, @upc, @weight);
my @temp;

open(IN, '<', 'input') || die "cant open: $!";
$read = <IN>;
chomp($read);
$build = "http://www.doba.com/members/catalog/".$read.".html";
@temp = `lynx -accept_all_cookies -dump $build`;

@product = grep(/Product ID/, @temp);
@id = grep(/Item ID/, @temp);
@sku = grep(/SKU/, @temp);
@upc = grep(/UPC/, @temp);
@weight = grep(/Weight/, @temp);

print @product;
print @id;
print @sku;
print @upc;
print @weight;


However, I don't know how to use LWP. Again, would the code run faster
if I used LWP?
 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      05-15-2008
>>>>> "c" == chadda <(E-Mail Removed)> writes:

c> On May 15, 1:37 pm, Uri Guttman <(E-Mail Removed)> wrote:
>> >>>>> "c" == chadda <(E-Mail Removed)> writes:

>>
>> i have to know if you could write this mess any slower? you are doing
>> everything possible to slow you down.


c> I know I shouldn't critize free help, but you seem to have some anger
c> management issues.

nope. i have bad code anger issues. i deal with this in code reviews all
the time. i just don't get how people come up with wacky and slow ways
to do things. i have seen worse code that read in files, parsed them,
wrote them out (untouched) and read them in again.



>>

c> open(IN, '<', 'input') || die "cant open: $!";
c> $read = <IN>;
c> chomp($read);
c> $build = "http://www.doba.com/members/catalog/".$read.".html";
c> $temp = `lynx -accept_all_cookies -dump $build`;
>>
>> why are you calling out to a program when perl can load web pages just
>> fine with LWP? did you even look for web stuff on cpan?
>>

c> Would using LWP speed up the code? By the way, this code is meant to
c> run on a server with restricted access. Ie, I can't install stuff from
c> cpan on that server.

if you have access to load scripts you can load pure perl modules
too. this is an FAQ.

c> open(OUTFILE, '>out');
c> print OUTFILE $temp;
c> close OUTFILE;
>>

c> open(OUT, '<', 'out') || die "cant open: $!";
c> @**** = <OUT>;
>>
>> why are you writing out the output of lynx JUST TO READ IT BACK IN
>> AGAIN? this is the most absurd part of this program.
>>
>> you have the text in $temp. you know how to use backticks but why do you
>> do the file write and reading back in? if you assigned the backticks to
>> an array you would get the same thing as in @**** without the wasted
>> effort.
>>
>> also calling it @**** is not a good thing.
>>

c> Huh? Are you saying I don't need the 'out' file?

yes. why do you think you need that file? you call backticks and get the
html page in $temp. why do you think you need a file to process that
data? you already have it inside perl.

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      05-15-2008
>>>>> "c" == chadda <(E-Mail Removed)> writes:

>> Huh? Are you saying I don't need the 'out' file?


yes.

c> Maybe something like this?
c> % more parse.pl
c> #!/usr/bin/perl -w

c> my (@****, $read, $build, @product, @id, @sku, @upc, @weight);
c> my @temp;

c> open(IN, '<', 'input') || die "cant open: $!";
c> $read = <IN>;
c> chomp($read);
c> $build = "http://www.doba.com/members/catalog/".$read.".html";
c> @temp = `lynx -accept_all_cookies -dump $build`;

c> @product = grep(/Product ID/, @temp);
c> @id = grep(/Item ID/, @temp);
c> @sku = grep(/SKU/, @temp);
c> @upc = grep(/UPC/, @temp);
c> @weight = grep(/Weight/, @temp);

c> print @product;
c> print @id;
c> print @sku;
c> print @upc;
c> print @weight;


c> However, I don't know how to use LWP. Again, would the code run faster
c> if I used LWP?

better but forking off lynx is still slow. LWP should be much faster. if
you want speed (and with the data size you have, you want it), use LWP.

depending on how fast you need it (cpu usage will spike with the greps
you have) you can also change all that to parse out what you want with
regexes. (again, that assumes a known fixed html page layout which you
seem to have).

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
 
Reply With Quote
 
Gordon Etly
Guest
Posts: n/a
 
      05-15-2008
(E-Mail Removed) wrote:
> On May 15, 1:37 pm, Uri Guttman <(E-Mail Removed)> wrote:
> chadda <(E-Mail Removed)> writes:


> > i have to know if you could write this mess any slower? you are
> > doing
> > everything possible to slow you down.


> I know I shouldn't critize free help, but you seem to have some anger
> management issues.


He seems to constantly come across this way. I really wish he could see
things from other points of view.
....


As a simple answer, take a look at LWP:UserAgent
(http://search.cpan.org/~gaas/libwww-...P/UserAgent.pm),
as a good start in the right direction.

--
G.Etly


 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      05-15-2008
"Gordon Etly" <(E-Mail Removed)> wrote in
news:(E-Mail Removed):

> (E-Mail Removed) wrote:
>> On May 15, 1:37 pm, Uri Guttman <(E-Mail Removed)> wrote:
>> chadda <(E-Mail Removed)> writes:

>
>> > i have to know if you could write this mess any slower? you are
>> > doing
>> > everything possible to slow you down.

>
>> I know I shouldn't critize free help, but you seem to have some anger
>> management issues.

>
> He seems to constantly come across this way. I really wish he could
> see things from other points of view.
> ...
>
>
> As a simple answer, take a look at LWP:UserAgent
> (http://search.cpan.org/~gaas/libwww-...P/UserAgent.pm),
> as a good start in the right direction.


All the OP needs is LWP::Simple and HTML::TableExtract.

In fact, I wrote a whole script that took only 0.8 seconds to download
and parse a single page (of course, with more id's in a file, the only
real limit on the speed is the network latency and transfer speed) but I
have decided not to post it as I do not know what his intentions are.

As for you, pick a posting id and stick with it.

PLONKETY PLONK!

Sinan

--
A. Sinan Unur <(E-Mail Removed)>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
Reply With Quote
 
chadda@lonemerchant.com
Guest
Posts: n/a
 
      05-15-2008
On May 15, 3:16 pm, "Gordon Etly" <(E-Mail Removed)> wrote:
> (E-Mail Removed) wrote:
> > On May 15, 1:37 pm, Uri Guttman <(E-Mail Removed)> wrote:
> > chadda <(E-Mail Removed)> writes:
> > > i have to know if you could write this mess any slower? you are
> > > doing
> > > everything possible to slow you down.

> > I know I shouldn't critize free help, but you seem to have some anger
> > management issues.

>
> He seems to constantly come across this way. I really wish he could see
> things from other points of view.
> ...
>
> As a simple answer, take a look at LWP:UserAgent
> (http://search.cpan.org/~gaas/libwww-...P/UserAgent.pm),
> as a good start in the right direction.
>
> --
> G.Etly



I just tried LWP, and now I can't get the code to work for the life of
me. Here is what I attempted

#!/usr/bin/perl -w

use LWP::UserAgent;
use HTTP::Request;
use HTTP::Cookies;

my ($read, $build, @product, @id, @sku, @upc, @weight);
my @temp;

open(IN, '<', 'input') || die "cant open: $!";
$read = <IN>;
chomp($read);
$build = 'http://www.doba.com/members/catalog/'.$read.'.html';
#@temp = `lynx -accept_all_cookies -dump $build`;

my $ua = LWP::UserAgent->new;
$ua->agent("OMEGA SPARC DESTROYER/69");

my $request = HTTP::Request->new('GET');
$request->url($build);

my $cookie_jar = HTTP::Cookies->new;
$cookie_jar->add_cookie_header($request);

my $response = $ua->request($request);

my $code = $response->code;
print $code;

@temp = $request->content;

@product = grep(/Product ID/, @temp);
@id = grep(/Item ID/, @temp);
@sku = grep(/SKU/, @temp);
@upc = grep(/UPC/, @temp);
@weight = grep(/Weight/, @temp);

print @product;
print @id;
print @sku;
print @upc;
print @weight;

% ./parse.pl
500%
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      05-15-2008
(E-Mail Removed) wrote in
news:(E-Mail Removed):

> On May 15, 3:16 pm, "Gordon Etly" <(E-Mail Removed)> wrote:
>> (E-Mail Removed) wrote:
>> > On May 15, 1:37 pm, Uri Guttman <(E-Mail Removed)> wrote:
>> > chadda <(E-Mail Removed)> writes:
>> > > i have to know if you could write this mess any slower? you are
>> > > doing
>> > > everything possible to slow you down.
>> > I know I shouldn't critize free help, but you seem to have some
>> > anger management issues.


....

>> As a simple answer, take a look at LWP:UserAgent
>> (http://search.cpan.org/~gaas/libwww-perl-

5.812/lib/LWP/UserAgent.pm),
>> as a good start in the right direction.


....

> I just tried LWP, and now I can't get the code to work for the life of
> me. Here is what I attempted


As I mentioned elsewhere, all you need is LWP::Simple.

So, here is a fish for you:

C:\Temp> cat p.pl
#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;
use LWP::Simple;


my ($input_file) = @ARGV;
die "No input file specified\n" unless defined $input_file;

open my $INPUT, '<', $input_file
or die "Cannot open '$input_file': $!";

ID:
while ( my $id = <$INPUT> ) {
chomp $id;

my $url = make_url( $id );
my $html = get $url;

unless ( defined $html ) {
warn "Error downloading from '$url'\n";
next ID;
}

my $parser = HTML::TokeParser->new( \$html );

TABLE:
while ( my $token = $parser->get_tag('table') ) {
if ( lc $token->[1]{id} eq 'product_details' ) {
my $td = $parser->get_tag('td');
last TABLE unless $td;
my $cell = $parser->get_text('/td');
my %data;
while ( $cell =~ /\s*([^:]+?):\s+(\d+)\s+/g ) {
$data{$1} = $2;
}
use Data:umper;
print Dumper \%data;
}
}
}

sub make_url {
return
sprintf q{http://www.doba.com/members/catalog/%s.html}, $_[0];
}

__END__

C:\Temp> timethis p list

$VAR1 = {
'Product ID' => '3308191',
'UPC' => '896207999816',
'Item ID' => '3653992',
'SKU' => '8930'
};

TimeThis : Command Line : p list
TimeThis : Start Time : Thu May 15 18:19:28 2008
TimeThis : End Time : Thu May 15 18:19:29 2008
TimeThis : Elapsed Time : 00:00:01.062

Comparing this to the overhead of an empty script:

C:\Temp> cat t.pl
#!/usr/bin/perl

use strict;
use warnings;

C:\Temp> timethis t

TimeThis : Command Line : t
TimeThis : Start Time : Thu May 15 18:20:38 2008
TimeThis : End Time : Thu May 15 18:20:38 2008
TimeThis : Elapsed Time : 00:00:00.218

It took 0.844 seconds to retrieve and parse the required information. Of
course, the time cost would be better amortized if you ran a lot of
these queries.



--
A. Sinan Unur <(E-Mail Removed)>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Detecting a click on the turtle screen when the turtle isn't doinganything? Adam Funk Python 7 02-06-2013 09:52 PM
Musatov | Meami.org introduce .mdf (TM) Master Document Format --Rasters Faster Than PDF (needs community help to finish) -- Course Code Methodology Martin C Programming 1 04-14-2011 05:36 PM
Re: Why is my code faster with append() in a loop than with a largelist? Dave Angel Python 4 07-06-2009 10:51 PM
How do I make a turtle graphic program in C++ or C. jevitop C++ 2 09-17-2003 03:53 PM



Advertisments