Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > fetching webpage and extracting contents

Reply
Thread Tools

fetching webpage and extracting contents

 
 
alfonsobaldaserra
Guest
Posts: n/a
 
      10-04-2010
hello

i am trying to write a script which will go to bbc's top 40 pages and
show only intended contents.

i have written a script

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, ">", "bbc.txt" or die "$!\n";
print $bbc $res->decoded_content;
close $bbc;
} else {
die "could not fetch bbc.co.uk\n";
}

open my $bbc, "<", "bbc.txt";
while (<$bbc>) {
print if m!<span class="artist">(.*)</span>!;
print if m!<span class="track">(.*)</span>!;
#next unless $_ =~ m[(<span class="artist">)|(<span
class="track">)];
#my ($foo) =~ m!<span class="artist">(.*)</span>!;
#my ($bar) =~ m!<span class="track">(.*)</span>!;
# print "$foo -> $bar\n";
}

__RESULT__
<span class="artist">Tinie Tempah</span>
<span class="track">Written In The Stars</span>
<span class="artist">Bruno Mars</span>
<span class="track">Just The Way You Are (Amazing)</span>
<span class="artist">Labrinth</span>
<span class="track">Let The Sun Shine</span>
<span class="artist">Adele</span>
<span class="track">Make You Feel My Love</span>
<span class="artist">Taio Cruz</span>
<span class="track">Dynamite</span>



but i can't figure out

#1 how to parse $res->decoded_content without writing it to a file
because apparently the whole page is a single string

#2 how to show data in artist - track format, like
Tinie Tempah - Written In The Stars

#3 how to make this work
#next unless $_ =~ m[(<span class="artist">)|(<span
class="track">)];
#my ($foo) =~ m!<span class="artist">(.*)</span>!;
#my ($bar) =~ m!<span class="track">(.*)</span>!;
# print "$foo -> $bar\n"

appreciate your time gents.

salute
 
Reply With Quote
 
 
 
 
alfonsobaldaserra
Guest
Posts: n/a
 
      10-05-2010
> #1 how to parse $res->decoded_content without writing it to a file
> because apparently the whole page is a single string


got it fixed by opening a fh to $res->decoded_content

> #2 how to show data in artist - track format, like
> Tinie Tempah - Written In The Stars



so the new code is

#!/usr/bin/perl

use strict;
#use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";
while (defined (my $con = <$bbc>)) {
chomp $con;
next unless $con =~ m!(<span class="artist">)|(<span
class="track">)!;
my ($artist) = $con =~ m!<span class="artist">(.*?)</
span>!;
my ($track) = $con =~ m!<span class="track">(.*?)</
span>!;
print "$artist - $track\n";
}

} else {
die "could not fetch bbc.co.uk\n";
}


but the output is coming as

Tinie Tempah -
- Written In The Stars
Bruno Mars -
- Just The Way You Are (Amazing)
Labrinth -
- Let The Sun Shine
Adele -
- Make You Feel My Love

while it should have been

Tinie Tempah - Written In The Stars
Bruno Mars - Just The Way You Are (Amazing)
Labrinth - Let The Sun Shine
Adele - Make You Feel My Love

i cant figure out why this is happening.

any help guys?

thanku
 
Reply With Quote
 
 
 
 
alfonsobaldaserra
Guest
Posts: n/a
 
      10-05-2010
i got a real bad code working

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";
while (defined (my $con = <$bbc>)) {
chomp $con;
next if $con =~ /^\s*$/;
next unless $con =~ m!(<span class="artist">)|(<span
class="track">)!;
$con =~ s/^\s*|\s*$//g;
if ($con =~ m!<span class="artist">(.*)</span>!) {
print $1, " - ";
} elsif ($con =~ m!<span class="track">(.*)</span>!) {
print $1, "\n";
}
}
}


thank you gents for giving me a chance to do it myself.

though i am still looking for any improvements that you could
suggest
 
Reply With Quote
 
Peter Makholm
Guest
Posts: n/a
 
      10-05-2010
alfonsobaldaserra <(E-Mail Removed)> writes:

> i got a real bad code working
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> use LWP::UserAgent;
>
> my $ua = LWP::UserAgent->new;
> $ua->timeout(10);
> $ua->env_proxy;
>
> my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");
>
> if ($res->is_success) {
> open my $bbc, "<", \$res->decoded_content or die "$!\n";


Don't do this. While possible, it is kind of obscure and shoul in my
opinion only be used when existing interfaces requires a perl file
handle.

Just split the content on newlines if you want to iterate over the
lines.

> while (defined (my $con = <$bbc>)) {
> chomp $con;
> next if $con =~ /^\s*$/;
> next unless $con =~ m!(<span class="artist">)|(<span
> class="track">)!;
> $con =~ s/^\s*|\s*$//g;
> if ($con =~ m!<span class="artist">(.*)</span>!) {
> print $1, " - ";
> } elsif ($con =~ m!<span class="track">(.*)</span>!) {
> print $1, "\n";
> }


Don't parse HTML by throwing naive regexpes at the problem. This would
fail horribly if BBC decided to remove unneded newlines from their
content.

> }
> }


I would rather use one of the existing HTML parsing modules. One
option could be HTML::TreeBuilder. Base on a quick read in the
documentation it would looke something like this:

my $html = HTML::TreeBuilder->new_from_content( $res->decoded_content );
for my $tag ($html->find('span') {
my $class = $tag->attr('class');

if ( $class eq 'artist' ) {
...;
} elsif ( $class eq 'track' ) {
...;
}
}

This would be a much more robust solution. (But I don't parse HTML in
my day to day work, so I might not be uptodate on the current set of
HTML parsers.)

//Makholm
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      10-05-2010
On Tue, 5 Oct 2010 01:13:03 -0700 (PDT), alfonsobaldaserra <(E-Mail Removed)> wrote:

>i got a real bad code working
>
>#!/usr/bin/perl
>
>use strict;
>use warnings;
>use LWP::UserAgent;
>
>my $ua = LWP::UserAgent->new;
>$ua->timeout(10);
>$ua->env_proxy;
>
>my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");
>
>if ($res->is_success) {
> open my $bbc, "<", \$res->decoded_content or die "$!\n";
> while (defined (my $con = <$bbc>)) {
> chomp $con;
> next if $con =~ /^\s*$/;
> next unless $con =~ m!(<span class="artist">)|(<span
>class="track">)!;
> $con =~ s/^\s*|\s*$//g;
> if ($con =~ m!<span class="artist">(.*)</span>!) {
> print $1, " - ";
> } elsif ($con =~ m!<span class="track">(.*)</span>!) {
> print $1, "\n";
> }
> }
>}
>
>
>thank you gents for giving me a chance to do it myself.
>
>though i am still looking for any improvements that you could
>suggest


Along the lines of what you are doing, something like below.
-sln
-----------
use strict;
use warnings;

my $string =<<EOHTML;
<html>
<span class="artist">
Tinie Tempah
</span>
<span class="track">
Written In The Stars
</span>
<span class="artist"> Bruno Mars </span>
<span class="track">Just The Way You Are (Amazing)</span>
<span class="artist">
Labrinth</span>
<span class="track">Let The Sun Shine
</span>
<span class="track">A song by Labrinth</span>
<span class="artist">Adele </span>
<span class="track">Make You Feel My Love</span>
<span class="artist">Taio Cruz</span>
<span class="track">Dynamite</span>
<html/>
EOHTML
my $artist;

while ( $string =~
/ <span \s+ class \s* = \s* ['"]\s* (artist|track) \s*['"] \s* >
\s* (.*?) \s*
<\/span\s*>
/xsig )
{
if ($1 eq 'artist') {
$artist = $2;
}
else {
if (length $artist) {
print "$artist - $2\n";
}
$artist = '';
}
}
print "\n";

## Alternate -
##

$artist = '';
my %tracks;

while ( $string =~
/ <span \s+ class \s* = \s* ['"]\s* (artist|track) \s*['"] \s* >
\s* (.*?) \s*
<\/span\s*>
/xsig )
{
if ($1 eq 'artist') {
$artist = $2;
}
else {
push @{ $tracks{$artist} }, $2;
}
}

for $artist (sort keys %tracks) {
print "\n$artist\n";
for my $track ( sort @{ $tracks{$artist} } ) {
print " - $track\n"
}
}

 
Reply With Quote
 
alfonsobaldaserra
Guest
Posts: n/a
 
      10-06-2010
thank you for such beautiful codes sln.

though i am inclined towards peter's advise to use html parsers.
unfortunately, i couldn't get your code to work due to lack of usage
examples of html::treebuilder online.

does anybody happen to know a good html parser with some good examples
online?
 
Reply With Quote
 
Peter Makholm
Guest
Posts: n/a
 
      10-06-2010
alfonsobaldaserra <(E-Mail Removed)> writes:

> though i am inclined towards peter's advise to use html parsers.
> unfortunately, i couldn't get your code to work due to lack of usage
> examples of html::treebuilder online.


Huh?

http://www.perlmonks.org/?node_id=280461
http://search.cpan.org/perldoc?HTML::TreeBuilder
http://groups.google.com/group/comp....2b363f0e9be360

//Makholm
 
Reply With Quote
 
alfonsobaldaserra
Guest
Posts: n/a
 
      10-21-2010
> Huh?
>
> http://www.perlmonks.org/?node_id=28...2b363f0e9be360
>
> //Makholm


thank you guys

i finally utilised perlmonks link, read a little at cpan at here i am

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Tree;
use LWP::Simple;

my $uri = "http://www.bbc.co.uk/radio1/chart/singles";

my $html = get($uri);
my $tree = HTML::Tree->new();
$tree->parse($html);

my @artist = $tree->look_down('_tag' , 'span', 'class', 'artist');
my @track = $tree->look_down('_tag' , 'span', 'class', 'track');

foreach my $i (0..$#artist) {
print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
}


again i am wondering if there is a better way to group these two
arrays together instead of the way i did

foreach my $i (0..$#artist) {
print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
}

thank you
 
Reply With Quote
 
Peter Makholm
Guest
Posts: n/a
 
      10-21-2010
alfonsobaldaserra <(E-Mail Removed)> writes:

> my @artist = $tree->look_down('_tag' , 'span', 'class', 'artist');
> my @track = $tree->look_down('_tag' , 'span', 'class', 'track');
>
> foreach my $i (0..$#artist) {
> print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
> }
>
> again i am wondering if there is a better way to group these two
> arrays together instead of the way i did


It all depends on the HTML. But looking at the URL you posted it looks
like you're looke for a structure looking like this:

<a class="artist-link" href="/music/artists/ba7d2626-38ce-4859-8495-bdb5732715c4" id="link-13">
<span class="artist">Taio Cruz</span>
<span class="track">Dynamite</span>
</a>

What you could do was to iterate over all the <a class="artist-link>
nodes and then look for the artist and track below this
node. Untested, but something like this:

for my $link ( $tree->look_down(_tag => 'a', class => 'artist-link') ) {
my $artist = $link->look_down(class => 'artist')->as_text;
my $track = $link->look_down(class => 'track' )->as_text;

print "$artist - $track\n";
}

//Makholm
 
Reply With Quote
 
alfonsobaldaserra
Guest
Posts: n/a
 
      10-21-2010
> for my $link ( $tree->look_down(_tag => 'a', class => 'artist-link') ) {
> * * my $artist = $link->look_down(class => 'artist')->as_text;
> * * my $track *= $link->look_down(class => 'track' )->as_text;
>
> * * print "$artist - $track\n";
>
> }
>
> //Makholm


thank you again makholm, your code worked sexily without any
modification
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Fetching a gzipped webpage Barry Python 1 05-26-2010 07:41 PM
Extracting text from a Webpage using BeautifulSoup Magnus.Moraberg@gmail.com Python 3 05-28-2008 12:26 AM
fetching a POST webpage... bruce Python 1 07-07-2006 02:52 AM
fetching webpage yookyung Python 1 12-30-2005 02:37 AM
Email contents of webpage or Form on webpage w/o using Server scripting sifar Javascript 5 08-24-2005 05:47 PM



Advertisments