Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Find string in web page

Reply
Thread Tools

Find string in web page

 
 
Kirk Larsen
Guest
Posts: n/a
 
      07-09-2003
Sounds simple enough. I need to retrieve the source from a web page
and then find a link in that web page that ends with a string which I
have stored in a variable. Can someone please post or direct me to a
sample of how to do this? Thanks!
 
Reply With Quote
 
 
 
 
Greg Bacon
Guest
Posts: n/a
 
      07-09-2003
In article <(E-Mail Removed) >,
Kirk Larsen <(E-Mail Removed)> wrote:

: Sounds simple enough. I need to retrieve the source from a web page
: and then find a link in that web page that ends with a string which I
: have stored in a variable. Can someone please post or direct me to a
: sample of how to do this? Thanks!

Try this on for size:

% cat try
#! /usr/local/bin/perl

use strict;
use warnings;

use HTML:arser;
use LWP::UserAgent;
use URI::URL;
use Data:umper;

sub make_parser {
my $inside;
my %attr;
my $text;
my @links;

my $record = sub {
my $state = Dumper {
inside => $inside,
attr => \%attr,
text => $text,
};

my @cond = (
[ sub { $state }, "not inside" ],
[ sub { %attr }, "no attr" ],
[ sub { $attr{href} }, "no href" ],
);

my $ok = 1;
for (@cond) {
my($check,$msg) = @$_;

unless ($check->()) {
warn "$0: $msg:\n$state ";
$ok = 0;
}
}

push @links => [ $text || '<empty>', $attr{href} ] if $ok;

$inside = 0;
%attr = ();
$text = '';
};

my $start_h = sub {
my $tag = shift;
return unless $tag eq 'a';

if ($inside) {
warn "$0: already inside";
$record->();
}

my $attr = shift;
return unless $attr->{href};

%attr = %$attr;
$inside = 1;
};

my $text_h = sub {
return unless $inside;

$text .= shift;
};

my $end_h = sub {
my $tag = shift;
return unless $tag eq 'a';

return unless $inside;

$record->();
};

my $p = HTML:arser->new(
api_version => 3,
start_h => [ $start_h, "tagname, attr" ],
text_h => [ $text_h, "dtext" ],
end_h => [ $end_h, "tagname" ],
);

($p, sub { @links });
}

sub usage () { "Usage: $0 search-pattern\n" }

## main
die usage unless @ARGV;

my $pat = shift;
my $lookfor = eval { qr/$pat/ };
die "$0: bad pattern: $pat" unless $lookfor;

my $url = "http://www.cpan.org/";
my $ua = LWP::UserAgent->new;

my($p,$links) = make_parser;

# Request document and parse it as it arrives
my $res = $ua->request(
HTTP::Request->new(GET => $url),
sub { $p->parse($_[0]) }
);

my $base = $res->base;
for ($links->()) {
my($text,$href) = @$_;

next unless $text =~ /$lookfor$/;

my $url = url($href, $base)->abs;

$text =~ s/\s+/ /g;
print "$text:\n $url\n";
}
% ./try 's$'
Perl modules:
http://www.cpan.org/modules/index.html
Perl scripts:
http://www.cpan.org/scripts/index.html
Perl recent arrivals:
http://www.cpan.org/RECENT.html
CPAN sites:
http://www.cpan.org/SITES.html
CPAN sites:
http://mirrors.cpan.org/
CPAN modules, distributions, and authors:
http://search.cpan.org/
CPAN Frequently Asked Questions:
http://www.cpan.org/misc/cpan-faq.html
Perl Mailing Lists:
http://lists.cpan.org/
Perl Bookmarks:
http://bookmarks.cpan.org/
% ./try '('
./try: bad pattern: ( at ./try line 95.

Hope this helps,
Greg
--
In a system of full capitalism, there should be (but, historically, has not
yet been) a complete separation of state and economics, in the same way and
for the same reasons as the separation of state and church.
-- Ayn Rand
 
Reply With Quote
 
 
 
 
Greg Bacon
Guest
Posts: n/a
 
      07-10-2003
In article <(E-Mail Removed) >,
Kirk Larsen <(E-Mail Removed)> wrote:

: Can't seem to get it to work. It just outputs nothing. Am I doing
: something wrong, or is there another way? I did print out my search
: string var and verified that it is in the source I'm searching, so
: that's not the problem. Thanks again!

Out of the box, does the code produce the same output as shown in
my followup?

What are you looking for? It looks like I was forcing the match to
be at the end:

next unless $text =~ /$lookfor$/;

If you don't want to look at the end, change that to

next unless $text =~ /$lookfor/;

It would also help if you showed your code, but, as always with
Usenet, cutting-and-pasting megabytes of source code isn't useful.

Greg
--
The greatest dangers to liberty lurk in insidious encroachment by men
of zeal, well-meaning but without understanding.
-- Justice Louis D. Brandeis
 
Reply With Quote
 
Mina Naguib
Guest
Posts: n/a
 
      07-11-2003
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Kirk Larsen wrote:
> Sounds simple enough. I need to retrieve the source from a web page


use LWP::Simple;

> and then find a link in that web page that ends with a string which I
> have stored in a variable.


There are a few ways to do this. I prefer HTML::TokeParser;

> Can someone please post or direct me to a
> sample of how to do this? Thanks!



my $url = 'http://www.freebsd.org';
my $match = 'man.cgi';

use LWP::Simple;
use HTML::TokeParser;

my $document = get($url) || die "Failed to retrieve document\n";

my $parser = HTML::TokeParser->new(\$document);

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$match$/) {
print "I matched $token->[1]->{href}\n";
}
}

For more information, see http://search.cpan.org/dist/HTML-Par.../TokeParser.pm and
http://search.cpan.org/dist/libwww-p.../LWP/Simple.pm.

Note that links are often relative, which means you'll often get a link to "something.html" instead
of "http://domain.com/dir/something.html". It'll be up to you to extrapolate the domain and
directory structure of the original URL (and append to it the link data, as well as possibly take
into account any ../.././ calls) to determine the full URL to call next.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE/DkfieS99pGMif6wRApEdAJwIJrCRTLNOgtsxCSUYCY7NyO6/AgCZATFH
cc0PEq+mFhTbBDrQ/79fah4=
=/K0i
-----END PGP SIGNATURE-----

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to exclude action of Find::Find::find in subdirectories withknown names? vdvorkin Perl Misc 3 02-14-2011 05:28 AM
How to exclude action of Find::Find::find in subdirectories withknown names? vdvorkin Perl Misc 0 02-10-2011 05:18 PM
Find.find does not find orphaned links? Wybo Dekker Ruby 1 11-15-2005 02:50 PM
RE: Web page is not available - "The Web page you requested is not available offline. To view this page, click Connect" =?Utf-8?B?VHJldm9yIEJlbmVkaWN0IFI=?= ASP .Net 0 06-07-2004 07:11 AM
Re: Web page is not available - "The Web page you requested is not available offline. To view this page, click Connect" Natty Gur ASP .Net 0 06-06-2004 05:46 AM



Advertisments