Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > LWP: Any Easy Way to Use Relative Links?

Reply
Thread Tools

LWP: Any Easy Way to Use Relative Links?

 
 
Hal Vaughan
Guest
Posts: n/a
 
      03-22-2005
I'm exploring LWP and trying to write a program that will pull down some web
pages. When I read one page, I use regular expressions to find the links
for other pages I want to download. Sometimes the links are relative
(like /cgi/link.pl or subdir/newfile.html) instead of including a domain
name. I don't see anything in the doc files about any consistency from one
connection to another.

Is there any module out there for keeping track of domains and handling
relative URLs?

I thought about writing a program to look for them, but it seems rather hard
to distinguish if a string is a domain name (I'd look for periods, but
can't be sure it'll include a .com, .gov, or anything else unless I check
all TLDs), and some URLs might not have a slash (if it's a domain name
only, or just a file in the same directory), so I can't think of a way to
be sure a string includes a domain and full path or is a relative URL
(other than trying to load it, and checking the error messag).

I would think there's a module or something to help handle this either by
tracking links used OR by easily determining if a link is absolute or
relative.

Thanks!

Hal
 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      03-22-2005
Hal Vaughan wrote:
> I'm exploring LWP and trying to write a program that will pull down some web
> pages. When I read one page, I use regular expressions to find the links
> for other pages I want to download. Sometimes the links are relative
> (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
> name. I don't see anything in the doc files about any consistency from one
> connection to another.
>
> Is there any module out there for keeping track of domains and handling
> relative URLs?


Maybe you are looking for URI::WithBase.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
 
 
 
Hal Vaughan
Guest
Posts: n/a
 
      03-22-2005
Gunnar Hjalmarsson wrote:

> Hal Vaughan wrote:
>> I'm exploring LWP and trying to write a program that will pull down some
>> web
>> pages. When I read one page, I use regular expressions to find the links
>> for other pages I want to download. Sometimes the links are relative
>> (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
>> name. I don't see anything in the doc files about any consistency from
>> one connection to another.
>>
>> Is there any module out there for keeping track of domains and handling
>> relative URLs?

>
> Maybe you are looking for URI::WithBase.
>


Pretty close. I didn't know about that, and your comment led me to it, and
from the docs on CPAN, that lead me to URI. After experimenting with
URI::WithBase, I realized I can't always tell if a link is relative or not,
and URI::WithBase seems to expect you to know. URI includes uri->scheme,
which will return http for an http connection, and nothing if it's
relative, which is a major help, and lets me detect if a URL is relative or
not.

Hal
 
Reply With Quote
 
Jay Tilton
Guest
Posts: n/a
 
      03-22-2005
Hal Vaughan <(E-Mail Removed)> wrote:

: I'm exploring LWP and trying to write a program that will pull down some web
: pages. When I read one page, I use regular expressions to find the links
: for other pages I want to download.

Regex-parsing HTML? Yuck.

: Sometimes the links are relative
: (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
: name. I don't see anything in the doc files about any consistency from one
: connection to another.
:
: Is there any module out there for keeping track of domains and handling
: relative URLs?

HTML::LinkExtor is your one-stop answer. It can snatch links from HTML
better than your regex can, and it can return all links in a
fully-qualified form if given a base URL.

 
Reply With Quote
 
John Bokma
Guest
Posts: n/a
 
      03-22-2005
Hal Vaughan wrote:

> Pretty close. I didn't know about that, and your comment led me to
> it, and from the docs on CPAN, that lead me to URI. After
> experimenting with URI::WithBase, I realized I can't always tell if a
> link is relative or not, and URI::WithBase seems to expect you to
> know. URI includes uri->scheme, which will return http for an http
> connection, and nothing if it's relative, which is a major help, and
> lets me detect if a URL is relative or not.


$uri = URI->new_abs( $str, $base_uri )

This constructs a new absolute URI object. The $str argument can denote a
relative or absolute URI. If relative, then it will be absolutized using
$base_uri as base. The $base_uri must be an absolute URI.

No need for fancy scheme detection.

And the base_uri you know, since you just fetched it .

>perl -e "use URI; print URI->new_abs('../baz',

'http://castleamber.com/foo/bar/')"
http://castleamber.com/foo/baz

>perl -e "use URI; print URI->new_abs('http://johnbokma.com/perl/',

'http://castleamber.com/foo/bar/')"
http://johnbokma.com/perl/

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

 
Reply With Quote
 
Hal Vaughan
Guest
Posts: n/a
 
      03-22-2005
John Bokma wrote:

> Hal Vaughan wrote:
>
>> Pretty close. I didn't know about that, and your comment led me to
>> it, and from the docs on CPAN, that lead me to URI. After
>> experimenting with URI::WithBase, I realized I can't always tell if a
>> link is relative or not, and URI::WithBase seems to expect you to
>> know. URI includes uri->scheme, which will return http for an http
>> connection, and nothing if it's relative, which is a major help, and
>> lets me detect if a URL is relative or not.

>
> $uri = URI->new_abs( $str, $base_uri )
>
> This constructs a new absolute URI object. The $str argument can denote a
> relative or absolute URI. If relative, then it will be absolutized using
> $base_uri as base. The $base_uri must be an absolute URI.
>
> No need for fancy scheme detection.


Great! It works even better. I must not have tested it properly, since I
missed it first time around.

Thanks!

Hal

> And the base_uri you know, since you just fetched it .
>
>>perl -e "use URI; print URI->new_abs('../baz',

> 'http://castleamber.com/foo/bar/')"
> http://castleamber.com/foo/baz
>
>>perl -e "use URI; print URI->new_abs('http://johnbokma.com/perl/',

> 'http://castleamber.com/foo/bar/')"
> http://johnbokma.com/perl/
>


 
Reply With Quote
 
Hal Vaughan
Guest
Posts: n/a
 
      03-22-2005
Jay Tilton wrote:

> Hal Vaughan <(E-Mail Removed)> wrote:
>
> : I'm exploring LWP and trying to write a program that will pull down some
> : web
> : pages. When I read one page, I use regular expressions to find the
> : links for other pages I want to download.
>
> Regex-parsing HTML? Yuck.
>
> : Sometimes the links are relative
> : (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
> : name. I don't see anything in the doc files about any consistency from
> : one connection to another.
> :
> : Is there any module out there for keeping track of domains and handling
> : relative URLs?
>
> HTML::LinkExtor is your one-stop answer. It can snatch links from HTML
> better than your regex can, and it can return all links in a
> fully-qualified form if given a base URL.


I had never seen that before. I'll look into it. For this project, though,
I scan each page for specific links with a specific phrase as the displayed
text part of the link. Once I get that link, I pull out the url. From
what I see in HTML::LinkExtor, I'd still have to do it close to what I do.

As of now, I do this:

$page =~ s/\n//g; #kill all cr's
(@links) = $page =~ /(<a href.*?<\/a>)/gi; #get all links

Then I page through each link and see if it includes the text part I want.
With HTML::LinkExtor, I'd still have to loop through all the links.

It will be useful on the next project I'm doing, so thanks!

Hal
 
Reply With Quote
 
John Bokma
Guest
Posts: n/a
 
      03-22-2005
Hal Vaughan wrote:

> Jay Tilton wrote:
>
>> Hal Vaughan <(E-Mail Removed)> wrote:
>>
>> : I'm exploring LWP and trying to write a program that will pull down
>> : some web
>> : pages. When I read one page, I use regular expressions to find the
>> : links for other pages I want to download.
>>
>> Regex-parsing HTML? Yuck.
>>
>> : Sometimes the links are relative
>> : (like /cgi/link.pl or subdir/newfile.html) instead of including a
>> : domain name. I don't see anything in the doc files about any
>> : consistency from one connection to another.
>> :
>> : Is there any module out there for keeping track of domains and
>> : handling relative URLs?
>>
>> HTML::LinkExtor is your one-stop answer. It can snatch links from
>> HTML better than your regex can, and it can return all links in a
>> fully-qualified form if given a base URL.

>
> I had never seen that before. I'll look into it. For this project,
> though, I scan each page for specific links with a specific phrase as
> the displayed text part of the link.


Haven't used HTML::LinkExtor, but I have used HTML::TreeBuilder a lot,
see also HTML::Element for some specific documentation.

I guess look_down( _tag => 'a', ... ) will make several things a bit
easier, especially if you want only specific links.

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

 
Reply With Quote
 
Bart Lateur
Guest
Posts: n/a
 
      03-23-2005
Jay Tilton wrote:

>HTML::LinkExtor is your one-stop answer. It can snatch links from HTML
>better than your regex can, and it can return all links in a
>fully-qualified form if given a base URL.


See also HTML::SimpleLinkExtor for a similar module with a (maybe)
simpler API.

--
Bart.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Any way to do relative path with SSL JerryK ASP .Net 3 04-11-2011 02:44 PM
501 PIX "deny any any" "allow any any" Any Anybody? Networking Student Cisco 4 11-16-2006 10:40 PM
any easy way to do this?-put items in table =?Utf-8?B?UGF1bA==?= ASP .Net 3 06-04-2005 08:14 AM
any easy way to write out a XML DOM object to file? Kaidi Java 2 11-26-2004 01:28 AM
is thre any easy way to embed font derrick HTML 5 07-04-2004 08:46 PM



Advertisments