Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > LWP::UserAgent and 404 page not found

Reply
Thread Tools

LWP::UserAgent and 404 page not found

 
 
P.R.Brady
Guest
Posts: n/a
 
      06-22-2005
I'm using LWP::UserAgent (Active Perl v5.6.1.63 in a web site
crawler, but there's a page I just can't read -
http://www.psychology.bangor.ac.uk/ gives '404 not found' It is
similarly inaccessible for many of the web checkers out there (like
http://validator.w3.org/) but is okay with 'real' browsers like Internet
Explorer and Netscape.
There's a redirection there somewhere behind the scenes to index.php
(which can be read), but then that is so for our main web page
http://www.bangor.ac.uk/ as well and that redirects okay.

I suppose the problem is not understanding how redirection takes place.
Is it a server issue? Do the regular browsers 'guess' at filenames if
none are given? Is there some browser/server negotiation which is not
being implemented?

An extract from the code which exhibits the symptoms is below (but note
the folding of the 'my $referer' line!)

I'd appreciate any help you can give - I've drawn blanks elsewhere!

Regards
Phil



use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Response;
use HTML::TokeParser;

#the page which refers to the culprit:
my $referer = http://www.bangor.ac.uk/corporate/in...epts.php';

#the inaccessible page
my $url='http://www.psychology.bangor.ac.uk/';

#but these are okay
# $url='http://www.informatics.bangor.ac.uk/';
# $url='http://www.psychology.bangor.ac.uk/index.php';
# $url='http://www.bangor.ac.uk/';

#open the browser

my $browser = LWP::UserAgent->new;
$browser->timeout(30);

#try to get the page

my $response = $browser->get($url, Referer => $referer);
print "Response $response\n";

my $status= $response->status_line;
($status) = split(' ',$status.' ');
print "Status_line $status\n";

exit;

 
Reply With Quote
 
 
 
 
Brian Wakem
Guest
Posts: n/a
 
      06-22-2005
P.R.Brady wrote:

> I'm using LWP::UserAgent (Active Perl v5.6.1.63 in a web site
> crawler, but there's a page I just can't read -
> http://www.psychology.bangor.ac.uk/ gives '404 not found' It is
> similarly inaccessible for many of the web checkers out there (like
> http://validator.w3.org/) but is okay with 'real' browsers like Internet
> Explorer and Netscape.
> There's a redirection there somewhere behind the scenes to index.php
> (which can be read), but then that is so for our main web page
> http://www.bangor.ac.uk/ as well and that redirects okay.
>
> I suppose the problem is not understanding how redirection takes place.
> Is it a server issue? Do the regular browsers 'guess' at filenames if
> none are given? Is there some browser/server negotiation which is not
> being implemented?
>
> An extract from the code which exhibits the symptoms is below (but note
> the folding of the 'my $referer' line!)
>
> I'd appreciate any help you can give - I've drawn blanks elsewhere!
>
> Regards
> Phil
>
> my $response = $browser->get($url, Referer => $referer);



They seem to be doing a redirect based upon the language that your broswer
declares itself to accept. As you aren't doing this you get an error page.


Try:-

my $response = $browser->get($url, Referer => $referer, ACCEPT_LANGUAGE =>
'en');


--
Brian Wakem


 
Reply With Quote
 
 
 
 
P.R.Brady
Guest
Posts: n/a
 
      06-23-2005
Brian Wakem wrote:
> P.R.Brady wrote:
>
>
>>I'm using LWP::UserAgent (Active Perl v5.6.1.63 in a web site
>>crawler, but there's a page I just can't read -
>>http://www.psychology.bangor.ac.uk/ gives '404 not found' It is
>>similarly inaccessible for many of the web checkers out there (like
>>http://validator.w3.org/) but is okay with 'real' browsers like Internet
>>Explorer and Netscape.
>>There's a redirection there somewhere behind the scenes to index.php
>>(which can be read), but then that is so for our main web page
>>http://www.bangor.ac.uk/ as well and that redirects okay.
>>



[ ... snipped ...]

>
> They seem to be doing a redirect based upon the language that your broswer
> declares itself to accept. As you aren't doing this you get an error page.
>
> Try:-
>
> my $response = $browser->get($url, Referer => $referer, ACCEPT_LANGUAGE =>
> 'en');
>


Thanks Brian, that certainly works, Much appreciated.

Now do I have to alter my crawler to scan pages twice I wonder, once for
English, once for Welsh?

Phil

 
Reply With Quote
 
Sherm Pendley
Guest
Posts: n/a
 
      06-24-2005
"P.R.Brady" <(E-Mail Removed)> writes:

> Brian Wakem wrote:
>> P.R.Brady wrote:
>>
>>>I'm using LWP::UserAgent (Active Perl v5.6.1.63 in a web site
>>>crawler, but there's a page I just can't read -

>
> [ ... snip ...]
>
>> Try:-
>> my $response = $browser->get($url, Referer => $referer,
>> ACCEPT_LANGUAGE =>
>> 'en');
>>

>
> Those parameters like Referer and ACCEPT_LANGUAGE are clearly reserved
> words, but to what? The UserAgent? HTMP protocol?


HTTP. Here's a reference:

<http://www.w3.org/Protocols/rfc2616/rfc2616.html>

sherm--
 
Reply With Quote
 
P.R.Brady
Guest
Posts: n/a
 
      06-24-2005
Brian Wakem wrote:
> P.R.Brady wrote:
>
>
>>I'm using LWP::UserAgent (Active Perl v5.6.1.63 in a web site
>>crawler, but there's a page I just can't read -


[ ... snip ...]

>
> Try:-
>
> my $response = $browser->get($url, Referer => $referer, ACCEPT_LANGUAGE =>
> 'en');
>



Those parameters like Referer and ACCEPT_LANGUAGE are clearly reserved
words, but to what? The UserAgent? HTMP protocol?
Where are they listed and defined, or what are they called generically
so I can google them?

Phil

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Sending htp response 404 not found for a page when anonymous usersgo to it and showing the valid page when a logged in user sees it. Andy B. ASP .Net 0 12-23-2008 03:22 PM
Changing the 404 Page not found page. JB ASP .Net 4 03-08-2007 07:37 PM
error 404 page not found after appling patch =?Utf-8?B?QmVu?= ASP .Net 3 10-19-2005 08:59 AM
Intercept IIS 404 errors? Issue with Viewstate and 404 Jonathan Folland ASP .Net 2 03-17-2005 02:32 AM
How To Send Http/1.0 404 Not Found Error Page In C searcher1234 C Programming 1 10-31-2004 04:01 AM



Advertisments