Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > LWP Doesn't Seem To Save Cookies:

Reply
Thread Tools

LWP Doesn't Seem To Save Cookies:

 
 
Hal Vaughan
Guest
Posts: n/a
 
      03-23-2005
I'm trying to write a scraper for a website that uses cookies. The short of
it is that I keep getting their "You have to set your browser to allow
cookies" message. The code for the full scraper is a bit much, so here are
the relevant sections:

use File::Spec::Functions;
use File::Basename;
use File::Copy;
use LWP::UserAgent;
use HTTP::Cookies;
use URI::WithBase;
use DBI;
use strict;

Here's where I set up the variables (not all "my" and "our" statements are
included):

print "Cookie file: $cfile\n";
$ua = LWP::UserAgent->new;
$ua->timeout(5);
$ua->agent("Netscape/7.1");
$cjar = HTTP::Cookies->new(file =>$cfile, autosave => 1, ignore_discard =>
1);
$ua->cookie_jar($cjar);

Here's where I get the login page (which I always retrieve to make sure the
fields or info hasn't changed):


$page = $ua->get($url);
$page = $page->as_string;

And after that, I go through the page, make sure the form input fields
haven't changed (which are "login" and "key" for the username and
password). Then I post the data for the next page, including the form
data:


$parm = "";
foreach (keys %form) {
print "\tAdding parm. Key: $_, Value: $form{$_}\n";
$parm = "$parm$_=$form{$_}&";
}
$parm =~ s/&$//;
$req = HTTP::Request->new(POST => $url);
$req->content_type("application/x-www-form-urlencoded");
$req->header('Accept' => 'text/html');
$req->content_type("form-data");
$req->content($parm);
$page = $ua->request($req);

When I'm building up $parm, I'm taking the values from %form. I TRIED to
use the hash to post the values, using "$page = $ua->post($url, \%form);",
but even though it worked on a test web server on my LAN, it wouldn't work
on the system I'm scraping (don't know why -- if you can help here as well,
feel free to chip in).

The problem comes up when I use the code above to post the form data and get
the next page. The next page is a frameset with two frames. I get the
frame urls from the page and load them:

$req = HTTP::Request->new(GET => $url);
$req->content_type("application/x-www-form-urlencoded");
$req = $ua->request($req);
$page = $req->as_string;

And this is when I always get the "You don't have cookies" message.

I thought that LWP automatically took the cookies out of the page (I also
thought cookies were in the header, the one here is set with
document.cookie="doc cookie" within the document), and stored them in the
cookie jar automatically. That doesn't seem to be happening. I've been
reading the perldocs, but I can't see anything in the response object that
allows me to check the page for cookies, so I can do it myself.

So why aren't the cookies being kept and why can't the pages I retrieve
AFTER the cookie is set? Is part of the problem because they are in
frames?

Any help on this is appreciated.

Thanks!

Hal
 
Reply With Quote
 
 
 
 
Todd W
Guest
Posts: n/a
 
      03-23-2005

"Hal Vaughan" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> I'm trying to write a scraper for a website that uses cookies. The short

of
> it is that I keep getting their "You have to set your browser to allow
> cookies" message. The code for the full scraper is a bit much, so here

are
> the relevant sections:
>

<snip />

I've had a lot of sucess using LWP to scrape web pages, for instance I have
a neat program that shows me all my bank account balances on my web enabled
cell phone, but Ive had some trouble getting LWP to scrape some pages that
required cookies also.

Heres my code:

[trwww[at]waveright temp]$ perl -MWWW::Mechanize::Shell -e 'shell'
>get https://www.setsivr.odjfs.state.oh.us/welcome.asp

Retrieving https://www.setsivr.odjfs.state.oh.us/welcome.asp(200)
https://www.setsivr.odjfs.state.oh.us/cookieerror.htm>

If the client and the server were doing everything according to
specification, this would work.

I get the same problem with lynx, and another poster on perl.libwww verified
my issue, and also got the same error using a python http library.

Heres the archive of my thread:

http://groups-beta.google.com/group/...d09ffd6ff2f4fd

I guess that since it dosent work with lynx I can say that the server is
doing something that isnt standard, but it sucks beause it works fine on any
of the major graphical browsers I've tried.

I suppose that someone who knew http well enough could say why it dosent
work, but I know it pretty well and I cant figure it out, and I've tried
pretty hard.

Todd W


 
Reply With Quote
 
 
 
 
Hal Vaughan
Guest
Posts: n/a
 
      03-23-2005
Todd W wrote:

>
> "Hal Vaughan" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>> I'm trying to write a scraper for a website that uses cookies. The short

> of
>> it is that I keep getting their "You have to set your browser to allow
>> cookies" message. The code for the full scraper is a bit much, so here

> are
>> the relevant sections:
>>

> <snip />
>
> I've had a lot of sucess using LWP to scrape web pages, for instance I
> have a neat program that shows me all my bank account balances on my web
> enabled cell phone, but Ive had some trouble getting LWP to scrape some
> pages that required cookies also.
>
> Heres my code:
>
> [trwww[at]waveright temp]$ perl -MWWW::Mechanize::Shell -e 'shell'
>>get https://www.setsivr.odjfs.state.oh.us/welcome.asp

> Retrieving https://www.setsivr.odjfs.state.oh.us/welcome.asp(200)
> https://www.setsivr.odjfs.state.oh.us/cookieerror.htm>
>
> If the client and the server were doing everything according to
> specification, this would work.
>
> I get the same problem with lynx, and another poster on perl.libwww
> verified my issue, and also got the same error using a python http
> library.
>
> Heres the archive of my thread:
>
>

http://groups-beta.google.com/group/...d09ffd6ff2f4fd

I checked the thread, and I've gone back over the pages I downloaded. I
wasn't clear (I think I mentioned it in my first post) about how cookies
are normally handled, and had not looked closely at the files (since I
figured that was not likely the problem). It turns out that the cookie IS
being set in Javascript, which I suspected, but didn't realize this is a
problem. I wrote out a routine that scanned the page, grabbed the cookie,
and set it manually with $cookie_jar->set_cookie(), and it looks like it is
set properly (it includes the domain and path setting, as well). However,
even after setting the cookie manually, I either get "no cookie" messages,
or trying to load any page after the login gives me the login page again
(which I noticed happens in Firefox if I try to paste in a link to a page
after the login page when I'm not logged in). (I also looked at the
cookies in Firefox to see if it looked like the same ones I was getting in
Perl, and they seem the same except for the session ID number.)

So I've found a way to set the cookie by hand, but the server I'm trying to
read from doesn't seem to see the cookie is set. Is there something I need
to do, other than setting a cookie, to make sure the server I'm connecting
to knows the cookie is set?

This is not an area I'm an expert in, and it's frustrating because I need to
get this done, so I'm low on sleep, and trying to put together a lot more
pieces than I expected in this. I didn't know, when I sent a page request
to a server, that the server could actually read the cookie with the
request, I thought cookies were only used by client side Java, but the fact
that the server won't send me the right pages without the cookie seems to
say the server can read the cookie. Is that right? If so, how do I make
sure the server gets the cookie?

Thanks for any help on this!

Hal
 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      03-23-2005
Hal Vaughan wrote:
> I thought that LWP automatically took the cookies out of the page (I also
> thought cookies were in the header, the one here is set with
> document.cookie="doc cookie" within the document), and stored them in the
> cookie jar automatically. That doesn't seem to be happening. I've been
> reading the perldocs, but I can't see anything in the response object that
> allows me to check the page for cookies, so I can do it myself.


This thread with a similar topic might contain something useful:

http://groups-beta.google.com/group/...f4b9ef0d73a11d

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
Hal Vaughan
Guest
Posts: n/a
 
      03-23-2005
Gunnar Hjalmarsson wrote:

> Hal Vaughan wrote:
>> I thought that LWP automatically took the cookies out of the page (I also
>> thought cookies were in the header, the one here is set with
>> document.cookie="doc cookie" within the document), and stored them in the
>> cookie jar automatically. That doesn't seem to be happening. I've been
>> reading the perldocs, but I can't see anything in the response object
>> that allows me to check the page for cookies, so I can do it myself.

>
> This thread with a similar topic might contain something useful:
>
>

http://groups-beta.google.com/group/...f4b9ef0d73a11d
>


Thanks. I read through it. I already have the ignore_discard set, so that
isn't it.

At this point, I think it's a bigger problem and I could use some
clarification from anyone (I'm trying to find info on Google, but am not
doing too well). It turns out the cookie is set by Javascript, with
"document.cookie=". Since Perl doesn't catch this, I'm pulling the cookie
out with a regex and setting it manually. That doesn't seem to help
though, so I've got some more questions:

1) If I have an HTTP::Response object, and I pull out the Javascript cookie
string, is there a way to add it to the header in the Response object and
re-parse the Response to get the cookie into the jar, or will that make a
difference over me setting the cookie manually?

2) How does the server know what my cookies are? I had no idea that the
server was able to read cookies, but since I get different pages without
the cookie than what I should get, I think the server has a way of
detecting the cookies on my system.

3) If I'm right, and the server can read my cookies (other than reading them
with client-side Javascript, which was what I used to think happened), is
it worth sending the cookie as POST data instead?

If anyone can help me with these, it'll be a huge help.

Thanks!

Hal
 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      03-23-2005
Hal Vaughan wrote:
> Gunnar Hjalmarsson wrote:
>> This thread with a similar topic might contain something useful:
>>
>> http://groups-beta.google.com/group/...f4b9ef0d73a11d

>
> Thanks. I read through it. I already have the ignore_discard set, so that
> isn't it.


I knew that you have ignore_discard set; my thought was that other
details in Richard's code might serve as clues.

I have no own experience from using HTTP::Cookies, but when helping
Richard, I noticed that the module provides quite a few methods, of
which some appear to be relevant to you.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
Ilmari Karonen
Guest
Posts: n/a
 
      03-28-2005
Hal Vaughan <(E-Mail Removed)> wrote on 2005-03-23:
[snip]
> even after setting the cookie manually, I either get "no cookie" messages,
> or trying to load any page after the login gives me the login page again
> (which I noticed happens in Firefox if I try to paste in a link to a page
> after the login page when I'm not logged in).


It looks like the server might be checking the Referer header. You
may want to try to include one in every request you make, like this:

my $res = $ua->get($url, Referer => $ref);

where $ref is the URL of the page you got $url from. (It might be
enough just to give any URL from the same site, but then again, it
might not.)

A server paranoid enough to do things like that may also be checking
User-Agent headers, so if you're not doing that already, I'd suggest
setting yours to imitate some common browser, like this:

$ua->agent('Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)');

--
Ilmari Karonen
To reply by e-mail, please replace ".invalid" with ".net" in address.
 
Reply With Quote
 
Joe Smith
Guest
Posts: n/a
 
      04-05-2005
Hal Vaughan wrote:

> At this point, I think it's a bigger problem and I could use some
> clarification from anyone


Last time I had a problem like this, I told my browser to use an
http proxy, and had the proxy log what was actually being sent to
the server. I used http://www.inwap.com/mybin/miscunix/?tcp-proxy
to do the logging when my proxy did not log everything I needed.
-Joe

P.S. I noticed that cookies are mentioned in
http://search.cpan.org/~petdance/WWW...W/Mechanize.pm
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to save a webpage contents to a file ( with LWP ) Jack Perl Misc 6 02-21-2008 12:50 AM
Save contents of iframe from parent's save button user ASP .Net 1 04-04-2005 07:44 PM
word will not save or save as Alex B Computer Support 5 07-10-2004 05:23 AM
Save, Save As, Paste Phil Edwards Computer Support 1 06-27-2004 03:32 PM
How to save lwp::useragent state? John Perl Misc 1 04-28-2004 01:30 PM



Advertisments