Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   How to save a webpage contents to a file ( with LWP ) (http://www.velocityreviews.com/forums/t906481-how-to-save-a-webpage-contents-to-a-file-with-lwp.html)

Jack 02-20-2008 07:15 AM

How to save a webpage contents to a file ( with LWP )
 
Hi there, does anyone skilled in the art of LWP (or other perl module)
and screen scraping know how to do the equivalent of a "file", "save
as" html content ? Some webpages arent scrapeable but when you save
down their content to a local file its available. Any ideas would be
great.

Also, if there is a drop down + button to select content BUT in the
HTML source no "submit" entry at all, how does one remote control a
user selection without this post handle ?

Thanks in advance,

Jack

A. Sinan Unur 02-20-2008 01:49 PM

Re: How to save a webpage contents to a file ( with LWP )
 
Jack <jack_posemsky@yahoo.com> wrote in news:412be207-d043-4b9d-bd96-25294294d50e@u72g2000hsf.googlegroups.com:

> Hi there, does anyone skilled in the art of LWP (or other perl module)
> and screen scraping know how to do the equivalent of a "file", "save
> as" html content ?


http://search.cpan.org/~gaas/libwww-.../LWP/Simple.pm

getstore($url, $file)

http://search.cpan.org/~gaas/libwww-...esponse_Object

http://search.cpan.org/~gaas/libwww-...TP/Response.pm

$r->content( $content )

This is used to get/set the raw content

$r->decoded_content( %options )

This will return the content after any Content-Encoding and charsets
has been decoded.

> Also, if there is a drop down + button to select content BUT in the
> HTML source no "submit" entry at all, how does one remote control a
> user selection without this post handle ?


If the page uses Javascript to dynamically post form contents, you will
have to figure out what the Javascript does and replicate it.

Sinan


--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)
clpmisc guidelines: <URL:http://www.rehabitation.com/clpmisc.shtml>


Jack 02-20-2008 03:41 PM

Re: How to save a webpage contents to a file ( with LWP )
 
On Feb 20, 5:49*am, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:
> Jack <jack_posem...@yahoo.com> wrote innews:412be207-d043-4b9d-bd96-25294294d50e@u72g2000hsf.googlegroups.com:
>
> > Hi there, does anyone skilled in the art of LWP (or other perl module)
> > and screen scraping know how to do the equivalent of a "file", "save
> > as" html content ?

>
> http://search.cpan.org/~gaas/libwww-.../LWP/Simple.pm
>
> getstore($url, $file)
>
> http://search.cpan.org/~gaas/libwww-...pm#The_Respons...
>
> http://search.cpan.org/~gaas/libwww-...TP/Response.pm
>
> $r->content( $content )
>
> * * This is used to get/set the raw content
>
> $r->decoded_content( %options )
>
> * * This will return the content after any Content-Encoding and charsets
> * * has been decoded.
>
> > Also, if there is a drop down + button to select content BUT in the
> > HTML source no "submit" entry at all, how does one remote control a
> > user selection without this post handle ?

>
> If the page uses Javascript to dynamically post form contents, you will
> have to figure out what the Javascript does and replicate it.
>
> Sinan
>
> --
> A. Sinan Unur <1...@llenroc.ude.invalid>
> (remove .invalid and reverse each component for email address)
> clpmisc guidelines: <URL:http://www.rehabitation.com/clpmisc.shtml>


Hi Sinan the site uses ASP, no JS files.. this is all there is in the
html
<!--<SCRIPT>
//
</SCRIPT>-->
<FRAMESET ROWS="70,*" FRAMESPACING=0>
<FRAME NAME="header" SRC="./header_default.asp?
NoCache=2%2F20%2F2008+7%3A35%3A47+AM" SCROLLING="no" MARGINWIDTH="2"
MARGINHEIGHT="0">

<FRAME NAME="bodyx" SRC=
body.asp?centerin=GGCC
SCROLLING="auto" MARGINWIDTH="2" MARGINHEIGHT="2">


</FRAMESET>

</HTML>

A. Sinan Unur 02-20-2008 03:51 PM

Re: How to save a webpage contents to a file ( with LWP )
 
Jack <jack_posemsky@yahoo.com> wrote in
news:14c9e85d-9e1d-43ca-ae55-423ce6256df2@q78g2000hsh.googlegroups.com:

> On Feb 20, 5:49*am, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:
>> Jack <jack_posem...@yahoo.com> wrote
>> innews:412be207-d043-4b9d-bd96-252942

> 94d50e@u72g2000hsf.googlegroups.com:
>>
>> > Hi there, does anyone skilled in the art of LWP (or other perl
>> > module) and screen scraping know how to do the equivalent of a
>> > "file", "save as" html content ?

>>
>> http://search.cpan.org/~gaas/libwww-.../LWP/Simple.pm
>>
>> getstore($url, $file)
>>
>> http://search.cpan.org/~gaas/libwww-perl-

5.808/lib/LWP.pm#The_Respons.
>> ..
>>
>> http://search.cpan.org/~gaas/libwww-...TP/Response.pm
>>
>> $r->content( $content )
>>
>> * * This is used to get/set the raw content
>>
>> $r->decoded_content( %options )
>>
>> * * This will return the content after any Content-Encoding and
>> charse

> ts
>> * * has been decoded.
>>
>> > Also, if there is a drop down + button to select content BUT in the
>> > HTML source no "submit" entry at all, how does one remote control a
>> > user selection without this post handle ?

>>
>> If the page uses Javascript to dynamically post form contents, you
>> will have to figure out what the Javascript does and replicate it.
>>
>> Sinan
>>
>> --
>> A. Sinan Unur <1...@llenroc.ude.invalid>


Do *not* quote sigs.

> Hi Sinan the site uses ASP, no JS files.. this is all there is in the
> html
> <!--<SCRIPT>
> //
> </SCRIPT>-->
> <FRAMESET ROWS="70,*" FRAMESPACING=0>
> <FRAME NAME="header" SRC="./header_default.asp?
> NoCache=2%2F20%2F2008+7%3A35%3A47+AM" SCROLLING="no" MARGINWIDTH="2"
> MARGINHEIGHT="0">
>
> <FRAME NAME="bodyx" SRCbody.asp?centerin=GGCC


I am assuming you retyped the source rather than copied & pasting.
Please don't retype code.

> SCROLLING="auto" MARGINWIDTH="2" MARGINHEIGHT="2">


Oh, but there is more. How about them frames?

Anyway, this forum is for help with the Perl aspect of things. If you
need to learn html, there is a group for that as well.

Sinan
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)
clpmisc guidelines: <URL:http://www.rehabitation.com/clpmisc.shtml>


Gunnar Hjalmarsson 02-20-2008 04:08 PM

Re: How to save a webpage contents to a file ( with LWP )
 
Jack wrote:
> this is all there is in the html
> <!--<SCRIPT>
> //
> </SCRIPT>-->
> <FRAMESET ROWS="70,*" FRAMESPACING=0>
> <FRAME NAME="header" SRC="./header_default.asp?
> NoCache=2%2F20%2F2008+7%3A35%3A47+AM" SCROLLING="no" MARGINWIDTH="2"
> MARGINHEIGHT="0">
>
> <FRAME NAME="bodyx" SRC=
> body.asp?centerin=GGCC
> SCROLLING="auto" MARGINWIDTH="2" MARGINHEIGHT="2">
>
>
> </FRAMESET>
>
> </HTML>


Then get the bodyx frame, not the frameset.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Jack 02-20-2008 11:49 PM

Re: How to save a webpage contents to a file ( with LWP )
 
On Feb 20, 8:08*am, Gunnar Hjalmarsson <nore...@gunnar.cc> wrote:
> Jack wrote:
> > this is all there is in the html
> > * <!--<SCRIPT>
> > * *//
> > * </SCRIPT>-->
> > * <FRAMESET ROWS="70,*" FRAMESPACING=0>
> > * *<FRAME NAME="header" SRC="./header_default.asp?
> > NoCache=2%2F20%2F2008+7%3A35%3A47+AM" SCROLLING="no" MARGINWIDTH="2"
> > MARGINHEIGHT="0">

>
> > * *<FRAME NAME="bodyx" SRC=
> > body.asp?centerin=GGCC
> > * *SCROLLING="auto" MARGINWIDTH="2" MARGINHEIGHT="2">

>
> > </FRAMESET>

>
> > </HTML>

>
> Then get the bodyx frame, not the frameset.
>
> --
> Gunnar Hjalmarsson
> Email:http://www.gunnar.cc/cgi-bin/contact.pl- Hide quoted text -
>
> - Show quoted text -


How exactly does one get the bodyx frame, and more importantly how do
you auto select from the select box when there is no such mention of
it or a submit button in html for this ASP application.
Thank you,
Jack

Gunnar Hjalmarsson 02-21-2008 12:50 AM

Re: How to save a webpage contents to a file ( with LWP )
 
Jack wrote:
> On Feb 20, 8:08 am, Gunnar Hjalmarsson <nore...@gunnar.cc> wrote:
>> Jack wrote:
>>> this is all there is in the html
>>> <!--<SCRIPT>
>>> //
>>> </SCRIPT>-->
>>> <FRAMESET ROWS="70,*" FRAMESPACING=0>
>>> <FRAME NAME="header" SRC="./header_default.asp?
>>> NoCache=2%2F20%2F2008+7%3A35%3A47+AM" SCROLLING="no" MARGINWIDTH="2"
>>> MARGINHEIGHT="0">
>>> <FRAME NAME="bodyx" SRC=
>>> body.asp?centerin=GGCC
>>> SCROLLING="auto" MARGINWIDTH="2" MARGINHEIGHT="2">
>>> </FRAMESET>
>>> </HTML>

>>
>> Then get the bodyx frame, not the frameset.

>
> How exactly does one get the bodyx frame,


Assuming the URL of the frameset is
http://www.example.com/somepage/index.asp, you probably use the URL
http://www.example.com/somepage/body.asp?centerin=GGCC

> and more importantly how do
> you auto select from the select box when there is no such mention of
> it or a submit button in html for this ASP application.


As Sinan mentioned, you apparently need to learn some basics about HTML.
Asking questions in a Perl group is not the right way to do so.

Recommended reading: http://www.w3.org/TR/html4/present/frames.html

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl


All times are GMT. The time now is 11:47 AM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57