Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   Capturing actual Browser output in perl (http://www.velocityreviews.com/forums/t910600-capturing-actual-browser-output-in-perl.html)

digz 05-22-2009 01:04 AM

Capturing actual Browser output in perl
 
#!/usr/bin/perl
use LWP;
my $browser = LWP::UserAgent->new;
my $response = $browser->get( "http://lkml.org" );
print( $response->content );

In this program I am trying to get the output as the browser displays
it , not the actual HTML page with all the tags .., that $response-
>content returns.


For a example , this URL ,

What I want to save in a string is how the browser shows it

Last 100 messages Today's messages Yesterday's messages
Hottest Messages
LKML.ORG

NOT

what the actual HTML content is:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=UTF-8" />
<link href="/css/frontpage.css" rel="stylesheet" type="text/css" /
>

<title>LKML.ORG - the Linux Kernel Mailing List Archive</title>
<script type="text/javascript" src="/css/multiline-tooltip.js"></
script>
</head>
......
Is there any easy way to achieve this

Thanks

Digz

Jürgen Exner 05-22-2009 03:39 AM

Re: Capturing actual Browser output in perl
 
digz <Digvijoy.C@gmail.com> wrote:
>#!/usr/bin/perl
>use LWP;
>my $browser = LWP::UserAgent->new;
>my $response = $browser->get( "http://lkml.org" );
>print( $response->content );
>
>In this program I am trying to get the output as the browser displays
>it , not the actual HTML page with all the tags .., that $response-
>>content returns.


The way you stated your requirements your best bet is a screen capture
tool, because the output of a browser depends not only on the HTML but
to a large part on user settings and configurations.
Therefore a different rendering tool would have to use the same
configuration as the browser and interpret them the same way.

>For a example , this URL ,
>
>What I want to save in a string is how the browser shows it


But a browser shows a a graphic with different fonts, styles, colors,
layouts, tables, ....
You cannot save that as a "text string" (unless you incorporate that
formatting information in the string, of course, but then it is no
longer plain text).

>Last 100 messages Today's messages Yesterday's messages
>Hottest Messages
>LKML.ORG
>
>NOT
>
>what the actual HTML content is:
>.....
>Is there any easy way to achieve this


The easiest way to get an approximation of the textual part of the
display is to use a text-only browser like e.g. Lynx and redirect its
output to a file (Lynx has an option for that).

Another way, probably more customizable (what do you intent to do with
tool tips? Alternate text and captures for graphics? DHTML? How much
JavaScript do you want to run? ...?) is to run the HTML code through an
HTML parser and extract those text pieces you are interested in. THere
are several parsers on CPAN.



Gunnar Hjalmarsson 05-22-2009 03:55 PM

Re: Capturing actual Browser output in perl
 
digz wrote:
> #!/usr/bin/perl
> use LWP;
> my $browser = LWP::UserAgent->new;
> my $response = $browser->get( "http://lkml.org" );
> print( $response->content );
>
> In this program I am trying to get the output as the browser displays
> it , not the actual HTML page with all the tags .., that $response-
> content returns.


You may want to check out:

http://search.cpan.org/dist/html2text/

http://search.cpan.org/perldoc?HTML:...ext::Html2text

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Franken Sense 05-23-2009 11:55 PM

Re: Capturing actual Browser output in perl
 
In Dread Ink, the Grave Hand of digz Did Inscribe:

> In this program I am trying to get the output as the browser displays
> it , not the actual HTML page with all the tags .., that
> $response->content returns.


I was endeavoring close to the same thing a while back, and I think this
was the closest I came:

#!/usr/bin/perl
# perl wahab4.pl

use strict;
use warnings;
use LWP::Simple;
use HTML::Parser;
use HTML::FormatText;
my ($html, $ascii);
$html = get("http://www.co-array.com/");
defined $html
or die "Can't fetch HTML from http://www.perl.com/";
$ascii = HTML::FormatText->new->format(parse_html($html));
print $ascii;


C:\MinGW\source>perl wahab4.pl
Undefined subroutine &main::parse_html called at wahab4.pl line 12.

I'm having trouble using the methods that are on cpan. I sure wish every
module included a bevy of examples.
--
Frank

No Child Left Behind is the most ironically named act, piece of legislation
since the 1942 Japanese Family Leave Act.
~~ Al Franken, in response to the 2004 SOTU address


All times are GMT. The time now is 08:07 AM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.