Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Extracting table in html page

Reply
Thread Tools

Extracting table in html page

 
 
shankar_perl_rookie
Guest
Posts: n/a
 
      07-21-2010
Hello All,

I have an html file where I am trying to extract a table. The problem
I am facing is there are lot of tables in the page and the table I am
looking to extract appears after a particular string say $some_text. I
know of a way that I can search for the string in the html page but
what I want to do is capture a table that immediately follows the
$some_text.

Any suggestions on how to do this ??

Thanks,
Shankar
 
Reply With Quote
 
 
 
 
Jim Gibson
Guest
Posts: n/a
 
      07-21-2010
In article
<233f66ab-b5eb-449d-b3b0->,
shankar_perl_rookie <> wrote:

> Hello All,
>
> I have an html file where I am trying to extract a table. The problem
> I am facing is there are lot of tables in the page and the table I am
> looking to extract appears after a particular string say $some_text. I
> know of a way that I can search for the string in the html page but
> what I want to do is capture a table that immediately follows the
> $some_text.
>
> Any suggestions on how to do this ??


The most reliable way would be to use the HTML:arser module to parse
the html file, register appropriate handlers for the table elements
(<table>, <tr>, <td>) and one for text elements, look for your string,
and process the next table encountered in a callback (handler
subroutines are called as callbacks by the parsing method).

Another way would be to use a module to extract tables from HTML. There
are at least two on CPAN: HTML::TableExtract and HTML::TableParser. The
problem using these is to find the table after the specified text. Is
there some other way of identifying the table?

The quick and dirty way is to use a regular expression (untested):

if( $html =~ m{ $some_text .*? <table> (.*?) </table> }isx ) {
# table contents in $1
}

However, this will not always work. It fails if you have nested tables,
for example, which is a common occurrence in some HTML. However, if you
are in a hurry it might work for you. It is always better to use a real
parser for HTML.

--
Jim Gibson
 
Reply With Quote
 
 
 
 
sln@netherlands.com
Guest
Posts: n/a
 
      07-22-2010
On Wed, 21 Jul 2010 16:08:17 -0700, Jim Gibson <> wrote:

>In article
><233f66ab-b5eb-449d-b3b0->,
>shankar_perl_rookie <> wrote:
>

[snip]

>The quick and dirty way is to use a regular expression (untested):
>
>if( $html =~ m{ $some_text .*? <table> (.*?) </table> }isx ) {
> # table contents in $1
>}
>
>However, this will not always work. It fails if you have nested tables,
>for example, which is a common occurrence in some HTML. However, if you
>are in a hurry it might work for you. It is always better to use a real
>parser for HTML.


Its ALWAYS trivial to parse a markup language's markup.
ie: parse out tags(open|close)/attributes and content.
Creating an element tree (document) with HTML is another
process altogether. Xhtml/Xml, not so bad, sgml er ..

I always laugh when people say a 'real parser for HTML' because they
don't know what thier saying, instead, just parroting phrases from
so called God's, then passing them along.
As if a SAX parser does nothing more than a realtime parse on a stream,
ie: a markup parse. Easily done by regular expressions.

Oh, and before anybody starts that "regular language" crap, they better
be able to explain what the "can't" part means!

-sln
 
Reply With Quote
 
HASM
Guest
Posts: n/a
 
      07-22-2010
Jim Gibson <> writes:

>> I have an html file where I am trying to extract a table. The problem
>> I am facing is there are lot of tables in the page and the table I am
>> looking to extract appears after a particular string say $some_text.


> The most reliable way would be to use the HTML:arser module to parse
> the html file,


Or HTML::TreeBuilder;

use HTML::TreeBuilder;
use LWP::UserAgent;
my $url = 'http://www.example.com/...";
my $browser = LWP::UserAgent->new;
my $response = $browser->request (HTTP::Request->new(GET => $url));
if ($response->is_success) {
my $tree = HTML::TreeBuilder->new;
my $content =
$tree->parse_content($response->decoded_content);
# search for text with look_down (there are other way)
my $text = $content->look_down (...)
# then for your table
my $table = $content->look_down ('_tag', 'table', ...)

etc,

-- HASM
 
Reply With Quote
 
sopan.shewale@gmail.com
Guest
Posts: n/a
 
      07-22-2010
The best way can be:
use split on $some_text and throw away the first part.
my ($junk, $interest_html) = split (/$some_text/, $html);

on $interest_html - use HTML::TreeBuilder to parse the tables.
grab the first table - you are done.

Let me know if you find difficult to use HTML::TreeBuilder.

--sopan shewale



On Jul 22, 3:21*am, shankar_perl_rookie <mulshan...@gmail.com> wrote:
> Hello All,
>
> I have an html file where I am trying to extract a table. The problem
> I am facing is there are lot of tables in the page and the table I am
> looking to extract appears after a particular string say $some_text. I
> know of a way that I can search for the string in the html page but
> what I want to do is capture a table that immediately follows the
> $some_text.
>
> Any suggestions on how to do this ??
>
> Thanks,
> Shankar


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Extracting html source from a web page... Konrad Rotuski ASP .Net 4 02-15-2009 08:44 PM
Extracting links from a html table David.Bramer@googlemail.com Perl Misc 1 05-19-2008 09:40 PM
copy part of HTML Table to another HTML page lvcha.gouqizi ASP General 0 12-29-2005 03:46 AM
Extracting html source from a web page... Cor Ligthert ASP .Net 0 09-13-2004 10:34 AM
Could not load type VTFixup Table from assembly Invalid token in v-table fix-up table. David Williams ASP .Net 2 08-12-2003 07:55 AM



Advertisments