Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > How to get table from some html

Reply
Thread Tools

How to get table from some html

 
 
dysgraphia
Guest
Posts: n/a
 
      02-05-2007
I am new to Perl and also to the Mechanize module.
So far I have obtained a table, table[4] below, with
useful text I would like to put into a tabular format like:

List Position Patient Name Weight Height Clinic Doctor

but I am unsure as to how to proceed.
I will want to send the data to an Access db later so hopefully this
format will be amenable to this.

Any suggestions or assistance appreciated!

Below is my code followed by the relevant portion of html.
In practice the daily list may vary in length up to about 30 patients.

#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;

my $mech = WWW::Mechanize->new(
agent => 'Mozilla/4.0',
cookie_jar => {}
);

$url = 'http://www.somemedicaldata'; # not a real page

$mech->get($url);
unless ($mech->success) {
die "Cannot get login page $url: ",
$mech->response->status_line;
}

my $content = $mech->content();

print "Content is: \"$content\"\n";

# get table data
my @table;
my $tmp = $content;
my $tablecount=0;

while (my $result=$tmp=~/(?=\x3CTABLE).*?(?=\x3C\/TABLE\x3E)/igsm)
{
$tablecount++;
$table[$tablecount]= $&;
}

print "Number of tables: \"$tablecount\"\n";

# table4 has the useful data
my ($dd1,$dd2) = split('<tr class="texttab" ',$table[4]);
$table[4] = $dd2;

# Save table4 raw to see what is collected
open(FH, ">table4raw.txt");
print FH $table[4];
close(FH);
# end of code

This is the table4 html:

<table width="741" border="0" cellpadding="2" cellspacing="1">
<tr bgcolor="#CC9966"

class="texttab"> <td><div align="center"><font
color="#663300"><strong>List
Position</strong></font></div></td>
<td><div align="center"><font

color="#663300"><strong>Patient Name</strong></font></div></td>
<td><div
align="center"><font
color="#663300"><strong>Weight</strong></font></div></td>

<td><div align="center"><font
color="#663300"><strong>Height</strong></font></div></td>
<td><div align="center"><font
color="#663300"><strong>Clinic</strong></font></div></td>
<td><div align="center"><font
color="#663300"><strong>Doctor</strong></font></div></td>
</tr> <tr
class="texttab" > <td
align="center">1</td> <td align="center">A Smith
</td>
<td align="center">78.0</td> <td
align="center">185</td>
<td align="center">AM</td> <td align="center">F
Magoo</td> </tr>
<tr class="texttab" bgcolor=#FFFFFF >
<td
align="center">2</td> <td align="center">B
Smith</td> <td
align="center">56.0</td> <td
align="center">165</td> <td
align="center">PM</td> <td align="center">L
Magee</td> </tr>
<tr class="texttab" >
<td align="center">3</td>
<td align="center">C Smith </td>
<td
align="center">66.0</td> <td
align="center">171</td> <td
align="center">RM</td> <td align="center">R
Magaa</td> </tr>

 
Reply With Quote
 
 
 
 
Brian McCauley
Guest
Posts: n/a
 
      02-05-2007
On Feb 5, 5:48 am, dysgraphia <(E-Mail Removed)> wrote:
> I am new to Perl and also to the Mechanize module.
> So far I have obtained a table, table[4] below, with
> useful text I would like to put into a tabular format like:
>
> List Position Patient Name Weight Height Clinic Doctor
>
> but I am unsure as to how to proceed.
> I will want to send the data to an Access db later so hopefully this
> format will be amenable to this.
>
> Any suggestions or assistance appreciated!


I suggest you parse HTML with a HTML parser. Looking for a module with
"HTML" and "Parser" in its name would be a good start. Since you are
specifically looking for parsing tables you may want to see if there's
on with "Table" in its name too.

 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      02-05-2007
dysgraphia <(E-Mail Removed)> wrote:

> So far I have obtained a table, table[4] below, with
> useful text I would like to put into a tabular format like:



> Any suggestions or assistance appreciated!



use HTML::TableExtract;


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
dysgraphia
Guest
Posts: n/a
 
      02-05-2007
Brian McCauley wrote:
>
> I suggest you parse HTML with a HTML parser. Looking for a module with
> "HTML" and "Parser" in its name would be a good start. Since you are
> specifically looking for parsing tables you may want to see if there's
> on with "Table" in its name too.
>


Thanks Brian, I will look through the modules based on your suggestions.
Your help is appreciated!...cheers, Peter
 
Reply With Quote
 
gf
Guest
Posts: n/a
 
      02-05-2007
I am partial to HTML::TreeBuilder for my parsing.

After a tree has been built from the HTML you use the methods in
HTML::Element to traverse the tree. look_down() is very powerful and
is my go-to routine.

You can easily find the location of your target table in the tree with
look_down(), then loop through the rows and cells, extracting the
contents of the cells using as_text().

Use an array to mimic the table structure. This is untested and
doesn't check for all errors, but I'd loop through the table with
something like:

use warnings;
use strict;
use LWP::Simple;
use HTML::TreeBuilder;

my $html = get('the URL you want to retrieve') or die "Can't get URL.
\n";
my $tree = HTML::TreeBuilder->new_from_content($html);

my @table_data;
foreach my $table ( $tree->look_down( '_tag' => 'table' ) )
{
foreach my $tr ( $table->look_down( '_tag' => 'tr' ) )
{
my @row_data;
foreach my $td ( $table->look_down( '_tag' => 'td' ) )
{
push @row_data, $td->as_text();
}
push @table_data, [@row_data];
}
}

foreach my $r (@table_data)
{
print join( "\t", @$r ), "\n";
}

You might have to flesh out the look_down() calls to narrow your table
selections, but for a single table embedded in a page it should
suffice.

 
Reply With Quote
 
gf
Guest
Posts: n/a
 
      02-05-2007
> foreach my $td ( $table->look_down( '_tag' => 'td' ) )
> {
> push @row_data, $td->as_text();
> }


OOPS, that should be

foreach my $td ( $tr->look_down( '_tag' => 'td' ) )
{
push @row_data, $td->as_text();
}

 
Reply With Quote
 
dysgraphia
Guest
Posts: n/a
 
      02-06-2007
gf wrote:
> I am partial to HTML::TreeBuilder for my parsing.
>
> After a tree has been built from the HTML you use the methods in
> HTML::Element to traverse the tree. look_down() is very powerful and
> is my go-to routine.


Thanks gf!
I have had a look at your suggestion of HTML::TreeBuilder and can see
it is most likely worth me learning. I have installed the module and
given it some trial runs on example code and your code. Comments of mine
below.

> You can easily find the location of your target table in the tree with
> look_down(), then loop through the rows and cells, extracting the
> contents of the cells using as_text().
>
> Use an array to mimic the table structure. This is untested and
> doesn't check for all errors, but I'd loop through the table with
> something like:
>
> use warnings;
> use strict;
> use LWP::Simple;
> use HTML::TreeBuilder;
>
> my $html = get('the URL you want to retrieve') or die "Can't get URL.
> \n";
> my $tree = HTML::TreeBuilder->new_from_content($html);
>
> my @table_data;
> foreach my $table ( $tree->look_down( '_tag' => 'table' ) )
> {
> foreach my $tr ( $table->look_down( '_tag' => 'tr' ) )
> {
> my @row_data;
> foreach my $td ( $table->look_down( '_tag' => 'td' ) )
> {
> push @row_data, $td->as_text();
> }
> push @table_data, [@row_data];
> }
> }
>
> foreach my $r (@table_data)
> {
> print join( "\t", @$r ), "\n";
> }
>
> You might have to flesh out the look_down() calls to narrow your table
> selections, but for a single table embedded in a page it should
> suffice.
>


I tried your code and it ran perfectly. My project has a
table-within-tables structure. The HTML has a lot of dross that I want
to avoid.
I did a bit of digging and found some articles and links of Sean M.
Burke eg
http://aspn.activestate.com/ASPN/doc.../Scanning.html
and tried to use his suggestion for rejecting certain tables.
He wrote:

$h1 = $tree->look_down('_tag', 'h1');
returns the first element at-or-under $tree whose "_tag" attribute has
the value "h1".......
you could exclude ``h1'' elements that contain the word ``visit'' under
them:

my $real_h1 = $tree->look_down(
'_tag', 'h1',
sub {
$_[0]->as_text !~ m/\bvisit/i
}
);

I adapted and tried this code but could not get the table to be excluded.
In my case the HTML has a large (approx 700 line) table I don't want.
This table has tags like <option>....</option> to identify it but
putting this in the above code did not work.
Any comments or suggestions of yours are welcome...thanks again for your
help so far....cheers, Peter



 
Reply With Quote
 
dysgraphia
Guest
Posts: n/a
 
      02-06-2007
Tad McClellan wrote:
>>Any suggestions or assistance appreciated!

>
> use HTML::TableExtract;
>

Thanks Tad I will check this module out.
 
Reply With Quote
 
dysgraphia
Guest
Posts: n/a
 
      02-06-2007
Michele Dondi wrote:
> You will have to parse it. So use some HTML parsing module. One such
> module that gets mentioned frequently here is HTML::TokeParser. There
> are others though, and you may want to check some of them to find the
> best one for you.


Thanks for your input Michele, I will have a look at TokeParser.
>
>>List Position Patient Name Weight Height Clinic Doctor

>
> Do you mean in pure text? Then use some pure text table formatting
> module, like Text::Table or Perl6::Form.
>
> Michele


I am using Perl 5.8 from ActiveState. My initial requirement was to see
the text in either a text editor or spreadsheet format. This was just to
ensure I am getting the data correctly as I will have a need to download
many files on a weekly basis. When the parsing looks OK I will then send
it to a db.

Again, thanks for your help Michele...appreciated...cheers, Peter
 
Reply With Quote
 
gf
Guest
Posts: n/a
 
      02-06-2007
On Feb 6, 6:45 am, dysgraphia <(E-Mail Removed)> wrote:

> I tried your code and it ran perfectly.


That occasionally happens.

[...]

> my $real_h1 = $tree->look_down(
> '_tag', 'h1',
> sub {
> $_[0]->as_text !~ m/\bvisit/i
> }
> );
>


You're on the right track, just keep following it. Because you're so
close to the answer I'm just going to say "keep going".

sub {} calls in look_down() are your friends - they're really
powerful. Sometimes I've needed to use multiple embedded subs to chain
together the results of the look_down(). In effect this causes the
test to drill down into the HTML deeper and deeper to determine if the
child nodes contain what you want.

And, remember that the parameters to a look_down() constitute an OR
condition, and the embedded sub {} conditions act as ANDs.

Also, the use of qr// regexp patterns can be powerful OR tests.

Stylistically I like to use the '=>' operator to separate my argument
pairs in the look_down() parameter list rather than plain commas.

OK, I lied. Here's an (untested) example of drilling in farther.

[...]
foreach my $_tr (
$tree->look_down(
'_tag' => 'tr',
'class' => qr/row[123]/,
sub {
$_[0]->look_down(
'_tag' => 'td',
'id' => qr/^datafield_(?:name|date|age)/,
sub {
$_[0]->as_text() =~ /\bfoo\b/;
}
);
}
)
)
{
; # ...do something revolutionary here
}

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
asp text boxes in a table -- some are emphasized & some not, but props are identical Jeremy ASP .Net 2 11-27-2007 07:03 PM
HTML Parser.. some tool in swing? some experience? asd Java 1 12-07-2006 08:41 AM
Table/table rows/table data tag question? Rio HTML 4 11-05-2004 08:11 AM
Could not load type VTFixup Table from assembly Invalid token in v-table fix-up table. David Williams ASP .Net 2 08-12-2003 07:55 AM



Advertisments