Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > HTML::TokeParser & TableExtract

Reply
Thread Tools

HTML::TokeParser & TableExtract

 
 
Abram
Guest
Posts: n/a
 
      04-25-2006
I'm fairly new to Perl, so bare with me.

I am trying to extract a table from an HTML file and parse through each
row, then dump the extracted cell data into a csv file. This was
pretty easy to accomplish with HTML::TokeParser, however I have one
problem. Each HTML file I need to parse has three tables with the same
structure. I need to separate these three tables into three csv files.

I can use TableExtract to get the exact tables using the depth and
count matching (depth is always 2 and count is 5-7), but I am not sure
how to then parse only that table and extract the data. I'm sure this
is pretty simple stuff, and I'll kick myself when I see the answer.

Thanks in advance.

--Abram

 
Reply With Quote
 
 
 
 
David Squire
Guest
Posts: n/a
 
      04-25-2006
Abram wrote:
> I'm fairly new to Perl, so bare with me.


What an image! I guess you mean "bear with me"

(Sorry, but it seems to be spelling/idiom correction day here).

DS
 
Reply With Quote
 
 
 
 
Dr.Ruud
Guest
Posts: n/a
 
      04-25-2006
David Squire schreef:

> it seems to be spelling/idiom correction day here


How perfect, on my birthday!

--
Affijn, Ruud (44)

"Gewoon is een tijger."


 
Reply With Quote
 
Abram
Guest
Posts: n/a
 
      04-25-2006
Ha! My brain has become a bit mushy with my hours of "learning" perl,
so I didn't even notice... I better put something on!

At least it got some attention, any suggestions (not on my apparel, but
the html data extraction)?

 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      04-26-2006
Abram <(E-Mail Removed)> wrote:

> I can use TableExtract to get the exact tables using the depth and
> count matching (depth is always 2 and count is 5-7), but I am not sure
> how to then parse only that table and extract the data. I'm sure this
> is pretty simple stuff, and I'll kick myself when I see the answer.



From "perldoc HTML::TableExtract":

$te = new HTML::TableExtract( depth => 2, count => 2 );
$te->parse($html_string);
foreach $ts ($te->table_states) {
print "Table found at ", join(',', $ts->coords), ":\n";
foreach $row ($ts->rows) {
print " ", join(',', @$row), "\n";
}
}


That seems to do it.

Are you having trouble modifying that to produce CSV?


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Abram
Guest
Posts: n/a
 
      04-26-2006
Thanks Tad,


> Tad McClellan wrote:
> Are you having trouble modifying that to produce CSV?


Actually yes. I have been using the code from perldoc (slightly
modified), but cannot seem to get the proper structure for csv. That
is why I was looking into TokeParser as I could easily parse through
each TD and conditionally extract the data.

Could you provide some help on how to get this done with TableExtract?
My HTML looks something like this:
....
<table>
<tr>
<td> Header 1 </td>
<td> Header 2 </td>
<td> Header 3 </td>
</tr>
<!-- Data Starts Here -->
<tr id="Data_Row_1">
<td> data 1_1 </td>
<td> data 1_2 </td>
<td> data 1_3 </td>
</tr>
<tr id="Data_Row_1_1">
<td colspan=3> More data for 1 </td>
</tr>
<tr id="Data_Row_2">
<td> data 2_1 </td>
<td> data 2_2 </td>
<td> data 2_3 </td>
</tr>
<tr id="Data_Row_2_1">
<td colspan=3> More data for 2 </td>
</tr>
</table>
(NOTE: Actual html doesn' t have tr id's, used just to illustrate
associated rows)

To make things even more interesting I need to extract the "More data
for NN" row and append it to the data row.

Any suggestions?

 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      04-26-2006
"Abram" <(E-Mail Removed)> wrote in news:1146068409.560958.129860
@i39g2000cwa.googlegroups.com:

> Thanks Tad,
>
>
>> Tad McClellan wrote:
>> Are you having trouble modifying that to produce CSV?

>
> Actually yes. I have been using the code from perldoc (slightly
> modified), but cannot seem to get the proper structure for csv. That
> is why I was looking into TokeParser as I could easily parse through
> each TD and conditionally extract the data.


....

> <tr id="Data_Row_1">
> <td> data 1_1 </td>
> <td> data 1_2 </td>
> <td> data 1_3 </td>
> </tr>
> <tr id="Data_Row_1_1">
> <td colspan=3> More data for 1 </td>
> </tr>


....

> (NOTE: Actual html doesn' t have tr id's, used just to illustrate
> associated rows)
>
> To make things even more interesting I need to extract the "More data
> for NN" row and append it to the data row.


Which column are you supposed to put the data in "More data for NN"?

Sinan

--
A. Sinan Unur <(E-Mail Removed)>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc...uidelines.html

 
Reply With Quote
 
Abram
Guest
Posts: n/a
 
      04-26-2006
Sinan,

> Which column are you supposed to put the data in "More data for NN"?


The last column of the row. So it would look like this in the csv:
data 1_1,data 1_2,data 1_3,More data for 1
data 2_1,data 2_2,data 2_3,More data for 2
data 3_1,data 3_2,data 3_3,More data for 3
data 4_1,data 4_2,data 4_3,More data for 4
....etc...

--Abram

 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      04-26-2006
Abram <(E-Mail Removed)> wrote:
>> Tad McClellan wrote:
>> Are you having trouble modifying that to produce CSV?

>
> Actually yes. I have been using the code from perldoc (slightly
> modified), but cannot seem to get the proper structure for csv.



It is _already_ CSV will extra spaces at the beginning and
no quotes around fields.

Modify the boilerplate code to eliminate the extra spaces, and
to put quotes around fields.


> Could you provide some help on how to get this done with TableExtract?



Sure.

Post your broken code, and someone will help you fix it.


> To make things even more interesting I need to extract the "More data
> for NN" row and append it to the data row.



How do you identify what is to be joined?

Does it always have the "More data" text in it? (I doubt it)

Are there times when there is NOT a "continuation" row?

Can there be more than one "continuation row"?

etc...


> Any suggestions?



If you need debugging help, you pretty much have to post the
code that you want debugged...


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      04-27-2006
"Abram" <(E-Mail Removed)> wrote in news:1146090054.571782.194790
@i40g2000cwc.googlegroups.com:

> Sinan,
>
>> Which column are you supposed to put the data in "More data for NN"?

>
> The last column of the row. So it would look like this in the csv:
> data 1_1,data 1_2,data 1_3,More data for 1
> data 2_1,data 2_2,data 2_3,More data for 2
> data 3_1,data 3_2,data 3_3,More data for 3
> data 4_1,data 4_2,data 4_3,More data for 4


Each regular row will contain 3 elements. The continuation row will have
only one element. Join that element with the third column of the previous
row.

For more help, post your best attempt to implement the algorithm above. If
it does not work, if I don't get a chance, someone will definitely help
you fix it.

Sinan
--
A. Sinan Unur <(E-Mail Removed)>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc...uidelines.html

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
HTML::TableExtract with headers constraint, exluding right-most column Jim Monty Perl Misc 0 05-16-2005 04:54 PM
Perl HTML::TableExtract Question Paul Perl Misc 3 04-17-2005 10:57 PM
TableExtract question - how to deal with headers with spaces? Michael Bourgon Perl Misc 0 09-30-2004 08:56 PM
Rookie: HTML::TableExtract test will not print sdfgsd Perl Misc 6 10-09-2003 03:31 PM
Problem using TableExtract 1.08 Darren Dunham Perl Misc 0 09-07-2003 11:01 PM



Advertisments