Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Parsing HTML with HTML::TableExtract

Reply
Thread Tools

Parsing HTML with HTML::TableExtract

 
 
Ninja Li
Guest
Posts: n/a
 
      11-27-2009
Hi,

I am trying to a comma-delimited file by parsing HTML from the
website "http://www.earnings.com/conferencecall.asp?client=cb"
using HTML::TableExtract module (Thanks for Tad McClellan for the
introduction). However, I got the following error message when running
my script at the end of the post:
----------------------
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
HOGGF.PK
*,HOGG ROBINSON GROUP PLC,Half- Year HOGG ROBINSON GROUP PLC
Earnings Conference Call,,,4:00 AM
................
----------------------

Also notice the large spaces between first value "HOGGF.PK" and
second "HOGG ROBINSON GROUP PLC". There are only a few spaces after
the first field in the original HTML. For what I could see so far, it
seems the empty values in the fields are not handled correctly. The
source code is at the end of the post.

Please advise the root cause and the fix.

Thanks in advance.

Nick

----------------------------------------------
Source code:

use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;

my $html = get 'http://www.earnings.com/conferencecall.asp?
client=cb';

my @headers =
(
'SYMBOL',
'COMPANY',
'EVENT TITLE',
'WEBCAST',
'TRANSCRIPT',
'TIME'
);

my $te = HTML::TableExtract->new( headers => \@headers );
$te->parse($html);

foreach my $ts ( $te->tables )
{
foreach my $row ( $ts->rows )
{
my $csv = join ',', @$row;
print "$csv\n";
}
}
 
Reply With Quote
 
 
 
 
sln@netherlands.com
Guest
Posts: n/a
 
      11-27-2009
On Fri, 27 Nov 2009 14:57:07 -0800 (PST), Ninja Li <(E-Mail Removed)> wrote:

>Hi,
>
> I am trying to a comma-delimited file by parsing HTML from the
>website "http://www.earnings.com/conferencecall.asp?client=cb"
>using HTML::TableExtract module (Thanks for Tad McClellan for the
>introduction). However, I got the following error message when running
>my script at the end of the post:
>----------------------
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>HOGGF.PK
> *,HOGG ROBINSON GROUP PLC,Half- Year HOGG ROBINSON GROUP PLC
>Earnings Conference Call,,,4:00 AM
>...............
>----------------------
>
> Also notice the large spaces between first value "HOGGF.PK" and
>second "HOGG ROBINSON GROUP PLC". There are only a few spaces after
>the first field in the original HTML. For what I could see so far, it
>seems the empty values in the fields are not handled correctly. The
>source code is at the end of the post.
>
> Please advise the root cause and the fix.
>
> Thanks in advance.
>
> Nick
>

What have you done to find out what caused this rediculous
number of warnings? Nothing from your code it seems.
Something is off, WAY off! Something wrong with your content or
headers. Have to learn the module, actually you have to read the docs
for it. Then, plan ahead. Look at the source of the html.

This is not rocket science.

-sln
 
Reply With Quote
 
 
 
 
Martien Verbruggen
Guest
Posts: n/a
 
      11-28-2009
On Fri, 27 Nov 2009 14:57:07 -0800 (PST),
Ninja Li <(E-Mail Removed)> wrote:
> Hi,
>
> I am trying to a comma-delimited file by parsing HTML from the
> website "http://www.earnings.com/conferencecall.asp?client=cb"
> using HTML::TableExtract module (Thanks for Tad McClellan for the
> introduction). However, I got the following error message when running
> my script at the end of the post:
> ----------------------
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> HOGGF.PK
> *,HOGG ROBINSON GROUP PLC,Half- Year HOGG ROBINSON GROUP PLC
> Earnings Conference Call,,,4:00 AM
> ...............


Tha is not the only output. I get more.

> Also notice the large spaces between first value "HOGGF.PK" and
> second "HOGG ROBINSON GROUP PLC". There are only a few spaces after
> the first field in the original HTML. For what I could see so far, it


Check the 'original' HTML again. What's currently at that URL has the
spaces that you see. I guess they muct have changed it since you last
looked at it.

> seems the empty values in the fields are not handled correctly. The
> source code is at the end of the post.


Define 'correctly'. Or rather, find out what HTML::TableExtract defines
as correctly, and adjust your expectations to that. Cells without text
content seem to be returned as undefined values. It's your job to deal
with that in whichever way you think it should be dealt with.

> Please advise the root cause and the fix.


If you want, I can send you a contract and rate card.

Martien
--
|
Martien Verbruggen |
http://www.velocityreviews.com/forums/(E-Mail Removed) | Can't say that it is, 'cause it ain't.
|
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Parsing HTML with HTML::Tree Ninja Li Perl Misc 1 03-01-2010 03:37 PM
[ANN] Parsing Tutorial and YARD 1.0: A C++ Parsing Framework Christopher Diggins C++ 0 07-09-2007 09:01 PM
Parsing HTML - using HTML::TreeBuilder olson_ord@yahoo.it Perl Misc 7 10-06-2006 06:33 PM
SAX Parsing - Weird results when parsing content between tags. Naren XML 0 05-11-2004 07:25 PM
Perl expression for parsing CSV (ignoring parsing commas when in double quotes) GIMME Perl 2 02-11-2004 05:40 PM



Advertisments