Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   XML (http://www.velocityreviews.com/forums/f32-xml.html)
-   -   HTML parsing (http://www.velocityreviews.com/forums/t599088-html-parsing.html)

worlman385@yahoo.com 03-16-2008 01:08 AM

HTML parsing
 

I need to parse the following HTML page and extract TV listing data
using VC++

http://tvlistings.zap2it.com/tvlistings/ZCGrid.do

any good way to extract the data?

is easy for VC++ to call PERL script and do some regular expression?

since the HTML page is not XML well formed, I cannot use a XML parser
right?

any other good ways to extract HTML page data?

Malcolm Dew-Jones 03-16-2008 04:45 AM

Re: HTML parsing
 
worlman385@yahoo.com wrote:

: I need to parse the following HTML page and extract TV listing data
: using VC++

: http://tvlistings.zap2it.com/tvlistings/ZCGrid.do

: any good way to extract the data?

: is easy for VC++ to call PERL script and do some regular expression?

: since the HTML page is not XML well formed, I cannot use a XML parser
: right?

: any other good ways to extract HTML page data?

Perl, HTML::Parser (my spelling is right but case may be wrong).

#!perl
use strict;
use HTML::Parser;
... perl code, etc...

As an aside, this is also an excellent tool for sax-like parsing of xml.
It has an xml mode that expects properly balanced tags, and etc, and
though it it doesn't handle all xml features, HTML::Parser comes with
almost all distros of perl, which means that any a script that uses it can
work with almost any installation of perl, even if you can't install
anything additional (a real life saver in a controlled environment).


Peter Flynn 03-16-2008 12:35 PM

Re: HTML parsing
 
worlman385@yahoo.com wrote:
> I need to parse the following HTML page and extract TV listing data
> using VC++
>
> http://tvlistings.zap2it.com/tvlistings/ZCGrid.do
>
> any good way to extract the data?
>
> is easy for VC++ to call PERL script and do some regular expression?
>
> since the HTML page is not XML well formed, I cannot use a XML parser
> right?
>
> any other good ways to extract HTML page data?


Pass the page through HTML Tidy, which produces well-formed XHTML.
Then use XSLT to extract what you need.

///Peter
--
XML FAQ: http://xml.silmaril.ie/


All times are GMT. The time now is 07:55 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.