Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > XML Parser VS HTML Parser

Reply
Thread Tools

XML Parser VS HTML Parser

 
 
ZOCOR
Guest
Posts: n/a
 
      10-03-2004
Hi

Can a XML parser be used to parse a HTML document? even if it is not
well-formed?

If the answer is yes to both, can you recommend a Java XML parser class
(from the standard API)?

Cheers

ZOCOR



---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.760 / Virus Database: 509 - Release Date: 10/09/2004


 
Reply With Quote
 
 
 
 
Sudsy
Guest
Posts: n/a
 
      10-03-2004
ZOCOR wrote:
> Hi
>
> Can a XML parser be used to parse a HTML document? even if it is not
> well-formed?


No; an XML parser will balk on a lot of HTML. It's not well-formed.

> If the answer is yes to both, can you recommend a Java XML parser class
> (from the standard API)?


Search the archives for alternate approaches.

 
Reply With Quote
 
 
 
 
[private]
Guest
Posts: n/a
 
      10-03-2004
ZOCOR wrote:
> Can a XML parser be used to parse a HTML document? even if it is not
> well-formed?
>

It can parse it as long as the HTML is well-formed. XML isn't as
relaxed as HTML, so any open elements will throw an exception (probably
org.xml.sax.SAXException, but can't verify right now).
 
Reply With Quote
 
Martin Honnen
Guest
Posts: n/a
 
      10-03-2004


ZOCOR wrote:


> Can a XML parser be used to parse a HTML document? even if it is not
> well-formed?


No, an XML parser can't parse HTML, unless of course it is XHTML. But
HTML 3.2 or HTML 4.01 cannot be parsed with an XML parser.

--

Martin Honnen
http://JavaScript.FAQTs.com/
 
Reply With Quote
 
Darryl L. Pierce
Guest
Posts: n/a
 
      10-03-2004
ZOCOR wrote:

> Can a XML parser be used to parse a HTML document? even if it is not
> well-formed?


A SAX or DOM parser will throw exceptions on data that's not well-formed.
So, the answer is no, it cannot.

--
/**
* @author Darryl L. Pierce <(E-Mail Removed)>
* @see The Infobahn Offramp <http://mcpierce.mypage.org>
* @quote "Lobby, lobby, lobby, lobby, lobby, lobby..." - Adrian Monk
*/
 
Reply With Quote
 
Tor Iver Wilhelmsen
Guest
Posts: n/a
 
      10-03-2004
"[private]" <"[private]"@[private].net> writes:

> It can parse it as long as the HTML is well-formed.


Except for XHTML, HTML cannot be assumed to be well-formed since HTML
does not "end" empty elements properly; they are only empty by
implication, like <br>.

Also, real-world HTML is packed full of implicit begin and end tags a
parser needs to be aware of.
 
Reply With Quote
 
CarlosRivera
Guest
Posts: n/a
 
      10-03-2004
You could use tidy or similar to turn html into xhtml and then use an
XML parser.

ZOCOR wrote:
> Hi
>
> Can a XML parser be used to parse a HTML document? even if it is not
> well-formed?
>
> If the answer is yes to both, can you recommend a Java XML parser class
> (from the standard API)?

 
Reply With Quote
 
ZOCOR
Guest
Posts: n/a
 
      10-04-2004

"Darryl L. Pierce" <(E-Mail Removed)> wrote in message
news:1096821414.TMHnUn2xrpVueIRygtEFdA@teranews...
> ZOCOR wrote:
>
> > Can a XML parser be used to parse a HTML document? even if it is not
> > well-formed?

>
> A SAX or DOM parser will throw exceptions on data that's not well-formed.
> So, the answer is no, it cannot.


Well i can catch the exceptions so that processing can continue?

Whats the problem?

ZOCOR



---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.760 / Virus Database: 509 - Release Date: 10/09/2004


 
Reply With Quote
 
Tor Iver Wilhelmsen
Guest
Posts: n/a
 
      10-04-2004
"ZOCOR" <(E-Mail Removed)> writes:

> Whats the problem?


<br> and the like, which are (implicitly) empty elements that a SAX
parser will not report an end element for, since they are start tags
for containing elements as far as the parser knows.

So you need to add a bunch of logic that handles optional start
elements, implicit end elements, and non-terminated empty elements.

But, hey, if you don't consider that a problem...
 
Reply With Quote
 
ZOCOR
Guest
Posts: n/a
 
      10-04-2004
> > Whats the problem?
>
> <br> and the like, which are (implicitly) empty elements that a SAX
> parser will not report an end element for, since they are start tags
> for containing elements as far as the parser knows.
>
> So you need to add a bunch of logic that handles optional start
> elements, implicit end elements, and non-terminated empty elements.
>
> But, hey, if you don't consider that a problem...


Well im only after specific text contained in certain tags, which
fortunately have an end tag for. As for the other tags, I couldn't give 2
rats about.


ZOCOR



---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.760 / Virus Database: 509 - Release Date: 10/09/2004


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
XML::Parser Installation error: XML-Parser-2.34 Sean Perl Misc 3 10-03-2006 01:23 AM
XML::Parser Installation error: XML-Parser-2.34 Sean Perl Misc 0 10-02-2006 06:20 PM
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
XML-Parser to XML-Parser communication (encoding issues?) arne Perl Misc 0 09-13-2005 12:53 PM
HTML-Parser / SGML-Parser Zach Dennis Ruby 5 10-01-2003 07:26 PM



Advertisments