Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > HTML Processing in Java

Reply
Thread Tools

HTML Processing in Java

 
 
Honza
Guest
Posts: n/a
 
      11-29-2005
Hello,

I would like to process html pages in java. The very first task would
be to ignore unnecessary information like comments (everything in <!--
-->) or images.
What would be the best start point?
I have found JTidy and HTML Parser in SourceForge, but none of them is
able of ignoring tags - or did I miss it?

Thank you for any clue
Honza

 
Reply With Quote
 
 
 
 
Roedy Green
Guest
Posts: n/a
 
      11-29-2005
On 29 Nov 2005 01:11:37 -0800, "Honza" <(E-Mail Removed)> wrote,
quoted or indirectly quoted someone who said :

>I would like to process html pages in java. The very first task would
>be to ignore unnecessary information like comments (everything in <!--
>-->) or images.
>What would be the best start point?


See http://mindprod.com/products1.html#ENTITIES
to strip the HTML out optionally convert the &xxx; entities back to
normal characters.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
 
Reply With Quote
 
 
 
 
zero
Guest
Posts: n/a
 
      11-29-2005
"Honza" <(E-Mail Removed)> wrote in news:1133255497.231778.229120
@g14g2000cwa.googlegroups.com:

> Hello,
>
> I would like to process html pages in java. The very first task would
> be to ignore unnecessary information like comments (everything in <!--
> -->) or images.
> What would be the best start point?
> I have found JTidy and HTML Parser in SourceForge, but none of them is
> able of ignoring tags - or did I miss it?
>
> Thank you for any clue
> Honza
>
>


I would be very surprised if either of those actually did anything with the
comments. If they do, why not just remove the code that handles them?

--
Beware the False Authority Syndrome
 
Reply With Quote
 
Oliver Wong
Guest
Posts: n/a
 
      11-29-2005

"Honza" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
> Hello,
>
> I would like to process html pages in java. The very first task would
> be to ignore unnecessary information like comments (everything in <!--
> -->) or images.
> What would be the best start point?
> I have found JTidy and HTML Parser in SourceForge, but none of them is
> able of ignoring tags - or did I miss it?
>
> Thank you for any clue
> Honza


Haven't used the parsers you're talking about, but if you find any SAX
based parser, you'll just receive a bunch of "events" representing the
discovery of "things" in an HTML document, and you can just ignore the
"comment" events.

- Oliver


 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      11-29-2005
On Tue, 29 Nov 2005 11:01:38 GMT, Roedy Green
<(E-Mail Removed) > wrote, quoted or
indirectly quoted someone who said :

>See http://mindprod.com/products1.html#ENTITIES
>to strip the HTML out optionally convert the &xxx; entities back to
>normal characters.


With a simple modification, you could strip just comments, not all
HTML tags.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
 
Reply With Quote
 
Abhijat Vatsyayan
Guest
Posts: n/a
 
      11-29-2005
Honza wrote:
> Hello,
>
> I would like to process html pages in java. The very first task would
> be to ignore unnecessary information like comments (everything in <!--
> -->) or images.
> What would be the best start point?
> I have found JTidy and HTML Parser in SourceForge, but none of them is
> able of ignoring tags - or did I miss it?
>
> Thank you for any clue
> Honza
>

Take a look at classes ParserDelegator and HTMLEditorKit.ParserCallback
in package javax.swing.text.html

You can implement(subclass) your own ParserCallback and use that in the
parse method of ParserDelegator object. This is quite like using SAX
parsers for XML documents.

Abhijat
 
Reply With Quote
 
Honza
Guest
Posts: n/a
 
      11-29-2005
Thank you guys, I will check the possibilities.

I have found another interesting application which could also be
solution of my problem. Its name is Muffin - http://muffin.doit.org/
It is highly customizable java writen proxy where you can filter html
content.
I am going to try it out tomorrow.

Thanks a lot
Honza

 
Reply With Quote
 
Honza
Guest
Posts: n/a
 
      11-30-2005
Hello Abhijat,

I have tested HTMLEditorKit today. It is really very easy to use and it
would be appropriate for my purpose...

BUT: I've tested it with "real world" HTML pages and I find it not
robust enough. The results are not accurate enough and number of errors
is too high if parsing any "badly written" HTML page.

I have found nice page benchmarking "real world" SAX HTML parsers. I
think I will use one of them...

Link: http://www.portletbridge.org/saxbenchmark

Honza

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
firefox html, my downloaded html and firebug html different? Adam Akhtar Ruby 9 08-16-2008 07:55 PM
XSLT: processing embedded (X)HTML je@brighton.ac.uk XML 2 09-14-2005 11:39 PM
Post-Processing RAW vs Post-Processing TIFF Mike Henley Digital Photography 42 01-30-2005 08:26 AM
Question: processing HTML, re-write default processing action of many tags Hubert Hung-Hsien Chang Python 2 09-17-2004 03:10 PM
Text-to-HTML processing program phil hunt Python 11 01-08-2004 01:18 PM



Advertisments