Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Best way to convert html to plain text in java?

Reply
Thread Tools

Best way to convert html to plain text in java?

 
 
google@lrlart.com
Guest
Posts: n/a
 
      03-19-2006
Hello,

I have a java servlet that processes plain text. I'd like to point to a
specific url and pull over a webpage, then convert it to plain text for
further processing.

I have written some code that simply strips tags from the html, but
this only does an OK job as it fails on poorly written html and
javascript (to name a few). Are there any java APIs that would perform
a better conversion? I've looked into JEditorPane and HTMLEditorKit,
but haven't had any luck in getting these to perform the conversion.
Thanks for any help!

 
Reply With Quote
 
 
 
 
Marcin Wielgus
Guest
Posts: n/a
 
      03-19-2006
On Sun, 19 Mar 2006 08:20:01 +0100, <(E-Mail Removed)> wrote:

> Hello,
>
> I have a java servlet that processes plain text. I'd like to point to a
> specific url and pull over a webpage, then convert it to plain text for
> further processing.
>
> I have written some code that simply strips tags from the html, but
> this only does an OK job as it fails on poorly written html and
> javascript (to name a few). Are there any java APIs that would perform
> a better conversion? I've looked into JEditorPane and HTMLEditorKit,
> but haven't had any luck in getting these to perform the conversion.
> Thanks for any help!
>


its a bad solution but u can always run html2text in child process

--
SaSol


--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
 
Reply With Quote
 
 
 
 
Dave Mandelin
Guest
Posts: n/a
 
      03-21-2006
Can you give some examples of how it fails on poorly written HTML? It
may not be that hard to bulletproof the tag-stripping code you wrote.

 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      03-21-2006
On 20 Mar 2006 18:32:27 -0800, "Dave Mandelin"
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone
who said :

>Can you give some examples of how it fails on poorly written HTML? It
>may not be that hard to bulletproof the tag-stripping code you wrote.


I wrote a tag stripper, but it presumes valid HTML. I suppose you
could on hitting an < in a tag presume the > was missing. and insert
one just before the first space after the last <

You could look for standard tags.

The other common error is as < or > lying around by itself or next to
=.

From a practical point of view it might be easiest to run your code
through a verifier and fix the errors then do your strip. See
http://mindprod.com/jgloss/htmlvalidator.html

Anything else is going to lose some data or insert some junk.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
 
Reply With Quote
 
google@lrlart.com
Guest
Posts: n/a
 
      03-21-2006
One failure I've run into is with the use of javascript--for example

<script>

function CNN_getCookies() {
var hash = new Array;
if ( document.cookie ) {
var cookies = document.cookie.split( '; ' );
for ( var i = 0; i < cookies.length; i++ ) {

.......
Note: Notice the "less than" symbol in the javascript above.

</script>

This is some slightly modified source from cnn's site--but the point is
that a "<tag>" pattern can be distinguished, but it's difficult to
differentiate this from a greater than or less than in some enclosed
javascript code.

But even if I were to write some code that could handle this case
effectively I'd probably be dealing with loads of other special cases
within poorly written html source.

 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      03-21-2006
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> But even if I were to write some code that could handle this case
> effectively I'd probably be dealing with loads of other special cases
> within poorly written html source.


Take it from me: parsing HTML is not trivial. And that's even without
considering all the invalid HTML out there (I don't mean stuff like incorrectly
nested structures, but unmatched ""s, tags with no >, etc).

JTidy appears to do what you are looking for, it might help (I've never tried
it myself):
http://jtidy.sourceforge.net/

-- chris


 
Reply With Quote
 
Dave Mandelin
Guest
Posts: n/a
 
      03-21-2006
Ah, I see. Yeah, that looks pretty rough. JTidy looks like a really
nice program.

 
Reply With Quote
 
kalyan_iitd kalyan_iitd is offline
Junior Member
Join Date: Jul 2006
Posts: 1
 
      07-04-2006
Hai Dave, can you prove java code for html to plain text using jtidy. for me, jtidy is working as html validator only.

some experties provide code for html to text (any java api)

thanks in advance.
Kalyan.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to convert markup text to plain text in python? geoffbache Python 8 02-11-2008 10:02 AM
Plain text file to xml file convert mahesh Java 2 02-17-2007 01:48 PM
Convert HTML to plain text Marcel Kessler Java 3 11-14-2006 07:58 AM
when I add HTML to innerHTML, FireFox renders it as HTML, but IE shows it as plain text Jake Barnes Javascript 9 02-21-2006 10:37 AM
Best Way to "Format" Plain Text for Email Edge Computer Information 0 02-03-2004 07:54 PM



Advertisments