Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Convert HTML to plain text

Reply
Thread Tools

Convert HTML to plain text

 
 
Marcel Kessler
Guest
Posts: n/a
 
      11-13-2006
Hi there

Does anyone know a good way of converting HTML to plain text, keeping as
much of the formatting as possible?

The HTML will be produced by an editor like FCKEditor, and
transformation should happen in Java.

So far I've found the following options, none of them really convincing:

# Using w3m or lynx to convert html to plain text
(http://www.biglist.com/lists/xsl-lis.../msg00689.html)
+ neat output
- need to call C from java

# Google gdata routine
(http://www.biglist.com/lists/xsl-lis.../msg00689.html)
+ java source available
- only basic stripping, no tables etc

# Use xml & xslt
(http://www-128.ibm.com/developerwork...ary/x-xmlist1/)
+ good result
- complicated approach, cannot use wysiwyg-editor like FCKEditor

# use other tools like docfraq, detagger, notetab etc.
- no better results than with w3m

Thanks and regars
Marcel
 
Reply With Quote
 
 
 
 
Andy Dingley
Guest
Posts: n/a
 
      11-13-2006

Marcel Kessler wrote:

> Does anyone know a good way of converting HTML to plain text, keeping as
> much of the formatting as possible?


Of course not. "Plain text" doesn't have formatting. If you want to
"keep some formatting", then you first have to know just how much is
preservable. Some people claim "RTF" is "plain text" because it's
editable with a text editor rather than in binary -- how much are you
expecting to preserve?

Converting all HTML block elements to a marker, stripping out
everything except text and markers, normalizing whitespace and markers
and then converting markers to something local is usually a good start.

If you're already in a web context, then a DOM walker that returns the
set of text nodes might be easier.

if the HTML is crap to begin with, pre-process it with Tidy.

 
Reply With Quote
 
 
 
 
Marcel Kessler
Guest
Posts: n/a
 
      11-14-2006
Andy Dingley wrote:
> Marcel Kessler wrote:
>
>> Does anyone know a good way of converting HTML to plain text, keeping as
>> much of the formatting as possible?

>
> Of course not. "Plain text" doesn't have formatting. If you want to
> "keep some formatting", then you first have to know just how much is
> preservable. Some people claim "RTF" is "plain text" because it's
> editable with a text editor rather than in binary -- how much are you
> expecting to preserve?


Thanks, Andy!
Obviously we can't keep e.g. a header in big letters, but one thing we
need for example is if we have a <li> tag, we don't want

* Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque
nec est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
aliquet risus ac velit eleifend scelerisque.

but rather

* Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque
nec
est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
aliquet risus ac velit eleifend scelerisque.

i.e. something that keeps the indention...
If there is some Java library out there that does this kind of thing,
that would be great... the HTML itself should already be quite nice.
 
Reply With Quote
 
Karl Uppiano
Guest
Posts: n/a
 
      11-14-2006

"Marcel Kessler" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Andy Dingley wrote:
>> Marcel Kessler wrote:
>>
>>> Does anyone know a good way of converting HTML to plain text, keeping as
>>> much of the formatting as possible?

>>
>> Of course not. "Plain text" doesn't have formatting. If you want to
>> "keep some formatting", then you first have to know just how much is
>> preservable. Some people claim "RTF" is "plain text" because it's
>> editable with a text editor rather than in binary -- how much are you
>> expecting to preserve?

>
> Thanks, Andy!
> Obviously we can't keep e.g. a header in big letters, but one thing we
> need for example is if we have a <li> tag, we don't want
>
> * Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque nec
> est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut aliquet
> risus ac velit eleifend scelerisque.
>
> but rather
>
> * Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque nec
> est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
> aliquet risus ac velit eleifend scelerisque.
>
> i.e. something that keeps the indention...
> If there is some Java library out there that does this kind of thing, that
> would be great... the HTML itself should already be quite nice.


It sounds like you want an HTML parser with pluggable handlers that are
customizable. A SAX parser comes pretty close. If you could first convert
the HTML to well-formed HTML (with matching open and close tags, for
example) you might be able to get a non-validating SAX parser to work. Just
a thought. My guess is that it would take a fair bit of work to implement.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to convert markup text to plain text in python? geoffbache Python 8 02-11-2008 10:02 AM
Plain text file to xml file convert mahesh Java 2 02-17-2007 01:48 PM
Best way to convert html to plain text in java? google@lrlart.com Java 7 07-04-2006 06:29 AM
when I add HTML to innerHTML, FireFox renders it as HTML, but IE shows it as plain text Jake Barnes Javascript 9 02-21-2006 10:37 AM
SEEK free util to convert GIF/JPG to plain text [OT?] Franklin Digital Photography 3 11-15-2005 02:14 PM



Advertisments