Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   HTML (http://www.velocityreviews.com/forums/f31-html.html)
-   -   Convert HTML to Text (http://www.velocityreviews.com/forums/t164451-convert-html-to-text.html)

cawoodm@gmail.com 03-09-2006 03:51 PM

Convert HTML to Text
 
I have written a simple RegEx which strips all tags from an HTML file
and replaces them with spaces.

This was fine until I noticed that some tags should not be replaced
with spaces. For example in the HTML:
<b>H</b>ello World
My program will generate "H ello World" effectively breaking a word
apart.

Where could I get an "authoritative" list of tags which should result
in a space and which shouldn't. I presume these are mostly block
elements like div, br, hr, table etc...


Dylan Parry 03-09-2006 03:59 PM

Re: Convert HTML to Text
 
Pondering the eternal question of "Hobnobs or Rich Tea?",
cawoodm@gmail.com finally proclaimed:

> Where could I get an "authoritative" list of tags which should result
> in a space and which shouldn't. I presume these are mostly block
> elements like div, br, hr, table etc...


You probably won't find a list that tells you the exact information you
are after, but the HTML DTDs available from W3C[1] will show you which
elements are block level and which are inline. From that you could
assume that the block elements result in a space, and the inline should
not.

____
[1] http://www.w3.org/TR/html4/sgml/dtd.html
--
Dylan Parry
http://webpageworkshop.co.uk -- FREE Web tutorials and references

mbstevens 03-09-2006 04:20 PM

Re: Convert HTML to Text
 
cawoodm@gmail.com wrote:
> I have written a simple RegEx which strips all tags from an HTML file
> and replaces them with spaces.
>
> This was fine until I noticed that some tags should not be replaced
> with spaces. For example in the HTML:
> <b>H</b>ello World
> My program will generate "H ello World" effectively breaking a word
> apart.
>
> Where could I get an "authoritative" list of tags which should result
> in a space and which shouldn't. I presume these are mostly block
> elements like div, br, hr, table etc...
>


I don't have a specific answer to your last paragraph, but:

Have a look at Perl's HTML::Parser and related modules.

In Python, sgmllib will be useful.

Using simple regexes to parse HTML
is liable to more errors than libraries that have been
exercised by many users. Of course, you might have a good reason
to re-invent the wheel for another language, but even there having
a look at the source of these modules might be helpful.
--
mbstevens
http://www.mbstevens.com/

Toby Inkster 03-09-2006 07:40 PM

Re: Convert HTML to Text
 
Dylan Parry wrote:

> You probably won't find a list that tells you the exact information you
> are after, but the HTML DTDs available from W3C[1] will show you which
> elements are block level and which are inline. From that you could
> assume that the block elements result in a space, and the inline should
> not.


In fact, you could assume that the block elements should begin and end
with a line break. You could also add a tab between <td> and <th> elements
in a table, add asterisks for unordered lists, add numbers for ordered
lists and so on.

I'll echo Mr Stevens' recommendation to use HTML::Parser for parsing
though -- it will give far better results than a reg exp. For example, a
reg exp won't tell you to add a line break after the word "bar" here,
because the closing tag for a paragraph is optional:

<body>
<p>Foo bar.
</body>

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact


Jim Higson 03-10-2006 02:48 PM

Re: Convert HTML to Text
 
cawoodm@gmail.com wrote:

> I have written a simple RegEx which strips all tags from an HTML file
> and replaces them with spaces.
>
> This was fine until I noticed that some tags should not be replaced
> with spaces. For example in the HTML:
> <b>H</b>ello World
> My program will generate "H ello World" effectively breaking a word
> apart.
>
> Where could I get an "authoritative" list of tags which should result
> in a space and which shouldn't. I presume these are mostly block
> elements like div, br, hr, table etc...


How about using this?

http://www.mbayer.de/html2text/

--
Jim

cawoodm@gmail.com 03-14-2006 11:29 AM

Re: Convert HTML to Text
 
Thank-you all for the helpful feedback.
It is true that RegEx is a bit of a dark art but I am writing a Crawler
in VB Dot Net and not Perl or Python.
I am not sure if the .NET framework supports HTML parsing in the way I
want it so I've been applying RegEx.
Basically I want to strip all tags and then remove excess whitespace so
that I have "pure" text.
My current strategy is to replace inline tags with an empty string and
then replacing all other tags with a space:
HTML = RegEx.Replace(HTML, "</?(b|i|u|strong|etc)*>", "")
HTML = RegEx.Replace(HTML, "</?[^>]*>", " ")
Then I remove excess whitespace:
HTMLText = RegEx.Replace(HTMLText, "\s+", " ")
It's the authorative list (b|u|i|strong|...) that I'm looking for so
I'll take a look at the DTD recommended.
Cheers
Jack


cawoodm@gmail.com 03-14-2006 11:33 AM

Re: Convert HTML to Text
 
Aha!
http://www.htmlhelp.com/reference/html40/inline.html
-------------------------
A
ABBR
ACRONYM
B
BASEFONT
BDO
BIG
BR
CITE
CODE
DFN
EM
FONT
I
IMG
INPUT
KBD
LABEL
Q
S
SAMP
SELECT
SMALL
SPAN
STRIKE
STRONG
SUB
SUP
TEXTAREA
TT
U
VAR
-------------------------


Jim Higson 03-14-2006 11:52 AM

Re: Convert HTML to Text
 
cawoodm@gmail.com wrote:

> Thank-you all for the helpful feedback.
> It is true that RegEx is a bit of a dark art but I am writing a Crawler
> in VB Dot Net and not Perl or Python.
> I am not sure if the .NET framework supports HTML parsing in the way I
> want it so I've been applying RegEx.
> Basically I want to strip all tags and then remove excess whitespace so
> that I have "pure" text.
> My current strategy is to replace inline tags with an empty string and
> then replacing all other tags with a space:
> HTML = RegEx.Replace(HTML, "</?(b|i|u|strong|etc)*>", "")
> HTML = RegEx.Replace(HTML, "</?[^>]*>", " ")
> Then I remove excess whitespace:
> HTMLText = RegEx.Replace(HTMLText, "\s+", " ")
> It's the authorative list (b|u|i|strong|...) that I'm looking for so
> I'll take a look at the DTD recommended.
> Cheers
> Jack


The program I recomended (http://www.mbayer.de/html2text/) is a simple
command line app. You should be able to call it from just about any
language with one line of code. I don't know how you call commands in .NET,
but it shouldn't be difficult.

--
Jim

Neredbojias 03-14-2006 05:30 PM

Re: Convert HTML to Text
 
With neither quill nor qualm, cawoodm@gmail.com quothed:

> Aha!
> http://www.htmlhelp.com/reference/html40/inline.html
> -------------------------
> A
> ABBR
> ACRONYM
> B
> BASEFONT
> BDO
> BIG
> BR
> CITE
> CODE
> DFN
> EM
> FONT
> I
> IMG
> INPUT
> KBD
> LABEL
> Q
> S
> SAMP
> SELECT
> SMALL
> SPAN
> STRIKE
> STRONG
> SUB
> SUP
> TEXTAREA
> TT
> U
> VAR
> -------------------------


What happened to DIV?

--
Neredbojias
Contrary to popular belief, it is believable.

Steve Pugh 03-14-2006 05:44 PM

Re: Convert HTML to Text
 
Neredbojias <invalid@neredbojias.com> wrote:
>With neither quill nor qualm, cawoodm@gmail.com quothed:
>
>> Aha!
>> http://www.htmlhelp.com/reference/html40/inline.html
>> -------------------------


[snip list]

>What happened to DIV?


Not an inline element is it?

Steve
--
"My theories appal you, my heresies outrage you,
I never answer letters and you don't like my tie." - The Doctor

Steve Pugh <steve@pugh.net> <http://steve.pugh.net/>


All times are GMT. The time now is 12:09 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.