Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > FAQ 9.4 How do I remove HTML from a string?

Reply
Thread Tools

FAQ 9.4 How do I remove HTML from a string?

 
 
PerlFAQ Server
Guest
Posts: n/a
 
      04-10-2011
This is an excerpt from the latest version perlfaq9.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at http://faq.perl.org .

--------------------------------------------------------------------

9.4: How do I remove HTML from a string?

The most correct way (albeit not the fastest) is to use "HTML:arser"
from CPAN. Another mostly correct way is to use "HTML::FormatText" which
not only removes HTML but also attempts to do a little simple formatting
of the resulting plain text.

Many folks attempt a simple-minded regular expression approach, like
"s/<.*?>//g", but that fails in many cases because the tags may continue
over line breaks, they may contain quoted angle-brackets, or HTML
comment may be present. Plus, folks forget to convert entities--like
"&lt;" for example.

Here's one "simple-minded" approach, that works for most files:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\g1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program
in http://www.cpan.org/authors/Tom_Chri...s/striphtml.gz .

Here are some tricky cases that you should think about when picking a
solution:

<IMG SRC = "foo.gif" ALT = "A > B">

<IMG SRC = "foo.gif"
ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on
text like this:

<!-- This section commented out.
<B>You can't see me!</B>
-->



--------------------------------------------------------------------

The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
are not necessarily experts in every domain where Perl might show up,
so please include as much information as possible and relevant in any
corrections. The perlfaq-workers also don't have access to every
operating system or platform, so please include relevant details for
corrections to examples that do not work on particular platforms.
Working code is greatly appreciated.

If you'd like to help maintain the perlfaq, see the details in
perlfaq.pod.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Best way to remove body/html tag from HTML::Element tree afrinspray Perl Misc 2 09-07-2006 04:55 PM
Easy way to remove HTML entities from an HTML document? Robert Oschler Python 8 07-31-2004 02:03 AM
RE: Easy way to remove HTML entities from an HTML document? Robert Brewer Python 0 07-25-2004 08:21 PM
How to use HTML::Parser to remove HTML tags and print result Mitchua Perl 1 07-15-2003 02:02 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57