Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > regex for stripping HTML

Reply
Thread Tools

regex for stripping HTML

 
 
Michael Vilain
Guest
Posts: n/a
 
      10-28-2003
Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash --
it gets more opaque with multiple coats.

TIA,

/MeV/

--
DeeDee, don't press that button! DeeDee! NO! Dee...



 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      10-28-2003
[not sent to the defunct newsgroup comp.lang.perl]

"Michael Vilain " wrote:
> I found this regex in this group:
>
> $value =~ s/\<[^\<]+\>//g;


Then you had some bad luck.

This makes sense under certain conditions:

$value =~ s/<[^>]*>//g;

But normally you are recommended to use a module instead for parsing
HTML markup.

> - why escape the "<"? It's not one of the meta characters that has
> special meaning in a regex.


You are correct.

> - what's the difference between using ".*" to match any string and
> "+" to match a repeat of the character class "[^\<]".


Please study the Perl documentation for regular expressions, for instance:

http://www.perldoc.com/perl5.8.0/pod/perlretut.html

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

 
Reply With Quote
 
 
 
 
Matija Papec
Guest
Posts: n/a
 
      10-28-2003
X-Ftn-To: Michael Vilain <(E-Mail Removed)>

"Michael Vilain <(E-Mail Removed)>" wrote:
>Originally, I was using
>
> $value =~ s/<.*>//g;
>
>to strip HTML tags from a variable. It actually stripped everything
>from the first "<" to the last ">" after the ending tag. I found this
>regex in this group:
>
> $value =~ s/\<[^\<]+\>//g;
>
>and I'm trying to parse it out and figure out why it works. First off,
>some questions:
>
>- why escape the "<"? It's not one of the meta characters that has
>special meaning in a regex.
>
>- what's the difference between using ".*" to match any string and "+"
>to match a repeat of the character class "[^\<]".


/<.*>/g matches everything between first "<" and last ">". There should be
"?" after "*" to make regex ungreedy.

/<[^<]+>/g matches everything except "<" between "<" and next ">"



--
Matija
 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      10-28-2003
Gunnar Hjalmarsson <(E-Mail Removed)> wrote:
> But normally you are recommended to use a module instead for parsing
> HTML markup.


Or, say, read perldoc -q HTML .

Ben

--
"The Earth is degenerating these days. Bribery and corruption abound.
Children no longer mind their parents, every man wants to write a book,
and it is evident that the end of the world is fast approaching."
-Assyrian stone tablet, c.2800 BC http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
Koncept
Guest
Posts: n/a
 
      10-28-2003
In article <(E-Mail Removed)>,
Michael Vilain <(E-Mail Removed)> wrote:

> Originally, I was using
>
> $value =~ s/<.*>//g;
>
> to strip HTML tags from a variable. It actually stripped everything
> from the first "<" to the last ">" after the ending tag. I found this
> regex in this group:
>
> $value =~ s/\<[^\<]+\>//g;
>
> and I'm trying to parse it out and figure out why it works. First off,
> some questions:
>
> - why escape the "<"? It's not one of the meta characters that has
> special meaning in a regex.
>
> - what's the difference between using ".*" to match any string and "+"
> to match a repeat of the character class "[^\<]".
>
> Just trying to deepen my understanding of regex. It's like whitewash --
> it gets more opaque with multiple coats.
>
> TIA,
>
> /MeV/


Hello. This is from the Terminal Query:

$ perldoc -q html

Quote:
Here's one "simple-minded" approach, that works for most files:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage
striphtml
program in http://www.cpan.org/authors/Tom_Chris-
tiansen/scripts/striphtml.gz .
--
Koncept <<
"Contrary to popular belief, the most dangerous animal is not the lion or
tiger or even the elephant. The most dangerous animal is a shark riding
on an elephant, just trampling and eating everything they see." - Jack Handey
 
Reply With Quote
 
Eric J. Roode
Guest
Posts: n/a
 
      10-29-2003
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"Michael Vilain <(E-Mail Removed)>" wrote in news:vilain-
(E-Mail Removed):

> Originally, I was using
>
> $value =~ s/<.*>//g;
>
> to strip HTML tags from a variable. It actually stripped everything
> from the first "<" to the last ">" after the ending tag. I found this
> regex in this group:
>
> $value =~ s/\<[^\<]+\>//g;
>
> and I'm trying to parse it out and figure out why it works. First off,
> some questions:
>
> - why escape the "<"? It's not one of the meta characters that has
> special meaning in a regex.
>
> - what's the difference between using ".*" to match any string and "+"
> to match a repeat of the character class "[^\<]".
>
> Just trying to deepen my understanding of regex. It's like whitewash

--
> it gets more opaque with multiple coats.


Nah, it's not that hard. There's a learning curve, sure, but you'll get
to the top of it in time.

First, you are correct about the "<" -- no need to escape it; whoever did
it wasn't thinking.

Second, it helps to translate the regex sub-expressions into English
(assuming English is your native tongue):

<.*> means: Match a less-than character, followed by as many
characters as possible, followed by a greather-than character.

<[^>]+> means: Match a less-than character, followed by as many non-
greater-than characters as possible, followed by a greater-than
character.

See the difference? . matches ANY character; [^>] matches only non-">"
characters.


Note that it is not possible in general to process HTML via regular
expressions (at least, not simple regexes). Consider the following
snippet of valid HTML:

<img src="foo.jpg" alt='<<<"cool!">>>' />

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP59EVWPeouIeTNHoEQJRGQCguzB4DdBzsa/9dmTMRm4ExzMmxBUAoIIq
bHd4Hbx8MdXgkJm3sWoUu0K1
=ADWR
-----END PGP SIGNATURE-----
 
Reply With Quote
 
DOV LEVENGLICK
Guest
Posts: n/a
 
      10-30-2003
you have to escape < because it can be used as a search delimiter

"Michael Vilain " wrote:

>Originally, I was using
>
> $value =~ s/<.*>//g;
>
>to strip HTML tags from a variable. It actually stripped everything
>from the first "<" to the last ">" after the ending tag. I found this
>regex in this group:
>
> $value =~ s/\<[^\<]+\>//g;
>
>and I'm trying to parse it out and figure out why it works. First off,
>some questions:
>
>- why escape the "<"? It's not one of the meta characters that has
>special meaning in a regex.
>
>- what's the difference between using ".*" to match any string and "+"
>to match a repeat of the character class "[^\<]".
>
>Just trying to deepen my understanding of regex. It's like whitewash --
>it gets more opaque with multiple coats.
>
>TIA,
>
>/MeV/
>
>
>


--
Regards,
Dov Levenglick



 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      10-30-2003
DOV LEVENGLICK <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> "Michael Vilain " wrote:


[DOV's top-posting re-arranged]

> > $value =~ s/\<[^\<]+\>//g;
> >
> >and I'm trying to parse it out and figure out why it works. First off,
> >some questions:
> >
> >- why escape the "<"? It's not one of the meta characters that has
> >special meaning in a regex.

>
> you have to escape < because it can be used as a search delimiter


This is nonsense. What are you talking about? And don't top-post.

Anno
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      10-30-2003
On Thu, 30 Oct 2003, DOV LEVENGLICK ...

Bogosity alerts:

1:
Content-Type: multipart/alternative;
boundary="------------030500060107020504030609"

2: TOFU-posting

3: cross-posted without further comment to a dead newsgroup
comp.lang.perl

and need I mention the SHOUTED PERSONAL NAME?

> ... wrote:
>
> you have to escape < because it can be used as a search delimiter


Well, Q.E.D.

I suppose it's wasted effort to suggest you might get a grasp on your
material and the conventions of your chosen forum -before- stepping up
to the plate to offer answers to technical questions?

If you had been _asking_ a question, then such behaviour *might*
just be a tad[1] more excusable.

[1] No pun intended.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
(newbie) definitive class/regex for stripping empty lines from text files? christek Java 1 01-31-2007 10:32 AM
Stripping html tags from text Spondishy ASP .Net 4 03-07-2006 03:45 PM
Stripping HTML attributes and tags JJ Harrison HTML 5 11-28-2005 10:12 PM
regex for stripping HTML Michael Vilain Perl 4 10-30-2003 01:06 PM



Advertisments