Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regex, replacing THIS|THAT

Reply
Thread Tools

Regex, replacing THIS|THAT

 
 
Jason C
Guest
Posts: n/a
 
      12-17-2011
Before putting this into production, can you guys confirm if the logic hereis correct?

my $lt = chr(1);
my $gt = chr(2);

$text =~ s/<(\/{0,1})(div|span|table|tr|td|font|img)(.*?)>/$lt$1$2$3$gt/gsi;

What I'm not sure about is if (div|span...) will work correctly, or if it'sgoing to read "di, followed by either v or s, followed by pa", and so on.

(FWIW, the next step in the process is to remove all other HTML code, so that only these tags are allowed. Then, I go back and change $lt and $gt backto < and >. This concept works well, so my only real question is whether the regex will work as expected.)
 
Reply With Quote
 
 
 
 
Jürgen Exner
Guest
Posts: n/a
 
      12-17-2011
Jason C <(E-Mail Removed)> wrote:
[misguided attempt at using REs to manage HTML snipped]
>(FWIW, the next step in the process is to remove all other HTML code, so that only these tags are allowed. Then, I go back and change $lt and $gt back to < and >. This concept works well, so my only real question is whether the regex will work as expected.)


No, it doesn't work at all. You are aware of 'perldoc -q "remove HTML"'?
The examples given there for why REs are not suitable to parse HTML
apply just as well for your limited scope of only 7 tags.

If you want to parse HTML then use a parser for HTM but don't dwadle
with home-brewn RE approaches. Those can't work as has been discussed ad
nauseam before.

jue
 
Reply With Quote
 
 
 
 
John W. Krahn
Guest
Posts: n/a
 
      12-17-2011
Jason C wrote:
> Before putting this into production, can you guys confirm if the logic here is correct?
>
> my $lt = chr(1);
> my $gt = chr(2);
>
> $text =~ s/<(\/{0,1})(div|span|table|tr|td|font|img)(.*?)>/$lt$1$2$3$gt/gsi;


You have nothing between $1 and $2 or between $2 and $3 so why not just
use one pair of capturing parentheses:

$text =~ s/<(\/?(?:div|span|table|tr|td|font|img).*?)>/$lt$1$gt/gsi;


> What I'm not sure about is if (div|span...) will work correctly,


Yes, that is how alternation works. Each alternative can be any valid
pattern, including strings.


> or if it's going to read "di, followed by either v or s,
> followed by pa", and so on.


No, that would not make sense.



John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction. -- Albert Einstein
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Replacing String Using Regular Expression lucky ASP .Net 11 11-11-2005 07:01 AM
Guide: Replacing your North Bridge Heatsink RObErT_RaTh Reviews & How-To's 34 09-08-2005 07:22 AM
Replacing - and not Replacing... Rob Meade ASP General 5 04-11-2005 06:49 PM
Replacing groups of statements Salman VHDL 1 04-05-2005 03:42 PM
Replacing UTP with wireless, what equipment? Joel Dorfan Wireless Networking 2 10-31-2004 12:44 PM



Advertisments