Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Replace text inside html tags?

Reply
Thread Tools

Replace text inside html tags?

 
 
squash@peoriadesignweb.com
Guest
Posts: n/a
 
      01-30-2005
I want to able replace text inside html tags. I am using a regex to
extract the text , but after I modify the text how can I re-assemble
the html tag? Here is an example:

<font size=1> HI </font>

I need to replace HI with BYE and re-assemble html tag like below:

<font size=1> BYE </font>
I checked perldoc -q html but could not find the answer there.

Thx!

 
Reply With Quote
 
 
 
 
A. Sinan Unur
Guest
Posts: n/a
 
      01-30-2005
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote in news:1107118901.149776.208370
@z14g2000cwz.googlegroups.com:

> I want to able replace text inside html tags. I am using a regex to
> extract the text , but after I modify the text how can I re-assemble
> the html tag? Here is an example:
>
> <font size=1> HI </font>
>
> I need to replace HI with BYE and re-assemble html tag like below:
>
> <font size=1> BYE </font>
> I checked perldoc -q html but could not find the answer there.


The answer to your question can be found in the answer to the FAQ.

The most correct way (albeit not the fastest) is to use HTML:arser
from CPAN.

....

Many folks attempt a simple-minded regular expression approach, like
"s/<.*?>//g", but that fails in many cases because the tags may
continue over line breaks, they may contain quoted angle-brackets,
or HTML comment may be present. Plus, folks forget to convert
entities--like "&lt;" for example.

That is, you need to use an HTML parser to parse HTML.

See CPAN for HTML parser modules.

I had never used HTML::TokeParser::Simple, so I gave that a shot:

#! /usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $html = <<HTML;
<font><!--
<font> HI
</font>
-->
HI
</font>
HTML

my $p = HTML::TokeParser::Simple->new(string => $html);

my $in_font_tag;

while(my $token = $p->get_token) {
if($token->is_start_tag('font')) {
print $token->as_is;
$in_font_tag = 1;
next;
}
if($token->is_end_tag('font')) {
print $token->as_is;
$in_font_tag = 0;
next;
}
if($in_font_tag and $token->is_text) {
my $text = $token->as_is;
$text =~ s/HI/BYE/g;
print $text;
next;
}
print $token->as_is;
}

__END__

C:\Dload> h
<font><!--
<font> HI
</font>
-->
BYE
</font>

Seems to work.

Sinan.
 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      01-30-2005
(E-Mail Removed) wrote:
> I want to able replace text inside html tags. I am using a regex to
> extract the text , but after I modify the text how can I re-assemble
> the html tag? Here is an example:
>
> <font size=1> HI </font>
>
> I need to replace HI with BYE and re-assemble html tag like below:
>
> <font size=1> BYE </font>


Depending on the complexity of the document, the s/// operator may be
sufficient.

> I checked perldoc -q html but could not find the answer there.


Then you should have seen for instance

perldoc -q "remove HTML"

and other entries in perlfaq9 which warn for trying to parse HTML
documents with regular expressions, and recommend the use of a suitable
module for HTML parsing.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
Sherm Pendley
Guest
Posts: n/a
 
      01-30-2005
(E-Mail Removed) wrote:

> I want to able replace text inside html tags. I am using a regex to
> extract the text , but after I modify the text how can I re-assemble
> the html tag? Here is an example:
>
> <font size=1> HI </font>
>
> I need to replace HI with BYE and re-assemble html tag like below:
>
> <font size=1> BYE </font>


Others have suggested using a parser module - and they're right. That should
always be your first instinct when working with HTML. However, there are
some scenarios where a regex is good enough, and faster to write than a
parser-based solution. For example, if the task at hand is a very simple
search-and-replace across a number of pages where you know a given pattern
will match. Or you're fixing pages that are broken beyond a parser's
ability to cope with them.

With that in mind, have a look at "perldoc perlretut", paying special
attention to the section titled "Extracting matches". You can use
"backreferences" in your regex to use parts of the matched string in the
replacement, like this:

#!/usr/bin/perl
use strict;
use warnings;

my $html = '<font size=1> HI </font><font size=1> HELLO </font>';

$html =~ s%(<font size=1>)(.*?)(</font>)%$1 BYE $3%g;

print $html, "\n";

Aside from subexpressions and backreferences, another point of note is the
"non-greedy" quantifier "*?". Without it - i.e. written as "*" - the second
expression would be "greedy", meaning it would return the longest possible
string that matches the expression it modifies. In the example above, that
would mean replacing everything between the first '<font size=1>' and the
*second* '</font>'. (Try it!)

That's not what you want - you want the *shortest* string that matches the
expression, not the longest. That's what the "non-greedy" quantifier gives
you.

Just to restate it - regexes are generally *not* the best way to parse HTML,
particularly arbitrary HTML that's fetched from a web site that's beyond
your control. But using them *can* useful if the task at hand is extremely
limited, or if the HTML is broken beyond a parser's ability to handle it.

References:

perldoc perlretut
perldoc perlre

sherm--

--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
 
Reply With Quote
 
Bart Lateur
Guest
Posts: n/a
 
      01-31-2005
A. Sinan Unur wrote:

>I had never used HTML::TokeParser::Simple, so I gave that a shot:


>my $p = HTML::TokeParser::Simple->new(string => $html);
>
>my $in_font_tag;
>
>while(my $token = $p->get_token) {
> if($token->is_start_tag('font')) {
> print $token->as_is;
> $in_font_tag = 1;
> next;
> }
> if($token->is_end_tag('font')) {
> print $token->as_is;
> $in_font_tag = 0;
> next;
> }
> if($in_font_tag and $token->is_text) {
> my $text = $token->as_is;
> $text =~ s/HI/BYE/g;
> print $text;
> next;
> }
> print $token->as_is;
>}


I like to use ".." in code with this kind of functionality. This shows
IMO an aspect where a tokeparser approach is vastly superior to raw
usage of HTML:arser.

while(my $token = $p->get_token) {
if($token->is_start_tag('font') .. $token->is_end_tag('font')) {
if($token->is_text) {
my $text = $token->as_is;
$text =~ s/HI/BYE/g;
print $text;
next;
}
}
print $token->as_is;
}


--
Bart.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expression to Replace UPPER Case Text with lower case text penny Perl Misc 28 03-10-2008 01:14 AM
replace text in IFRAME using innerHTML.replace(...) possible? mscir Javascript 0 10-11-2005 11:21 PM
replace instances of text on page with image or other text? juglesh Javascript 16 12-27-2004 04:14 AM
RegEx: replace HTML block with specific text inside Claudio Biagioli ASP .Net 1 02-06-2004 06:21 PM
Re: Can We Use EXEC CGI Inside a JavaScript Applet Inside an HTML File? David Dorward HTML 1 06-28-2003 09:30 AM



Advertisments