Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Removing empty tags

Reply
Thread Tools

Removing empty tags

 
 
jwcarlton
Guest
Posts: n/a
 
      02-24-2011
I've just started changing my processing over to HTML::HTML5:arser,
so please bear with me on this.

I've been using a regex to remove empty tags, but I see one that's not
working so I assume there's either a typo, or an error in the logic.

I'm trying to convert this:

<span class="Apple-style-span" style="font-family: Arial, Verdana,
Helvetica, sans-serif; "><br></span>

To:

<br>

It should also catch <span...></span> (with nothing inside), or
<span...> </span> (with a whitespace inside).

"class" and "style" can be anything (or non-existent), so I'm just
trying to remove <span, followed by anything (or nothing) to the first
>, then the following </span>


Here's what I'm using:

$text =~ s/<span[^>]*>\s*<\/span>/ /gi;
$text =~ s/<span[^>]*>(<br>)*<\/span>/$1/gi;

This doesn't appear to work, though. The string I posted above
actually came through verbatim, so it must have matched false.

Of course, I know that this would fail on nested <span></span> tags,
which is why I'm switching over to HTML::HTML5:arser. But in the
meanwhile, why did this one not match?
 
Reply With Quote
 
 
 
 
jwcarlton
Guest
Posts: n/a
 
      02-24-2011
> It works for me.
>
> ------------------------
> #!/usr/bin/perl
> use warnings;
> use strict;
>
> $_ = '<span class="Apple-style-span" style="font-family: Arial, Verdana,
> Helvetica, sans-serif; "><br></span>';
>
> s/<span[^>]*>(<br>)*<\/span>/$1/gi;
>
> print "$_\n";
> ------------------------
>
> If you can post a short and complete program that we can run that
> duplicates the problem you are having, then we can surely help
> you fix it...



That's really pretty much all there is! I'll paste the whole function
below; the only thing I'm leaving out is the part at the top where it
declares a few variables, logs the user in (which doesn't affect the
$text variable), and then prints the data to MySQL.

The data comes from a contenteditable, and when people paste things it
needs to be manipulated a bit, which is mostly what this function
does. I don't have a sample of raw content (I don't save it before it
runs through the function), but here's a sample of a complete string
that was printed (I left the content because I thought you guys might
get a kick out of it):

<span class="Apple-style-span" style="font-family: Arial, Verdana,
Helvetica, sans-serif; "><b>"We ALL got problems....If you're gonna be
dumb, ya gotta be tough."</b></span><br><br><span class="Apple-style-
span" style="font-family: Arial, Verdana, Helvetica, sans-serif;
"><br></span>


And the function:

sub fixtext {
$text = $_[0];

$text =~ s/&nbsp;/ /gi;

# Convert <em> to <i> and <strong> to <b>, saves a few steps later
$text =~ s/<em>(.*?)<\/em>/<i>$1<\/i>/gsi;
$text =~ s/<strong>(.*?)<\/strong>/<b>$1<\/b>/gsi;

# Strip Javascript
$text =~ s/<script.*?>.*?<\/script>//gsi;
$text =~ s/onmouseover=".*?"//gsi;
$text =~ s/onclick=".*?"//gsi;

### Only Allow Specified Tags
my $lt=chr(1);
my $gt=chr(2);
$text =~ s/<br>/$lt br $gt/gi;

$text =~ s/<(\/{0,1})(div.*?)>/$lt$1$2$gt/gsi;
$text =~ s/<(\/{0,1})(span.*?)>/$lt$1$2$gt/gsi;

$text =~ s/<(\/{0,1})(table.*?)>/$lt$1$2$gt/gsi;
$text =~ s/<(\/{0,1})(tr.*?)>/$lt$1$2$gt/gsi;
$text =~ s/<(\/{0,1})(td.*?)>/$lt$1$2$gt/gsi;

$text =~ s/<(\/{0,1})(b|p)>/$lt$1$2$gt/gsi;
$text =~ s/<(\/{0,1})(u|i)>/$lt$1$2$gt/gsi;

$text =~ s/<(\/{0,1})(font.*?)>/$lt$1$2$gt/gsi;

$text =~ s/<(\/{0,1})(img.*?)>/$lt$1$2$gt/gsi;

# delete all other tags
$text =~ s/<.+?>//gs;

$text =~ s/$lt/</g;
$text =~ s/$gt/>/g;
$text =~ s/< br >/<br>/gi;
###

# Strip Word junk
$text =~ s/Normal 0 false.*?}//gsi;
$text =~ s/Normal 0 MicrosoftInternetExplorer4.*?}//gsi;
$text =~ s/\/\* Style Definitions \*\/.*?}//gsi;
$text =~ s/Normal\.dotm .*? false false//gsi;

$text =~ s/white-space: nowrap;*//gsi;
$text =~ s/style="(\s*)"//gsi;

# Strip empty tags
$text =~ s/<font[^>]*>\s*<\/font>/ /gi;
$text =~ s/<font[^>]*>(<br>)*<\/font>/<br><br>/gi;

$text =~ s/<span[^>]*>\s*<\/span>/ /gi;
$text =~ s/<span[^>]*>(<br>)*<\/span>/$1/gsi;

$text =~ s/<i>(\s*)<\/i>/$1/gi;
$text =~ s/<b>(\s*)<\/b>/$1/gi;
$text =~ s/<u>(\s*)<\/u>/$1/gi;

$text =~ s/<div>\s*<\/div>/<br>/gi;
$text =~ s/<div>(.*?)<\/div>/<br><br>$1/gsi;

# Limit repeating characters
$text =~ s/(.)\1{4,}/$1$1$1$1/g;

# Strip opening, trailing, or repeating whitespace, <br>
$text =~ s/\s+/ /gs;
$text =~ s/^\s+|\s+$//g;

$text =~ s/(<br><br>)+/<br><br>/gi;
$text =~ s/^(<br>)+|(<br>)+$//gi;

return $text;
}
 
Reply With Quote
 
 
 
 
Wolf Behrenhoff
Guest
Posts: n/a
 
      02-24-2011
On 24.02.2011 06:11, jwcarlton wrote:
>> If you can post a short and complete program that we can run that
>> duplicates the problem you are having, then we can surely help
>> you fix it...

>
>
> That's really pretty much all there is! I'll paste the whole function
> below; the only thing I'm leaving out is the part at the top where it
> declares a few variables, logs the user in (which doesn't affect the
> $text variable), and then prints the data to MySQL.


We are not interested in whole long functions but only on the relevant
parts.

> The data comes from a contenteditable, and when people paste things it
> needs to be manipulated a bit, which is mostly what this function
> does. I don't have a sample of raw content (I don't save it before it
> runs through the function), but here's a sample of a complete string
> that was printed (I left the content because I thought you guys might
> get a kick out of it):


First: try the string you have posted. Your function will remove the
second span part!

And then: why don't you output the string before putting it in your
function? You need to look at the input!

Solution is probably simple: you are doing a lot of replacements. Assume
the input is "<span><br><b></b></span>". Then you don't remove the spam.
But later you remove the b. If you reverse the order, you would also
remove the span.

So you can try running the fixtext function more than once or try to
change the order of your 10000 replacements.

- Wolf

Next time please try to post a short program that one can run without
changing/adding anything! Often writing such a short program will point
you to the problem so that you can solve it on your own.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
All style tags after the first 30 style tags on an HTML page are not applied in Internet Explorer Rob Nicholson ASP .Net 3 05-28-2005 03:11 PM
Evaluating struts tags inside my own custom tags... A. Brinkmann Java 2 04-16-2004 07:44 AM
JSP newbie - use include, custom tags, standard tags - or what? Mike Java 3 01-09-2004 09:30 AM
RegEx to find CFML tags nested in HTML tags Dean H. Saxe Perl 0 01-03-2004 06:11 PM
Custom Tags within Custom Tags. Ranganath Java 2 10-21-2003 06:14 AM



Advertisments