Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regex question; match <br> after opening tag

Reply
Thread Tools

Regex question; match <br> after opening tag

 
 
jwcarlton
Guest
Posts: n/a
 
      02-16-2011
I'm working on an area where the visitor submits content via
contenteditable, so the submission comes through in Word-style HTML
(meaning, it's somewhat of a mess, and completely dependent on the
users browser).

I'm trying to remove opening and closing <br> tags. The problem I'm
having is when those tags come after a <font, <div, or <span, or
before a closing </font>, </div>, or </span>; eg:

<div class=whatever><span class=whatever><font
class=whatever><br><br><br>Hello, World!<br><br></font></span></div>

It's worth noting that <div>...</div> may or may not be there,
<span>...</span> may or may not be there, <font>...</font> may or may
not be there, they could be transposed (ie, <font> before <span>), and
the <br> tags can be from 0 to 3.

Here's where I am so far:

$text =~ s/^(<div(.*?)>)(<br>)+/$1/gi;
$text =~ s/^(<span(.*?)>)(<br>)+/$1/gi;
$text =~ s/^(<font(.*?)>)(<br>)+/$1/gi;

$text =~ s/(<br>)+(<\/div>)$/$2/gi;
$text =~ s/(<br>)+(<\/span>)$/$2/gi;
$text =~ s/(<br>)+(<\/font>)$/$2/gi;


I have 3 questions on this:

1. First off, does the code above look technically correct to you?
Meaning, would it work if we assume that the tags are always div,
followed by span, followed by font?

2. Is there a way to get these on 1 line?

3. How can I code it to work regardless of which tag comes first?

TIA,

Jason
 
Reply With Quote
 
 
 
 
Jürgen Exner
Guest
Posts: n/a
 
      02-16-2011
jwcarlton <(E-Mail Removed)> wrote:
>I'm working on an area where the visitor submits content via
>contenteditable, so the submission comes through in Word-style HTML
>(meaning, it's somewhat of a mess, and completely dependent on the
>users browser).


Then why are you trying to use REs to parse this mess?

[typical ill-fated attempt of using the wrong tool for the job deleted]

>I have 3 questions on this:
>
>1. First off, does the code above look technically correct to you?
>Meaning, would it work if we assume that the tags are always div,
>followed by span, followed by font?


Who cares? Nobody in his right mind would use _REGULAR_ expressions to
parse a context-free language.

>2. Is there a way to get these on 1 line?


Sure. Just remove the linebreaks.

>3. How can I code it to work regardless of which tag comes first?


By writing a proper HTML parser. Or much easier by using one of the
readily available HTML parsers from CPAN.

jue
 
Reply With Quote
 
 
 
 
jwcarlton
Guest
Posts: n/a
 
      02-16-2011
> >I'm working on an area where the visitor submits content via
> >contenteditable, so the submission comes through in Word-style HTML
> >(meaning, it's somewhat of a mess, and completely dependent on the
> >users browser).

>
> Then why are you trying to use REs to parse this mess?
>
> [typical ill-fated attempt of using the wrong tool for the job deleted]


I'm guessing that you've never worked with a contenteditable form?
It's not as easy as all that.


> >I have 3 questions on this:

>
> >1. First off, does the code above look technically correct to you?
> >Meaning, would it work if we assume that the tags are always div,
> >followed by span, followed by font?

>
> Who cares? Nobody in his right mind would use _REGULAR_ expressions to
> parse a context-free language.


I care, or I wouldn't have asked. I assume that you care, too, or you
wouldn't have wasted your time on replying


> >2. Is there a way to get these on 1 line?

>
> Sure. Just remove the linebreaks.


Sigh.
 
Reply With Quote
 
jwcarlton
Guest
Posts: n/a
 
      02-16-2011
On Feb 16, 12:11*am, Tad McClellan <(E-Mail Removed)> wrote:
> jwcarlton <(E-Mail Removed)> wrote:
> > I'm trying to remove opening and closing <br> tags.

>
> There is no such thing as a "closing" <br> tag...
>
> * *http://www.w3.org/TR/REC-html32#br
>
> * * ... This is an empty element so the end tag is forbidden
>
> ><div class=whatever><span class=whatever><font
> > class=whatever><br><br><br>Hello, World!<br><br></font></span></div>

>
> ---------------------------
> #!/usr/bin/perl
> use warnings;
> use strict;
>
> my $text = '<div class=whatever><span class=whatever><font
> class=whatever><br><br><br>Hello, World!<br><br></font></span></div>';
>
> $text =~ s/<br>//g;
>
> print "$text\n";
> ---------------------------
>
> --
> Tad McClellan
> email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
> The above message is a Usenet post.
> I don't recall having given anyone permission to use it on a Web site.


Seriously, why even both replying?
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      02-16-2011
On 2011-02-16 06:18, jwcarlton wrote:
> On Feb 16, 12:11 am, Tad McClellan<(E-Mail Removed)> wrote:
>> jwcarlton<(E-Mail Removed)> wrote:


>>> I'm trying to remove opening and closing<br> tags.

>>
>> There is no such thing as a "closing"<br> tag...
>> [...]

>
> Seriously, why even both replying?


I guess because all answers to your questions are in the FAQ.
That you shouldn't quote sigs is in another one.

--
Ruud
 
Reply With Quote
 
George Mpouras
Guest
Posts: n/a
 
      02-16-2011
my $text = '<div class=whatever><span
class=whatever><font class=whatever><br>help<o><br><br>Hello,
World!<br><br></font></span>
</div>';

while ( $text =~/<br>(.+?)<br>/gm )
{
(my $a = $^N)=~s/<.+?>//g;
print "*$a*\n";
}


 
Reply With Quote
 
Justin C
Guest
Posts: n/a
 
      02-16-2011
On 2011-02-16, jwcarlton <(E-Mail Removed)> wrote:
>
> Seriously, why even both replying?


Then show us a sample of the content that you are receiving so we can
better understand the problem. Antagonising those who offer suggestions
is never a good move.

Justin.

--
Justin C, by the sea.
 
Reply With Quote
 
jwcarlton
Guest
Posts: n/a
 
      02-16-2011
> > Seriously, why even both replying?
>
> Then show us a sample of the content that you are receiving so we can
> better understand the problem. Antagonising those who offer suggestions
> is never a good move.


Justin, please understand that Tad was giving a PITA answer, not a
suggestion. I definitely wasn't antagonizing; if you look closely at
his response, you'll see what I mean.

He and I have a history, and in the years that I've been watching, I
don't think he's ever given a REAL answer to anyone.

Anyway, let's not let Tad ruin yet another thread.

I gave a sample of what I get in the OP:

<div class=whatever><span class=whatever><font
class=whatever><br><br><br>Hello, World!<br><br></font></span></div>

I'm trying to write a regex that will remove <br> from both the
beginning and the end of the string, but that's also nested within
other tags.

I already use this, which obviously removes the <br> when it's not
nested inside of other tags:

$text =~ s/^(<br>)+|(<br>)+$//gi;

I gave code samples in my OP, too, of what I think will work; the only
problem is that it requires the tags to be in that order; DIV, then
SPAN, then FONT. If the FONT comes before the SPAN, then it doesn't
work, so I'm trying to create a more streamline method.

Thanks, Justin.
 
Reply With Quote
 
jwcarlton
Guest
Posts: n/a
 
      02-16-2011
On Feb 16, 4:03*am, "George Mpouras"
<(E-Mail Removed)> wrote:
> my $text = '<div class=whatever><span
> class=whatever><font class=whatever><br>help<o><br><br>Hello,
> World!<br><br></font></span>
> </div>';
>
> while ( $text =~/<br>(.+?)<br>/gm )
> {
> (my $a = $^N)=~s/<.+?>//g;
> print "*$a*\n";
> }
>
>


Awesome, George! I really appreciate that.
 
Reply With Quote
 
Jürgen Exner
Guest
Posts: n/a
 
      02-16-2011
jwcarlton <(E-Mail Removed)> wrote:
>I gave a sample of what I get in the OP:
>
><div class=whatever><span class=whatever><font
>class=whatever><br><br><br>Hello, World!<br><br></font></span></div>
>
>I'm trying to write a regex that will remove <br> from both the
>beginning and the end of the string, but that's also nested within
>other tags.
>
>I already use this, which obviously removes the <br> when it's not
>nested inside of other tags:
>
>$text =~ s/^(<br>)+|(<br>)+$//gi;
>
>I gave code samples in my OP, too, of what I think will work; the only
>problem is that it requires the tags to be in that order; DIV, then
>SPAN, then FONT. If the FONT comes before the SPAN, then it doesn't
>work, so I'm trying to create a more streamline method.


And these conditions are exactly why using a simple-minded regular
expression is an unsuitable approach, in particular if you have no
control over the format of the incoming data.
Use a parser that actually parses HTML fragments and creates a syntax
tree, and then delete or keep exactly those elements that you want.

Doing it on the textual level is not going to work reliably.

jue
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
regex match for same number of opening and closing brackets Sascha Bendix Perl Misc 3 09-19-2009 01:20 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
how do u invoke Tag b's Tag Handler from within Tag a's tag Handler? shruds Java 1 01-27-2006 03:00 AM
RegEx Help, Please? (match after n) Smarta55 Chris Perl Misc 13 06-27-2005 08:07 AM
Java regex can't match lengthy match? hiwa Java 0 01-29-2004 10:09 AM



Advertisments