Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regex losing <br> (different from the earlier topic about losing $1)

Reply
Thread Tools

Regex losing <br> (different from the earlier topic about losing $1)

 
 
Jason C
Guest
Posts: n/a
 
      06-22-2012
I'm building a profanity filter, and I'm using the following subroutine to replace matched words with XXXX:

while (($original, $converted) = @profanityArr) {
if (!$converted) {
$len = length($original);
$converted = "X" x $len;
}

$original = quotemeta($original);

$text =~ s/(\r|\n|\r\n|<br>|\s)*$original(\r|\n|\r\n|<br>|\s) */$1$converted$2/i;
}


# When I feed:
$original = "daym";

$text = "<br><br>daym<br><br>";
###

I'm getting "<br>XXXX<br>". Meaning, it loses the matched <br> in both $1 and $2.

# When I feed:
$original = "jason";
$converted = "brainfried";

$text = "<br><br>jason<br><br>";
###

I'm getting "<br>brainfried<br>". Again, it loses the matched <br> in both $1 and $2.


# When I feed:
$original = "dammit";
$converted = "XXXXit";

$text = "<br><br>dammit<br><br>";
###

I'm getting "<br>XXXXit<br><br>". Meaning, it loses the matched <br> in $1, but keeps it in $2.

It's the same if I change $1 and $2 to \1 and \2.

Any suggestions on how to correct the sub to keep the matched <br>?
 
Reply With Quote
 
 
 
 
Jan Pluntke
Guest
Posts: n/a
 
      06-22-2012
"Jason C" <> wrote:

> $text =~
> s/(\r|\n|\r\n|<br>|\s)*$original(\r|\n|\r\n|<br>|\s) */$1$converted$2/i;

[...]
> # When I feed:
> $original = "daym";
>
> $text = "<br><br>daym<br><br>";
> ###
>
> I'm getting "<br>XXXX<br>". Meaning, it loses the matched <br> in both $1
> and $2.


You will want to capture the * also, otherwise $1 and $2 will
contain only one (the last) match for that part of the string:

((?:\r|\n|\r\n|<br>|\s)*)

The ?: will make the inner () non-capturing.

I think (but might be wrong - did not test) that \s contains
\r and \n, so you can remove them:

((?:<br>|\s)*)

Regards,
Jan

 
Reply With Quote
 
 
 
 
Jason C
Guest
Posts: n/a
 
      06-22-2012
On Friday, June 22, 2012 1:37:51 AM UTC-4, Jan Pluntke wrote:
> You will want to capture the * also, otherwise $1 and $2 will
> contain only one (the last) match for that part of the string:
>
> ((?:\r|\n|\r\n|
> |\s)*)
>
> The ?: will make the inner () non-capturing.


Excellent! I was not familiar with the ?:, so I'll have to make a note of that for future reference.


> I think (but might be wrong - did not test) that \s contains
> \r and \n, so you can remove them:
>
> ((?:
> |\s)*)
>
> Regards,
> Jan


Correct again! I thought that \s just captured the space, and didn't realize that it includes line breaks (and apparently tabs, too). I can modify all of my scripts for that, now, and save a little bandwidth

Thanks for the help!
 
Reply With Quote
 
Jason C
Guest
Posts: n/a
 
      06-23-2012
On Friday, June 22, 2012 5:07:40 AM UTC-4, Ben Morrow wrote:
> Unless $original is supposed to be a regex, you want \Q\E around it.


I originally did this in the function:

$original = quotemeta($original);
$text =~ ...;

Is there a difference between quotemeta() and \Q\E?


> You don't really need the final capture, you can just use lookahead.
> Similarly you don't need to capture more than one \s just to put it back
> again:
>
> s/(\s|
> ) \Q$original\E (?= \s|
> )/$1$converted/ix;
>
> Turning the initial capture into lookbehind is harder, since Perl
> doesn't support variable-length lookbehind and the two branches of the
> alternation are different lengths. However, if you have at least 5.10
> (which you do, I hope), you can use \K like this:
>
> s/ (?:\s|
> ) \K \Q$original\E (?=\s|
> ) /$converted/ix;


I'm afraid that you went just a little over my head on that one. What does the \K do? And what does (?=\s|<br>) do differently from (?:\s|<br>)? Or are they the same?

This is slightly different, but how do I include "or at the beginning of the string" in that regex?

I don't think that this would work, would it?

((?:^|\s|<br>)*)

For this purpose, I'm specifically converting a string of "www.example.com"to "http://www.example.com". A string like

$text = "Go to www.example.com";

matches, but

$text = "www.example.com<br><br>click here";

doesn't.

Further, in this case I don't want it to match when the www is between other characters (so that it doesn't change "http://www" to "http://http://www"), so I think I'll have to use a totally different regex without the trailing. But I still need to figure out how to make it match if it follows a \s,<br>, or is at the beginning of the string.
 
Reply With Quote
 
Morty Abzug
Guest
Posts: n/a
 
      06-26-2012
In article <s7pdb9->,
"Ben Morrow " <> spake thusly:
>
> s/ (?:\s|<br>) \K \Q$original\E (?=\s|<br>) /$converted/ix;


In addition to Jan and Ben's excellent suggestions, please note that
the patterns don't need to be applied in a loop. You can do something
like this:

my $regex=join "|", map quotemeta, @profanityArr;
$text =~ s{ (?:\s|<br>) \K ($regex) (?=\s|<br>) }{"X" x length $1}ex;

The "|" lets you match alternatives in a single regex, while the /e
flag is used to eval an expression before performing a substitution.

As I'm sure you know, folks who want to bypass the filters can usually
figure out ways around them.

- Morty
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Program opens (task manager says so), but no browser appears V1.06 and earlier Jim Firefox 4 08-13-2005 09:07 PM
Earlier debugging / more detailed dependency error info. Ben Harper ASP .Net 2 07-05-2005 03:28 PM
Which IDE to choose (more specific than earlier, very similar post)? Gelmir Tinehtelë Java 10 06-09-2004 07:39 PM
ASP DATE between NOW and 24 HOURS earlier. Darren ASP .Net 3 01-26-2004 09:22 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57