Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regular Expressions: "Negated Strings" instead of "Negated Character Classes"

Reply
Thread Tools

Regular Expressions: "Negated Strings" instead of "Negated Character Classes"

 
 
lmeurs@gmail.com
Guest
Posts: n/a
 
      06-07-2007
Dear all,

I'm sure the subject sounds more complicated than the actual matter.
Let me explain my problem. With Perl regular expressions one can
define a character and negated character classes.

s/[abc]//g The letters a, b and c will be removed from a
string
s/[^abc]//g Now all the letters *but* a, b and c will be
removed from a string

One can also do the first with strings, like this:

s/(one|two|three)//g Substrings 'one', 'two' and 'three' will be
removed from a string

But how can I turn this around, just like I did with character
classes? What I'm looking for would look something like this:

s/(^one|two|three)//g or s/!(one|two|three)//g

Why? I am trying to get rid of all HTML-tags *but* break-, paragraph-
and divider-tags.

s/<\/?(br|p|div)( .+?)?>//ig This would remove the break-,
paragraph- and divider-tags from a string

How can I invert this regular expression? Any help would be really
appreciated!

Thanks a lot in advance,

Laurens Meurs
Rotterdam, the Netherlands

 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      06-07-2007
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> With Perl regular expressions one can
> define a character and negated character classes.
>
> s/[abc]//g The letters a, b and c will be removed from a
> string
> s/[^abc]//g Now all the letters *but* a, b and c will be
> removed from a string
>
> One can also do the first with strings, like this:
>
> s/(one|two|three)//g Substrings 'one', 'two' and 'three' will be
> removed from a string
>
> But how can I turn this around, just like I did with character
> classes? What I'm looking for would look something like this:
>
> s/(^one|two|three)//g or s/!(one|two|three)//g


This is one approach:

s{(\b\w+\b)}{
my $match = $1;
$match =~ /^(?ne|two|three)$/ ? $match : '';
}eg;

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
 
 
 
lmeurs@gmail.com
Guest
Posts: n/a
 
      06-07-2007
Dear Gunnar,

Thanks for the quick reply, and it worked!

I was looking so hard for a Perl way to do the trick, so that
unfortunately I couldn't think of a brilliant workaround like this
myself...

But still being curious: is this the easiest way? Doesn't Perls
Regular Expression engine provide something like I suggested?

Thanks again, a lot!

Laurens

 
Reply With Quote
 
lmeurs@gmail.com
Guest
Posts: n/a
 
      06-07-2007
And to be complete, the eventual solution to get rid of all HTML-tags,
except for for example BR and P:

my $t = " a <br /> b <p style='border: 1px red solid; '> c </p> d <hr>
e <br> f <hr /> g";
$t =~ s#(</?(\w+)(?: .+?)?>)#
my $t1 = $1;
my $t2 = $2;
$t2 =~ /^(?: br|p)$/i ? $t1:"";
#eg;

results in both <hr> and <hr /> tags are removed from the original
string, the new value is:

"a <br /> b <p style='border: 1px red solid; '> c </p> d e <br> f
g";

Gr!

 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      06-07-2007
>>>>> "GH" == Gunnar Hjalmarsson <(E-Mail Removed)> writes:


GH> This is one approach:

GH> s{(\b\w+\b)}{
GH> my $match = $1;
GH> $match =~ /^(?ne|two|three)$/ ? $match : '';
GH> }eg;

or use a hash inside for better speed (untested):

my %ignore_tags = map { $_ => 1 } qw( one two three ) ;

s{(\b\w+\b)}{ $ignore_tags{$1} ? $1 : '' }eg;

adding in the <> stuff is left as an exercise to the reader. for that
reason alone, a parser should be used. most html parser modules are easy
hack so they will filter out tags and rebuild the html text later.

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      06-08-2007
>>>>> "l" == lmeurs <(E-Mail Removed)> writes:

l> And to be complete, the eventual solution to get rid of all HTML-tags,
l> except for for example BR and P:

and to be really complete that will fail in many ways. html can only be
fully parsed by a module and not by regexes. in some cases where you
know or control the html you can mung it with regexes.

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
 
Reply With Quote
 
Brian McCauley
Guest
Posts: n/a
 
      06-08-2007
On Jun 7, 10:58 pm, (E-Mail Removed) wrote:

> But still being curious: is this the easiest way? Doesn't Perls
> Regular Expression engine provide something like I suggested?


Yes it does, negative lookahead.

To remove any word but 'one' 'two' or 'three'...

s/\b(?!one|two|three)\w+//g;

Note you have to say \b to constrain it to finding whole words -
otherwise it would be perfectly within it's rights to remove the
'hree' from 'three'.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Need character output instead of numbers bilsch Java 4 04-25-2012 04:24 AM
Regular expression matches last occurrence instead of first andyo Ruby 1 02-27-2007 12:31 PM
browser displaying &amp;int; instead of integral character DKM XML 0 02-03-2007 08:43 PM
How to use a value instead of a string in regular expression? mrz2003 Perl Misc 1 08-17-2005 12:29 AM
VOIP instead of regular PBX Antonio VOIP 0 05-03-2004 05:35 PM



Advertisments