Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Negative lookahead regex clarification needed

Reply
Thread Tools

Negative lookahead regex clarification needed

 
 
shifty
Guest
Posts: n/a
 
      01-19-2005
Hi,

I'm trying to hack my way through a regex for a chunk of code I'm going
to use. I've been using a Regex Coach to run through this and I think
I have correct syntax.

I am trying to find any one of several 'hacked' variants of the word
"microsoft" (ex: m1cr0s0ft, mir00ft, etc.), but NOT match on the
actual word "microsoft". I need the regex to be case sensitive.

This is my regex - it seems to work, but I don't know if the syntax is
honestly correct and I don't want it to break later:

(?i).*\b(??!microsoft)m+[i1l\\\|!]+[C]+r+[o0]+[s]+[o0]+f+[t\+]+)\b.*

This expression will:
Be case insensitive
Have a word boundary to limit only finding the word I'm looking for
Allow anything to preceed this word's boundaries
Match on several variants of 'microsoft' as long as negative lookahead
doesn't find the proper spelling
Will not capture the match if one is found

Is this correct? Any help is appreciated. I'm going to need to knock
out several of these things.

I'm just starting with regex, and I'm totally in love - but it's really
easy to be inefficient and it's also easy really, really easy to miss
"false positives" caused by overlooking an aspect of your expression.
Reminds me of 'chess vs. chemistry' or something.

 
Reply With Quote
 
 
 
 
Alan J. Flavell
Guest
Posts: n/a
 
      01-19-2005
On Wed, 19 Jan 2005, shifty wrote:

> I'm trying to hack my way through a regex for a chunk of code I'm going
> to use. I've been using a Regex Coach to run through this and I think
> I have correct syntax.


I didn't know what "Regex Coach" is (I do now, courtesy of Google),
but I find "pcretest" (part of the PCRE package from Phil Hazel) to be
a valuable aid.

> I am trying to find any one of several 'hacked' variants of the word
> "microsoft" (ex: m1cr0s0ft, mir00ft, etc.), but NOT match on the
> actual word "microsoft". I need the regex to be case sensitive.


Off the top of my head: Perhaps it would be better to do a character
translation on the string, and then compare the result with the
original.

OTOH, if you're in a context where only a regex is acceptable (you're
not by any chance writing recipes for spamassassin?) then I might have
to take that back.

 
Reply With Quote
 
 
 
 
shifty
Guest
Posts: n/a
 
      01-21-2005

> I didn't know what "Regex Coach" is (I do now, courtesy of Google),
> but I find "pcretest" (part of the PCRE package from Phil Hazel) to

be
> a valuable aid.


I'll hafta check that out.


> OTOH, if you're in a context where only a regex is acceptable (you're


> not by any chance writing recipes for spamassassin?) then I might

have
> to take that back.


I am writing recipes for spam rejection, you're sharp

I'm writing something specific to PCRE. I couldn't find any current
regex-specific groups.

 
Reply With Quote
 
shifty
Guest
Posts: n/a
 
      01-21-2005

> If the syntax weren't correct it wouldn't compile. What you are

asking is
> whether it does what you want it to do, which is about semantics.


For the purpose it's being used, it is not necessary to compile the
regex. It's being accessed from an outside resource (spam filter).


> Is there any reason why you want to use lookahead to exclude

unaltered
> strings like "microsoft"? Just skip those strings using an extra

regex,
> and concentrate on matching the altered variants.


Yes. I don't want to bounce legitimate emails. Spam emails offering
their software almost always misspell it at some point; I want to
bounce anything I can be 99% certain is spam.

 
Reply With Quote
 
shifty
Guest
Posts: n/a
 
      01-21-2005

Jim Gibson wrote:
> In article <(E-Mail Removed). com>,
> shifty <(E-Mail Removed)> wrote:
>
> Yes, it does work, but it could be simplified:


I'm still not sure how, though Seriously, though, I've noticed it
works for everything but microsof+ (non-word character @ end of
expression! You actually noted this )

> 1, It is useless to have .* at the beginning and end of the regex.


For the purpose it's being used (spam filter rule), it is necessary.

> 2. It is useless to group with (?: ... ) in this case


You're right ... I was doing this because I didn't want to capture the
match.

> 3. You don't need all of the plus signs unless you expect repeated
> characters.


I do. Spam emails with "hacked" words often use repeat characters to
fool keyword filtering.

> 9. Dont forget $ as a replacement for s, $ needs escaping in
> double-quote context of a regular expression.


Thanks, missed that one. I hadn't even thought about it. I was
running through an ASCII character map to look at similar
characters...dunno how I missed the $ sign.

>
> With all of the above points in mind, I would suggest the following:
>
> my $regex = qr(
> (?:\b|\s)
> (?!microsoft)
> m
> [i1l\\\|!]
> [C]
> r
> [o0]
> [s\$]
> [o0]
> f
> [t+]
> (?:\b|\s)
> )ix;
>


Thanks! I'm going to play with your suggestion for a bit, I think this
should work. I need to make some versions for pharmaceutical spam as
well. Should work perfect!


> Are you looking for other approximations such as 'microsloth' and
> 'microsquash'?


Nah, because spammers don't usually do things like that.

Thanks again for your insight. Couldn't have asked for a more perfect
answer!

 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      01-21-2005
On Fri, 21 Jan 2005, shifty wrote:

> Jim Gibson wrote:


> > 2. It is useless to group with (?: ... ) in this case

>
> You're right ... I was doing this because I didn't want to capture the
> match.


I think Jim means that the negative-lookahead syntax is itself
non-capturing, despite the parentheses - so you did't need to nullify
the capturing anyway.

If you already realised that - apologies in advance.

No, I don't know where to raise questions specifically about regexes,
either. But the Perl regulars seem quite a bit more tolerant of
off-topically regex-related questions here, than they are about
off-topically CGI questions here :-}
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      01-21-2005
shifty <(E-Mail Removed)> wrote in comp.lang.perl.misc:
>
> > If the syntax weren't correct it wouldn't compile. What you are

> asking is
> > whether it does what you want it to do, which is about semantics.

>
> For the purpose it's being used, it is not necessary to compile the
> regex. It's being accessed from an outside resource (spam filter).


Something is going to compile it. Every regex engine in existence
does that.

My point was the misuse of "syntax" for "correct code". It's becoming a
sore spot.

> > Is there any reason why you want to use lookahead to exclude

> unaltered
> > strings like "microsoft"? Just skip those strings using an extra

> regex,
> > and concentrate on matching the altered variants.

>
> Yes. I don't want to bounce legitimate emails. Spam emails offering
> their software almost always misspell it at some point; I want to
> bounce anything I can be 99% certain is spam.


That's inconclusive, but since you didn't say what your spam filter
actually does with the regex, there's no way of telling.

Anno
 
Reply With Quote
 
shifty
Guest
Posts: n/a
 
      01-25-2005


> No, I don't know where to raise questions specifically about regexes,


> either. But the Perl regulars seem quite a bit more tolerant of
> off-topically regex-related questions here, than they are about
> off-topically CGI questions here :-}


For that, I'm really thankful. Nothing like getting your ass lit up by
someone when you truly mean well, look twice to make sure you're trying
to do the right thing, then you get flamed to holy hell for trying to
be as cautious and netiqueete-oriented as possible.

 
Reply With Quote
 
shifty
Guest
Posts: n/a
 
      01-25-2005

> Something is going to compile it. Every regex engine in existence
> does that.


I would guess they're never compiled - regexes are interpreted, eh?
So, in essence, if I am writing a regex for perl in particular (we'll
keep it on-topic), perl is an interpreted language and so is a regex,
so it's processed on the fly instead of compiling it into an object for
future use. Unless I'm misinterpreting your use of "compile". If so,
I have a true interest in understanding if you don't mind explaining.


> My point was the misuse of "syntax" for "correct code". It's

becoming a
> sore spot.


My apologies. I think we have conflicting views on what a regex really
is. To me, a regex is a sentence or formula which expresses any number
of meanings. Without the correct characters pattern (and/or placement)
within the text (and/or string), you don't have a correct statement.

If you don't produce a correct statement because one or more characters
are misplaced, is it a syntax error or a code error?

> That's inconclusive, but since you didn't say what your spam filter
> actually does with the regex, there's no way of telling.


I use these regex expressions for both SpamAssassin and Vamsoft's Open
Relay Filter EE. Depends on which mailserver I'm dealing with
(personal, co-hosted or business). I primarily do more administration
and hosting type stuff than I do programming - if that's not blatantly
obvious already.
Thanks for your input, looking forward to clarification.

>
> Anno


 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      01-26-2005
"Alan J. Flavell" <(E-Mail Removed)> wrote:
> On Fri, 21 Jan 2005, shifty wrote:
>
> > Jim Gibson wrote:

>
> > > 2. It is useless to group with (?: ... ) in this case

> >
> > You're right ... I was doing this because I didn't want to capture the
> > match.

>
> I think Jim means that the negative-lookahead syntax is itself
> non-capturing, despite the parentheses - so you did't need to nullify
> the capturing anyway.
>
> If you already realised that - apologies in advance.
>
> No, I don't know where to raise questions specifically about regexes,
> either. But the Perl regulars seem quite a bit more tolerant of
> off-topically regex-related questions here, than they are about
> off-topically CGI questions here :-}


That's probably because CGI is a complete specification of its own,
independent of Perl; while Perl regexes are not independent of Perl.
People who ask here about the quirks of Java or .net regexes do
get a chilly reception.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
python regex "negative lookahead assertions" problems Jelle Smet Python 2 11-23-2009 09:20 AM
Re: python regex "negative lookahead assertions" problems MRAB Python 0 11-22-2009 04:32 PM
Negative lookahead in Regexp question Axel Etzold Ruby 5 06-16-2007 08:50 PM
Regexp help - Negative lookahead before across word boundaries Phrogz Ruby 2 02-19-2005 02:19 AM
help: negative lookahead and backref in regex? stenor@bayarea.net Perl Misc 2 12-29-2004 12:13 AM



Advertisments