Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > [OT] a little about regex

Reply
Thread Tools

[OT] a little about regex

 
 
Fulvio
Guest
Posts: n/a
 
      10-18-2006
***********************
Your mail has been scanned by InterScan MSS.
***********************


Hello,

I'm trying to get working an assertion which filter address from some domain
but if it's prefixed by '.com'.
Even trying to put the result in a negate test I can't get the wanted result.

The tought in program term :

>>> def filter(adr):

.... import re
.... allow = re.compile('.*\.my(>|$)')
.... deny = re.compile('.*\.com\.my(>|$)')
.... cnt = 0
.... if deny.search(adr): cnt += 1
.... if allow.search(adr): cnt += 1
.... return cnt
....
>>> filter('(E-Mail Removed)')

2
>>> filter('(E-Mail Removed)')

1
>>>


Seem that I miss some better regex implementation to avoid that both of the
filters taking action. I'm thinking of lookbehind (negative or positive)
option, but I think I couldn't realize it yet.
I think the compilation should either allow have no '.com' before '.my' or
deny should have _only_ '.com' before '.my'. Sorry I don't get the correct
sintax to do it.

Suggestions are welcome.

F


 
Reply With Quote
 
 
 
 
Ron Adam
Guest
Posts: n/a
 
      10-18-2006
Fulvio wrote:
> ***********************
> Your mail has been scanned by InterScan MSS.
> ***********************
>
>
> Hello,
>
> I'm trying to get working an assertion which filter address from some domain
> but if it's prefixed by '.com'.
> Even trying to put the result in a negate test I can't get the wanted result.
>
> The tought in program term :
>
>>>> def filter(adr):

> ... import re
> ... allow = re.compile('.*\.my(>|$)')
> ... deny = re.compile('.*\.com\.my(>|$)')
> ... cnt = 0
> ... if deny.search(adr): cnt += 1
> ... if allow.search(adr): cnt += 1
> ... return cnt
> ...
>>>> filter('(E-Mail Removed)')

> 2
>>>> filter('(E-Mail Removed)')

> 1
>
> Seem that I miss some better regex implementation to avoid that both of the
> filters taking action. I'm thinking of lookbehind (negative or positive)
> option, but I think I couldn't realize it yet.
> I think the compilation should either allow have no '.com' before '.my' or
> deny should have _only_ '.com' before '.my'. Sorry I don't get the correct
> sintax to do it.
>
> Suggestions are welcome.
>
> F


Instead of using two separate if's, Use an if - elif and be sure to test the
narrower filter first. (You have them in the correct order) That way it will
skip the more general filter and not increment cnt twice.

It's not exactly clear on what output you are seeking. If you want 0 for not
filtered and 1 for filtered, then look to Freds Hint.

Or are you writing a test at the moment, a 1 means it only passed one filter so
you know your filters are working as designed?

Another approach would be to assign values for filtered, accepted, and undefined
and set those accordingly instead of incrementing and decrementing a counter.

Cheers,
Ron



 
Reply With Quote
 
 
 
 
Rob Wolfe
Guest
Posts: n/a
 
      10-18-2006

Fulvio wrote:

> I'm trying to get working an assertion which filter address from some domain
> but if it's prefixed by '.com'.
> Even trying to put the result in a negate test I can't get the wanted result.


[...]

> Seem that I miss some better regex implementation to avoid that both of the
> filters taking action. I'm thinking of lookbehind (negative or positive)
> option, but I think I couldn't realize it yet.
> I think the compilation should either allow have no '.com' before '.my' or
> deny should have _only_ '.com' before '.my'. Sorry I don't get the correct
> sintax to do it.
>
> Suggestions are welcome.


Try this:

def filter(adr): # note that "filter" is a builtin function also
import re

allow = re.compile(r'.*(?<!\.com)\.my(>|$)') # negative lookbehind
deny = re.compile(r'.*\.com\.my(>|$)')
cnt = 0
if deny.search(adr): cnt += 1
if allow.search(adr): cnt += 1
return cnt


HTH,
Rob

 
Reply With Quote
 
Fulvio
Guest
Posts: n/a
 
      10-18-2006
***********************
Your mail has been scanned by InterScan MSS.
***********************


On Wednesday 18 October 2006 16:43, Rob Wolfe wrote:

> |def filter(adr): * *# note that "filter" is a builtin function also
> |* * import re


I didn't know it, but my function _is_ starting by underscore (a bit of
localization )

> |* * allow = re.compile(r'.*(?<!\.com)\.my(>|$)') *# negative lookbehind
> |* * deny = re.compile(r'.*\.com\.my(>|$)')


Great, it works perfectly. I found my errors.
I didn't use r ahead of the patterns and i was close to the 'allow' pattern
but didn't give positive result and KregexEditor reported wrong way. This
specially because of '<' inside the stream. I thing that is not a normal
regex input. It's only python valid. Am I right?

More details are the previous thread.

F


 
Reply With Quote
 
Fulvio
Guest
Posts: n/a
 
      10-18-2006
***********************
Your mail has been scanned by InterScan MSS.
***********************


On Wednesday 18 October 2006 15:32, Ron Adam wrote:

> |Instead of using two separate if's, Use an if - elif and be sure to test


Thank you, Ron, for the input
I'll examine also in this mode. Meanwhile I had faced the total disaster of
deleting all my emails from all server ;(
(I've saved them locally, luckly )

> |It's not exactly clear on what output you are seeking. *If you want 0 for
> | not filtered and 1 for filtered, then look to Freds Hint.


Actually the return code is like herein:

if _filter(hdrs,allow,deny):
# allow and deny are objects prepared by re.compile(pattern)
_del(Num_of_Email)

In short, it means unwanted to be deleted.
And now the function is :

def _filter(msg,al,dn):
""" Filter try to classify a list of lines for a set of compiled
patterns."""
a = 0
for hdrline in msg:
# deny has the first priority and stop any further searching. Score 10
#times
if dn.search(hdrline): return len(msg) * 10
if al.search(hdrline): return 0
a += 1
return a # it returns with a score of rejected matches or zero if none


The patterns are taken from a configuration file. Those with Axx ='pattern'
are allowing streams the others are Dxx to block under different criteria.
Here they're :

[Filters]
A01 = ^From:.*\.it\b
A02 = ^(To|Cc):.*frioio@
A03 = ^(To|Cc):.*the_sting@
A04 = ^(To|Cc):.*calm_me_or_die@
A05 = ^(To|Cc):.*further@
A06 = ^From:.*\.za\b
D01 = ^From:.*\.co\.au\b
D02 = ^Subject:.*\*\*\*SPAM\*\*\*

*A bit of fake in order to get some privacy*
I'm using configparser to fetch their value and they're are joint by :

allow = re.compile('|'.join([k[1] for k in ifil if k[0] is 'a']))
deny = re.compile('|'.join([k[1] for k in ifil if k[0] is 'd']))

ifil is the input filter's section.

At this point I suppose that I have realized the right thing, just I'm a bit
curious to know if ithere's a better chance and realize a single regex
compilation for all of the options.
Basically the program will work, in term of filtering as per config and
sincronizing with local $HOME/Mail/trash (configurable path). This last
option will remove emails on the server for those that are in the local
trash.
Todo = backup local and remote emails for those filtered as good.
multithread to connect all server in parallel
SSL for POP3 and IMAP4 as well
Actually I've problem on issuing the command to imap server to flag "Deleted"
the message which count as spam. I only know the message details but what
is the correct command is a bit obscure, for me.
BTW whose Fred?

F


 
Reply With Quote
 
Ant
Guest
Posts: n/a
 
      10-18-2006
Rob Wolfe wrote:
....
> def filter(adr): # note that "filter" is a builtin function also
> import re
>
> allow = re.compile(r'.*(?<!\.com)\.my(>|$)') # negative lookbehind
> deny = re.compile(r'.*\.com\.my(>|$)')
> cnt = 0
> if deny.search(adr): cnt += 1
> if allow.search(adr): cnt += 1
> return cnt


Which makes the 'deny' code here redundant so in this case the function
could be reduced to:

import re

def allow(adr): # note that "filter" is a builtin function also
allow = re.compile(r'.*(?<!\.com)\.my(>|$)') # negative lookbehind
if allow.search(adr):
return True
return False

Though having the explicit allow and deny expressions may make what's
going on clearer than the fairly esoteric negative lookbehind.

 
Reply With Quote
 
Rob Wolfe
Guest
Posts: n/a
 
      10-19-2006

Fulvio wrote:

> Great, it works perfectly. I found my errors.
> I didn't use r ahead of the patterns and i was close to the 'allow' pattern
> but didn't give positive result and KregexEditor reported wrong way. This
> specially because of '<' inside the stream. I thing that is not a normal
> regex input. It's only python valid. Am I right?


The sequence inside "(?...)" is an extension notation specific to
python.

Regards,
Rob

 
Reply With Quote
 
Fulvio
Guest
Posts: n/a
 
      10-19-2006
On Wednesday 18 October 2006 23:05, Ant wrote:
> * * allow = re.compile(r'.*(?<!\.com)\.my(>|$)') *# negative lookbehind
> * * if allow.search(adr):
> * * * * return True
> * * return False


I'd point out that :
allow = re.search(r'.*(?<!\.com)\.my(>|$)',adr)

Will do as yours, since the call to 're' class will do the compilation as here
it's doing separately.

> Though having the explicit allow and deny expressions may make what's
> going on clearer than the fairly esoteric negative lookbehind.


This makes me think that your point is truly correct.
The option for my case is meant as "deny all except those are specified".
Also may go viceversa. Therefore I should refine the way the filtering act.
In fact the (temporarily) ignored score is the base of the method to be
applied.
Obviously here mainly we are talking about email addresses, so my intention is
like the mailfilter concept, which means the program may block an entire
domain but some are allowed and all from ".my" are allowed but not those
from ".com.my" (mostly annoying emails )

At the sum of the view I've considered a flexible programming as much as I'm
thinking that may be published some time to benefit for multiplatform user as
python is.
In such perspective I'm a bit curious to know if exist sites on the web where
small program are welcomed and people like me can express all of their
ignorance about the mode of using python. For such ignorance I may concour
for the Nobel Price

Also the News Group doesn't contemplate the idea to split into beginners and
high level programmers (HLP). Of course the HLP are welcome to discuss on
such NG .

F
 
Reply With Quote
 
Ron Adam
Guest
Posts: n/a
 
      10-19-2006
Fulvio wrote:
> ***********************
> Your mail has been scanned by InterScan MSS.
> ***********************
>
>
> On Wednesday 18 October 2006 15:32, Ron Adam wrote:
>
>> |Instead of using two separate if's, Use an if - elif and be sure to test

>
> Thank you, Ron, for the input
> I'll examine also in this mode. Meanwhile I had faced the total disaster of
> deleting all my emails from all server ;(
> (I've saved them locally, luckly )
>
>> |It's not exactly clear on what output you are seeking. If you want 0 for
>> | not filtered and 1 for filtered, then look to Freds Hint.

>
> Actually the return code is like herein:
>
> if _filter(hdrs,allow,deny):
> # allow and deny are objects prepared by re.compile(pattern)
> _del(Num_of_Email)
>
> In short, it means unwanted to be deleted.
> And now the function is :
>
> def _filter(msg,al,dn):
> """ Filter try to classify a list of lines for a set of compiled
> patterns."""
> a = 0
> for hdrline in msg:
> # deny has the first priority and stop any further searching. Score 10
> #times
> if dn.search(hdrline): return len(msg) * 10
> if al.search(hdrline): return 0
> a += 1
> return a # it returns with a score of rejected matches or zero if none


I see, is this a cleanup script to remove the least wanted items?

The allow/deny caused me to think it was more along the lines of a white/black
list. Where as keep/discard would be terms more suitable to cleaning out items
already allowed.

Or is it a bit of both? Why the score?

Just curious, I don't think I have any suggestions that will help in any
specific ways.

I would think the allow(keep?) filters would always have priority over deny filters.


> The patterns are taken from a configuration file. Those with Axx ='pattern'
> are allowing streams the others are Dxx to block under different criteria.
> Here they're :
>
> [Filters]
> A01 = ^From:.*\.it\b
> A02 = ^(To|Cc):.*frioio@
> A03 = ^(To|Cc):.*the_sting@
> A04 = ^(To|Cc):.*calm_me_or_die@
> A05 = ^(To|Cc):.*further@
> A06 = ^From:.*\.za\b
> D01 = ^From:.*\.co\.au\b
> D02 = ^Subject:.*\*\*\*SPAM\*\*\*
>
> *A bit of fake in order to get some privacy*
> I'm using configparser to fetch their value and they're are joint by :
>
> allow = re.compile('|'.join([k[1] for k in ifil if k[0] is 'a']))
> deny = re.compile('|'.join([k[1] for k in ifil if k[0] is 'd']))
>
> ifil is the input filter's section.
>
> At this point I suppose that I have realized the right thing, just I'm a bit
> curious to know if ithere's a better chance and realize a single regex
> compilation for all of the options.


I think keeping the allow filter seperate from the deny filter is good.

You might be able to merge the header lines and run the filters across the whole
header at once instead of each line.

> Basically the program will work, in term of filtering as per config and
> sincronizing with local $HOME/Mail/trash (configurable path). This last
> option will remove emails on the server for those that are in the local
> trash.
> Todo = backup local and remote emails for those filtered as good.
> multithread to connect all server in parallel
> SSL for POP3 and IMAP4 as well
> Actually I've problem on issuing the command to imap server to flag "Deleted"
> the message which count as spam. I only know the message details but what
> is the correct command is a bit obscure, for me.


I can't help you here. Sorry.

> BTW whose Fred?
>
> F


Fredrik see...

news://news.cox.net:119/(E-Mail Removed)


 
Reply With Quote
 
Fulvio
Guest
Posts: n/a
 
      10-20-2006
On Friday 20 October 2006 02:40, Ron Adam wrote:
> I see, is this a cleanup script to remove the least wanted items?


Yes. Probably will remain in this mode for a while.
I'm not prepaired to bring out a new algorithm

> Or is it a bit of both? *Why the score?


As exposed on another post. There should be a way to define a deny/allow with
some particular exception.( I.e deny all ".com" but not
http://www.velocityreviews.com/forums/(E-Mail Removed))

> I would think the allow(keep?) filters would always have priority over deny
> filters.


It's a term which discerning capacity are involved. The previous post got this
point up. I think to allow all ".uk" (let us say) but not "info.uk" (all
reference are purely meant as example). Therefore if applying regex denial
on ".info.uk" surely that doesn't match only ".uk".
>


> I think keeping the allow filter seperate from the deny filter is good.

Agreed with you. Simply I was supposing the regex can do negative matching.

> You might be able to merge the header lines and run the filters across the
> whole header at once instead of each line.


I got into this idea, which is good, I still need a bit of thinking to code
it. It need to remember what will be the right separator between fields,
otherwise may cause problems with different charset.

> > Actually I've problem on issuing the command to imap server to flag
> > "Deleted" the message which count as spam. I only know the message

>
> I can't help you here. *Sorry.


Found it , by try&fail.

> > BTW whose Fred?

> * *
> news://news.cox.net:119/(E-Mail Removed)


I can't link foreigner NG than my isp giving me. I'm curious and I'll give it
a try.

F

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
1 little 2 little 3 little Kennedys dale Digital Photography 0 03-23-2008 01:03 PM
having a little problem with some code for a little game I am creating. ThaDoctor C++ 3 09-28-2007 03:28 PM
little red X in little white box Puzzled Computer Support 8 12-13-2004 09:11 AM
A little regex help? Ivan Marsh Perl 1 07-04-2003 06:00 PM



Advertisments