Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > compound regex

Reply
Thread Tools

compound regex

 
 
spir
Guest
Posts: n/a
 
      02-09-2009
Hello,

(new here)

Below an extension to standard module re. The point is to allow writing and testing sub-expressions individually, then nest them into a super-expression. More or less like using a parser generator -- but keeping regex grammar and power.
I used the format {sub_expr_name}: as in standard regexes {} are only used to express repetition number, a pair of curly braces nesting an identifier should not conflict.

The extension is new, very few tested. I would enjoy comments, critics, etc. I would like to know if you find such a feature useful. You will probably find the code simple enough

Denis
------
la vida e estranya

===============
# coding: utf-8

''' super_regex

Define & check sub-patterns individually,
then include them in global super-pattern.

uses format {name} for inclusion:
sub1 = Regex(...)
sub2 = Regex(...)
super_format = "...{sub1}...{sub2}..."
# final regex object:
super_regex = superRegex(super_format)
'''

from re import compile as Regex

# sub-pattern inclusion format
sub_pattern = Regex(r"{[a-zA-Z_][a-zA-Z_0-9]*}")

# sub-pattern expander
def sub_pattern_expansion(inclusion, dic=None):
name = inclusion.group()[1:-1]
### namespace dict may be specified -- else globals()
if dic is None:
dic = globals()
if name not in dic:
raise NameError("Cannot find sub-pattern '%s'." % name)
return dic[name].pattern

# super-pattern generator
def superRegex(format):
expanded_format = sub_pattern.sub(sub_pattern_expansion, format)
return Regex(expanded_format)

if __name__ == "__main__": # purely artificial example use
# pattern
time = Regex(r"\d\d:\d\d:\d\d") # hh:mm:ss
code = Regex(r"\S{5}") # non-whitespace x 5
desc = Regex(r"[\w\s]+$") # alphanum|space --> EOL
ref_format = "^ref: {time} #{code} --- {desc}"
ref_regex = superRegex(ref_format)
# output
print 'super pattern:\n"%s" ==>\n"%s"\n' % (ref_format,ref_regex.pattern)
text = "ref: 12:04:59 #%+.?% --- foo 987 bar"
result = ref_regex.match(text)
print 'text: "%s" ==>\n"%s"' %(text,result.group())
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Mark up compound noun so that search engines see two words Greg N. HTML 20 02-15-2006 05:52 PM
RE: Compound Control event not firing, but only when it's in a library =?Utf-8?B?TWlrZUw=?= ASP .Net 0 11-19-2004 04:45 AM
JAI problem of creating compound image from slices Apc Java 1 06-02-2004 10:10 PM
CMR/CMP and Compound Primary Key Damir Mikoc Java 1 07-04-2003 03:27 AM



Advertisments