Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Regular Expression - old regex module vs. re module

Reply
Thread Tools

Regular Expression - old regex module vs. re module

 
 
Steve
Guest
Posts: n/a
 
      06-29-2006
Hi All,

I'm having a tough time converting the following regex.compile patterns
into the new re.compile format. There is also a differences in the
regsub.sub() vs. re.sub()

Could anyone lend a hand?


import regsub
import regex

import re # << need conversion to this module

.....

"""Convert perl style format symbology to printf tokens.

Take a string and substitute computed printf tokens for perl style
format symbology.

For example:

###.## yields %6.2f
######## yields %8d
<<<<< yields %-5s
"""


exponentPattern = regex.compile('\(^\|[^\\#]\)\(#+\.#+\*\*\*\*\)')
floatPattern = regex.compile('\(^\|[^\\#]\)\(#+\.#+\)')
integerPattern = regex.compile('\(^\|[^\\#]\)\(##+\)')
leftJustifiedStringPattern = regex.compile('\(^\|[^\\<]\)\(<<+\)')
rightJustifiedStringPattern = regex.compile('\(^\|[^\\>]\)\(>>+\)')



while 1: # process all integer fields
print("Testing Integer")
if integerPattern.search(s) < 0: break
print("Integer Match : ", integerPattern.search(s).span() )
# i1 , i2 = integerPattern.regs[2]
i1 , i2 = integerPattern.search(s).span()
width_total = i2 - i1
f = '%'+`width_total`+'d'
# s = regsub.sub(integerPattern, '\\1'+f, s)
s = integerPattern.sub(f, s)



Thanks in advance!

Steve

 
Reply With Quote
 
 
 
 
Jim Segrave
Guest
Posts: n/a
 
      06-30-2006
In article < .com>,
Steve <> wrote:
>Hi All,
>
>I'm having a tough time converting the following regex.compile patterns
>into the new re.compile format. There is also a differences in the
>regsub.sub() vs. re.sub()
>
>Could anyone lend a hand?
>
>
>import regsub
>import regex
>
>import re # << need conversion to this module
>
>....
>
> """Convert perl style format symbology to printf tokens.
>
> Take a string and substitute computed printf tokens for perl style
> format symbology.
>
> For example:
>
> ###.## yields %6.2f
> ######## yields %8d
> <<<<< yields %-5s
> """


Perhaps not optimal, but this processes things as requested. Note that
all floats have to be done before any integer patterns are replaced.

==========================
#!/usr/local/bin/python

import re

"""Convert perl style format symbology to printf tokens.
Take a string and substitute computed printf tokens for perl style
format symbology.

For example:

###.## yields %6.2f
######## yields %8d
<<<<< yields %-5s
"""


# handle cases where there's no integer or no fractional chars
floatPattern = re.compile(r'(?<!\\)(#+\.(#*)|\.(#+))')
integerPattern = re.compile(r'(?<![\\.])(#+)(?![.#])')
leftJustifiedStringPattern = re.compile(r'(?<!\\)(<+)')
rightJustifiedStringPattern = re.compile(r'(?<!\\)(>+)')

def float_sub(matchobj):
# fractional part may be in either groups()[1] or groups()[2]
if matchobj.groups()[1] is not None:
return "%%%d.%df" % (len(matchobj.groups()[0]),
len(matchobj.groups()[1]))
else:
return "%%%d.%df" % (len(matchobj.groups()[0]),
len(matchobj.groups()[2]))


def unperl_format(s):
changed_things = 1
while changed_things:
# lather, rinse and repeat until nothing new happens
changed_things = 0

mat_obj = leftJustifiedStringPattern.search(s)
if mat_obj:
s = re.sub(leftJustifiedStringPattern, "%%-%ds" %
len(mat_obj.groups()[0]), s, 1)
changed_things = 1

mat_obj = rightJustifiedStringPattern.search(s)
if mat_obj:
s = re.sub(rightJustifiedStringPattern, "%%%ds" %
len(mat_obj.groups()[0]), s, 1)
changed_things = 1

# must do all floats before ints
mat_obj = floatPattern.search(s)
if mat_obj:
s = re.sub(floatPattern, float_sub, s, 1)
changed_things = 1
# don't fall through to the int code
continue

mat_obj = integerPattern.search(s)
if mat_obj:
s = re.sub(integerPattern, "%%%dd" % len(mat_obj.groups()[0]),
s, 1)
changed_things = 1
return s

if __name__ == '__main__':
testarray = ["integer: ####, integer # integer at end #",
"float ####.## no decimals ###. no int .### at end ###.",
"Left string <<<<<< short left string <",
"right string >>>>>> short right string >",
"escaped chars \\#### \\####.## \\<\\<<<< \\>\\><<<"]


for s in testarray:
print("Testing: %s" % s)
print "Result: %s" % unperl_format(s)
print

======================

Running this gives

Testing: integer: ####, integer # integer at end #
Result: integer: %4d, integer %1d integer at end %1d

Testing: float ####.## no decimals ###. no int .### at end ###.
Result: float %7.2f no decimals %4.0f no int %4.3f at end %4.0f

Testing: Left string <<<<<< short left string <
Result: Left string %-6s short left string %-1s

Testing: right string >>>>>> short right string >
Result: right string %6s short right string %1s

Testing: escaped chars \#### \####.## \<\<<<< \>\><<<
Result: escaped chars \#%3d \#%6.2f \<\<%-3s \>\>%-3s



--
Jim Segrave ()

 
Reply With Quote
 
 
 
 
Paul McGuire
Guest
Posts: n/a
 
      06-30-2006
"Steve" <> wrote in message
news: oups.com...
> Hi All,
>
> I'm having a tough time converting the following regex.compile patterns
> into the new re.compile format. There is also a differences in the
> regsub.sub() vs. re.sub()
>
> Could anyone lend a hand?
>
>


Not an re solution, but pyparsing makes for an easy-to-follow program.
TransformString only needs to scan through the string once - the
"reals-before-ints" testing is factored into the definition of the
formatters variable.

Pyparsing's project wiki is at http://pyparsing.wikispaces.com.

-- Paul

-------------------
from pyparsing import *

"""
read Perl-style formatting placeholders and replace with
proper Python %x string interp formatters

###### -> %6d
##.### -> %6.3f
<<<<< -> %-5s
>>>>> -> %5s


"""

# set up patterns to be matched - Word objects match character groups
# made up of characters in the Word constructor; Combine forces
# elements to be adjacent with no intervening whitespace
# (note use of results name in realFormat, for easy access to
# decimal places substring)
intFormat = Word("#")
realFormat = Combine(Word("#")+"."+
Word("#").setResultsName("decPlaces"))
leftString = Word("<")
rightString = Word(">")

# define parse actions for each - the matched tokens are the third
# arg to parse actions; parse actions will replace the incoming tokens with
# value returned from the parse action
intFormat.setParseAction( lambda s,l,toks: "%%%dd" % len(toks[0]) )
realFormat.setParseAction( lambda s,l,toks: "%%%d.%df" %
(len(toks[0]),len(toks.decPlaces)) )
leftString.setParseAction( lambda s,l,toks: "%%-%ds" % len(toks[0]) )
rightString.setParseAction( lambda s,l,toks: "%%%ds" % len(toks[0]) )

# collect all formatters into a single "grammar"
# - note reals are checked before ints
formatters = rightString | leftString | realFormat | intFormat

# set up our test string, and use transform string to invoke parse actions
# on any matched tokens
testString = """
This is a string with
ints: #### # ###############
floats: #####.# ###.###### #.#
left-justified strings: <<<<<<<< << <
right-justified strings: >>>>>>>>>> >> >
int at end of sentence: ####.
"""
print formatters.transformString( testString )

-------------------
Prints:

This is a string with
ints: %4d %1d %15d
floats: %7.1f %10.6f %3.1f
left-justified strings: %-8s %-2s %-1s
right-justified strings: %10s %2s %1s
int at end of sentence: %4d.



 
Reply With Quote
 
Jim Segrave
Guest
Posts: n/a
 
      06-30-2006
In article <ePapg.6149$>,
Paul McGuire <._bogus_.com> wrote:

>Not an re solution, but pyparsing makes for an easy-to-follow program.
>TransformString only needs to scan through the string once - the
>"reals-before-ints" testing is factored into the definition of the
>formatters variable.
>
>Pyparsing's project wiki is at http://pyparsing.wikispaces.com.


If fails for floats specified as ###. or .###, it outputs an integer
format and the decimal point separately. It also ignores \# which
should prevent the '#' from being included in a format.



--
Jim Segrave ()

 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      06-30-2006
"Jim Segrave" <> wrote in message
news:...
>
> If fails for floats specified as ###. or .###, it outputs an integer
> format and the decimal point separately. It also ignores \# which
> should prevent the '#' from being included in a format.
>


True. What is the spec for these formatting strings, anyway? I Googled a
while, and it does not appear that this is really a Perl string formatting
technique, despite the OP's comments to the contrary. And I'm afraid my
limited Regex knowledge leaves the OP's example impenetrable to me. I got
lost among the '\'s and parens.

I actually thought that "###." was *not* intended to be floating point, but
instead represented an integer before a sentence-ending period. You do have
to be careful of making *both* leading and trailing digits optional, or else
simple sentence punctuating periods will get converted to "%1f"!

As for *ignoring* "\#", it would seem to me we would rather convert this to
"#", since "#" shouldn't be escaped in normal string interpolation.

The following modified version adds handling for "\#", "\<" and "\>", and
real numbers with no integer part. The resulting program isn't radically
different from the first version. (I've highlighted the changes with "<==="
marks.)

-- Paul

------------------
from pyparsing import Combine,Word,Optional,Regex

"""
read Perl-style formatting placeholders and replace with
proper %x string interp formatters

###### -> %6d
##.### -> %6.3f
<<<<< -> %-5s
>>>>> -> %5s


"""

# set up patterns to be matched
# (note use of results name in realFormat, for easy access to
# decimal places substring)
intFormat = Word("#")
realFormat = Combine(Optional(Word("#"))+"."+ # <===
Word("#").setResultsName("decPlaces"))
leftString = Word("<")
rightString = Word(">")
escapedChar = Regex(r"\\[#<>]") # <===

# define parse actions for each - the matched tokens are the third
# arg to parse actions; parse actions will replace the incoming tokens with
# value returned from the parse action
intFormat.setParseAction( lambda s,l,toks: "%%%dd" % len(toks[0]) )
realFormat.setParseAction( lambda s,l,toks: "%%%d.%df" %
(len(toks[0]),len(toks.decPlaces)) )
leftString.setParseAction( lambda s,l,toks: "%%-%ds" % len(toks[0]) )
rightString.setParseAction( lambda s,l,toks: "%%%ds" % len(toks[0]) )
escapedChar.setParseAction( lambda s,l,toks: toks[0][1] ) #
<===

# collect all formatters into a single "grammar"
# - note reals are checked before ints
formatters = rightString | leftString | realFormat | intFormat | escapedChar
# <===

# set up our test string, and use transform string to invoke parse actions
# on any matched tokens
testString = r"""
This is a string with
ints: #### # ###############
floats: #####.# ###.###### #.# .###
left-justified strings: <<<<<<<< << <
right-justified strings: >>>>>>>>>> >> >
int at end of sentence: ####.
I want \##, please.
"""

print testString
print formatters.transformString( testString )

------------------
Prints:

This is a string with
ints: #### # ###############
floats: #####.# ###.###### #.# .###
left-justified strings: <<<<<<<< << <
right-justified strings: >>>>>>>>>> >> >
int at end of sentence: ####.
I want \##, please.


This is a string with
ints: %4d %1d %15d
floats: %7.1f %10.6f %3.1f %4.3f
left-justified strings: %-8s %-2s %-1s
right-justified strings: %10s %2s %1s
int at end of sentence: %4d.
I want #%1d, please.



 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      06-30-2006
"Jim Segrave" <> wrote in message
news:...
> In article <ePapg.6149$>,
> Paul McGuire <._bogus_.com> wrote:
>
> >Not an re solution, but pyparsing makes for an easy-to-follow program.
> >TransformString only needs to scan through the string once - the
> >"reals-before-ints" testing is factored into the definition of the
> >formatters variable.
> >
> >Pyparsing's project wiki is at http://pyparsing.wikispaces.com.

>
> If fails for floats specified as ###. or .###, it outputs an integer
> format and the decimal point separately. It also ignores \# which
> should prevent the '#' from being included in a format.
>

Ah! This may be making some sense to me now. Here are the OP's original
re's for matching.

exponentPattern = regex.compile('\(^\|[^\\#]\)\(#+\.#+\*\*\*\*\)')
floatPattern = regex.compile('\(^\|[^\\#]\)\(#+\.#+\)')
integerPattern = regex.compile('\(^\|[^\\#]\)\(##+\)')
leftJustifiedStringPattern = regex.compile('\(^\|[^\\<]\)\(<<+\)')
rightJustifiedStringPattern = regex.compile('\(^\|[^\\>]\)\(>>+\)')

Each re seems to have two parts to it. The leading parts appear to be
guards against escaped #, <, or > characters, yes? The second part of each
re shows the actual pattern to be matched. If so:

It seems that we *don't* want "###." or ".###" to be recognized as floats,
floatPattern requires at least one "#" character on either side of the ".".
Also note that single #, <, and > characters don't seem to be desired, but
at least two or more are required for matching. Pyparsing's Word class
accepts an optional min=2 constructor argument if this really is the case.
And it also seems that the pattern is supposed to be enclosed in ()'s. This
seems especially odd to me, since one of the main points of this funky
format seems to be to set up formatting that preserves column alignment of
text, as if creating a tabular output - enclosing ()'s just junks this up.

My example also omitted the exponent pattern. This can be handled with
another expression like realFormat, but with the trailing "****" characters.
Be sure to insert this expression before realFormat in the list of
formatters.

I may be completely off in my re interpretation. Perhaps one of the re
experts here can explain better what the OP's re's are all about. Can
anybody locate/cite the actual spec for this formatting, um, format?

-- Paul


 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      06-30-2006
"Jim Segrave" <> wrote in message
news:...
> If fails for floats specified as ###. or .###, it outputs an integer
> format and the decimal point separately. It also ignores \# which
> should prevent the '#' from being included in a format.
>


Here's a little more study on this (all tests are using Python 2.4.1):

If floats are specified as "###.", should we generate "%4.0f" as the result?
In fact, to get 3 leading places and a trailing decimal point, when 0
decimal places are desired, should be formatted with "%3.0f." - we have to
explicitly put in the trailing '.' character.
>>> print ">%1.0f<" % 10.00001

>10<
>>> print ">%2.0f<" % 10.00001

>10<
>>> print ">%3.0f<" % 10.00001

> 10<
>>> print ">%3.0f.<" % 10.00001

> 10.<

But as we see below, if the precision field is not zero, the initial width
consumes one character for the decimal point. If the precision field *is*
zero, then the entire width is used for the integer part of the value, with
no trailing decimal point.

".###" almost makes no sense. There is no floating point format that
suppresses the leading '0' before the decimal point.
>>> print ">%1.2f<" % 0.00001

>0.00<
>>> print ">%2.2f<" % 0.00001

>0.00<
>>> print ">%3.2f<" % 0.00001

>0.00<
>>> print ">%4.2f<" % 0.00001

>0.00<
>>> print ">%5.2f<" % 0.00001

> 0.00<


Using the %f with a nonzero precision field, will always output at least the
number of decimal places, plus the decimal point and leading '0' if number
is less than 1.

This whole discussion so far has also ignore negative values, again, we
should really look more into the spec for this formatting scheme, rather
than try to read the OP's mind.

-- Paul


 
Reply With Quote
 
Jim Segrave
Guest
Posts: n/a
 
      06-30-2006
In article <UCdpg.7174$>,
Paul McGuire <._bogus_.com> wrote:
>"Jim Segrave" <> wrote in message
>news:...
>> In article <ePapg.6149$>,
>> Paul McGuire <._bogus_.com> wrote:
>>
>> >Not an re solution, but pyparsing makes for an easy-to-follow program.
>> >TransformString only needs to scan through the string once - the
>> >"reals-before-ints" testing is factored into the definition of the
>> >formatters variable.
>> >
>> >Pyparsing's project wiki is at http://pyparsing.wikispaces.com.

>>
>> If fails for floats specified as ###. or .###, it outputs an integer
>> format and the decimal point separately. It also ignores \# which
>> should prevent the '#' from being included in a format.
>>

>Ah! This may be making some sense to me now. Here are the OP's original
>re's for matching.
>
>exponentPattern = regex.compile('\(^\|[^\\#]\)\(#+\.#+\*\*\*\*\)')
>floatPattern = regex.compile('\(^\|[^\\#]\)\(#+\.#+\)')
>integerPattern = regex.compile('\(^\|[^\\#]\)\(##+\)')
>leftJustifiedStringPattern = regex.compile('\(^\|[^\\<]\)\(<<+\)')
>rightJustifiedStringPattern = regex.compile('\(^\|[^\\>]\)\(>>+\)')
>
>Each re seems to have two parts to it. The leading parts appear to be
>guards against escaped #, <, or > characters, yes? The second part of each
>re shows the actual pattern to be matched. If so:
>
>It seems that we *don't* want "###." or ".###" to be recognized as floats,
>floatPattern requires at least one "#" character on either side of the ".".
>Also note that single #, <, and > characters don't seem to be desired, but
>at least two or more are required for matching. Pyparsing's Word class
>accepts an optional min=2 constructor argument if this really is the case.
>And it also seems that the pattern is supposed to be enclosed in ()'s. This
>seems especially odd to me, since one of the main points of this funky
>format seems to be to set up formatting that preserves column alignment of
>text, as if creating a tabular output - enclosing ()'s just junks this up.
>


The poster was excluding escaped (with a '\' character, but I've just
looked up the Perl format statement and in fact fields always begin
with a '@', and yes having no digits on one side of the decimal point
is legal. Strings can be left or right justified '@<<<<', '@>>>>', or
centred '@||||', numerics begin with an @, contain '#' and may contain
a decimal point. Fields beginning with '^' instead of '@' are omitted
if the format is a numeric ('#' with/without decimal). I assumed from
the poster's original patterns that one has to worry about '@', but
that's incorrect, they need to be present to be a format as opposed to
ordinary text and there's appears to be no way to embed a '@' in an
format. It's worth noting that PERL does implicit float to int
coercion, so it treats @### the same for ints and floats (no decimal
printed).

For the grisly details:

http://perl.com/doc/manual/html/pod/perlform.html

--
Jim Segrave ()

 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      06-30-2006
"Jim Segrave" <> wrote in message
news:...
<snip>
> The poster was excluding escaped (with a '\' character, but I've just
> looked up the Perl format statement and in fact fields always begin
> with a '@', and yes having no digits on one side of the decimal point
> is legal. Strings can be left or right justified '@<<<<', '@>>>>', or
> centred '@||||', numerics begin with an @, contain '#' and may contain
> a decimal point. Fields beginning with '^' instead of '@' are omitted
> if the format is a numeric ('#' with/without decimal). I assumed from
> the poster's original patterns that one has to worry about '@', but
> that's incorrect, they need to be present to be a format as opposed to
> ordinary text and there's appears to be no way to embed a '@' in an
> format. It's worth noting that PERL does implicit float to int
> coercion, so it treats @### the same for ints and floats (no decimal
> printed).
>
> For the grisly details:
>
> http://perl.com/doc/manual/html/pod/perlform.html
>
> --
> Jim Segrave ()
>


Ah, wunderbar! Some further thoughts...

I can see that the OP omitted the concept of "@|||" centering, since the
Python string interpolation forms only support right or left justified
fields, and it seems he is trying to do some form of format->string interp
automation. Adding centering would require not only composing a suitable
string interp format, but also some sort of pad() operation in the arg
passed to the string interp operation. I suspect this also rules out simple
handling of the '^' operator as mentioned in the spec, and likewise for the
trailing ellipsis if a field is not long enough for the formatted value.

The '@' itself seems to be part of the field, so "@<<<<" would be a 5
column, left-justified string. A bare '@' seems to be a single string
placeholder (meaningless to ask right or left justified ), since this is
used in the doc's hack for including a "@" in the output. (That is, as you
said, the original spec provides no mechanism for escaping in a '@'
character, it has to get hacked in as a value dropped into a single
character field.)

The Perl docs say that fields that are too long are truncated. This does
not happen in Python string interps for numeric values, but it can be done
with strings (using the precision field).
>>> print "%-10s" % string.ascii_uppercase

ABCDEFGHIJKLMNOPQRSTUVWXYZ
>>> print "%-10.10s" % string.ascii_uppercase

ABCDEFGHIJ

So if we were to focus on support for "@", "@>>>", "@<<<", "@###" and
"@###.##" (with and without leading or trailing digits about the decimal)
style format fields, this shouldn't be overly difficult, and may even meet
the OP's requirements. (The OP seemed to also want some support for
something like "@##.###****" for scientific notation, again, not a
dealbreaker.)

-- Paul


 
Reply With Quote
 
Jim Segrave
Guest
Posts: n/a
 
      06-30-2006
In article <R1fpg.6488$>,
Paul McGuire <._bogus_.com> wrote:
>"Jim Segrave" <> wrote in message
>news:...
>
>I can see that the OP omitted the concept of "@|||" centering, since the
>Python string interpolation forms only support right or left justified
>fields, and it seems he is trying to do some form of format->string interp
>automation. Adding centering would require not only composing a suitable
>string interp format, but also some sort of pad() operation in the arg
>passed to the string interp operation. I suspect this also rules out simple
>handling of the '^' operator as mentioned in the spec, and likewise for the
>trailing ellipsis if a field is not long enough for the formatted value.
>
>The '@' itself seems to be part of the field, so "@<<<<" would be a 5
>column, left-justified string. A bare '@' seems to be a single string
>placeholder (meaningless to ask right or left justified ), since this is
>used in the doc's hack for including a "@" in the output. (That is, as you
>said, the original spec provides no mechanism for escaping in a '@'
>character, it has to get hacked in as a value dropped into a single
>character field.)
>
>The Perl docs say that fields that are too long are truncated. This does
>not happen in Python string interps for numeric values, but it can be done
>with strings (using the precision field).
>>>> print "%-10s" % string.ascii_uppercase

>ABCDEFGHIJKLMNOPQRSTUVWXYZ
>>>> print "%-10.10s" % string.ascii_uppercase

>ABCDEFGHIJ
>
>So if we were to focus on support for "@", "@>>>", "@<<<", "@###" and
>"@###.##" (with and without leading or trailing digits about the decimal)
>style format fields, this shouldn't be overly difficult, and may even meet
>the OP's requirements. (The OP seemed to also want some support for
>something like "@##.###****" for scientific notation, again, not a
>dealbreaker.)


One would need a much clearer spec on what the OP really wants to do - note
that` Perl formats have the variable names embeeded as part of the
format string, so writing a simple Perl->Python converter isn't going
to work,

I've given him a good start for an re based solution, you've given one
for a pyparsing based one, at this point I'd hope the OP can take it
from there or can come back with more specific questions on how to
deal with some of the awfulness of the formats he's working with.




--
Jim Segrave ()

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex question. Oh I so cannot do regular expression matching. grocery_stocker Perl Misc 20 04-15-2009 07:29 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Java Regular Expression (java.util.regex ): Multiple Occurences, always guaranteed that it takes the last? joes Java 2 05-25-2007 05:57 PM
boost regex --- sregex_iterator -- Regular expression too big wolverine C++ 2 08-29-2006 11:22 PM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57