Go Back   Velocity Reviews > Newsgroups > Python
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply

Python - New implementation of re module

 
Thread Tools Search this Thread
Old 07-27-2009, 05:34 PM   #1
Default New implementation of re module


Hi all,

I've been working on a new implementation of the re module. The details
are at http://bugs.python.org/issue2636, specifically from
http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
Python 2.6 on Windows if you want to try it out.

I'm interested in how fast it is generally, compared with the current re
module, but especially when faced with those 'pathological' regular
expressions which seem to take a long time to finish, for example:

re.search(r"^(.+|D)*A$", "x" * 25 + "B")

which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs
with this new implementation.

TIA


MRAB
  Reply With Quote
Old 07-27-2009, 08:04 PM   #2
William Dode
 
Posts: n/a
Default Re: New implementation of re module
On 27-07-2009, MRAB wrote:
> Hi all,
>
> I've been working on a new implementation of the re module. The details
> are at http://bugs.python.org/issue2636, specifically from
> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
> Python 2.6 on Windows if you want to try it out.
>


Someone can remember me how to compile it (on debian lenny), if possible
with python2.5. I've also python3.1 that i build alone...

I could test it with pytextile, i've a bunch of texts to bench and
compare.

Did you announce it on the unladen-swallow list ? They wanted to hack on
RE also...


--
William Dodé - http://flibuste.net
Informaticien Indépendant


William Dode
  Reply With Quote
Old 07-27-2009, 09:00 PM   #3
MRAB
 
Posts: n/a
Default Re: New implementation of re module
William Dode wrote:
> On 27-07-2009, MRAB wrote:
>> Hi all,
>>
>> I've been working on a new implementation of the re module. The details
>> are at http://bugs.python.org/issue2636, specifically from
>> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
>> Python 2.6 on Windows if you want to try it out.
>>

>
> Someone can remember me how to compile it (on debian lenny), if possible
> with python2.5. I've also python3.1 that i build alone...
>
> I could test it with pytextile, i've a bunch of texts to bench and
> compare.
>

All I can do is point you to
http://docs.python.org/extending/extending.html.

For Linux (which I don't have) you'll need to use _regex.h and _regex.c
to compile to _regex.so instead of _regex.pyd.

> Did you announce it on the unladen-swallow list ? They wanted to hack on
> RE also...
>

No. I haven't subscribed to it.


MRAB
  Reply With Quote
Old 07-28-2009, 03:52 AM   #4
Aahz
 
Posts: n/a
Default Re: New implementation of re module
In article <mailman.3787.1248712420.8015.python->,
MRAB <> wrote:
>
>I've been working on a new implementation of the re module. The details
>are at http://bugs.python.org/issue2636, specifically from
>http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
>Python 2.6 on Windows if you want to try it out.


How does it handle the re module's unit tests?
--
Aahz () <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer


Aahz
  Reply With Quote
Old 07-28-2009, 05:41 AM   #5
OKB (not okblacke)
 
Posts: n/a
Default Re: New implementation of re module
MRAB wrote:

> http://bugs.python.org/issue2636#msg90954


Variable-length lookbehind! My hero!

--
--OKB (not okblacke)
Brendan Barnwell
"Do not follow where the path may lead. Go, instead, where there is
no path, and leave a trail."
--author unknown


OKB (not okblacke)
  Reply With Quote
Old 07-28-2009, 03:59 PM   #6
MRAB
 
Posts: n/a
Default Re: New implementation of re module
Aahz wrote:
> In article <mailman.3787.1248712420.8015.python->,
> MRAB <> wrote:
>> I've been working on a new implementation of the re module. The details
>> are at http://bugs.python.org/issue2636, specifically from
>> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
>> Python 2.6 on Windows if you want to try it out.

>
> How does it handle the re module's unit tests?


Basically, it passes all those tests I expect it to pass.

It fails those where the intended behaviour has changed, such as re.sub
treating unmatched groups as empty strings, as requested in
http://bugs.python.org/issue1519638.


MRAB
  Reply With Quote
Old 07-28-2009, 09:57 PM   #7
Aahz
 
Posts: n/a
Default Re: New implementation of re module
In article <mailman.3843.1248793153.8015.python->,
MRAB <> wrote:
>Aahz wrote:
>> In article <mailman.3787.1248712420.8015.python->,
>> MRAB <> wrote:
>>> I've been working on a new implementation of the re module. The details
>>> are at http://bugs.python.org/issue2636, specifically from
>>> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
>>> Python 2.6 on Windows if you want to try it out.

>>
>> How does it handle the re module's unit tests?

>
>Basically, it passes all those tests I expect it to pass.
>
>It fails those where the intended behaviour has changed, such as re.sub
>treating unmatched groups as empty strings, as requested in
>http://bugs.python.org/issue1519638.


Then you should definitely publish to PyPI and post a message to
c.l.py.announce to get more users.
--
Aahz () <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer


Aahz
  Reply With Quote
Old 07-29-2009, 04:24 PM   #8
Mike
 
Posts: n/a
Default Re: New implementation of re module
On Jul 27, 11:34*am, MRAB <pyt...@mrabarnett.plus.com> wrote:
> I've been working on a new implementation of the re module.


Fabulous!

If you're extending/changing the interface, there are a couple of sore
points in the current implementation I'd love to see addressed:

- findall/finditer doesn't find overlapping matches. Sometimes you
really *do* want to know all possible matches, even if they overlap.
This comes up in bioinformatics, for example.

- split won't split on empty patterns, e.g. empty lookahead patterns.
This means that it can't be used for a whole class of interesting
cases. This has been discussed previously:

http://bugs.python.org/issue3262
http://bugs.python.org/issue852532
http://bugs.python.org/issue988761

- It'd be nice to have a version of split that generates the parts
(one by one) rather than returning the whole list.

- Repeated subgroup match information is not available. That is, for
a match like this

re.match('(.){3}', 'xyz')

there's no way to discover that the subgroup first matched 'x', then
matched 'y', and finally matched 'z'. Here is one past proposal
(mine), perhaps over-complex, to address this problem:

http://mail.python.org/pipermail/pyt...st/047238.html

Mike



Mike
  Reply With Quote
Old 07-29-2009, 04:45 PM   #9
MRAB
 
Posts: n/a
Default Re: New implementation of re module
Mike wrote:
> On Jul 27, 11:34 am, MRAB <pyt...@mrabarnett.plus.com> wrote:
>> I've been working on a new implementation of the re module.

>
> Fabulous!
>
> If you're extending/changing the interface, there are a couple of sore
> points in the current implementation I'd love to see addressed:
>
> - findall/finditer doesn't find overlapping matches. Sometimes you
> really *do* want to know all possible matches, even if they overlap.
> This comes up in bioinformatics, for example.
>

Perhaps by adding "overlapped=True"?

> - split won't split on empty patterns, e.g. empty lookahead patterns.
> This means that it can't be used for a whole class of interesting
> cases. This has been discussed previously:
>
> http://bugs.python.org/issue3262
> http://bugs.python.org/issue852532
> http://bugs.python.org/issue988761
>

Already addressed (see issue2636 for the full details).

> - It'd be nice to have a version of split that generates the parts
> (one by one) rather than returning the whole list.
>

Hmm, re.splititer() perhaps.

> - Repeated subgroup match information is not available. That is, for
> a match like this
>
> re.match('(.){3}', 'xyz')
>
> there's no way to discover that the subgroup first matched 'x', then
> matched 'y', and finally matched 'z'. Here is one past proposal
> (mine), perhaps over-complex, to address this problem:
>
> http://mail.python.org/pipermail/pyt...st/047238.html
>

Yikes! I think I'll let you code that...


MRAB
  Reply With Quote
Old 07-29-2009, 06:21 PM   #10
Mike
 
Posts: n/a
Default Re: New implementation of re module
On Jul 29, 10:45*am, MRAB <pyt...@mrabarnett.plus.com> wrote:
> Mike wrote:
> > - findall/finditer doesn't find overlapping matches. *Sometimes you
> > really *do* want to know all possible matches, even if they overlap.

>
> Perhaps by adding "overlapped=True"?


Something like that would be great, yes.


> > - split won't split on empty patterns, e.g. empty lookahead patterns.


> Already addressed (see issue2636 for the full details).


Glad to hear it.


> > - Repeated subgroup match information is not available. *That is, for
> > a match like this

>
> > * * re.match('(.){3}', 'xyz')

>
> > there's no way to discover that the subgroup first matched 'x', then
> > matched 'y', and finally matched 'z'. *Here is one past proposal
> > (mine), perhaps over-complex, to address this problem:

>
> > * *http://mail.python.org/pipermail/pyt...st/047238.html

>
> Yikes! I think I'll let you code that...


I agree that that document looks a little scary--maybe I was trying to
bite off too much at once.

My intuition, though, is that the basic idea should be fairly simple
to implement, at least for a depth-first matcher. The repeated match
subgroups are already being discovered, it's just that they're not
being saved, so there's no way to report them out once a complete
match is found. If some trail of breadcrumbs were pushed onto a stack
during the DFS, it could be traced at the end. And the whole thing
might not even been that expensive to do.

The hardest parts about this, in my mind, are figuring out how to
report the repeated matches out in a useful form (hence all that
detail in the proposal), and getting users to understand that using
this feature *could* suck up a lot of memory, if they're not careful.

As always, it's possible that my intuition is totally wrong. Plus I'm
not sure how this would work out in the breadth-first case.

Details aside, I would really, really, really like to have a way to
get at the repeated subgroup matches. I write a lot of code that
would be one-liners if this capability existed. Plus, it just plain
burns me that Python is discovering this information but impudently
refuses to tell me what it's found!




Mike
  Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
implementation of collapsible panel control in asp.net bhawneshkumar Software 0 07-17-2009 06:38 AM
ImportError: No module named server manasonnet Software 0 11-20-2008 04:53 AM
WFS:GBIC module Helenzhou Hardware 0 06-19-2008 08:04 AM
I have an Invalid Page Fault in module XVID.AX Wassup Mang! DVD Video 1 10-19-2003 02:40 PM
Re: SmartCertify A+ Core Module Tony Sivori A+ Certification 2 07-17-2003 05:56 AM




SEO by vBSEO 3.3.2 ©2009, Crawlability, Inc.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46