![]() |
|
|
|||||||
![]() |
Python - New implementation of re module |
|
|
Thread Tools | Search this Thread |
|
|
#1 |
|
Hi all,
I've been working on a new implementation of the re module. The details are at http://bugs.python.org/issue2636, specifically from http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for Python 2.6 on Windows if you want to try it out. I'm interested in how fast it is generally, compared with the current re module, but especially when faced with those 'pathological' regular expressions which seem to take a long time to finish, for example: re.search(r"^(.+|D)*A$", "x" * 25 + "B") which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs with this new implementation. TIA MRAB |
|
|
|
|
#2 |
|
Posts: n/a
|
On 27-07-2009, MRAB wrote:
> Hi all, > > I've been working on a new implementation of the re module. The details > are at http://bugs.python.org/issue2636, specifically from > http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for > Python 2.6 on Windows if you want to try it out. > Someone can remember me how to compile it (on debian lenny), if possible with python2.5. I've also python3.1 that i build alone... I could test it with pytextile, i've a bunch of texts to bench and compare. Did you announce it on the unladen-swallow list ? They wanted to hack on RE also... -- William Dodé - http://flibuste.net Informaticien Indépendant William Dode |
|
|
|
#3 |
|
Posts: n/a
|
William Dode wrote:
> On 27-07-2009, MRAB wrote: >> Hi all, >> >> I've been working on a new implementation of the re module. The details >> are at http://bugs.python.org/issue2636, specifically from >> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for >> Python 2.6 on Windows if you want to try it out. >> > > Someone can remember me how to compile it (on debian lenny), if possible > with python2.5. I've also python3.1 that i build alone... > > I could test it with pytextile, i've a bunch of texts to bench and > compare. > All I can do is point you to http://docs.python.org/extending/extending.html. For Linux (which I don't have) you'll need to use _regex.h and _regex.c to compile to _regex.so instead of _regex.pyd. > Did you announce it on the unladen-swallow list ? They wanted to hack on > RE also... > No. I haven't subscribed to it. MRAB |
|
|
|
#4 |
|
Posts: n/a
|
In article <mailman.3787.1248712420.8015.python->,
MRAB <> wrote: > >I've been working on a new implementation of the re module. The details >are at http://bugs.python.org/issue2636, specifically from >http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for >Python 2.6 on Windows if you want to try it out. How does it handle the re module's unit tests? -- Aahz () <*> http://www.pythoncraft.com/ "Many customs in this life persist because they ease friction and promote productivity as a result of universal agreement, and whether they are precisely the optimal choices is much less important." --Henry Spencer Aahz |
|
|
|
#5 |
|
Posts: n/a
|
MRAB wrote:
> http://bugs.python.org/issue2636#msg90954 Variable-length lookbehind! My hero! -- --OKB (not okblacke) Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown OKB (not okblacke) |
|
|
|
#6 |
|
Posts: n/a
|
Aahz wrote:
> In article <mailman.3787.1248712420.8015.python->, > MRAB <> wrote: >> I've been working on a new implementation of the re module. The details >> are at http://bugs.python.org/issue2636, specifically from >> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for >> Python 2.6 on Windows if you want to try it out. > > How does it handle the re module's unit tests? Basically, it passes all those tests I expect it to pass. It fails those where the intended behaviour has changed, such as re.sub treating unmatched groups as empty strings, as requested in http://bugs.python.org/issue1519638. MRAB |
|
|
|
#7 |
|
Posts: n/a
|
In article <mailman.3843.1248793153.8015.python->,
MRAB <> wrote: >Aahz wrote: >> In article <mailman.3787.1248712420.8015.python->, >> MRAB <> wrote: >>> I've been working on a new implementation of the re module. The details >>> are at http://bugs.python.org/issue2636, specifically from >>> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for >>> Python 2.6 on Windows if you want to try it out. >> >> How does it handle the re module's unit tests? > >Basically, it passes all those tests I expect it to pass. > >It fails those where the intended behaviour has changed, such as re.sub >treating unmatched groups as empty strings, as requested in >http://bugs.python.org/issue1519638. Then you should definitely publish to PyPI and post a message to c.l.py.announce to get more users. -- Aahz () <*> http://www.pythoncraft.com/ "Many customs in this life persist because they ease friction and promote productivity as a result of universal agreement, and whether they are precisely the optimal choices is much less important." --Henry Spencer Aahz |
|
|
|
#8 |
|
Posts: n/a
|
On Jul 27, 11:34*am, MRAB <pyt...@mrabarnett.plus.com> wrote:
> I've been working on a new implementation of the re module. Fabulous! If you're extending/changing the interface, there are a couple of sore points in the current implementation I'd love to see addressed: - findall/finditer doesn't find overlapping matches. Sometimes you really *do* want to know all possible matches, even if they overlap. This comes up in bioinformatics, for example. - split won't split on empty patterns, e.g. empty lookahead patterns. This means that it can't be used for a whole class of interesting cases. This has been discussed previously: http://bugs.python.org/issue3262 http://bugs.python.org/issue852532 http://bugs.python.org/issue988761 - It'd be nice to have a version of split that generates the parts (one by one) rather than returning the whole list. - Repeated subgroup match information is not available. That is, for a match like this re.match('(.){3}', 'xyz') there's no way to discover that the subgroup first matched 'x', then matched 'y', and finally matched 'z'. Here is one past proposal (mine), perhaps over-complex, to address this problem: http://mail.python.org/pipermail/pyt...st/047238.html Mike Mike |
|
|
|
#9 |
|
Posts: n/a
|
Mike wrote:
> On Jul 27, 11:34 am, MRAB <pyt...@mrabarnett.plus.com> wrote: >> I've been working on a new implementation of the re module. > > Fabulous! > > If you're extending/changing the interface, there are a couple of sore > points in the current implementation I'd love to see addressed: > > - findall/finditer doesn't find overlapping matches. Sometimes you > really *do* want to know all possible matches, even if they overlap. > This comes up in bioinformatics, for example. > Perhaps by adding "overlapped=True"? > - split won't split on empty patterns, e.g. empty lookahead patterns. > This means that it can't be used for a whole class of interesting > cases. This has been discussed previously: > > http://bugs.python.org/issue3262 > http://bugs.python.org/issue852532 > http://bugs.python.org/issue988761 > Already addressed (see issue2636 for the full details). > - It'd be nice to have a version of split that generates the parts > (one by one) rather than returning the whole list. > Hmm, re.splititer() perhaps. > - Repeated subgroup match information is not available. That is, for > a match like this > > re.match('(.){3}', 'xyz') > > there's no way to discover that the subgroup first matched 'x', then > matched 'y', and finally matched 'z'. Here is one past proposal > (mine), perhaps over-complex, to address this problem: > > http://mail.python.org/pipermail/pyt...st/047238.html > Yikes! I think I'll let you code that... MRAB |
|
|
|
#10 |
|
Posts: n/a
|
On Jul 29, 10:45*am, MRAB <pyt...@mrabarnett.plus.com> wrote:
> Mike wrote: > > - findall/finditer doesn't find overlapping matches. *Sometimes you > > really *do* want to know all possible matches, even if they overlap. > > Perhaps by adding "overlapped=True"? Something like that would be great, yes. > > - split won't split on empty patterns, e.g. empty lookahead patterns. > Already addressed (see issue2636 for the full details). Glad to hear it. > > - Repeated subgroup match information is not available. *That is, for > > a match like this > > > * * re.match('(.){3}', 'xyz') > > > there's no way to discover that the subgroup first matched 'x', then > > matched 'y', and finally matched 'z'. *Here is one past proposal > > (mine), perhaps over-complex, to address this problem: > > > * *http://mail.python.org/pipermail/pyt...st/047238.html > > Yikes! I think I'll let you code that... I agree that that document looks a little scary--maybe I was trying to bite off too much at once. My intuition, though, is that the basic idea should be fairly simple to implement, at least for a depth-first matcher. The repeated match subgroups are already being discovered, it's just that they're not being saved, so there's no way to report them out once a complete match is found. If some trail of breadcrumbs were pushed onto a stack during the DFS, it could be traced at the end. And the whole thing might not even been that expensive to do. The hardest parts about this, in my mind, are figuring out how to report the repeated matches out in a useful form (hence all that detail in the proposal), and getting users to understand that using this feature *could* suck up a lot of memory, if they're not careful. As always, it's possible that my intuition is totally wrong. Plus I'm not sure how this would work out in the breadth-first case. Details aside, I would really, really, really like to have a way to get at the repeated subgroup matches. I write a lot of code that would be one-liners if this capability existed. Plus, it just plain burns me that Python is discovering this information but impudently refuses to tell me what it's found! Mike |
|
![]() |
| Thread Tools | Search this Thread |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| implementation of collapsible panel control in asp.net | bhawneshkumar | Software | 0 | 07-17-2009 06:38 AM |
| ImportError: No module named server | manasonnet | Software | 0 | 11-20-2008 04:53 AM |
| WFS:GBIC module | Helenzhou | Hardware | 0 | 06-19-2008 08:04 AM |
| I have an Invalid Page Fault in module XVID.AX | Wassup Mang! | DVD Video | 1 | 10-19-2003 02:40 PM |
| Re: SmartCertify A+ Core Module | Tony Sivori | A+ Certification | 2 | 07-17-2003 05:56 AM |