Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: regexp non-greedy matching bug?

Thread Tools

Re: regexp non-greedy matching bug?

Tim Peters
Posts: n/a
[John Hazen]
> I want to match one or two instances of a pattern in a string.
> According to the docs for the 're' module
> ( ) the '?' qualifier
> is greedy by default, and adding a '?' after a qualifier makes it
> non-greedy.

>> The "*", "+", and "?" qualifiers are all greedy...
>> Adding "?" after the qualifier makes it perform the match in
>> non-greedy or minimal fashion...

> In the following example, though my re is intended to allow for 1 or 2
> instinces of 'foo', there are 2 in the string I'm matching. So, I would
> expect group(1) and group(3) to both be populated. (When I remove the
> conditional match on the 2nd foo, the grouping is as I expect.)
> $ python2.4
> Python 2.4.1 (#2, Mar 31 2005, 00:05:10)
> [GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import re
> >>> foofoo = re.compile(r'^(foo)(.*?)(foo)?(.*?)$')
> >>> foofoo.match(s).group(0)

> 'foobarbazfoobar'
> >>> foofoo.match(s).group(1)

> 'foo'
> >>> foofoo.match(s).group(2)

> ''
> >>> foofoo.match(s).group(3)
> >>> foofoo.match(s).group(4)

> 'barbazfoobar'

Your problem isn't that


is not greedy (it is greedy), it's that your first


is not greedy. Remember that regexps also work left to right. When
you coded your first


you're asking (because of the '?') the regexp engine to chew up the
fewest possible number of characters at that point such that the
_rest_ of the regexp _can_ match. By chewing up no characters at all,
the rest of the regexp can in fact match, so that's what the engine
did -- your second


is optional, telling the engine you don't require that `foo` to match.
The engine took you at your word about that

> >>> foofoo = re.compile(r'^(foo)(.*?)(foo)(.*?)$')

In this case your second `foo` is not optional. The behavior wrt the first


really doesn't change: the regexp engine again chews up the fewest
number of characters at that point such that the rest of the regexp
can match. But because your second `foo` isn't optional in this case,
the engine can't get away with matching 0 characters in this case. It
still matches the fewest number it can match there, though (consistent
with the rest of the pattern matching too).

> >>> foofoo.match(s).group(0)

> 'foobarbazfoobar'
> >>> foofoo.match(s).group(1)

> 'foo'
> >>> foofoo.match(s).group(2)

> 'barbaz'
> >>> foofoo.match(s).group(3)

> 'foo'
> >>> foofoo.match(s).group(4)

> 'bar'
> >>>

> So, is this a bug, or just a problem with my understanding?

The behavior is what I expected

> If it's my brain that's broken, what's the proper way to do this with regexps?

Sorry, I'm unclear on (exactly) what it is you're trying to
accomplish. Maybe what you're looking for is



> And, if the above is expected behavior, should I submit a doc bug? It's
> clear that the "?" qualifier (applied to the second foo group) is _not_
> greedy in this situation.

See above: that's not only not clear, it's not true. Consider a
related but much simpler example:

>>> m = re.match(r'a(b)?(b)?c', 'abc')
>>> m.groups()

('b', None)

Both instances of


are "greedy" there, and that the second one didn't match "b" does not
mean that the second one is not greedy. It _couldn't_ match without
violating that the _first_ is greedy, and the first "wins" because
regexps work left to right. It may be harder to see that the same
principle is at work in your example, but it is: your second (foo)?
couldn't match without violating that your first (.*?) asks for a
minimal match. My


above asks the engine to match two instances of P if possible, but to
settle for one if that's all it can find.
Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Help with Pattern matching. Matching multiple lines from while reading from a file. Bobby Chamness Perl Misc 2 05-03-2007 06:02 PM
java.util.regexp: matching square brackets enrique Java 3 02-08-2005 06:57 PM
Regexp question: Trouble matching with backslash Prabh Java 2 05-12-2004 11:10 PM
Pattern matching : not matching problem Marc Bissonnette Perl Misc 9 01-13-2004 05:52 PM