Velocity Reviews > Understanding '?' in regular expressions

# Understanding '?' in regular expressions

krishna.k.kishor3@gmail.com
Guest
Posts: n/a

 11-16-2012
Can someone explain the below behavior please?

>>> re1 = re.compile(r'(?(?:1000|1010|1020))[ ]*?[\,]?[ ]*?){1,3}')
>>> re.findall(re_obj,'1000,1020,1000')

['1000']
>>> re.findall(re_obj,'1000,1020, 1000')

['1020', '1000']

However when I use "[\,]??" instead of "[\,]?" as below, I see a different result
>>> re2 = re.compile(r'(?(?:1000|1010|1020))[ ]*?[\,]??[ ]*?){1,3}')
>>> re.findall(re_obj,'1000,1020,1000')

['1000', '1020', '1000']

I am not able to understand what's causing the difference of behavior here, I am assuming it's not 'greediness' if "?"

Thank you,
Kishor

Jussi Piitulainen
Guest
Posts: n/a

 11-16-2012
http://www.velocityreviews.com/forums/(E-Mail Removed) writes:

> Can someone explain the below behavior please?
>
> >>> re1 = re.compile(r'(?(?:1000|1010|1020))[ ]*?[\,]?[ ]*?){1,3}')
> >>> re.findall(re_obj,'1000,1020,1000')

> ['1000']
> >>> re.findall(re_obj,'1000,1020, 1000')

> ['1020', '1000']
>
> However when I use "[\,]??" instead of "[\,]?" as below, I see a
> different result
> >>> re2 = re.compile(r'(?(?:1000|1010|1020))[ ]*?[\,]??[ ]*?){1,3}')
> >>> re.findall(re_obj,'1000,1020,1000')

> ['1000', '1020', '1000']

Those re_obj should be re1 and re2, respectively. With that
correction, the behaviour appears to be as you say.

> I am not able to understand what's causing the difference of
> behavior here, I am assuming it's not 'greediness' if "?"

But the greed seems to be the only the difference.

I can't wrap my mind around this (at the moment at least) and I need
to rush away, but may I suggest the removal of all that is not
relevant to the problem at hand. Study these instead:

>>> re.findall(r'(10.0,?){1,3}', '1000,1020,1000')

['1000']
>>> re.findall(r'(10.0,??){1,3}', '1000,1020,1000')

['1000', '1020', '1000']

Ian Kelly
Guest
Posts: n/a

 11-16-2012
On Fri, Nov 16, 2012 at 12:28 AM, <(E-Mail Removed)> wrote:
> Can someone explain the below behavior please?
>
>>>> re1 = re.compile(r'(?(?:1000|1010|1020))[ ]*?[\,]?[ ]*?){1,3}')
>>>> re.findall(re_obj,'1000,1020,1000')

> ['1000']
>>>> re.findall(re_obj,'1000,1020, 1000')

> ['1020', '1000']

Try removing the grouping parentheses to see the full strings being matched:

>>> re1 = re.compile(r'(??:1000|1010|1020)[ ]*?[\,]?[ ]*?){1,3}')
>>> re.findall(re1,'1000,1020,1000')

['1000,1020,1000']
>>> re.findall(re1,'1000,1020, 1000')

['1000,1020,', '1000']

In the first case, the regular expression is matching the full string.
It could also match shorter expressions, but as only the space
quantifiers are non-greedy and there are no spaces to match anyway, it
does not. With the grouping parentheses in place, only the *last*
value of the group is returned, which is why you only see the last
'1000' instead of all three strings in the group, even though the
group is actually matching three different substrings.

In the second case, the regular expression finds first the '1000,1020'
and then the '1000' as two separate matches. The reason for this is
the space. Since the quantifier on the space is non-greedy, it first
tries *not* matching the space, finds that it has a valid match, and
so does not backtrack. The '1000' is then identified as a separate
match. As before, with the grouping parentheses in place you see only
the '1020' and the last '1000' because the group only reports the last
substring it matched for that particular match.

> However when I use "[\,]??" instead of "[\,]?" as below, I see a different result
>>>> re2 = re.compile(r'(?(?:1000|1010|1020))[ ]*?[\,]??[ ]*?){1,3}')
>>>> re.findall(re_obj,'1000,1020,1000')

> ['1000', '1020', '1000']
>
> I am not able to understand what's causing the difference of behavior here, I am assuming it's not 'greediness' if "?"

The difference is the non-greediness of the comma quantifier. When it
comes time for it to match the comma, because the quantifier is
non-greedy, it first tries *not* matching the comma, whereas before it
first tried to match it. As with the space above, when the comma is
not matched, it finds that it has a valid match anyway if it just
stops matching immediately. So it does not need to backtrack, and in
this case it ends up terminating each match early upon the comma and
returning all three numbers as separate matches.

What exactly is it that you're trying to do with this regular
expression? I suspect that it the solution actually a lot simpler
than you're making it.