Velocity Reviews > Regular Expression: Matching substring

# Regular Expression: Matching substring

Kevin CH
Guest
Posts: n/a

 04-13-2006
Hi,

I'm currently running into a confusion on regex and hopefully you guys
can clear it up for me.

Suppose I have a regular expression (0|(1(01*0)*1))* and two test
strings: 110_1011101_ and _101101_1. (The underscores are not part of
the string. They are added to show that both string has a substring
that matches the pattern.) Applying a match() function on the first
string returns true while false for the second. The difference is the
first one has unmatched chunk in the beginning while the second at the
end. How's the regex rule work here?

Thanks.

Leon
Guest
Posts: n/a

 04-13-2006
Hi Kevin,

You may notice that, for matching the regex (0|(1(01*0)*1))*, the left
most
three characters of a string must not be ``101" while not followed by
an `0'.
After reading the first `1', automata expects `1' or ``00" or ``010"
or ``11",
right?

Kevin CH 寫道：

> Hi,
>
> I'm currently running into a confusion on regex and hopefully you guys
> can clear it up for me.
>
> Suppose I have a regular expression (0|(1(01*0)*1))* and two test
> strings: 110_1011101_ and _101101_1. (The underscores are not part of
> the string. They are added to show that both string has a substring
> that matches the pattern.) Applying a match() function on the first
> string returns true while false for the second. The difference is the
> first one has unmatched chunk in the beginning while the second at the
> end. How's the regex rule work here?
>
> Thanks.

Kevin CH
Guest
Posts: n/a

 04-13-2006

Leon wrote:
> Hi Kevin,
>
> You may notice that, for matching the regex (0|(1(01*0)*1))*, the left
> most
> three characters of a string must not be ``101" while not followed by
> an `0'.
> After reading the first `1', automata expects `1' or ``00" or ``010"
> or ``11",
> right?

Why it must expect "010"? Why not say "0110", since 1* can represent 0
or more repetitions.

>
>
> Kevin CH 寫道：
>
> > Hi,
> >
> > I'm currently running into a confusion on regex and hopefully you guys
> > can clear it up for me.
> >
> > Suppose I have a regular expression (0|(1(01*0)*1))* and two test
> > strings: 110_1011101_ and _101101_1. (The underscores are not part of
> > the string. They are added to show that both string has a substring
> > that matches the pattern.) Applying a match() function on the first
> > string returns true while false for the second. The difference is the
> > first one has unmatched chunk in the beginning while the second at the
> > end. How's the regex rule work here?
> >
> > Thanks.

Leon
Guest
Posts: n/a

 04-13-2006
You are right. In fact the procedure is as follows:
The substr ``101101" is no problem, if stop here, match will
successful.
But the tailing `1' occurs, so we may imagine the working automata move
to a state, which according to the regexp's outer most `)', and ready
to repeat
the whole regexp again. In this case, the answer is ``yes" only when
there exists
at least two ``1", but only one here.

BTW, the first string is matched exactly, according to your notion, it
should be written as: _11_0_1011101

Kevin CH
Guest
Posts: n/a

 04-13-2006
Oh yea, I see what's my confusion now. In the first string, I didn't
consider 11 and 0 matches the pattern without the repetition. Actually
what I did is I entered the pattern and the test strings into kudos (a
python regexp debugger) and got the match groups, which didn't have 11
and 0 as matches, although they do match the pattern "(0|1(01*0)*1)".

Thank you for the help.

Leon wrote:
> You are right. In fact the procedure is as follows:
> The substr ``101101" is no problem, if stop here, match will
> successful.
> But the tailing `1' occurs, so we may imagine the working automata move
> to a state, which according to the regexp's outer most `)', and ready
> to repeat
> the whole regexp again. In this case, the answer is ``yes" only when
> there exists
> at least two ``1", but only one here.
>
> BTW, the first string is matched exactly, according to your notion, it
> should be written as: _11_0_1011101

John Machin
Guest
Posts: n/a

 04-13-2006
On 13/04/2006 12:33 PM, Kevin CH wrote:
> Hi,
>
> I'm currently running into a confusion on regex and hopefully you guys
> can clear it up for me.
>
> Suppose I have a regular expression (0|(1(01*0)*1))* and two test
> strings: 110_1011101_ and _101101_1. (The underscores are not part of
> the string. They are added to show that both string has a substring
> that matches the pattern.) Applying a match() function on the first
> string returns true while false for the second.

Perhaps you are using grep, or you have stumbled on the old deprecated
"regex" module and are using that instead of the "re" module. Perhaps
not as you are using only 2 plain vanilla RE operations which should
work the same way everywhere. Perhaps you are having trouble with
search() versus match() -- if so, read the section on this topic in the
re docs. It's rather hard to tell what you are doing without seeing the
code you are using.

> The difference is the
> first one has unmatched chunk in the beginning

With re's match(), the whole string matches.

> while the second at the
> end.

With re's match(), the part you marked with underscores (at the
*beginning*) matches.

> How's the regex rule work here?

Let's abbreviate your pattern as (0|X)*
This means 0 or more occurrences of strings that match either 0 or X.

Case 1 gives us 11 matching X [it's a 1 followed by zero occurrences of
(01*0) followed by a 1], then a 0, then 1011101 matching X [it's a 1
foll. by 1 occ. of (01110) followed by a 1].

Case 2 gives us 101101 matching X [it's a 1 foll. by 1 occ of (0110)
foll by a 1] -- then there's a 1 that doesn't match anything.

Here's some code and its output:

C:\junk>type kevinch.py
import re

rx = re.compile(r"(0|(1(01*0)*1))*")

def doit(n, s):
print "Case", n
m = rx.match(s)
if m:
print "0123456789"
print s
for k in range(4):
print "span(%d) -> %r" % (k, m.span(k))
else:
print "... no match"

s1 = "110_1011101_".replace('_', '')
s2 = "_101101_1".replace('_', '')
doit(1, s1)
doit(2, s2)

C:\junk>kevinch.py
Case 1
0123456789
1101011101
span(0) -> (0, 10)
span(1) -> (3, 10)
span(2) -> (3, 10)
span(3) -> (4, 9)
Case 2
0123456789
1011011
span(0) -> (0, 6)
span(1) -> (0, 6)
span(2) -> (0, 6)
span(3) -> (1, 5)

HTH,
John

Kevin CH
Guest
Posts: n/a

 04-13-2006
Thank you for your reply.

> Perhaps you are using grep, or you have stumbled on the old deprecated
> "regex" module and are using that instead of the "re" module. Perhaps
> not as you are using only 2 plain vanilla RE operations which should
> work the same way everywhere. Perhaps you are having trouble with
> search() versus match() -- if so, read the section on this topic in the
> re docs. It's rather hard to tell what you are doing without seeing the
> code you are using.

Sorry I should have said it up front. I'm using Kudos (which I'm sure
uses re module) to test these strings on the pattern, and had the match
results as I stated. (search() of course gives me true since the
pattern appears in the substrings of both strings.)

Fredrik Lundh
Guest
Posts: n/a

 04-13-2006
"Kevin CH" wrote:

news:(E-Mail Removed) oups.com...
> Thank you for your reply.
>
> > Perhaps you are using grep, or you have stumbled on the old deprecated
> > "regex" module and are using that instead of the "re" module. Perhaps
> > not as you are using only 2 plain vanilla RE operations which should
> > work the same way everywhere. Perhaps you are having trouble with
> > search() versus match() -- if so, read the section on this topic in the
> > re docs. It's rather hard to tell what you are doing without seeing the
> > code you are using.

>
> Sorry I should have said it up front. I'm using Kudos (which I'm sure
> uses re module) to test these strings on the pattern, and had the match
> results as I stated. (search() of course gives me true since the
> pattern appears in the substrings of both strings.)

Python's "match" function doesn't return "true" or "false"; it returns a match
object if the string matches the pattern, and None if not. since your pattern
can match the empty string, it'll match any target string (all strings start with
an empty string), and will never return false.

looks like the "debugger" does a great job of hiding how things really work...

</F>

 Thread Tools

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is OffTrackbacks are On Pingbacks are On Refbacks are Off Forum Rules

 Similar Threads Thread Thread Starter Forum Replies Last Post Jet Koten Ruby 9 02-27-2010 03:28 PM Eloff Python 2 07-18-2009 01:54 AM m Perl Misc 3 11-19-2007 12:50 PM colinhumber@gmail.com Perl Misc 3 08-03-2005 04:29 PM M.N.A.Smadi Python 1 03-02-2005 09:49 PM

Advertisments