Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > another strange regexp case

Reply
Thread Tools

another strange regexp case

 
 
Kristof Bastiaensen
Guest
Posts: n/a
 
      06-29-2004
Hi,

here is another regexp behaviour which surprises me.
There may be some logic behind it, but I fail to see it...

irb(main):004:0> /(theone)?/.match(" theone").to_a
=> ["", nil]

irb(main):003:0> /(theone)?/.match("theone").to_a
=> ["theone", "theone"]

irb(main):005:0> / (theone)?/.match(" theone").to_a
=> [" theone", "theone"]

In the first case, it doesn't match "theone", but in
the second and third it does...

Could anyone explain this?

Kristof
 
Reply With Quote
 
 
 
 
ts
Guest
Posts: n/a
 
      06-29-2004
>>>>> "K" == Kristof Bastiaensen <> writes:

K> irb(main):004:0> /(theone)?/.match(" theone").to_a
K> => ["", nil]

When the regexp engine try to match `t' it fail, because the first
character is ` ' and the regexp succeed because `theone' was optional

K> irb(main):003:0> /(theone)?/.match("theone").to_a
K> => ["theone", "theone"]

it can match `theone' in its first try


Guy Decoux





 
Reply With Quote
 
 
 
 
Ara.T.Howard
Guest
Posts: n/a
 
      06-29-2004
On Tue, 29 Jun 2004, Kristof Bastiaensen wrote:


> irb(main):004:0> /(theone)?/.match(" theone").to_a
> => ["", nil]


? means 'zero or one'

we start a the beginning of ' theone' and instantly find a match: zero of
them.

> irb(main):003:0> /(theone)?/.match("theone").to_a
> => ["theone", "theone"]


same here.

> irb(main):005:0> / (theone)?/.match(" theone").to_a
> => [" theone", "theone"]


same here.


remember regexp engines work (well, some of them) by staring at a position and
consuming chars while the pattern matches, iff all the pattern was used we
have a positive match, otherwise not. so in all these cases we start like so

' theone'
^
^
^
ptr

and drive with the regexp asking "does the regexp match starting here? if so
how many chars did it consume" the consumed chars are returned in $1, $2,
etc. in all the cases above this explains the matching.

note that some regexp engines work in the reverse sense but the effect is
largely the same...

> In the first case, it doesn't match "theone", but in the second and third it
> does...


so it matched in all cases -- sometimes zero times, sometimes one time. this
is what you asked the regexp to do. i try to follow these rules when
composing regexps:

- always use anchors ^ and $
- never use anything that can match 'zero' things

it's the 'zero' thing that suprised you. your first two regexps match even
the empty string!

obviously this is not always possible but i will maintain this:

if you create a regexp without anchors and with portions that can match zero
things and have not done so out of absolute need - your code has a bug.

kind regards.

-a
--
================================================== =============================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
================================================== =============================
 
Reply With Quote
 
Florian Gross
Guest
Posts: n/a
 
      06-29-2004
Kristof Bastiaensen wrote:
> Hi,


Moin!

> here is another regexp behaviour which surprises me.
> There may be some logic behind it, but I fail to see it...
> irb(main):004:0> /(theone)?/.match(" theone").to_a
> => ["", nil]


I think that this is about how greediness in Regexps works:

A Regexp will try to match as much as possible starting at the current
position, but even a "bad" match at the current position will be better
than a "good" match at a later position in the String.

Maybe it would be possible to do a version of .match that finds the
"best" (== longest, if greedy) match in the whole string. I assume that
it would be based on .scan in some kind of way.

Regards,
Florian Gross
 
Reply With Quote
 
Kristof Bastiaensen
Guest
Posts: n/a
 
      06-29-2004
On Tue, 29 Jun 2004 11:05:43 -0600, Ara.T.Howard wrote:

<snip>
>> irb(main):004:0> /(theone)?/.match(" theone").to_a
>> => ["", nil]

>
> ? means 'zero or one'
>
> we start a the beginning of ' theone' and instantly find a match: zero of
> them.

<snip>
>
> if you create a regexp without anchors and with portions that can match zero
> things and have not done so out of absolute need - your code has a bug.


Thanks for the answer. I expected the pattern to expand greedily,
but I forgot it will return the first match, which is the empty
match. You are right, /(theone)?/ is a silly thing to write,
finally I just needed another regexp for my problem.

Thanks,
Kristof
 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      06-30-2004

"Kristof Bastiaensen" <> schrieb im Newsbeitrag
news.. .
> On Tue, 29 Jun 2004 11:05:43 -0600, Ara.T.Howard wrote:
>
> <snip>
> >> irb(main):004:0> /(theone)?/.match(" theone").to_a
> >> => ["", nil]

> >
> > ? means 'zero or one'
> >
> > we start a the beginning of ' theone' and instantly find a match: zero

of
> > them.

> <snip>
> >
> > if you create a regexp without anchors and with portions that can

match zero
> > things and have not done so out of absolute need - your code has a

bug.
>
> Thanks for the answer. I expected the pattern to expand greedily,
> but I forgot it will return the first match, which is the empty
> match. You are right, /(theone)?/ is a silly thing to write,
> finally I just needed another regexp for my problem.


This is a case of the simple general rule "Watch out for regular
expressions that match the empty string". All sorts of problems can arise
when using them and usually you don't want to match an empty string
anyway.

Kind regards

robert

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
new RegExp().test() or just RegExp().test() Matěj Cepl Javascript 3 11-24-2009 02:41 PM
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Ruby 1.9 - ArgumentError: incompatible encoding regexp match(US-ASCII regexp with ISO-2022-JP string) Mikel Lindsaar Ruby 0 03-31-2008 10:27 AM
Programmatically turning a Regexp into an anchored Regexp Greg Hurrell Ruby 4 02-14-2007 06:56 PM
RegExp.exec() returns null when there is a match - a JavaScript RegExp bug? Uldis Bojars Javascript 2 12-17-2006 09:50 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57