Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regex /(X*)/

Reply
Thread Tools

Regex /(X*)/

 
 
ulrich_martin@seznam.cz
Guest
Posts: n/a
 
      02-13-2008
Hello,

I would like to ask you for help. Could anybody explain me, why regex
"/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
this regex is a greedy one and there is not anchor "^", so I would
expect, that $1 would contain XXX. I know (I have read it), that it is
possible to use + instead of *, but I would like to know, why the "*"
quantifier doesn't catch it.

I have found this example in perlretut:
Finally,
"aXXXb" =~ /(X*)/; # matches with $1 = ''
because it can match zero copies of 'X' at the beginning of
the string. If you definitely want to match at least one
'X', use "X+", not "X*".

M.
 
Reply With Quote
 
 
 
 
Damian Lukowski
Guest
Posts: n/a
 
      02-13-2008
http://www.velocityreviews.com/forums/(E-Mail Removed) schrieb:
> I have found this example in perlretut:
> Finally,
> "aXXXb" =~ /(X*)/; # matches with $1 = ''
> because it can match zero copies of 'X' at the beginning of
> the string. If you definitely want to match at least one
> 'X', use "X+", not "X*".


Well, that is the explanation. Perl tries to match as soon as possible.
"Sooner" is more important than "longer".
 
Reply With Quote
 
 
 
 
Paul Lalli
Guest
Posts: n/a
 
      02-13-2008
On Feb 13, 3:25*am, (E-Mail Removed) wrote:
> Hello,
>
> I would like to ask you for help. Could anybody explain me, why regex
> "/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
> this regex is a greedy one and there is not anchor "^", so I would
> expect, that $1 would contain XXX. I know (I have read it), that it is
> possible to use + instead of *, but I would like to know, why the "*"
> quantifier doesn't catch it.
>
> I have found this example in perlretut:
> Finally,
> "aXXXb" =~ /(X*)/; # matches with $1 = ''
> because it can match zero copies of 'X' at the beginning of
> the string. *If you definitely want to match at least one
> 'X', use "X+", not "X*".


Because greediness takes second place to position. Perl attempts to
find the FIRST match that it can. Once it's started successfully
matching, only then does the greediness of quantifiers come into play.

Take a look at all the places /(X*)/ could match aXXXb....

while ("aXXXb" =~ /(X*)/g) {
print "$`<<$&>>$'\n";
}

<<>>aXXXb
a<<XXX>>b
aXXX<<>>b
aXXXb<<>>


The first time through, it matches right at the beginning of the
string.
The second time through, it matches the XXX
The third time through, it matches between the X and the b
The final time through, it maches after the b, at the end of the
string.


Paul Lalli
 
Reply With Quote
 
jl_post@hotmail.com
Guest
Posts: n/a
 
      02-13-2008
On Feb 13, 1:25 am, (E-Mail Removed) wrote:
>
> I would like to ask you for help. Could anybody explain me, why regex
> "/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
> this regex is a greedy one and there is not anchor "^", so I would
> expect, that $1 would contain XXX. I know (I have read it), that it is
> possible to use + instead of *, but I would like to know, why the "*"
> quantifier doesn't catch it.
>
> I have found this example in perlretut:
> Finally,
> "aXXXb" =~ /(X*)/; # matches with $1 = ''
> because it can match zero copies of 'X' at the beginning of
> the string. If you definitely want to match at least one
> 'X', use "X+", not "X*".



Basically, a lot of people mistakenly think that the greedy '*'
quantifier makes m/(X*)/ match the LONGEST string of Xs. But in
reality, m/(X*)/ matches AS SOON AS POSSIBLE, and '*' just makes (X*)
gobble as much as it can once a match is found.

I believe it was the "Learning Perl" (the "llama" book) that said
that if a regular expression can match an empty string, then it will
always return true no matter what string it is given. And the regular
expression m/(X*)/ does return true when used with an empty string, as
'' has zero-or-more instances of 'X' inside it. Therefore, even this
match succeeds:

"ab" =~ /(X*)/; # $1 gets set to ''

It succeeds because it found zero-or-more Xs at the very beginning of
the string. Likewise, the match:

"aXXXb" =~ /(X*)/; # $1 gets set to ''

also succeeds by finding zero-or-more Xs at the very beginning of the
string. It stops searching after that because it found a match, and
has no need to continue any further.

If you really wanted a regular expression that would match at least
one X, then you should use the '+' quantifier instead of '*', like
this:

"aXXXb" =~ /(X+)/; # $1 gets set to 'XXX'

but since it still matches as soon as possible, it wouldn't match a
longer string of Xs, as shown here:

"aXXXbXXXXXc" =~ /(X+)/; # $1 still gets set to 'XXX'

If you wanted to match the longest string of Xs, you'd have to loop
through all the strings of Xs and record the longest one. You can do
this with the /g modifier like this:

my $longestString = '';
while ( "aXXXbXXXXXcXd" =~ m/(X+)/g )
{
$longestString = $1 if length($1) > length($longestString);
}
print "$longestString\n"; # prints 'XXXXX'

So remember, it is a mistake to think that the '*' and '+'
quantifiers match the longest instance of a string; they just match as
much as they can (or "gobble" up as much as they can) once a match has
been found -- even if the match was found at the very beginning of the
string.

This means that if a regular expression can match an empty string,
then the '*' quantifier will probably match an empty string unless
what it's quantifying happens to be at the beginning of the string.

I hope this explanation helps.

-- Jean-Luc
 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      02-13-2008
(E-Mail Removed) wrote:
> Hello,
>
> I would like to ask you for help. Could anybody explain me, why regex
> "/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
> this regex is a greedy one and there is not anchor "^", so I would
> expect, that $1 would contain XXX.


"Greedy" is a term of art in computer science. It does not have exactly
the same meaning as it does in religion or ethics or Marxism. Alas,
even the term of art isn't all that unambiguous in this context, either, as
there is no objective way of knowing what "locally optimal" means in the
regex context. Fortunately the documentation doesn't rely on you knowing
exactly what it means by greedy, it goes on to explain what the behavior
actually is. So don't get hung on loaded words. I think the docs should
remove that reference and just stick to describing the behavior explicitly.

> I know (I have read it), that it is
> possible to use + instead of *, but I would like to know, why the "*"
> quantifier doesn't catch it.


Because it doesn't look ahead to see what better thing in the future might
happen, it makes local decisions. That is what greedy means in the term
of art, but in this case it is applying not to the "*" but to the scanning.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
Reply With Quote
 
ulrich_martin@seznam.cz
Guest
Posts: n/a
 
      02-14-2008
On Feb 13, 11:59 pm, Abigail <(E-Mail Removed)> wrote:
> _
> (E-Mail Removed) ((E-Mail Removed)) wrote on VCCLXXIX
> September MCMXCIII in <URL:news:(E-Mail Removed)>:
> $$ Hello,
> $$
> $$ I would like to ask you for help. Could anybody explain me, why regex
> $$ "/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
> $$ this regex is a greedy one and there is not anchor "^", so I would
> $$ expect, that $1 would contain XXX. I know (I have read it), that it is
> $$ possible to use + instead of *, but I would like to know, why the "*"
> $$ quantifier doesn't catch it.
>
> Because if a regexp can match in more than one way in the subject string,
> it will match at the left most position.
>
> /(X*)/ matches 0 or more X's. "aXXXb" starts with zero X's. So it matches
> at the beginning of the string. With 0 X's.
>
> $$ I have found this example in perlretut:
> $$ Finally,
> $$ "aXXXb" =~ /(X*)/; # matches with $1 = ''
> $$ because it can match zero copies of 'X' at the beginning of
> $$ the string. If you definitely want to match at least one
> $$ 'X', use "X+", not "X*".
>
> Right.
>
> Abigail
> --
> use lib sub {($\) = split /\./ => pop; print $"};
> eval "use Just" || eval "use another" || eval "use Perl" || eval "use Hacker";


Thank you very much for perfects explanations to all of you.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
String Pattern Matching: regex and Python regex documentation Xah Lee Java 1 09-22-2006 07:11 PM
Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine? =?Utf-8?B?SmViQnVzaGVsbA==?= ASP .Net 2 10-22-2005 02:43 PM
Java regex imposture re: Perl regex compatibility a_c_Attlee@yahoo.com Java 2 05-06-2005 12:16 AM
perl regex to java regex Rick Venter Java 5 11-06-2003 10:55 AM



Advertisments