Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > regex behavior

Reply
Thread Tools

regex behavior

 
 
Matija Papec
Guest
Posts: n/a
 
      10-01-2003

I went through perldoc but didn't found similar regex,
print join ',', 'a bb ccc dddd' =~ /(\w)+/g;

the question is, what it exactly matches and why?


--
Matija
 
Reply With Quote
 
 
 
 
Michael P. Broida
Guest
Posts: n/a
 
      10-01-2003
Abigail wrote:
>
> Matija Papec ((E-Mail Removed)) wrote on MMMDCLXXXIII September MCMXCIII
> in <URL:news:(E-Mail Removed) om>:
> --
> -- I went through perldoc but didn't found similar regex,
> -- print join ',', 'a bb ccc dddd' =~ /(\w)+/g;
> --
> -- the question is, what it exactly matches and why?
>
> /(\w)+/ matches a set of consecutive word characters, capturing
> the *last* one. //g in list context means, do this as often as
> possible (without overlap), returning a list of each of the submatches.
>
> So, 'a bb ccc dddd' =~ /(\w)+/g; returns for each substring of
> consecutive word characters the last one, resulting in 'a', 'b', 'c' and 'd'.


That tests out as you said, so it's MY thinking that's off.
Hopefully, you can clue me in.

I expected it to result in "a,bb,ccc,dddd". Now I realize that
it's the positioning of the + that causes it to get a single
character from each group. If the + is inside the (), it
prints what I expected.

But... What is causing the original /(\w)+/ to get the LAST
character from each group instead of the FIRST character from
each group?

I changed the input string to 'a bc def ghij' and it printed
"a,c,f,j" as you noted. But I don't see why it's the LAST
character per group. At this point, I now expect "a,b,d,g".

Ignoring the () to populate the result list, the \w+ matches a
string of one or more characters. On the second match, it will
grab "bc".

Now why isn't the () part of that getting the FIRST of those
characters?

And what regex would you use to get the FIRST char of each group
since this one doesn't?

Mike
 
Reply With Quote
 
 
 
 
Jeff 'japhy' Pinyan
Guest
Posts: n/a
 
      10-01-2003
[posted & mailed]

On Wed, 1 Oct 2003, Michael P. Broida wrote:

> But... What is causing the original /(\w)+/ to get the LAST
> character from each group instead of the FIRST character from
> each group?


The location of the + modifier.

> Ignoring the () to populate the result list, the \w+ matches a
> string of one or more characters. On the second match, it will
> grab "bc".


DON'T ignore the (), they're important here. (\w+) is seen by the regex
as something like this:

OPEN $1
PLUS
ALNUM
CLOSE $1

whereas (\w)+ is seen as

PLUS
OPEN $1
ALNUM
CLOSE $1

> Now why isn't the () part of that getting the FIRST of those
> characters?


It does... but then the + modifier causes $1 to be repopulated with the
NEXT character \w matches, and so on.

> And what regex would you use to get the FIRST char of each group
> since this one doesn't?


I'd use /(\w)\w*/g, or perhaps /\b\w/g (if there are no parens in a /.../g
regex, you get whatever the regex matches returned).

--
Jeff Pinyan RPI Acacia Brother #734 2003 Rush Chairman
"And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)

 
Reply With Quote
 
Bill
Guest
Posts: n/a
 
      10-02-2003
> Ignoring the () to populate the result list, the \w+ matches a
> string of one or more characters. On the second match, it will
> grab "bc".
>
> Now why isn't the () part of that getting the FIRST of those
> characters?
>
> And what regex would you use to get the FIRST char of each group
> since this one doesn't?
>
> Mike



from `perldoc perlre` :

By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?".
 
Reply With Quote
 
Michael P. Broida
Guest
Posts: n/a
 
      10-03-2003
Jeff 'japhy' Pinyan wrote:
>
> On Wed, 1 Oct 2003, Michael P. Broida wrote:
>
> > Now why isn't the () part of that getting the FIRST of those
> > characters?

>
> It does... but then the + modifier causes $1 to be repopulated with the
> NEXT character \w matches, and so on.


(I e-mailed a different response, then thought about it more.)

Hmm, that explains it pretty well. I guess my only remaining
question would be: why does it actually "repopulate"??

It seems as though, once it matches that single character, it
would/should save it in $1 as the () directs, and the NEXT
matched character would go into $2 instead of being thrown
away, and the next in $3, etc. I mean, the + seems to be
telling it to repeat the entire (\w) operation, and THAT
is saving characters.

Is there an operator precedence kinda thing going on?? Maybe
the + has to "FINISH" before the () can save a value?? That
would make it completely understandable to me. <grin>

Thanks for the answers!
Mike
 
Reply With Quote
 
Michael P. Broida
Guest
Posts: n/a
 
      10-03-2003
David Oswald wrote:
>
> > > So, 'a bb ccc dddd' =~ /(\w)+/g; returns for each substring of
> > > consecutive word characters the last one, resulting in 'a', 'b', 'c' and

> 'd'.
> >
> > That tests out as you said, so it's MY thinking that's off.
> > Hopefully, you can clue me in.
> >
> > I expected it to result in "a,bb,ccc,dddd". Now I realize that
> > it's the positioning of the + that causes it to get a single
> > character from each group. If the + is inside the (), it
> > prints what I expected.
> >
> > But... What is causing the original /(\w)+/ to get the LAST
> > character from each group instead of the FIRST character from
> > each group?

>
> Because, walking through your string of "a bb ccc dddd" look at what your
> regexp is doing:
> Pass one, step one. Find and capture "a". Return "a".
> Pass 2, step 1: Find and capture first 'b'.
> Pass2, step 2: Find 2nd 'b', and replace the first 'b' with the second one.
> Return 2nd 'b'.
> Pass3, step 1: Find first 'c' and capture it.
> Pass3, step 2: Find second 'c' and put it where first 'c' had been captured.
> Pass3, step 3: Find third 'c' and put it where the 2nd 'c' had been
> captured. Return 3rd 'c'.
> Pass4..... you should get the idea by now.
>
> Think of the capturing parens as your pocket, and it only has room for one
> thing. The regexp puts the first thing it matches into the pocket. When it
> finds (due to the quantifier) that it matches the 2nd thing, take the first
> one out and put the 2nd one in. And so on.


See my answer to Jeff 'japhy" Pinyan.

Your explanation makes sense, especially since the results are
just what (all of) you are espousing. <grin>

But I guess the part about "replacing" the value doesn't sit well
with me. I don't see any operator telling it to "replace" things.

It looks to me as though the (\w) part should save EACH char
that is matched into a separate $n variable. The + tells the
matching part to continue, but why doesn't the next pass through
(\w) save a NEW character in a NEW $n variable ($1,$2,etc)??

As I said in the other response: if the + operation must FINISH
before the () can save anything (one char), that would make it
all understandable to me. Operator precedence would cover that.

I'm not trying to argue here. It undeniably works as you've
said it does: test results bear that out. But I'm trying to
understand WHY it works that way and not another way that seems
to make as much sense to me.

Mike
 
Reply With Quote
 
Jeff 'japhy' Pinyan
Guest
Posts: n/a
 
      10-03-2003
[posted & mailed]

On Fri, 3 Oct 2003, Michael P. Broida wrote:

>> It does... but then the + modifier causes $1 to be repopulated with the
>> NEXT character \w matches, and so on.

>
> It seems as though, once it matches that single character, it
> would/should save it in $1 as the () directs, and the NEXT
> matched character would go into $2 instead of being thrown
> away, and the next in $3, etc. I mean, the + seems to be
> telling it to repeat the entire (\w) operation, and THAT
> is saving characters.


But you're ignoring how a regex is compiled. Watch:

perl -mre=debug -e 'qr/(a+)/'
...
Compiling REx `(a+)'
...
1: OPEN1(3)
3: PLUS(6)
4: EXACT <a>(0)
6: CLOSE1(
8: END(0)

versus:

perl -mre=debug -e 'qr/(a)+/'
...
Compiling REx `(a)+'
...
1: CURLYN[1] {1,32767}(11)
3: NOTHING(5)
5: EXACT <a>(0)
9: WHILEM(0)
10: NOTHING(11)
11: END(0)

A regex is compiled into an array of instructions, opcodes. Some opcodes
have additional data stored with them, such as the OPEN and CLOSE opcodes,
which have a number stored telling them WHICH $<DIGIT> variable to store
the matched content to. You can't change that. Each pair of capturing
parentheses refers to a SPECIFIC, SINGLE $<DIGIT>.

--
Jeff Pinyan RPI Acacia Brother #734 2003 Rush Chairman
"And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)

 
Reply With Quote
 
Michael P. Broida
Guest
Posts: n/a
 
      10-06-2003
Abigail wrote:
>
> Michael P. Broida (michael.p.broida@boeing_oops.com) wrote on
> MMMDCLXXXIII September MCMXCIII in <URL:news:3F7B532C.7878A3BB@boeing_oops.com>:
> ,, Abigail wrote:
> ,, >
> ,, > Matija Papec ((E-Mail Removed)) wrote on MMMDCLXXXIII September MCMXCIII
> ,, > in <URL:news:(E-Mail Removed) om>:
> ,, > --
> ,, > -- I went through perldoc but didn't found similar regex,
> ,, > -- print join ',', 'a bb ccc dddd' =~ /(\w)+/g;
> ,, > --
> ,, > -- the question is, what it exactly matches and why?
> ,, >
> ,, > /(\w)+/ matches a set of consecutive word characters, capturing
> ,, > the *last* one. //g in list context means, do this as often as
> ,, > possible (without overlap), returning a list of each of the submatches.
> ,, >
> ,, > So, 'a bb ccc dddd' =~ /(\w)+/g; returns for each substring of
> ,, > consecutive word characters the last one, resulting in 'a', 'b', 'c' and 'd'.
> ,,
> ,, That tests out as you said, so it's MY thinking that's off.
> ,, Hopefully, you can clue me in.
> ,,
> ,, I expected it to result in "a,bb,ccc,dddd". Now I realize that
> ,, it's the positioning of the + that causes it to get a single
> ,, character from each group. If the + is inside the (), it
> ,, prints what I expected.
> ,,
> ,, But... What is causing the original /(\w)+/ to get the LAST
> ,, character from each group instead of the FIRST character from
> ,, each group?
>
> Would you expect:
>
> $x = $_ for qw /a b c d/
> print $x;
>
> to print 'a' as well?


It doesn't print anything without a semi-colon on the first line.
<grin>

At first glance, I thought it would print each letter. Then I
looked deeper and realized it's basically assigning and re-assigning
$x (via $_) during the "for" loop, but only printing it when it's all
done. Thus it only prints "d".

But the prior discussion was about a regex, not a "for" loop.
If your point is that the regex processing works similarly to
the "for" loop in your example, then I see what you mean.

If that's NOT what your point was, then you've lost me. <grin>

Mike
 
Reply With Quote
 
Michael P. Broida
Guest
Posts: n/a
 
      10-07-2003
Abigail wrote:
>
> Michael P. Broida (michael.p.broida@boeing_oops.com) wrote on
> MMMDCLXXXVIII September MCMXCIII in <URL:news:3F81D592.208F5420@boeing_oops.com>:
> '' Abigail wrote:
> '' >
> '' > Michael P. Broida (michael.p.broida@boeing_oops.com) wrote on
> '' > MMMDCLXXXIII September MCMXCIII in <URL:news:3F7B532C.7878A3BB@boeing_oops.com>:
> '' > ,, Abigail wrote:
> '' > ,, >
> '' > ,, > Matija Papec ((E-Mail Removed)) wrote on MMMDCLXXXIII September MCMXCIII
> '' > ,, > in <URL:news:(E-Mail Removed) om>:
> '' > ,, > --
> '' > ,, > -- I went through perldoc but didn't found similar regex,
> '' > ,, > -- print join ',', 'a bb ccc dddd' =~ /(\w)+/g;
> '' > ,, > --
> '' > ,, > -- the question is, what it exactly matches and why?
> '' > ,, >
> '' > ,, > /(\w)+/ matches a set of consecutive word characters, capturing
> '' > ,, > the *last* one. //g in list context means, do this as often as
> '' > ,, > possible (without overlap), returning a list of each of the submatches.
> '' > ,, >
> '' > ,, > So, 'a bb ccc dddd' =~ /(\w)+/g; returns for each substring of
> '' > ,, > consecutive word characters the last one, resulting in 'a', 'b', 'c' and 'd'.
> '' > ,,
> '' > ,, That tests out as you said, so it's MY thinking that's off.
> '' > ,, Hopefully, you can clue me in.
> '' > ,,
> '' > ,, I expected it to result in "a,bb,ccc,dddd". Now I realize that
> '' > ,, it's the positioning of the + that causes it to get a single
> '' > ,, character from each group. If the + is inside the (), it
> '' > ,, prints what I expected.
> '' > ,,
> '' > ,, But... What is causing the original /(\w)+/ to get the LAST
> '' > ,, character from each group instead of the FIRST character from
> '' > ,, each group?
> '' >
> '' > Would you expect:
> '' >
> '' > $x = $_ for qw /a b c d/
> '' > print $x;
> '' >
> '' > to print 'a' as well?
> ''
> '' It doesn't print anything without a semi-colon on the first line.
> '' <grin>
> ''
> '' At first glance, I thought it would print each letter. Then I
> '' looked deeper and realized it's basically assigning and re-assigning
> '' $x (via $_) during the "for" loop, but only printing it when it's all
> '' done. Thus it only prints "d".
> ''
> '' But the prior discussion was about a regex, not a "for" loop.
> '' If your point is that the regex processing works similarly to
> '' the "for" loop in your example, then I see what you mean.
> ''
> '' If that's NOT what your point was, then you've lost me. <grin>
>
> My point is, if you repeatedly assign something to a variable, do you
> expect the variable to retain the first value it was set to, or the
> last value? Because that's happening in both the match, and the for loop.


Ah. No, I wouldn't expect that. But then, I didn't know
that the *regex* was repeatedly assigning to the variable
WITHIN the (\w)+ portion. I -DID- expect it to assign a
new result for each letter group (a, bb, ccc, and dddd)
due to the //g. I did NOT know it was reassigning for
the \w within the () for each letter in a single group.

But now I do know that, thanks to the discussion here.

Thanks everyone!

Mike
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      10-08-2003
Michael P. Broida <michael.p.broida@boeing_oops.com> wrote in comp.lang.perl.misc:
> Jeff 'japhy' Pinyan wrote:
> >
> > On Wed, 1 Oct 2003, Michael P. Broida wrote:
> >
> > > Now why isn't the () part of that getting the FIRST of those
> > > characters?

> >
> > It does... but then the + modifier causes $1 to be repopulated with the
> > NEXT character \w matches, and so on.

>
> (I e-mailed a different response, then thought about it more.)
>
> Hmm, that explains it pretty well. I guess my only remaining
> question would be: why does it actually "repopulate"??
>
> It seems as though, once it matches that single character, it
> would/should save it in $1 as the () directs, and the NEXT
> matched character would go into $2 instead of being thrown
> away, and the next in $3, etc. I mean, the + seems to be
> telling it to repeat the entire (\w) operation, and THAT
> is saving characters.


Yes, but it only has *one* $n variable to save to, determined by the number
of the opening parenthesis of the capturing pair. It isn't free to use
more $n variables for additional matches, because those may be occupied
by other capturing pairs.

So there's hardly a choice but to overwrite what's already there.

Anno
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
regex object limitations/behavior for large regexes? Aaron Watters Python 1 12-30-2009 04:39 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Strange behavior by regex with variable DJ Stunks Perl Misc 10 04-06-2006 11:55 AM
Ruby regex engine behavior question Daniel Berger Ruby 5 09-14-2004 10:20 AM
undefined behavior or not undefined behavior? That is the question Mantorok Redgormor C Programming 70 02-17-2004 02:46 PM



Advertisments