Velocity Reviews > Perl > Matching multiple subexpressions in a regular expression

# Matching multiple subexpressions in a regular expression

ShaunJ
Guest
Posts: n/a

 03-11-2008
If more than one subepxression of a regex could have matched the
string, is it possible to find all the subexpressions that could have
matched? Example...

my \$re = qr/([AC]CT)|(AC[CT])/;
'CCT' =~ m/\$re/;
print join ',', @-; print "\n";
'ACC' =~ m/\$re/;
print join ',', @-; print "\n";
'ACT' =~ m/\$re/;
print join ',', @-; print "\n";

Output:
0,0
0,,0
0,0

This shows that for...
CCT: the first subexpression matched
ACC: the second subexpression matched
ACT: the first subexpression matched

However, ACT matched both subexpressions! The ideal result for ACT
would be...
0,0,0
showing that both subexpressions matched. Is this possible without
having to split each subexpression into its own regular expression? My
understanding -- please correct me if I'm wrong -- is that one big
regular expression will run faster than 100 little ones, since the one
big regular expression can be compiled into a single large finite-
state-machine that is more efficient than running 100 little FSM.

Thanks!
Shaun

John W. Krahn
Guest
Posts: n/a

 03-12-2008
ShaunJ wrote:
> If more than one subepxression of a regex could have matched the
> string, is it possible to find all the subexpressions that could have
> matched? Example...
>
> my \$re = qr/([AC]CT)|(AC[CT])/;
> 'CCT' =~ m/\$re/;
> print join ',', @-; print "\n";
> 'ACC' =~ m/\$re/;
> print join ',', @-; print "\n";
> 'ACT' =~ m/\$re/;
> print join ',', @-; print "\n";
>
> Output:
> 0,0
> 0,,0
> 0,0
>
> This shows that for...
> CCT: the first subexpression matched
> ACC: the second subexpression matched
> ACT: the first subexpression matched
>
> However, ACT matched both subexpressions! The ideal result for ACT
> would be...
> 0,0,0
> showing that both subexpressions matched.

Perl's regular expressions can't do that. They always stop after a
successful match so either ([AC]CT) would match or (AC[CT]) would match
but never both.

> Is this possible without
> having to split each subexpression into its own regular expression? My
> understanding -- please correct me if I'm wrong -- is that one big
> regular expression will run faster than 100 little ones,

> since the one
> big regular expression can be compiled into a single large finite-
> state-machine that is more efficient than running 100 little FSM.

That question is answered in perlfaq6:

perldoc -q "How do I efficiently match many regular expressions at once"

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

ShaunJ
Guest
Posts: n/a

 03-12-2008
On Mar 11, 6:13 pm, "John W. Krahn" <(E-Mail Removed)> wrote:
....
> > since the one
> > big regular expression can be compiled into a single large finite-
> > state-machine that is more efficient than running 100 little FSM.

>
> That question is answered in perlfaq6:
>
> perldoc -q "How do I efficiently match many regular expressions at once"

Hi John,

If I structure my program as in the example, using many small regex
instead of one big regex, Perl 5.8.6 runs out of memory and dies:
vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
characters each, and the input string is one line 100 kB long. The
machine has 2 GB of memory and free disk space, which should be
enough, so I presume the code is somehow leeking memory. It's only a
dozen or so lines long, so I've posted my code below. Can you see an
obvious leak?

Thanks,
Shaun

my @restrings = <REFILE>;
my @re = map { qr/\$_/x } @restrings;
while (<>) {
my \$i = 0;
foreach my \$re (@re) {
\$i++;
pos = 0;
while (m/\$re/g) {
print \$i, "\t",
\$LAST_MATCH_START[0] + 1, "\t",
\$&, "\n";
pos = \$LAST_MATCH_START[0] + 1;
}
}
}

Ilya Zakharevich
Guest
Posts: n/a

 03-12-2008
[A complimentary Cc of this posting was sent to
John W. Krahn
<(E-Mail Removed)>], who wrote in article <k7GBj.100243\$C61.89884@edtnps89>:
> > my \$re = qr/([AC]CT)|(AC[CT])/;
> > 'ACT' =~ m/\$re/;
> > print join ',', @-; print "\n";
> >
> > Output:
> > 0,0
> >
> > This shows that for...
> > ACT: the first subexpression matched
> >
> > However, ACT matched both subexpressions! The ideal result for ACT
> > would be...
> > 0,0,0
> > showing that both subexpressions matched.

>
> Perl's regular expressions can't do that. They always stop after a
> successful match so either ([AC]CT) would match or (AC[CT]) would match
> but never both.

Perl's regular expressions can do it. You just follow the match by
(?!), and preceed it by (??{code}) which analizes and stores the match
results.

Hope this helps,
Ilya

ShaunJ
Guest
Posts: n/a

 03-13-2008
On Mar 12, 7:40 pm, Frank Seitz <(E-Mail Removed)> wrote:
> ShaunJ wrote:
> > If I structure my program as in the example, using many small regex
> > instead of one big regex, Perl 5.8.6 runs out of memory and dies:
> > vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
> > characters each, and the input string is one line 100 kB long.

> [...]
> > my @restrings = <REFILE>;
> > my @re = map { qr/\$_/x } @restrings;

>
> 400'000 precompiled regexes are quite a lot!
> Why don't you read and create them in chunks of, say, 1000?

Hi Frank,

400'000 * 27 = 12 MB. I wouldn't say it's that large. It seems
reasonable that the compiled regex would fit in less than one GB of
memory, which is my only requirement.

In any case, the following 9-line code snippet burns through 100MB of
memory a second using Perl 5.8.6! Certainly a memory leak. The only
explanation I can think of is if the m/\$re/g expression were
recompiling the regex every time and the previously compiled regex

Thanks,
Shaun

my @restrings = <REFILE>;
my @re = map { qr/\$_/x } @restrings;
while (<>) {
foreach my \$re (@re) {
while (m/\$re/g) {
print \$LAST_MATCH_START[0], "\n";
}
}
}

ShaunJ
Guest
Posts: n/a

 03-13-2008
On Mar 13, 1:08 pm, Frank Seitz <(E-Mail Removed)> wrote:
....
> > In any case, the following 9-line code snippet burns through 100MB of
> > memory a second using Perl 5.8.6! Certainly a memory leak. The only
> > explanation I can think of is if the m/\$re/g expression were
> > recompiling the regex every time and the previously compiled regex

>
> No, Perl precompiles the patterns (your 400'000 regexes)
> into an internal representation at the moment of qr//,
> that is at the beginning of your program.

That is my intention. A quick experiment with `top` shows that the
400'000 regex use 500 MB of memory once they're compiled, which is
fine by me. It's the following loop that leaks memory like crazy, and
it shouldn't use any additional memory. Any ideas why?

Swapping the loops won't have any effect, as there's only one string
(one line in the input file) for the first while(<>) loop.

Cheers,
Shaun

xhoster@gmail.com
Guest
Posts: n/a

 03-13-2008
ShaunJ <(E-Mail Removed)> wrote:

> Hi John,
>
> If I structure my program as in the example, using many small regex
> instead of one big regex, Perl 5.8.6 runs out of memory and dies:
> vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
> characters each, and the input string is one line 100 kB long. The
> machine has 2 GB of memory and free disk space, which should be
> enough, so I presume the code is somehow leeking memory. It's only a
> dozen or so lines long, so I've posted my code below. Can you see an
> obvious leak?
>
> Thanks,
> Shaun
>
> my @restrings = <REFILE>;
> my @re = map { qr/\$_/x } @restrings;
> while (<>) {

....

Can you produce a version that we can run? (I.e. that doesn't

The below doesn't leak on v5.8.3, 5.8.7, or 5.8.8.

use strict;
use warnings;
use English;

my @re = map { qr/\$_/x } '000000'..'400000';
push @re, qr/\d/; #just to make sure something matches

foreach
('000000000000000000000000000000'..'00000000000000 0000000001000000') {
print "\$_\n";
my \$i = 0;
foreach my \$re (@re) {
\$i++;
pos = 0;
while (m/\$re/g) {
print \$i, "\t",
\$LAST_MATCH_START[0] + 1, "\t",
\$&, "\n";
pos = \$LAST_MATCH_START[0] + 1;
}
}
}
__END__

--
The costs of publication of this article were defrayed in part by the
this fact.

ShaunJ
Guest
Posts: n/a

 03-13-2008
On Mar 13, 2:28 pm, (E-Mail Removed) wrote:
> ShaunJ <(E-Mail Removed)> wrote:
> > Hi John,

>
> > If I structure my program as in the example, using many small regex
> > instead of one big regex, Perl 5.8.6 runs out of memory and dies:
> > vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
> > characters each, and the input string is one line 100 kB long. The
> > machine has 2 GB of memory and free disk space, which should be
> > enough, so I presume the code is somehow leeking memory. It's only a
> > dozen or so lines long, so I've posted my code below. Can you see an
> > obvious leak?

>
> > Thanks,
> > Shaun

>
> > my @restrings = <REFILE>;
> > my @re = map { qr/\$_/x } @restrings;
> > while (<>) {

>
> ...
>
> Can you produce a version that we can run? (I.e. that doesn't
> depend on REFILE or STDIN, which we don't have access to?)

Yes, see the recent thread 'm// on very long lines leaks memory',
where I gave a small test case. As it turns out, there is a bug in
Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
either English or \$& causes the memory leak. This bug is fixed in
5.10.0.

Cheers,
Shaun

Uri Guttman
Guest
Posts: n/a

 03-13-2008
>>>>> "S" == ShaunJ <(E-Mail Removed)> writes:

S> On Mar 13, 2:28 pm, (E-Mail Removed) wrote:
>> ShaunJ <(E-Mail Removed)> wrote:
>> > Hi John,

>>
>> > If I structure my program as in the example, using many small regex
>> > instead of one big regex, Perl 5.8.6 runs out of memory and dies:
>> > vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
>> > characters each, and the input string is one line 100 kB long. The
>> > machine has 2 GB of memory and free disk space, which should be
>> > enough, so I presume the code is somehow leeking memory. It's only a
>> > dozen or so lines long, so I've posted my code below. Can you see an
>> > obvious leak?

>>
>> > Thanks,
>> > Shaun

>>
>> > my @restrings = <REFILE>;
>> > my @re = map { qr/\$_/x } @restrings;
>> > while (<>) {

>>
>> ...
>>
>> Can you produce a version that we can run? (I.e. that doesn't
>> depend on REFILE or STDIN, which we don't have access to?)

S> Yes, see the recent thread 'm// on very long lines leaks memory',
S> where I gave a small test case. As it turns out, there is a bug in
S> Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
S> either English or \$& causes the memory leak. This bug is fixed in
S> 5.10.0.

it is not a leak (as someone else proved in another post). so don't go
blabbing that it is a leak. it is a well known ram suck but it doesn't
lose the ram like a true leak does.

uri

--
Uri Guttman ------ http://www.velocityreviews.com/forums/(E-Mail Removed) -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

Dr.Ruud
Guest
Posts: n/a

 03-16-2008
ShaunJ schreef:

> As it turns out, there is a bug in
> Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
> either English or \$& causes the memory leak. This bug is fixed in
> 5.10.0.

It is neither a leak nor a bug. Read perldoc perlre.

--
Affijn, Ruud

"Gewoon is een tijger."