Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Need more efficient use of the substitution operator

Reply
Thread Tools

Need more efficient use of the substitution operator

 
 
Niall Macpherson
Guest
Posts: n/a
 
      09-17-2004
I don't use regexp / substitution handling very often and although I
think I have a basic grasp I am having problems with understanding how
to make multiple substitutions of different characters within a
string. I understand the use of appending a 'g' to the command for
multiple substitutions of the same pattern , but the following code
looks as if it could be improved.

I am trying to find the first occurence of anything between a '[' and
a ']'
and return that string

i.e the following code should print 'STRING'. It appears to work but
seems a bit long winded. Is there a better way of doing it ?

use strict;
use warnings;
use diagnostics;

sub GetString
{
my ($teststring) = @_;

if ($teststring =~ /\[.*\]/)
{
my $match = $&;
$match =~ s/\[//;
$match =~ s/\]//;
return($match);
}
else
{
return("");
}
}

my $input = " foo [STRING] bar ";
my $output = GetString($input);
print "Result = '$output'";

Thanks
 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      09-17-2004
Niall Macpherson wrote:
> I don't use regexp / substitution handling very often and although
> I think I have a basic grasp I am having problems with
> understanding how to make multiple substitutions of different
> characters within a string. I understand the use of appending a 'g'
> to the command for multiple substitutions of the same pattern , but
> the following code looks as if it could be improved.
>
> I am trying to find the first occurence of anything between a '['
> and a ']' and return that string


If you are trying to *find* something, it's not substitution you
should do, but you'd rather use the m// (matching) operator with
capturing parentheses (see "perldoc perlop").

> i.e the following code should print 'STRING'. It appears to work
> but seems a bit long winded. Is there a better way of doing it ?


<code snipped>

Indeed.

my $input = " foo [STRING] bar ";
print "Result = '", $input =~ /\[(.*?)\]/, "'\n";

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      09-17-2004
Niall Macpherson <> wrote in comp.lang.perl.misc:
> I don't use regexp / substitution handling very often and although I
> think I have a basic grasp I am having problems with understanding how
> to make multiple substitutions of different characters within a
> string. I understand the use of appending a 'g' to the command for
> multiple substitutions of the same pattern , but the following code
> looks as if it could be improved.
>
> I am trying to find the first occurence of anything between a '[' and
> a ']'
> and return that string


That is, you want to match part of a string and return the result.
That is what capturing parentheses are for.

> i.e the following code should print 'STRING'. It appears to work but
> seems a bit long winded. Is there a better way of doing it ?


It doesn't even do exactly what you want. Test it with
" foo [STRING] [A-LING] bar ".

> use strict;
> use warnings;
> use diagnostics;
>
> sub GetString
> {
> my ($teststring) = @_;
>
> if ($teststring =~ /\[.*\]/)


This matches everything from the first opening "[" to the last closing
"]". To catch only the first pair, make the /.*/ non-greedy:

/\[.*?\]/

> {
> my $match = $&;
> $match =~ s/\[//;
> $match =~ s/\]//;
> return($match);


You could have returned the substring of $match from the second to
the next-to-last character, instead of deleting the brackets:

return substr( $match, 1, -1);

But see below.

> }
> else
> {
> return("");


It would be wiser to return nothing instead of an empty string in
case of failure. An empty string is a legitimate return value
for an empty "[]". Just

return;

> }
> }
>
> my $input = " foo [STRING] bar ";
> my $output = GetString($input);
> print "Result = '$output'";


The use of $& to capture the match is still supported, but there are
better ways. Use capturing parentheses to extract exactly the part
of the match you want. That way, you get the content of the "[...]"
directly:

my ( $match ) = $teststring =~ /\[(.*?)\]/;

That is all. Putting it together:

sub GetString {
my $teststring = shift;
my ( $match) = $teststring =~ /\[(.*?)\]/ or return;
$match;
}

or even

sub GetString { ( shift =~ /\[(.*?)\]/)[ 0] }

Anno
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      09-17-2004
(Niall Macpherson) wrote in
news: om:

> I am trying to find the first occurence of anything between a '[' and
> a ']' and return that string


In addition to the useful responses by others, consider reading the faq
entry

perldoc -q match

Also, for simple string matches, keep in mind the index function:

perldoc -f index

> use strict;
> use warnings;
> use diagnostics;
>
> sub GetString
> {
> my ($teststring) = @_;
>
> if ($teststring =~ /\[.*\]/)
> {
> my $match = $&;


Have you read perldoc perlvar?

$& The string matched by the last successful pattern match
....
The use of this variable anywhere in a program imposes a
considerable performance penalty on all regular expression
matches. See "BUGS".

If you wanted to do what you are doing above in a better way, you could
do this:

#! perl

use strict;
use warnings;

my $s = 'Hello [ insert planet name here ]';

print scalar find_bracketed_string($s), "\n";

sub find_bracketed_string {
my ($s) = @_;

my ($l, $r);

if(($l = 1 + index $s, '[') > $[
and ($r = index $s, ']', $l) >= $[) {
my $rs = substr $s, $l, $r - $l;
return wantarray ? ($rs, $r + 1) : $rs;
}

return;
}

Sinan.
 
Reply With Quote
 
Niall Macpherson
Guest
Posts: n/a
 
      09-17-2004
Gunnar Hjalmarsson <> wrote in message news:<>...
>
> If you are trying to *find* something, it's not substitution you
> should do, but you'd rather use the m// (matching) operator with
> capturing parentheses (see "perldoc perlop").
>


Thanks Gunnar . The reason that I was doing the substitution was that
I didn't fully understand the concept of the capturing parentheses in
a regexp.

Therefore all I had to work with was the string [STRING] returned from
via the $& variable which needed the '[' and ']' removed.

In your example you use the return value from the expression. Am I
right in thinking that this value will also be in $1 ?

And if I have multiple regexps inside my expression then the matches
will be in $1, $2, $3 ?
 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      09-17-2004
Niall Macpherson wrote:
> Gunnar Hjalmarsson wrote:
>>
>> my $input = " foo [STRING] bar ";
>> print "Result = '", $input =~ /\[(.*?)\]/, "'\n";

>
> In your example you use the return value from the expression. Am I
> right in thinking that this value will also be in $1 ?


If there is a match: yes, otherwise: no. Consequently, if you want to
work with $1, $2 etc., you need to first check if the match succeeded,
and only use those variables if it did.

> And if I have multiple regexps inside my expression then the matches
> will be in $1, $2, $3 ?


No. The dollar-digit variables contain what was captured from the last
succeeded match.

Or did you mean multiple pairs of capturing parentheses inside the
regex? If you had asked that, the answer would have been yes. (Again
provided that the match succeeded.)

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
Michael Slass
Guest
Posts: n/a
 
      09-17-2004
Gunnar Hjalmarsson <> writes:

>Niall Macpherson wrote:
>> I am trying to find the first occurence of anything between a '['
>> and a ']' and return that string

>
>If you are trying to *find* something, it's not substitution you
>should do, but you'd rather use the m// (matching) operator with
>capturing parentheses (see "perldoc perlop").
>
>
>Indeed.
>
> my $input = " foo [STRING] bar ";
> print "Result = '", $input =~ /\[(.*?)\]/, "'\n";
>
>--

Is there a differnce in regex efficiency between the non-greedy ".*?" as
used above, and the more specific "[^]]*" ? I can't remember the
backtracking rules for NFA non-greedy quantifiers, and my Mastering
Regular Expressions is out on loan.

--
Mike Slass
 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      09-17-2004
Michael Slass wrote:
> Gunnar Hjalmarsson <> writes:
>>
>> my $input = " foo [STRING] bar ";
>> print "Result = '", $input =~ /\[(.*?)\]/, "'\n";

>
> Is there a differnce in regex efficiency between the non-greedy
> ".*?" as used above, and the more specific "[^]]*" ?


Not sure, but I believe the latter is more efficient (but two more
characters to type...).

> I can't remember the backtracking rules for NFA non-greedy
> quantifiers, and my Mastering Regular Expressions is out on loan.


Do a benchmark!

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
Michael Slass
Guest
Posts: n/a
 
      09-17-2004
Gunnar Hjalmarsson <> writes:

>Michael Slass wrote:
>> Gunnar Hjalmarsson <> writes:
>>> my $input = " foo [STRING] bar ";
>>> print "Result = '", $input =~ /\[(.*?)\]/, "'\n";

>> Is there a differnce in regex efficiency between the non-greedy
>> ".*?" as used above, and the more specific "[^]]*" ?

>
>Do a benchmark!



Yup, that's the true engineer's answer; I'm more interested in the
professor's answer -- *why* the faster one is faster. A rule from
Mastering Regular Expressions, "Say what you mean", seems to come to
mind --- in this case, we mean "anything that's not ]" --- so "[^]]*"
is more exact.

I'll try to dig up the Dragon book for the regex discussion on NFA
backtracking and *.
--
Mike Slass
 
Reply With Quote
 
Eric Bohlman
Guest
Posts: n/a
 
      09-18-2004
Gunnar Hjalmarsson <> wrote in news:2r0guiF144m9kU1@uni-
berlin.de:

>> In your example you use the return value from the expression. Am I
>> right in thinking that this value will also be in $1 ?

>
> If there is a match: yes, otherwise: no. Consequently, if you want to
> work with $1, $2 etc., you need to first check if the match succeeded,
> and only use those variables if it did.


Just to amplify on this (I'm sure you know it, but many newbies won't): if
the match failed, the $digit variables will be *untouched*. Not set to ""
or undef or anything like that. In particular, if a regex succeeds once
and then fails on subsequent input, the $digit variables will still have
the values *left over from the successful match*. Failing to take this
into account can lead to extremely puzzling bugs (which often result in
plausible-looking but incorrect output).
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Equivalence in use of bitwise | operator and + operator Ioannis Vranos C++ 8 11-14-2008 11:03 PM
I need a more efficient algorithm for this problem. Sam Kong Ruby 15 01-24-2007 04:23 PM
Substitution Operator Not Working on Directory Path Strings Hike Mike Perl Misc 10 08-03-2005 04:52 PM
How to make a regex substitution repeat until there are no more matches? David Deutsch Perl Misc 1 02-14-2005 02:06 PM
Greater than operator Vs Equal to operator : Which one is efficient? Vivek Mandava C Programming 28 09-11-2003 10:43 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57