Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > regexp s// too greedy

Reply
Thread Tools

regexp s// too greedy

 
 
bettyann
Guest
Posts: n/a
 
      11-11-2004
hi all,

can anyone help me limit the greediness of my substitution pattern? i
have a CSV file and i want to insert a new column of values after the
6th column. but the new data to be inserted is dependent upon the
value of the 6th column.

example original data:
2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1
1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1
5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1
8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1

i want to put "0" after the 6th column if the 6th column contains
"hold.bmp".
i want to put "-1" after the 6th column if the 6th column contains
"NaN".

i thought i could do this with two substitutions commands:

s/^((.*?,){5}?(hold.bmp))/$1,0/
s/^((.*?,){5}?(NaN))/$1,-1/

i cannot limit the matching of "hold.bmp" or "NaN". i want this
pattern to match *only* if "hold.bmp" or "NaN" immediately follows the
5th column.

my test code:
#!/usr/local/bin/perl

use strict;
use warnings;

my $input = <<EOF;
2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1
1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1
5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1
8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1
EOF

my @oData = split( '\n', $input );
my $line;
my $cnt = 0;
foreach $line ( @oData ) {
printf( "$cnt) $line \n" );
$cnt++;
}

my $prevCol = 5;
my @txtList = ( "hold.bmp", "NaN" );
my @valList = ( "0", "-1" );
my ( $txt, $cmd, $i );
$i = 0;
foreach $txt ( @txtList ) {
$cmd = sprintf( '$line =~ s/^((.*?,){%d}?(%s))/$1,%s/;',
$prevCol, $txt, $valList[$i] );
printf( "\ncmd >>$cmd<< \n" );
foreach $line ( @oData ) {
printf( "orig line |$line| \n" );
eval $cmd;
printf( " new line |$line| \n---------------------\n" );
}
$i++;
}

exit;

output:
% test2.pl
0) 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1
1) 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1
2) 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1
3) 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1

cmd >>$line =~ s/^((.*?,){5}?(hold.bmp))/$1,0/;<<
orig line |2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hol d.bmp,NaN,1|
new line |2,NaN,NaN,NaN,64,hold.bmp,0,1607444,NaN,NaN,NaN,h old.bmp,NaN,1|
---------------------
orig line |1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hol d.bmp,3,1|
new line |1,NaN,NaN,NaN,32,hold.bmp,0,1607488,NaN,NaN,NaN,h old.bmp,3,1|
---------------------
orig line |5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp, NaN,1|
new line |5,NaN,NaN,4,32,hold.bmp,0,1607503,NaN,NaN,8,go.bm p,NaN,1|
---------------------
orig line |8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN ,1|
new line |8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,0,N aN,1|
---------------------

cmd >>$line =~ s/^((.*?,){5}?(NaN))/$1,-1/;<<
orig line |2,NaN,NaN,NaN,64,hold.bmp,0,1607444,NaN,NaN,NaN,h old.bmp,NaN,1|
new line |2,NaN,NaN,NaN,64,hold.bmp,0,1607444,NaN,-1,NaN,NaN,hold.bmp,NaN,1|
---------------------
orig line |1,NaN,NaN,NaN,32,hold.bmp,0,1607488,NaN,NaN,NaN,h old.bmp,3,1|
new line |1,NaN,NaN,NaN,32,hold.bmp,0,1607488,NaN,-1,NaN,NaN,hold.bmp,3,1|
---------------------
orig line |5,NaN,NaN,4,32,hold.bmp,0,1607503,NaN,NaN,8,go.bm p,NaN,1|
new line |5,NaN,NaN,4,32,hold.bmp,0,1607503,NaN,-1,NaN,8,go.bmp,NaN,1|
---------------------
orig line |8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,0,N aN,1|
new line |8,NaN,NaN,4,32,NaN,-1,1607564,NaN,NaN,8,hold.bmp,0,NaN,1|
---------------------

thanks,
- bettyann
 
Reply With Quote
 
 
 
 
Stuart Moore
Guest
Posts: n/a
 
      11-11-2004
bettyann wrote:

> hi all,
>
> can anyone help me limit the greediness of my substitution pattern? i
> have a CSV file and i want to insert a new column of values after the
> 6th column. but the new data to be inserted is dependent upon the
> value of the 6th column.
>
> example original data:
> 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1
> 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1
> 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1
> 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1
>
> i want to put "0" after the 6th column if the 6th column contains
> "hold.bmp".
> i want to put "-1" after the 6th column if the 6th column contains
> "NaN".
>
> i thought i could do this with two substitutions commands:
>
> s/^((.*?,){5}?(hold.bmp))/$1,0/
> s/^((.*?,){5}?(NaN))/$1,-1/


^ Not sure that you want that ?

I suggest replacing (.*?,) with ([^,]*) assuming there isn't some way of
commas appearing escaped within the data.
 
Reply With Quote
 
 
 
 
Stuart Moore
Guest
Posts: n/a
 
      11-11-2004
Stuart Moore wrote:

> I suggest replacing (.*?,) with ([^,]*) assuming there isn't some way of
> commas appearing escaped within the data.


That should have been ([^,]*,) of course
 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      11-11-2004
bettyann wrote:
> can anyone help me limit the greediness of my substitution pattern? i
> have a CSV file and i want to insert a new column of values after the
> 6th column. but the new data to be inserted is dependent upon the
> value of the 6th column.
>
> example original data:
> 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1
> 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1
> 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1
> 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1
>
> i want to put "0" after the 6th column if the 6th column contains
> "hold.bmp".
> i want to put "-1" after the 6th column if the 6th column contains
> "NaN".
>
> i thought i could do this with two substitutions commands:
>
> s/^((.*?,){5}?(hold.bmp))/$1,0/
> s/^((.*?,){5}?(NaN))/$1,-1/
>
> i cannot limit the matching of "hold.bmp" or "NaN". i want this
> pattern to match *only* if "hold.bmp" or "NaN" immediately follows the
> 5th column.


Limiting to a fixed number of occurrences while using '.*' is
contradictory, irrespective of greediness. Besides a few other things, I
believe that the most important change you should make is to get rid of
that problem by replacing the '.' meta character with the character
class '[^,]'. This might do it, using only one substitution:

s/^((?:[^,]*,){5}(?hold\.bmp)|NaN))/"$1,".($2 ? '0' : '-1')/e;

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      11-11-2004
bettyann <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> hi all,
>
> can anyone help me limit the greediness of my substitution pattern? i
> have a CSV file and i want to insert a new column of values after the
> 6th column. but the new data to be inserted is dependent upon the
> value of the 6th column.
>
> example original data:
> 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1
> 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1
> 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1
> 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1
>
> i want to put "0" after the 6th column if the 6th column contains
> "hold.bmp".
> i want to put "-1" after the 6th column if the 6th column contains
> "NaN".
>
> i thought i could do this with two substitutions commands:
>
> s/^((.*?,){5}?(hold.bmp))/$1,0/
> s/^((.*?,){5}?(NaN))/$1,-1/
>
> i cannot limit the matching of "hold.bmp" or "NaN". i want this
> pattern to match *only* if "hold.bmp" or "NaN" immediately follows the
> 5th column.


[code appreciated, but snipped]

I'd use split and splice for that, not a regex (except that split also
uses a regex). Then you can comfortably look at the preceding field
and decide what goes after it. For instance:

while ( <DATA> ) {
my @l = split /,/;
splice @l, 6, 0, $l[ 5] eq 'hold.bmp' ? 0 : -1;
print join ',', @l;
}

Anno
 
Reply With Quote
 
Janek Schleicher
Guest
Posts: n/a
 
      11-11-2004
On Wed, 10 Nov 2004 19:38:35 -0800, bettyann wrote:

> can anyone help me limit the greediness of my substitution pattern? i
> have a CSV file and i want to insert a new column of values after the
> 6th column. but the new data to be inserted is dependent upon the
> value of the 6th column.


Well, when talking about handling CSV files, why not using one of the
numerous modules on CPAN (http://search.cpan.org?query=CSV)
E.g. with Text::CSV_XS the following snippet works without to be worried
about parsing csv:

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV_XS;

my $csv = Text::CSV_XS->new();
while (<DATA>) {
chomp;
$csv->parse($_) or die "Couldn't parse '$_' as CSV";
my @col = $csv->fields;
$csv->combine(@col[0..5],($col[5] eq 'hold.bmp' ? 0 : -1),@col[6..$#col]);
print $csv->string,"\n";
}

__DATA__
2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1
1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1
5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1
8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1


Greetings,
Janek
 
Reply With Quote
 
bettyann
Guest
Posts: n/a
 
      11-11-2004
thanks to everyone who replied -- all suggestions are good.

stuart and gunnar -- using pattern ([^,]*,) rather than (.*?,) works
as i need. i understand now that i need to use a pattern that
describes the negative of what i want rather than a pattern that
describes what i *do* want. thanks for the suggestion and the new way
of thinking.

len and anno -- i did consider using split/join but since the CSV file
has thousands of lines, i thought maybe regexp might be faster. i'm
not sure, tho, as i haven't done a benchmark.

janek -- Text::CSV_XS looks really nice. i'll certainly investigate
this package more in the future.

one last clarification, i actually have more than two different cases,
ie:

s/^(([^,]*,){5}hold.bmp)/$1,0/;
s/^(([^,]*,){5}go.bmp)/$1,1/;
s/^(([^,]*,){5}slow.bmp)/$1,2/;
s/^(([^,]*,){5}speed.bmp)/$1,3/;
s/^(([^,]*,){5}NaN)/$1,-1/;

so i don't think the "?:" combination would be as straight forward.

thanks for all the help. greatly appreciated.
- bettyann
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      11-11-2004
bettyann <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> thanks to everyone who replied -- all suggestions are good.


[...]

> len and anno -- i did consider using split/join but since the CSV file
> has thousands of lines, i thought maybe regexp might be faster. i'm
> not sure, tho, as i haven't done a benchmark.


I don't think split will be significantly slower than a regex solution.
While split *implies* the use of a regex for the delimiter, that is
usually a very simple one which will predictably perform well enough.
The rest split does is (in principle, not in detail) what a capturing
regex does too. The performance of a pure-regex solution is much
harder to predict.

If anything, splice may slow it down a bit, but no more than the actual
substitution slows down the "regex" solution. I wouldn't expect a
significant difference between split and regex, but if there is, I'd
expect the regex to be slower.

> janek -- Text::CSV_XS looks really nice. i'll certainly investigate
> this package more in the future.
>
> one last clarification, i actually have more than two different cases,
> ie:
>
> s/^(([^,]*,){5}hold.bmp)/$1,0/;
> s/^(([^,]*,){5}go.bmp)/$1,1/;
> s/^(([^,]*,){5}slow.bmp)/$1,2/;
> s/^(([^,]*,){5}speed.bmp)/$1,3/;
> s/^(([^,]*,){5}NaN)/$1,-1/;
>
> so i don't think the "?:" combination would be as straight forward.


Now this is something that's going slow it down a bit, matching n times
for n possibilities. A hash lets you do them all in one go. Quite simple:

my %replace = (
'hold.bmp' => 0,
'go.bmp' => 1,
# ...
NaN => -1,
);

Then the five substitutions could become (untested, probably more
the spirit than the real thing)

s/^(([^,]*,){5}([^,]*))/$1,$replace{ $2}/;

But I can't say I like the regex you're using. Only a short regex
is a good regex, that one is much too long. I still favor the
split solution, if only because it works on the actual data, not
their messy representation. The hash can be used with that too,
in the obvious way.

Anno
 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      11-11-2004
bettyann wrote:
> one last clarification, i actually have more than two different cases,
> ie:
>
> s/^(([^,]*,){5}hold.bmp)/$1,0/;
> s/^(([^,]*,){5}go.bmp)/$1,1/;
> s/^(([^,]*,){5}slow.bmp)/$1,2/;
> s/^(([^,]*,){5}speed.bmp)/$1,3/;
> s/^(([^,]*,){5}NaN)/$1,-1/;
>
> so i don't think the "?:" combination would be as straight forward.


No, but in that case you can use a hash instead. Something like:

my %hash = (
'hold.bmp' => ',0',
'go.bmp' => ',1',
'slow.bmp' => ',2',
'speed.bmp' => ',3',
NaN => ',-1',
);

s/^((?:[^,]*,){5}([^,]+))/$1.($hash{$2} or '')/e;

After all, parsing thousands of lines once should reasonably be faster
than doing it six times.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
bettyann
Guest
Posts: n/a
 
      11-14-2004
> Now this is something that's going slow it down a bit, matching n times
> for n possibilities.


indeed.

> A hash lets you do them all in one go. Quite simple:
>
> my %replace = (
> 'hold.bmp' => 0,
> 'go.bmp' => 1,
> # ...
> NaN => -1,
> );
>
> Then the five substitutions could become (untested, probably more
> the spirit than the real thing)
>
> s/^(([^,]*,){5}([^,]*))/$1,$replace{ $2}/;


thanks! this works well. altho the i needed to use the $3 capture as
a key to the hash, ie,

s/^(([^,]*,){5}([^,]*))/$1,$replace{$3}/;

as the key is captured with the 3rd open-parenthesis.

gunnar, thanks, too. altho i found the "e" option in the command
"s//e" gave me this error so i simply removed the "e":

Scalar found where operator expected at (eval 4571) line 1, near
"}${4}"
(Missing operator before ${4}?)

thanks for all the help and ideas. i've incorporated hash tables in a
few other places in my code where they really make the logic cleaner.

thanks!
- bettyann
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Greedy and non greedy quantifiers Dan Kelly Ruby 4 01-19-2008 08:36 PM
too greedy of a regexp Dave Rose Ruby 3 11-09-2006 07:04 PM
regexp non-greedy matching bug? Sam Pointon Python 8 12-05-2005 08:31 AM
greedy v. non-greedy matching Matt Garrish Perl Misc 4 02-16-2004 03:25 PM



Advertisments