![]() |
regexp s// too greedy
hi all,
can anyone help me limit the greediness of my substitution pattern? i have a CSV file and i want to insert a new column of values after the 6th column. but the new data to be inserted is dependent upon the value of the 6th column. example original data: 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1 i want to put "0" after the 6th column if the 6th column contains "hold.bmp". i want to put "-1" after the 6th column if the 6th column contains "NaN". i thought i could do this with two substitutions commands: s/^((.*?,){5}?(hold.bmp))/$1,0/ s/^((.*?,){5}?(NaN))/$1,-1/ i cannot limit the matching of "hold.bmp" or "NaN". i want this pattern to match *only* if "hold.bmp" or "NaN" immediately follows the 5th column. my test code: #!/usr/local/bin/perl use strict; use warnings; my $input = <<EOF; 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1 EOF my @oData = split( '\n', $input ); my $line; my $cnt = 0; foreach $line ( @oData ) { printf( "$cnt) $line \n" ); $cnt++; } my $prevCol = 5; my @txtList = ( "hold.bmp", "NaN" ); my @valList = ( "0", "-1" ); my ( $txt, $cmd, $i ); $i = 0; foreach $txt ( @txtList ) { $cmd = sprintf( '$line =~ s/^((.*?,){%d}?(%s))/$1,%s/;', $prevCol, $txt, $valList[$i] ); printf( "\ncmd >>$cmd<< \n" ); foreach $line ( @oData ) { printf( "orig line |$line| \n" ); eval $cmd; printf( " new line |$line| \n---------------------\n" ); } $i++; } exit; output: % test2.pl 0) 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1 1) 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1 2) 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1 3) 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1 cmd >>$line =~ s/^((.*?,){5}?(hold.bmp))/$1,0/;<< orig line |2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hol d.bmp,NaN,1| new line |2,NaN,NaN,NaN,64,hold.bmp,0,1607444,NaN,NaN,NaN,h old.bmp,NaN,1| --------------------- orig line |1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hol d.bmp,3,1| new line |1,NaN,NaN,NaN,32,hold.bmp,0,1607488,NaN,NaN,NaN,h old.bmp,3,1| --------------------- orig line |5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp, NaN,1| new line |5,NaN,NaN,4,32,hold.bmp,0,1607503,NaN,NaN,8,go.bm p,NaN,1| --------------------- orig line |8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN ,1| new line |8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,0,N aN,1| --------------------- cmd >>$line =~ s/^((.*?,){5}?(NaN))/$1,-1/;<< orig line |2,NaN,NaN,NaN,64,hold.bmp,0,1607444,NaN,NaN,NaN,h old.bmp,NaN,1| new line |2,NaN,NaN,NaN,64,hold.bmp,0,1607444,NaN,-1,NaN,NaN,hold.bmp,NaN,1| --------------------- orig line |1,NaN,NaN,NaN,32,hold.bmp,0,1607488,NaN,NaN,NaN,h old.bmp,3,1| new line |1,NaN,NaN,NaN,32,hold.bmp,0,1607488,NaN,-1,NaN,NaN,hold.bmp,3,1| --------------------- orig line |5,NaN,NaN,4,32,hold.bmp,0,1607503,NaN,NaN,8,go.bm p,NaN,1| new line |5,NaN,NaN,4,32,hold.bmp,0,1607503,NaN,-1,NaN,8,go.bmp,NaN,1| --------------------- orig line |8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,0,N aN,1| new line |8,NaN,NaN,4,32,NaN,-1,1607564,NaN,NaN,8,hold.bmp,0,NaN,1| --------------------- thanks, - bettyann |
Re: regexp s// too greedy
bettyann wrote:
> hi all, > > can anyone help me limit the greediness of my substitution pattern? i > have a CSV file and i want to insert a new column of values after the > 6th column. but the new data to be inserted is dependent upon the > value of the 6th column. > > example original data: > 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1 > 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1 > 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1 > 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1 > > i want to put "0" after the 6th column if the 6th column contains > "hold.bmp". > i want to put "-1" after the 6th column if the 6th column contains > "NaN". > > i thought i could do this with two substitutions commands: > > s/^((.*?,){5}?(hold.bmp))/$1,0/ > s/^((.*?,){5}?(NaN))/$1,-1/ ^ Not sure that you want that ? I suggest replacing (.*?,) with ([^,]*) assuming there isn't some way of commas appearing escaped within the data. |
Re: regexp s// too greedy
Stuart Moore wrote:
> I suggest replacing (.*?,) with ([^,]*) assuming there isn't some way of > commas appearing escaped within the data. That should have been ([^,]*,) of course |
Re: regexp s// too greedy
bettyann wrote:
> can anyone help me limit the greediness of my substitution pattern? i > have a CSV file and i want to insert a new column of values after the > 6th column. but the new data to be inserted is dependent upon the > value of the 6th column. > > example original data: > 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1 > 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1 > 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1 > 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1 > > i want to put "0" after the 6th column if the 6th column contains > "hold.bmp". > i want to put "-1" after the 6th column if the 6th column contains > "NaN". > > i thought i could do this with two substitutions commands: > > s/^((.*?,){5}?(hold.bmp))/$1,0/ > s/^((.*?,){5}?(NaN))/$1,-1/ > > i cannot limit the matching of "hold.bmp" or "NaN". i want this > pattern to match *only* if "hold.bmp" or "NaN" immediately follows the > 5th column. Limiting to a fixed number of occurrences while using '.*' is contradictory, irrespective of greediness. Besides a few other things, I believe that the most important change you should make is to get rid of that problem by replacing the '.' meta character with the character class '[^,]'. This might do it, using only one substitution: s/^((?:[^,]*,){5}(?:(hold\.bmp)|NaN))/"$1,".($2 ? '0' : '-1')/e; -- Gunnar Hjalmarsson Email: http://www.gunnar.cc/cgi-bin/contact.pl |
Re: regexp s// too greedy
bettyann <bettyann@campbell.com> wrote in comp.lang.perl.misc:
> hi all, > > can anyone help me limit the greediness of my substitution pattern? i > have a CSV file and i want to insert a new column of values after the > 6th column. but the new data to be inserted is dependent upon the > value of the 6th column. > > example original data: > 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1 > 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1 > 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1 > 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1 > > i want to put "0" after the 6th column if the 6th column contains > "hold.bmp". > i want to put "-1" after the 6th column if the 6th column contains > "NaN". > > i thought i could do this with two substitutions commands: > > s/^((.*?,){5}?(hold.bmp))/$1,0/ > s/^((.*?,){5}?(NaN))/$1,-1/ > > i cannot limit the matching of "hold.bmp" or "NaN". i want this > pattern to match *only* if "hold.bmp" or "NaN" immediately follows the > 5th column. [code appreciated, but snipped] I'd use split and splice for that, not a regex (except that split also uses a regex). Then you can comfortably look at the preceding field and decide what goes after it. For instance: while ( <DATA> ) { my @l = split /,/; splice @l, 6, 0, $l[ 5] eq 'hold.bmp' ? 0 : -1; print join ',', @l; } Anno |
Re: regexp s// too greedy
On Wed, 10 Nov 2004 19:38:35 -0800, bettyann wrote:
> can anyone help me limit the greediness of my substitution pattern? i > have a CSV file and i want to insert a new column of values after the > 6th column. but the new data to be inserted is dependent upon the > value of the 6th column. Well, when talking about handling CSV files, why not using one of the numerous modules on CPAN (http://search.cpan.org?query=CSV) E.g. with Text::CSV_XS the following snippet works without to be worried about parsing csv: #!/usr/bin/perl use strict; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new(); while (<DATA>) { chomp; $csv->parse($_) or die "Couldn't parse '$_' as CSV"; my @col = $csv->fields; $csv->combine(@col[0..5],($col[5] eq 'hold.bmp' ? 0 : -1),@col[6..$#col]); print $csv->string,"\n"; } __DATA__ 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold .bmp,NaN,1 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold .bmp,3,1 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,N aN,1 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN, 1 Greetings, Janek |
Re: regexp s// too greedy
thanks to everyone who replied -- all suggestions are good.
stuart and gunnar -- using pattern ([^,]*,) rather than (.*?,) works as i need. i understand now that i need to use a pattern that describes the negative of what i want rather than a pattern that describes what i *do* want. thanks for the suggestion and the new way of thinking. len and anno -- i did consider using split/join but since the CSV file has thousands of lines, i thought maybe regexp might be faster. i'm not sure, tho, as i haven't done a benchmark. janek -- Text::CSV_XS looks really nice. i'll certainly investigate this package more in the future. one last clarification, i actually have more than two different cases, ie: s/^(([^,]*,){5}hold.bmp)/$1,0/; s/^(([^,]*,){5}go.bmp)/$1,1/; s/^(([^,]*,){5}slow.bmp)/$1,2/; s/^(([^,]*,){5}speed.bmp)/$1,3/; s/^(([^,]*,){5}NaN)/$1,-1/; so i don't think the "?:" combination would be as straight forward. thanks for all the help. greatly appreciated. - bettyann |
Re: regexp s// too greedy
bettyann <bettyann@campbell.com> wrote in comp.lang.perl.misc:
> thanks to everyone who replied -- all suggestions are good. [...] > len and anno -- i did consider using split/join but since the CSV file > has thousands of lines, i thought maybe regexp might be faster. i'm > not sure, tho, as i haven't done a benchmark. I don't think split will be significantly slower than a regex solution. While split *implies* the use of a regex for the delimiter, that is usually a very simple one which will predictably perform well enough. The rest split does is (in principle, not in detail) what a capturing regex does too. The performance of a pure-regex solution is much harder to predict. If anything, splice may slow it down a bit, but no more than the actual substitution slows down the "regex" solution. I wouldn't expect a significant difference between split and regex, but if there is, I'd expect the regex to be slower. > janek -- Text::CSV_XS looks really nice. i'll certainly investigate > this package more in the future. > > one last clarification, i actually have more than two different cases, > ie: > > s/^(([^,]*,){5}hold.bmp)/$1,0/; > s/^(([^,]*,){5}go.bmp)/$1,1/; > s/^(([^,]*,){5}slow.bmp)/$1,2/; > s/^(([^,]*,){5}speed.bmp)/$1,3/; > s/^(([^,]*,){5}NaN)/$1,-1/; > > so i don't think the "?:" combination would be as straight forward. Now this is something that's going slow it down a bit, matching n times for n possibilities. A hash lets you do them all in one go. Quite simple: my %replace = ( 'hold.bmp' => 0, 'go.bmp' => 1, # ... NaN => -1, ); Then the five substitutions could become (untested, probably more the spirit than the real thing) s/^(([^,]*,){5}([^,]*))/$1,$replace{ $2}/; But I can't say I like the regex you're using. Only a short regex is a good regex, that one is much too long. I still favor the split solution, if only because it works on the actual data, not their messy representation. The hash can be used with that too, in the obvious way. Anno |
Re: regexp s// too greedy
bettyann wrote:
> one last clarification, i actually have more than two different cases, > ie: > > s/^(([^,]*,){5}hold.bmp)/$1,0/; > s/^(([^,]*,){5}go.bmp)/$1,1/; > s/^(([^,]*,){5}slow.bmp)/$1,2/; > s/^(([^,]*,){5}speed.bmp)/$1,3/; > s/^(([^,]*,){5}NaN)/$1,-1/; > > so i don't think the "?:" combination would be as straight forward. No, but in that case you can use a hash instead. Something like: my %hash = ( 'hold.bmp' => ',0', 'go.bmp' => ',1', 'slow.bmp' => ',2', 'speed.bmp' => ',3', NaN => ',-1', ); s/^((?:[^,]*,){5}([^,]+))/$1.($hash{$2} or '')/e; After all, parsing thousands of lines once should reasonably be faster than doing it six times. -- Gunnar Hjalmarsson Email: http://www.gunnar.cc/cgi-bin/contact.pl |
Re: regexp s// too greedy
> Now this is something that's going slow it down a bit, matching n times
> for n possibilities. indeed. > A hash lets you do them all in one go. Quite simple: > > my %replace = ( > 'hold.bmp' => 0, > 'go.bmp' => 1, > # ... > NaN => -1, > ); > > Then the five substitutions could become (untested, probably more > the spirit than the real thing) > > s/^(([^,]*,){5}([^,]*))/$1,$replace{ $2}/; thanks! this works well. altho the i needed to use the $3 capture as a key to the hash, ie, s/^(([^,]*,){5}([^,]*))/$1,$replace{$3}/; as the key is captured with the 3rd open-parenthesis. gunnar, thanks, too. altho i found the "e" option in the command "s//e" gave me this error so i simply removed the "e": Scalar found where operator expected at (eval 4571) line 1, near "}${4}" (Missing operator before ${4}?) thanks for all the help and ideas. i've incorporated hash tables in a few other places in my code where they really make the logic cleaner. thanks! - bettyann |
| All times are GMT. The time now is 12:01 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.