Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Index of first and last non-"\xff" in a long string

Reply
Thread Tools

Index of first and last non-"\xff" in a long string

 
 
w.c.humann@arcor.de
Guest
Posts: n/a
 
      11-12-2007
I'm going through several PGM images, overlaying (i.e. ANDing) them
and would also like to determine the bounding box. For that I need to
find the first and last non-white (i.e. non-"\xff") pixel in every
line. Now I have one line in a string $data (one pixel per character),
possibly several 1000 charcters long. I tried 3 alternatives so far.
All 3 work, but there may be even faster ways to do this:

# slow
my $first = -1;
1 while substr($data, ++$first, 1) eq "\xff" and $first < $width - 1;
my $last = $width;
1 while substr($data, --$last, 1) eq "\xff" and $last > $first;
print STDERR "f: $first, l: $last, ";

# the match for $first2 is fast, but the one for $last2 is really slow
my $first2 = length( ($data =~ /^(\xff+)/)[0] );
my $last2 = $width - 1 - length( ($data =~ /(\xff+)$/)[0] );
print STDERR "f2: $first2, l2: $last2, ";

# best solution so far. "tr" is the slowest part of this.
# Is there a way without the "tr"?
$data =~ tr|\x00-\xfe|\x00|;
my $first3 = index $data, "\x00";
my $last3 = ($first3 > -1) ? rindex $data, "\x00" : -1;
print STDERR "f3: $first3, l3: $last3, ";

print STDERR "\n";

Thanks,
Wolfram

 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      11-12-2007

Quoth http://www.velocityreviews.com/forums/(E-Mail Removed):
> I'm going through several PGM images, overlaying (i.e. ANDing) them
> and would also like to determine the bounding box. For that I need to
> find the first and last non-white (i.e. non-"\xff") pixel in every
> line. Now I have one line in a string $data (one pixel per character),
> possibly several 1000 charcters long. I tried 3 alternatives so far.
> All 3 work, but there may be even faster ways to do this:


On my machine, the benchmark below gives

Rate subst sloop chop match reverse index C
subst 403/s -- -90% -94% -96% -97% -97% -100%
sloop 4078/s 912% -- -36% -58% -73% -73% -96%
chop 6340/s 1474% 55% -- -35% -58% -59% -94%
match 9754/s 2322% 139% 54% -- -36% -37% -91%
reverse 15247/s 3686% 274% 140% 56% -- -1% -85%
index 15376/s 3718% 277% 143% 58% 1% -- -85%
C 104065/s 25739% 2452% 1541% 967% 583% 577% --

so index is probably the best you're going to get without using C.

Ben

#!/usr/bin/perl

use Benchmark qw/cmpthese/;

my $str = ("\xff" x 160) . ("f" x 10_000) . ("\xff" x 150);
my $len = length $str;

use Inline C => <<'EOC';

IV
unindex(SV *sv, const char *str)
{
const char *pv;
IV len, i = 0;
const char chr = str[0];

pv = SvPV(sv, len);
while (pv[i] == chr) i++;

return i;
}

IV
unrindex(SV *sv, const char *str)
{
const char *pv;
IV len, i;
const char chr = str[0];

pv = SvPV(sv, len);
i = len - 1;
while (pv[i] == chr) i--;

return i;
}

EOC

cmpthese -3, {
match => sub {
local $_ = $str;
# this is the fastest single match I can come up with
/^((?>\xff*)).*[^\xff]((?>\xff*))$/;
161 == 1 + $+[1] or die "\$+ failed: " . (1 + $+[1]);
10_160 == $-[2] or die "\$- failed: " . $-[2];
},
index => sub {
local $_ = $str;
tr,\x00-\xfe,\x00,;
161 == 1 + index $_, "\x00"
or die "index failed: " . 1 + index $_, "\x00";
10_160 == 1 + rindex $_, "\x00"
or die "rindex failed: " . 1 + rindex $_, "\x00";
},
subst => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
s/\xff+$//;
10_160 == length or die "subst failed: " . length;
},
sloop => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
1 while s/\xff$//;
10_160 == length or die "sloop failed: " . length;
},
chop => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
1 while "\xff" eq chop;
10_160 == 1 + length or die "chop failed: " . (1 + length);
},
reverse => sub {
local $_ = $str;
/[^\xff]/g;
161 == pos or die "pos failed: " . pos;
$_ = reverse;
/[^\xff]/g;
10_160 == (1 + $len - pos)
or die "reverse failed: " . (1 + $len - pos);
},
C => sub {
local $_ = $str;
161 == 1 + unindex $_, "\xff"
or die "unindex failed: " . (1 + unindex $_, "\xff");
10_160 == 1 + unrindex $_, "\xff"
or die "unrindex failed: " . (1 + unrindex $_, "\xff");
},
};
 
Reply With Quote
 
 
 
 
John W. Krahn
Guest
Posts: n/a
 
      11-12-2007
(E-Mail Removed) wrote:
>
> I'm going through several PGM images, overlaying (i.e. ANDing) them
> and would also like to determine the bounding box. For that I need to
> find the first and last non-white (i.e. non-"\xff") pixel in every
> line. Now I have one line in a string $data (one pixel per character),
> possibly several 1000 charcters long. I tried 3 alternatives so far.
> All 3 work, but there may be even faster ways to do this:
>
> # slow
> my $first = -1;
> 1 while substr($data, ++$first, 1) eq "\xff" and $first < $width - 1;
> my $last = $width;
> 1 while substr($data, --$last, 1) eq "\xff" and $last > $first;
> print STDERR "f: $first, l: $last, ";
>
> # the match for $first2 is fast, but the one for $last2 is really slow
> my $first2 = length( ($data =~ /^(\xff+)/)[0] );
> my $last2 = $width - 1 - length( ($data =~ /(\xff+)$/)[0] );
> print STDERR "f2: $first2, l2: $last2, ";
>
> # best solution so far. "tr" is the slowest part of this.
> # Is there a way without the "tr"?
> $data =~ tr|\x00-\xfe|\x00|;
> my $first3 = index $data, "\x00";
> my $last3 = ($first3 > -1) ? rindex $data, "\x00" : -1;
> print STDERR "f3: $first3, l3: $last3, ";
>
> print STDERR "\n";


Try this in your testing:

$data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
] - 1 );



John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
w.c.humann@arcor.de
Guest
Posts: n/a
 
      11-12-2007
On Nov 12, 8:57 pm, "John W. Krahn" <(E-Mail Removed)> wrote:
>
> Try this in your testing:
>
> $data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
> ] - 1 );


Thanks John,

at least a hundred times faster than my attempt at pattern matching --
but still several times slower than tr/index/rindex.

Wolfram

 
Reply With Quote
 
w.c.humann@arcor.de
Guest
Posts: n/a
 
      11-12-2007
On Nov 12, 8:56 pm, Ben Morrow <(E-Mail Removed)> wrote:
>
> On my machine, the benchmark below gives
>
> Rate subst sloop chop match reverse index C
> subst 403/s -- -90% -94% -96% -97% -97% -100%
> sloop 4078/s 912% -- -36% -58% -73% -73% -96%
> chop 6340/s 1474% 55% -- -35% -58% -59% -94%
> match 9754/s 2322% 139% 54% -- -36% -37% -91%
> reverse 15247/s 3686% 274% 140% 56% -- -1% -85%
> index 15376/s 3718% 277% 143% 58% 1% -- -85%
> C 104065/s 25739% 2452% 1541% 967% 583% 577% --
>
> so index is probably the best you're going to get without using C.


Hey, some great ideas here, thanks Ben.

Glad I had already found the fastest pure-perl solution
(but 'reverse' is so close, the order might change per run...)

'Inline C' is great but less portable, and I'm mainly using this on
win32.

Wolfram

 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      11-12-2007

Quoth (E-Mail Removed):
> On Nov 12, 8:57 pm, "John W. Krahn" <(E-Mail Removed)> wrote:
> >
> > Try this in your testing:
> >
> > $data =~ /[^\xff].*[^\xff]/s and my ( $first, $last ) = ( $-[ 0 ], $+[ 0
> > ] - 1 );


D'oh! I knew my match didn't need to do so much backtracking...

> at least a hundred times faster than my attempt at pattern matching --
> but still several times slower than tr/index/rindex.


Interesting... which version of perl? With
This is perl, v5.8.8 built for i386-freebsd-64int

and adding this

innerm => sub {
local $_ = $str;
/[^\xff].*[^\xff]/s;
161 == ($-[0] + 1) or die "innerm \$+ failed: " . ($-[0] + 1);
10_160 == $+[0] or die "innerm \$- failed: " . $+[0];
},

to my previous benchmark, I get

Rate match index innerm C
match 9609/s -- -41% -53% -91%
index 16398/s 71% -- -21% -85%
innerm 20641/s 115% 26% -- -81%
C 109225/s 1037% 566% 429% --

though seriously increasing the number of trailing "\xff"s causes both
'match' and 'innerm' to perform dramatically badly, so maybe this is an
artefact of my test string.

Ben

 
Reply With Quote
 
w.c.humann@arcor.de
Guest
Posts: n/a
 
      11-13-2007
On Nov 12, 10:08 pm, Ben Morrow <(E-Mail Removed)> wrote:
> Interesting... which version of perl? With
> This is perl, v5.8.8 built for i386-freebsd-64int


Mine is:
This is perl, v5.8.7 built for MSWin32-x86-multi-thread

>
> and adding this
>
> innerm => sub {
> local $_ = $str;
> /[^\xff].*[^\xff]/s;
> 161 == ($-[0] + 1) or die "innerm \$+ failed: " . ($-[0] + 1);
> 10_160 == $+[0] or die "innerm \$- failed: " . $+[0];
> },
>
> to my previous benchmark, I get
>
> Rate match index innerm C
> match 9609/s -- -41% -53% -91%
> index 16398/s 71% -- -21% -85%
> innerm 20641/s 115% 26% -- -81%
> C 109225/s 1037% 566% 429% --


Well, with your test-string I get:

Rate subst sloop chop match index reverse
innerm
subst 1306/s -- -91% -94% -94% -94% -95%
-97%
sloop 13917/s 965% -- -33% -41% -41% -52%
-72%
chop 20894/s 1499% 50% -- -11% -11% -28%
-59%
match 23417/s 1693% 68% 12% -- -1% -19%
-54%
index 23561/s 1704% 69% 13% 1% -- -18%
-53%
reverse 28902/s 2112% 108% 38% 23% 23% --
-43%
innerm 50510/s 3767% 263% 142% 116% 114% 75%
--

so indeed 'innerm' wins, but...

> though seriously increasing the number of trailing "\xff"s causes both
> 'match' and 'innerm' to perform dramatically badly, so maybe this is an
> artefact of my test string.


typical files have blank lines at top and bottom and if I modify your
script like this:

# lenght of left, middle and right part of string
#my ($l,$m,$r) = (160, 10_000, 150);
my ($l,$m,$r) = (5000, 2, 5000);

my $str = ("\xff" x $l) . ("f" x $m) . ("\xff" x $r);
my $len = length $str;

cmpthese -3, {
match => sub {
local $_ = $str;
# this is the fastest single match I can come up with
/^((?>\xff*)).*[^\xff]((?>\xff*))$/;
$l+1 == 1 + $+[1] or die "\$+ failed: " . (1 + $+[1]);
$l+$m == $-[2] or die "\$- failed: " . $-[2];
},
etc.

the result is:

Rate subst sloop chop match innerm
reverse index
subst 1.44/s -- -100% -100% -100% -100%
-100% -100%
sloop 462/s 32117% -- -37% -75% -82%
-86% -98%
chop 737/s 51284% 59% -- -60% -71%
-78% -96%
match 1844/s 128383% 299% 150% -- -28%
-45% -91%
innerm 2575/s 179348% 457% 249% 40% --
-23% -87%
reverse 3352/s 233498% 625% 355% 82% 30%
-- -84%
index 20452/s 1424967% 4323% 2673% 1009% 694%
510% --

Stange enough with my ($l,$m,$r) = (5000, 1, 5000); 'innerm' fails!?

An optimization not considered so far: Once I've found a left and
right bound in a line, I only need to check from the edges up these
bounds in all following lines, because my bounding-box can only grow
(and never shrink) while checking further lines. As a faked test I've
used this:

index2 => sub {
my $left = substr($str,0,$l+500);
my $right = substr($str,$l+$m-500,$r+500);
$left =~ tr,\x00-\xfe,\x00,;
$right =~ tr,\x00-\xfe,\x00,;
$l+1 == 1 + index $left, "\x00"
or die "index failed: " . (1 + index $left, "\x00");
500 == 1 + rindex $right, "\x00"
or die "rindex failed: " . (1 + rindex $right, "\x00");
},

The potential savings are big but of course highly dependent on the
actual image:
Rate index reverse innerm index2
index 23924/s -- -17% -53% -82%
reverse 28749/s 20% -- -43% -79%
innerm 50574/s 111% 76% -- -63%
index2 136616/s 471% 375% 170% --


Wolfram

 
Reply With Quote
 
John W. Krahn
Guest
Posts: n/a
 
      11-13-2007
(E-Mail Removed) wrote:
>
> On Nov 12, 10:08 pm, Ben Morrow <(E-Mail Removed)> wrote:
> > Interesting... which version of perl? With
> > This is perl, v5.8.8 built for i386-freebsd-64int
> >
> > and adding this
> >
> > innerm => sub {
> > local $_ = $str;
> > /[^\xff].*[^\xff]/s;


[ SNIP ]

> Stange enough with my ($l,$m,$r) = (5000, 1, 5000); 'innerm' fails!?


Not strange at all. The pattern has to match at least two [^\xff]
characters.


John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Making an array wrap, where last index + 1 = first index Shawn W_ Ruby 5 09-16-2009 02:45 PM
Having compilation error: no match for call to (const __gnu_cxx::hash<long long int>) (const long long int&) veryhotsausage C++ 1 07-04-2008 05:41 PM
sorting index-15, index-9, index-110 "the human way"? Tomasz Chmielewski Perl Misc 4 03-04-2008 05:01 PM
first and last index as in matlab Evan Python 6 12-18-2006 09:29 AM
how common is it to get email spam with your first, first and last and/or home town in it? Lookout Computer Support 16 04-27-2006 07:42 AM



Advertisments