Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   Efficiently convert characters to octal representation (http://www.velocityreviews.com/forums/t899217-efficiently-convert-characters-to-octal-representation.html)

 Worky Workerson 07-28-2006 06:54 PM

Efficiently convert characters to octal representation

I have a (possibly binary) string like "worky" where I'd like to
convert each byte to its octal representation, resulting in a string
"\167\157\162\153\171". I have two solutions, however I'm looking for
any way that would be faster.

Control:
\$content = 'worky';
return \$content;

Solution 1 (in place w/regex):
\$content = 'worky';
\$content =~ s/(.|\n)/sprintf("\\%03o", ord \$1)/eg;
return \$content

Solution 2 (index into string):
\$content = 'worky';
do {
use bytes;
foreach my \$idx (0..(length(\$content)-1)) {
\$ret .= sprintf("\\%03o", ord(substr(\$content, \$idx, 1)));
}
};
return \$ret;

Based on a quick cmpthese benchmark, the control is about 16 times
faster than solution 1 and about 9 times faster than solution 2.

Does anyone know of A) The fastest way to do this or B) some
tips/tricks on how to speedup my methods?

Thanks!
-Worky

 Worky Workerson 07-28-2006 07:35 PM

Re: Efficiently convert characters to octal representation

After tinkering for a while, my best solution is now:
\$content = 'worky'
return join('', map {sprintf("\\%03o", \$_)} unpack("C*", \$content));

Anyone got anything better?

Thanks!
-Worky

 DJ Stunks 07-28-2006 07:48 PM

Re: Efficiently convert characters to octal representation

Worky Workerson wrote:
> I have a (possibly binary) string like "worky" where I'd like to
> convert each byte to its octal representation, resulting in a string
> "\167\157\162\153\171". I have two solutions, however I'm looking for
> any way that would be faster.
>
> Control:
> \$content = 'worky';
> return \$content;
>
> Solution 1 (in place w/regex):
> \$content = 'worky';
> \$content =~ s/(.|\n)/sprintf("\\%03o", ord \$1)/eg;
> return \$content
>
> Solution 2 (index into string):
> \$content = 'worky';
> do {
> use bytes;
> foreach my \$idx (0..(length(\$content)-1)) {
> \$ret .= sprintf("\\%03o", ord(substr(\$content, \$idx, 1)));
> }
> };
> return \$ret;
>
> Based on a quick cmpthese benchmark, the control is about 16 times
> faster than solution 1 and about 9 times faster than solution 2.
>
> Does anyone know of A) The fastest way to do this or B) some
> tips/tricks on how to speedup my methods?

if you have the benchmark set up, try this sub:

#!/usr/bin/perl

use strict;
use warnings;

my \$string = 'worky';
print convert_to_octal(\$string);

sub convert_to_octal {
my (\$string) = @_;

return map { sprintf '\\%03o', ord \$_ }
split //, \$string;

}

__END__

-jp

 Ben Morrow 07-28-2006 08:35 PM

Re: Efficiently convert characters to octal representation

Quoth "Worky Workerson" <worky.workerson@gmail.com>:

Guidelines?]

[converting a string into octal esacpes]

> After tinkering for a while, my best solution is now:
> \$content = 'worky'
> return join('', map {sprintf("\\%03o", \$_)} unpack("C*", \$content));
>
> Anyone got anything better?

Here's a couple more, in the spirit of TMTOWTDI:

#!/usr/bin/perl

use warnings;
use strict;
use Math::BaseCnv;
use Benchmark qw/cmpthese/;

\$\ = "\n";
my \$w = 'worky';

my %subs = (
regex => sub { (my \$x = \$w) =~ s/(.)/sprintf '\\%3o', ord \$1/egs; \$x; },
substr => sub {
use bytes;
my \$x;
for (0..(length \$w) - 1) {
\$x .= sprintf '\\%3o', ord substr \$w, \$_, 1;
}
return \$x;
},
unpack => sub { join '', map sprintf('\\%3o', \$_), unpack 'C*', \$w },
split => sub {
use bytes;
join '', map sprintf('\\%3o', ord), split //, \$w
},
cnv => sub { '\\' . join '\\', map cnv(\$_, 10, 8), unpack 'C*', \$w; },
);

for (keys %subs) {
print "\$_ => " . \$subs{\$_}->();
}

cmpthese -3, \%subs;

__END__

This gives (on my machine)

cnv => \167\157\162\153\171
regex => \167\157\162\153\171
unpack => \167\157\162\153\171
substr => \167\157\162\153\171
split => \167\157\162\153\171
Rate cnv regex split unpack substr
cnv 5618/s -- -87% -87% -92% -92%
regex 42639/s 659% -- -0% -39% -39%
split 42712/s 660% 0% -- -39% -39%
unpack 69589/s 1139% 63% 63% -- -1%
substr 70257/s 1151% 65% 64% 1% --

This is usually true: substr > unpack > split > regex. The reason is
that Perl ops are so much slower than C.

However, I am hard-pressed to think of a situation where it's worth
writing anything other than 'regex' above, as clarity is almost always
more important than speed.

Also note that the 'regex' solution would need a 'use bytes' to be
strictly compatible with the others. I'm not sure why you think you need
it: if you've read your data from a binmode :raw filehandle it's binary
anyway; otherwise you want to encode it with Encode into a suitable
encoding.

Ben

--
Joy and Woe are woven fine,
A Clothing for the Soul divine William Blake
Under every grief and pine 'Auguries of Innocence'
Runs a joy with silken twine. benmorrow@tiscali.co.uk

 Worky Workerson 07-28-2006 08:41 PM

Re: Efficiently convert characters to octal representation

> sub convert_to_octal {
> my (\$string) = @_;
>
> return map { sprintf '\\%03o', ord \$_ }
> split //, \$string;
>
> }

Its about 25% slower than my "best" solution listed previously, which
was basically the same thing with unpack instead of split. Also, since
the data might be binary, I'm worried about the split // ... isn't that
a character split (vs a binary split)?

 Worky Workerson 07-28-2006 09:03 PM

Re: Efficiently convert characters to octal representation

Ben Morrow wrote:
> [converting a string into octal esacpes]
>
> my %subs = (
> regex => sub { (my \$x = \$w) =~ s/(.)/sprintf '\\%3o', ord \$1/egs; \$x; },
> substr => sub {
> use bytes;
> my \$x;
> for (0..(length \$w) - 1) {
> \$x .= sprintf '\\%3o', ord substr \$w, \$_, 1;
> }
> return \$x;
> },
> unpack => sub { join '', map sprintf('\\%3o', \$_), unpack 'C*', \$w },
> split => sub {
> use bytes;
> join '', map sprintf('\\%3o', ord), split //, \$w
> },
> cnv => sub { '\\' . join '\\', map cnv(\$_, 10, 8), unpack 'C*', \$w; },
> );

> However, I am hard-pressed to think of a situation where it's worth
> writing anything other than 'regex' above, as clarity is almost always
> more important than speed.

I'm doing database ETL and transforming 300GB of CSV into something the
database likes to load. According to DProf, this was my biggest
slacker by far, partly because it is called so often. Every little bit
of speed helps :)

> Also note that the 'regex' solution would need a 'use bytes' to be
> strictly compatible with the others. I'm not sure why you think you need
> it: if you've read your data from a binmode :raw filehandle it's binary
> anyway; otherwise you want to encode it with Encode into a suitable
> encoding.

I guess I'm still a little fuzzy on the whole perl/binary thing. I'm
reading in CSV where most of the columns are ASCII but I'm not sure
what sort of data will be stored in one of the columns. I am declaring
binmode on the filehandle ... do I still need the 'use bytes' on the
substr approach?

 Ben Morrow 07-28-2006 09:10 PM

Re: Efficiently convert characters to octal representation

Quoth "Worky Workerson" <worky.workerson@gmail.com>:
> Ben Morrow wrote:
> > [converting a string into octal esacpes]

>
> > However, I am hard-pressed to think of a situation where it's worth
> > writing anything other than 'regex' above, as clarity is almost always
> > more important than speed.

>
> I'm doing database ETL and transforming 300GB of CSV into something the
> database likes to load. According to DProf, this was my biggest
> slacker by far, partly because it is called so often. Every little bit
> of speed helps :)

Fair enough :). A lot of people seem to come here saying 'I want to do
<foo> really fast' without thinking whether that's really necessary.

> > Also note that the 'regex' solution would need a 'use bytes' to be
> > strictly compatible with the others. I'm not sure why you think you need
> > it: if you've read your data from a binmode :raw filehandle it's binary
> > anyway; otherwise you want to encode it with Encode into a suitable
> > encoding.

>
> I guess I'm still a little fuzzy on the whole perl/binary thing.

Yeah, it's kinda complicated. It's made harder by the fact that Perl has
to be backwards-compatible, so a lot of the time just fudging things
seems to work...

> I'm reading in CSV where most of the columns are ASCII but I'm not
> sure what sort of data will be stored in one of the columns. I am
> declaring binmode on the filehandle ... do I still need the 'use
> bytes' on the substr approach?

If you are reading from a binary filehandle, then the data is all 8bit
(as opposed to wider than that) anyway, so you don't. You may get a
slight speed benefit by declaring 'use bytes' at the top of the script.

--
For far more marvellous is the truth than any artists of the past imagined!
Why do the poets of the present not speak of it? What men are poets who can
speak of Jupiter if he were like a man, but if he is an immense spinning
sphere of methane and ammonia must be silent?~Feynmann~benmorrow@tiscali.co.uk

 xhoster@gmail.com 07-28-2006 09:18 PM

Re: Efficiently convert characters to octal representation

"Worky Workerson" <worky.workerson@gmail.com> wrote:
> I have a (possibly binary) string like "worky" where I'd like to
> convert each byte to its octal representation, resulting in a string
> "\167\157\162\153\171". I have two solutions, however I'm looking for
> any way that would be faster.
>

....
> Solution 2 (index into string):
> \$content = 'worky';
> do {
> use bytes;
> foreach my \$idx (0..(length(\$content)-1)) {
> \$ret .= sprintf("\\%03o", ord(substr(\$content, \$idx, 1)));
> }
> };
> return \$ret;

....
> Does anyone know of A) The fastest way to do this or B) some
> tips/tricks on how to speedup my methods?

This seems like a pretty strange thing to need to optimize. How many
times do you need to do this operation on a 5 character fixed string?
If you don't need to do it on a 5 character fixed string, then your
benchmark should incorporate realistic sizes and with more realistic
methods for obtaining the non-fixed thing you want to operate on.

Anyway, if really need the speed, this Inline C code is about 3 times
faster than sol2.

Rate sol1 sol2 sol3 control2
sol1 70274/s -- -45% -82% -94%
sol2 127219/s 81% -- -67% -89%
sol3 385820/s 449% 203% -- -66%
control2 1122504/s 1497% 782% 191% --

Benchmark::cmpthese(-5, {
'control' => sub {control(\$text)},
'sol1' => sub {sol1(\$text)},
'sol2' => sub {sol2(\$text)},
'sol3' => sub {sol3(\$text)},
});
__END__
__C__
SV* sol3(SV* a) {
STRLEN len;
int i;
unsigned char * s;
SV* ret;
s = SvPV(a,len);
ret = newSV(4*len);
for (i=0; i<len; i++,s++) {
sv_catpvf(ret, "\\%03o", *s);
};
return ret;
};

Xho

--
Usenet Newsgroup Service \$9.95/Month 30GB

 Sisyphus 07-28-2006 11:00 PM

Re: Efficiently convert characters to octal representation

<xhoster@gmail.com> wrote in message
..
..
>
> Anyway, if really need the speed, this Inline C code is about 3 times
> faster than sol2.
>

A neat little Inline C routine .... so I saved the code and ran it:

-----------------------------
D:\pscrpt\inline\>cat char2octal.pl
use warnings;
use Inline C => Config =>
BUILD_NOISY => 1;

use Inline C => <<'EOC';

SV* c2o(SV* a) {
STRLEN len;
int i;
unsigned char * s;
SV* ret;
s = SvPV(a,len);
ret = newSV(4*len);
for (i=0; i<len; i++,s++) {
sv_catpvf(ret, "\\%03o", *s);
}
return ret;
}

EOC

print c2o('abcdABCD'), "\n"; #line 22

D:\pscrpt\inline\>perl char2octal.pl
Use of uninitialized value in subroutine entry at char2octal.pl line 22.
\141\142\143\144\101\102\103\104
-----------------------------

I'm sure it's one of those questions that will make me go "Doh!", but I
can't for the life of me see what is causing that "uninitialized" warning.
Any hints ? (I'm running perl 5.8.8 on Win32.)

Cheers,
Rob

 xhoster@gmail.com 07-28-2006 11:55 PM

Re: Efficiently convert characters to octal representation

"Sisyphus" <sisyphus1@nomail.afraid.org> wrote:
> <xhoster@gmail.com> wrote in message
> .
> .
> >
> > Anyway, if really need the speed, this Inline C code is about 3 times
> > faster than sol2.
> >

>
> A neat little Inline C routine .... so I saved the code and ran it:
>
> -----------------------------
> D:\pscrpt\inline\>cat char2octal.pl
> use warnings;
> use Inline C => Config =>
> BUILD_NOISY => 1;
>
> use Inline C => <<'EOC';
>
> SV* c2o(SV* a) {
> STRLEN len;
> int i;
> unsigned char * s;
> SV* ret;
> s = SvPV(a,len);
> ret = newSV(4*len);
> for (i=0; i<len; i++,s++) {
> sv_catpvf(ret, "\\%03o", *s);
> }
> return ret;
> }
>
> EOC
>
> print c2o('abcdABCD'), "\n"; #line 22
>
> D:\pscrpt\inline\>perl char2octal.pl
> Use of uninitialized value in subroutine entry at char2octal.pl line 22.
> \141\142\143\144\101\102\103\104
> -----------------------------
>
> I'm sure it's one of those questions that will make me go "Doh!", but I
> can't for the life of me see what is causing that "uninitialized"
> warning. Any hints ? (I'm running perl 5.8.8 on Win32.)

Ah, I forgot to turn on warnings and so never saw it.

Apparently sv_catpvf, unlike .= operator, doesn't care for undefined
values. So make that:

ret = newSV(4*len);
sv_setpv(ret, "");
for (i=0; i<len; i++,s++) {

I guess Inline warnings all get reported as being at subroutine entry?

For what it's worth, I've made another uglier one that is about twice again
as fast. This is going to wrap like crazy:

Xho

SV* sol32(SV* a) {
static const char * cache[]={"\\000","\\001","\\002","\\003","\\004",
"\\005","\\006","\\007","\\010","\\011","\\012","\ \013","\\014","\\015",
"\\016","\\017","\\020","\\021","\\022","\\023","\ \024","\\025","\\026",
"\\027","\\030","\\031","\\032","\\033","\\034","\ \035","\\036","\\037",
"\\040","\\041","\\042","\\043","\\044","\\045","\ \046","\\047","\\050",
"\\051","\\052","\\053","\\054","\\055","\\056","\ \057","\\060","\\061",
"\\062","\\063","\\064","\\065","\\066","\\067","\ \070","\\071","\\072",
"\\073","\\074","\\075","\\076","\\077","\\100","\ \101","\\102","\\103",
"\\104","\\105","\\106","\\107","\\110","\\111","\ \112","\\113","\\114",
"\\115","\\116","\\117","\\120","\\121","\\122","\ \123","\\124","\\125",
"\\126","\\127","\\130","\\131","\\132","\\133","\ \134","\\135","\\136",
"\\137","\\140","\\141","\\142","\\143","\\144","\ \145","\\146","\\147",
"\\150","\\151","\\152","\\153","\\154","\\155","\ \156","\\157","\\160",
"\\161","\\162","\\163","\\164","\\165","\\166","\ \167","\\170","\\171",
"\\172","\\173","\\174","\\175","\\176","\\177","\ \200","\\201","\\202",
"\\203","\\204","\\205","\\206","\\207","\\210","\ \211","\\212","\\213",
"\\214","\\215","\\216","\\217","\\220","\\221","\ \222","\\223","\\224",
"\\225","\\226","\\227","\\230","\\231","\\232","\ \233","\\234","\\235",
"\\236","\\237","\\240","\\241","\\242","\\243","\ \244","\\245","\\246",
"\\247","\\250","\\251","\\252","\\253","\\254","\ \255","\\256","\\257",
"\\260","\\261","\\262","\\263","\\264","\\265","\ \266","\\267","\\270",
"\\271","\\272","\\273","\\274","\\275","\\276","\ \277","\\300","\\301",
"\\302","\\303","\\304","\\305","\\306","\\307","\ \310","\\311","\\312",
"\\313","\\314","\\315","\\316","\\317","\\320","\ \321","\\322","\\323",
"\\324","\\325","\\326","\\327","\\330","\\331","\ \332","\\333","\\334",
"\\335","\\336","\\337","\\340","\\341","\\342","\ \343","\\344","\\345",
"\\346","\\347","\\350","\\351","\\352","\\353","\ \354","\\355","\\356",
"\\357","\\360","\\361","\\362","\\363","\\364","\ \365","\\366","\\367",
"\\370","\\371","\\372","\\373","\\374","\\375","\ \376","\\377"};

STRLEN len;
int i;
unsigned char * s;
SV* ret;
s = SvPV(a,len);
ret = newSV(4*len);
sv_setpv(ret, "");
for (i=0; i<len; i++,s++) {
sv_catpv(ret, cache[*s]);
};
return ret;
};

--