Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > looking for some size optimization

Reply
Thread Tools

looking for some size optimization

 
 
Marc Espie
Guest
Posts: n/a
 
      04-15-2007
I'm looking at a script that handles a huge amount of data... basically,
the filenames from +4000 packages in order to recognize conflicts.

Currently, it builds a big hash through a loop that constructs
$all_conflict like this:

my $file= File::Spec->canonpath($self->fullname());
push ${$all_conflict->{$file}}, $pkgname;


I end up with a hash of 250M.

I can't really use Devel::Size with any benefit, since all the data
is one single chunk (all the other data in the program amount to <2~3M)

I expect the $pkgname strings to be shared. In fact, I tried replacing
$pkgname with \$pkgname, with negative size results (+20M).

I've tried splitting the path along `frequent' directories, with inconclusive
gains (most of the files live under /usr/local, /etc, or /var/www):
-20M at most.

I'm looking for bright ideas to try and reduce the size used... without
being too detrimental in terms of speed...
 
Reply With Quote
 
 
 
 
Mumia W.
Guest
Posts: n/a
 
      04-15-2007
On 04/15/2007 07:22 AM, Marc Espie wrote:
> I'm looking at a script that handles a huge amount of data... basically,
> the filenames from +4000 packages in order to recognize conflicts.
>
> Currently, it builds a big hash through a loop that constructs
> $all_conflict like this:
>
> my $file= File::Spec->canonpath($self->fullname());
> push ${$all_conflict->{$file}}, $pkgname;
>
>
> I end up with a hash of 250M.
> [...]


If swapping is a problem, you could try using a tied database hash such
as DB_File or NDBM_File or SDBM_File or GDBM_File.

Perhaps try all of those to see which one performs best. If nothing
works, buy more ram


--
Count the YOYOs:
http://home.earthlink.net/~mumia.w.18.spam/games_fever/

 
Reply With Quote
 
 
 
 
anno4000@radom.zrz.tu-berlin.de
Guest
Posts: n/a
 
      04-15-2007
Marc Espie <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> I'm looking at a script that handles a huge amount of data... basically,
> the filenames from +4000 packages in order to recognize conflicts.
>
> Currently, it builds a big hash through a loop that constructs
> $all_conflict like this:
>
> my $file= File::Spec->canonpath($self->fullname());
> push ${$all_conflict->{$file}}, $pkgname;
>
>
> I end up with a hash of 250M.


No, you end up with a syntax error, unless the second code line is
really

push @{$all_conflict->{$file}}, $pkgname;

I'll assume that, but please don't re-type code, copy/paste it to
make sure such errors don't happen.

> I can't really use Devel::Size with any benefit, since all the data
> is one single chunk (all the other data in the program amount to <2~3M)
>
> I expect the $pkgname strings to be shared. In fact, I tried replacing


What makes you expect that? You're pushing unrelated strings on
arrays.

> $pkgname with \$pkgname, with negative size results (+20M).


Sure. That way, you are storing a reference in addition to the strings.

> I've tried splitting the path along `frequent' directories, with inconclusive
> gains (most of the files live under /usr/local, /etc, or /var/www):
> -20M at most.


That comes at the cost of additional (anonymous, but real) hashes.

> I'm looking for bright ideas to try and reduce the size used... without
> being too detrimental in terms of speed...


The best storage strategy often depends on the nature of the data.
In your case, I'll assume that actual conflicts are rare. That means
the majority of your arrays of package names contain only one element.
That is wasteful. You could use two hashes, one to detect if a file
name has been seen before and another to keep information about actual
conflicts. Here is some code. I assume that pairs of package names
and file names can be read from DATA:

my ( %seen, %conflicts);
while ( <DATA> ) {
my ( $package, $file) = split;
if ( exists $seen{ $file} ) {
$conflicts{ $file} ||= [ $seen{ $file}]; # transfer first package
push @{ $conflicts{ $file} }, $package; # add another package
} else {
$seen{ $file} = $package; # mark as seen
}
}

print "No conflicts\n" unless %conflicts;
print "$_ in ", join( ', ' => @{ $conflicts{ $_} }), "\n" for
keys %conflicts;

This stores the first package name a file is seen with as a plain
string in %seen, not a one-element array. Only when a file is seen
a second time an array is built in %conflicts. If conflicts are
rare, this should save some memory.

Anno
 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      04-15-2007
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I'm looking at a script that handles a huge amount of data... basically,
> the filenames from +4000 packages in order to recognize conflicts.


How many more than 4000 do you have? To take up the amount of room you
are talking about, it seems you would need orders of magnitude more
than 4000. But maybe I don't understand the nature of the data. What does
a conflict end up being represented as?

>
> Currently, it builds a big hash through a loop that constructs
> $all_conflict like this:
>
> my $file= File::Spec->canonpath($self->fullname());
> push ${$all_conflict->{$file}}, $pkgname;


Is $pkgname the name of the package declared in $file, or the name
of the package used in $file. In any event, are you pushing the same
$pkgname onto same file's list over and over? Maybe you should use a hash
of hash instead of hash of array:

$all_conflict->{$file}->{$pkname}=();

> I end up with a hash of 250M.
>
> I can't really use Devel::Size with any benefit, since all the data
> is one single chunk (all the other data in the program amount to <2~3M)
>
> I expect the $pkgname strings to be shared.


Hash keys are shared (with substantial overhead) but array values are not.
If your paths are really so long, you might try compressing them, or
digesting them.

....
> I'm looking for bright ideas to try and reduce the size used... without
> being too detrimental in terms of speed...


Without knowing more about the exact nature of the data, it is hard to say.
optimization that involved changing the way a structure is built is often
drive by the way the structure is used, so showing that part may also be
useful.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
Marc Espie
Guest
Posts: n/a
 
      04-16-2007
In article <(E-Mail Removed)>,
<(E-Mail Removed)-berlin.de> wrote:
>Marc Espie <(E-Mail Removed)> wrote in comp.lang.perl.misc:
>> I'm looking at a script that handles a huge amount of data... basically,
>> the filenames from +4000 packages in order to recognize conflicts.
>>
>> Currently, it builds a big hash through a loop that constructs
>> $all_conflict like this:
>>
>> my $file= File::Spec->canonpath($self->fullname());
>> push ${$all_conflict->{$file}}, $pkgname;
>>
>>
>> I end up with a hash of 250M.

>
>No, you end up with a syntax error, unless the second code line is
>really
>
> push @{$all_conflict->{$file}}, $pkgname;
>
>I'll assume that, but please don't re-type code, copy/paste it to
>make sure such errors don't happen.


Oops, sorry about that. I usually copy&paste code indeed, and I could
have sworn I did.

>> I expect the $pkgname strings to be shared. In fact, I tried replacing

>
>What makes you expect that? You're pushing unrelated strings on
>arrays.


They're not necessarily unrelated.

I guess I'll have to share more code . I grab the full list
for each package, and populate the hash with that string like so:

$plist->visit('register', $filehash, $dephash, $plist->pkgname());

(not quoting my plist and visitor code, I assume you can figure out that
it will end up calling the above snippet for each item in the packing-list)

so I am certain that I call the registration routine with an identical
string for each item in a given package. I haven't looked too closely at
perl's internals, but I would have assumed it to share the string in such
a case ?

>The best storage strategy often depends on the nature of the data.
>In your case, I'll assume that actual conflicts are rare. That means
>the majority of your arrays of package names contain only one element.
>That is wasteful. You could use two hashes, one to detect if a file
>name has been seen before and another to keep information about actual
>conflicts. Here is some code. I assume that pairs of package names
>and file names can be read from DATA:
>
> my ( %seen, %conflicts);
> while ( <DATA> ) {
> my ( $package, $file) = split;
> if ( exists $seen{ $file} ) {
> $conflicts{ $file} ||= [ $seen{ $file}]; # transfer first package
> push @{ $conflicts{ $file} }, $package; # add another package
> } else {
> $seen{ $file} = $package; # mark as seen
> }
> }


This looks like a good idea indeed.
I'm wondering if maybe I should build a cache of pkgname lists,
and try to share as many as I can...

anyways, I'll try your idea (and the corresponding one) and tell you whether
I see any improvement.

While I'm there, responding to the other persons: sorry, my first message
wasn't clear. I do not want to go off to disk to a dbm or a database.
As it stands, the program works. It's just that 250M is over the default
ulimit of the considered system, which means that people have to remember
to bump the limits before running it. There's also the fact that the current
set of packages is bound to grow, so if I can find a good idea to reduce
memory usage, that would be cool. But using external storage is not the
solution I'm looking for.
 
Reply With Quote
 
Marc Espie
Guest
Posts: n/a
 
      04-16-2007
In article <evvq4o$5ua$(E-Mail Removed)>,
Marc Espie <(E-Mail Removed)> wrote:
>In article <(E-Mail Removed)>,
> <(E-Mail Removed)-berlin.de> wrote:
>> my ( %seen, %conflicts);
>> while ( <DATA> ) {
>> my ( $package, $file) = split;
>> if ( exists $seen{ $file} ) {
>> $conflicts{ $file} ||= [ $seen{ $file}]; # transfer first package
>> push @{ $conflicts{ $file} }, $package; # add another package
>> } else {
>> $seen{ $file} = $package; # mark as seen
>> }
>> }

>
>This looks like a good idea indeed.
>I'm wondering if maybe I should build a cache of pkgname lists,
>and try to share as many as I can...


Thanks for the insight. With a few lines of additional code, this does shrink
the process footprint from >250M to 190M. Very nice indeed.
Doesn't appear to be slower, the computation is IO-bound anyways (reading
packing-lists for gzip'ed archives).

The code now looks like:

my $file= File::Spec->canonpath($self->fullname());
if (exists $all_conflict->{$file}) {
$list->{$all_conflict->{$file}}->{$pkgname} ||=
[@{$all_conflict->{$file}}, $pkgname ];
$all_conflict->{$file} = $list->{$all_conflict->{$file}}->{$pkgname};
} else {
$list->{$pkgname} ||= [$pkgname];
$all_conflict->{$file} = $list->{$pkgname};
}

I may try to tweak it a bit further, but it's already a very nice improvement.
(if somebody has other ideas, I'm still game )

Anyways, thanks a lot

(that's the ports/infrastructure/packages/find-all-conflicts script used in
OpenBSD, btw)

I think I'm going to use an extra seen array as well... I need to do some
measurements, but I suspect most of my files don't appear in more than
one package, so scanning the conflicts hash later probably can be sped up
by a large amount.
 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      04-16-2007
>>>>> "ME" == Marc Espie <(E-Mail Removed)> writes:


ME> my $file= File::Spec->canonpath($self->fullname());
ME> if (exists $all_conflict->{$file}) {
ME> $list->{$all_conflict->{$file}}->{$pkgname} ||=
ME> [@{$all_conflict->{$file}}, $pkgname ];

that is slow as you copy the existing array back into another anon
array. andyou have a lot of code redundancy all over this. you always
are assigning an anon array but autoviviification will handle that for
you. put this before the if (and i am not even sure you need a
conditional there at all but i haven't followed the logic flow)

push( @{$all_conflict->{$file}}, $pkgname ;

that will replace the first line in both clauses and be much faster as
well (no wasteful extra copying). it might even save space as perl won't
be allocating and freeing as many buffers so less storage could be used.


ME> $all_conflict->{$file} = $list->{$all_conflict->{$file}}->{$pkgname};
ME> } else {
ME> $list->{$pkgname} ||= [$pkgname];
ME> $all_conflict->{$file} = $list->{$pkgname};
ME> }

as for the conflict hash, i am sure it can be reduced but i don't know
the logic. you change $all_conflict->{$file} after each push (your ||=
code) which makes no sense to me. maybe you should clearly explain the
data structure you want to get out of this. i have yet to see such an
explanation in this thread (or i am not awake yet). i can't see how 4000
entries of maybe a few hundred bytes each will use up 250MB (or even 190).

ME> (that's the ports/infrastructure/packages/find-all-conflicts
ME> script used in OpenBSD, btw)

any url to get that directly? if that is published code then my autoviv
fix will save tons of time for many users. that copy anon arrays to
themselves thing is massively bad code. for more on autovivification see
my article at http://sysarch.com/Perl/autoviv.txt.

knowing that the build code is poorly designed, now i am confident that
the data structure is also poorly design and can be majorly optimized.

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
 
Reply With Quote
 
Marc Espie
Guest
Posts: n/a
 
      04-16-2007
In article <(E-Mail Removed)>,
Uri Guttman <(E-Mail Removed)> wrote:
>>>>>> "ME" == Marc Espie <(E-Mail Removed)> writes:

>
>
> ME> my $file= File::Spec->canonpath($self->fullname());
> ME> if (exists $all_conflict->{$file}) {
> ME> $list->{$all_conflict->{$file}}->{$pkgname} ||=
> ME> [@{$all_conflict->{$file}}, $pkgname ];
>
>that is slow as you copy the existing array back into another anon
>array. andyou have a lot of code redundancy all over this. you always
>are assigning an anon array but autoviviification will handle that for
>you. put this before the if (and i am not even sure you need a
>conditional there at all but i haven't followed the logic flow)
>
> push( @{$all_conflict->{$file}}, $pkgname ;


But still, the new structure takes less space than the old one.
The trick is that there is one single ref for each pkgname set now.

The simple line overhead will build one separate entry for each file
and this looks like it is the stuff gobbling memory.

>as for the conflict hash, i am sure it can be reduced but i don't know
>the logic. you change $all_conflict->{$file} after each push (your ||=
>code) which makes no sense to me. maybe you should clearly explain the
>data structure you want to get out of this. i have yet to see such an
>explanation in this thread (or i am not awake yet). i can't see how 4000
>entries of maybe a few hundred bytes each will use up 250MB (or even 190).


We are talking 4000 packages. Which contain about 700000 files, total.

> ME> (that's the ports/infrastructure/packages/find-all-conflicts
> ME> script used in OpenBSD, btw)
>
>any url to get that directly? if that is published code then my autoviv
>fix will save tons of time for many users. that copy anon arrays to
>themselves thing is massively bad code. for more on autovivification see
>my article at http://sysarch.com/Perl/autoviv.txt.


>knowing that the build code is poorly designed, now i am confident that
>the data structure is also poorly design and can be majorly optimized.


Go ahead, knock yourself out...

You can get this through OpenBSD's cvsweb, from openbsd.org.
http://www.openbsd.org/cgi-bin/cvswe...-all-conflicts

the extra pm files live under
http://www.openbsd.org/cgi-bin/cvswe...r.sbin/pkg_add
 
Reply With Quote
 
Ilya Zakharevich
Guest
Posts: n/a
 
      04-16-2007
[A complimentary Cc of this posting was sent to
Marc Espie
<(E-Mail Removed)>], who wrote in article <f00b1i$dmi$(E-Mail Removed)>:
> > ME> my $file= File::Spec->canonpath($self->fullname());
> > ME> if (exists $all_conflict->{$file}) {
> > ME> $list->{$all_conflict->{$file}}->{$pkgname} ||=
> > ME> [@{$all_conflict->{$file}}, $pkgname ];


> We are talking 4000 packages. Which contain about 700000 files, total.


So you want a hash with 700000 entries? With 100b/entry overhead, it
will take about least 70K + size of entries.

You did not explain what is the distribution of lengths of arrays
which are values of the hash. E.g, if you have a lot of singletons,
it would save space to store them directly in the hash (the price
being an extra check when adding a new entry).

You may also win by enumerating the packages, and storing the index:

my $idx = $seen{$pkgname};
$idx = $seen{$pkgname} = ++$max_pk_idx, $pk[$idx] = $pkgname
unless defined $idx;
push @{$all_conflict->{$file}}, 0 + $idx; # Cvt to "pure-number"

Also, take care not to stringify $all_conflict->{$file}[$n], copy it
to a scratch variable if you ever need the string value of it...

Hope this helps,
Ilya
 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      04-16-2007
>>>>> "ME" == Marc Espie <(E-Mail Removed)> writes:

>> that is slow as you copy the existing array back into another anon
>> array. andyou have a lot of code redundancy all over this. you always
>> are assigning an anon array but autoviviification will handle that for
>> you. put this before the if (and i am not even sure you need a
>> conditional there at all but i haven't followed the logic flow)
>>
>> push( @{$all_conflict->{$file}}, $pkgname ;


ME> But still, the new structure takes less space than the old one.
ME> The trick is that there is one single ref for each pkgname set now.

that may be the case but the intermediate storage will be dramatically
reduced too with my code.

ME> The simple line overhead will build one separate entry for each file
ME> and this looks like it is the stuff gobbling memory.

i will take a look at that when i can.

>> as for the conflict hash, i am sure it can be reduced but i don't know
>> the logic. you change $all_conflict->{$file} after each push (your ||=
>> code) which makes no sense to me. maybe you should clearly explain the
>> data structure you want to get out of this. i have yet to see such an
>> explanation in this thread (or i am not awake yet). i can't see how 4000
>> entries of maybe a few hundred bytes each will use up 250MB (or even 190).


ME> We are talking 4000 packages. Which contain about 700000 files, total.

i didn't see the 700k files part anywhere before. the dataset wasn't
defined for me.

>> knowing that the build code is poorly designed, now i am confident that
>> the data structure is also poorly design and can be majorly optimized.


ME> Go ahead, knock yourself out...

ME> You can get this through OpenBSD's cvsweb, from openbsd.org.
ME> http://www.openbsd.org/cgi-bin/cvswe...-all-conflicts

is this your program for the bsd distro? or are you trying to improve it
for them?

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Template turing-completeness and code size optimization a.tavs@hotmail.it C++ 5 01-28-2009 01:40 PM
Zero Optimization and Sign Optimization??? Ravikiran C Programming 22 11-24-2008 03:19 AM
Preferred Size, Minimum Size, Size Jason Cavett Java 5 05-25-2008 08:32 AM
mega pixels, file size, image size, and print size - Adobe Evangelists Frank ess Digital Photography 0 11-14-2006 05:08 PM
Some optimization tale Stephan Diehl Python 7 12-30-2003 02:44 PM



Advertisments