Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > How to speed up this slow part of my program

Reply
Thread Tools

How to speed up this slow part of my program

 
 
Justin C
Guest
Posts: n/a
 
      03-28-2012
We have a database of thousands of clothing items. Some of the items are
almost identical apart from their size. Consequently we use the same
image in our web-shop to advertise items of the same style, design and
colour.

In a program I have to get new images from the art guy's computer I end
up grepping the entire list of items $#(list-of-items) times, there must
be a better way. The file names are exactly the same as the style codes
apart from the size suffix being dropped. I'm using File::Find.

Here's some code:

find(\&set_flag, (keys %{ $stock_groups->{text2code} }));

sub set_flag {
return unless (-f $_ );

(my $item_code_part = $_) =~ s/\.jpg//;
$item_code_part = uc($item_code_part);
$item_code_part =~ s|_|/|g;

my @matches = grep(/$item_code_part/, keys %{ $stock_items });
foreach my $i (@matches) {
$stock_items->{$i}{got_image} = 1;
}
}

The 'find' iterates through two top level directories, 36 next level
directories in each of the top level, and about 20k files accross the
entire 72 level 2 directories. It then passes the filename to the sub
which compares the filename (which is only a partial stock code because
it may have several matches) with the hash of stock_items, pulling out
matches. Those matches are then iterated over and the items with an
available image get a hash element added and set to 1.

I can probably do the grep and iteration over the matches with map{...},
grep{...}, keys %{ $stock_items}; but I don't think that'll save much
time, and I'm not certain how to do, I can see how to create a new hash,
but I'm not sure if changing the hash while grep iterates through it is
a good idea.

The bottle-neck, as I see it, is running grep 20k times, once for each
image found. Can anyone suggest a better way?

Justin.

--
Justin C, by the sea.
 
Reply With Quote
 
 
 
 
Rainer Weikusat
Guest
Posts: n/a
 
      03-28-2012
Justin C <(E-Mail Removed)> writes:

[...]

> find(\&set_flag, (keys %{ $stock_groups->{text2code} }));
>
> sub set_flag {
> return unless (-f $_ );
>
> (my $item_code_part = $_) =~ s/\.jpg//;
> $item_code_part = uc($item_code_part);
> $item_code_part =~ s|_|/|g;
>
> my @matches = grep(/$item_code_part/, keys %{ $stock_items });
> foreach my $i (@matches) {
> $stock_items->{$i}{got_image} = 1;
> }
> }
>
> The 'find' iterates through two top level directories, 36 next level
> directories in each of the top level, and about 20k files accross the
> entire 72 level 2 directories. It then passes the filename to the sub
> which compares the filename (which is only a partial stock code because
> it may have several matches) with the hash of stock_items, pulling out
> matches. Those matches are then iterated over and the items with an
> available image get a hash element added and set to 1.


[...]

> The bottle-neck, as I see it, is running grep 20k times, once for each
> image found. Can anyone suggest a better way?


"Doing linear scans over an associative array is like trying to club
someone to death with a loaded Uzi."

An obvious suggestion would be to traverse the stock_items keys once
an build a second hash which maps each item_code_part to an array
reference containing all the corresponding stock_items keys (or even
to a reference to the corresponding stock_items element) and use this
second hash to locate the keys which need to be changed (or the hash
locations which need to be changed) in constant time.

NB: Someone should probably repost that since 'Justin who keeps coming
out of my killfile' as chosen to 'block' himself from seeing my
answers to questions of him for 'clearly indecent behaviour' ...
 
Reply With Quote
 
 
 
 
Tim McDaniel
Guest
Posts: n/a
 
      03-28-2012
In article <(E-Mail Removed)>,
Ben Morrow <(E-Mail Removed)> wrote:
>I would probably turn this into a big pattern match. Something like
>this:
>
> use File::Find::Rule;
>
> my ($imgs) = map qr/$_/, join "|", map "\Q\U$_",
> map { (my ($x) = /(.*)\.jpg/) =~ s!_!/!g; $x }
> File::Find::Rule->file->in(keys %{...});
>
> while (my ($item, $entry) = each %$stock_items) {
> $item =~ $imgs and $entry->{got_image} = 1;
> }
>
>If you're using 5.14 you can get rid of the ugly map block using s///r
>and tr///r:
>
> map tr!_!/!r, map s/\.jpg//r,
>
>This assumes the entries in %$stock_items are already hashrefs; if they
>aren't a 'for (keys %$stock_items)' loop will be easier.


What's s///r and tr///r? I looked in "man perlop" and "man perlre"
for a system with 5.14.2, but I didn't see them.

--
Tim McDaniel, http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      03-28-2012
Ben Morrow <(E-Mail Removed)> writes:
> Quoth Justin C <(E-Mail Removed)>:


[...]

>> find(\&set_flag, (keys %{ $stock_groups->{text2code} }));
>>
>> sub set_flag {
>> return unless (-f $_ );
>>
>> (my $item_code_part = $_) =~ s/\.jpg//;
>> $item_code_part = uc($item_code_part);
>> $item_code_part =~ s|_|/|g;
>>
>> my @matches = grep(/$item_code_part/, keys %{ $stock_items });

>
> Careful: you want \Q there, even if you think you're sure the filenames
> are all safe.
>
>> foreach my $i (@matches) {
>> $stock_items->{$i}{got_image} = 1;
>> }
>> }

>
> I would probably turn this into a big pattern match. Something like
> this:
>
> use File::Find::Rule;
>
> my ($imgs) = map qr/$_/, join "|", map "\Q\U$_",
> map { (my ($x) = /(.*)\.jpg/) =~ s!_!/!g; $x }
> File::Find::Rule->file->in(keys %{...});
>
> while (my ($item, $entry) = each %$stock_items) {
> $item =~ $imgs and $entry->{got_image} = 1;
> }


Something I should already have written last time: A hash lookup is
(supposed to be) an O(1) operation. Matching against a set of
alternative patterns is O(n). In this particular case, the means that
the algorithm you suggest still scales as badly as the originally
used one in this respect (run time proportional to the of patters
times the number of stock items), except that it is possibly somewhat faster
(although I wouldn't want to bet on that blindly).

The sensible way to do a lot of dictionary lookups is using a
dictionary, especially considering that Perl has native support for
that.


 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      03-28-2012
Rainer Weikusat <(E-Mail Removed)> writes:

> Ben Morrow <(E-Mail Removed)> writes:
>> Quoth Justin C <(E-Mail Removed)>:

>
> [...]
>
>>> find(\&set_flag, (keys %{ $stock_groups->{text2code} }));
>>>
>>> sub set_flag {
>>> return unless (-f $_ );
>>>
>>> (my $item_code_part = $_) =~ s/\.jpg//;
>>> $item_code_part = uc($item_code_part);
>>> $item_code_part =~ s|_|/|g;
>>>
>>> my @matches = grep(/$item_code_part/, keys %{ $stock_items });

>>
>> Careful: you want \Q there, even if you think you're sure the filenames
>> are all safe.
>>
>>> foreach my $i (@matches) {
>>> $stock_items->{$i}{got_image} = 1;
>>> }
>>> }

>>
>> I would probably turn this into a big pattern match. Something like
>> this:
>>
>> use File::Find::Rule;
>>
>> my ($imgs) = map qr/$_/, join "|", map "\Q\U$_",
>> map { (my ($x) = /(.*)\.jpg/) =~ s!_!/!g; $x }
>> File::Find::Rule->file->in(keys %{...});
>>
>> while (my ($item, $entry) = each %$stock_items) {
>> $item =~ $imgs and $entry->{got_image} = 1;
>> }


[...]

> Matching against a set of alternative patterns is O(n). In this
> particular case, the means that the algorithm you suggest still
> scales as badly as the originally used one


OTOH, I have now learnt why 'autothreading' is supposed to be
useful. I can imagine the following set of 'optimizations' which led
to it:

1. Do it as in the original example. The algorithm is
quadratic and doesn't scale, performance sucks.

2. Assemble a giant regexp: The algorithm is still quadratic
and doesn't scale, performance sucks.

3. Turn the giant regexp into a suitable kind of search tree:
Unfortunately, the algorithm is still quadratic and doesn't
scale, performance sucks.

4. Go back to 2, use as many cores as available in order to
try alternatives in parallell: Given that the algorithm
remains quadratic and doesn't scale, performance still sucks
but now, it at least flattens the complete system.

[5. Give up in despair and use a Hadoop-cluster. Add more
nodes whenever the fact that quadratic algorithms don't scale
and performance sucks because of this becomes to obvious].

Steps 1 - 5 also invole a successive increase in code complexity,
starting from 'more complicated than necessary' and ending with
'absolutely insanely more complicated than necessary' ...
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      03-29-2012
Christian Winter <(E-Mail Removed)> writes:

[...]


> I've had similar problems in my day-to-day work (iterating over
> thousands of files and building relationships with database contents)
> and found the only truly efficient approach is to avoid repetitive
> lookups.


The only 'truly efficient' way to solve such a problem is 'use a
suitable lookup algorithm'. Otherwise, you are always betting on
"runtime proportional to n*n won't matter because n will always be
small".
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      03-29-2012
Ben Morrow <(E-Mail Removed)> writes:
> Quoth Rainer Weikusat <(E-Mail Removed)>:
>> Justin C <(E-Mail Removed)> writes:
>>
>> > my @matches = grep(/$item_code_part/, keys %{ $stock_items });

>>
>> An obvious suggestion would be to traverse the stock_items keys once
>> an build a second hash which maps each item_code_part to an array
>> reference containing all the corresponding stock_items keys (or even
>> to a reference to the corresponding stock_items element) and use this
>> second hash to locate the keys which need to be changed (or the hash
>> locations which need to be changed) in constant time.

>
> Given the problem as stated, I don't believe that's practical.
> $item_code_part can match the stock_item anywhere along its length, so
> you would need to enter each stock_item under all of its substrings.
> That obviously turns into a combinatorial nightmare very fast.
>
> If, of course, there is some further constraint we don't know about,
> like 'the item code part is the section of the stock item up to the
> third underscore', then the approach you suggest is the right one.


Just that the posted code behaves in this way doesn't mean that this
behaviour is strictly necessary to solve the problem. If it was
otherwise, coding errors wouldn't exist. And in this particular case,
I would risk a bet on the assumption that the regular expression match
could be replaced with a comparison of the item_code_part and a
substring of each stock_items key whose start and length are the same
for each key and known in advance, IOW, that, given a stock_items key,
it's item_code_part can be calculated without knowing the set of all
item_code_parts presently corresponding to image files.

Indepedently of this, another sensible idea would be to turn the
'stock item key' into an attribute of a data structure, say, an array
reference, put these data structures into an array, traverse the array
to locate the first matching item, process that once it was found and
remove it from the array afterwards (or use a programming language
designed for people who don't want to be bothered with outlandish
distinctions like that, eg, PHP).

 
Reply With Quote
 
J. Gleixner
Guest
Posts: n/a
 
      03-29-2012
On 03/28/12 10:24, Justin C wrote:
> We have a database of thousands of clothing items. Some of the items are
> almost identical apart from their size. Consequently we use the same
> image in our web-shop to advertise items of the same style, design and
> colour.
>[...]
> The bottle-neck, as I see it, is running grep 20k times, once for each
> image found. Can anyone suggest a better way?


Since you already have data in a DB, I'd suggest looking at
associating these files, to the items, in the database.

Maybe store the path to the file, or possibly the image as a BLOB, a
many to one relationship.
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      03-29-2012
Rainer Weikusat <(E-Mail Removed)> writes:

[...]

> 3. Turn the giant regexp into a suitable kind of search tree:
> Unfortunately, the algorithm is still quadratic and doesn't
> scale, performance sucks.


This was an error on my part: The total running time should be
proportional to the number of stock_items times log(number of images)
and while this is proportional to the square of the logarithm, this is
not how these terms are commonly used and understood. But for 20k
images, the running time will still be proportional to the number of
stock items times 14, much larger than what can be achieved by
'throwing memory at the problem' aka 'using a hash table'.
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      03-29-2012
Ben Morrow <(E-Mail Removed)> writes:
> Quoth Rainer Weikusat <(E-Mail Removed)>:
>>
>> The algorithm is quadratic and doesn't scale, performance sucks.

>
> The first part of that sentence does not imply the second: 'N is usually
> small' and all that.


Assuming that the original problem was stated correctly, 'N' is not
small here. Further, when 'N' is 'small',

> When you're comparing heavily optimized C like the
> regex engine with pretty much *any* logic in Perl,
> 'small' can end up being quite large.


the overhead of 'pretty much any logic in Perl' shouldn't
matter. Also, last time I looked, hashes were a datatype built into
Perl.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Re: slow slow slow! Expert lino fitter Computer Support 5 12-12-2008 04:00 PM
Re: slow slow slow! Beauregard T. Shagnasty Computer Support 2 12-10-2008 09:03 PM
Re: slow slow slow! Expert lino fitter Computer Support 0 12-10-2008 02:33 PM
speed speed speed a.metselaar Computer Support 14 12-30-2003 03:34 AM



Advertisments