Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Why Is My Hash Assignment Taking So Long?

Reply
Thread Tools

Why Is My Hash Assignment Taking So Long?

 
 
James E Keenan
Guest
Posts: n/a
 
      10-13-2003
I need to build a hash whose elements represent characteristics of each of
9600 files in a single directory. I am trying to figure out why this hash
assignment is taking so long. Here is my code:

my %threads_chars = (
dir => "~/Threads",
pattern => "\.thr\.txt",
);

$threadsref = analyze_dir(\%threads_chars);

sub analyze_dir {
my $hashref = shift;
my $dir = ${$hashref}{'dir'};
my $pattern = ${$hashref}{'pattern'};
my (%filechars, $count);
chdir $dir or die "Couldn't change to $dir: $!";
opendir DIR, $dir or die "Couldn't open $dir: $!";

#1: get files in $dir which match pattern
my @files = grep { m|$pattern$|o } readdir DIR;
print '@files established for ', "$dir\n";

#2: build a hash where each element is keyed by the core of the
file name
# and the value is a ref to an array holding the file name, full
path, atime, mtime.
# Set a counter so I know how fast it's going
foreach (@files) {
$_ =~ m|^(.*)$pattern$|;
my $core = $1;
$filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
$count++;
if ($count % 100 == 0) { print 'count: ', "$count\n";}
}
closedir DIR or die "Couldn't close $dir: $!";
print "$dir HAS BEEN ANALYZED\n";
return \%filechars;
}

The counter prints to screen in 100-file increments. The first couple of
hundred files zip right by, but then the process slows to a crawl. This
occurred despite there being no other significant activity happening on this
disk. Why?

jimk


 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      10-13-2003
James E Keenan <> wrote in comp.lang.perl.misc:
> I need to build a hash whose elements represent characteristics of each of
> 9600 files in a single directory. I am trying to figure out why this hash
> assignment is taking so long. Here is my code:


That's a lot of files to keep in a single directory. That is the
reason why it's slow.

There are a few obvious inefficiencies in your code, but these have
no measurable effect.

> my %threads_chars = (
> dir => "~/Threads",
> pattern => "\.thr\.txt",


You have a quoting problem here. Backslashing the dots in a *string*
has no effect. After the string is built, the backslashes are gone.
Use qr// to quote patterns:

pattern => qr/\.thr\.txt/,

> );
>
> $threadsref = analyze_dir(\%threads_chars);
>
> sub analyze_dir {
> my $hashref = shift;
> my $dir = ${$hashref}{'dir'};
> my $pattern = ${$hashref}{'pattern'};
> my (%filechars, $count);
> chdir $dir or die "Couldn't change to $dir: $!";
> opendir DIR, $dir or die "Couldn't open $dir: $!";
>
> #1: get files in $dir which match pattern
> my @files = grep { m|$pattern$|o } readdir DIR;


Why do you build a list of all files just to process them once? Walk
through the directory and skip what you don't want. You'll save
the list (space efficiency) and you'll only have to apply the regex
once (time efficiency).

You should also take care to de-select anything that isn't a plain file.

> print '@files established for ', "$dir\n";
>
> #2: build a hash where each element is keyed by the core of the
> file name
> # and the value is a ref to an array holding the file name, full
> path, atime, mtime.
> # Set a counter so I know how fast it's going
> foreach (@files) {
> $_ =~ m|^(.*)$pattern$|;
> my $core = $1;
> $filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
> $count++;
> if ($count % 100 == 0) { print 'count: ', "$count\n";}
> }
> closedir DIR or die "Couldn't close $dir: $!";
> print "$dir HAS BEEN ANALYZED\n";
> return \%filechars;
> }
>
> The counter prints to screen in 100-file increments. The first couple of
> hundred files zip right by, but then the process slows to a crawl. This
> occurred despite there being no other significant activity happening on this
> disk. Why?


The stat() operation takes longer for files that are "late" in the
directory, in the sense that readdir() produces them late. Since
your directory is oversized, the difference is significant. Since
you process the files in the order in which they were found, the process
seems to slow down progressively. To test the hypothesis, just change

my @files = grep { m|$pattern$|o } readdir DIR;

in your code to

my @files = reverse grep { m|$pattern$|o } readdir DIR;

and watch it start out slow and grow faster as it goes. To gain speed,
spread the files out over more smaller directories.

Below is a cleaned-up version of analyze_dir(). It isn't any faster.

sub analyze_dir {
my ( $dir, $pattern) = @{ shift()}{ qw( dir pattern)};
opendir my $d, $dir or die "Can't open dir $dir: $!";
chdir $dir or die "Can't cd to $dir: $!";
my (%filechars, $count);
while ( defined( $_ = readdir $d) ) {
die "boo" unless defined $_;
next unless -f;
next unless /($pattern)/;
$filechars{ $1} = [ $_, "$dir/$_", (stat $_)[ 8 .. 9]];
print "count: $count\n" if ++$count % 100 == 0;
}

Anno
 
Reply With Quote
 
 
 
 
Sisyphus
Guest
Posts: n/a
 
      10-13-2003
James E Keenan wrote:

> foreach (@files) {
> $_ =~ m|^(.*)$pattern$|;
> my $core = $1;
> $filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
> $count++;
> if ($count % 100 == 0) { print 'count: ', "$count\n";}
> }


Can't really see the answer to your question, but it seems to me that
you could readily play around with that loop in order to find out just
which part of it is killing the show.

I would start by assigning to $filechars{$count} instead of
$filechars{$core}. If that makes no difference comment out 'my $core =
$1;' - then comment out the regex - then remove the stat() results from
the array reference. Hopefully by then you will have found out where the
time is going, and we can start pondering "why?"

Hth.

Cheers,
Rob


--
To reply by email u have to take out the u in kalinaubears.

 
Reply With Quote
 
James E Keenan
Guest
Posts: n/a
 
      10-13-2003

"Anno Siegel" <> wrote in message
news:bmdrkl$gji$...
> James E Keenan <> wrote in

comp.lang.perl.misc:
> > I need to build a hash whose elements represent characteristics of each

of
> > 9600 files in a single directory. I am trying to figure out why this

hash
> > assignment is taking so long. Here is my code:

>
> That's a lot of files to keep in a single directory. That is the
> reason why it's slow.
>
> There are a few obvious inefficiencies in your code, but these have
> no measurable effect.
>
> > my %threads_chars = (
> > dir => "~/Threads",
> > pattern => "\.thr\.txt",

>
> You have a quoting problem here. Backslashing the dots in a *string*
> has no effect. After the string is built, the backslashes are gone.
> Use qr// to quote patterns:
>
> pattern => qr/\.thr\.txt/,
>


Ah! An incentive to learn the 'qr' operator!.

> > );
> >
> > $threadsref = analyze_dir(\%threads_chars);
> >
> > sub analyze_dir {
> > my $hashref = shift;
> > my $dir = ${$hashref}{'dir'};
> > my $pattern = ${$hashref}{'pattern'};
> > my (%filechars, $count);
> > chdir $dir or die "Couldn't change to $dir: $!";
> > opendir DIR, $dir or die "Couldn't open $dir: $!";
> >
> > #1: get files in $dir which match pattern
> > my @files = grep { m|$pattern$|o } readdir DIR;

>
> Why do you build a list of all files just to process them once? Walk
> through the directory and skip what you don't want. You'll save
> the list (space efficiency) and you'll only have to apply the regex
> once (time efficiency).
>

I was trying to think of the way to walk thru the directory, but I was just
too tired to think. After a good night's sleep, I located the answer in the
Llama book, similar to what you have below.

> You should also take care to de-select anything that isn't a plain file.
>

As it happened, there was only one non-matching file in the directory. But
your point is well taken and I've done that in other scripts.

> > print '@files established for ', "$dir\n";
> >
> > #2: build a hash where each element is keyed by the core of the
> > file name
> > # and the value is a ref to an array holding the file name,

full
> > path, atime, mtime.
> > # Set a counter so I know how fast it's going
> > foreach (@files) {
> > $_ =~ m|^(.*)$pattern$|;
> > my $core = $1;
> > $filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
> > $count++;
> > if ($count % 100 == 0) { print 'count: ', "$count\n";}
> > }
> > closedir DIR or die "Couldn't close $dir: $!";
> > print "$dir HAS BEEN ANALYZED\n";
> > return \%filechars;
> > }
> >
> > The counter prints to screen in 100-file increments. The first couple

of
> > hundred files zip right by, but then the process slows to a crawl. This
> > occurred despite there being no other significant activity happening on

this
> > disk. Why?

>
> The stat() operation takes longer for files that are "late" in the
> directory, in the sense that readdir() produces them late. Since
> your directory is oversized, the difference is significant. Since
> you process the files in the order in which they were found, the process
> seems to slow down progressively.


Very interesting!

> To test the hypothesis, just change
>
> my @files = grep { m|$pattern$|o } readdir DIR;
>
> in your code to
>
> my @files = reverse grep { m|$pattern$|o } readdir DIR;
>
> and watch it start out slow and grow faster as it goes. To gain speed,
> spread the files out over more smaller directories.
>

I'll test these out! Thanks, Anno.

More generally, is the sort of too-many-files-in-a-directory slowdown I
experienced here true across operating systems? Or just on this one
(Windows98 SE)? (It does seem as if all directory-related operations
calling upon this directory are slow.)

jimk

jimk


 
Reply With Quote
 
John W. Krahn
Guest
Posts: n/a
 
      10-14-2003
James E Keenan wrote:
>
> "Anno Siegel" <> wrote in message
> news:bmdrkl$gji$...
> >
> > To test the hypothesis, just change
> >
> > my @files = grep { m|$pattern$|o } readdir DIR;
> >
> > in your code to
> >
> > my @files = reverse grep { m|$pattern$|o } readdir DIR;
> >
> > and watch it start out slow and grow faster as it goes. To gain speed,
> > spread the files out over more smaller directories.

>
> I'll test these out! Thanks, Anno.
>
> More generally, is the sort of too-many-files-in-a-directory slowdown I
> experienced here true across operating systems? Or just on this one
> (Windows98 SE)? (It does seem as if all directory-related operations
> calling upon this directory are slow.)


The FAT32 file system that Win9x uses is based on a file system
originally (IIRC) designed for floppy disks. Even if you defragmented
your entire disk (including directories) you would not get much of a
speed improvement. And yes, there are operating systems/file systems
that handle this better but I don't have any specifics.


John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      10-14-2003
James E Keenan <> wrote in comp.lang.perl.misc:
>
> "Anno Siegel" <> wrote in message
> news:bmdrkl$gji$...
> > James E Keenan <> wrote in

> comp.lang.perl.misc:
> > > I need to build a hash whose elements represent characteristics of each

> of
> > > 9600 files in a single directory. I am trying to figure out why this


[...]

> > The stat() operation takes longer for files that are "late" in the
> > directory, in the sense that readdir() produces them late. Since > > your directory is oversized, the difference is significant. Since
> > you process the files in the order in which they were found, the process
> > seems to slow down progressively.

>
> Very interesting!


[...]

> More generally, is the sort of too-many-files-in-a-directory slowdown I
> experienced here true across operating systems? Or just on this one
> (Windows98 SE)? (It does seem as if all directory-related operations
> calling upon this directory are slow.)


Generally very large directories are slow though there may be differences,
even large ones, between file various systems. Directories were invented
so you don't have all your files in one place, so they are perhaps not
optimized for large numbers. Most people would consider a few hundred
files in a directory quite crowded, especially if the directory is active.
Split them up so that no directory has more than 256 entries, that seems
to be a reasonable number.

When a directory is mostly passive (like, say, the fonts of something big
and bulky), you can store more files per directory, but even then I'd put
a limit at a thousand or so.

Anno
 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      10-14-2003
>>>>> "AS" == Anno Siegel <> writes:

AS> When a directory is mostly passive (like, say, the fonts of
AS> something big and bulky), you can store more files per directory,
AS> but even then I'd put a limit at a thousand or so.

the size of a file has little to do with how many files should be in a
directory. classic unix file systems totally isolate the directory
entry from the file contents. directory searches are linear which is why
they get very slow when there are many entries. i ran into this many
years ago and redesigned a subsystem to use multiple subdirs before
finding a particular file. it effectively converted a linear search to a
tree search. i used 2 digits of the log file name for 2 levels keeping
the number of entries in any directory to under 100. it was much much
faster than the old system which had 10k+ entries in one dir. this is
still a common and easy technique. some modern unix/linux filesystems
use trees or hashes for lookup and are much faster than linear. but you
are always guaranteed to have a classic linear filesystem around so this
subdir trick will work anywhere.

and i would limit dir sizes to way less than 1000. 100-300 seems to be a
good value.

and this is way OT as it has no perl content at all.

uri

--
Uri Guttman ------ -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      10-15-2003
Uri Guttman <> wrote in comp.lang.perl.misc:
> >>>>> "AS" == Anno Siegel <> writes:

>
> AS> When a directory is mostly passive (like, say, the fonts of
> AS> something big and bulky), you can store more files per directory,
> AS> but even then I'd put a limit at a thousand or so.
>
> the size of a file has little to do with how many files should be in a
> directory...


Just for the record, I didn't mean to say it does.

Anno
 
Reply With Quote
 
James E Keenan
Guest
Posts: n/a
 
      10-15-2003

"Uri Guttman" <> wrote in message
news...
> >>>>> "AS" == Anno Siegel <> writes:

> [snip]
> i ran into this many
> years ago and redesigned a subsystem to use multiple subdirs before
> finding a particular file. it effectively converted a linear search to a
> tree search. i used 2 digits of the log file name for 2 levels keeping
> the number of entries in any directory to under 100. it was much much
> faster than the old system which had 10k+ entries in one dir. this is
> still a common and easy technique.


And after I pondered Anno's first response to my OP, I began to design a new
directory structure that works along very much the same principle (viz.,
breaking up this one big directory into smaller directories where the files
sort on a very simple principle such as the 1st letter of the file name).

> [snip]
>
> and i would limit dir sizes to way less than 1000. 100-300 seems to be a
> good value.
>


Agreed.

> and this is way OT as it has no perl content at all.
>

Disagree. See my original posting. I thought the posting might be due to
some problems in my Perl hash assignment. While Anno pointed out some ways
in which my code could be optimized (which I've since implemented), he
explained something about file systems I didn't previously know and
correctly diagnosed the problem as a system problem rather than a Perl
problem.

jimk


 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      10-15-2003
James E Keenan <> wrote in comp.lang.perl.misc:
>
> "Uri Guttman" <> wrote in message
> news...
> > >>>>> "AS" == Anno Siegel <> writes:


> > and i would limit dir sizes to way less than 1000. 100-300 seems to be a
> > good value.
> >

>
> Agreed.
>
> > and this is way OT as it has no perl content at all.
> >

> Disagree. See my original posting. I thought the posting might be due to
> some problems in my Perl hash assignment. While Anno pointed out some ways
> in which my code could be optimized (which I've since implemented), he
> explained something about file systems I didn't previously know and
> correctly diagnosed the problem as a system problem rather than a Perl
> problem.


....which means that my reply was off topic already. Let's no prolong it.

Anno
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Tied hash: Differentiating between assignment of single value andentire hash bernd Perl Misc 0 04-24-2012 02:41 PM
hash of hash of hash of hash in c++ rp C++ 1 11-10-2011 04:45 PM
Hash#select returns an array but Hash#reject returns a hash... Srijayanth Sridhar Ruby 19 07-02-2008 12:49 PM
why why why why why Mr. SweatyFinger ASP .Net 4 12-21-2006 01:15 PM
findcontrol("PlaceHolderPrice") why why why why why why why why why why why Mr. SweatyFinger ASP .Net 2 12-02-2006 03:46 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57