Ram Prasad <> wrote:
> On Sep 29, 8:15Â*am, Nobody <nob...@nowhere.com> wrote:
> > On Wed, 28 Sep 2011 08:30:56 -0700, Ram Prasad wrote:
> > > I have a system that gets jobs in files which are stored in a
> > > directory tree structure.
> > > To get the current job queue size , I simply have to count all files
> > > in a particular directory ( including sub dirs )
> > > The queue size may be upto 2 million files
> > > I can get the size by using
> >
> > > find /path Â*-type f Â* | wc -l
> >
> > > But this is not fast enough. Â* So I wrote a small directory search
> > > script to just count the number of files , can I optimize this
> > > further.
> >
> > If you only need it to work on Linux, you can usually eliminate the stat()
> > call by using the d_type field in the dirent structure.
> Thanks for the tip that greatly helped the speed.
> Also do I have to sprintf() everytime in the loop. Sorry I am not a
> regular C programmer .. used perl / php all the while
Note that that will not work on all types of file systems (see
the details in the NOTES section of of the man page readdir(3)).
And then there are other possible issues with your program:
> #define MAXPATH 1000
....
> char fullpath[MAXPATH];
....
> sprintf(fullpath,"%s/%s",path,ep->d_name);
First of all, MAXPATH might be way too short for all possible
paths (there are actually ways to find out what the maximum is,
more about that in a Linux or Unix newsgroup). And if it's too
short you easily could write past the end of the 'fullpath'
buffer, making all what your program outputs afterward (if
it doesn't crash) at least dubious...
And no, if you use the d_type element you only have to cons-
truct the directory name when you already know it's a direc-
tory but not for normal files (since you then don't need to
call stat(). BTW, are you sure you want stat() and not lstat()?
Moreover you could copy 'path' and the slash only once to
the buffer, store the possition after the slash and then
later on only overwrite the 'ep->d_name' part.
But when you compare run times you should be aware that they
might depend a real lot on information about the file system
already buffered by the OS - with a similar program counting
about 300 K files took about 2 minutes when run for the first
time and only about 0.4 seconds when run again for a second
time immediately afterwards on my machine. Got the same kind
of behaviour from 'find'. So the run time of your program is
may likely dominated by caching issues and all your attempts
to optimize might hardly be noticable in real life when your
program is rarely run on the same directory with lots of time
in between.
Regards, Jens
--
\ Jens Thoms Toerring ___
\__________________________
http://toerring.de