Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Quickly get count of files on linux

Reply
Thread Tools

Quickly get count of files on linux

 
 
Ram Prasad
Guest
Posts: n/a
 
      09-28-2011
I have a system that gets jobs in files which are stored in a
directory tree structure.
To get the current job queue size , I simply have to count all files
in a particular directory ( including sub dirs )
The queue size may be upto 2 million files
I can get the size by using

find /path -type f | wc -l

But this is not fast enough. So I wrote a small directory search
script to just count the number of files , can I optimize this
further. Currently the script takes longer than optimal
0.7 s for a queue size of 300 k

The script will always run only on linux .. so I dont bother about
compatibility anyway.




#include <stdio.h>
#include <sys/types.h>
#include <dirent.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <string.h>

#if STAT_MACROS_BROKEN
# undef S_ISDIR
#endif

#define MAXPATH 1000
#if !defined S_ISDIR && defined S_IFDIR
# define S_ISDIR(Mode) (((Mode) & S_IFMT) == S_IFDIR)
#endif
/* I Think this function is the bottleneck */
int isdir (const char *path){
struct stat stats;
return stat (path, &stats) == 0 && S_ISDIR (stats.st_mode);
}
int dirnscan (const char *path){
char fullpath[MAXPATH];
DIR *dp;
struct dirent *ep;
int n=0;
dp = opendir (path);
if(dp==NULL) return 0;
while ((ep = readdir (dp))){
if(ep->d_name[0] == '.') continue;
sprintf(fullpath,"%s/%s",path,ep->d_name);
if(isdir(fullpath) == 0){
++n;
} else {
n = n + dirnscan(fullpath);
}
}
closedir(dp);
return(n);
}
int main(int argc,char *argv[]){
printf("%d\n",dirnscan(argv[1]));
return(0);
}
 
Reply With Quote
 
 
 
 
Keith Thompson
Guest
Posts: n/a
 
      09-28-2011
Ram Prasad <(E-Mail Removed)> writes:
> I have a system that gets jobs in files which are stored in a
> directory tree structure.
> To get the current job queue size , I simply have to count all files
> in a particular directory ( including sub dirs )
> The queue size may be upto 2 million files
> I can get the size by using
>
> find /path -type f | wc -l
>
> But this is not fast enough. So I wrote a small directory search
> script to just count the number of files , can I optimize this
> further. Currently the script takes longer than optimal
> 0.7 s for a queue size of 300 k
>
> The script will always run only on linux .. so I dont bother about
> compatibility anyway.
>

[43 lines deleted]

You'll get better answers on comp.unix.programmer.

(But I doubt that you'll get much improvement; the "find" command
already has to do all the work you're doing. Can your queueing
system just keep track of the number of jobs itself?)

--
Keith Thompson (The_Other_Keith) http://www.velocityreviews.com/forums/(E-Mail Removed) <http://www.ghoti.net/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
 
 
 
Ben Bacarisse
Guest
Posts: n/a
 
      09-29-2011
China Blue Corn Chips <(E-Mail Removed)> writes:

> In article <(E-Mail Removed)>, Keith Thompson <(E-Mail Removed)>
> wrote:
>
>> Ram Prasad <(E-Mail Removed)> writes:
>> > I have a system that gets jobs in files which are stored in a
>> > directory tree structure.
>> > To get the current job queue size , I simply have to count all files
>> > in a particular directory ( including sub dirs )
>> > The queue size may be upto 2 million files
>> > I can get the size by using
>> >
>> > find /path -type f | wc -l

>
>> (But I doubt that you'll get much improvement; the "find" command
>> already has to do all the work you're doing. Can your queueing
>> system just keep track of the number of jobs itself?)

>
> The problem isn't find but wc. The only relevant output of find are
> the \n, but wc has to read every other character to find those.


I don't think that matters all that much:

$ time find /dir/with/long/paths -type f | wc -l
115562

real 0m0.385s
user 0m0.090s
sys 0m0.320s

$ time find /dir/with/long/paths -type f -printf "\n" | wc -l
115562

real 0m0.322s
user 0m0.100s
sys 0m0.210s

It's faster, but not by much (average path length about 100 bytes).

<snip>
--
Ben.
 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      09-29-2011
On Wed, 28 Sep 2011 08:30:56 -0700, Ram Prasad wrote:

> I have a system that gets jobs in files which are stored in a
> directory tree structure.
> To get the current job queue size , I simply have to count all files
> in a particular directory ( including sub dirs )
> The queue size may be upto 2 million files
> I can get the size by using
>
> find /path -type f | wc -l
>
> But this is not fast enough. So I wrote a small directory search
> script to just count the number of files , can I optimize this
> further.


If you only need it to work on Linux, you can usually eliminate the stat()
call by using the d_type field in the dirent structure.

For more information, see the readdir(3) manual page (note: *not* the
readdir(2) manual page; the readdir() function in libc uses the getdents()
system call, not the readdir() system call).

Also, if this is something you have to do repeatedly, you can cache the
total for each directory, and only re-scan if the directory's modification
time changes.

 
Reply With Quote
 
Ram Prasad
Guest
Posts: n/a
 
      09-29-2011
On Sep 29, 8:15*am, Nobody <(E-Mail Removed)> wrote:
> On Wed, 28 Sep 2011 08:30:56 -0700, Ram Prasad wrote:
> > I have a system that gets jobs in files which are stored in a
> > directory tree structure.
> > To get the current job queue size , I simply have to count all files
> > in a particular directory ( including sub dirs )
> > The queue size may be upto 2 million files
> > I can get the size by using

>
> > find /path *-type f * | wc -l

>
> > But this is not fast enough. * So I wrote a small directory search
> > script to just count the number of files , can I optimize this
> > further.

>
> If you only need it to work on Linux, you can usually eliminate the stat()
> call by using the d_type field in the dirent structure.
>


Thanks for the tip that greatly helped the speed.
Also do I have to sprintf() everytime in the loop. Sorry I am not a
regular C programmer .. used perl / php all the while



 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      09-29-2011
On 9/29/2011 7:23 AM, Ram Prasad wrote:
> On Sep 29, 8:15 am, Nobody<(E-Mail Removed)> wrote:
>>[...]
>> If you only need it to work on Linux, you can usually eliminate the stat()
>> call by using the d_type field in the dirent structure.

>
> Thanks for the tip that greatly helped the speed.
> Also do I have to sprintf() everytime in the loop. Sorry I am not a
> regular C programmer .. used perl / php all the while


Please note that the helpful tip had nothing at all to do with
the C programming language, and everything to do with the behavior
of the Linux system on which your program runs. Ponder that, next
time you have a question and are wondering which forum would be
most suitable.

--
Eric Sosman
(E-Mail Removed)d
 
Reply With Quote
 
Jens Thoms Toerring
Guest
Posts: n/a
 
      09-29-2011
Ram Prasad <(E-Mail Removed)> wrote:
> On Sep 29, 8:15*am, Nobody <(E-Mail Removed)> wrote:
> > On Wed, 28 Sep 2011 08:30:56 -0700, Ram Prasad wrote:
> > > I have a system that gets jobs in files which are stored in a
> > > directory tree structure.
> > > To get the current job queue size , I simply have to count all files
> > > in a particular directory ( including sub dirs )
> > > The queue size may be upto 2 million files
> > > I can get the size by using

> >
> > > find /path *-type f * | wc -l

> >
> > > But this is not fast enough. * So I wrote a small directory search
> > > script to just count the number of files , can I optimize this
> > > further.

> >
> > If you only need it to work on Linux, you can usually eliminate the stat()
> > call by using the d_type field in the dirent structure.


> Thanks for the tip that greatly helped the speed.
> Also do I have to sprintf() everytime in the loop. Sorry I am not a
> regular C programmer .. used perl / php all the while


Note that that will not work on all types of file systems (see
the details in the NOTES section of of the man page readdir(3)).
And then there are other possible issues with your program:

> #define MAXPATH 1000

....
> char fullpath[MAXPATH];

....
> sprintf(fullpath,"%s/%s",path,ep->d_name);


First of all, MAXPATH might be way too short for all possible
paths (there are actually ways to find out what the maximum is,
more about that in a Linux or Unix newsgroup). And if it's too
short you easily could write past the end of the 'fullpath'
buffer, making all what your program outputs afterward (if
it doesn't crash) at least dubious...

And no, if you use the d_type element you only have to cons-
truct the directory name when you already know it's a direc-
tory but not for normal files (since you then don't need to
call stat(). BTW, are you sure you want stat() and not lstat()?

Moreover you could copy 'path' and the slash only once to
the buffer, store the possition after the slash and then
later on only overwrite the 'ep->d_name' part.

But when you compare run times you should be aware that they
might depend a real lot on information about the file system
already buffered by the OS - with a similar program counting
about 300 K files took about 2 minutes when run for the first
time and only about 0.4 seconds when run again for a second
time immediately afterwards on my machine. Got the same kind
of behaviour from 'find'. So the run time of your program is
may likely dominated by caching issues and all your attempts
to optimize might hardly be noticable in real life when your
program is rarely run on the same directory with lots of time
in between.
Regards, Jens
--
\ Jens Thoms Toerring ___ (E-Mail Removed)
\__________________________ http://toerring.de
 
Reply With Quote
 
Ram Prasad
Guest
Posts: n/a
 
      09-29-2011
On Sep 29, 4:41*pm, Eric Sosman <(E-Mail Removed)> wrote:
> On 9/29/2011 7:23 AM, Ram Prasad wrote:
>
> > On Sep 29, 8:15 am, Nobody<(E-Mail Removed)> *wrote:
> >>[...]
> >> If you only need it to work on Linux, you can usually eliminate the stat()
> >> call by using the d_type field in the dirent structure.

>





On a reiserfs filesystem
dirent_obj->d_type is always set to 0 ... I think I will have to ask
on a unix forum now

 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      09-29-2011
Ram Prasad <(E-Mail Removed)> writes:
> On Sep 29, 4:41*pm, Eric Sosman <(E-Mail Removed)> wrote:
>> On 9/29/2011 7:23 AM, Ram Prasad wrote:
>> > On Sep 29, 8:15 am, Nobody<(E-Mail Removed)> *wrote:
>> >>[...]
>> >> If you only need it to work on Linux, you can usually eliminate the stat()
>> >> call by using the d_type field in the dirent structure.

>
> On a reiserfs filesystem
> dirent_obj->d_type is always set to 0 ... I think I will have to ask
> on a unix forum now


As I suggested some time ago. Try comp.unix.programmer.

--
Keith Thompson (The_Other_Keith) (E-Mail Removed) <http://www.ghoti.net/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Count = Count + 1 Using only std_logic_1164 Doubt efelnavarro09 VHDL 2 01-26-2011 03:49 AM
writing large files quickly rbt Python 36 01-29-2006 01:01 PM
Count(*) in a Subquery with multiple tables: How does SQL determine which table to generate the Count() from? Kaimuri MCSD 3 12-29-2004 06:38 PM
I am adding a new row to the datagrid dynamically but if i use the Count property of Item it is not showing the count of the new rows being added Praveen Balanagendra via .NET 247 ASP .Net 2 06-06-2004 07:16 AM
How do I count keys in a BerkeleyDB *quickly* (using DB_File)? Damon Hastings Perl Misc 2 09-09-2003 09:03 AM



Advertisments