Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > most efficient way to get number of files in a directory

Reply
Thread Tools

most efficient way to get number of files in a directory

 
 
guba@vi-anec.de
Guest
Posts: n/a
 
      01-03-2010
Hello,

I am searching the most efficient way to get the number of files
in a directory (up to 10^6 files). I will use the nr as a stop
condition
of of generation process so the method must be applied during this
process
a lot of times. Therefore it must be efficient and opendir is not the
choice.

I am thinking about the bash command "ls | wc -l"
but I don't know how to get this in a perl variable.

Thank you very much for any help!
 
Reply With Quote
 
 
 
 
Jürgen Exner
Guest
Posts: n/a
 
      01-03-2010
"(E-Mail Removed)" <(E-Mail Removed)> wrote:
>I am searching the most efficient way to get the number of files
>in a directory (up to 10^6 files). I will use the nr as a stop
>condition
>of of generation process so the method must be applied during this
>process
>a lot of times. Therefore it must be efficient and opendir is not the
>choice.


opendir() or glob() would have been my first suggestion. But you will
have to run your own benchmark tests, I doubt that anyone has ever
investigated performance in such a scenario before.

>I am thinking about the bash command "ls | wc -l"
>but I don't know how to get this in a perl variable.


Use backticks:
my $captured = `ls | wc -l`;

Of course, if launching two external processes and initiating IPC is
indeed faster than using Perl's buildin functions has to be tested.

jue

 
Reply With Quote
 
 
 
 
Uri Guttman
Guest
Posts: n/a
 
      01-03-2010
>>>>> "JE" == Jürgen Exner <(E-Mail Removed)> writes:

JE> "(E-Mail Removed)" <(E-Mail Removed)> wrote:
>> I am searching the most efficient way to get the number of files
>> in a directory (up to 10^6 files). I will use the nr as a stop
>> condition
>> of of generation process so the method must be applied during this
>> process
>> a lot of times. Therefore it must be efficient and opendir is not the
>> choice.


JE> opendir() or glob() would have been my first suggestion. But you will
JE> have to run your own benchmark tests, I doubt that anyone has ever
JE> investigated performance in such a scenario before.

how would opendir be slower than any other method (perl, shell, ls, glob
or other)? they ALL must do a system call to opendir underneath as that
is the only normal way to read a dir (you can 'open' a dir as a file but
then you have to parse it out yourself which can be painful).

JE> Of course, if launching two external processes and initiating IPC is
JE> indeed faster than using Perl's buildin functions has to be tested.

i can't see how they would ever be faster unless they can buffer the
dirnames better than perl's opendir can (when assigning to an
array). the fork overhead should easily lose out in this case but i
won't benchmark it with 10k files in a dir!

uri

--
Uri Guttman ------ http://www.velocityreviews.com/forums/(E-Mail Removed) -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      01-03-2010
(E-Mail Removed) wrote:

> I am searching the most efficient way to get the number of files
> in a directory (up to 10^6 files). I will use the nr as a stop
> condition
> of of generation process so the method must be applied during this
> process
> a lot of times. Therefore it must be efficient and opendir is not the
> choice.
>
> I am thinking about the bash command "ls | wc -l"
> but I don't know how to get this in a perl variable.


Why have so many files in a directory? You could create them in
subdirectories named after the first few characters of the filename.

Or maybe you are looking for a database solution?

Or add a byte to a metafile, each time a new file is created, and check
the size of that file?

--
Ruud
 
Reply With Quote
 
Wanna-Be Sys Admin
Guest
Posts: n/a
 
      01-03-2010
Jürgen Exner wrote:

> opendir() or glob() would have been my first suggestion. But you will
> have to run your own benchmark tests, I doubt that anyone has ever
> investigated performance in such a scenario before.


Hmm, I've not looked, so you might be right, but I'd think someone
probably had benchmarked the results before, but then again, maybe
you're right, considering the number of files in the directory itself
is ridiculously large, so someone may have not bothered and used a
better directory structure for the files instead. Daily, I see this as
a common issue with clients, asking why their FTP program doesn't show
files after the 2000th one, and ask if they can have use modify FTP to
allow the listing of 10-20K files. That's when the education has to
begin for the client.
--
Not really a wanna-be, but I don't know everything.
 
Reply With Quote
 
John Bokma
Guest
Posts: n/a
 
      01-03-2010
"Dr.Ruud" <(E-Mail Removed)> writes:

> (E-Mail Removed) wrote:
>
>> I am searching the most efficient way to get the number of files
>> in a directory (up to 10^6 files). I will use the nr as a stop
>> condition
>> of of generation process so the method must be applied during this
>> process
>> a lot of times. Therefore it must be efficient and opendir is not the
>> choice.
>>
>> I am thinking about the bash command "ls | wc -l"
>> but I don't know how to get this in a perl variable.

>
> Why have so many files in a directory? You could create them in
> subdirectories named after the first few characters of the filename.


I've used the first few characters of the md5 hex digest of the
filename, depending on how the files are named [1], this might
distribute the files more evenly.

(e.g. if a lot of files start with the you might end up with a lot of
files in the "the" directory).

--
John Bokma

Read my blog: http://johnbokma.com/
Hire me (Perl/Python): http://castleamber.com/
 
Reply With Quote
 
Ilya Zakharevich
Guest
Posts: n/a
 
      01-04-2010
On 2010-01-03, Wanna-Be Sys Admin <(E-Mail Removed)> wrote:
> J?rgen Exner wrote:
>
>> opendir() or glob() would have been my first suggestion. But you will
>> have to run your own benchmark tests, I doubt that anyone has ever
>> investigated performance in such a scenario before.

>
> Hmm, I've not looked, so you might be right, but I'd think someone
> probably had benchmarked the results before, but then again, maybe
> you're right, considering the number of files in the directory itself
> is ridiculously large, so someone may have not bothered and used a
> better directory structure for the files instead. Daily, I see this as
> a common issue with clients, asking why their FTP program doesn't show
> files after the 2000th one, and ask if they can have use modify FTP to
> allow the listing of 10-20K files. That's when the education has to
> begin for the client.


???? Just upgrade the server to use some non-brain-damaged
filesystem. 100K files in a directory should not be a big deal...
E.g., AFAIK, with HPFS386 1Mfile would not be much user-noticable.

Ilya

P.S. Of course, if one uses some brain-damaged API (like POSIX, which
AFAIK does not allow "merged" please_do_readdir_and_stat()
call), this may significantly slow down things even with
average-intelligence FSes...
 
Reply With Quote
 
Wanna-Be Sys Admin
Guest
Posts: n/a
 
      01-04-2010
Ilya Zakharevich wrote:

> On 2010-01-03, Wanna-Be Sys Admin <(E-Mail Removed)> wrote:
>> J?rgen Exner wrote:
>>
>>> opendir() or glob() would have been my first suggestion. But you
>>> will have to run your own benchmark tests, I doubt that anyone has
>>> ever investigated performance in such a scenario before.

>>
>> Hmm, I've not looked, so you might be right, but I'd think someone
>> probably had benchmarked the results before, but then again, maybe
>> you're right, considering the number of files in the directory itself
>> is ridiculously large, so someone may have not bothered and used a
>> better directory structure for the files instead. Daily, I see this
>> as a common issue with clients, asking why their FTP program doesn't
>> show files after the 2000th one, and ask if they can have use modify
>> FTP to
>> allow the listing of 10-20K files. That's when the education has to
>> begin for the client.

>
> ???? Just upgrade the server to use some non-brain-damaged
> filesystem. 100K files in a directory should not be a big deal...
> E.g., AFAIK, with HPFS386 1Mfile would not be much user-noticable.
>



A lot of systems I have to fix things on, are not one's I make the call
for. ext3 is about as good as it gets, which is fine, but... Anyway,
this is also about programs users are limited to use by management,
such as pure-ftpd, where it becomes a resource issue if it has to list
20K+ files in each directory. But, I do understand what you're getting
at.
--
Not really a wanna-be, but I don't know everything.
 
Reply With Quote
 
Martijn Lievaart
Guest
Posts: n/a
 
      01-04-2010
On Sun, 03 Jan 2010 14:46:50 -0800, (E-Mail Removed) wrote:

> I am thinking about the bash command "ls | wc -l" but I don't know how
> to get this in a perl variable.


Perls opendir is better, but if you use ls, you probably want to use the
unsorted flag to ls.

M4
 
Reply With Quote
 
smallpond
Guest
Posts: n/a
 
      01-04-2010
On Jan 3, 5:46*pm, "(E-Mail Removed)" <(E-Mail Removed)> wrote:
> Hello,
>
> I am searching the most efficient way to get the number of files
> in a directory *(up to 10^6 files). I will use the nr as a stop
> condition
> of of generation process so the method must be applied during this
> process
> a lot of times. Therefore it must be efficient and opendir is not the
> choice.
>
> I am thinking about the bash command "ls | wc -l"
> but I don't know how to get this in a perl variable.
>
> Thank you very much for any help!



What file system and OS?
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Most efficient way to get pixelcolors of an image? defn noob Python 2 07-02-2008 08:25 PM
Most efficient way to transfer a file directory recursively (using sockets) Arash Nikkar Java 8 11-27-2006 10:21 PM
Most Efficient Way of Exporting CSV data from page Peter ASP .Net 1 11-09-2004 10:41 PM
What is the most efficient way to access common fcts on asp.net pages when using user controls? Brent Minder ASP .Net 3 12-28-2003 02:28 PM
most efficient way to get a connection from a connection pool Linus Nikander Java 5 09-18-2003 04:34 AM



Advertisments