Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Faster os.walk()

Reply
Thread Tools

Faster os.walk()

 
 
fuzzylollipop
Guest
Posts: n/a
 
      04-20-2005
I am trying to get the number of bytes used by files in a directory.
I am using a large directory ( lots of stuff checked out of multiple
large cvs repositories ) and there is lots of wasted time doing
multiple os.stat() on dirs and files from different methods.

 
Reply With Quote
 
 
 
 
Peter Hansen
Guest
Posts: n/a
 
      04-20-2005
Laszlo Zsolt Nagy wrote:
> fuzzylollipop wrote:
>
>> I am trying to get the number of bytes used by files in a directory.
>> I am using a large directory ( lots of stuff checked out of multiple
>> large cvs repositories ) and there is lots of wasted time doing
>> multiple os.stat() on dirs and files from different methods.
>>
>>

> Do you need a precise value, or are you satisfied with approximations too?
> Under which operating system? The 'du' command can be your firend.


How can "du" find the sizes without do os.stat() on each
file?
 
Reply With Quote
 
 
 
 
Laszlo Zsolt Nagy
Guest
Posts: n/a
 
      04-20-2005
fuzzylollipop wrote:

>I am trying to get the number of bytes used by files in a directory.
>I am using a large directory ( lots of stuff checked out of multiple
>large cvs repositories ) and there is lots of wasted time doing
>multiple os.stat() on dirs and files from different methods.
>
>

Do you need a precise value, or are you satisfied with approximations too?
Under which operating system? The 'du' command can be your firend.

man du

Best,

Laci 2.0



--
__________________________________________________ _______________
Laszlo Nagy web: http://designasign.biz
IT Consultant mail: http://www.velocityreviews.com/forums/(E-Mail Removed)

Python forever!


 
Reply With Quote
 
fuzzylollipop
Guest
Posts: n/a
 
      04-20-2005
du is faster than my code that does the same thing in python, it is
highly optomized at the os level.

that said, I profiled spawning an external process to call du and over
the large number of times I need to do this it is actually slower to
execute du externally than my os.walk() implementation.

du does not return the value I need anyway, I need files only not raw
blocks consumed which is what du returns. also I need to filter out
some files and dirs.

after extensive profiling I found out that the way that os.walk() is
implemented it calls os.stat() on the dirs and files multiple times and
that is where all the time is going.

I guess I need something like os.statcache() but that is deprecated,
and probably wouldn't fix my problem. I only walk the dir once and then
cache all bytes, it is the multiple calls to os.stat() that is kicked
off by the os.walk() command internally on all the isdir() and
getsize() and what not.

just wanted to check and see if anyone had already solved this problem.

 
Reply With Quote
 
Philippe C. Martin
Guest
Posts: n/a
 
      04-20-2005
How about rerouting stdout/err and 'popening" something like

/bin/find -name '*' -exec
a_script_or_cmd_that_does_what_i_want_with_the_fil e {} \;

?

Regards,

Philippe




fuzzylollipop wrote:

> du is faster than my code that does the same thing in python, it is
> highly optomized at the os level.
>
> that said, I profiled spawning an external process to call du and over
> the large number of times I need to do this it is actually slower to
> execute du externally than my os.walk() implementation.
>
> du does not return the value I need anyway, I need files only not raw
> blocks consumed which is what du returns. also I need to filter out
> some files and dirs.
>
> after extensive profiling I found out that the way that os.walk() is
> implemented it calls os.stat() on the dirs and files multiple times and
> that is where all the time is going.
>
> I guess I need something like os.statcache() but that is deprecated,
> and probably wouldn't fix my problem. I only walk the dir once and then
> cache all bytes, it is the multiple calls to os.stat() that is kicked
> off by the os.walk() command internally on all the isdir() and
> getsize() and what not.
>
> just wanted to check and see if anyone had already solved this problem.


 
Reply With Quote
 
Kent Johnson
Guest
Posts: n/a
 
      04-20-2005
fuzzylollipop wrote:
> after extensive profiling I found out that the way that os.walk() is
> implemented it calls os.stat() on the dirs and files multiple times and
> that is where all the time is going.


os.walk() is pretty simple, you could copy it and make your own version that calls os.stat() just
once for each item. The dirnames and filenames lists it yields could be lists of (name,
os.stat(path)) tuples so you would have the sizes available.

Kent
 
Reply With Quote
 
Nick Craig-Wood
Guest
Posts: n/a
 
      04-20-2005
fuzzylollipop <(E-Mail Removed)> wrote:
> I am trying to get the number of bytes used by files in a directory.
> I am using a large directory ( lots of stuff checked out of multiple
> large cvs repositories ) and there is lots of wasted time doing
> multiple os.stat() on dirs and files from different methods.


I presume you are saying that the os.walk() has to stat() each file to
see whether it is a directory or not, and that you are stat()-ing each
file to count its bytes?

If you want to just get away with the one stat() you'll have to
re-implement os.walk yourself.

Another trick for speeding up lots of stats is to chdir() to the
directory you are processing, and then just use the leafnames in
stat(). The OS then doesn't have to spend ages parsing lots of paths.

However even if you implement both the above, I don't reckon you'll
see a lot of improvement given that decent OSes have a very good cache
for stat results, and that parsing file names is very quick too,
compared to python.

--
Nick Craig-Wood <(E-Mail Removed)> -- http://www.craig-wood.com/nick
 
Reply With Quote
 
Lonnie Princehouse
Guest
Posts: n/a
 
      04-20-2005
If you're trying to track changes to files on (e.g. by comparing
current size with previously recorded size), fam might obviate a lot of
filesystem traversal.

http://python-fam.sourceforge.net/

 
Reply With Quote
 
fuzzylollipop
Guest
Posts: n/a
 
      04-20-2005
ding, ding, ding, we have a winner.

One of the guys on the team did just this, he re-implemented the
os.walk() logic and embedded the logic to the S_IFDIR, S_IFMT and
S_IFREG directly into the transversal code.

This is all going to run on unix or linux machines in production so
this is not a big deal.
All in all we went from 64+k function calls for 7070 files/dirs to 1
PER dir/file.

the new code is just a little bit more than twice as fast.

Huge improvement!

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Which is faster in ASIC: 2-input AND gate or a 2-input multiplexer Weng Tianxiang VHDL 12 08-11-2005 10:50 AM
NEW FIREFOX 1.0.6...It seems faster and use less memory!! Ron Firefox 3 07-23-2005 02:23 AM
Is Firefox really faster and IE Zimran Douglas Firefox 21 01-14-2005 12:38 PM
I'm considering buying a new motherboard/processor combo for faster synthesis Randy Thelen VHDL 9 04-17-2004 05:01 PM
Anything faster than stat() ? Ken Tucker Perl 1 07-08-2003 06:29 AM



Advertisments