Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Iterating over files of a huge directory

Reply
Thread Tools

Iterating over files of a huge directory

 
 
Oscar Benjamin
Guest
Posts: n/a
 
      12-17-2012
On 17 December 2012 18:40, Evan Driscoll <(E-Mail Removed)> wrote:
> On 12/17/2012 09:52 AM, Oscar Benjamin wrote:
>> https://github.com/benhoyt/betterwalk

>
> This is very useful to know about; thanks.
>
> I actually wrote something very similar on my own (I wanted to get
> information about whether each directory entry was a file, directory,
> symlink, etc. without separate stat() calls).


The initial goal of betterwalk seemed to be the ability to do os.walk
with fewer stat calls. I think the information you want is part of
what betterwalk finds "for free" from the underlying OS iteration
(without the need to call stat()) but I'm not sure.

> (Also just for the record and anyone looking for other posts, I'd guess
> said discussion was on Python-dev. I don't look at even remotely
> everything on python-list (there's just too much), but I do skim most
> subject lines and I haven't noticed any discussion on it before now.)


Actually, it was python-ideas:
http://thread.gmane.org/gmane.comp.python.ideas/17932
http://thread.gmane.org/gmane.comp.python.ideas/17757
 
Reply With Quote
 
 
 
 
Evan Driscoll
Guest
Posts: n/a
 
      12-17-2012
On 12/17/2012 01:50 PM, Oscar Benjamin wrote:
> On 17 December 2012 18:40, Evan Driscoll <(E-Mail Removed)> wrote:
>> On 12/17/2012 09:52 AM, Oscar Benjamin wrote:
>>> https://github.com/benhoyt/betterwalk

>>
>> This is very useful to know about; thanks.
>>
>> I actually wrote something very similar on my own (I wanted to get
>> information about whether each directory entry was a file, directory,
>> symlink, etc. without separate stat() calls).

>
> The initial goal of betterwalk seemed to be the ability to do os.walk
> with fewer stat calls. I think the information you want is part of
> what betterwalk finds "for free" from the underlying OS iteration
> (without the need to call stat()) but I'm not sure.


Yes, that's my impression as well.


>> (Also just for the record and anyone looking for other posts, I'd guess
>> said discussion was on Python-dev. I don't look at even remotely
>> everything on python-list (there's just too much), but I do skim most
>> subject lines and I haven't noticed any discussion on it before now.)

>
> Actually, it was python-ideas:
> http://thread.gmane.org/gmane.comp.python.ideas/17932
> http://thread.gmane.org/gmane.comp.python.ideas/17757


Thanks again for the pointers; I'll have to go through that thread. It's
possible I can contribute something; it sounds like at least at one
point the implementation was ctypes-based and is sometimes slower, and I
have both a (now-defunct) C implementation and my current Cython module.
Ironically I haven't actually benchmarked mine.

Evan


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJQz3wEAAoJEAOzoR8eZTzggJEH/iEAls+NAcfLA1nEt8GzeYFd
O8IeSJR4nnVJoUilXzFb8MF9sqneg91+fiMnvi9UZGkvkvkKvD qgDdiWBg27l6g7
iwBwruKlxrrPOm0UvhB+ltOgANo8OlVpM/MhfzWU38cjqYEwo6aBxlvYH9y5wQk0
HmFzE85x1c9hy1AU138LRDrdoIw6xLkRhB/cO4vPsJMNx5PxUNBMMc8uyJQiZAuC
QLnZa9PT8g8HFaGvjq1XRN7DYOd+rfoHjuE3EoYdyza0oiSPoe vKmub5ovkRT8U+
NBxcbzjJbWuakvD43MbzhxN5jPM8z+Zpomb7sXk6mXqVbCWNZX QgkuSv9r9hc9Y=
=obg5
-----END PGP SIGNATURE-----

 
Reply With Quote
 
 
 
 
Terry Reedy
Guest
Posts: n/a
 
      12-17-2012
On 12/17/2012 10:28 AM, Gilles Lenfant wrote:
> Hi,
>
> I have googled but did not find an efficient solution to my problem.
> My customer provides a directory with a huuuuge list of files (flat,
> potentially 100000+) and I cannot reasonably use
> os.listdir(this_path) unless creating a big memory footprint.


Is is really big enough to be a real problem? See below.

> So I'm looking for an iterator that yields the file names of a
> directory and does not make a giant list of what's in.
>
> i.e :
>
> for filename in enumerate_files(some_directory): # My cooking...


See http://bugs.python.org/issue11406
As I said there, I personally think (and still do) that listdir should
have been changed in 3.0 to return an iterator rather than a list.
Developers who count more than me disagree on the basis that no
application has the millions of directory entries needed to make space a
real issue. They also claim that time is a wash either way.

As for space, 100000 entries x 100 bytes/entry (generous guess at
average) = 10,000,000 bytes, no big deal with gigabyte memories. So the
logic goes. A smaller example from my machine with 3.3.

from sys import getsizeof

def seqsize(seq):
"Get size of flat sequence and contents"
return sum((getsizeof(item) for item in seq), getsizeof(seq))

import os
d = os.listdir()
print(seqsize([1,2,3]), len(d), seqsize(d))
#
172 45 3128

The size per entry is relatively short because the two-level directory
prefix for each path is only about 15 bytes. By using 3.3 rather than
3.0-3.2, the all-ascii-char unicode paths only take 1 byte per char
rather than 2 or 4.

If you disagree with the responses on the issue, after reading them,
post one yourself with real numbers.

--
Terry Jan Reedy

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Iterating a std::vector vs iterating a std::map? carl C++ 5 11-25-2009 09:55 AM
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
VOIP over VPN over TCP over WAP over 3G Theo Markettos UK VOIP 2 02-14-2008 03:27 PM
iterating over arrays with map - problem Mothra Perl 1 05-27-2004 03:37 PM
problems iterating over a files lines Jesse Noller Python 2 01-21-2004 05:07 PM



Advertisments