Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Iterating over files of a huge directory

Reply
Thread Tools

Iterating over files of a huge directory

 
 
Gilles Lenfant
Guest
Posts: n/a
 
      12-17-2012
Hi,

I have googled but did not find an efficient solution to my problem. My customer provides a directory with a huuuuge list of files (flat, potentially 100000+) and I cannot reasonably use os.listdir(this_path) unless creating a big memory footprint.

So I'm looking for an iterator that yields the file names of a directory and does not make a giant list of what's in.

i.e :

for filename in enumerate_files(some_directory):
# My cooking...

Many thanks by advance.
--
Gilles Lenfant
 
Reply With Quote
 
 
 
 
Chris Angelico
Guest
Posts: n/a
 
      12-17-2012
On Tue, Dec 18, 2012 at 2:28 AM, Gilles Lenfant
<(E-Mail Removed)> wrote:
> Hi,
>
> I have googled but did not find an efficient solution to my problem. My customer provides a directory with a huuuuge list of files (flat, potentially 100000+) and I cannot reasonably use os.listdir(this_path) unless creating a big memory footprint.
>
> So I'm looking for an iterator that yields the file names of a directory and does not make a giant list of what's in.


Sounds like you want os.walk. But... a hundred thousand files? I know
the Zen of Python says that flat is better than nested, but surely
there's some kind of directory structure that would make this
marginally manageable?

http://docs.python.org/3.3/library/os.html#os.walk

ChrisA
 
Reply With Quote
 
 
 
 
Tim Golden
Guest
Posts: n/a
 
      12-17-2012
On 17/12/2012 15:41, Chris Angelico wrote:
> On Tue, Dec 18, 2012 at 2:28 AM, Gilles Lenfant
> <(E-Mail Removed)> wrote:
>> Hi,
>>
>> I have googled but did not find an efficient solution to my
>> problem. My customer provides a directory with a huuuuge list of
>> files (flat, potentially 100000+) and I cannot reasonably use
>> os.listdir(this_path) unless creating a big memory footprint.
>>
>> So I'm looking for an iterator that yields the file names of a
>> directory and does not make a giant list of what's in.

>
> Sounds like you want os.walk. But... a hundred thousand files? I
> know the Zen of Python says that flat is better than nested, but
> surely there's some kind of directory structure that would make this
> marginally manageable?
>
> http://docs.python.org/3.3/library/os.html#os.walk


Unfortunately all of the built-in functions (os.walk, glob.glob,
os.listdir) rely on the os.listdir functionality which produces a list
first even if (as in glob.iglob) it later iterates over it.

There are external functions to iterate over large directories in both
Windows & Linux. I *think* the OP is on *nix from his previous posts, in
which case someone else will have to produce the Linux-speak for this.
If it's Windows, you can use the FindFilesIterator in the pywin32 package.

TJG
 
Reply With Quote
 
marduk
Guest
Posts: n/a
 
      12-17-2012


On Mon, Dec 17, 2012, at 10:28 AM, Gilles Lenfant wrote:
> Hi,
>
> I have googled but did not find an efficient solution to my problem. My
> customer provides a directory with a huuuuge list of files (flat,
> potentially 100000+) and I cannot reasonably use os.listdir(this_path)
> unless creating a big memory footprint.
>
> So I'm looking for an iterator that yields the file names of a directory
> and does not make a giant list of what's in.
>
> i.e :
>
> for filename in enumerate_files(some_directory):
> # My cooking...
>



You could try using opendir[1] which is a binding to the posix call. I
believe that it returns an iterator (file-like) of the entries in the
directory.

[1] http://pypi.python.org/pypi/opendir/
 
Reply With Quote
 
Oscar Benjamin
Guest
Posts: n/a
 
      12-17-2012
On 17 December 2012 15:28, Gilles Lenfant <(E-Mail Removed)> wrote:
> I have googled but did not find an efficient solution to my problem. My customer provides a directory with a huuuuge list of files (flat, potentially 100000+) and I cannot reasonably use os.listdir(this_path) unless creating a big memory footprint.
>
> So I'm looking for an iterator that yields the file names of a directory and does not make a giant list of what's in.
>
> i.e :
>
> for filename in enumerate_files(some_directory):
> # My cooking...


In the last couple of months there has been a lot of discussion (on
python-list or python-dev - not sure) about creating a library to more
efficiently iterate over the files in a directory. The result so far
is this library on github:
https://github.com/benhoyt/betterwalk

It says there that
"""
Somewhat relatedly, many people have also asked for a version of
os.listdir() that yields filenames as it iterates instead of returning
them as one big list.

So as well as a faster walk(), BetterWalk adds iterdir_stat() and
iterdir(). They're pretty easy to use, but see below for the full API
docs.
"""

Does that code work for you? If so, I imagine the author would be
interested to get some feedback on how well it works.

Alternatively, perhaps consider calling an external utility.


Oscar
 
Reply With Quote
 
Gilles Lenfant
Guest
Posts: n/a
 
      12-17-2012
Le lundi 17 décembre 2012 16:52:19 UTC+1, Oscar Benjamin a écrit*:
> On 17 December 2012 15:28, Gilles Lenfant <...> wrote:
>
>
> In the last couple of months there has been a lot of discussion (on
>
> python-list or python-dev - not sure) about creating a library to more
>
> efficiently iterate over the files in a directory. The result so far
>
> is this library on github:
>
> https://github.com/benhoyt/betterwalk
>
>
>
> It says there that
>
> """
>
> Somewhat relatedly, many people have also asked for a version of
>
> os.listdir() that yields filenames as it iterates instead of returning
>
> them as one big list.
>
>
>
> So as well as a faster walk(), BetterWalk adds iterdir_stat() and
>
> iterdir(). They're pretty easy to use, but see below for the full API
>
> docs.
>
> """
>
>
>
> Does that code work for you? If so, I imagine the author would be
>
> interested to get some feedback on how well it works.
>
>
>
> Alternatively, perhaps consider calling an external utility.
>


Many thanks for this pointer Oscar.

"betterwalk" is exactly what I was looking for. More particularly iterdir(....) and iterdir_stat(...)
I'll get a deeper look at betterwalk and provide (hopefully successful) feedback.

Cheers
--
Gilles Lenfant
 
Reply With Quote
 
Gilles Lenfant
Guest
Posts: n/a
 
      12-17-2012
Le lundi 17 décembre 2012 16:52:19 UTC+1, Oscar Benjamin a écrit*:
> On 17 December 2012 15:28, Gilles Lenfant <...> wrote:
>
>
> In the last couple of months there has been a lot of discussion (on
>
> python-list or python-dev - not sure) about creating a library to more
>
> efficiently iterate over the files in a directory. The result so far
>
> is this library on github:
>
> https://github.com/benhoyt/betterwalk
>
>
>
> It says there that
>
> """
>
> Somewhat relatedly, many people have also asked for a version of
>
> os.listdir() that yields filenames as it iterates instead of returning
>
> them as one big list.
>
>
>
> So as well as a faster walk(), BetterWalk adds iterdir_stat() and
>
> iterdir(). They're pretty easy to use, but see below for the full API
>
> docs.
>
> """
>
>
>
> Does that code work for you? If so, I imagine the author would be
>
> interested to get some feedback on how well it works.
>
>
>
> Alternatively, perhaps consider calling an external utility.
>


Many thanks for this pointer Oscar.

"betterwalk" is exactly what I was looking for. More particularly iterdir(....) and iterdir_stat(...)
I'll get a deeper look at betterwalk and provide (hopefully successful) feedback.

Cheers
--
Gilles Lenfant
 
Reply With Quote
 
Paul Rudin
Guest
Posts: n/a
 
      12-17-2012
Chris Angelico <(E-Mail Removed)> writes:

> On Tue, Dec 18, 2012 at 2:28 AM, Gilles Lenfant
> <(E-Mail Removed)> wrote:
>> Hi,
>>
>> I have googled but did not find an efficient solution to my
>> problem. My customer provides a directory with a huuuuge list of
>> files (flat, potentially 100000+) and I cannot reasonably use
>> os.listdir(this_path) unless creating a big memory footprint.
>>
>> So I'm looking for an iterator that yields the file names of a
>> directory and does not make a giant list of what's in.

>
> Sounds like you want os.walk.


But doesn't os.walk call listdir() and that creates a list of the
contents of a directory, which is exactly the initial problem?

> But... a hundred thousand files? I know the Zen of Python says that
> flat is better than nested, but surely there's some kind of directory
> structure that would make this marginally manageable?
>


Sometimes you have to deal with things other people have designed, so
the directory structure is not something you can control. I've run up
against exactly the same problem and made something in C that
implemented an iterator.

It would probably be better if listdir() made an iterator rather than a
list.
 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      12-17-2012
On 2012-12-17 17:27, Paul Rudin wrote:
> Chris Angelico <(E-Mail Removed)> writes:
>
>> On Tue, Dec 18, 2012 at 2:28 AM, Gilles Lenfant
>> <(E-Mail Removed)> wrote:
>>> Hi,
>>>
>>> I have googled but did not find an efficient solution to my
>>> problem. My customer provides a directory with a huuuuge list of
>>> files (flat, potentially 100000+) and I cannot reasonably use
>>> os.listdir(this_path) unless creating a big memory footprint.
>>>
>>> So I'm looking for an iterator that yields the file names of a
>>> directory and does not make a giant list of what's in.

>>
>> Sounds like you want os.walk.

>
> But doesn't os.walk call listdir() and that creates a list of the
> contents of a directory, which is exactly the initial problem?
>
>> But... a hundred thousand files? I know the Zen of Python says that
>> flat is better than nested, but surely there's some kind of directory
>> structure that would make this marginally manageable?
>>

>
> Sometimes you have to deal with things other people have designed, so
> the directory structure is not something you can control. I've run up
> against exactly the same problem and made something in C that
> implemented an iterator.
>

<Off topic>
Years ago I had to deal with an in-house application that was written
using a certain database package. The package stored each predefined
query in a separate file in the same directory.

I found that if I packed all the predefined queries into a single file
and then called an external utility to extract the desired query from
the file every time it was needed into a file for the package to use,
not only did it save a significant amount of disk space (hard disks
were a lot smaller then), I also got a significant speed-up!

It wasn't as bad as 100000 in one directory, but it was certainly too
many...
</Off topic>
> It would probably be better if listdir() made an iterator rather than a
> list.
>


 
Reply With Quote
 
Evan Driscoll
Guest
Posts: n/a
 
      12-17-2012
On 12/17/2012 09:52 AM, Oscar Benjamin wrote:
> In the last couple of months there has been a lot of discussion (on
> python-list or python-dev - not sure) about creating a library to more
> efficiently iterate over the files in a directory. The result so far
> is this library on github:
> https://github.com/benhoyt/betterwalk


This is very useful to know about; thanks.

I actually wrote something very similar on my own (I wanted to get
information about whether each directory entry was a file, directory,
symlink, etc. without separate stat() calls). I'm guessing that the
library you linked is more mature than mine (I only have a Linux
implementation at present, for instance) so I'm happy to see that I
could probably switch to something better... and even happier that it
sounds like it's aiming for inclusion in the standard library.


(Also just for the record and anyone looking for other posts, I'd guess
said discussion was on Python-dev. I don't look at even remotely
everything on python-list (there's just too much), but I do skim most
subject lines and I haven't noticed any discussion on it before now.)

Evan




-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJQz2c3AAoJEAOzoR8eZTzgPf0H/AjEzoD2b78DX7Xb9R7LHUfY
woWEfivsWkjkdA23/5BrAgDGgXKvu/zhi4UCl0MaXSIJHLA1av2x+Li+wSgjLPm9
8WE7B/sOcMY2qEH04FyBCgAZgpWv4JHOnFdDtarZG8et5AeDm1R2jqrP KGzlD4SI
EIQtgM1nNpqFLw1fnnGqlm3Bj2aJjinVIS1Mn5WQyePkSW0RtB Nzz/7rxaQAMhEp
vJWyOmiCrHmOSIsaj4IzfQTeegTSvvN20crELVbwM7TMtQoepR PZyCCkWC3Ir3JG
UYwPY0KoM27me/k7pbtphbIB5xGBrMTHSTV35EAV/Z5VyYBy24f6DmsCaBButPA=
=pEvG
-----END PGP SIGNATURE-----

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Iterating a std::vector vs iterating a std::map? carl C++ 5 11-25-2009 09:55 AM
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
VOIP over VPN over TCP over WAP over 3G Theo Markettos UK VOIP 2 02-14-2008 03:27 PM
iterating over arrays with map - problem Mothra Perl 1 05-27-2004 03:37 PM
problems iterating over a files lines Jesse Noller Python 2 01-21-2004 05:07 PM



Advertisments