Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > iglob performance no better than glob

Reply
Thread Tools

iglob performance no better than glob

 
 
Kyp
Guest
Posts: n/a
 
      01-31-2010
I have a dir with a large # of files that I need to perform operations
on, but only needing to access a subset of the files, i.e. the first
100 files.

Using glob is very slow, so I ran across iglob, which returns an
iterator, which seemed just like what I wanted. I could iterate over
the files that I wanted, not having to read the entire dir.

So the iglob was faster, but accessing the first file took about the
same time as glob.glob.

Here's some code to compare glob vs. iglob performance, it outputs
the time before/after a glob.iglob('*.*') files.next() sequence and a
glob.glob('*.*') sequence.

#!/usr/bin/env python

import glob,time
print '\nTest of glob.iglob'
print 'before iglob:', time.asctime()
files = glob.iglob('*.*')
print 'after iglob:',time.asctime()
print files.next()
print 'after files.next():', time.asctime()

print '\nTest of glob.glob'
print 'before glob:', time.asctime()
files = glob.glob('*.*')
print 'after glob:',time.asctime()


Here are the results:

Test of glob.iglob
before iglob: Sun Jan 31 11:09:08 2010
after iglob: Sun Jan 31 11:09:08 2010
foo.bar
after files.next(): Sun Jan 31 11:09:59 2010

Test of glob.glob
before glob: Sun Jan 31 11:09:59 2010
after glob: Sun Jan 31 11:10:51 2010

The results are about the same for the 2 approaches, both took about
51 seconds. Am I doing something wrong with iglob?

Is there a way to get the first X # of files from a dir with lots of
files, that does not take a long time to run?

thanx, mark
 
Reply With Quote
 
 
 
 
Skip Montanaro
Guest
Posts: n/a
 
      01-31-2010
> So the iglob was faster, but accessing the first file took about the
> same time as glob.glob.


I'll wager most of the time required to access the first file is due
to filesystem overhead, not any inherent limitation in Python.

Skip Montanaro


 
Reply With Quote
 
 
 
 
John Bokma
Guest
Posts: n/a
 
      01-31-2010
Kyp <(E-Mail Removed)> writes:

> Is there a way to get the first X # of files from a dir with lots of
> files, that does not take a long time to run?


Assuming Linux: what does time

ls thedir | head

give?

with thedir the name of the actual dir

Also how many is many files?

--
John Bokma j3b

Hacking & Hiking in Mexico - http://johnbokma.com/
http://castleamber.com/ - Perl & Python Development
 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      01-31-2010
Kyp wrote:

> I have a dir with a large # of files that I need to perform operations
> on, but only needing to access a subset of the files, i.e. the first
> 100 files.
>
> Using glob is very slow, so I ran across iglob, which returns an
> iterator, which seemed just like what I wanted. I could iterate over
> the files that I wanted, not having to read the entire dir.
>
> So the iglob was faster, but accessing the first file took about the
> same time as glob.glob.
>
> Here's some code to compare glob vs. iglob performance, it outputs
> the time before/after a glob.iglob('*.*') files.next() sequence and a
> glob.glob('*.*') sequence.
>
> #!/usr/bin/env python
>
> import glob,time
> print '\nTest of glob.iglob'
> print 'before iglob:', time.asctime()
> files = glob.iglob('*.*')
> print 'after iglob:',time.asctime()
> print files.next()
> print 'after files.next():', time.asctime()
>
> print '\nTest of glob.glob'
> print 'before glob:', time.asctime()
> files = glob.glob('*.*')
> print 'after glob:',time.asctime()
>
>
> Here are the results:
>
> Test of glob.iglob
> before iglob: Sun Jan 31 11:09:08 2010
> after iglob: Sun Jan 31 11:09:08 2010
> foo.bar
> after files.next(): Sun Jan 31 11:09:59 2010
>
> Test of glob.glob
> before glob: Sun Jan 31 11:09:59 2010
> after glob: Sun Jan 31 11:10:51 2010
>
> The results are about the same for the 2 approaches, both took about
> 51 seconds. Am I doing something wrong with iglob?


No, but iglob() being lazy is pointless in your case because it uses
os.listdir() and fnmatch.filter() underneath which both read the whole
directory before returning anything.

> Is there a way to get the first X # of files from a dir with lots of
> files, that does not take a long time to run?


Here's my attempt. It turned out to be more work than expected, so I cut a
few corners. It's Linux-only "works on my machine" code, but may give you
some hints on how to proceed.

from ctypes import *
import fnmatch
import glob
import os
import re
from itertools import ifilter, imap

class dirent(Structure):
"works on my machine "
_fields_ = [
("d_ino", c_long),
("d_off", c_long),
("d_reclen", c_ushort),
("d_type", c_ubyte),
("d_name", c_char*256)]


direntp = POINTER(dirent)

LIBC = "libc.so.6"
cdll.LoadLibrary(LIBC)
libc = CDLL(LIBC)
libc.readdir.restype = direntp


def diriter(dir):
"lazy partial replacement for os.listdir()"
# errors? what errors?
dirp = libc.opendir(dir)
if not dirp:
return
try:
while True:
ep = libc.readdir(dirp)
if not ep:
break
yield ep.contents.d_name
finally:
libc.closedir(dirp)


def filter(names, pattern):
"lazy partial replacement for fnmatch.filter()"
import posixpath

pattern = os.path.normcase(pattern)
r = fnmatch.translate(pattern)
r = re.compile(r)

if os.path is not posixpath:
names = imap(os.path.normcase, names)

return ifilter(r.match, names)

def globiter(path):
"lazy partial replacement for glob.glob()"
dir, filename = os.path.split(path)
if glob.has_magic(dir):
raise ValueError("wildcards in directory not supported")
return filter(diriter(dir), filename)


if __name__ == "__main__":
import sys
[pattern] = sys.argv[1:]
for name in globiter(pattern):
print name

Peter
 
Reply With Quote
 
Benjamin Peterson
Guest
Posts: n/a
 
      01-31-2010
Kyp <kyp <at> stsci.edu> writes:

> So the iglob was faster, but accessing the first file took about the
> same time as glob.glob.


That would be because glob is implemented in terms of iglob.




 
Reply With Quote
 
Kyp
Guest
Posts: n/a
 
      02-01-2010
On Jan 31, 1:06*pm, John Bokma <(E-Mail Removed)> wrote:
> Kyp <(E-Mail Removed)> writes:
> > Is there a way to get the first X # of files from a dir with lots of
> > files, that does not take a long time to run?

>
> Assuming Linux: what does time
>
> *ls thedir | head
>
> give?
>
> with thedir the name of the actual dir

about 3 seconds.

3.086u 0.201s 0:03.32 98.7% 0+0k 0+0io 0pf+0w

>
> Also how many is many files?

over 100K (I know I should not do that, but it's a temp dir holding
files to be transferred)
thanx, mark



 
Reply With Quote
 
Kyp
Guest
Posts: n/a
 
      02-01-2010
On Jan 31, 2:44*pm, Peter Otten <(E-Mail Removed)> wrote:
> Kyp wrote:
> > I have a dir with a large # of files that I need to perform operations
> > on, but only needing to access a subset of the files, i.e. the first
> > 100 files.

>
> > Using glob is very slow, so I ran across iglob, which returns an
> > iterator, which seemed just like what I wanted. I could iterate over
> > the files that I wanted, not having to read the entire dir.

>
> > So the iglob was faster, but accessing the first file took about the
> > same time as glob.glob.

>
> > Here's some code to compare glob vs. iglob performance, *it outputs
> > the time before/after a glob.iglob('*.*') files.next() sequence and a
> > glob.glob('*.*') sequence.

>
> > #!/usr/bin/env python

>
> > import glob,time
> > print '\nTest of glob.iglob'
> > print 'before * * * iglob:', time.asctime()
> > files = glob.iglob('*.*')
> > print 'after * * * *iglob:',time.asctime()
> > print files.next()
> > print 'after files.next():', time.asctime()

>
> > print '\nTest of glob.glob'
> > print 'before * * * *glob:', time.asctime()
> > files = glob.glob('*.*')
> > print 'after * * * * glob:',time.asctime()

>
> > Here are the results:

>
> > Test of glob.iglob
> > before * * * iglob: Sun Jan 31 11:09:08 2010
> > after * * * *iglob: Sun Jan 31 11:09:08 2010
> > foo.bar
> > after files.next(): Sun Jan 31 11:09:59 2010

>
> > Test of glob.glob
> > before * * * *glob: Sun Jan 31 11:09:59 2010
> > after * * * * glob: Sun Jan 31 11:10:51 2010

>
> > The results are about the same for the 2 approaches, both took about
> > 51 seconds. Am I doing something wrong with iglob?

>
> No, but iglob() being lazy is pointless in your case because it uses
> os.listdir() and fnmatch.filter() underneath which both read the whole
> directory before returning anything.
>
> > Is there a way to get the first X # of files from a dir with lots of
> > files, that does not take a long time to run?

>
> Here's my attempt. It turned out to be more work than expected, so I cut a
> few corners. It's Linux-only "works on my machine" code, but may give you
> some hints on how to proceed.
>
> from ctypes import *
> import fnmatch
> import glob
> import os
> import re
> from itertools import ifilter, imap
>
> class dirent(Structure):
> * * "works on my machine "
> * * _fields_ = [
> * * * * ("d_ino", c_long),
> * * * * ("d_off", c_long),
> * * * * ("d_reclen", c_ushort),
> * * * * ("d_type", c_ubyte),
> * * * * ("d_name", c_char*256)]
>
> direntp = POINTER(dirent)
>
> LIBC = "libc.so.6"
> cdll.LoadLibrary(LIBC)
> libc = CDLL(LIBC)
> libc.readdir.restype = direntp
>
> def diriter(dir):
> * * "lazy partial replacement for os.listdir()"
> * * # errors? what errors?
> * * dirp = libc.opendir(dir)
> * * if not dirp:
> * * * * return
> * * try:
> * * * * while True:
> * * * * * * ep = libc.readdir(dirp)
> * * * * * * if not ep:
> * * * * * * * * break
> * * * * * * yield ep.contents.d_name
> * * finally:
> * * * * libc.closedir(dirp)
>
> def filter(names, pattern):
> * * "lazy partial replacement for fnmatch.filter()"
> * * import posixpath
>
> * * pattern = os.path.normcase(pattern)
> * * r = fnmatch.translate(pattern)
> * * r = re.compile(r)
>
> * * if os.path is not posixpath:
> * * * * names = imap(os.path.normcase, names)
>
> * * return ifilter(r.match, names)
>
> def globiter(path):
> * * "lazy partial replacement for glob.glob()"
> * * dir, filename = os.path.split(path)
> * * if glob.has_magic(dir):
> * * * * raise ValueError("wildcards in directory not supported")
> * * return filter(diriter(dir), filename)
>
> if __name__ == "__main__":
> * * import sys
> * * [pattern] = sys.argv[1:]
> * * for name in globiter(pattern):
> * * * * print name
>
> Peter


I'll give it a try, thanx for the reply.
mark
 
Reply With Quote
 
Cameron Simpson
Guest
Posts: n/a
 
      02-14-2010
On 31Jan2010 16:23, Kyp <(E-Mail Removed)> wrote:
| On Jan 31, 2:44*pm, Peter Otten <(E-Mail Removed)> wrote:
| > Kyp wrote:
| > > I have a dir with a large # of files that I need to perform operations
| > > on, but only needing to access a subset of the files, i.e. the first
| > > 100 files.
| > > Using glob is very slow, so I ran across iglob, which returns an
| > > iterator, which seemed just like what I wanted. I could iterate over
| > > the files that I wanted, not having to read the entire dir.
[...]
| > > So the iglob was faster, but accessing the first file took about the
| > > same time as glob.glob.
| >
| > > Here's some code to compare glob vs. iglob performance, *it outputs
| > > the time before/after a glob.iglob('*.*') files.next() sequence and a
| > > glob.glob('*.*') sequence.
| >
| > > #!/usr/bin/env python
| >
| > > import glob,time
| > > print '\nTest of glob.iglob'
| > > print 'before * * * iglob:', time.asctime()
| > > files = glob.iglob('*.*')
| > > print 'after * * * *iglob:',time.asctime()
| > > print files.next()
| > > print 'after files.next():', time.asctime()
| >
| > > print '\nTest of glob.glob'
| > > print 'before * * * *glob:', time.asctime()
| > > files = glob.glob('*.*')
| > > print 'after * * * * glob:',time.asctime()
| >
| > > Here are the results:
| >
| > > Test of glob.iglob
| > > before * * * iglob: Sun Jan 31 11:09:08 2010
| > > after * * * *iglob: Sun Jan 31 11:09:08 2010
| > > foo.bar
| > > after files.next(): Sun Jan 31 11:09:59 2010
| >
| > > Test of glob.glob
| > > before * * * *glob: Sun Jan 31 11:09:59 2010
| > > after * * * * glob: Sun Jan 31 11:10:51 2010
| >
| > > The results are about the same for the 2 approaches, both took about
| > > 51 seconds. Am I doing something wrong with iglob?
| >
| > No, but iglob() being lazy is pointless in your case because it uses
| > os.listdir() and fnmatch.filter() underneath which both read the whole
| > directory before returning anything.
| >
| > > Is there a way to get the first X # of files from a dir with lots of
| > > files, that does not take a long time to run?
| >
| > Here's my attempt. [...open directory and read native format...]

I'd be inclined first to time os.listdir('.') versus glob.lgo('*.*').

Glob routines tend to lstat() every matching name to ensure the path
exists. That's very slow. If you just do os.listdir() and choose your
100 nmaes, you only need to stat (or just try to open) them.

So time glob.glob("*.*") versus os.listdir(".") first.

Generally, with a large directory, stat time will change performance
immensely.
--
Cameron Simpson <(E-Mail Removed)> DoD#743
http://www.cskk.ezoshosting.com/cs/

Usenet is essentially a HUGE group of people passing notes in class. --R. Kadel
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
glob.glob output Hitesh Python 6 03-13-2007 03:45 PM
glob.glob unicode bug or feature Elbert Lev Python 5 08-02-2004 12:09 AM
Question about glob.glob <--newbie Sean Berry Python 3 05-04-2004 05:34 PM
RE: Bug in glob.glob for files w/o extentions in Windows Tim Peters Python 1 12-01-2003 09:22 AM
Bug in glob.glob for files w/o extentions in Windows Georgy Pruss Python 15 12-01-2003 04:04 AM



Advertisments