Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > python vs. grep

Reply
Thread Tools

python vs. grep

 
 
Anton Slesarev
Guest
Posts: n/a
 
      05-06-2008
I've read great paper about generators:
http://www.dabeaz.com/generators/index.html

Author say that it's easy to write analog of common linux tools such
as awk,grep etc. He say that performance could be even better.

But I have some problem with writing performance grep analog.


It's my script:

import re
pat = re.compile("sometext")

f = open("bigfile",'r')

flines = (line for line in f if pat.search(line))
c=0
for x in flines:
c+=1
print c

and bash:
grep "sometext" bigfile | wc -l

Python code 3-4 times slower on windows. And as I remember on linux
the same situation...

Buffering in open even increase time.

Is it possible to increase file reading performance?
 
Reply With Quote
 
 
 
 
Ian Kelly
Guest
Posts: n/a
 
      05-06-2008
On Tue, May 6, 2008 at 1:42 PM, Anton Slesarev <> wrote:
> Is it possible to increase file reading performance?


Dunno about that, but this part:

> flines = (line for line in f if pat.search(line))
> c=0
> for x in flines:
> c+=1
> print c


could be rewritten as just:

print sum(1 for line in f if pat.search(line))
 
Reply With Quote
 
 
 
 
Arnaud Delobelle
Guest
Posts: n/a
 
      05-06-2008
Anton Slesarev <> writes:

> f = open("bigfile",'r')
>
> flines = (line for line in f if pat.search(line))
> c=0
> for x in flines:
> c+=1
> print c


It would be simpler (and probably faster) not to use a generator expression:

search = re.compile('sometext').search

c = 0
for line in open('bigfile'):
if search(line):
c += 1

Perhaps faster (because the number of name lookups is reduced), using
itertools.ifilter:

from itertools import ifilter

c = 0
for line in ifilter(search, 'bigfile'):
c += 1


If 'sometext' is just text (no regexp wildcards) then even simpler:

....
for line in ...:
if 'sometext' in line:
c += 1

I don't believe you'll easily beat grep + wc using Python though.

Perhaps faster?

sum(bool(search(line)) for line in open('bigfile'))
sum(1 for line in ifilter(search, open('bigfile')))

....etc...

All this is untested!
--
Arnaud
 
Reply With Quote
 
Wojciech Walczak
Guest
Posts: n/a
 
      05-06-2008
2008/5/6, Anton Slesarev <>:
> But I have some problem with writing performance grep analog.

[...]
> Python code 3-4 times slower on windows. And as I remember on linux
> the same situation...
>
> Buffering in open even increase time.
>
> Is it possible to increase file reading performance?


The best advice would be not to try to beat grep, but if you really
want to, this is the right place

Here is my code:
$ cat grep.py
import sys

if len(sys.argv) != 3:
print 'grep.py <pattern> <file>'
sys.exit(1)

f = open(sys.argv[2],'r')

print ''.join((line for line in f if sys.argv[1] in line)),

$ ls -lh debug.0
-rw-r----- 1 gminick root 4,1M 2008-05-07 00:49 debug.0

---
$ time grep nusia debug.0 |wc -l
26009

real 0m0.042s
user 0m0.020s
sys 0m0.004s
---

---
$ time python grep.py nusia debug.0 |wc -l
26009

real 0m0.077s
user 0m0.044s
sys 0m0.016s
---

---
$ time grep nusia debug.0

real 0m3.163s
user 0m0.016s
sys 0m0.064s
---

---
$ time python grep.py nusia debug.0
[26009 lines here...]
real 0m2.628s
user 0m0.032s
sys 0m0.064s
---

So, printing the results take 2.6 secs for python and 3.1s for original grep.
Suprised? The only reason for this is that we have reduced the number
of write calls in the python example:

$ strace -ooriggrep.log grep nusia debug.0
$ grep write origgrep.log |wc -l
26009


$ strace -opygrep.log python grep.py nusia debug.0
$ grep write pygrep.log |wc -l
12


Wish you luck saving your CPU cycles

--
Regards,
Wojtek Walczak
http://www.stud.umk.pl/~wojtekwa/
 
Reply With Quote
 
Anton Slesarev
Guest
Posts: n/a
 
      05-07-2008
I try to save my time not cpu cycles)

I've got file which I really need to parse:
-rw-rw-r-- 1 xxx xxx 3381564736 May 7 09:29 bigfile

That's my results:

$ time grep "python" bigfile | wc -l
2470

real 0m4.744s
user 0m2.441s
sys 0m2.307s

And python scripts:

import sys

if len(sys.argv) != 3:
print 'grep.py <pattern> <file>'
sys.exit(1)

f = open(sys.argv[2],'r')

print ''.join((line for line in f if sys.argv[1] in line)),

$ time python grep.py "python" bigfile | wc -l
2470

real 0m37.225s
user 0m34.215s
sys 0m3.009s

Second script:

import sys

if len(sys.argv) != 3:
print 'grepwc.py <pattern> <file>'
sys.exit(1)

f = open(sys.argv[2],'r',100000000)

print sum((1 for line in f if sys.argv[1] in line)),


time python grepwc.py "python" bigfile
2470

real 0m39.357s
user 0m34.410s
sys 0m4.491s

40 sec and 5. This is really sad...

That was on freeBSD.



On windows cygwin.

Size of bigfile is ~50 mb

$ time grep "python" bigfile | wc -l
51

real 0m0.196s
user 0m0.169s
sys 0m0.046s

$ time python grepwc.py "python" bigfile
51

real 0m25.485s
user 0m2.733s
sys 0m0.375s

 
Reply With Quote
 
Ville Vainio
Guest
Posts: n/a
 
      05-07-2008
On May 6, 10:42 pm, Anton Slesarev <slesarev.an...@gmail.com> wrote:

> flines = (line for line in f if pat.search(line))


What about re.findall() / re.finditer() for the whole file contents?

 
Reply With Quote
 
Pop User
Guest
Posts: n/a
 
      05-07-2008
Anton Slesarev wrote:
>
> But I have some problem with writing performance grep analog.
>


I don't think you can ever catch grep. Searching is its only purpose in
life and its very good at it. You may be able to come closer, this
thread relates.

http://groups.google.com/group/comp....476da5d7a9e466

This relates to the speed of re. If you don't need regex don't use re.
If you do need re an alternate re library might be useful but you
aren't going to catch grep.


 
Reply With Quote
 
Anton Slesarev
Guest
Posts: n/a
 
      05-07-2008
On May 7, 7:22 pm, Pop User <popu...@christest2.dc.k12us.com> wrote:
> Anton Slesarev wrote:
>
> > But I have some problem with writing performance grep analog.

>
> I don't think you can ever catch grep. Searching is its only purpose in
> life and its very good at it. You may be able to come closer, this
> thread relates.
>
> http://groups.google.com/group/comp....thread/thread/...
>
> This relates to the speed of re. If you don't need regex don't use re.
> If you do need re an alternate re library might be useful but you
> aren't going to catch grep.


In my last test I dont use re. As I understand the main problem in
reading file.
 
Reply With Quote
 
Ricardo Aráoz
Guest
Posts: n/a
 
      05-08-2008
Anton Slesarev wrote:
> I try to save my time not cpu cycles)
>
> I've got file which I really need to parse:
> -rw-rw-r-- 1 xxx xxx 3381564736 May 7 09:29 bigfile
>
> That's my results:
>
> $ time grep "python" bigfile | wc -l
> 2470
>
> real 0m4.744s
> user 0m2.441s
> sys 0m2.307s
>
> And python scripts:
>
> import sys
>
> if len(sys.argv) != 3:
> print 'grep.py <pattern> <file>'
> sys.exit(1)
>
> f = open(sys.argv[2],'r')
>
> print ''.join((line for line in f if sys.argv[1] in line)),
>
> $ time python grep.py "python" bigfile | wc -l
> 2470
>
> real 0m37.225s
> user 0m34.215s
> sys 0m3.009s
>
> Second script:
>
> import sys
>
> if len(sys.argv) != 3:
> print 'grepwc.py <pattern> <file>'
> sys.exit(1)
>
> f = open(sys.argv[2],'r',100000000)
>
> print sum((1 for line in f if sys.argv[1] in line)),
>
>
> time python grepwc.py "python" bigfile
> 2470
>
> real 0m39.357s
> user 0m34.410s
> sys 0m4.491s
>
> 40 sec and 5. This is really sad...
>
> That was on freeBSD.
>
>
>
> On windows cygwin.
>
> Size of bigfile is ~50 mb
>
> $ time grep "python" bigfile | wc -l
> 51
>
> real 0m0.196s
> user 0m0.169s
> sys 0m0.046s
>
> $ time python grepwc.py "python" bigfile
> 51
>
> real 0m25.485s
> user 0m2.733s
> sys 0m0.375s
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>



All these examples assume your regular expression will not span multiple
lines, but this can easily be the case. How would you process the file
with regular expressions that span multiple lines?





 
Reply With Quote
 
Alan Isaac
Guest
Posts: n/a
 
      05-08-2008
Anton Slesarev wrote:
> I've read great paper about generators:
> http://www.dabeaz.com/generators/index.html
> Author say that it's easy to write analog of common linux tools such
> as awk,grep etc. He say that performance could be even better.
> But I have some problem with writing performance grep analog.



https://svn.enthought.com/svn/sandbox/grin/trunk/

hth,
Alan Isaac
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
regular expression syntax the same in Python, Perl and grep? seberino@spawar.navy.mil Python 3 11-08-2007 12:09 AM
Grep Equivalent for Python tereglow Python 15 03-19-2007 03:33 PM
Efficient grep using Python? sf Python 15 12-17-2004 04:08 PM
Efficient grep using Python? Jane Austine Python 1 12-16-2004 04:54 AM
Python script to grep squid logs Zlatko Hristov Python 1 04-15-2004 03:15 PM



Advertisments