Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Efficient grep using Python?

Reply
Thread Tools

Efficient grep using Python?

 
 
Christos TZOTZIOY Georgiou
Guest
Posts: n/a
 
      12-16-2004
On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that http://www.velocityreviews.com/forums/(E-Mail Removed)
might have written:

[sf]
>>>>Essentially, want to do efficient grep, i..e from A remove those lines which
>>>>are also present in file B.


[p@draig]
>>>You could implement elegantly using the new sets feature
>>>For reference here is the unix way to do it:
>>>
>>>sort a b b | uniq -u


[Christos]
>> No, like I just wrote in another post, he wants
>>
>> $ grep -vf B A
>>
>> I think that
>>
>> $ sort A B B | uniq -u
>>
>> can be abbreviated to
>>
>> $ sort -u A B B
>>
>> which is the union rather than the intersection of the files


[P@draig]
>wrong. Notice the -u option to uniq.
>http://marc.free.net.ph/message/2002....1bc24964.html


I see your point. That's a new to me use of uniq, since I started using
Unices long before GNU versions of the tools, but then, I might have
missed the -u option.

$ cat A
aa
ss
dd
ff
gg
hh
jj
kk
$ cat B
aa
dd
gg
jj
pp
$ time sort A B B | uniq -u
ff
hh
kk
ss

real 0m0.004s
user 0m0.004s
sys 0m0.004s
$ time grep -vf B A
ss
ff
hh
kk

real 0m0.003s
user 0m0.000s
sys 0m0.003s

So I stand corrected that your solution does *not* give the union.

>> wastes some time by considering B twice

>
>I challenge you to a benchmark


Well, the numbers I provided above are almost meaningless with such a
small set (and they easily could be reverse, I just kept the
convenient-to-me first run . Do you really believe that sorting three
files and then scanning their merged output counting duplicates is
faster than scanning two files (and doing lookups during the second
scan)?

$ python
Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x=open('/usr/share/dict/words').readlines()
>>> len(x)

45378
>>> import random
>>> random.shuffle(x)
>>> open("/tmp/A", "w").writelines(x)
>>> random.shuffle(x)
>>> open("/tmp/B", "w").writelines(x[:1000])
>>>

$ time sort A B B | uniq -u >/dev/null

real 0m0.311s
user 0m0.315s
sys 0m0.008s
$ time grep -Fvf B A >/dev/null

real 0m0.067s
user 0m0.064s
sys 0m0.003s

(Yes, I cheated by adding the F (for no regular expressions) flag

>> and finally destroys original line
>> order (should it be important).

>
>true


That's our final agreement
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC195
I really should keep that in mind when talking with people, actually...
 
Reply With Quote
 
 
 
 
P@draigBrady.com
Guest
Posts: n/a
 
      12-17-2004
Christos TZOTZIOY Georgiou wrote:
> On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that (E-Mail Removed)
>>I challenge you to a benchmark

>
>
> Well, the numbers I provided above are almost meaningless with such a
> small set (and they easily could be reverse, I just kept the
> convenient-to-me first run . Do you really believe that sorting three
> files and then scanning their merged output counting duplicates is
> faster than scanning two files (and doing lookups during the second
> scan)?
>
> $ python
> Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
> [GCC 3.3.3 (SuSE Linux)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>
>>>>x=open('/usr/share/dict/words').readlines()
>>>>len(x)

>
> 45378
>
>>>>import random
>>>>random.shuffle(x)
>>>>open("/tmp/A", "w").writelines(x)
>>>>random.shuffle(x)
>>>>open("/tmp/B", "w").writelines(x[:1000])
>>>>

>
> $ time sort A B B | uniq -u >/dev/null
>
> real 0m0.311s
> user 0m0.315s
> sys 0m0.008s
> $ time grep -Fvf B A >/dev/null
>
> real 0m0.067s
> user 0m0.064s
> sys 0m0.003s
>
> (Yes, I cheated by adding the F (for no regular expressions) flag


Also you only have 1000 entries in B!
Try it again with all entries in B also
Remember the original poster had 100K entries!

>>>and finally destroys original line
>>>order (should it be important).

>>
>>true

>
> That's our final agreement


Note the order is trivial to restore with a
"decorate-sort-undecorate" idiom.

--
Pádraig Brady - http://www.pixelbeat.org
--
 
Reply With Quote
 
 
 
 
sf
Guest
Posts: n/a
 
      12-17-2004
The point is that when you have 100,000s of records, this grep becomes
really slow?

Any comments?

Thats why I looked for python


> that would be
>
> grep -vf B A
>
> and it is a rare use of grep, indeed.
> --
> TZOTZIOY, I speak England very best.
> "Be strict when sending and tolerant when receiving." (from RFC195
> I really should keep that in mind when talking with people, actually...



 
Reply With Quote
 
Christos TZOTZIOY Georgiou
Guest
Posts: n/a
 
      12-17-2004
On Fri, 17 Dec 2004 12:21:08 +0000, rumours say that (E-Mail Removed)
might have written:

[snip some damn lie aka "benchmark"]

[me]
>> (Yes, I cheated by adding the F (for no regular expressions) flag

>
>Also you only have 1000 entries in B!
>Try it again with all entries in B also
>Remember the original poster had 100K entries!


Well, that's the closest I can do:

$ py
Python 2.4c1 (#3, Nov 26 2004, 23:39:44)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys; sys.ps1='.>>'

..>> alist=[line.strip() for line in open('/usr/share/dict/words')]
..>> words=set()
..>> for word in alist:
.... words.add(word + '\n')
.... words.add(word[::-1] + '\n')
....
..>> len(words)
90525
..>> words=list(words)
..>> open('/tmp/A', 'w').writelines(words)
..>> import random; random.shuffle(words)
..>> open('/tmp/B', 'w').writelines(words[:90000])
..>>
$ time sort A B B | uniq -u >/dev/null

real 0m2.408s
user 0m2.437s
sys 0m0.037s
$ time grep -Fvf B A >/dev/null

real 0m1.208s
user 0m1.161s
sys 0m0.035s

What now?-)

Mind you, I only replied in the first place because you wrote (my
emphasis) "...here is *the* unix way..." and it's the bad days of the
month (not mine, actually, but I suffer along...)

>>>>and finally destroys original line
>>>>order (should it be important).
>>>
>>>true

>>
>> That's our final agreement

>
>Note the order is trivial to restore with a
>"decorate-sort-undecorate" idiom.


Using python or unix tools (eg 'paste -d', 'sort -k', 'cut -d')?
Because the python way has been already discussed by Friedrik, John and
Tim, and the unix way gets overly complicated (aka non-trivial) if DSU
is involved.

BTW, the following occurred to me:

tzot@tril/tmp
$ cat >A
aa
ss
dd
ff
gg
hh
jj
kk
ll
aa
tzot@tril/tmp
$ cat >B
ss
ff
hh
kk
tzot@tril/tmp
$ sort A B B | uniq -u
dd
gg
jj
ll
tzot@tril/tmp
$ grep -Fvf B A
aa
dd
gg
jj
ll
aa

Note that 'aa' is contained twice in the A file (to be filtered by B).
So our methods do not produce the same output. As far as the OP wrote:

>Essentially, want to do efficient grep, i..e from A remove those lines which
>are also present in file B.


grep is the unix way to go for both speed and correctness.

I would call this issue a dead horse.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC195
I really should keep that in mind when talking with people, actually...
 
Reply With Quote
 
P@draigBrady.com
Guest
Posts: n/a
 
      12-17-2004
sf wrote:
> The point is that when you have 100,000s of records, this grep becomes
> really slow?


There are performance bugs with current versions of grep
and multibyte characters that are only getting addressed now.
To work around these do `export LANG=C` first.

In my experience grep is not scalable since it's O(n^2).
See below (note A and B are randomized versions of
/usr/share/dict/words (and therefore worst case for the
sort method)).

$ wc -l A B
45427 A
45427 B

$ export LANG=C

$ time grep -Fvf B A
real 0m0.437s

$ time sort A B B | uniq -u
real 0m0.262s

$ rpm -q grep coreutils
grep-2.5.1-16.1
coreutils-4.5.3-19

--
Pádraig Brady - http://www.pixelbeat.org
--
 
Reply With Quote
 
Christos TZOTZIOY Georgiou
Guest
Posts: n/a
 
      12-17-2004
On Fri, 17 Dec 2004 14:22:34 +0000, rumours say that (E-Mail Removed)
might have written:

sf:

>sf wrote:
>> The point is that when you have 100,000s of records, this grep becomes
>> really slow?

>
>There are performance bugs with current versions of grep
>and multibyte characters that are only getting addressed now.
>To work around these do `export LANG=C` first.


You also should use the -F flag that Pádraig suggests, since you don't
have regular expressions in the B file.

>In my experience grep is not scalable since it's O(n^2).
>See below (note A and B are randomized versions of
>/usr/share/dict/words (and therefore worst case for the
>sort method)).
>
>$ wc -l A B
> 45427 A
> 45427 B
>
>$ export LANG=C
>
>$ time grep -Fvf B A
>real 0m0.437s
>
>$ time sort A B B | uniq -u
>real 0m0.262s
>
>$ rpm -q grep coreutils
>grep-2.5.1-16.1
>coreutils-4.5.3-19


sf, you better do your own benchmarks (there is quick, sample code in
other posts of mine and Pádraig's) on your machine, since on my test
machine the numbers are reversed re to these of Pádraig's (grep takes
half the time).

package versions (on SuSE 9.1 64-bit):

$ rpm -q grep coreutils
grep-2.5.1-427
coreutils-5.2.1-21

language:
$ echo $LANG
en_US.UTF-8

Caution: both solutions are interexchangeable as long as you don't have
duplicate lines in the A file. If you do, use the grep version.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC195
I really should keep that in mind when talking with people, actually...
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Using grep on subarrays - help! Simon Harrison Ruby 11 04-04-2011 11:41 PM
Using popen & grep Dan King Ruby 2 04-13-2010 06:33 AM
Can you get the 1st occurrence of a value using grep? Mmcolli00 Mom Ruby 3 05-14-2009 09:30 PM
Using array.select with grep Milo Thurston Ruby 15 08-01-2008 07:25 PM
Efficient grep using Python? Jane Austine Python 1 12-16-2004 04:54 AM



Advertisments