Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Efficient grep using Python?

Reply
Thread Tools

Efficient grep using Python?

 
 
sf
Guest
Posts: n/a
 
      12-15-2004
Just started thinking about learning python.

Is there any place where I can get some free examples, especially for
following kind of problem ( it must be trivial for those using python)

I have files A, and B each containing say 100,000 lines (each line=one
string without any space)

I want to do

" A - (A intersection B) "

Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.


 
Reply With Quote
 
 
 
 
John Hunter
Guest
Posts: n/a
 
      12-15-2004
>>>>> "sf" == sf <(E-Mail Removed)> writes:

sf> Just started thinking about learning python. Is there any
sf> place where I can get some free examples, especially for
sf> following kind of problem ( it must be trivial for those using
sf> python)

sf> I have files A, and B each containing say 100,000 lines (each
sf> line=one string without any space)

sf> I want to do

sf> " A - (A intersection B) "

sf> Essentially, want to do efficient grep, i..e from A remove
sf> those lines which are also present in file B.

If you're only talking about 100K lines or so, and you have a
reasonably modern computer, you can do this all in memory. If order
doesn't matter (it probably does) you can use a set to get all the
lines in file B that are not in A

from sets import Set
A = Set(file('test1.dat').readlines())
B = Set(file('test2.dat').readlines())
print B-A

To preserve order, you should use a dictionary that maps lines to line
numbers. You can later use these numbers to sort

A = dict([(line, num) for num,line in enumerate(file('test1.dat'))])
B = dict([(line, num) for num,line in enumerate(file('test2.dat'))])

keep = [(num, line) for line,num in B.items() if not A.has_key(line)]
keep.sort()
for num, line in keep:
print line,

Now someone else will come along and tell you all this functionality
is already in the standard library. But it's always fun to hack this
out yourself once because python makes such things so damned easy.

JDH

 
Reply With Quote
 
 
 
 
Fredrik Lundh
Guest
Posts: n/a
 
      12-15-2004
"sf" <(E-Mail Removed)> wrote:

> I have files A, and B each containing say 100,000 lines (each line=one
> string without any space)
>
> I want to do
>
> " A - (A intersection B) "
>
> Essentially, want to do efficient grep, i..e from A remove those lines which
> are also present in file B.


that's an unusual definition of "grep", but the following seems to
do what you want:

afile = "a.txt"
bfile = "b.txt"

bdict = dict.fromkeys(open(bfile).readlines())

for line in open(afile):
if line not in bdict:
print line,

</F>



 
Reply With Quote
 
P@draigBrady.com
Guest
Posts: n/a
 
      12-15-2004
sf wrote:
> Just started thinking about learning python.
>
> Is there any place where I can get some free examples, especially for
> following kind of problem ( it must be trivial for those using python)
>
> I have files A, and B each containing say 100,000 lines (each line=one
> string without any space)
>
> I want to do
>
> " A - (A intersection B) "
>
> Essentially, want to do efficient grep, i..e from A remove those lines which
> are also present in file B.


You could implement elegantly using the new sets feature
For reference here is the unix way to do it:

sort a b b | uniq -u

--
Pádraig Brady - http://www.pixelbeat.org
--
 
Reply With Quote
 
Tim Peters
Guest
Posts: n/a
 
      12-15-2004
["sf" <(E-Mail Removed)>]
>> I have files A, and B each containing say 100,000 lines (each
>> line=one string without any space)
>>
>> I want to do
>>
>> " A - (A intersection B) "
>>
>> Essentially, want to do efficient grep, i..e from A remove those
>> lines which are also present in file B.


[Fredrik Lundh]
> that's an unusual definition of "grep", but the following seems to
> do what you want:
>
> afile = "a.txt"
> bfile = "b.txt"
>
> bdict = dict.fromkeys(open(bfile).readlines())
>
> for line in open(afile):
> if line not in bdict:
> print line,
>
> </F>


Note that an open file is an iterable object, yielding the lines in
the file. The "for" loop exploited that above, but fromkeys() can
also exploit it. That is,

bdict = dict.fromkeys(open(bfile))

is good enough (there's no need for the .readlines()).
 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      12-15-2004
Tim Peters wrote:

>> bdict = dict.fromkeys(open(bfile).readlines())
>>
>> for line in open(afile):
>> if line not in bdict:
>> print line,
>>
>> </F>

>
> Note that an open file is an iterable object, yielding the lines in
> the file. The "for" loop exploited that above, but fromkeys() can
> also exploit it. That is,
>
> bdict = dict.fromkeys(open(bfile))
>
> is good enough (there's no need for the .readlines()).


(sigh. my brain knows that, but my fingers keep forgetting)

and yes, for this purpose, "dict.fromkeys" can be replaced
with "set".

bdict = set(open(bfile))

(and then you can save a few more bytes by renaming the
variable...)

</F>



 
Reply With Quote
 
Tim Peters
Guest
Posts: n/a
 
      12-15-2004
[Fredrik Lundh]
>>> bdict = dict.fromkeys(open(bfile).readlines())
>>>
>>> for line in open(afile):
>>> if line not in bdict:
>>> print line,
>>>
>>> </F>


[Tim Peters]
>> Note that an open file is an iterable object, yielding the lines in
>> the file. The "for" loop exploited that above, but fromkeys() can
>> also exploit it. That is,
>>
>> bdict = dict.fromkeys(open(bfile))
>>
>> is good enough (there's no need for the .readlines()).


[/F]
> (sigh. my brain knows that, but my fingers keep forgetting)
>
> and yes, for this purpose, "dict.fromkeys" can be replaced
> with "set".
>
> bdict = set(open(bfile))
>
> (and then you can save a few more bytes by renaming the
> variable...)


Except the latter two are just shallow spelling changes. Switching
from fromkeys(open(f).readlines()) to fromkeys(open(f)) is much more
interesting, since it can allow major reduction in memory use. Even
if all the lines in the file are pairwise distinct, not materializing
them into a giant list can be a significant win. I wouldn't have
bothered replying if the only point were that you can save a couple
bytes of typing <wink>.
 
Reply With Quote
 
Christos TZOTZIOY Georgiou
Guest
Posts: n/a
 
      12-16-2004
On Wed, 15 Dec 2004 17:07:37 +0100, rumours say that "Fredrik Lundh"
<(E-Mail Removed)> might have written:

>> Essentially, want to do efficient grep, i..e from A remove those lines which
>> are also present in file B.

>
>that's an unusual definition of "grep"


that would be

grep -vf B A

and it is a rare use of grep, indeed.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC195
I really should keep that in mind when talking with people, actually...
 
Reply With Quote
 
Christos TZOTZIOY Georgiou
Guest
Posts: n/a
 
      12-16-2004
On Wed, 15 Dec 2004 16:10:08 +0000, rumours say that http://www.velocityreviews.com/forums/(E-Mail Removed)
might have written:

>> Essentially, want to do efficient grep, i..e from A remove those lines which
>> are also present in file B.

>
>You could implement elegantly using the new sets feature
>For reference here is the unix way to do it:
>
>sort a b b | uniq -u


No, like I just wrote in another post, he wants

$ grep -vf B A

I think that

$ sort A B B | uniq -u

can be abbreviated to

$ sort -u A B B

which is the union rather than the intersection of the files, wastes
some time by considering B twice, and finally destroys original line
order (should it be important).
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC195
I really should keep that in mind when talking with people, actually...
 
Reply With Quote
 
P@draigBrady.com
Guest
Posts: n/a
 
      12-16-2004
Christos TZOTZIOY Georgiou wrote:
> On Wed, 15 Dec 2004 16:10:08 +0000, rumours say that (E-Mail Removed)
> might have written:
>
>
>>>Essentially, want to do efficient grep, i..e from A remove those lines which
>>>are also present in file B.

>>
>>You could implement elegantly using the new sets feature
>>For reference here is the unix way to do it:
>>
>>sort a b b | uniq -u

>
>
> No, like I just wrote in another post, he wants
>
> $ grep -vf B A
>
> I think that
>
> $ sort A B B | uniq -u
>
> can be abbreviated to
>
> $ sort -u A B B
>
> which is the union rather than the intersection of the files


wrong. Notice the -u option to uniq.
http://marc.free.net.ph/message/2002....1bc24964.html

> wastes some time by considering B twice


I challenge you to a benchmark

> and finally destroys original line
> order (should it be important).


true

--
Pádraig Brady - http://www.pixelbeat.org
--
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Using grep on subarrays - help! Simon Harrison Ruby 11 04-04-2011 11:41 PM
Using popen & grep Dan King Ruby 2 04-13-2010 06:33 AM
Can you get the 1st occurrence of a value using grep? Mmcolli00 Mom Ruby 3 05-14-2009 09:30 PM
Using array.select with grep Milo Thurston Ruby 15 08-01-2008 07:25 PM
Efficient grep using Python? Jane Austine Python 1 12-16-2004 04:54 AM



Advertisments