![]() |
Efficient grep using Python?
Just started thinking about learning python.
Is there any place where I can get some free examples, especially for following kind of problem ( it must be trivial for those using python) I have files A, and B each containing say 100,000 lines (each line=one string without any space) I want to do " A - (A intersection B) " Essentially, want to do efficient grep, i..e from A remove those lines which are also present in file B. |
Re: Efficient grep using Python?
>>>>> "sf" == sf <sf@sf.sf> writes:
sf> Just started thinking about learning python. Is there any sf> place where I can get some free examples, especially for sf> following kind of problem ( it must be trivial for those using sf> python) sf> I have files A, and B each containing say 100,000 lines (each sf> line=one string without any space) sf> I want to do sf> " A - (A intersection B) " sf> Essentially, want to do efficient grep, i..e from A remove sf> those lines which are also present in file B. If you're only talking about 100K lines or so, and you have a reasonably modern computer, you can do this all in memory. If order doesn't matter (it probably does) you can use a set to get all the lines in file B that are not in A from sets import Set A = Set(file('test1.dat').readlines()) B = Set(file('test2.dat').readlines()) print B-A To preserve order, you should use a dictionary that maps lines to line numbers. You can later use these numbers to sort A = dict([(line, num) for num,line in enumerate(file('test1.dat'))]) B = dict([(line, num) for num,line in enumerate(file('test2.dat'))]) keep = [(num, line) for line,num in B.items() if not A.has_key(line)] keep.sort() for num, line in keep: print line, Now someone else will come along and tell you all this functionality is already in the standard library. But it's always fun to hack this out yourself once because python makes such things so damned easy. JDH |
Re: Efficient grep using Python?
"sf" <sf@sf.sf> wrote:
> I have files A, and B each containing say 100,000 lines (each line=one > string without any space) > > I want to do > > " A - (A intersection B) " > > Essentially, want to do efficient grep, i..e from A remove those lines which > are also present in file B. that's an unusual definition of "grep", but the following seems to do what you want: afile = "a.txt" bfile = "b.txt" bdict = dict.fromkeys(open(bfile).readlines()) for line in open(afile): if line not in bdict: print line, </F> |
Re: Efficient grep using Python?
sf wrote:
> Just started thinking about learning python. > > Is there any place where I can get some free examples, especially for > following kind of problem ( it must be trivial for those using python) > > I have files A, and B each containing say 100,000 lines (each line=one > string without any space) > > I want to do > > " A - (A intersection B) " > > Essentially, want to do efficient grep, i..e from A remove those lines which > are also present in file B. You could implement elegantly using the new sets feature For reference here is the unix way to do it: sort a b b | uniq -u -- Pádraig Brady - http://www.pixelbeat.org -- |
Re: Efficient grep using Python?
["sf" <sf@sf.sf>]
>> I have files A, and B each containing say 100,000 lines (each >> line=one string without any space) >> >> I want to do >> >> " A - (A intersection B) " >> >> Essentially, want to do efficient grep, i..e from A remove those >> lines which are also present in file B. [Fredrik Lundh] > that's an unusual definition of "grep", but the following seems to > do what you want: > > afile = "a.txt" > bfile = "b.txt" > > bdict = dict.fromkeys(open(bfile).readlines()) > > for line in open(afile): > if line not in bdict: > print line, > > </F> Note that an open file is an iterable object, yielding the lines in the file. The "for" loop exploited that above, but fromkeys() can also exploit it. That is, bdict = dict.fromkeys(open(bfile)) is good enough (there's no need for the .readlines()). |
Re: Efficient grep using Python?
Tim Peters wrote:
>> bdict = dict.fromkeys(open(bfile).readlines()) >> >> for line in open(afile): >> if line not in bdict: >> print line, >> >> </F> > > Note that an open file is an iterable object, yielding the lines in > the file. The "for" loop exploited that above, but fromkeys() can > also exploit it. That is, > > bdict = dict.fromkeys(open(bfile)) > > is good enough (there's no need for the .readlines()). (sigh. my brain knows that, but my fingers keep forgetting) and yes, for this purpose, "dict.fromkeys" can be replaced with "set". bdict = set(open(bfile)) (and then you can save a few more bytes by renaming the variable...) </F> |
Re: Efficient grep using Python?
[Fredrik Lundh]
>>> bdict = dict.fromkeys(open(bfile).readlines()) >>> >>> for line in open(afile): >>> if line not in bdict: >>> print line, >>> >>> </F> [Tim Peters] >> Note that an open file is an iterable object, yielding the lines in >> the file. The "for" loop exploited that above, but fromkeys() can >> also exploit it. That is, >> >> bdict = dict.fromkeys(open(bfile)) >> >> is good enough (there's no need for the .readlines()). [/F] > (sigh. my brain knows that, but my fingers keep forgetting) > > and yes, for this purpose, "dict.fromkeys" can be replaced > with "set". > > bdict = set(open(bfile)) > > (and then you can save a few more bytes by renaming the > variable...) Except the latter two are just shallow spelling changes. Switching from fromkeys(open(f).readlines()) to fromkeys(open(f)) is much more interesting, since it can allow major reduction in memory use. Even if all the lines in the file are pairwise distinct, not materializing them into a giant list can be a significant win. I wouldn't have bothered replying if the only point were that you can save a couple bytes of typing <wink>. |
Re: Efficient grep using Python?
On Wed, 15 Dec 2004 17:07:37 +0100, rumours say that "Fredrik Lundh"
<fredrik@pythonware.com> might have written: >> Essentially, want to do efficient grep, i..e from A remove those lines which >> are also present in file B. > >that's an unusual definition of "grep" that would be grep -vf B A and it is a rare use of grep, indeed. -- TZOTZIOY, I speak England very best. "Be strict when sending and tolerant when receiving." (from RFC1958) I really should keep that in mind when talking with people, actually... |
Re: Efficient grep using Python?
On Wed, 15 Dec 2004 16:10:08 +0000, rumours say that P@draigBrady.com
might have written: >> Essentially, want to do efficient grep, i..e from A remove those lines which >> are also present in file B. > >You could implement elegantly using the new sets feature >For reference here is the unix way to do it: > >sort a b b | uniq -u No, like I just wrote in another post, he wants $ grep -vf B A I think that $ sort A B B | uniq -u can be abbreviated to $ sort -u A B B which is the union rather than the intersection of the files, wastes some time by considering B twice, and finally destroys original line order (should it be important). -- TZOTZIOY, I speak England very best. "Be strict when sending and tolerant when receiving." (from RFC1958) I really should keep that in mind when talking with people, actually... |
Re: Efficient grep using Python?
Christos TZOTZIOY Georgiou wrote:
> On Wed, 15 Dec 2004 16:10:08 +0000, rumours say that P@draigBrady.com > might have written: > > >>>Essentially, want to do efficient grep, i..e from A remove those lines which >>>are also present in file B. >> >>You could implement elegantly using the new sets feature >>For reference here is the unix way to do it: >> >>sort a b b | uniq -u > > > No, like I just wrote in another post, he wants > > $ grep -vf B A > > I think that > > $ sort A B B | uniq -u > > can be abbreviated to > > $ sort -u A B B > > which is the union rather than the intersection of the files wrong. Notice the -u option to uniq. http://marc.free.net.ph/message/2002....1bc24964.html > wastes some time by considering B twice I challenge you to a benchmark :-) > and finally destroys original line > order (should it be important). true -- Pádraig Brady - http://www.pixelbeat.org -- |
| All times are GMT. The time now is 08:13 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.