Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > deduping

Reply
Thread Tools

deduping

 
 
dirknbr
Guest
Posts: n/a
 
      06-21-2010
Hi

I have 2 files (done and outf), and I want to chose unique elements
from the 2nd column in outf which are not in done. This code works but
is not efficient, can you think of a quicker way? The a=1 is just a
redundant task obviously, I put it this way around because I think
'in' is quicker than 'not in' - is that true?

done_={}
for line in done:
done_[line.strip()]=0

print len(done_)

universe={}
for line in outf:
if line.split(',')[1].strip() in universe.keys():
a=1
else:
if line.split(',')[1].strip() in done_.keys():
a=1
else:
universe[line.split(',')[1].strip()]=0

Dirk
 
Reply With Quote
 
 
 
 
Thomas Lehmann
Guest
Posts: n/a
 
      06-21-2010
> universe={}
> for line in outf:
> * * if line.split(',')[1].strip() in universe.keys():
> * * * * a=1
> * * else:
> * * * * if line.split(',')[1].strip() in done_.keys():
> * * * * * * a=1
> * * * * else:
> * * * * * * universe[line.split(',')[1].strip()]=0
>


I can not say too much because I don't see what is processed
but what I can say is: "line.split(',')[1].strip()" might be
called three times so I would do it once only. And I would write
it like this:

for line in outf:
key = line.split(',')[1].strip()
if not (key in universe.keys()):
if not (key in done_.keys()):
universe[key] = 0

 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      06-21-2010
dirknbr wrote:

> Hi
>
> I have 2 files (done and outf), and I want to chose unique elements
> from the 2nd column in outf which are not in done. This code works but
> is not efficient, can you think of a quicker way? The a=1 is just a
> redundant task obviously, I put it this way around because I think
> 'in' is quicker than 'not in' - is that true?
>
> done_={}
> for line in done:
> done_[line.strip()]=0
>
> print len(done_)
>
> universe={}
> for line in outf:
> if line.split(',')[1].strip() in universe.keys():
> a=1
> else:
> if line.split(',')[1].strip() in done_.keys():
> a=1
> else:
> universe[line.split(',')[1].strip()]=0


Instead of

if key in some_dict.keys():
#...

which converts the keys in the dictionary to a list and then performs an
O(N) lookup on that list you should use

if key in some_dict:
#...

which doesn't build a list and looks up the key in constant time.

Peter

 
Reply With Quote
 
python@bdurham.com
Guest
Posts: n/a
 
      06-21-2010
Use a set instead of a dictionary for done keys?

Malcolm
 
Reply With Quote
 
Dave Angel
Guest
Posts: n/a
 
      06-21-2010
dirknbr wrote:
> Hi
>
> I have 2 files (done and outf), and I want to chose unique elements
> from the 2nd column in outf which are not in done. This code works but
> is not efficient, can you think of a quicker way? The a=1 is just a
> redundant task obviously, I put it this way around because I think
> 'in' is quicker than 'not in' - is that true?
>
> done_={}
> for line in done:
> done_[line.strip()]=0
>
> print len(done_)
>
> universe={}
> for line in outf:
> if line.split(',')[1].strip() in universe.keys():
> a=1
> else:
> if line.split(',')[1].strip() in done_.keys():
> a=1
> else:
> universe[line.split(',')[1].strip()]=0
>
> Dirk
>
>

Where you have a=1, one would normally use the "pass" statement. But
you're wrong that 'not in' is less efficient than 'in'. If there's a
difference, it's probably negligible, and almost certainly less than the
extra else clause you're forcing here.

When doing an 'in', do *not* use the keys() method, as you're replacing
a fast lookup with a slow one, not to mention the time it takes to build
the keys() list each time.

In both these cases, you can use a set, rather than a dict. And there's
no need to test whether the item is already in the set, just put it in
again.

Changing all that, you'll wind up with something like (untested)

done_set = set()
universe = set()
for line in done:
done_set.add(line.strip())
for line in outf:
item = line.split(',')[1].strip()
if item not in done_set
universe.add(item)


DaveA

 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      06-21-2010
dirknbr <(E-Mail Removed)> writes:
> done_={}
> for line in done:
> done_[line.strip()]=0
> ...


Maybe you mean:

done_ = set(line.strip() for line in done)
outf_ = set(line.split(',')[1] for line in outf)
universe = done_ & outf # this finds the set intersection
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Deduping quotations Roedy Green Java 2 11-30-2009 12:16 PM
deduping algorithm Roedy Green Java 14 07-23-2004 08:33 PM



Advertisments