Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Performance: sets vs dicts.

Reply
Thread Tools

Performance: sets vs dicts.

 
 
John Nagle
Guest
Posts: n/a
 
      08-29-2010
Is the "in" test faster for a dict or a set?
Is "frozenset" faster than "set"? Use case is
for things like applying "in" on a list of 500 or so words
while checking a large body of text.

John Nagle
 
Reply With Quote
 
 
 
 
Arnaud Delobelle
Guest
Posts: n/a
 
      08-29-2010
John Nagle <(E-Mail Removed)> writes:

> Is the "in" test faster for a dict or a set?
> Is "frozenset" faster than "set"? Use case is
> for things like applying "in" on a list of 500 or so words
> while checking a large body of text.
>
> John Nagle


IIRC Frozensets are implemented more or less as sets with a hash
function and immutability so I would expect "in" to perform exactly the
same as for sets. For dicts, I would think that the set implementation
is very close to the dict one.

So I wouldn't expect any significant difference between any of them.

--
Arnaud
 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      08-29-2010
John Nagle wrote:

> Is the "in" test faster for a dict or a set?
> Is "frozenset" faster than "set"? Use case is
> for things like applying "in" on a list of 500 or so words
> while checking a large body of text.


As Arnaud suspects: no significant difference:

$ python dictperf.py
dict --> 0.210289001465
set --> 0.202902793884
frozenset --> 0.198950052261

$ cat dictperf.py
import random
import timeit

with open("/usr/share/dict/words") as instream:
words = [line.strip() for line in instream]

#random.seed(42)
sample = random.sample(words, 501)

n = sample.pop()
y = random.choice(sample)

d = dict.fromkeys(sample)
s = set(sample)
f = frozenset(sample)


for lookup in d, s, f:
print type(lookup).__name__, "-->", timeit.timeit(
"n in lookup; y in lookup",
"from __main__ import lookup, n, y")

Peter
 
Reply With Quote
 
Seth Rees
Guest
Posts: n/a
 
      08-29-2010
On 08/29/10 14:43, Peter Otten wrote:
> John Nagle wrote:
>
>> Is the "in" test faster for a dict or a set?
>> Is "frozenset" faster than "set"? Use case is
>> for things like applying "in" on a list of 500 or so words
>> while checking a large body of text.

>
> As Arnaud suspects: no significant difference:
>
> $ python dictperf.py
> dict --> 0.210289001465
> set --> 0.202902793884
> frozenset --> 0.198950052261
>
> $ cat dictperf.py
> import random
> import timeit
>
> with open("/usr/share/dict/words") as instream:
> words = [line.strip() for line in instream]
>
> #random.seed(42)
> sample = random.sample(words, 501)
>
> n = sample.pop()
> y = random.choice(sample)
>
> d = dict.fromkeys(sample)
> s = set(sample)
> f = frozenset(sample)
>
>
> for lookup in d, s, f:
> print type(lookup).__name__, "-->", timeit.timeit(
> "n in lookup; y in lookup",
> "from __main__ import lookup, n, y")
>
> Peter

What about lists versus tuples?
 
Reply With Quote
 
Raymond Hettinger
Guest
Posts: n/a
 
      08-30-2010
On Aug 29, 12:12*pm, John Nagle <(E-Mail Removed)> wrote:
> * * Is the "in" test faster for a dict or a set?
> Is "frozenset" faster than "set"? *Use case is
> for things like applying "in" on a list of 500 or so words
> while checking a large body of text.


There is no significant difference.
All three are implemented using substantially the same code.


Raymond

 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      08-30-2010
Seth Rees wrote:

> On 08/29/10 14:43, Peter Otten wrote:
>> John Nagle wrote:
>>
>>> Is the "in" test faster for a dict or a set?
>>> Is "frozenset" faster than "set"? Use case is
>>> for things like applying "in" on a list of 500 or so words
>>> while checking a large body of text.

>>
>> As Arnaud suspects: no significant difference:
>>
>> $ python dictperf.py
>> dict --> 0.210289001465
>> set --> 0.202902793884
>> frozenset --> 0.198950052261
>>
>> $ cat dictperf.py
>> import random
>> import timeit
>>
>> with open("/usr/share/dict/words") as instream:
>> words = [line.strip() for line in instream]
>>
>> #random.seed(42)
>> sample = random.sample(words, 501)
>>
>> n = sample.pop()
>> y = random.choice(sample)
>>
>> d = dict.fromkeys(sample)
>> s = set(sample)
>> f = frozenset(sample)
>>
>>
>> for lookup in d, s, f:
>> print type(lookup).__name__, "-->", timeit.timeit(
>> "n in lookup; y in lookup",
>> "from __main__ import lookup, n, y")
>>
>> Peter

> What about lists versus tuples?


You trade O(1) for O(N) with both, a bad idea for all but the smallest
lists/tuples.

$ python dictperf.py
dict --> 0.221934080124
set --> 0.220952987671
frozenset --> 0.225043058395
list --> 21.5041530132
tuple --> 20.8655071259

By the way, the script will run on your computer, too

Peter

 
Reply With Quote
 
Aahz
Guest
Posts: n/a
 
      08-30-2010
In article <(E-Mail Removed)>,
Raymond Hettinger <(E-Mail Removed)> wrote:
>On Aug 29, 12:12=A0pm, John Nagle <(E-Mail Removed)> wrote:
>>
>> Is the "in" test faster for a dict or a set? Is "frozenset" faster
>> than "set"? Use case is for things like applying "in" on a list of
>> 500 or so words while checking a large body of text.

>
>There is no significant difference. All three are implemented using
>substantially the same code.


That reminds me: one co-worker (who really should have known better
had the impression that sets were O(N) rather than O(1). Although
writing that off as a brain-fart seems appropriate, it's also the case
that the docs don't really make that clear, it's implied from requiring
elements to be hashable. Do you agree that there should be a comment?
--
Aahz ((E-Mail Removed)) <*> http://www.pythoncraft.com/

"...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box." --Cliff Wells
 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      08-30-2010
http://www.velocityreviews.com/forums/(E-Mail Removed) (Aahz) writes:
> That reminds me: one co-worker (who really should have known better
> had the impression that sets were O(N) rather than O(1). Although
> writing that off as a brain-fart seems appropriate, it's also the case
> that the docs don't really make that clear, it's implied from requiring
> elements to be hashable. Do you agree that there should be a comment?


It's O(1) with reasonable input distributions but can be O(N) for
adverse distributions. The docs should say something like that, and
include this link:

http://www.cs.rice.edu/~scrosby/hash...003/index.html
 
Reply With Quote
 
Aahz
Guest
Posts: n/a
 
      08-30-2010
In article <(E-Mail Removed)>,
Ben Finney <(E-Mail Removed)> wrote:
>(E-Mail Removed) (Aahz) writes:
>>
>> That reminds me: one co-worker (who really should have known better
>> had the impression that sets were O(N) rather than O(1).

>
>For settling exactly this kind of confusion, Python's standard library
>comes with a module, the timeit module. Your co-worker should have
>known better: don't guess about timing performance, measure it.
>
>Or am I missing something here?


Possibly; IMO, people should not need to run timeit to determine basic
algorithmic speed for standard Python datatypes.
--
Aahz ((E-Mail Removed)) <*> http://www.pythoncraft.com/

"...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box." --Cliff Wells
 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      08-31-2010
(E-Mail Removed) (Aahz) writes:
> Possibly; IMO, people should not need to run timeit to determine basic
> algorithmic speed for standard Python datatypes.


Indeed. Alex Stepanov (designer of C++ Standard Template Library) was
emphatic that algorithm complexity assertions should be part of the
interface of any STL function:

http://www.sgi.com/tech/stl/drdobbs-interview.html

He also said it should be part of the "unwritten contract" between the
module and its user, but I don't understand why he said "unwritten",
since in the C++ STL the complexity statements are part of the written
spec.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Feature Sets Parastatidis Nikos Cisco 2 07-26-2005 12:58 PM
How to determine IOS Feature Sets on Cisco 2600 series routers? John Keeney Cisco 1 06-14-2004 07:05 PM
Sql Server multiple XML result sets to ADO.NET randall g ASP .Net 1 12-29-2003 09:24 PM
How to identify feature sets? This Old Man Cisco 3 12-05-2003 03:20 AM
Shaping data/multiple result sets - hmmm... Mike Kingscott ASP .Net 0 08-27-2003 12:39 PM



Advertisments