Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Oddity with large dictionary (several million entries)

Reply
Thread Tools

Oddity with large dictionary (several million entries)

 
 
Lothar Werzinger
Guest
Posts: n/a
 
      04-27-2010
Hi,

I am trying to load files into a dictionary for analysis. the size of the
dictionary will grow quite large (several million entries) and as inserting
into a dictionary is roughly O(n) I figured if I loaded each file into it's
own dictionary it would speed things up. However it did not.

So I decided to write a small test program (attached)

As you can see I am inserting one million entries a time into a map. I ran
the tests where I put all three million entries into one map and one where I
put one million each into it's own map.

What I would have expected is that if I insert one million into it's own map
the time to do that would be roughly constant for each map. Interestingly it
is not. It's about the same as if I load everything into one map.

Oh and I have 4G of RAM and the test consumes about 40% at it's max. I even
run the test on one of our servers with 64G of RAM, so I can rule out
swapping as the issue.

Can anyone explain this oddity? Any insight is highly appreciated.

Here's the output of the test runs:

$ ./dicttest.py -t 0
Inserting into one map
Inserting 1000000 keys lasted 0:00:26 (38019 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:01:17 (12831 1/s)
len(map) 2000000
Inserting 1000000 keys lasted 0:02:23 (6972 1/s)
len(map) 3000000
total 3000000

$ ./dicttest.py -t 1
Inserting into three maps
Inserting 1000000 keys lasted 0:00:32 (30726 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:01:29 (11181 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:02:23 (6957 1/s)
len(map) 1000000
total 3000000


Thanks,
Lothar

,----[ /home/lothar/tmp/dicttest.py ]
| #!/usr/bin/python
| # -*- coding: utf-8 -*-
|
| import datetime
| import optparse
| import sys
| import time
|
|
|
|
| def fillDict(map, nop, num, guid):
| before = time.time()
|
| for i in xrange(0, num):
| key = (i, guid)
| if not nop:
| map[key] = ([], {})
|
| after = time.time()
| elapsed = (after - before)
| if elapsed <= 0:
| divide = 1.0
| else:
| divide = elapsed
| elapsedTime = datetime.timedelta(seconds=int(elapsed))
| print("Inserting %d keys lasted %s (%d 1/s)" % (num, elapsedTime, (num /
| divide))) print("len(map) %d" % (len(map)))
|
|
| def test0(nop, num):
| print("Inserting into one map")
| map = {}
| fillDict(map, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd71")
| fillDict(map, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd72")
| fillDict(map, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd73")
| print("total %d" % (len(map)))
|
|
| def test1(nop, num):
| print("Inserting into three maps")
| map1 = {}
| map2 = {}
| map3 = {}
| fillDict(map1, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd71")
| fillDict(map2, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd72")
| fillDict(map3, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd73")
| total = 0
| for map in [map1, map2, map3]:
| total += len(map)
| print("total %d" % (total))
|
|
|
| if __name__ == "__main__":
| usage = "USAGE: %prog [options]"
| description="test"
| version="%prog 1.0"
|
| parser = optparse.OptionParser(usage=usage, version=version,
| description=description) parser.add_option(
| "-t",
| "--testnum",
| action="store",
| dest="testnum",
| help="the number of the test to execute",
| type="int",
| default=1
| )
| parser.add_option(
| "-i",
| "--iterations",
| action="store",
| dest="iterations",
| help="the number of iterations",
| type="int",
| default=1000000
| )
| parser.add_option(
| "-n",
| "--nop",
| action="store_true",
| dest="nop",
| help="don't store in the dictionary only load and parse",
| default=False
| )
|
| (options, args) = parser.parse_args()
|
| testmap = {
| 0:test0,
| 1:test1,
| }
|
| test = testmap[options.testnum]
|
| test(options.nop, options.iterations)
`----
 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      04-27-2010
Lothar Werzinger wrote:

> I am trying to load files into a dictionary for analysis. the size of the
> dictionary will grow quite large (several million entries) and as
> inserting into a dictionary is roughly O(n) I figured if I loaded each
> file into it's own dictionary it would speed things up. However it did
> not.
>
> So I decided to write a small test program (attached)
>
> As you can see I am inserting one million entries a time into a map. I ran
> the tests where I put all three million entries into one map and one where
> I put one million each into it's own map.
>
> What I would have expected is that if I insert one million into it's own
> map the time to do that would be roughly constant for each map.
> Interestingly it is not. It's about the same as if I load everything into
> one map.
>
> Oh and I have 4G of RAM and the test consumes about 40% at it's max. I
> even run the test on one of our servers with 64G of RAM, so I can rule out
> swapping as the issue.
>
> Can anyone explain this oddity? Any insight is highly appreciated.


When you are creating objects like there is no tomorrow Python's cyclic
garbage collections often takes a significant amount of time. The first
thing I'd try is therefore switching it off with

import gc
gc.disable()

Peter


 
Reply With Quote
 
 
 
 
MRAB
Guest
Posts: n/a
 
      04-27-2010
Lothar Werzinger wrote:
> Hi,
>
> I am trying to load files into a dictionary for analysis. the size of the
> dictionary will grow quite large (several million entries) and as inserting
> into a dictionary is roughly O(n) I figured if I loaded each file into it's
> own dictionary it would speed things up. However it did not.
>
> So I decided to write a small test program (attached)
>
> As you can see I am inserting one million entries a time into a map. I ran
> the tests where I put all three million entries into one map and one where I
> put one million each into it's own map.
>
> What I would have expected is that if I insert one million into it's own map
> the time to do that would be roughly constant for each map. Interestingly it
> is not. It's about the same as if I load everything into one map.
>
> Oh and I have 4G of RAM and the test consumes about 40% at it's max. I even
> run the test on one of our servers with 64G of RAM, so I can rule out
> swapping as the issue.
>
> Can anyone explain this oddity? Any insight is highly appreciated.
>
> Here's the output of the test runs:
>
> $ ./dicttest.py -t 0
> Inserting into one map
> Inserting 1000000 keys lasted 0:00:26 (38019 1/s)
> len(map) 1000000
> Inserting 1000000 keys lasted 0:01:17 (12831 1/s)
> len(map) 2000000
> Inserting 1000000 keys lasted 0:02:23 (6972 1/s)
> len(map) 3000000
> total 3000000
>
> $ ./dicttest.py -t 1
> Inserting into three maps
> Inserting 1000000 keys lasted 0:00:32 (30726 1/s)
> len(map) 1000000
> Inserting 1000000 keys lasted 0:01:29 (11181 1/s)
> len(map) 1000000
> Inserting 1000000 keys lasted 0:02:23 (6957 1/s)
> len(map) 1000000
> total 3000000
>

[snip]
Inserting into a Python dict is O(1).
 
Reply With Quote
 
Lothar Werzinger
Guest
Posts: n/a
 
      04-27-2010
Peter Otten wrote:

> Lothar Werzinger wrote:
>> Can anyone explain this oddity? Any insight is highly appreciated.

>
> When you are creating objects like there is no tomorrow Python's cyclic
> garbage collections often takes a significant amount of time. The first
> thing I'd try is therefore switching it off with
>
> import gc
> gc.disable()
>
> Peter


Wow, that really MAKES a difference! Thanks a lot!

$ ~/tmp/dicttest.py -t 0
Inserting into one map
Inserting 1000000 keys lasted 0:00:01 (960152 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:00:01 (846416 1/s)
len(map) 2000000
Inserting 1000000 keys lasted 0:00:04 (235241 1/s)
len(map) 3000000
total 3000000

$ ~/tmp/dicttest.py -t 1
Inserting into three maps
Inserting 1000000 keys lasted 0:00:01 (973344 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:00:00 (1011303 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:00:00 (1021796 1/s)
len(map) 1000000
total 3000000
<~/beacon>


Thanks!
--
Lothar
 
Reply With Quote
 
Benjamin Peterson
Guest
Posts: n/a
 
      04-28-2010
Lothar Werzinger <lothar <at> tradescape.biz> writes:
> Wow, that really MAKES a difference! Thanks a lot!


You should also try Python 2.7 or 3.1. We've recently optimized the garabage
collector for situations where many objects are being created.




 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Large password list (27 million entries) - includes Sony Pictures, Gawker and more disclosure Computer Security 0 12-07-2011 07:55 AM
Performance ordered dictionary vs normal dictionary Navkirat Singh Python 6 07-29-2010 10:18 AM
Re: Performance ordered dictionary vs normal dictionary Chris Rebert Python 0 07-29-2010 06:11 AM
creating a dictionary from a dictionary with regex james_027 Python 1 08-22-2007 07:39 AM
[DICTIONARY] - Copy dictionary entries to attributes Ilias Lazaridis Python 6 02-21-2006 11:27 AM



Advertisments