Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Populating a dictionary, fast

Reply
Thread Tools

Re: Populating a dictionary, fast

 
 
Michael Bacarella
Guest
Posts: n/a
 
      11-11-2007
> That's an awfully complicated way to iterate over a file. Try this
> instead:


>
> id2name = {}
> for line in open('id2name.txt'):
> id,name = line.strip().split(':')
> id = long(id)
> id2name[id] = name
>
> > This takes about 45 *minutes*
> >

> On my system, it takes about a minute and a half to produce a

dictionary
> with 8191180 entries.


Doing something similar on my system is very fast as well.

$ cat dict-8191180.py

#!/usr/bin/python



v = {}

for i in xrange(8191180):

v[i] = i


$ time ./dict-8191180.py

real 0m5.877s
user 0m4.953s
sys 0m0.924s

But...

> > If I comment out the last line in the loop body it takes only about

30
> > _seconds_ to run. This would seem to implicate the line id2name[id] =
> > name as being excruciatingly slow.

>
> No, dictionary access is one of the most highly-optimized, fastest,

most
> efficient parts of Python. What it indicates to me is that your system

is
> running low on memory, and is struggling to find room for 517MB worth

of
> data.



If only it were so easy.




$ free


total used free shared buffers cached


Mem: 7390244 2103448 5286796 0 38996 1982756


-/+ buffers/cache: 81696 7308548


Swap: 2096472 10280 2086192





Here's your Python implementation running as badly as mine did.



$ wc -l id2name.txt

8191180 id2name.txt



$ cat cache-id2name.py

#!/usr/bin/python



id2name = {}

for line in open('id2name.txt'):

id,name = line.strip().split(':',1)

id = long(id)

id2name[id] = name



$ time ./cache-id2name.py

^C

I let it go 30 minutes before killing it since I had to leave. Here it is in top before I did the deed.


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

18802 root 25 0 1301m 1.2g 1148 R 99.9 17.1 36:05.99 cache-id2name.p



36 minutes, 05.99 seconds.



To rule out the file reading/parsing logic as culprit, here's same thing, but with

dictionary insertion removed.



$ cat nocache-id2name.py

#!/usr/bin/python



id2name = {}

for line in open('id2name.txt'):

id,name = line.strip().split(':',1)

id = long(id)



$ time ./nocache-id2name.py



real 0m33.518s

user 0m33.070s

sys 0m0.415s





Here's a Perl implementation running very fast.



$ cat cache-id2name.pl

#!/usr/bin/perl



my %id2name = ();

my $line;

my $id;

my $name;

open(F,"<id2name.txt");



foreach $line (<F>) {

chomp($line);

($id,$name) = split(/:/,$line,1);

$id = int($id);

$id2name{$id} = $name;

}



$ time ./cache-id2name.pl



real 0m46.363s

user 0m43.730s

sys 0m2.611s





So, you think the Python's dict implementation degrades towards O(N)

performance when it's fed millions of 64-bit pseudo-random longs?




 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      11-11-2007
On Sat, 10 Nov 2007 17:18:37 -0800, Michael Bacarella wrote:


> So, you think the Python's dict implementation degrades towards O(N)
> performance when it's fed millions of 64-bit pseudo-random longs?


No.

Here's my sample file:

$ wc -l id2name.txt
8191180 id2name.txt
$ ls -lh id2name.txt
-rw-rw-r-- 1 steve steve 371M 2007-11-11 14:00 id2name.txt

And the results of reading it into a dict (code follows below):

$ time ./slurp_dict.py
Starting at Sun Nov 11 14:26:51 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 14:29:31 2007
Starting to delete dict...


Traceback (most recent call last):
File "./slurp_dict.py", line 20, in <module>
del id2name
KeyboardInterrupt

real 35m52.334s
user 1m17.663s
sys 0m16.758s


Notice that the dict is completely read into memory in just two and a
half minutes. The script then tries to delete the dict, and 32 minutes
later is still struggling. That's the point I got sick of waiting and
interrupted the script.

Conclusion: it's a memory issue, or maybe a garbage collection issue, not
a problem with dicts.




Here's my code for creating the key:value file in the first place:

#!/usr/bin/python
"""Make a big file of 64-bit integer keys plus random values."""

bits64 = 2**64
import random
template = '%d:This is a bunch of text...\n'
fp = open('id2name.txt', 'w')
for i in xrange(8191180):
fp.write(template % random.randint(0, bits64))
fp.close()

###

And here's my code for slurping it in to a dict:

#!/usr/bin/python
"""Read a big file into a dict."""

import gc
import time
print "Starting at %s" % time.asctime()
flag = gc.isenabled()
gc.disable()
id2name = {}
for n, line in enumerate(open('id2name.txt', 'r')):
if n % 1000000 == 0:
# Give feedback.
print "Line %d" % n
id,name = line.strip().split(':', 1)
id = long(id)
id2name[id] = name
print "Items in dict:", len(id2name)
print "Completed import at %s" % time.asctime()
print "Starting to delete dict..."
del id2name
print "Completed deletion at %s" % time.asctime()
if flag:
gc.enable()
print "Finishing at %s" % time.asctime()

###


So, what can you do? Given that I'm not willing to do any more unpaid
experimentation for you, here are my suggestions, in no particular order:

(1) Presumably you don't want to run your app with the garbage collector
turned off. You might still want to play around with the gc module to see
what you can learn.

(2) More memory will help avoid paging. If you can't get more memory, try
more virtual memory. It will still be slow, but at least the operating
system doesn't have to try moving blocks around as much.

(3) Are you sure you need all eight-million-plus items in the cache all
at once?

(4) There are lots of algorithms out there for dealing with data too big
to fit into main memory. Do some research.

(5) There is a data structure designed for dealing with tens of millions
of records at once. It is called "a database". If you can't find a better
algorithm, and refuse to use an existing RDBMS, I suspect you're going to
end up inventing a primitive, inefficient, buggy database which is no
faster than existing systems out there.



--
Steven.
 
Reply With Quote
 
 
 
 
Yu-Xi Lim
Guest
Posts: n/a
 
      11-11-2007
Steven D'Aprano wrote:
> (2) More memory will help avoid paging. If you can't get more memory, try
> more virtual memory. It will still be slow, but at least the operating
> system doesn't have to try moving blocks around as much.
>


Based on his previous post, it would seem he has 7GB of RAM (with about
5GB free) and 2GB of swap. I don't think RAM is the issue.

Maybe there's something wrong with his specific Python installation.
What version is it and was it compiled by him?
 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      11-11-2007
Michael Bacarella <> writes:
> If only it were so easy.


I think I know what's going on, the dictionary updates are sending the
GC into quadratic behavior. Try turning off the GC:

import gc
gc.disable()
 
Reply With Quote
 
Alberto Berti
Guest
Posts: n/a
 
      11-11-2007
>>>>> "Steven" == Steven D'Aprano <> writes:

Steven> $ time ./slurp_dict.py Starting at Sun Nov 11 14:26:51
Steven> 2007 Line 0 Line 1000000 Line 2000000 Line 3000000 Line
Steven> 4000000 Line 5000000 Line 6000000 Line 7000000 Line
Steven> 8000000 Items in dict: 8191180 Completed import at Sun Nov
Steven> 11 14:29:31 2007 Starting to delete dict...


Steven> Traceback (most recent call last): File "./slurp_dict.py",
Steven> line 20, in <module> del id2name KeyboardInterrupt

Steven> real 35m52.334s user 1m17.663s sys 0m16.758s


Steven> Notice that the dict is completely read into memory in
Steven> just two and a half minutes. The script then tries to
Steven> delete the dict, and 32 minutes later is still
Steven> struggling. That's the point I got sick of waiting and
Steven> interrupted the script.

Steven> Conclusion: it's a memory issue, or maybe a garbage
Steven> collection issue, not a problem with dicts.


uh, strange results...

I run your same scripts with and without garbage collection enabled
and those are the results:

with gc enabled:

azazel@lizard:~/wip/zodb_test$ python slurp_dict.py
Starting at Sun Nov 11 16:35:12 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 16:36:03 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 16:36:09 2007
Finishing at Sun Nov 11 16:36:09 2007

and without gc enabled

azazel@lizard:~/wip/zodb_test$ python slurp_dict.py
Starting at Sun Nov 11 16:39:02 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 16:39:49 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 16:39:56 2007
Finishing at Sun Nov 11 16:39:56 2007


all with python2.4 on and i386 Linux

cheers

Alberto

 
Reply With Quote
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      11-11-2007
On Sun, 11 Nov 2007 04:32:44 -0000, Steven D'Aprano
<> declaimed the following in
comp.lang.python:

>
> real 35m52.334s
> user 1m17.663s
> sys 0m16.758s
>
>
> Notice that the dict is completely read into memory in just two and a
> half minutes. The script then tries to delete the dict, and 32 minutes
> later is still struggling. That's the point I got sick of waiting and
> interrupted the script.
>

Sure seems a problem somewhere... Out of insanity I just ran your
exact programs -- on a 32-bit WinXP Pro machine with an ActiveState
Python 2.4.x release. 2GB physical RAM, 1.5GB swap.

>pythonw -u "Dload.py"

Starting at Sun Nov 11 10:25:22 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 10:26:08 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 10:26:12 2007
Finishing at Sun Nov 11 10:26:12 2007
>Exit code: 0


Peak commit charge: 1437724K, which may have just been large enough
to start swapping when one accounts for the OS, Firefox, Eudora, and
Agent (as user applications -- I won't bother to track down the
background processes).
--
Wulfraed Dennis Lee Bieber KD6MOG

HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: web-)
HTTP://www.bestiaria.com/
 
Reply With Quote
 
Hrvoje Niksic
Guest
Posts: n/a
 
      11-12-2007
Paul Rubin <http://> writes:

> Michael Bacarella <> writes:
>> If only it were so easy.

>
> I think I know what's going on, the dictionary updates are sending the
> GC into quadratic behavior. Try turning off the GC:
>
> import gc
> gc.disable()


This is becoming an FAQ on this newsgroup, having come up first with
lists and now with dictionaries. It can also be considered a serious
implementation problem, since code like Michael's is expected to
perform in (close to) linear time and in fact did so in previous
versions of Python, and still does in Perl and other popular scripting
languages. Neither introductory nor intermediate Python learning
materials don't cover the need to disable GC when building large data
structures.

Maybe the default GC strategy should be rethought.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Populating a dictionary, fast Michael Bacarella Python 16 11-21-2007 08:17 PM
Re: Populating a dictionary, fast Michael Bacarella Python 5 11-12-2007 11:40 PM
Re: Populating a dictionary, fast Michael Bacarella Python 2 11-12-2007 03:41 PM
Re: Populating a dictionary, fast Michael Bacarella Python 3 11-11-2007 10:23 PM
Populating a dictionary, fast Michael Bacarella Python 4 11-11-2007 04:11 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57