Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Creating Long Lists

Reply
Thread Tools

Creating Long Lists

 
 
Kelson Zawack
Guest
Posts: n/a
 
      02-22-2011
I have a large (10gb) data file for which I want to parse each line into
an object and then append this object to a list for sorting and further
processing. I have noticed however that as the length of the list
increases the rate at which objects are added to it decreases
dramatically. My first thought was that I was nearing the memory
capacity of the machine and the decrease in performance was due to the
os swapping things in and out of memory. When I looked at the memory
usage this was not the case. My process was the only job running and
was consuming 40gb of the the total 130gb and no swapping processes were
running. To make sure there was not some problem with the rest of my
code, or the servers file system, I ran my program again as it was but
without the line that was appending items to the list and it completed
without problem indicating that the decrease in performance is the
result of some part of the process of appending to the list. Since
other people have observed this problem as well
(http://tek-tips.com/viewthread.cfm?qid=1096178&page=13,
http://stackoverflow.com/questions/2...ively-slower-i)
I did not bother to further analyze or benchmark it. Since the answers
in the above forums do not seem very definitive I thought I would
inquire here about what the reason for this decrease in performance is,
and if there is a way, or another data structure, that would avoid this
problem.

 
Reply With Quote
 
 
 
 
alex23
Guest
Posts: n/a
 
      02-22-2011
On Feb 22, 12:57*pm, Kelson Zawack <(E-Mail Removed)-star.edu.sg>
wrote:
> I did not bother to further analyze or benchmark it. *Since the answers
> in the above forums do not seem very definitive *I thought *I would
> inquire here about what the reason for this decrease in performance is,
> and if there is a way, or another data structure, that would avoid this
> problem.


The first link is 6 years old and refers to Python 2.4. Unless you're
using 2.4 you should probably ignore it.

The first answer on the stackoverflow link was accepted by the poster
as resolving his issue. Try disabling garbage collection.
 
Reply With Quote
 
 
 
 
John Bokma
Guest
Posts: n/a
 
      02-22-2011
alex23 <(E-Mail Removed)> writes:

> On Feb 22, 12:57*pm, Kelson Zawack <(E-Mail Removed)-star.edu.sg>
> wrote:
>> I did not bother to further analyze or benchmark it. *Since the answers
>> in the above forums do not seem very definitive *I thought *I would
>> inquire here about what the reason for this decrease in performance is,
>> and if there is a way, or another data structure, that would avoid this
>> problem.

>
> The first link is 6 years old and refers to Python 2.4. Unless you're
> using 2.4 you should probably ignore it.
>
> The first answer on the stackoverflow link was accepted by the poster
> as resolving his issue. Try disabling garbage collection.


I just read http://bugs.python.org/issue4074 which discusses a patch
that has been included 2 years ago. So using a recent Python 2.x also
doesn't have this issue?

--
John Bokma j3b

Blog: http://johnbokma.com/ Facebook: http://www.facebook.com/j.j.j.bokma
Freelance Perl & Python Development: http://castleamber.com/
 
Reply With Quote
 
Kelson Zawack
Guest
Posts: n/a
 
      02-22-2011
The answer it turns out is the garbage collector. When I disable the
garbage collector before the loop that loads the data into the list
and then enable it after the loop the program runs without issue.
This raises a question though, can the logic of the garbage collector
be changed so that it is not triggered in instances like this were you
really do want to put lots and lots of stuff in memory. Turning on
and off the garbage collector is not a big deal, but it would
obviously be nicer not to have to.
 
Reply With Quote
 
Kelson Zawack
Guest
Posts: n/a
 
      02-22-2011
I am using python 2.6.2, so it may no longer be a problem.

I am open to using another data type, but the way I read the
documentation array.array only supports numeric types, not arbitrary
objects. I also tried playing around with numpy arrays, albeit for
only a short time, and it seems that although they do support
arbitrary objects they seem to be geared toward numbers as well and I
found it cumbersome to manipulate objects with them. It could be
though that if I understood them better they would work fine. Also do
numpy arrays support sorting arbitrary objects, I only saw a method
that sorts numbers.
 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      02-22-2011
On 2/22/2011 4:40 AM, Kelson Zawack wrote:
> The answer it turns out is the garbage collector. When I disable the
> garbage collector before the loop that loads the data into the list
> and then enable it after the loop the program runs without issue.
> This raises a question though, can the logic of the garbage collector
> be changed so that it is not triggered in instances like this were you
> really do want to put lots and lots of stuff in memory. Turning on
> and off the garbage collector is not a big deal, but it would
> obviously be nicer not to have to.


Heuristics, by their very nature, are not correct in all situations.

--
Terry Jan Reedy

 
Reply With Quote
 
Jorgen Grahn
Guest
Posts: n/a
 
      02-23-2011
On Tue, 2011-02-22, Ben Finney wrote:
> Kelson Zawack <(E-Mail Removed)-star.edu.sg> writes:
>
>> I have a large (10gb) data file for which I want to parse each line
>> into an object and then append this object to a list for sorting and
>> further processing.

>
> What is the nature of the further processing?
>
> Does that further processing access the items sequentially? If so, they
> don't all need to be in memory at once, and you can produce them with a
> generator <URL:http://docs.python.org/glossary.html#term-generator>.


He mentioned sorting them -- you need all of them for that.

If that's the *only* such use, I'd experiment with writing them as
sortable text to file, and run GNU sort (the Unix utility) on the file.
It seems to have a clever file-backed sort algorithm.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
 
Reply With Quote
 
Tim Wintle
Guest
Posts: n/a
 
      02-23-2011
On Wed, 2011-02-23 at 13:57 +0000, Jorgen Grahn wrote:
> If that's the *only* such use, I'd experiment with writing them as
> sortable text to file, and run GNU sort (the Unix utility) on the file.
> It seems to have a clever file-backed sort algorithm.


+1 - and experiment with the different flags to sort (compression of
intermediate results, intermediate batch size, etc)

Tim


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Having compilation error: no match for call to (const __gnu_cxx::hash<long long int>) (const long long int&) veryhotsausage C++ 1 07-04-2008 05:41 PM
long long and long Mathieu Dutour C Programming 4 07-24-2007 11:15 AM
List of lists of lists of lists... =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==?= Python 5 05-15-2006 11:47 AM
unsigned long long int to long double Daniel Rudy C Programming 5 09-20-2005 02:37 AM
Assigning unsigned long to unsigned long long George Marsaglia C Programming 1 07-08-2003 05:16 PM



Advertisments