Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   RE: Python garbage collector/memory manager behaving strangely (http://www.velocityreviews.com/forums/t952284-re-python-garbage-collector-memory-manager-behaving-strangely.html)

Jadhav, Alok 09-17-2012 02:28 AM

RE: Python garbage collector/memory manager behaving strangely
 
Thanks Dave for clean explanation. I clearly understand what is going on
now. I still need some suggestions from you on this.

There are 2 reasons why I was using self.rawfile.read().split('|\n')
instead of self.rawfile.readlines()

- As you have seen, the line separator is not '\n' but its '|\n'.
Sometimes the data itself has '\n' characters in the middle of the line
and only way to find true end of the line is that previous character
should be a bar '|'. I was not able specify end of line using
readlines() function, but I could do it using split() function.
(One hack would be to readlines and combine them until I find '|\n'. is
there a cleaner way to do this?)
- Reading whole file at once and processing line by line was must
faster. Though speed is not of very important issue here but I think the
tie it took to parse complete file was reduced to one third of original
time.

Regards,
Alok


-----Original Message-----
From: Dave Angel [mailto:d@davea.name]
Sent: Monday, September 17, 2012 10:13 AM
To: Jadhav, Alok
Cc: python-list@python.org
Subject: Re: Python garbage collector/memory manager behaving strangely

On 09/16/2012 09:07 PM, Jadhav, Alok wrote:
> Hi Everyone,
>
>
>
> I have a simple program which reads a large file containing few

million
> rows, parses each row (`numpy array`) and converts into an array of
> doubles (`python array`) and later writes into an `hdf5 file`. I

repeat
> this loop for multiple days. After reading each file, i delete all the
> objects and call garbage collector. When I run the program, First day
> is parsed without any error but on the second day i get `MemoryError`.

I
> monitored the memory usage of my program, during first day of parsing,
> memory usage is around **1.5 GB**. When the first day parsing is
> finished, memory usage goes down to **50 MB**. Now when 2nd day starts
> and i try to read the lines from the file I get `MemoryError`.

Following
> is the output of the program.
>
>
>
>
>
> source file extracted at C:\rfadump\au\2012.08.07.txt
>
> parsing started
>
> current time: 2012-09-16 22:40:16.829000
>
> 500000 lines parsed
>
> 1000000 lines parsed
>
> 1500000 lines parsed
>
> 2000000 lines parsed
>
> 2500000 lines parsed
>
> 3000000 lines parsed
>
> 3500000 lines parsed
>
> 4000000 lines parsed
>
> 4500000 lines parsed
>
> 5000000 lines parsed
>
> parsing done.
>
> end time is 2012-09-16 23:34:19.931000
>
> total time elapsed 0:54:03.102000
>
> repacking file
>
> done
>
> >

s:\users\aaj\projects\pythonhf\rfadumptohdf.py(132 )generateFiles()
>
> -> while single_date <= self.end_date:
>
> (Pdb) c
>
> *** 2012-08-08 ***
>
> source file extracted at C:\rfadump\au\2012.08.08.txt
>
> cought an exception while generating file for day 2012-08-08.
>
> Traceback (most recent call last):
>
> File "rfaDumpToHDF.py", line 175, in generateFile
>
> lines = self.rawfile.read().split('|\n')
>
> MemoryError
>
>
>
> I am very sure that windows system task manager shows the memory usage
> as **50 MB** for this process. It looks like the garbage collector or
> memory manager for Python is not calculating the free memory

correctly.
> There should be lot of free memory but it thinks there is not enough.
>
>
>
> Any idea?
>
>
>
> Thanks.
>
>
>
>
>
> Alok Jadhav
>
> CREDIT SUISSE AG
>
> GAT IT Hong Kong, KVAG 67
>
> International Commerce Centre | Hong Kong | Hong Kong
>
> Phone +852 2101 6274 | Mobile +852 9169 7172
>
> alok.jadhav@credit-suisse.com | www.credit-suisse.com
> <http://www.credit-suisse.com/>
>
>
>


Don't blame CPython. You're trying to do a read() of a large file,
which will result in a single large string. Then you split it into
lines. Why not just read it in as lines, in which case the large string
isn't necessary. Take a look at the readlines() function. Chances are
that even that is unnecessary, but i can't tell without seeing more of
the code.

lines = self.rawfile.read().split('|\n')

lines = self.rawfile.readlines()

When a single large item is being allocated, it's not enough to have
sufficient free space, the space also has to be contiguous. After a
program runs for a while, its space naturally gets fragmented more and
more. it's the nature of the C runtime, and CPython is stuck with it.



--

DaveA


================================================== =============================
Please access the attached hyperlink for an important electronic communications disclaimer:
http://www.credit-suisse.com/legal/e..._email_ib.html
================================================== =============================


alex23 09-17-2012 03:25 AM

Re: Python garbage collector/memory manager behaving strangely
 
On Sep 17, 12:32*pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com>
wrote:
> - As you have seen, the line separator is not '\n' but its '|\n'.
> Sometimes the data itself has '\n' characters in the middle of the line
> and only way to find true end of the line is that previous character
> should be a bar '|'. I was not able specify end of line using
> readlines() function, but I could do it using split() function.
> (One hack would be to readlines and combine them until I find '|\n'. is
> there a cleaner way to do this?)


You can use a generator to take care of your readlines requirements:

def readlines(f):
lines = []
while "f is not empty":
line = f.readline()
if not line: break
if len(line) > 2 and line[-2:] == '|\n':
lines.append(line)
yield ''.join(lines)
lines = []
else:
lines.append(line)

> - Reading whole file at once and processing line by line was must
> faster. Though speed is not of very important issue here but I think the
> tie it took to parse complete file was reduced to one third of original
> time.


With the readlines generator above, it'll read lines from the file
until it has a complete "line" by your requirement, at which point
it'll yield it. If you don't need the entire file in memory for the
end result, you'll be able to process each "line" one at a time and
perform whatever you need against it before asking for the next.

with open(u'infile.txt','r') as infile:
for line in readlines(infile):
...

Generators are a very efficient way of processing large amounts of
data. You can chain them together very easily:

real_lines = readlines(infile)
marker_lines = (l for l in real_lines if l.startswith('#'))
every_second_marker = (l for i,l in enumerate(marker_lines) if (i
+1) % 2 == 0)
map(some_function, every_second_marker)

The real_lines generator returns your definition of a line. The
marker_lines generator filters out everything that doesn't start with
#, while every_second_marker returns only half of those. (Yes, these
could all be written as a single generator, but this is very useful
for more complex pipelines).

The big advantage of this approach is that nothing is read from the
file into memory until map is called, and given the way they're
chained together, only one of your lines should be in memory at any
given time.

88888 Dihedral 09-17-2012 04:39 AM

Re: Python garbage collector/memory manager behaving strangely
 
alex23於 2012年9月17日星期一UTC+8上午11時25分06 寫道:
> On Sep 17, 12:32*pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com>
>
> wrote:
>
> > - As you have seen, the line separator is not '\n' but its '|\n'.

>
> > Sometimes the data itself has '\n' characters in the middle of the line

>
> > and only way to find true end of the line is that previous character

>
> > should be a bar '|'. I was not able specify end of line using

>
> > readlines() function, but I could do it using split() function.

>
> > (One hack would be to readlines and combine them until I find '|\n'. is

>
> > there a cleaner way to do this?)

>
>
>
> You can use a generator to take care of your readlines requirements:
>
>
>
> def readlines(f):
>
> lines = []
>
> while "f is not empty":
>
> line = f.readline()
>
> if not line: break
>
> if len(line) > 2 and line[-2:] == '|\n':
>
> lines.append(line)
>
> yield ''.join(lines)
>
> lines = []
>
> else:
>
> lines.append(line)
>
>
>
> > - Reading whole file at once and processing line by line was must

>
> > faster. Though speed is not of very important issue here but I think the

>
> > tie it took to parse complete file was reduced to one third of original

>
> > time.

>
>
>
> With the readlines generator above, it'll read lines from the file
>
> until it has a complete "line" by your requirement, at which point
>
> it'll yield it. If you don't need the entire file in memory for the
>
> end result, you'll be able to process each "line" one at a time and
>
> perform whatever you need against it before asking for the next.
>
>
>
> with open(u'infile.txt','r') as infile:
>
> for line in readlines(infile):
>
> ...
>
>
>
> Generators are a very efficient way of processing large amounts of
>
> data. You can chain them together very easily:
>
>
>
> real_lines = readlines(infile)
>
> marker_lines = (l for l in real_lines if l.startswith('#'))
>
> every_second_marker = (l for i,l in enumerate(marker_lines) if (i
>
> +1) % 2 == 0)
>
> map(some_function, every_second_marker)
>
>
>
> The real_lines generator returns your definition of a line. The
>
> marker_lines generator filters out everything that doesn't start with
>
> #, while every_second_marker returns only half of those. (Yes, these
>
> could all be written as a single generator, but this is very useful
>
> for more complex pipelines).
>
>
>
> The big advantage of this approach is that nothing is read from the
>
> file into memory until map is called, and given the way they're
>
> chained together, only one of your lines should be in memory at any
>
> given time.


The basic problem is whether the output items really need
all lines of the input text file to be buffered to
produce the results.



Dave Angel 09-17-2012 10:46 AM

Re: Python garbage collector/memory manager behaving strangely
 
On 09/16/2012 11:25 PM, alex23 wrote:
> On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com>
> wrote:
>> - As you have seen, the line separator is not '\n' but its '|\n'.
>> Sometimes the data itself has '\n' characters in the middle of the line
>> and only way to find true end of the line is that previous character
>> should be a bar '|'. I was not able specify end of line using
>> readlines() function, but I could do it using split() function.
>> (One hack would be to readlines and combine them until I find '|\n'. is
>> there a cleaner way to do this?)

> You can use a generator to take care of your readlines requirements:
>
> def readlines(f):
> lines = []
> while "f is not empty":
> line = f.readline()
> if not line: break
> if len(line) > 2 and line[-2:] == '|\n':
> lines.append(line)
> yield ''.join(lines)
> lines = []
> else:
> lines.append(line)


There's a few changes I'd make:
I'd change the name to something else, so as not to shadow the built-in,
and to make it clear in caller's code that it's not the built-in one.
I'd replace that compound if statement with
if line.endswith("|\n":
I'd add a comment saying that partial lines at the end of file are ignored.

>> - Reading whole file at once and processing line by line was must
>> faster. Though speed is not of very important issue here but I think the
>> tie it took to parse complete file was reduced to one third of original
>> time.


You don't say what it was faster than. Chances are you went to the
other extreme, of doing a read() of 1 byte at a time. Using Alex's
approach of a generator which in turn uses the readline() generator.

> With the readlines generator above, it'll read lines from the file
> until it has a complete "line" by your requirement, at which point
> it'll yield it. If you don't need the entire file in memory for the
> end result, you'll be able to process each "line" one at a time and
> perform whatever you need against it before asking for the next.
>
> with open(u'infile.txt','r') as infile:
> for line in readlines(infile):
> ...
>
> Generators are a very efficient way of processing large amounts of
> data. You can chain them together very easily:
>
> real_lines = readlines(infile)
> marker_lines = (l for l in real_lines if l.startswith('#'))
> every_second_marker = (l for i,l in enumerate(marker_lines) if (i
> +1) % 2 == 0)
> map(some_function, every_second_marker)
>
> The real_lines generator returns your definition of a line. The
> marker_lines generator filters out everything that doesn't start with
> #, while every_second_marker returns only half of those. (Yes, these
> could all be written as a single generator, but this is very useful
> for more complex pipelines).
>
> The big advantage of this approach is that nothing is read from the
> file into memory until map is called, and given the way they're
> chained together, only one of your lines should be in memory at any
> given time.



--

DaveA


Jadhav, Alok 09-17-2012 11:00 AM

RE: Python garbage collector/memory manager behaving strangely
 
Thanks for your valuable inputs. This is very helpful.


-----Original Message-----
From: Python-list
[mailto:python-list-bounces+alok.jadhav=credit-suisse.com@python.org] On
Behalf Of Dave Angel
Sent: Monday, September 17, 2012 6:47 PM
To: alex23
Cc: python-list@python.org
Subject: Re: Python garbage collector/memory manager behaving strangely

On 09/16/2012 11:25 PM, alex23 wrote:
> On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com>
> wrote:
>> - As you have seen, the line separator is not '\n' but its '|\n'.
>> Sometimes the data itself has '\n' characters in the middle of the

line
>> and only way to find true end of the line is that previous character
>> should be a bar '|'. I was not able specify end of line using
>> readlines() function, but I could do it using split() function.
>> (One hack would be to readlines and combine them until I find '|\n'.

is
>> there a cleaner way to do this?)

> You can use a generator to take care of your readlines requirements:
>
> def readlines(f):
> lines = []
> while "f is not empty":
> line = f.readline()
> if not line: break
> if len(line) > 2 and line[-2:] == '|\n':
> lines.append(line)
> yield ''.join(lines)
> lines = []
> else:
> lines.append(line)


There's a few changes I'd make:
I'd change the name to something else, so as not to shadow the built-in,
and to make it clear in caller's code that it's not the built-in one.
I'd replace that compound if statement with
if line.endswith("|\n":
I'd add a comment saying that partial lines at the end of file are
ignored.

>> - Reading whole file at once and processing line by line was must
>> faster. Though speed is not of very important issue here but I think

the
>> tie it took to parse complete file was reduced to one third of

original
>> time.


You don't say what it was faster than. Chances are you went to the
other extreme, of doing a read() of 1 byte at a time. Using Alex's
approach of a generator which in turn uses the readline() generator.

> With the readlines generator above, it'll read lines from the file
> until it has a complete "line" by your requirement, at which point
> it'll yield it. If you don't need the entire file in memory for the
> end result, you'll be able to process each "line" one at a time and
> perform whatever you need against it before asking for the next.
>
> with open(u'infile.txt','r') as infile:
> for line in readlines(infile):
> ...
>
> Generators are a very efficient way of processing large amounts of
> data. You can chain them together very easily:
>
> real_lines = readlines(infile)
> marker_lines = (l for l in real_lines if l.startswith('#'))
> every_second_marker = (l for i,l in enumerate(marker_lines) if (i
> +1) % 2 == 0)
> map(some_function, every_second_marker)
>
> The real_lines generator returns your definition of a line. The
> marker_lines generator filters out everything that doesn't start with
> #, while every_second_marker returns only half of those. (Yes, these
> could all be written as a single generator, but this is very useful
> for more complex pipelines).
>
> The big advantage of this approach is that nothing is read from the
> file into memory until map is called, and given the way they're
> chained together, only one of your lines should be in memory at any
> given time.



--

DaveA

--
http://mail.python.org/mailman/listinfo/python-list

================================================== =============================
Please access the attached hyperlink for an important electronic communications disclaimer:
http://www.credit-suisse.com/legal/e..._email_ib.html
================================================== =============================


Steven D'Aprano 09-17-2012 11:47 AM

Re: Python garbage collector/memory manager behaving strangely
 
On Mon, 17 Sep 2012 06:46:55 -0400, Dave Angel wrote:

> On 09/16/2012 11:25 PM, alex23 wrote:
>> def readlines(f):
>> lines = []
>> while "f is not empty":
>> line = f.readline()
>> if not line: break
>> if len(line) > 2 and line[-2:] == '|\n':
>> lines.append(line)
>> yield ''.join(lines)
>> lines = []
>> else:
>> lines.append(line)

>
> There's a few changes I'd make:
> I'd change the name to something else, so as not to shadow the built-in,


Which built-in are you referring to? There is no readlines built-in.

py> readlines
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'readlines' is not defined


There is a file.readlines method, but that lives in a different namespace
to the function readlines so there should be no confusion. At least not
for a moderately experienced programmer, beginners can be confused by the
littlest things sometimes.


> and to make it clear in caller's code that it's not the built-in one.
> I'd replace that compound if statement with
> if line.endswith("|\n":
> I'd add a comment saying that partial lines at the end of file are
> ignored.


Or fix the generator so that it doesn't ignore partial lines, or raises
an exception, whichever is more appropriate.



--
Steven

Dave Angel 09-17-2012 12:03 PM

Re: Python garbage collector/memory manager behaving strangely
 
On 09/17/2012 07:47 AM, Steven D'Aprano wrote:
> On Mon, 17 Sep 2012 06:46:55 -0400, Dave Angel wrote:
>
>> On 09/16/2012 11:25 PM, alex23 wrote:
>>> def readlines(f):
>>> lines = []
>>> while "f is not empty":
>>> line = f.readline()
>>> if not line: break
>>> if len(line) > 2 and line[-2:] == '|\n':
>>> lines.append(line)
>>> yield ''.join(lines)
>>> lines = []
>>> else:
>>> lines.append(line)

>> There's a few changes I'd make:
>> I'd change the name to something else, so as not to shadow the built-in,

> Which built-in are you referring to? There is no readlines built-in.
>
> py> readlines
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> NameError: name 'readlines' is not defined
>
>
> There is a file.readlines method, but that lives in a different namespace
> to the function readlines so there should be no confusion. At least not
> for a moderately experienced programmer, beginners can be confused by the
> littlest things sometimes.


You're right of course, and that's not restricted to beginners. I've
been at this for over 40 years, and I make that kind of mistake once in
a while. Fortunately, when I make such a mistake on this forum, you
usually pop in to keep me honest. When I make it in code, I either get
a runtime error, or no harm is done.

>
>> and to make it clear in caller's code that it's not the built-in one.
>> I'd replace that compound if statement with
>> if line.endswith("|\n":
>> I'd add a comment saying that partial lines at the end of file are
>> ignored.

> Or fix the generator so that it doesn't ignore partial lines, or raises
> an exception, whichever is more appropriate.
>
>
>



--

DaveA


Aahz 11-14-2012 02:19 PM

Re: Python garbage collector/memory manager behaving strangely
 
In article <50570de3$0$29981$c3e8da3$5496439d@news.astraweb.c om>,
Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>On Mon, 17 Sep 2012 06:46:55 -0400, Dave Angel wrote:
>> On 09/16/2012 11:25 PM, alex23 wrote:
>>>
>>> def readlines(f):
>>> lines = []
>>> while "f is not empty":
>>> line = f.readline()
>>> if not line: break
>>> if len(line) > 2 and line[-2:] == '|\n':
>>> lines.append(line)
>>> yield ''.join(lines)
>>> lines = []
>>> else:
>>> lines.append(line)

>>
>> There's a few changes I'd make:
>> I'd change the name to something else, so as not to shadow the built-in,

>
>Which built-in are you referring to? There is no readlines built-in.
>
>py> readlines
>Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
>NameError: name 'readlines' is not defined
>
>There is a file.readlines method, but that lives in a different namespace
>to the function readlines so there should be no confusion. At least not
>for a moderately experienced programmer, beginners can be confused by the
>littlest things sometimes.


Actually, as an experienced programmer, I *do* think it is confusing as
evidenced by the mistake Dave made! Segregated namespaces are wonderful
(per Zen), but let's not pollute multiple namespaces with same name,
either.

It may not be literally shadowing the built-in, but it definitely
mentally shadows the built-in.
--
Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/

"....Normal is what cuts off your sixth finger and your tail..." --Siobhan

Dieter Maurer 11-15-2012 07:31 AM

Re: Python garbage collector/memory manager behaving strangely
 
aahz@pythoncraft.com (Aahz) writes:

> ...
>>>> def readlines(f):
>>>> lines = []
>>>> while "f is not empty":
>>>> line = f.readline()
>>>> if not line: break
>>>> if len(line) > 2 and line[-2:] == '|\n':
>>>> lines.append(line)
>>>> yield ''.join(lines)
>>>> lines = []
>>>> else:
>>>> lines.append(line)
>>>
>>> There's a few changes I'd make:
>>> I'd change the name to something else, so as not to shadow the built-in,

> ...
> Actually, as an experienced programmer, I *do* think it is confusing as
> evidenced by the mistake Dave made! Segregated namespaces are wonderful
> (per Zen), but let's not pollute multiple namespaces with same name,
> either.
>
> It may not be literally shadowing the built-in, but it definitely
> mentally shadows the built-in.


I disagree with you. namespaces are there that in working
with a namespace I do not need to worry much about other
namespaces. Therefore, calling a function "readlines"
is very much justified (if it reads lines from a file), even
though there was a module around with name "readlines".
By the way, the module is named "readline" (not "readlines").


Thomas Rachel 11-15-2012 11:20 AM

Re: Python garbage collector/memory manager behaving strangely
 
Am 17.09.2012 04:28 schrieb Jadhav, Alok:
> Thanks Dave for clean explanation. I clearly understand what is going on
> now. I still need some suggestions from you on this.
>
> There are 2 reasons why I was using self.rawfile.read().split('|\n')
> instead of self.rawfile.readlines()
>
> - As you have seen, the line separator is not '\n' but its '|\n'.
> Sometimes the data itself has '\n' characters in the middle of the line
> and only way to find true end of the line is that previous character
> should be a bar '|'. I was not able specify end of line using
> readlines() function, but I could do it using split() function.
> (One hack would be to readlines and combine them until I find '|\n'. is
> there a cleaner way to do this?)
> - Reading whole file at once and processing line by line was must
> faster. Though speed is not of very important issue here but I think the
> tie it took to parse complete file was reduced to one third of original
> time.


With

def itersep(f, sep='\0', buffering=1024, keepsep=True):
if keepsep:
keepsep=sep
else: keepsep=''
data = f.read(buffering)
next_line = data # empty? -> end.
while next_line: # -> data is empty as well.
lines = data.split(sep)
for line in lines[:-1]:
yield line+keepsep
next_line = f.read(buffering)
data = lines[-1] + next_line
# keepsep: only if we have something.
if (not keepsep) or data:
yield data

you can iterate over everything you want without needing too much
memory. Using a larger "buffering" might improve speed a little bit.


Thomas


All times are GMT. The time now is 07:54 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.