![]() |
RE: Python garbage collector/memory manager behaving strangely
Thanks Dave for clean explanation. I clearly understand what is going on
now. I still need some suggestions from you on this. There are 2 reasons why I was using self.rawfile.read().split('|\n') instead of self.rawfile.readlines() - As you have seen, the line separator is not '\n' but its '|\n'. Sometimes the data itself has '\n' characters in the middle of the line and only way to find true end of the line is that previous character should be a bar '|'. I was not able specify end of line using readlines() function, but I could do it using split() function. (One hack would be to readlines and combine them until I find '|\n'. is there a cleaner way to do this?) - Reading whole file at once and processing line by line was must faster. Though speed is not of very important issue here but I think the tie it took to parse complete file was reduced to one third of original time. Regards, Alok -----Original Message----- From: Dave Angel [mailto:d@davea.name] Sent: Monday, September 17, 2012 10:13 AM To: Jadhav, Alok Cc: python-list@python.org Subject: Re: Python garbage collector/memory manager behaving strangely On 09/16/2012 09:07 PM, Jadhav, Alok wrote: > Hi Everyone, > > > > I have a simple program which reads a large file containing few million > rows, parses each row (`numpy array`) and converts into an array of > doubles (`python array`) and later writes into an `hdf5 file`. I repeat > this loop for multiple days. After reading each file, i delete all the > objects and call garbage collector. When I run the program, First day > is parsed without any error but on the second day i get `MemoryError`. I > monitored the memory usage of my program, during first day of parsing, > memory usage is around **1.5 GB**. When the first day parsing is > finished, memory usage goes down to **50 MB**. Now when 2nd day starts > and i try to read the lines from the file I get `MemoryError`. Following > is the output of the program. > > > > > > source file extracted at C:\rfadump\au\2012.08.07.txt > > parsing started > > current time: 2012-09-16 22:40:16.829000 > > 500000 lines parsed > > 1000000 lines parsed > > 1500000 lines parsed > > 2000000 lines parsed > > 2500000 lines parsed > > 3000000 lines parsed > > 3500000 lines parsed > > 4000000 lines parsed > > 4500000 lines parsed > > 5000000 lines parsed > > parsing done. > > end time is 2012-09-16 23:34:19.931000 > > total time elapsed 0:54:03.102000 > > repacking file > > done > > > s:\users\aaj\projects\pythonhf\rfadumptohdf.py(132 )generateFiles() > > -> while single_date <= self.end_date: > > (Pdb) c > > *** 2012-08-08 *** > > source file extracted at C:\rfadump\au\2012.08.08.txt > > cought an exception while generating file for day 2012-08-08. > > Traceback (most recent call last): > > File "rfaDumpToHDF.py", line 175, in generateFile > > lines = self.rawfile.read().split('|\n') > > MemoryError > > > > I am very sure that windows system task manager shows the memory usage > as **50 MB** for this process. It looks like the garbage collector or > memory manager for Python is not calculating the free memory correctly. > There should be lot of free memory but it thinks there is not enough. > > > > Any idea? > > > > Thanks. > > > > > > Alok Jadhav > > CREDIT SUISSE AG > > GAT IT Hong Kong, KVAG 67 > > International Commerce Centre | Hong Kong | Hong Kong > > Phone +852 2101 6274 | Mobile +852 9169 7172 > > alok.jadhav@credit-suisse.com | www.credit-suisse.com > <http://www.credit-suisse.com/> > > > Don't blame CPython. You're trying to do a read() of a large file, which will result in a single large string. Then you split it into lines. Why not just read it in as lines, in which case the large string isn't necessary. Take a look at the readlines() function. Chances are that even that is unnecessary, but i can't tell without seeing more of the code. lines = self.rawfile.read().split('|\n') lines = self.rawfile.readlines() When a single large item is being allocated, it's not enough to have sufficient free space, the space also has to be contiguous. After a program runs for a while, its space naturally gets fragmented more and more. it's the nature of the C runtime, and CPython is stuck with it. -- DaveA ================================================== ============================= Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/e..._email_ib.html ================================================== ============================= |
Re: Python garbage collector/memory manager behaving strangely
On Sep 17, 12:32*pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com>
wrote: > - As you have seen, the line separator is not '\n' but its '|\n'. > Sometimes the data itself has '\n' characters in the middle of the line > and only way to find true end of the line is that previous character > should be a bar '|'. I was not able specify end of line using > readlines() function, but I could do it using split() function. > (One hack would be to readlines and combine them until I find '|\n'. is > there a cleaner way to do this?) You can use a generator to take care of your readlines requirements: def readlines(f): lines = [] while "f is not empty": line = f.readline() if not line: break if len(line) > 2 and line[-2:] == '|\n': lines.append(line) yield ''.join(lines) lines = [] else: lines.append(line) > - Reading whole file at once and processing line by line was must > faster. Though speed is not of very important issue here but I think the > tie it took to parse complete file was reduced to one third of original > time. With the readlines generator above, it'll read lines from the file until it has a complete "line" by your requirement, at which point it'll yield it. If you don't need the entire file in memory for the end result, you'll be able to process each "line" one at a time and perform whatever you need against it before asking for the next. with open(u'infile.txt','r') as infile: for line in readlines(infile): ... Generators are a very efficient way of processing large amounts of data. You can chain them together very easily: real_lines = readlines(infile) marker_lines = (l for l in real_lines if l.startswith('#')) every_second_marker = (l for i,l in enumerate(marker_lines) if (i +1) % 2 == 0) map(some_function, every_second_marker) The real_lines generator returns your definition of a line. The marker_lines generator filters out everything that doesn't start with #, while every_second_marker returns only half of those. (Yes, these could all be written as a single generator, but this is very useful for more complex pipelines). The big advantage of this approach is that nothing is read from the file into memory until map is called, and given the way they're chained together, only one of your lines should be in memory at any given time. |
Re: Python garbage collector/memory manager behaving strangely
alex23於 2012年9月17日星期一UTC+8上午11時25分06 寫道:
> On Sep 17, 12:32*pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com> > > wrote: > > > - As you have seen, the line separator is not '\n' but its '|\n'. > > > Sometimes the data itself has '\n' characters in the middle of the line > > > and only way to find true end of the line is that previous character > > > should be a bar '|'. I was not able specify end of line using > > > readlines() function, but I could do it using split() function. > > > (One hack would be to readlines and combine them until I find '|\n'. is > > > there a cleaner way to do this?) > > > > You can use a generator to take care of your readlines requirements: > > > > def readlines(f): > > lines = [] > > while "f is not empty": > > line = f.readline() > > if not line: break > > if len(line) > 2 and line[-2:] == '|\n': > > lines.append(line) > > yield ''.join(lines) > > lines = [] > > else: > > lines.append(line) > > > > > - Reading whole file at once and processing line by line was must > > > faster. Though speed is not of very important issue here but I think the > > > tie it took to parse complete file was reduced to one third of original > > > time. > > > > With the readlines generator above, it'll read lines from the file > > until it has a complete "line" by your requirement, at which point > > it'll yield it. If you don't need the entire file in memory for the > > end result, you'll be able to process each "line" one at a time and > > perform whatever you need against it before asking for the next. > > > > with open(u'infile.txt','r') as infile: > > for line in readlines(infile): > > ... > > > > Generators are a very efficient way of processing large amounts of > > data. You can chain them together very easily: > > > > real_lines = readlines(infile) > > marker_lines = (l for l in real_lines if l.startswith('#')) > > every_second_marker = (l for i,l in enumerate(marker_lines) if (i > > +1) % 2 == 0) > > map(some_function, every_second_marker) > > > > The real_lines generator returns your definition of a line. The > > marker_lines generator filters out everything that doesn't start with > > #, while every_second_marker returns only half of those. (Yes, these > > could all be written as a single generator, but this is very useful > > for more complex pipelines). > > > > The big advantage of this approach is that nothing is read from the > > file into memory until map is called, and given the way they're > > chained together, only one of your lines should be in memory at any > > given time. The basic problem is whether the output items really need all lines of the input text file to be buffered to produce the results. |
Re: Python garbage collector/memory manager behaving strangely
On 09/16/2012 11:25 PM, alex23 wrote:
> On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com> > wrote: >> - As you have seen, the line separator is not '\n' but its '|\n'. >> Sometimes the data itself has '\n' characters in the middle of the line >> and only way to find true end of the line is that previous character >> should be a bar '|'. I was not able specify end of line using >> readlines() function, but I could do it using split() function. >> (One hack would be to readlines and combine them until I find '|\n'. is >> there a cleaner way to do this?) > You can use a generator to take care of your readlines requirements: > > def readlines(f): > lines = [] > while "f is not empty": > line = f.readline() > if not line: break > if len(line) > 2 and line[-2:] == '|\n': > lines.append(line) > yield ''.join(lines) > lines = [] > else: > lines.append(line) There's a few changes I'd make: I'd change the name to something else, so as not to shadow the built-in, and to make it clear in caller's code that it's not the built-in one. I'd replace that compound if statement with if line.endswith("|\n": I'd add a comment saying that partial lines at the end of file are ignored. >> - Reading whole file at once and processing line by line was must >> faster. Though speed is not of very important issue here but I think the >> tie it took to parse complete file was reduced to one third of original >> time. You don't say what it was faster than. Chances are you went to the other extreme, of doing a read() of 1 byte at a time. Using Alex's approach of a generator which in turn uses the readline() generator. > With the readlines generator above, it'll read lines from the file > until it has a complete "line" by your requirement, at which point > it'll yield it. If you don't need the entire file in memory for the > end result, you'll be able to process each "line" one at a time and > perform whatever you need against it before asking for the next. > > with open(u'infile.txt','r') as infile: > for line in readlines(infile): > ... > > Generators are a very efficient way of processing large amounts of > data. You can chain them together very easily: > > real_lines = readlines(infile) > marker_lines = (l for l in real_lines if l.startswith('#')) > every_second_marker = (l for i,l in enumerate(marker_lines) if (i > +1) % 2 == 0) > map(some_function, every_second_marker) > > The real_lines generator returns your definition of a line. The > marker_lines generator filters out everything that doesn't start with > #, while every_second_marker returns only half of those. (Yes, these > could all be written as a single generator, but this is very useful > for more complex pipelines). > > The big advantage of this approach is that nothing is read from the > file into memory until map is called, and given the way they're > chained together, only one of your lines should be in memory at any > given time. -- DaveA |
RE: Python garbage collector/memory manager behaving strangely
Thanks for your valuable inputs. This is very helpful.
-----Original Message----- From: Python-list [mailto:python-list-bounces+alok.jadhav=credit-suisse.com@python.org] On Behalf Of Dave Angel Sent: Monday, September 17, 2012 6:47 PM To: alex23 Cc: python-list@python.org Subject: Re: Python garbage collector/memory manager behaving strangely On 09/16/2012 11:25 PM, alex23 wrote: > On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com> > wrote: >> - As you have seen, the line separator is not '\n' but its '|\n'. >> Sometimes the data itself has '\n' characters in the middle of the line >> and only way to find true end of the line is that previous character >> should be a bar '|'. I was not able specify end of line using >> readlines() function, but I could do it using split() function. >> (One hack would be to readlines and combine them until I find '|\n'. is >> there a cleaner way to do this?) > You can use a generator to take care of your readlines requirements: > > def readlines(f): > lines = [] > while "f is not empty": > line = f.readline() > if not line: break > if len(line) > 2 and line[-2:] == '|\n': > lines.append(line) > yield ''.join(lines) > lines = [] > else: > lines.append(line) There's a few changes I'd make: I'd change the name to something else, so as not to shadow the built-in, and to make it clear in caller's code that it's not the built-in one. I'd replace that compound if statement with if line.endswith("|\n": I'd add a comment saying that partial lines at the end of file are ignored. >> - Reading whole file at once and processing line by line was must >> faster. Though speed is not of very important issue here but I think the >> tie it took to parse complete file was reduced to one third of original >> time. You don't say what it was faster than. Chances are you went to the other extreme, of doing a read() of 1 byte at a time. Using Alex's approach of a generator which in turn uses the readline() generator. > With the readlines generator above, it'll read lines from the file > until it has a complete "line" by your requirement, at which point > it'll yield it. If you don't need the entire file in memory for the > end result, you'll be able to process each "line" one at a time and > perform whatever you need against it before asking for the next. > > with open(u'infile.txt','r') as infile: > for line in readlines(infile): > ... > > Generators are a very efficient way of processing large amounts of > data. You can chain them together very easily: > > real_lines = readlines(infile) > marker_lines = (l for l in real_lines if l.startswith('#')) > every_second_marker = (l for i,l in enumerate(marker_lines) if (i > +1) % 2 == 0) > map(some_function, every_second_marker) > > The real_lines generator returns your definition of a line. The > marker_lines generator filters out everything that doesn't start with > #, while every_second_marker returns only half of those. (Yes, these > could all be written as a single generator, but this is very useful > for more complex pipelines). > > The big advantage of this approach is that nothing is read from the > file into memory until map is called, and given the way they're > chained together, only one of your lines should be in memory at any > given time. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list ================================================== ============================= Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/e..._email_ib.html ================================================== ============================= |
Re: Python garbage collector/memory manager behaving strangely
On Mon, 17 Sep 2012 06:46:55 -0400, Dave Angel wrote:
> On 09/16/2012 11:25 PM, alex23 wrote: >> def readlines(f): >> lines = [] >> while "f is not empty": >> line = f.readline() >> if not line: break >> if len(line) > 2 and line[-2:] == '|\n': >> lines.append(line) >> yield ''.join(lines) >> lines = [] >> else: >> lines.append(line) > > There's a few changes I'd make: > I'd change the name to something else, so as not to shadow the built-in, Which built-in are you referring to? There is no readlines built-in. py> readlines Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'readlines' is not defined There is a file.readlines method, but that lives in a different namespace to the function readlines so there should be no confusion. At least not for a moderately experienced programmer, beginners can be confused by the littlest things sometimes. > and to make it clear in caller's code that it's not the built-in one. > I'd replace that compound if statement with > if line.endswith("|\n": > I'd add a comment saying that partial lines at the end of file are > ignored. Or fix the generator so that it doesn't ignore partial lines, or raises an exception, whichever is more appropriate. -- Steven |
Re: Python garbage collector/memory manager behaving strangely
On 09/17/2012 07:47 AM, Steven D'Aprano wrote:
> On Mon, 17 Sep 2012 06:46:55 -0400, Dave Angel wrote: > >> On 09/16/2012 11:25 PM, alex23 wrote: >>> def readlines(f): >>> lines = [] >>> while "f is not empty": >>> line = f.readline() >>> if not line: break >>> if len(line) > 2 and line[-2:] == '|\n': >>> lines.append(line) >>> yield ''.join(lines) >>> lines = [] >>> else: >>> lines.append(line) >> There's a few changes I'd make: >> I'd change the name to something else, so as not to shadow the built-in, > Which built-in are you referring to? There is no readlines built-in. > > py> readlines > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > NameError: name 'readlines' is not defined > > > There is a file.readlines method, but that lives in a different namespace > to the function readlines so there should be no confusion. At least not > for a moderately experienced programmer, beginners can be confused by the > littlest things sometimes. You're right of course, and that's not restricted to beginners. I've been at this for over 40 years, and I make that kind of mistake once in a while. Fortunately, when I make such a mistake on this forum, you usually pop in to keep me honest. When I make it in code, I either get a runtime error, or no harm is done. > >> and to make it clear in caller's code that it's not the built-in one. >> I'd replace that compound if statement with >> if line.endswith("|\n": >> I'd add a comment saying that partial lines at the end of file are >> ignored. > Or fix the generator so that it doesn't ignore partial lines, or raises > an exception, whichever is more appropriate. > > > -- DaveA |
Re: Python garbage collector/memory manager behaving strangely
In article <50570de3$0$29981$c3e8da3$5496439d@news.astraweb.c om>,
Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: >On Mon, 17 Sep 2012 06:46:55 -0400, Dave Angel wrote: >> On 09/16/2012 11:25 PM, alex23 wrote: >>> >>> def readlines(f): >>> lines = [] >>> while "f is not empty": >>> line = f.readline() >>> if not line: break >>> if len(line) > 2 and line[-2:] == '|\n': >>> lines.append(line) >>> yield ''.join(lines) >>> lines = [] >>> else: >>> lines.append(line) >> >> There's a few changes I'd make: >> I'd change the name to something else, so as not to shadow the built-in, > >Which built-in are you referring to? There is no readlines built-in. > >py> readlines >Traceback (most recent call last): > File "<stdin>", line 1, in <module> >NameError: name 'readlines' is not defined > >There is a file.readlines method, but that lives in a different namespace >to the function readlines so there should be no confusion. At least not >for a moderately experienced programmer, beginners can be confused by the >littlest things sometimes. Actually, as an experienced programmer, I *do* think it is confusing as evidenced by the mistake Dave made! Segregated namespaces are wonderful (per Zen), but let's not pollute multiple namespaces with same name, either. It may not be literally shadowing the built-in, but it definitely mentally shadows the built-in. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "....Normal is what cuts off your sixth finger and your tail..." --Siobhan |
Re: Python garbage collector/memory manager behaving strangely
aahz@pythoncraft.com (Aahz) writes:
> ... >>>> def readlines(f): >>>> lines = [] >>>> while "f is not empty": >>>> line = f.readline() >>>> if not line: break >>>> if len(line) > 2 and line[-2:] == '|\n': >>>> lines.append(line) >>>> yield ''.join(lines) >>>> lines = [] >>>> else: >>>> lines.append(line) >>> >>> There's a few changes I'd make: >>> I'd change the name to something else, so as not to shadow the built-in, > ... > Actually, as an experienced programmer, I *do* think it is confusing as > evidenced by the mistake Dave made! Segregated namespaces are wonderful > (per Zen), but let's not pollute multiple namespaces with same name, > either. > > It may not be literally shadowing the built-in, but it definitely > mentally shadows the built-in. I disagree with you. namespaces are there that in working with a namespace I do not need to worry much about other namespaces. Therefore, calling a function "readlines" is very much justified (if it reads lines from a file), even though there was a module around with name "readlines". By the way, the module is named "readline" (not "readlines"). |
Re: Python garbage collector/memory manager behaving strangely
Am 17.09.2012 04:28 schrieb Jadhav, Alok:
> Thanks Dave for clean explanation. I clearly understand what is going on > now. I still need some suggestions from you on this. > > There are 2 reasons why I was using self.rawfile.read().split('|\n') > instead of self.rawfile.readlines() > > - As you have seen, the line separator is not '\n' but its '|\n'. > Sometimes the data itself has '\n' characters in the middle of the line > and only way to find true end of the line is that previous character > should be a bar '|'. I was not able specify end of line using > readlines() function, but I could do it using split() function. > (One hack would be to readlines and combine them until I find '|\n'. is > there a cleaner way to do this?) > - Reading whole file at once and processing line by line was must > faster. Though speed is not of very important issue here but I think the > tie it took to parse complete file was reduced to one third of original > time. With def itersep(f, sep='\0', buffering=1024, keepsep=True): if keepsep: keepsep=sep else: keepsep='' data = f.read(buffering) next_line = data # empty? -> end. while next_line: # -> data is empty as well. lines = data.split(sep) for line in lines[:-1]: yield line+keepsep next_line = f.read(buffering) data = lines[-1] + next_line # keepsep: only if we have something. if (not keepsep) or data: yield data you can iterate over everything you want without needing too much memory. Using a larger "buffering" might improve speed a little bit. Thomas |
| All times are GMT. The time now is 05:52 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.