Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Implementing file reading in C/Python

Reply
Thread Tools

Implementing file reading in C/Python

 
 
Sion Arrowsmith
Guest
Posts: n/a
 
      01-12-2009
Grant Edwards <invalid@invalid> wrote:
>On 2009-01-09, Sion Arrowsmith <(E-Mail Removed)> wrote:
>> Grant Edwards <invalid@invalid> wrote:
>>>If I were you, I'd try mmap()ing the file instead of reading it
>>>into string objects one chunk at a time.

>> You've snipped the bit further on in that sentence where the
>> OP says that the file of interest is 2GB. Do you still want to
>> try mmap'ing it?

>Sure. The larger the file, the more you gain from mmap'ing it.
>2GB should easily fit within the process's virtual memory
>space.


Assuming you're in a 64bit world. Me, I've only got 2GB of address
space available to play in -- mmap'ing all of it out of the question.

But I supposed that mmap'ing it chunk at a time instead of reading
chunk at a time might be worth considering.

--
\S -- http://www.velocityreviews.com/forums/(E-Mail Removed) -- http://www.chaos.org.uk/~sion/
"Frankly I have no feelings towards penguins one way or the other"
-- Arthur C. Clarke
her nu becomež se bera eadward ofdun hlęddre heafdes bęce bump bump bump
 
Reply With Quote
 
 
 
 
sturlamolden
Guest
Posts: n/a
 
      01-12-2009
On Jan 9, 6:41 pm, Sion Arrowsmith <(E-Mail Removed)>
wrote:

> You've snipped the bit further on in that sentence where the OP
> says that the file of interest is 2GB. Do you still want to try
> mmap'ing it?


Python's mmap object does not take an offset parameter. If it did, one
could mmap smaller portions of the file.

 
Reply With Quote
 
 
 
 
Sion Arrowsmith
Guest
Posts: n/a
 
      01-12-2009
In case the cancel didn't get through:

Sion Arrowsmith <(E-Mail Removed)> wrote:
>Grant Edwards <invalid@invalid> wrote:
>>2GB should easily fit within the process's virtual memory
>>space.

>Assuming you're in a 64bit world. Me, I've only got 2GB of address
>space available to play in -- mmap'ing all of it out of the question.


And today's moral is: try it before posting. Yeah, I can map a 2GB
file no problem, complete with associated 2GB+ allocated VM. The
addressing is clearly not working how I was expecting it too.

--
\S -- (E-Mail Removed) -- http://www.chaos.org.uk/~sion/
"Frankly I have no feelings towards penguins one way or the other"
-- Arthur C. Clarke
her nu becomež se bera eadward ofdun hlęddre heafdes bęce bump bump bump
 
Reply With Quote
 
sturlamolden
Guest
Posts: n/a
 
      01-12-2009
On Jan 12, 1:52 pm, Sion Arrowsmith <(E-Mail Removed)>
wrote:

> And today's moral is: try it before posting. Yeah, I can map a 2GB
> file no problem, complete with associated 2GB+ allocated VM. The
> addressing is clearly not working how I was expecting it too.


The virtual memory space of a 32 bit process is 4 GB.

 
Reply With Quote
 
Hrvoje Niksic
Guest
Posts: n/a
 
      01-12-2009
sturlamolden <(E-Mail Removed)> writes:

> On Jan 9, 6:41 pm, Sion Arrowsmith <(E-Mail Removed)>
> wrote:
>
>> You've snipped the bit further on in that sentence where the OP
>> says that the file of interest is 2GB. Do you still want to try
>> mmap'ing it?

>
> Python's mmap object does not take an offset parameter. If it did, one
> could mmap smaller portions of the file.


As of 2.6 it does, but that might not be of much use if you're using
2.5.x or earlier. If you speak Python/C and really need offset, you
could backport the mmap module from 2.6 and compile it under a
different name for 2.5.
 
Reply With Quote
 
Steve Holden
Guest
Posts: n/a
 
      01-13-2009
sturlamolden wrote:
> On Jan 12, 1:52 pm, Sion Arrowsmith <(E-Mail Removed)>
> wrote:
>
>> And today's moral is: try it before posting. Yeah, I can map a 2GB
>> file no problem, complete with associated 2GB+ allocated VM. The
>> addressing is clearly not working how I was expecting it too.

>
> The virtual memory space of a 32 bit process is 4 GB.
>

I believe, though, that in some environments 2GB of that is mapped onto
the operating system, to allow system calls to access OS memory
structures without any VM remapping being required - see

http://blogs.technet.com/markrussino...7/3155406.aspx.

Things have, however, improved if we are to believe what we read in

http://www.tenouk.com/WinVirtualAddressSpace.html

The very idea of mapping part of a process's virtual address space onto
an area in which "low-level system code resides, so writing to this
region may corrupt the system, with potentially catastrophic
consequences" seems to be asking for trouble to me. It's surprising
things used to don't go wrong with Windows all the time, really. Oh,
wait a minute, they did, didn't they? Still do for that matter ...

getting-sicker-of-vista-by-the-minute-ly yr's - steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

 
Reply With Quote
 
Steve Holden
Guest
Posts: n/a
 
      01-13-2009
sturlamolden wrote:
> On Jan 12, 1:52 pm, Sion Arrowsmith <(E-Mail Removed)>
> wrote:
>
>> And today's moral is: try it before posting. Yeah, I can map a 2GB
>> file no problem, complete with associated 2GB+ allocated VM. The
>> addressing is clearly not working how I was expecting it too.

>
> The virtual memory space of a 32 bit process is 4 GB.
>

After my last post I should also point out

a) That was specific to 32-bit processes, and

b)
http://regions.cmg.org/regions/mcmg/...%20Windows.pdf
describes the situation better, and outliones some steps you can take to
get relief.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

 
Reply With Quote
 
Marc 'BlackJack' Rintsch
Guest
Posts: n/a
 
      01-13-2009
On Mon, 12 Jan 2009 21:26:27 -0500, Steve Holden wrote:

> The very idea of mapping part of a process's virtual address space onto
> an area in which "low-level system code resides, so writing to this
> region may corrupt the system, with potentially catastrophic
> consequences" seems to be asking for trouble to me.


That's why those regions are usually "write protected" and "no execution
allowed" from the code in the user area of the virtual address space.

Ciao,
Marc 'BlackJack' Rintsch
 
Reply With Quote
 
David Bolen
Guest
Posts: n/a
 
      01-14-2009
Johannes Bauer <(E-Mail Removed)> writes:

> Yup, I changed the Python code to behave the same way the C code did -
> however overall it's not much of an improvement: Takes about 15 minutes
> to execute (still factor 23).


Not sure this is completely fair if you're only looking for a pure
Python solution, but to be honest, looping through a gazillion
individual bytes of information sort of begs for trying to offload
that into a library that can execute faster, while maintaining the
convenience of Python outside of the pure number crunching.

I'd assume numeric/numpy might have applicable functions, but I don't
use those libraries much, whereas I've been using OpenCV recently for
a lot of image processing work, and it has matrix/histogram support,
which seems to be a good match for your needs.

For example, assuming the OpenCV library and ctypes-opencv wrapper, add
the following before the file I/O loop:

from opencv import *

# Histogram for each file chunk
hist = cvCreateHist([256], CV_HIST_ARRAY, [(0,256)])

then, replace (using one of your posted methods as a sample):

datamap = { }
for i in data:
datamap[i] = datamap.get(i, 0) + 1

array = sorted([(b, a) for (a, b) in datamap.items()], reverse=True)
most = ord(array[0][1])

with:

matrix = cvMat(1, len(data), CV_8UC1, data)
cvCalcHist([matrix], hist)
most = cvGetMinMaxHistValue(hist,
min_val = False, max_val = False,
min_idx = False, max_idx = True)

should give you your results in a fraction of the time. I didn't run
with a full size data file, but for a smaller one using smaller chunks
the OpenCV varient ran in about 1/10 of the time, and that was while
leaving all the other remaining Python code in place.

Note that it may not be identical results to some of your other
methods in the case of multiple values with the same counts, as the
OpenCV histogram min/max call will always pick the lower value in such
cases, whereas some of your code (such as above) will pick the upper
value, or your original code depended on the order of information
returned by dict.items.

This sort of small dedicated high performance choke point is probably
also perfect for something like Pyrex/Cython, although that would
require a compiler to build the extension for the histogram code.

-- David
 
Reply With Quote
 
mk
Guest
Posts: n/a
 
      01-23-2009
John Machin wrote:
>> The factor of 30 indeed does not seem right -- I have done somewhat
>> similar stuff (calculating Levenshtein distance [edit distance] on words
>> read from very large files), coded the same algorithm in pure Python and
>> C++ (using linked lists in C++) and Python version was 2.5 times slower.


> Levenshtein distance using linked lists? That's novel. Care to
> divulge?


I meant: using linked lists to store words that are compared. I found
using vectors was slow.

Regards,
mk


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Reading of file by next of map file and by next of file descriptor. =?ISO-8859-2?Q?Miros=B3aw?= Makowiecki C++ 1 07-10-2007 02:46 AM
Reading a file and resuming reading. Karim Ali Python 2 05-25-2007 02:04 PM
UnauthorizedAccessException when reading XML files (no problem when reading other file-types) blabla120@gmx.net ASP .Net 0 09-15-2006 02:08 PM
An Automated process of watching a network file folder, reading a file in it and deleting the file using ASP.NET ? Luis Esteban Valencia Muńoz ASP .Net 3 06-04-2005 10:56 AM
reading the DB vs. reading a text file...performance preference? Darrel ASP .Net 3 11-11-2004 02:27 PM



Advertisments