Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > python: ascii read

Reply
Thread Tools

python: ascii read

 
 
Sebastian Krause
Guest
Posts: n/a
 
      09-16-2004
Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read_array, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but not
as general as Python). The problem with scipy.io.read_array was, that it
is really slow, returns errors when trying to process large files and it
also changes (cuts) the files (after scipy.io.read_array processed a 2GB
file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly and
fast? (Maybe with another read-in routine.)

Thanks.

Greetings,
Sebastian
 
Reply With Quote
 
 
 
 
Alex Martelli
Guest
Posts: n/a
 
      09-16-2004
Sebastian Krause <(E-Mail Removed)> wrote:

> Hello,
>
> I tried to read in some large ascii files (200MB-2GB) in Python using
> scipy.io.read_array, but it did not work as I expected. The whole idea
> was to find a fast Python routine to read in arbitrary ascii files, to
> replace Yorick (which I use right now and which is really fast, but not
> as general as Python). The problem with scipy.io.read_array was, that it
> is really slow, returns errors when trying to process large files and it
> also changes (cuts) the files (after scipy.io.read_array processed a 2GB
> file its size was only 64MB).
>
> Can someone give me hint how to use Python to do this job correctly and
> fast? (Maybe with another read-in routine.)


If all you need is what you say -- read a huge amount of ASCII data into
memory -- it's hard to beat
data = open('thefile.txt').read()

mmap may in fact be preferable for many uses, but it doesn't actually
read (it _maps_ the file into memory instead).


Alex
 
Reply With Quote
 
 
 
 
Robert Kern
Guest
Posts: n/a
 
      09-16-2004
Sebastian Krause wrote:
> Hello,
>
> I tried to read in some large ascii files (200MB-2GB) in Python using
> scipy.io.read_array, but it did not work as I expected. The whole idea
> was to find a fast Python routine to read in arbitrary ascii files, to
> replace Yorick (which I use right now and which is really fast, but not
> as general as Python). The problem with scipy.io.read_array was, that it
> is really slow, returns errors when trying to process large files and it
> also changes (cuts) the files (after scipy.io.read_array processed a 2GB
> file its size was only 64MB).
>
> Can someone give me hint how to use Python to do this job correctly and
> fast? (Maybe with another read-in routine.)


What kind of data is it? What operations do you want to perform on the
data? What platform are you on?

Some of the scipy.io.read_array behavior that you see look like bugs. We
would greatly appreciate it if you were to send a complete bug report to
the scipy-dev mailing list. Thank you.

--
Robert Kern
http://www.velocityreviews.com/forums/(E-Mail Removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
 
Reply With Quote
 
Sebastian Krause
Guest
Posts: n/a
 
      09-16-2004
I did not explictly mention that the ascii file should be read in as an
array of numbers (either integer or float).
To use open() and read() is very fast, but does only read in the data as
string and it also does not work with large files.

Sebastian

Alex Martelli wrote:
> Sebastian Krause <(E-Mail Removed)> wrote:
>
>
>>Hello,
>>
>>I tried to read in some large ascii files (200MB-2GB) in Python using
>>scipy.io.read_array, but it did not work as I expected. The whole idea
>>was to find a fast Python routine to read in arbitrary ascii files, to
>>replace Yorick (which I use right now and which is really fast, but not
>>as general as Python). The problem with scipy.io.read_array was, that it
>>is really slow, returns errors when trying to process large files and it
>>also changes (cuts) the files (after scipy.io.read_array processed a 2GB
>>file its size was only 64MB).
>>
>>Can someone give me hint how to use Python to do this job correctly and
>>fast? (Maybe with another read-in routine.)

>
>
> If all you need is what you say -- read a huge amount of ASCII data into
> memory -- it's hard to beat
> data = open('thefile.txt').read()
>
> mmap may in fact be preferable for many uses, but it doesn't actually
> read (it _maps_ the file into memory instead).
>
>
> Alex

 
Reply With Quote
 
Sebastian Krause
Guest
Posts: n/a
 
      09-16-2004
The input data is is large ascii file of astrophysical parameters
(integer and float) of gaydynamics calculations. They should be read in
as an array of integer and float numbers not as string (as open() and
read() does). Then the array is used to make different plots from the
data and do some (simple) operations: subtraction and divison of
columns. I am using Scipy with Python 2.3.x under Linux (SuSE 9.1).

Sebastian

Robert Kern wrote:
> Sebastian Krause wrote:
>
>> Hello,
>>
>> I tried to read in some large ascii files (200MB-2GB) in Python using
>> scipy.io.read_array, but it did not work as I expected. The whole idea
>> was to find a fast Python routine to read in arbitrary ascii files, to
>> replace Yorick (which I use right now and which is really fast, but
>> not as general as Python). The problem with scipy.io.read_array was,
>> that it is really slow, returns errors when trying to process large
>> files and it also changes (cuts) the files (after scipy.io.read_array
>> processed a 2GB file its size was only 64MB).
>>
>> Can someone give me hint how to use Python to do this job correctly
>> and fast? (Maybe with another read-in routine.)

>
>
> What kind of data is it? What operations do you want to perform on the
> data? What platform are you on?
>
> Some of the scipy.io.read_array behavior that you see look like bugs. We
> would greatly appreciate it if you were to send a complete bug report to
> the scipy-dev mailing list. Thank you.
>

 
Reply With Quote
 
Alex Martelli
Guest
Posts: n/a
 
      09-16-2004
Sebastian Krause <(E-Mail Removed)> wrote:

> I did not explictly mention that the ascii file should be read in as an
> array of numbers (either integer or float).


Ah, right, you didn't . So I was answering the literal question you
asked rather than the one you had in mind.

> To use open() and read() is very fast, but does only read in the data as
> string and it also does not work with large files.


It works just fine with files as large as you have memory for (and mmap
works for files as large as you have _spare address space_ for, if your
OS is decently good at its job). But if what you want is not the job
that .read() and mmap do, the fact that they _do_ perform that job quite
well on large files is of course of no use to you.

Back to, why scipy.io.read_array works so badly for you -- I don't know,
it's rather complicated code, as well as maybe old-ish (wraps files into
class instances to be able to iterate on their lines) and very general
(lots of options regarding what are separators, etc, etc). If your
needs are very specific (you know a lot about the format of those huge
files -- e.g. they're column-oriented, or only use whitespace separators
and \n line termination, or other such specifics) you might well be able
to do better -- likely even in Python, worst case in C. I assume you
need Numeric arrays, 2-d, specifically, as the result of reading your
files? Would you know in advance whether you're reading int or float
(it might be faster to have two separate functions)? Could you
pre-dimension the Numeric array and pass it in, or do you need it to
dimension itself dynamically based on file contents? The less
flexibility you need, the simpler and faster the reading can be...


Alex
 
Reply With Quote
 
Robert Kern
Guest
Posts: n/a
 
      09-16-2004
Sebastian Krause wrote:
> The input data is is large ascii file of astrophysical parameters
> (integer and float) of gaydynamics calculations. They should be read in
> as an array of integer and float numbers not as string (as open() and
> read() does). Then the array is used to make different plots from the
> data and do some (simple) operations: subtraction and divison of
> columns. I am using Scipy with Python 2.3.x under Linux (SuSE 9.1).


Well, one option is to use the "lines" argument to scipy.io.read_array
to only read in chunks at a time. It probably won't help speed any, but
hopefully it will be correct.

> Sebastian


--
Robert Kern
(E-Mail Removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
 
Reply With Quote
 
Des Small
Guest
Posts: n/a
 
      09-16-2004
(E-Mail Removed) (Alex Martelli) writes:

> If your needs are very specific (you know a lot about the format of
> those huge files -- e.g. they're column-oriented, or only use
> whitespace separators and \n line termination, or other such
> specifics) you might well be able to do better -- likely even in
> Python, worst case in C. I assume you need Numeric arrays, 2-d,
> specifically, as the result of reading your files? Would you know
> in advance whether you're reading int or float (it might be faster
> to have two separate functions)? Could you pre-dimension the
> Numeric array and pass it in, or do you need it to dimension itself
> dynamically based on file contents? The less flexibility you need,
> the simpler and faster the reading can be...


The last time I wanted to be able to read large lumps of numerical
data from an ASCII file, I ended up using (f)lex, for performance
reasons. (Pure C _might_ have been faster still, of course, but it
would _quite certainly_ also have been pure C.)

This has caused minor irritation - the code has been in use through
several upgrades of Python, and it is considered polite to recompile
to match the current C API - but I'd probably do it the same way again
in the same situation.

Des
--
"[T]he structural trend in linguistics which took root with the
International Congresses of the twenties and early thirties [...] had
close and effective connections with phenomenology in its Husserlian
and Hegelian versions." -- Roman Jakobson
 
Reply With Quote
 
Brian van den Broek
Guest
Posts: n/a
 
      09-16-2004
Alex Martelli said unto the world upon 2004-09-16 07:22:
> Sebastian Krause <(E-Mail Removed)> wrote:
>
>
>>Hello,
>>
>>I tried to read in some large ascii files (200MB-2GB) in Python using
>>scipy.io.read_array, but it did not work as I expected. The whole idea
>>was to find a fast Python routine to read in arbitrary ascii files, to
>>replace Yorick (which I use right now and which is really fast, but not
>>as general as Python). The problem with scipy.io.read_array was, that it
>>is really slow, returns errors when trying to process large files and it
>>also changes (cuts) the files (after scipy.io.read_array processed a 2GB
>>file its size was only 64MB).
>>
>>Can someone give me hint how to use Python to do this job correctly and
>>fast? (Maybe with another read-in routine.)

>
>
> If all you need is what you say -- read a huge amount of ASCII data into
> memory -- it's hard to beat
> data = open('thefile.txt').read()
>
> mmap may in fact be preferable for many uses, but it doesn't actually
> read (it _maps_ the file into memory instead).
>
>
> Alex


Hi all,

[neophyte question warning]

I'd not been aware of mmap until this post. Looking at the Library
Reference and my trusty copy of Python in a Nutshell, I've gotten some
idea of the differences between using mmap and the .read() method on a
file object -- such as it returns a mutable object vs an immutable
string, constraint on slice assignment that len(oldslice) must be equal
to len(newslice), etc.

But I don't really feel I've a handle on the significance of saying it
maps the file into memory versus reading the file. The naive thought is
that since the data gets into memory, the file must be read. But this
makes me sure I'm missing a distinction in the terminology. Explanations
and pointers for what to read gratefully received.

And, since mmap behave differently on different platforms, I'm mostly a
win32 user looking to transition to Linux.

Best to all,

Brian vdB

 
Reply With Quote
 
Heiko Wundram
Guest
Posts: n/a
 
      09-16-2004
Am Donnerstag, 16. September 2004 17:56 schrieb Brian van den Broek:
> But I don't really feel I've a handle on the significance of saying it
> maps the file into memory versus reading the file. The naive thought is
> that since the data gets into memory, the file must be read. But this
> makes me sure I'm missing a distinction in the terminology. Explanations
> and pointers for what to read gratefully received.


read()ing a file into memory does what it says; it reads the binary data from
the disk all at once, and allocates main memory (as needed) to fit all the
data there. Memory mapping a file (or device or whatever) means that the
virtual memory architecture is involved. What happens here:

mmapping a file creates virtual memory pages (just like virtual memory which
is put into your paging file), which are registered with the MMU of the
processor as being absent initially.

Now, when the program tries to access the memory page (pages are some fixed
short length, like 4k for most Pentium-style computers), a (page) fault is
generated by the MMU, which invokes the operating system's handler for page
faults. Now that the operating system sees that a certain page is accessed
(from the page address it can deduce the offset in the file that you're
trying to access), it loads the corresponding page from disk, and puts it
into memory at some position, and alters the pagetable entry in the LDT to be
present.

Future accesses to the page will take place immediately (without a page fault
taking place).

Changes in memory are written to disk once the page is flushed (meaning that
it gets removed from main memory because there are too few pages available of
real main memory). Now, when a page is forcefully flushed (not due to closing
the mmap), the operating system marks the pagetable entry in the LDT to be
absent again, and the next time the program tries to access this location, a
page-fault again takes place, and the OS can load the page from disk.

For speed, the operating system allows you to mmap read-only, which means that
once a page is discarded, it does not need to be written back to disk (which
of course is faster). Some MMUs (IIRC not the Pentium-class MMU) set a dirty
bit on the page-table entry once the page has been altered, this can also be
used to control whether the page needs to be written back to disk after
access.

So, basically what you get is load on demand file handling, which is similar
to what the paging file (virtual memory file) on win32 does for allround
memory. Actually, internally, the architecture to handle mmapped files and
virtual memory is the same, and you could think of the swap file as an
operating system mmapped file, from which programs can allocate slices
through some OS calls (well, actually through the normal malloc/calloc
calls).

HTH!

Heiko.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex with ASCII and non-ASCII chars TOXiC Python 5 01-31-2007 04:48 PM
[FR/EN] how to convert the characters ASCII(0-255) to ASCII(0-127) Alextophi Perl Misc 8 12-30-2005 10:43 AM
Re: Read tab delimited ascii file Swaroop C H Python 0 05-12-2005 12:11 PM
routine/module to translate microsoft extended ascii to plain ascii James O'Brien Perl Misc 3 03-05-2004 04:33 PM
how to read a file with non-ascii file name guava Java 1 07-08-2003 03:07 AM



Advertisments