Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > How to write fast into a file in python?

Reply
Thread Tools

How to write fast into a file in python?

 
 
lokeshkoppaka@gmail.com
Guest
Posts: n/a
 
      05-17-2013


I need to write numbers into a file upto 50mb and it should be fast
can any one help me how to do that?
i had written the following code..
-----------------------------------------------------------------------------------------------------------
def create_file_numbers_old(filename, size):
start = time.clock()

value = 0
with open(filename, "w") as f:
while f.tell()< size:
f.write("{0}\n".format(value))
value += 1

end = time.clock()

print "time taken to write a file of size", size, " is ", (end -start), "seconds \n"
------------------------------------------------------------------------------------------------------------------
it takes about 20sec i need 5 to 10 times less than that.
 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      05-17-2013
On Thu, 16 May 2013 20:20:26 -0700, lokeshkoppaka wrote:

> I need to write numbers into a file upto 50mb and it should be fast can
> any one help me how to do that?
> i had written the following code..
> ----------------------------------------------------------------------
> def create_file_numbers_old(filename, size): start = time.clock()
>
> value = 0
> with open(filename, "w") as f:
> while f.tell()< size:
> f.write("{0}\n".format(value))
> value += 1
>
> end = time.clock()
>
> print "time taken to write a file of size", size, " is ", (end -start),
> "seconds \n"
> --------------------------------------------------------------------
> it takes about 20sec i need 5 to 10 times less than that.



20 seconds to write how many numbers? If you are doing

create_file_numbers_old(filename, 5)

then 20 seconds is really slow. But if you are doing:

create_file_numbers_old(filename, 50000000000000)

then 20 seconds is amazingly fast.


Try this instead, it may be a little faster:


def create_file_numbers_old(filename, size):
count = value = 0
with open(filename, 'w') as f:
while count < size:
s = '%d\n' % value
f.write(s)
count += len(s)
value += 1


If this is still too slow, you can try three other tactics:

1) pre-calculate the largest integer that will fit in `size` bytes, then
use a for-loop instead of a while loop:

maxn = calculation(...)
with open(filename, 'w') as f:
for i in xrange(maxn):
f.write('%d\n' % i)


2) Write an extension module in C that writes to the file.

3) Get a faster hard drive, and avoid writing over a network.


--
Steven
 
Reply With Quote
 
 
 
 
lokeshkoppaka@gmail.com
Guest
Posts: n/a
 
      05-17-2013
On Friday, May 17, 2013 8:50:26 AM UTC+5:30, (E-Mail Removed) wrote:
> I need to write numbers into a file upto 50mb and it should be fast
>
> can any one help me how to do that?
>
> i had written the following code..
>
> -----------------------------------------------------------------------------------------------------------
>
> def create_file_numbers_old(filename, size):
>
> start = time.clock()
>
>
>
> value = 0
>
> with open(filename, "w") as f:
>
> while f.tell()< size:
>
> f.write("{0}\n".format(value))
>
> value += 1
>
>
>
> end = time.clock()
>
>
>
> print "time taken to write a file of size", size, " is ", (end -start), "seconds \n"
>
> ------------------------------------------------------------------------------------------------------------------
>
> it takes about 20sec i need 5 to 10 times less than that.

size = 50mb
 
Reply With Quote
 
Dave Angel
Guest
Posts: n/a
 
      05-17-2013
On 05/17/2013 12:35 AM, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> On Friday, May 17, 2013 8:50:26 AM UTC+5:30, (E-Mail Removed) wrote:
>> I need to write numbers into a file upto 50mb and it should be fast
>>
>> can any one help me how to do that?
>>
>> i had written the following code..
>>
>> <SNIP>
>> value = 0
>>
>> with open(filename, "w") as f:
>>
>> while f.tell()< size:
>>
>> f.write("{0}\n".format(value))
>> <SNIP more double-spaced nonsense from googlegroups>

If you must use googlegroups, at least read this
http://wiki.python.org/moin/GoogleGroupsPython.
>>
>>
>> it takes about 20sec i need 5 to 10 times less than that.

> size = 50mb
>


Most of the time is spent figuring out whether the file has reached its
limit size. If you want Python to go fast, just specify the data. On
my Linux system, it takes 11 seconds to write the first 6338888 values,
which is just under 50mb. If I write the obvious loop, writing that
many values takes .25 seconds.

--
DaveA
 
Reply With Quote
 
Carlos Nepomuceno
Guest
Posts: n/a
 
      05-17-2013
I've got the following results on my desktop PC (Win7/Python2.7.5):

C:\src\Python>python -m timeit -cvn3 -r3 "execfile('fastwrite2.py')"
raw times: 123 126 125
3 loops, best of 3: 41 sec per loop

C:\src\Python>python -m timeit -cvn3 -r3 "execfile('fastwrite5.py')"
raw times: 34 34.3 34
3 loops, best of 3: 11.3 sec per loop

C:\src\Python>python -m timeit -cvn3 -r3 "execfile('fastwrite6.py')"
raw times: 0.4 0.447 0.391
3 loops, best of 3: 130 msec per loop


If you can just copy a preexisting file it will surely increase the speed to the levels you need, but doing the cStringIO operations can reduce the time in 72%.

Strangely I just realised that the time it takes to complete such scripts is the same no matter what hard drive I choose to run them. The results are the same for an SSD (main drive) and a HDD.

I think it's very strange to take 11.3s to write 50MB (4.4MB/s) sequentially on a SSD which is capable of 140MB/s.

Is that a Python problem? Why does it take the same time on the HDD?


### fastwrite2.py ###* <<< this is your code
size = 50*1024*1024
value = 0
filename = 'fastwrite2.dat'
with open(filename, "w") as f:
*** while f.tell()< size:
******* f.write("{0}\n".format(value))
******* value += 1
*** f.close()


### fastwrite5.py ###
import cStringIO
size = 50*1024*1024
value = 0
filename = 'fastwrite5.dat'
x = 0
b = cStringIO.StringIO()
while x < size:
*** line = '{0}\n'.format(value)
*** b.write(line)
*** value += 1
*** x += len(line)+1
f = open(filename, 'w')
f.write(b.getvalue())
f.close()
b.close()


### fastwrite6.py ###
import shutil
src = 'fastwrite.dat'
dst = 'fastwrite6.dat'
shutil.copyfile(src, dst)



----------------------------------------
> Date: Fri, 17 May 2013 07:58:43 -0400
> From: (E-Mail Removed)
> To: (E-Mail Removed)
> Subject: Re: How to write fast into a file in python?
>
> On 05/17/2013 12:35 AM, (E-Mail Removed) wrote:
>> On Friday, May 17, 2013 8:50:26 AM UTC+5:30, (E-Mail Removed) wrote:
>>> I need to write numbers into a file upto 50mb and it should be fast
>>>
>>> can any one help me how to do that?
>>>
>>> i had written the following code..
>>>
>>> <SNIP>
>>> value = 0
>>>
>>> with open(filename, "w") as f:
>>>
>>> while f.tell()< size:
>>>
>>> f.write("{0}\n".format(value))
>>> <SNIP more double-spaced nonsense from googlegroups>

> If you must use googlegroups, at least read this
> http://wiki.python.org/moin/GoogleGroupsPython.
>>>
>>>
>>> it takes about 20sec i need 5 to 10 times less than that.

>> size = 50mb
>>

>
> Most of the time is spent figuring out whether the file has reached its
> limit size. If you want Python to go fast, just specify the data. On
> my Linux system, it takes 11 seconds to write the first 6338888 values,
> which is just under 50mb. If I write the obvious loop, writing that
> many values takes .25 seconds.
>
> --
> DaveA
> --
> http://mail.python.org/mailman/listinfo/python-list
 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      05-17-2013
On Fri, 17 May 2013 18:20:33 +0300, Carlos Nepomuceno wrote:

> I've got the following results on my desktop PC (Win7/Python2.7.5):
>
> C:\src\Python>python -m timeit -cvn3 -r3 "execfile('fastwrite2.py')" raw
> times: 123 126 125
> 3 loops, best of 3: 41 sec per loop


Your times here are increased significantly by using execfile. Using
execfile means that instead of compiling the code once, then executing
many times, it gets compiled over and over and over and over again. In my
experience, using exec, execfile or eval makes your code ten or twenty
times slower:

[steve@ando ~]$ python -m timeit 'x = 100; y = x/3'
1000000 loops, best of 3: 0.175 usec per loop
[steve@ando ~]$ python -m timeit 'exec("x = 100; y = x/3")'
10000 loops, best of 3: 37.8 usec per loop


> Strangely I just realised that the time it takes to complete such
> scripts is the same no matter what hard drive I choose to run them. The
> results are the same for an SSD (main drive) and a HDD.


There's nothing strange here. The time you measure is dominated by three
things, in reducing order of importance:

* the poor choice of execfile dominates the time taken;

* followed by choice of algorithm;

* followed by the time it actually takes to write to the disk, which is
probably insignificant compared to the other two, regardless of whether
you are using a HDD or SSD.

Until you optimize the code, optimizing the media is a waste of time.


> I think it's very strange to take 11.3s to write 50MB (4.4MB/s)
> sequentially on a SSD which is capable of 140MB/s.


It doesn't. It takes 11.3 seconds to open a file, read it into memory,
parse it, compile it into byte-code, and only then execute it. My
prediction is that the call to f.write() and f.close() probably take a
fraction of a second, and nearly all of the rest of the time is taken by
other calculations.



--
Steven
 
Reply With Quote
 
Carlos Nepomuceno
Guest
Posts: n/a
 
      05-17-2013
Thank you Steve! You are totally right!

It takes about 0.2s for the f.write() to return. Certainly because it writes to the system file cache (~250MB/s).

Using a little bit different approach I've got:

C:\src\Python>python -m timeit -cvn3 -r3 -s"from fastwrite5r import run" "run()"
raw times: 24 25.1 24.4
3 loops, best of 3: 8 sec per loop
***

This time it took 8s to complete from previous 11.3s.

Does those 3.3s are the time to "open, read, parse, compile" steps you told me?

If so, the execute step is really taking 8s, right?

Why does it take so long to build the string to be written? Can it get faster?

Thanks in advance!



### fastwrite5r.py ###
def run():
*** import cStringIO
*** size = 50*1024*1024
*** value = 0
*** filename = 'fastwrite5.dat'
*** x = 0
*** b = cStringIO.StringIO()
*** while x < size:
******* line = '{0}\n'.format(value)
******* b.write(line)
******* value += 1
******* x += len(line)+1
*** f = open(filename, 'w')
*** f.write(b.getvalue())
*** f.close()
*** b.close()

if __name__ == '__main__':
*** run()





----------------------------------------
> From: (E-Mail Removed)
> Subject: Re: How to write fast into a file in python?
> Date: Fri, 17 May 2013 16:42:55 +0000
> To: (E-Mail Removed)
>
> On Fri, 17 May 2013 18:20:33 +0300, Carlos Nepomuceno wrote:
>
>> I've got the following results on my desktop PC (Win7/Python2.7.5):
>>
>> C:\src\Python>python -m timeit -cvn3 -r3 "execfile('fastwrite2.py')" raw
>> times: 123 126 125
>> 3 loops, best of 3: 41 sec per loop

>
> Your times here are increased significantly by using execfile. Using
> execfile means that instead of compiling the code once, then executing
> many times, it gets compiled over and over and over and over again. In my
> experience, using exec, execfile or eval makes your code ten or twenty
> times slower:
>
> [steve@ando ~]$ python -m timeit 'x = 100; y = x/3'
> 1000000 loops, best of 3: 0.175 usec per loop
> [steve@ando ~]$ python -m timeit 'exec("x = 100; y = x/3")'
> 10000 loops, best of 3: 37.8 usec per loop
>
>
>> Strangely I just realised that the time it takes to complete such
>> scripts is the same no matter what hard drive I choose to run them. The
>> results are the same for an SSD (main drive) and a HDD.

>
> There's nothing strange here. The time you measure is dominated by three
> things, in reducing order of importance:
>
> * the poor choice of execfile dominates the time taken;
>
> * followed by choice of algorithm;
>
> * followed by the time it actually takes to write to the disk, which is
> probably insignificant compared to the other two, regardless of whether
> you are using a HDD or SSD.
>
> Until you optimize the code, optimizing the media is a waste of time.
>
>
>> I think it's very strange to take 11.3s to write 50MB (4.4MB/s)
>> sequentially on a SSD which is capable of 140MB/s.

>
> It doesn't. It takes 11.3 seconds to open a file, read it into memory,
> parse it, compile it into byte-code, and only then execute it. My
> prediction is that the call to f.write() and f.close() probably take a
> fraction of a second, and nearly all of the rest of the time is taken by
> other calculations.
>
>
>
> --
> Steven
> --
> http://mail.python.org/mailman/listinfo/python-list
 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      05-17-2013
On Fri, 17 May 2013 18:20:33 +0300, Carlos Nepomuceno wrote:

> ### fastwrite5.py ###
> import cStringIO
> size = 50*1024*1024
> value = 0
> filename = 'fastwrite5.dat'
> x = 0
> b = cStringIO.StringIO()
> while x < size:
> *** line = '{0}\n'.format(value)
> *** b.write(line)
> *** value += 1
> *** x += len(line)+1


Oh, I forgot to mention: you have a bug in this function. You're already
including the newline in the len(line), so there is no need to add one.
The result is that you only generate 44MB instead of 50MB.

> f = open(filename, 'w')
> f.write(b.getvalue())
> f.close()
> b.close()


Here are the results of profiling the above on my computer. Including the
overhead of the profiler, it takes just over 50 seconds to run your file
on my computer.

[steve@ando ~]$ python -m cProfile fastwrite5.py
17846645 function calls in 53.575 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 30.561 30.561 53.575 53.575 fastwrite5.py:1(<module>)
1 0.000 0.000 0.000 0.000 {cStringIO.StringIO}
5948879 5.582 0.000 5.582 0.000 {len}
1 0.004 0.004 0.004 0.004 {method 'close' of 'cStringIO.StringO' objects}
1 0.000 0.000 0.000 0.000 {method 'close' of 'file' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
5948879 9.979 0.000 9.979 0.000 {method 'format' of 'str' objects}
1 0.103 0.103 0.103 0.103 {method 'getvalue' of 'cStringIO.StringO' objects}
5948879 7.135 0.000 7.135 0.000 {method 'write' of 'cStringIO.StringO' objects}
1 0.211 0.211 0.211 0.211 {method 'write' of 'file' objects}
1 0.000 0.000 0.000 0.000 {open}


As you can see, the time is dominated by repeatedly calling len(),
str.format() and StringIO.write() methods. Actually writing the data to
the file is quite a small percentage of the cumulative time.

So, here's another version, this time using a pre-calculated limit. I
cheated and just copied the result from the fastwrite5 output

# fasterwrite.py
filename = 'fasterwrite.dat'
with open(filename, 'w') as f:
for i in xrange(5948879): # Actually only 44MB, not 50MB.
f.write('%d\n' % i)


And the profile results are about twice as fast as fastwrite5 above, with
only 8 seconds in total writing to my HDD.

[steve@ando ~]$ python -m cProfile fasterwrite.py
5948882 function calls in 28.840 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 20.592 20.592 28.840 28.840 fasterwrite.py:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
5948879 8.229 0.000 8.229 0.000 {method 'write' of 'file' objects}
1 0.019 0.019 0.019 0.019 {open}


Without the overhead of the profiler, it is a little faster:

[steve@ando ~]$ time python fasterwrite.py

real 0m16.187s
user 0m13.553s
sys 0m0.508s


Although it is still slower than the heavily optimized dd command,
but not unreasonably slow for a high-level language:

[steve@ando ~]$ time dd if=fasterwrite.dat of=copy.dat
90781+1 records in
90781+1 records out
46479922 bytes (46 MB) copied, 0.737009 seconds, 63.1 MB/s

real 0m0.786s
user 0m0.071s
sys 0m0.595s




--
Steven
 
Reply With Quote
 
Carlos Nepomuceno
Guest
Posts: n/a
 
      05-17-2013
You've hit the bullseye!

Thanks a lot!!!

> Oh, I forgot to mention: you have a bug in this function. You're already
> including the newline in the len(line), so there is no need to add one.
> The result is that you only generate 44MB instead of 50MB.


That's because I'm running on Windows.
What's the fastest way to check if '\n' translates to 2 bytes on file?

> Here are the results of profiling the above on my computer. Including the
> overhead of the profiler, it takes just over 50 seconds to run your file
> on my computer.
>
> [steve@ando ~]$ python -m cProfile fastwrite5.py
> 17846645 function calls in 53.575 seconds
>


Didn't know the cProfile module.Thanks a lot!

> Ordered by: standard name
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 1 30.561 30.561 53.575 53.575 fastwrite5.py:1(<module>)
> 1 0.000 0.000 0.000 0.000 {cStringIO.StringIO}
> 5948879 5.582 0.000 5.582 0.000 {len}
> 1 0.004 0.004 0.004 0.004 {method 'close' of 'cStringIO.StringO' objects}
> 1 0.000 0.000 0.000 0.000 {method 'close' of 'file' objects}
> 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
> 5948879 9.979 0.000 9.979 0.000 {method 'format' of 'str' objects}
> 1 0.103 0.103 0.103 0.103 {method 'getvalue' of 'cStringIO.StringO' objects}
> 5948879 7.135 0.000 7.135 0.000 {method 'write' of 'cStringIO.StringO' objects}
> 1 0.211 0.211 0.211 0.211 {method 'write' of 'file' objects}
> 1 0.000 0.000 0.000 0.000 {open}
>
>
> As you can see, the time is dominated by repeatedly calling len(),
> str.format() and StringIO.write() methods. Actually writing the data to
> the file is quite a small percentage of the cumulative time.
>
> So, here's another version, this time using a pre-calculated limit. I
> cheated and just copied the result from the fastwrite5 output
>
> # fasterwrite.py
> filename = 'fasterwrite.dat'
> with open(filename, 'w') as f:
> for i in xrange(5948879): # Actually only 44MB, not 50MB.
> f.write('%d\n' % i)
>


I had the same idea but kept the original method because I didn't want to waste time creating a function for calculating the actual number of iterations needed to deliver 50MB of data.

> And the profile results are about twice as fast as fastwrite5 above, with
> only 8 seconds in total writing to my HDD.
>
> [steve@ando ~]$ python -m cProfile fasterwrite.py
> 5948882 function calls in 28.840 seconds
>
> Ordered by: standard name
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 1 20.592 20.592 28.840 28.840 fasterwrite.py:1(<module>)
> 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
> 5948879 8.229 0.000 8.229 0.000 {method 'write' of 'file' objects}
> 1 0.019 0.019 0.019 0.019 {open}
>


I thought there would be a call to format method by "'%d\n' % i". It seems the % operator is a lot faster than format.
I just stopped using it because I read it was going to be deprecated.
Why replace such a great and fast operator by a slow method? I mean, why format is been preferred over %?

> Without the overhead of the profiler, it is a little faster:
>
> [steve@ando ~]$ time python fasterwrite.py
>
> real 0m16.187s
> user 0m13.553s
> sys 0m0.508s
>
>
> Although it is still slower than the heavily optimized dd command,
> but not unreasonably slow for a high-level language:
>
> [steve@ando ~]$ time dd if=fasterwrite.dat of=copy.dat
> 90781+1 records in
> 90781+1 records out
> 46479922 bytes (46 MB) copied, 0.737009 seconds, 63.1 MB/s
>
> real 0m0.786s
> user 0m0.071s
> sys 0m0.595s
>
>
>
>
> --
> Steven
> --
> http://mail.python.org/mailman/listinfo/python-list
 
Reply With Quote
 
Carlos Nepomuceno
Guest
Posts: n/a
 
      05-17-2013
Think the following update will make the code more portable:

x += len(line)+len(os.linesep)-1

Not sure if it's the fastest way to achieve that. :/

> On Fri, 17 May 2013 18:20:33 +0300, Carlos Nepomuceno wrote:
>
>> ### fastwrite5.py ###
>> import cStringIO
>> size = 50*1024*1024
>> value = 0
>> filename = 'fastwrite5.dat'
>> x = 0
>> b = cStringIO.StringIO()
>> while x < size:
>> line = '{0}\n'.format(value)
>> b.write(line)
>> value += 1
>> x += len(line)+1

>
> Oh, I forgot to mention: you have a bug in this function. You're already
> including the newline in the len(line), so there is no need to add one.
> The result is that you only generate 44MB instead of 50MB.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to convert a JPG picture into a vector drawing forexperimentation Danny D Digital Photography 107 05-21-2013 03:13 PM
How to convert image into Hex 2 Dimensional Array? sout saret Java 35 05-15-2013 10:50 AM
How to break a bash command into an array consisting of the argumentsin the command? Peng Yu Perl Misc 3 05-13-2013 10:27 AM
How do I encode and decode this data to write to a file? cl@isbd.net Python 11 05-01-2013 11:36 PM
How to get JSON values and how to trace sessions?? webmaster@terradon.nl Python 2 04-25-2013 02:12 PM



Advertisments