Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Multi thread reading a file

Reply
Thread Tools

Re: Multi thread reading a file

 
 
Gabriel Genellina
Guest
Posts: n/a
 
      07-01-2009
En Tue, 30 Jun 2009 22:52:18 -0300, Mag Gam <(E-Mail Removed)> escribió:

> I am very new to python and I am in the process of loading a very
> large compressed csv file into another format. I was wondering if I
> can do this in a multi thread approach.


Does the format conversion involve a significant processing time? If not,
the total time is dominated by the I/O time (reading and writing the file)
so it's doubtful you gain anything from multiple threads.

> Here is the pseudo code I was thinking about:
>
> Let T = Total number of lines in a file, Example 1000000 (1 million
> files)
> Let B = Total number of lines in a buffer, for example 10000 lines
>
>
> Create a thread to read until buffer
> Create another thread to read buffer+buffer ( So we have 2 threads
> now. But since the file is zipped I have to wait until the first
> thread is completed. Unless someone knows of a clever technique.
> Write the content of thread 1 into a numpy array
> Write the content of thread 2 into a numpy array


Can you process each line independently? Is the record order important? If
not (or at least, some local dis-ordering is acceptable) you may use a few
worker threads (doing the conversion), feed them thru a Queue object, put
the converted lines into another Queue, and let another thread write the
results onto the destination file.

import Queue, threading, csv

def convert(in_queue, out_queue):
while True:
row = in_queue.get()
if row is None: break
# ... convert row
out_queue.put(converted_line)

def write_output(out_queue):
while True:
line = out_queue.get()
if line is None: break
# ... write line to output file

in_queue = Queue.Queue()
out_queue = Queue.Queue()
tlist = []
for i in range(4):
t = threading.Thread(target=convert, args=(in_queue, out_queue))
t.start()
tlist.append(t)
output_thread = threading.Thread(target=write_output, args=(out_queue,))
output_thread.start()

with open("...") as csvfile:
reader = csv.reader(csvfile, ...)
for row in reader:
in_queue.put(row)

for t in tlist: in_queue.put(None) # indicate end-of-work
for t in tlist: t.join() # wait until finished
out_queue.put(None)
output_thread.join() # wait until finished

--
Gabriel Genellina

 
Reply With Quote
 
 
 
 
Stefan Behnel
Guest
Posts: n/a
 
      07-01-2009
Gabriel Genellina wrote:
> En Tue, 30 Jun 2009 22:52:18 -0300, Mag Gam <(E-Mail Removed)> escribió:
>
>> I am very new to python and I am in the process of loading a very
>> large compressed csv file into another format. I was wondering if I
>> can do this in a multi thread approach.

>
> Does the format conversion involve a significant processing time? If
> not, the total time is dominated by the I/O time (reading and writing
> the file) so it's doubtful you gain anything from multiple threads.


Well, the OP didn't say anything about multiple processors, so multiple
threads may not help wrt. processing time. However, if the file is large
and the OS can schedule the I/O in a way that a seek disaster is avoided
(although that's hard to assure with today's hard disk storage density, but
SSDs may benefit), multiple threads reading multiple partial streams may
still reduce the overall runtime due to increased I/O throughput.

That said, the OP was mentioning that the data was compressed, so I doubt
that the I/O bandwidth is a problem here. As another poster put it: why
bother? Run a few benchmarks first to see where (and if!) things really get
slow, and then check what to do about the real problem.

Stefan
 
Reply With Quote
 
 
 
 
Gabriel Genellina
Guest
Posts: n/a
 
      07-02-2009
En Wed, 01 Jul 2009 12:49:31 -0300, Scott David Daniels
<(E-Mail Removed)> escribió:

> Gabriel Genellina wrote:
>> ...
>> def convert(in_queue, out_queue):
>> while True:
>> row = in_queue.get()
>> if row is None: break
>> # ... convert row
>> out_queue.put(converted_line)

>
> These loops work well with the two-argument version of iter,
> which is easy to forget, but quite useful to have in your bag
> of tricks:
>
> def convert(in_queue, out_queue):
> for row in iter(in_queue.get, None):
> # ... convert row
> out_queue.put(converted_line)


Yep, I always forget about that variant of iter() -- very handy!

--
Gabriel Genellina

 
Reply With Quote
 
ryles
Guest
Posts: n/a
 
      07-03-2009
On Jul 2, 6:10*am, "Gabriel Genellina" <(E-Mail Removed)> wrote:
> En Wed, 01 Jul 2009 12:49:31 -0300, Scott David Daniels *
> <(E-Mail Removed)> escribió:
> > These loops work well with the two-argument version of iter,
> > which is easy to forget, but quite useful to have in your bag
> > of tricks:

>
> > * * *def convert(in_queue, out_queue):
> > * * * * *for row in iter(in_queue.get, None):
> > * * * * * * *# ... convert row
> > * * * * * * *out_queue.put(converted_line)

>
> Yep, I always forget about that variant of iter() -- very handy!


Yes, at first glance using iter() here seems quite elegant and clever.
You might even pat yourself on the back, or treat yourself to an ice
cream cone, as I once did. There is one subtle distinction, however.
Please allow me to demonstrate.

>>> import Queue
>>>
>>> queue = Queue.Queue()
>>>
>>> queue.put(1)
>>> queue.put("la la la")
>>> queue.put(None)
>>>
>>> list(iter(queue.get, None))

[1, 'la la la']
>>>
>>> # Cool, it really works! I'm going to change all my old code to use this... new and *improved*

....
>>> # And then one day your user inevitably does something like this.

....
>>> class A(object):

.... def __init__(self, value):
.... self.value = value
....
.... def __eq__(self, other):
.... return self.value == other.value
....
>>> queue.put(A(1))
>>> queue.put(None)
>>>
>>> # And then this happens inside your 'generic' code (which probably even passed your unit tests).

....
>>> list(iter(queue.get, None))

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in __eq__
AttributeError: 'NoneType' object has no attribute 'value'
>>>
>>> # Oh... yeah. I really *did* want 'is None' and not '== None' which iter() will do. Sorry guys!


Please don't let this happen to you too
 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      07-03-2009
ryles <(E-Mail Removed)> writes:
> >>> # Oh... yeah. I really *did* want 'is None' and not '== None'
> >>> which iter() will do. Sorry guys!

>
> Please don't let this happen to you too


None is a perfectly good value to put onto a queue. I prefer
using a unique sentinel to mark the end of the stream:

sentinel = object()
 
Reply With Quote
 
ryles
Guest
Posts: n/a
 
      07-03-2009
On Jul 2, 10:20*pm, Paul Rubin <http://(E-Mail Removed)> wrote:
> ryles <(E-Mail Removed)> writes:
> > >>> # Oh... yeah. I really *did* want 'is None' and not '== None'
> > >>> which iter() will do. Sorry guys!

>
> > Please don't let this happen to you too

>
> None is a perfectly good value to put onto a queue. *I prefer
> using a unique sentinel to mark the end of the stream:
>
> * *sentinel = object()


I agree, this is cleaner than None. We're still in the same boat,
though, regarding iter(). Either it's 'item == None' or 'item == object
()', and depending on the type, __eq__ can introduce some avoidable
risk.

FWIW, even object() has its disadvantages. Namely, it doesn't work for
multiprocessing.Queue which pickles and unpickles, thus giving you a
new object. One way to deal with this is to define a "Stopper" class
and type check objects taken from the queue. This is not news to
anyone who's worked with multiprocessing.Queue, though.
 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      07-03-2009
ryles <(E-Mail Removed)> writes:
> > * *sentinel = object()

>
> I agree, this is cleaner than None. We're still in the same boat,
> though, regarding iter(). Either it's 'item == None' or 'item == object ()'


Use "item is sentinel".
 
Reply With Quote
 
Gabriel Genellina
Guest
Posts: n/a
 
      07-03-2009
En Fri, 03 Jul 2009 00:15:40 -0300, <//(E-Mail Removed)>> escribió:

> ryles <(E-Mail Removed)> writes:
>> > * *sentinel = object()

>>
>> I agree, this is cleaner than None. We're still in the same boat,
>> though, regarding iter(). Either it's 'item == None' or 'item == object
>> ()'

>
> Use "item is sentinel".


We're talking about the iter() builtin behavior, and that uses ==
internally.

It could have used an identity test, and that would be better for this
specific case. But then iter(somefile.read, '') wouldn't work. A
compromise solution is required; since one can customize the equality test
but not an identity test, the former has a small advantage. (I don't know
if this was the actual reason, or even if this really was a concious
decision, but that's why *I* would choose == to test against the sentinel
value).

--
Gabriel Genellina

 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      07-03-2009
"Gabriel Genellina" <(E-Mail Removed)> writes:
> We're talking about the iter() builtin behavior, and that uses ==
> internally.


Oh, I see. Drat.

> It could have used an identity test, and that would be better for this
> specific case. But then iter(somefile.read, '') wouldn't work.


Yeah, it should allow supplying a predicate instead of using == on
a value. How about (untested):

from itertools import *
...
for row in takewhile(lambda x: x is sentinel,
starmap(in_queue.get, repeat(()))):
...
 
Reply With Quote
 
ryles
Guest
Posts: n/a
 
      07-03-2009
On Jul 2, 11:55*pm, Paul Rubin <http://(E-Mail Removed)> wrote:
> Yeah, it should allow supplying a predicate instead of using == on
> a value. *How about (untested):
>
> * *from itertools import *
> * *...
> * *for row in takewhile(lambda x: x is sentinel,
> * * * * * * * * * * * * *starmap(in_queue.get, repeat(()))):
> * * * ...


Yeah, it's a small recipe I'm sure a lot of others have written as
well. My old version:

def iterwhile(callable_, predicate):
""" Like iter() but with a predicate instead of a sentinel. """
return itertools.takewhile(predicate, repeatfunc(callable_))

where repeatfunc is as defined here:

http://docs.python.org/library/itertools.html#recipes

I wish all of these little recipes made their way into itertools or a
like module; itertools seems a bit tightly guarded.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Multi thread reading a file Mag Gam Python 1 07-02-2009 04:50 AM
Multi thread reading a file Mag Gam Python 2 07-02-2009 01:28 AM
how to implement per-thread timer in a multi-thread program? liu yang C Programming 4 07-28-2008 09:12 PM
Concurrent Write to a single file (in multi thread code)? Donkey Hot Java 3 04-27-2008 03:33 PM
Thread was being aborted in win2003 server. Back ground thread reading MS access database, no redirects or transfers. Johanna ASP .Net 0 10-13-2004 01:32 PM



Advertisments