Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Parallelization in Python 2.6

Reply
Thread Tools

Parallelization in Python 2.6

 
 
Robert Dailey
Guest
Posts: n/a
 
      08-18-2009
I'm looking for a way to parallelize my python script without using
typical threading primitives. For example, C++ has pthreads and TBB to
break things into "tasks". I would like to see something like this for
python. So, if I have a very linear script:

doStuff1()
doStuff2()


I can parallelize it easily like so:

create_task( doStuff1 )
create_task( doStuff2 )

Both of these functions would be called from new threads, and once
execution ends the threads would die. I realize this is a simple
example and I could create my own classes for this functionality, but
I do not want to bother if a solution already exists.

Thanks in advance.
 
Reply With Quote
 
 
 
 
Stefan Behnel
Guest
Posts: n/a
 
      08-18-2009
Robert Dailey wrote:
> I'm looking for a way to parallelize my python script without using
> typical threading primitives. For example, C++ has pthreads and TBB to
> break things into "tasks". I would like to see something like this for
> python. So, if I have a very linear script:
>
> doStuff1()
> doStuff2()
>
>
> I can parallelize it easily like so:
>
> create_task( doStuff1 )
> create_task( doStuff2 )
>
> Both of these functions would be called from new threads, and once
> execution ends the threads would die. I realize this is a simple
> example and I could create my own classes for this functionality, but
> I do not want to bother if a solution already exists.


I think the canonical answer is to use the threading module or (preferably)
the multiprocessing module, which is new in Py2.6.

http://docs.python.org/library/threading.html
http://docs.python.org/library/multiprocessing.html

Both share a (mostly) common interface and are simple enough to use. They
are pretty close to the above interface already.

Stefan
 
Reply With Quote
 
 
 
 
Jonathan Gardner
Guest
Posts: n/a
 
      08-18-2009
On Aug 18, 11:19*am, Robert Dailey <(E-Mail Removed)> wrote:
> I'm looking for a way to parallelize my python script without using
> typical threading primitives. For example, C++ has pthreads and TBB to
> break things into "tasks". I would like to see something like this for
> python. So, if I have a very linear script:
>
> doStuff1()
> doStuff2()
>
> I can parallelize it easily like so:
>
> create_task( doStuff1 )
> create_task( doStuff2 )
>
> Both of these functions would be called from new threads, and once
> execution ends the threads would die. I realize this is a simple
> example and I could create my own classes for this functionality, but
> I do not want to bother if a solution already exists.
>


If you haven't heard of the Python GIL, you'll want to find out sooner
rather than later. Short summary: Python doesn't do threading very
well.

There are quite a few parallelization solutions out there for Python,
however. (I don't know what they are off the top of my head, however.)
The way they work is they have worker processes that can be spread
across machines. When you want to parallelize a task, you send off a
function to those worker threads.

There are some serious caveats and problems, not the least of which is
sharing code between the worker threads and the director, so this
isn't a great solution.

If you're looking for highly parallelized code, Python may not be the
right answer. Try something like Erlang or Haskell.
 
Reply With Quote
 
Robert Dailey
Guest
Posts: n/a
 
      08-18-2009
On Aug 18, 3:41*pm, Jonathan Gardner <(E-Mail Removed)>
wrote:
> On Aug 18, 11:19*am, Robert Dailey <(E-Mail Removed)> wrote:
>
>
>
>
>
> > I'm looking for a way to parallelize my python script without using
> > typical threading primitives. For example, C++ has pthreads and TBB to
> > break things into "tasks". I would like to see something like this for
> > python. So, if I have a very linear script:

>
> > doStuff1()
> > doStuff2()

>
> > I can parallelize it easily like so:

>
> > create_task( doStuff1 )
> > create_task( doStuff2 )

>
> > Both of these functions would be called from new threads, and once
> > execution ends the threads would die. I realize this is a simple
> > example and I could create my own classes for this functionality, but
> > I do not want to bother if a solution already exists.

>
> If you haven't heard of the Python GIL, you'll want to find out sooner
> rather than later. Short summary: Python doesn't do threading very
> well.
>
> There are quite a few parallelization solutions out there for Python,
> however. (I don't know what they are off the top of my head, however.)
> The way they work is they have worker processes that can be spread
> across machines. When you want to parallelize a task, you send off a
> function to those worker threads.
>
> There are some serious caveats and problems, not the least of which is
> sharing code between the worker threads and the director, so this
> isn't a great solution.
>
> If you're looking for highly parallelized code, Python may not be the
> right answer. Try something like Erlang or Haskell.


Really, all I'm trying to do is the most trivial type of
parallelization. Take two functions, execute them in parallel. This
type of parallelization is called "embarrassingly parallel", and is
the simplest form. There are no dependencies between the two
functions. They do requires read-only access to shared data, though.
And if they are being spawned as sub-processes this could cause
problems, unless the multiprocess module creates pipelines or other
means to handle this situation.
 
Reply With Quote
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      08-19-2009
On Tue, 18 Aug 2009 13:45:38 -0700 (PDT), Robert Dailey
<(E-Mail Removed)> declaimed the following in
gmane.comp.python.general:


> Really, all I'm trying to do is the most trivial type of
> parallelization. Take two functions, execute them in parallel. This
> type of parallelization is called "embarrassingly parallel", and is
> the simplest form. There are no dependencies between the two
> functions. They do requires read-only access to shared data, though.
> And if they are being spawned as sub-processes this could cause
> problems, unless the multiprocess module creates pipelines or other
> means to handle this situation.


If they are number crunchers (CPU-bound) and don't make use of
binary extension libraries that release the GIL (for the most common
Python implementation), they'll run faster being called in sequence
since you won't have the overhead of task switching.

For I/O bound tasks, which spend most of their time blocked waiting
for an I/O, Python threads work fine with fairly rapid response time.
--
Wulfraed Dennis Lee Bieber KD6MOG
http://www.velocityreviews.com/forums/(E-Mail Removed) HTTP://wlfraed.home.netcom.com/

 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      08-19-2009
Dennis Lee Bieber wrote:
> On Tue, 18 Aug 2009 13:45:38 -0700 (PDT), Robert Dailey wrote:
>> Really, all I'm trying to do is the most trivial type of
>> parallelization. Take two functions, execute them in parallel. This
>> type of parallelization is called "embarrassingly parallel", and is
>> the simplest form. There are no dependencies between the two
>> functions. They do requires read-only access to shared data, though.
>> And if they are being spawned as sub-processes this could cause
>> problems, unless the multiprocess module creates pipelines or other
>> means to handle this situation.


It wouldn't be much worth if it didn't, as the subprocess module handles
everything else nicely. See the Queue classes.


> If they are number crunchers (CPU-bound) and don't make use of
> binary extension libraries that release the GIL (for the most common
> Python implementation), they'll run faster being called in sequence
> since you won't have the overhead of task switching.


.... unless, obviously, the hardware is somewhat up to date (which is not
that uncommon for number crunching environments) and can execute more than
one thing at once.

Stefan
 
Reply With Quote
 
Hendrik van Rooyen
Guest
Posts: n/a
 
      08-19-2009
On Tuesday 18 August 2009 22:45:38 Robert Dailey wrote:

> Really, all I'm trying to do is the most trivial type of
> parallelization. Take two functions, execute them in parallel. This
> type of parallelization is called "embarrassingly parallel", and is
> the simplest form. There are no dependencies between the two
> functions. They do requires read-only access to shared data, though.
> And if they are being spawned as sub-processes this could cause
> problems, unless the multiprocess module creates pipelines or other
> means to handle this situation.


Just use thread then and thread.start_new_thread.
It just works.

- Hendrik
 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      08-19-2009
Hendrik van Rooyen <(E-Mail Removed)> writes:
> Just use thread then and thread.start_new_thread.
> It just works.


The GIL doesn't apply to threads made like that?!
 
Reply With Quote
 
sturlamolden
Guest
Posts: n/a
 
      08-19-2009
On 18 Aug, 11:19, Robert Dailey <(E-Mail Removed)> wrote:

> I'm looking for a way to parallelize my python script without using
> typical threading primitives. For example, C++ has pthreads and TBB to
> break things into "tasks".


In C++, parallelization without "typical threading primitives" usually
means one of three things:

- OpenMP pragmas
- the posix function fork(), unless you are using Windows
- MPI

In Python, you find the function os.fork and wrappers for MPI, and
they are used as in C++. With os.fork, I like to use a context
manager, putting the calls to fork in __enter__ and the calls to
sys.exit in __exit__. Then I can just write code like this:

with parallel():
# parallel block here

You can also program in the same style as OpenMP using closures. Just
wrap whatever loop or block you want to execute in parallel in a
closure. It requires minimal edition of the serial code. Instead of

def foobar():
for i in iterable:
#whatever

you can add a closure (internal function) and do this:

def foobar():
def section(): # add a closure
for i in sheduled(iterable): # balance load
#whatever
parallel(section) # spawn off threads

Programs written in C++ are much more difficult to parallelize with
threads because C++ do not have closures. This is why pragma-based
parallelization (OpenMP) was invented:

#pragma omp parallel for private(i)
for (i=0; i<n; i++) {
// whatever
}

You should know about the GIL. It prevents multiple threads form using
the Python interpreter simultaneously. For parallel computing, this is
a blessing and a curse. Only C extensions can release the GIL; this
includes I/0 routines in Python's standard library. If the GIL is not
released, the C library call are guaranteed to be thread-safe.
However, the Python interpreter will be blocked while waiting for the
library call to return. If the GIL is released, parallelization works
as expected; you can also utilise multi-core CPUs (it is a common
misbelief that Python cannot do this).

What the GIL prevents you from doing, is writing parallel compute-
bound code in "pure python" using threads. Most likely, you don't want
to do this. There is a 200x speed penalty from using Python over a C
extension. If you care enough about speed to program for parallel
execution, you should always use some C. If you still want to do this,
you can use processes instead (os.fork, multiprocessing, MPI), as the
GIL only affects threads.

It should be mentioned that compute-bound code is very rare, and
typically involves scientific computing. The only every-day example is
3D graphics. However, this is taken care of by the GPU and libraries
like OpenGL and Direct3D. Most parallel code you will want to write
are I/O bound. You can use the Python standard library and threads for
this, as it releases the GIL whenever a blocking call is made.

I program Python for scientific computing daily (computational
neuroscience). I have yet to experience that the GIL has hindered me
in my work. This is because whenever I run into a computational
bottleneck I cannot solve with NumPy, putting this tiny piece of code
in Fortran, C or Cython involves very little work. 95% is still
written in plain Python. The human brain is bad at detecting
computational bottlenecks though. So it almost always pays off to
write everything in Python first, and use the profiler to locate the
worst offenders.

Regards,
Sturla Molden
 
Reply With Quote
 
sturlamolden
Guest
Posts: n/a
 
      08-19-2009
On 18 Aug, 13:45, Robert Dailey <(E-Mail Removed)> wrote:

> Really, all I'm trying to do is the most trivial type of
> parallelization. Take two functions, execute them in parallel. This
> type of parallelization is called "embarrassingly parallel", and is
> the simplest form. There are no dependencies between the two
> functions.


If you are using Linux or Mac, just call os.fork for this.

You should also know that you function "create_task" is simply

from threading import Thread
def create_task(task):
Thread(target=task).start()

If your task releases the GIL, this will work fine.


> They do requires read-only access to shared data, though.
> And if they are being spawned as sub-processes this could cause
> problems, unless the multiprocess module creates pipelines or other
> means to handle this situation.


With forking or multiprocessing, you have to use IPC. That is, usually
pipes, unix sockets / named pipes, or shared memory. Multiprocessing
helps you with this. Multiprocessing also has a convinient Queue
object for serialised read/write access to a pipe.

You can also create shared memory with mmap.mmap, using fd 0 with
Windows or -1 with Linux.









 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Parallelization with Python: which, where, how? Mathias Python 5 01-04-2005 04:03 PM
Re: Parallelization with Python: which, where, how? Jp Calderone Python 1 12-21-2004 07:09 PM
Parallelization with Python: which, where, how? Mike M?ller Python 0 12-21-2004 01:04 PM
Parallelization on muli-CPU hardware? P.M. Python 60 10-26-2004 07:22 AM
Parallelization on multi-CPU hardware? P.M. Python 0 10-05-2004 04:47 AM



Advertisments