Velocity Reviews > Matrix multiplication

# Matrix multiplication

Zdenko
Guest
Posts: n/a

 01-04-2011
Please, can anybody write me a simple recursive matrix multiplication
using multiple threads in Python, or at least show me some guidelines
how to write it myself

Thank You

Ulrich Eckhardt
Guest
Posts: n/a

 01-04-2011
Zdenko wrote:
> Please, can anybody write me a simple recursive matrix multiplication
> using multiple threads in Python, or at least show me some guidelines
> how to write it myself

No problem, I just need your bank account data to withdraw the payment and

Seriously, things don't work like this. If you show effort, you will get
help, but you don't get working solutions without even a bit of effort on

make some experiments on how to distribute tasks to threads and collect
them again.

Uli

Guest
Posts: n/a

 01-04-2011
On 4 jan, 11:15, Ulrich Eckhardt <(E-Mail Removed)>
wrote:
> Zdenko wrote:
> > Please, can anybody write me a simple recursive matrix multiplication
> > using multiple threads in Python, or at least show me some guidelines
> > how to write it myself

>
> No problem, I just need your bank account data to withdraw the payment and
> the address of your teacher whom to send the results. ;^)
>
> Seriously, things don't work like this. If you show effort, you will get
> help, but you don't get working solutions without even a bit of effort on
>
> make some experiments on how to distribute tasks to threads and collect
> them again.
>
> Uli

Hi,

just have a look here http://en.wikipedia.org/wiki/Matrix_multiplication
and find out what parts can be parallelized

Zdenko
Guest
Posts: n/a

 01-04-2011
On 4.1.2011 11:15, Ulrich Eckhardt wrote:
> Zdenko wrote:
>> Please, can anybody write me a simple recursive matrix multiplication
>> using multiple threads in Python, or at least show me some guidelines
>> how to write it myself

>
> No problem, I just need your bank account data to withdraw the payment and
> the address of your teacher whom to send the results. ;^)
>
> Seriously, things don't work like this. If you show effort, you will get
> help, but you don't get working solutions without even a bit of effort on
>
> make some experiments on how to distribute tasks to threads and collect
> them again.
>
> Uli
>

Ok, I expected something like this
Lets try this way:

I wrote these two codes for example:

this one is naive way of matrix multiplication using one thread

from random import *
from time import clock

t0 = clock()
A=[]
B=[]
C=[]
m=300 #size of matrix m x m

for i in range(m):
A.append([])
B.append([])
C.append([])
for j in range(m):
A[i].append(randint(1,9))
B[i].append(randint(1,9))
C[i].append(0)

def run ( self ):
for i in range(m):
for j in range(m):
for k in range(m):
C[i][j]+=A[i][k]*B[k][j]

t.start()
t.join()
print clock() - t0, "seconds"

this one is using two threads, each one is multiplying half of matrix

import time
from random import randint

t0 = time.clock()
def __init__(self, poc, kr):
self.poc=poc
self.kr=kr
def run(self):
for i in range(self.poc,self.kr):
for j in range(m):
for k in range(m):
C[i][j]+=A[i][k]*B[k][j]

A=[]
B=[]
C=[]
m=300 #size of matrix m x m

for i in range(m):
A.append([])
B.append([])
C.append([])
for j in range(m):
A[i].append(randint(1,9))
B[i].append(randint(1,9))
C[i].append(0)
thr1.start()
thr2.start()
thr1.join()
thr2.join()
print time.clock() - t0, "seconds"

why is second one more than twice slower than first when it should be
faster. I suspect that problem is in simultaneous access to matrix C but
i don't know how to solve this.

I used this just for example to see how the threading works so i can
work on recursive and Strassen algorithm.

DevPlayer
Guest
Posts: n/a

 01-04-2011
I would be very suprised if you achieve faster results threading this
problem. There's been much discussed on benefits or lack thereof to
threading in Python (or in general).

Threading is best used in situations where you are doing different
kinds of tasks. For example if you want to do your matrix
multiplication WHILE you were doing other things on your computer
where matrix multiplication was a background process chugging away
when you are not taxing the computer doing other stuff.

operations like reading in lots of big files from a comparable slow
hard drive (compared to how fast a CPU processes data) or waiting on
netword data (super slow compared to CPU processing).

When you are doing mass numeric processing, you want to minimize the
jumping from one function to another which uses overhead, recursion
adds a small amount of uneccessary overhead, you want to minimize the
need for the cpu to switch between threads or processes.

If you still feel the need to use threads for some reason, for numeric
processing I'd recommend using a "lighter" thread object, like a tiny
them now.

Another thing to note is it seems you might be expecting threads to
run on different CPU cores expecting improvment. Be careful with this
assumption. This is not always true. It is up to the CPU and OS to
determine how threads are handled and perhaps the GIL to some extent.
Beaware that Python has a GIL (some distributions). Google it if you
don't know of it.

To encourage better use of multi-core cpus you might consider the
multiprocessing library included in Python 2.7 (I think) and above.

I'm assuming that speed is an issue because you where timing your
code. If you are doing actual serious number crunching there's lots of
advice on this. The python Numpy package as well as Stackless Python
(for microthreads or whatever thier called) comes to mind.

Another thought. Ask yourself if you need a large in-memory or live
set of processed numbers, in your case a fully and processed
multiplied matrix. Usually a large set of in-memory numbers is
something your going to use to simulate a model or to process and
crunch further.

Or is your actual usage going to be picking out a processed number
here or there from the matrix. If this is true look at iterators or
generators. Which would be a snapshot in time of your matrix
multiplication. I like to think of Python generators like integral
calculus (definition at: http://en.wikipedia.org/wiki/Integral_calculus)
where the specific integral of a generator is often just 1.

I'm loving generators a lot. For example there are generator
accelorators which if you think it through means you can make
generator deccelorators, useful for doing interpolation between
elements of your matrix elements for example. I always forget if

Some indicators that generators could help: You're doing lots of for
loops with range().

Also it's been measured that list comprehensions are slightly faster
then while loops are a slightly faster then for loops. You can Google
to confirm, enter something like "python fast iteration".

Also if your numbers in your matix are actually not really numbers but
objects with numbers, __slots__ is used to for large sets of objects
(10s of millions at the very least) to minimize memory usage and
perhaps with speed, if used properly. Just mentioning. I'd stay away
from this though.

Some of my informatation may be inaccurate (and even completely wrong;
like I always get when a thread is best switched during a blocking
verse a non-blocking operation) but there are some things to consider.

DevPlayer
Guest
Posts: n/a

 01-04-2011
BTW
http://www.python.org/dev/peps/pep-0211/

DevPlayer
Guest
Posts: n/a

 01-04-2011
See section titled: "'array' or 'matrix'? Which should I use?"
at
http://www.scipy.org/NumPy_for_Matlab_Users

BTW
http://www.python.org/dev/peps/pep-0211/

Dan Stromberg
Guest
Posts: n/a

 01-04-2011
On Tue, Jan 4, 2011 at 4:22 AM, Zdenko <(E-Mail Removed)> wrote:
> On 4.1.2011 11:15, Ulrich Eckhardt wrote:
>>
>> Zdenko wrote:
>>>
>>> Please, can anybody write me a simple recursive matrix multiplication
>>> using multiple threads in Python, or at least show me some guidelines
>>> how to write it myself

>>
>> No problem, I just need your bank account data to withdraw the payment and
>> the address of your teacher whom to send the results. ;^)
>>
>> Seriously, things don't work like this. If you show effort, you will get
>> help, but you don't get working solutions without even a bit of effort on
>>
>> to
>> make some experiments on how to distribute tasks to threads and collect
>> them again.
>>
>> Uli
>>

>
> Ok, I expected something like this
> Lets try this way:
>
> I wrote these two codes for example:
>
> this one is naive way of matrix multiplication using one thread
>
> from random import *
> from time import clock
>
> t0 = clock()
> A=[]
> B=[]
> C=[]
> m=300 *#size of matrix m x m
>
> for i in range(m):
> * *A.append([])
> * *B.append([])
> * *C.append([])
> * *for j in range(m):
> * * * *A[i].append(randint(1,9))
> * * * *B[i].append(randint(1,9))
> * * * *C[i].append(0)
>
>
> * def run ( self ):
> * * * * * for i in range(m):
> * * * * * * * *for j in range(m):
> * * * * * * * * * *for k in range(m):
> * * * * * * * * * * * *C[i][j]+=A[i][k]*B[k][j]
>
> t.start()
> t.join()
> print clock() - t0, "seconds"
>
>
>
> this one is using two threads, each one is multiplying half of matrix
>
> import time
> from random import randint
>
> t0 = time.clock()
> * *def __init__(self, poc, kr):
> * * * *self.poc=poc
> * * * *self.kr=kr
> * *def run(self):
> * * * * * for i in range(self.poc,self.kr):
> * * * * * * * *for j in range(m):
> * * * * * * * * * *for k in range(m):
> * * * * * * * * * * * *C[i][j]+=A[i][k]*B[k][j]
>
> A=[]
> B=[]
> C=[]
> m=300 #size of matrix m x m
>
> for i in range(m):
> * *A.append([])
> * *B.append([])
> * *C.append([])
> * *for j in range(m):
> * * * *A[i].append(randint(1,9))
> * * * *B[i].append(randint(1,9))
> * * * *C[i].append(0)
> thr1.start()
> thr2.start()
> thr1.join()
> thr2.join()
> print time.clock() - t0, "seconds"
>
> why is second one more than twice slower than first when it should be
> faster. I suspect that problem is in simultaneous access to matrix C but i
> don't know how to solve this.
>
> I used this just for example to see how the threading works so i can work on
> recursive and Strassen algorithm.
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Threads in CPython are decent when you're doing I/O, not so good when
may actually find that your application runs about 1/n times as fast,
where n is the number of CPU's - that is, much slower. I don't mean
that it just doesn't scale well, I mean that each additional CPU makes
things slower for CPU-bound jobs.

If you require java-like threading, you might be better off with
Jython or Iron Python for now.

pypy's likely to have good threading eventually, but last I heard it
wasn't quite there yet.

If you do require CPython, you might check out the greenlets module
and/or PyCSP.

Oh, and there's "stackless python", which amounts to a branch of
CPython that's unlikely to be merged. It's supposed to be able to
thread really well, and ISTR hearing that it's pretty similar to
greenlets. The main force behind stackless is working on pypy now, I
believe.

Finally, with CPython, you can use the multiprocessing module to get
pretty good parallelism via distinct processes. This way you tend to
end up using queues for interprocess communication; these queues use
shared memory. However, you might actually be able to put your matrix
in shared memory - if it's large enough to justify that. I know you
could put a linear array of homogeneous, scalar type in shared memory;
I've not heard of a way of putting an aggregate type in shared memory.
Naturally, you can fake an n dimensional array using a 1 dimensional
array with a little multiplication and addition.

As an example of using CPython with multiprocessing without having
your matrix in shared memory, you could probably have one worker
subprocess for each core on a system, divide up the cells of your
result matrix by their coordinates (perhaps using an iterator and izip
across the queues) and send a message (via a shared memory queue) with
those coordinates to the appropriate subprocess. Then those
subprocess send back the coordinates and the result at said coordinate
via a different result queue. I suspect you'd want one queue for all
the results, and one queue for each subprocess for initiating
computation. Each subprocess would have its own COW copy of the input
matrix.

Steven D'Aprano
Guest
Posts: n/a

 01-04-2011
On Tue, 04 Jan 2011 13:22:33 +0100, Zdenko wrote:

> I wrote these two codes for example:
>
> this one is naive way of matrix multiplication using one thread

[...]
> this one is using two threads, each one is multiplying half of matrix

[...]
> why is second one more than twice slower than first when it should be
> faster. I suspect that problem is in simultaneous access to matrix C but
> i don't know how to solve this.

Can I suggest that your timing code leaves something to be desired?

* You time too much. You're interested in how long it takes to multiply
two matrices, but you time how long it takes to do much more than just
the multiplication: your timing code covers the time it takes to create
the class object (which will be trivial), and build the matrix (non-
trivial), as well as perform the multiplication.

* Your perform the timing test once, which makes it subject to sampling
errors. (Although if the process takes a long time, say, multiple
seconds, the sampling error will *probably* be small relative to the
measured time. But not always.)

* You use time.clock, but without knowing which operating system you are
running, it's impossible to tell whether you should be using time.time

Whether these issues will actually make a practical difference in this
*specific* case, I don't know, but as a general rule, the most accurate
way to perform these sorts of timing tests is with the timeit module.
Something like this:

A = ... # initialise matrix A
B = ... # and B
C = ... # and the result

def mult1(A, B, C):
# Multiply matrices A and B using 1 thread.
t.start()
t.join()

def mult2(A, B, C):
# Multiply matrices A and B using 2 threads.
t1.start()
t2.start()
t1.join() # Block until thread 1 is done.
t2.join() # Now block until thread 2 is done.

setup1 = "from __main__ import A, B, C, mult1"
setup2 = "from __main__ import A, B, C, mult2"

from timeit import Timer
t1 = Timer("mult1(A, B, C)", setup1)
t2 = Timer("mult2(A, B, C)", setup2)

# Now perform the timing tests.
best1 = min(t1.repeat())
best2 = min(t2.repeat())

By default, Timer.repeat will measure the time taken by one million
iterations of the code snippet, and take three measurements. You almost
always only care about the smallest measurement -- any larger times will
represent sampling error.

If it takes too long to time one million iterations, either make the
matrices smaller, or pass keyword arguments number and/or repeat to the
repeat method:

# best of five measurements of one hundred iterations
best1 = min(t1.repeat(number=100, repeat=5))

--
Steven

John Nagle
Guest
Posts: n/a

 01-05-2011
On 1/4/2011 2:15 AM, Ulrich Eckhardt wrote:
> Zdenko wrote:
>> Please, can anybody write me a simple recursive matrix multiplication
>> using multiple threads in Python, or at least show me some guidelines
>> how to write it myself

>
> No problem, I just need your bank account data to withdraw the payment and
> the address of your teacher whom to send the results. ;^)
>
> Seriously, things don't work like this. If you show effort, you will get
> help, but you don't get working solutions without even a bit of effort on
>
> make some experiments on how to distribute tasks to threads and collect
> them again.

If you want to do matrix multiplication from Python, use NumPy.
Anything you write in Python itself and run in CPython will be far,
far slower. CPython without NumPy is about 60x slower than C on
basic array math.

Writing high-speed parallel number-crunching code that outperforms
off the shelf libraries is typically non-trivial. If there's some
easy way to make a simple operation like matrix multiply go faster,

If you're really into this, look into the Unified Parallel C effort
and its libraries. The supercomputer types spend their lives doing this
sort of thing.

John Nagle