Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Java (http://www.velocityreviews.com/forums/f30-java.html)
-   -   Re: Matrix performance question (http://www.velocityreviews.com/forums/t645532-re-matrix-performance-question.html)

Roedy Green 11-21-2008 03:19 PM

Re: Matrix performance question
 
On Thu, 20 Nov 2008 08:05:45 -0800, Patricia Shanahan <pats@acm.org>
wrote, quoted or indirectly quoted someone who said :

>Any other ideas I should be considering? Any packages that would do this
>operation fast?


Other approaches:

Transpose rows and columns and see if Java can optimise better.

Use Jet native compilation.
see http://mindprod.com/jgloss/jet.html

Write that code in C and hook to it with JNI. C has more efficient
ways of storing matrices.

write that code in assembler and hook to it with JNI

mathapi http://mindprod.com/jgloss/mathapi.html
--
Roedy Green Canadian Mind Products
http://mindprod.com
Your old road is
Rapidly agin'.
Please get out of the new one
If you can't lend your hand
For the times they are a-changin'.

Tom Anderson 11-21-2008 05:21 PM

Re: Matrix performance question
 
On Fri, 21 Nov 2008, Eric Sosman wrote:

> Roedy Green wrote:
>> On Thu, 20 Nov 2008 08:05:45 -0800, Patricia Shanahan <pats@acm.org>
>> wrote, quoted or indirectly quoted someone who said :
>>
>>> Any other ideas I should be considering? Any packages that would do this
>>> operation fast?

>>
>> Other approaches:
>>
>> Transpose rows and columns and see if Java can optimise better.

>
> Since she's using both the matrix and its transpose, she's got
> to traverse it in both orientations anyhow. If Java does better
> with one of them, it must do worse with the other.


Maybe yes, maybe no. You have to think about the effect of the layout on
memory and cache access patterns - does one more easily admit efficient
use of the cache? I *think* that with Pat's array-of-arrays
implementation, having the larger axis be contiguous is the right thing to
do; with a packed array - one where you put all the rows (or columns) in
one big array, and then do the index arithmetic yourself, which isn't hard
- it doesn't matter either way, not with one of the dimensions that small.

>> Write that code in C and hook to it with JNI. C has more efficient ways
>> of storing matrices.

>
> Not *that* much more efficient.


I don't believe C's multimensional arrays are any more efficient than
using a packed matrix in java. It boils down to the same data layout and
sequence of calculations, neither of which are complicated, so i doubt C
does it faster than java.

> Besides, why re-invent wheels? If resorting to JNI, use it to get at
> pre-written high-performance BLAS routines or the like. Vendors really
> *do* optimize those things heavily, with knowledge of model-specific
> caches, pipeline depths, pre-fetch strategies, and so on.


s/vendors/Kazushige Goto/!

I'd be interested to see a benchmark of the various leading BLAS
implementations - current versions, with current compilers, on current
hardware. I can't find anything more recent than 1995!

tom

--
A paranoid-schizophrenic is a guy who just found out what's going on. --
William S. Burroughs

Tom Anderson 11-21-2008 10:37 PM

Re: Matrix performance question
 
On Fri, 21 Nov 2008, Eric Sosman wrote:

> Graham Jones wrote:
>>
>> "Patricia Shanahan" <pats@acm.org> wrote in message
>> news:CYSdnfOccfnFsbrUnZ2dnUVZ_v_inZ2d@earthlink.co m...
>>
>> [...]
>>> Thanks. I'm actually doing eight 10,000 element dot products followed by
>>> eight 10,000 element daxpy-like operations. That does not affect the
>>> number of floating point operations, but does change the memory access
>>> pattern.
>>>
>>> [For anyone following this who is not familiar with BLAS, daxpy is
>>> scalar constant times a vector plus a vector]

>>
>> Maybe worth storing U twice, as 8*10000 and 10000*8, and using only dot
>> products, at least when you have several vectors for the same U.

>
> Watch out for the memory penalty, though.


Moreover, the cache penalty. How big is Patricia's L2/3 cache? Probably
0.5 - 4 MB? That's the size where the difference between working on a 640
kB data structure and two which total to 1480 kB (per your calculations)
is going to be very significant.

tom

--
Sorry. Went a bit Atari Teenage Riot there. -- Andrew

Tom Anderson 11-21-2008 10:59 PM

Re: Matrix performance question
 
On Fri, 21 Nov 2008, Patricia Shanahan wrote:

> Patricia Shanahan wrote:
>> Eric Sosman wrote:
>>> Roedy Green wrote:

> ...
>>>> Write that code in C and hook to it with JNI. C has more efficient
>>>> ways of storing matrices.
>>>
>>> Not *that* much more efficient. Besides, why re-invent wheels?
>>> If resorting to JNI, use it to get at pre-written high-performance
>>> BLAS routines or the like. Vendors really *do* optimize those things
>>> heavily, with knowledge of model-specific caches, pipeline depths,
>>> pre-fetch strategies, and so on.

> ...
>> My main concern with the mixed languages approaches is that this work is
>> a lot of calls to a relatively short job. I will get some measurements
>> on that. At least last time I tried it, JNI had a significant overhead
>> in the transition between languages.

>
> In one of my performance test cases, the method is called 122,520 times,
> taking a total of 434 seconds, mean time 3.5 milliseconds per call. What
> do people think about the economics of an inter-language call for a 3.5
> ms function?


I believe the overhead for a JNI call - the call itself - is on the order
of hundreds of instructions, which is hundreds of nanoseconds on a modern
processor. That vanishes in comparison to 3.5 ms of computation.

JNI's reputation for slowness comes, i believe, not from the overhead of
calls, but from the overhead of manipulating java objects from the C side
via the JNI interface. Those operations are also hundreds of instructions,
but you do them hojillions of times per call, hence they really start to
drag.

> Each call allocates one double[10000].


About that. The natural way to represent your arrays is as java arrays.
You can get at arrays from JNI in bulk using GetDoubleArray, but there's a
chance that this might give you a copy rather than a direct pointer. That
would work, but be very slow. I don't know what any production JVM's
actually do. However, there is an alternative - NIO's FloatBuffer. If you
allocate them as direct buffers, then you can use JNI's
GetDirectBufferAddress to get direct access to them. That is pretty much
guaranteed to be the fastest way to get at an array from the C side. I
believe access from the java side is still pretty fast, but i can't put a
number on it.

A related point is the use of arrays of arrays for matrices. This is
convenient, but leads to a layout in memory that isn't native-friendly.
Packing your matrix into a single array, and wrapping it in an object
which does the index calculations for you, will give you better layout
(including better cache locality for operations in java) for minimal
ergonomic loss. You will have to write change things like:

m[i][j] = m[i][0] + m[0][j] ;

to:

m.set(i, j, m.get(i, 0) + m.get(0, j)) ;

Which is a pain, but is hopefully bearable.

On the bright side, once you've got your matrices wrapped, you can change
the implementation between arrays of arrays, packed arrays, buffers, or
whatever else you can think of, with only localised changes to the code,
which makes for convenient benchmarking of the alternatives.

tom

--
Sorry. Went a bit Atari Teenage Riot there. -- Andrew

Tom Anderson 11-22-2008 11:21 PM

Re: Matrix performance question
 
On Sat, 22 Nov 2008, Graham Jones wrote:

> "Tom Anderson" <twic@urchin.earth.li> wrote in message
> news:Pine.LNX.4.64.0811212232540.23313@urchin.eart h.li...
>> On Fri, 21 Nov 2008, Eric Sosman wrote:
>>
>>> Graham Jones wrote:
>>>>
>>>> "Patricia Shanahan" <pats@acm.org> wrote in message
>>>> news:CYSdnfOccfnFsbrUnZ2dnUVZ_v_inZ2d@earthlink.co m...
>>>>
>>>> [...]
>>>>> Thanks. I'm actually doing eight 10,000 element dot products followed by
>>>>> eight 10,000 element daxpy-like operations. That does not affect the
>>>>> number of floating point operations, but does change the memory access
>>>>> pattern.
>>>>>
>>>>> [For anyone following this who is not familiar with BLAS, daxpy is
>>>>> scalar constant times a vector plus a vector]
>>>>
>>>> Maybe worth storing U twice, as 8*10000 and 10000*8, and using only dot
>>>> products, at least when you have several vectors for the same U.
>>>
>>> Watch out for the memory penalty, though.

>>
>> Moreover, the cache penalty. How big is Patricia's L2/3 cache? Probably 0.5
>> - 4 MB? That's the size where the difference between working on a 640 kB
>> data structure and two which total to 1480 kB (per your calculations) is
>> going to be very significant.

>
> I think the dot product calculation is cache-friendly even when the vectors
> are bigger than the cache. Anyway the code below seems to work about 6x
> faster than Patricia reported, and making N ten times bigger and T ten times
> smaller makes little odds.
>
> I'm really a C/C++ programmer who may doing some numerical stuff in Java
> soon. Have I done something silly or clever? ;-)


You've done something very silly - you've contradicted me! For this, you
will pay!!

I've written an implementation of Patricia's function (as i understand
it), with a timing driver and a bunch of matrix implementations:

http://urchin.earth.li/~twic/Matrix_Layout/

For a 10 000 x 8 matrix, doing 100 runs of 100 iterations each, i get the
following results (times are in nanoseconds per iteration):

Matrix Type min median 95% max
RowMajorMatrix 1087790.0 1096540.0 1146610.0 1165810.0
ColumnMajorMatrix 593350.0 598450.0 641440.0 709700.0
ParallelRowMajorMatrix 876650.0 922850.0 959030.0 1142850.0
ParallelColumnMajorMatrix 366350.0 414790.0 428000.0 469190.0
AntiParallelRowMajorMatrix 352050.0 409060.0 615760.0 670430.0
AntiParallelColumnMajorMatrix 830310.0 833460.0 869010.0 1044430.0
DualMatrix 649780.0 726620.0 752110.0 807310.0
PessimalDualMatrix 861620.0 866200.0 914710.0 1771520.0

I'm running this code on a remote machine; i don't know how big its caches
are, and i can't guarantee that it was unloaded, or equally loaded between
tests.

As to what these implementations are, well. The plain row- and
column-major matrices should be obvious. The dual matrix is your idea, the
pessimal dual is one which uses the two copies of the matrix the wrong way
round. The parallel matrices, rather than computing each element of the
output vector one after the other, compute them in parallel in one
direction, to make the memory access pattern more linear. The antiparallel
versions do the same, but to make them *less* linear. Or something. I'm
pretty confused by the whole thing at this point.

So:

Column-major beats row-major, as Patricia anticipated, and one would
expect. Your dual matrix approach is faster than a row-major matrix, but a
bit slower than a column-major matrix. Doing the right kind of
parallelisation speeds up both the plain matrices to roughly the same
speed, which is much faster than the dual matrix. I didn't try
parallelising the dual matrix; i wouldn't have thought it was necessary,
but i could be wrong.

Please do report any errors, omissions, further results, etc. I've had
enough of this for today!

tom

--
Weird is Cool -- Naomi Clarke

Tom Anderson 11-23-2008 03:58 PM

Re: Matrix performance question
 
On Sun, 23 Nov 2008, Graham Jones wrote:

> "Tom Anderson" <twic@urchin.earth.li> wrote in message
> news:Pine.LNX.4.64.0811222149340.2777@urchin.earth .li...
>
>> You've done something very silly - you've contradicted me! For this, you
>> will pay!!
>>
>> I've written an implementation of Patricia's function (as i understand it),
>> with a timing driver and a bunch of matrix implementations:
>>
>> http://urchin.earth.li/~twic/Matrix_Layout/
>>
>> For a 10 000 x 8 matrix, doing 100 runs of 100 iterations each, i get the
>> following results (times are in nanoseconds per iteration):
>>
>> Matrix Type min median 95% max
>> RowMajorMatrix 1087790.0 1096540.0 1146610.0 1165810.0
>> ColumnMajorMatrix 593350.0 598450.0 641440.0 709700.0
>> ParallelRowMajorMatrix 876650.0 922850.0 959030.0 1142850.0
>> ParallelColumnMajorMatrix 366350.0 414790.0 428000.0 469190.0
>> AntiParallelRowMajorMatrix 352050.0 409060.0 615760.0 670430.0
>> AntiParallelColumnMajorMatrix 830310.0 833460.0 869010.0 1044430.0
>> DualMatrix 649780.0 726620.0 752110.0 807310.0
>> PessimalDualMatrix 861620.0 866200.0 914710.0 1771520.0

>
> I shall make sure I annoy you whenever I have an speed-critical routine
> to write.


*shakes fist*

> But the big puzzle is why even the slowest versions take 1ms for
> something Patricia said took 3.5ms.


It's possible my computer's just a lot faster than hers. Or i
misunderstood the calculation she's doing.

Patricia was using arrays-of-arrays, so let's try that:

UnpackedRowMajorMatrix 1408840.0 1456200.0 1763020.0 2343020.0
UnpackedColumnMajorMatrix 519090.0 523810.0 542980.0 624270.0

Interestingly, neither are these as slow as 3.5 ms, the column-major one
is faster than the packed version! Let's try parallelising the cross-grain
multiplication:

ParallelUnpackedColumnMajorMatrix 370980.0 376420.0 437200.0 500890.0

Okay, so that's faster than the packed version too. Well, looks like my
advice to use a packed matrix was pretty bad, then.

Hmm. It seems hard to believe that my machine is an order of magnitude
(and that's denary magnitude!) faster than hers. It's also hard to believe
my code is an order of magnitude faster.

Hang on, was her matrix 10 000 or 100 000 elements long?

tom

--
The best way I know of to win an argument is to start by being in the
right. -- Lord Hailsham

Tom Anderson 11-23-2008 07:11 PM

Re: Matrix performance question
 
On Sun, 23 Nov 2008, Patricia Shanahan wrote:

> Tom Anderson wrote:
> ...
>> Hmm. It seems hard to believe that my machine is an order of magnitude (and
>> that's denary magnitude!) faster than hers. It's also hard to believe my
>> code is an order of magnitude faster.
>>
>> Hang on, was her matrix 10 000 or 100 000 elements long?

> ...
>
> I've done a few checks, and found these issues:
>
> 1. My production arrays are up to 20,101 elements long.


Going up to 20101 elements makes ParallelColumnMajorMatrix take a median
time of 951450 ns.

> 2. It is slightly faster, about 2ms, when run in the production
> environment, -server and without profiling.
>
> 3. I am measuring it in the context of my program, not an isolated
> benchmark. Since these calls are interleaved with other memory intensive
> code, and the input vector is new each time, I may be getting more cache
> misses.


Good point - my code reuses the same input vector. If i modify it to use a
fresh set of arrays each time (from a pre-prepared list, so they should be
cache-cold), it comes out as 965280, about the same. It's possible i'm not
making things cache-cold enough.

> 4. My code was very simple unoptimized code, pending a decision about
> the most promising approaches to improve it.


Fair enough. Please don't read any of what i've written as impugning your
code or coding!

> 5. There is a single post-processing pass over the input and output
> vectors that is done after the main calculation.


Okay, i don't have that. And i'm not even certain that the calculation i'm
doing is the same as yours apart from that.

> I'll do some more measurements to see which of these issues are important.


If you get a chance, could you give my benchmark a run? There are some
shell scripts which automate it, and it's not hard to do by hand.

tom

--
Only the bagel has the correct aspect ratio.

Roedy Green 11-24-2008 08:31 PM

Re: Matrix performance question
 
On Fri, 21 Nov 2008 07:19:49 -0800, Roedy Green
<see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>Other approaches:


People love to argue about which approach will work better, however,
the inner working of CPUS, compiler, and operating systems are so
complicated, it is at best an educated guess informed by years of
experience. Gone are the days you could sit there with an adding
machine toting up cycles to compare algorithms.

If you have the time, and it really matters, you can perform all
manner of experiments to see what actually works better.

I once had a plum task, designing high voltage transmission lines.
Back then they cost a million dollars a mile to build, and back then
CPU time was extremely expensive relative to my time. Almost anything
I could think of to improve the code had an enormous payback. So
there was no question about trying out any idea.

Today it is rare to get a task that justifies much in the way of
tuning.

The big problem with the JNI approach is the overhead of the JNI call.
If you hop over to the C++ side, you want to do a substantial amount
of work before you return to justify the cost of the call.

The other thing is Jet and Sun -server hotspot are so good now, there
is a fair chance you won't get any benefit to speak of from coding in
C++ or even assembler. You need to use some highly tuned polished
library code to make it worthwhile.

This is not meant to discount any other advice you have received. Some
of the best people are offering their thoughts.

--
Roedy Green Canadian Mind Products
http://mindprod.com
"Humanity is conducting an unintended, uncontrolled, globally pervasive experiment
whose ultimate consequences could be second only to global nuclear war."
~ Environment Canada (The Canadian equivalent of the EPA on global warming)

Tom Anderson 11-24-2008 11:11 PM

Re: Matrix performance question
 
On Sun, 23 Nov 2008, Graham Jones wrote:

> "Tom Anderson" <twic@urchin.earth.li> wrote in message
> news:Pine.LNX.4.64.0811231848450.11453@urchin.eart h.li...
>
>> If you get a chance, could you give my benchmark a run? There are some
>> shell scripts which automate it, and it's not hard to do by hand.

>
> Shell scripts don't work too good on Vista...


http://win-bash.sourceforge.net/
http://gnuwin32.sourceforge.net/

!

> but I did as you asked of Patricia...
>
> I just changed the reporting to just give median time in ms. Ordinary
> PC, Vista, duo CPU, 2.33 GHz, 4Mb L2 cache. jdk1.6.0_03 (1). I use
> Eclipse in case that matters. Typical command line:
>
> java PatriciaTest PessimalDualMatrix 10000 8 1795725554
>
> (I confess my ignorance of Java is such that I don't know what -server
> might do for me.)


The Sun JVM has, broadly speaking, two modes: client and server. Client
optimises for startup time and minimal memory consumption, and server
optimises for speed. The mode affects the compiler (the server compiler is
a *lot* better), and i think memory management. For tests like this, you
really want server mode. I'm not sure which is the default; it used to be
that client was, but i think it's now the case that it will be server if
your machine is big enough, which i think means >2 cores and >2 GB RAM.
Not sure.

My results again:

ParallelColumnMajorMatrix 414790.0
DualMatrix 726620.0

> ParallelColumnMajorMatrix 0.6838042
> DualMatrix 0.5443394


Interesting!

Patricia's numbers:

> ParallelColumnMajorMatrix 345285.12
> DualMatrix 502855.2


Vaguely similar to mine, although the difference is smaller than on my
machine.

> And replacing 10000 with 100000, which surely overflows my cache:
>
> ColumnMajorMatrix 6.429315
> RowMajorMatrix 12.383814
> ColumnMajorMatrix 6.390228
> ParallelRowMajorMatrix 7.260556
> ParallelColumnMajorMatrix 7.1654706
> AntiParallelRowMajorMatrix 7.239504
> AntiParallelColumnMajorMatrix 8.493671
> DualMatrix 5.873374
> PessimalDualMatrix 13.273498
> UnpackedColumnMajorMatrix 7.186361
> UnpackedRowMajorMatrix 16.7376
> ParallelUnpackedColumnMajorMatrix 8.268656


Also interesting! I note that the parallelisation, which speeds things up
for the column-major matrix at N=10k, slows things down at N=100k.
Possibly because it uses an accumulator array which is bigger than the
cache. Not sure.

> Differences in platform seem more important than differences in code.


Really? There was about a 50% difference in times between your machine and
mine. But there was a twofold or more difference in times between the best
and worst strategies.

What's is particularly interesting, though, is that different strategies
are best on different machines. On mine, the parallel column-major
organisation was a fairly solid winner, especially against the dual
matrix. On yours, the dual matrix convincingly edged it out, at both
sizes.

I guess the take-home message is that (a) you need to do benchmarks and
(b) you need to do them on the target machine. Not surprising, i guess,
but strikingly demonstrated here.

It follows that there's scope for performance-criticial libraries to
include multiple strategies, and then to select an optimal one at runtime
based on the characteristics of the host machine. There are some libraries
which already do this, and are some of the fastest libraries of their
kind:

http://math-atlas.sourceforge.net/
http://www.fftw.org/

tom

--
Tomorrow has made a phone call to today.

Tom Anderson 11-25-2008 07:15 PM

Re: Matrix performance question
 
On Tue, 25 Nov 2008, Graham Jones wrote:

> "Tom Anderson" <twic@urchin.earth.li> wrote in message
> news:Pine.LNX.4.64.0811242249300.28704@urchin.eart h.li...
>
>> The Sun JVM has, broadly speaking, two modes: client and server. Client
>> optimises for startup time and minimal memory consumption, and server
>> optimises for speed. The mode affects the compiler (the server compiler
>> is a *lot* better), and i think memory management. For tests like this,
>> you really want server mode.

>
> Yes indeed!
>
> ColumnMajorMatrix 0.3995627
> RowMajorMatrix 0.44905844
> ColumnMajorMatrix 0.39949283
> ParallelRowMajorMatrix 0.43311155
> ParallelColumnMajorMatrix 0.2588967
> AntiParallelRowMajorMatrix 0.2619739
> AntiParallelColumnMajorMatrix 0.6244125
> DualMatrix 0.31512663
> PessimalDualMatrix 0.53376275
> UnpackedColumnMajorMatrix 0.4364346
> UnpackedRowMajorMatrix 0.5116404
> ParallelUnpackedColumnMajorMatrix 0.25846648


Hurrah! And now my code's fastest again! :D

tom

--
The literature is filled with bizarre occurrances for which we have
no explanation


All times are GMT. The time now is 09:18 PM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57