Velocity Reviews > Java > Re: Matrix performance question

# Re: Matrix performance question

John B. Matthews
Guest
Posts: n/a

 11-20-2008
In article <(E-Mail Removed)> ,
Patricia Shanahan <(E-Mail Removed)> wrote:

> As indicated in the "Request for profiler recommendation" thread, I have
> a program that is running inconveniently slowly. I applied one of the
> recommendations, and determined that the time is being spent doing a
> particular matrix operation, and in computing a Poisson distribution
>
> This is a high level description of the matrix operation, not
> necessarily the current code.
>
> I have a matrix U, stored as a double[8][10,000], that represents part
> of the result of a truncated singular value decomposition. U has 8
> columns and about 10,000 rows, but I store it transposed to reduce the
> total number of arrays.
>
> At several points in my code, I take a vector X, one row of 10,000
> elements, and multiply it by U, producing a double[8] intermediate
> result. I then multiply the intermediate result by the transpose of U,
> producing another 10,000 element vector.
>
> I would like to do this faster, and have some ideas. To save time
> trying things out, I would welcome opinions on which of these are most
> likely to have as good performance as possible, as well as additional
> suggestions. I will then implement and measure only the most promising
> one or two ideas:
>
> 1. Throw it to Matlab. U resulted from a matlab script, so I could keep
> the matlab session open and continue operating using it. However, I
> would suffer the overhead of communication with matlab for each vector.
>
> 2. Use Colt. Matlab outperformed Colt on the truncated singular value
> decomposition, but that may be because Colt did not have a truncated
> operation - I had to do the whole decomposition and truncate it myself.
>
> 3. Refactor to group multiple operations together. In some situations, I
> need to do the operation on several vectors. I could group those vectors
> into a matrix, and perhaps open up more optimization options. Does
> anyone know whether the -server JVM restructures nested loops for cache
> use optimization?
>
> Note that pre-computing U * U_transpose does not seem to be practical -
> the result would take 800 MB of memory.
>
> Any other ideas I should be considering? Any packages that would do this
> operation fast?

I was pleasantly surprised by the speed of the JScience library,
possibly due to stack-based allocation afforded by Javolution, on which
Jscience is based. If staying in Java, it may be worth comparing
JScience to Colt:

<http://jscience.org/>
<http://javolution.org/>

To compare Java to another high-level language, I implemented a
root-finding algorithm in Java & Ada:

From the command line, the Ada appeared many times faster; but after
factoring out JVM startup and Ada RT initialization, it was less than a
factor of three.

It's possible to mix Java & Ada, but you'd still need Matlab for
truncation:

--
John B. Matthews
trashgod at gmail dot com

Tom Anderson
Guest
Posts: n/a

 11-21-2008
On Thu, 20 Nov 2008, Patricia Shanahan wrote:

> John B. Matthews wrote:
>> In article <(E-Mail Removed)> ,
>> Patricia Shanahan <(E-Mail Removed)> wrote:
>>
>>> I have a matrix U, stored as a double[8][10,000], that represents part
>>> of the result of a truncated singular value decomposition. U has 8
>>> columns and about 10,000 rows, but I store it transposed to reduce the
>>> total number of arrays.
>>>
>>> At several points in my code, I take a vector X, one row of 10,000
>>> elements, and multiply it by U, producing a double[8] intermediate
>>> result. I then multiply the intermediate result by the transpose of U,
>>> producing another 10,000 element vector.
>>>
>>> Any other ideas I should be considering? Any packages that would do this
>>> operation fast?

>>
>> To compare Java to another high-level language, I implemented a
>> root-finding algorithm in Java & Ada:
>>
>>
>> From the command line, the Ada appeared many times faster; but after
>> factoring out JVM startup and Ada RT initialization, it was less than a
>> factor of three.

>
> If I go with Ada for the critical code, it would have to be as a
> combined Java/Ada program. The problem code is part of 25,000 line
> program, most of which is most naturally expressed in terms of objects.

If you're willing to do a big chunk of work in native code, doesn't
fortran become an appealing option? Or rather, doesn't using BLAS or
LAPACK through a minimal wrapper become attractive? BLAS will give you a
constant-time speedup, i would guess; if LAPACK has routines that are more
specific to your needs, it might do even better.

As for algorithmic improvements to your app - pass! I'd try a numerics
newsgroup rather than cljp for that.

tom

--
a moratorium on the future

Tom Anderson
Guest
Posts: n/a

 11-21-2008
On Thu, 20 Nov 2008, John B. Matthews wrote:

> I was pleasantly surprised by the speed of the JScience library,
> possibly due to stack-based allocation afforded by Javolution, on which
> Jscience is based. If staying in Java, it may be worth comparing
> JScience to Colt:
>
> <http://jscience.org/>
> <http://javolution.org/>

I'd never come across javolution before - it's a pretty impressive
library. If it's as good as it claims, i'm surprised it's not more
well-known.

tom

--
a moratorium on the future

John B. Matthews
Guest
Posts: n/a

 11-21-2008
In article <(E-Mail Removed) >,
Tom Anderson <(E-Mail Removed)> wrote:

> On Thu, 20 Nov 2008, John B. Matthews wrote:
>
> > I was pleasantly surprised by the speed of the JScience library,
> > possibly due to stack-based allocation afforded by Javolution, on which
> > Jscience is based. If staying in Java, it may be worth comparing
> > JScience to Colt:
> >
> > <http://jscience.org/>
> > <http://javolution.org/>

>
> I'd never come across javolution before - it's a pretty impressive
> library. If it's as good as it claims, i'm surprised it's not more
> well-known.

Difficult question. I suppose there is the hope that one or more
Javolution features will be obviated in a future release of the
standard platform, e.g.:

"It should be noted that future versions of the JVM may provide some
limited support for stack allocation through escape analysis. Users can
always turn-off stack allocation to revert to standard heap allocation."

<http://javolution.org/api/javolution/context/StackContext.html>

--
John B. Matthews
trashgod at gmail dot com

John B. Matthews
Guest
Posts: n/a

 11-21-2008
In article <(E-Mail Removed) >,
Tom Anderson <(E-Mail Removed)> wrote:

> On Thu, 20 Nov 2008, Patricia Shanahan wrote:
>
> > John B. Matthews wrote:
> >> In article <(E-Mail Removed)> ,
> >> Patricia Shanahan <(E-Mail Removed)> wrote:
> >>
> >>> I have a matrix U, stored as a double[8][10,000], that represents part
> >>> of the result of a truncated singular value decomposition. U has 8
> >>> columns and about 10,000 rows, but I store it transposed to reduce the
> >>> total number of arrays.
> >>>
> >>> At several points in my code, I take a vector X, one row of 10,000
> >>> elements, and multiply it by U, producing a double[8] intermediate
> >>> result. I then multiply the intermediate result by the transpose of U,
> >>> producing another 10,000 element vector.
> >>>
> >>> Any other ideas I should be considering? Any packages that would do this
> >>> operation fast?
> >>
> >> To compare Java to another high-level language, I implemented a
> >> root-finding algorithm in Java & Ada:
> >>
> >>
> >> From the command line, the Ada appeared many times faster; but after
> >> factoring out JVM startup and Ada RT initialization, it was less than a
> >> factor of three.

> >
> > If I go with Ada for the critical code, it would have to be as a
> > combined Java/Ada program. The problem code is part of 25,000 line
> > program, most of which is most naturally expressed in terms of objects.

>
> If you're willing to do a big chunk of work in native code, doesn't
> fortran become an appealing option? Or rather, doesn't using BLAS or
> LAPACK through a minimal wrapper become attractive? BLAS will give you a
> constant-time speedup, i would guess; if LAPACK has routines that are more
> specific to your needs, it might do even better.

This makes sense. Indeed, one widely available Ada complier (gnat)
typically implements the numeric parts of the language's standard
library using BLAS. I guess in this use, Ada would be a kind of

--
John B. Matthews
trashgod at gmail dot com

Stefan Ram
Guest
Posts: n/a

 11-21-2008
"John B. Matthews" <(E-Mail Removed)> writes:
>I was pleasantly surprised by the speed of the JScience library,
>possibly due to stack-based allocation afforded by Javolution, on which
>Jscience is based. If staying in Java, it may be worth comparing
>JScience to Colt:
><http://jscience.org/>
><http://javolution.org/>

»Javolution is a pure Java Solution«

http://javolution.org/

Does this mean, they implement stack-based allocation
with Java source code only?

If so, how do they do this?

Tom Anderson
Guest
Posts: n/a

 11-21-2008
On Fri, 21 Nov 2008, Stefan Ram wrote:

> "John B. Matthews" <(E-Mail Removed)> writes:
>> I was pleasantly surprised by the speed of the JScience library,
>> possibly due to stack-based allocation afforded by Javolution, on which
>> Jscience is based. If staying in Java, it may be worth comparing
>> JScience to Colt:
>> <http://jscience.org/>
>> <http://javolution.org/>

>
> »Javolution is a pure Java Solution«
>
> http://javolution.org/
>
> Does this mean, they implement stack-based allocation
> with Java source code only?
>
> If so, how do they do this?

"using thread-local pools or RTSJ ScopedMemory" [1]

Although:

"the default implementation [...] uses thread-local pools. RTSJ
alternative implementations could use ScopedMemory for their stack
allocations. "

So it sounds like it's done with 'thread-local pools'. I have a hard time
believing this would actually be faster than normal allocation in a modern
VM. I'd be interested to see a performance comparison, and to read the
source for this class (something i haven't got time to do right now).

tom

[1] RTSJ = Real-Time Specification for Java, a fairly scary add-on
package, with classes for doing all sorts of hairy and unusual things.

--
A paranoid-schizophrenic is a guy who just found out what's going on. --
William S. Burroughs

Stefan Ram
Guest
Posts: n/a

 11-21-2008
Tom Anderson <(E-Mail Removed)> writes:
>[1] RTSJ = Real-Time Specification for Java, a fairly scary add-on
>package, with classes for doing all sorts of hairy and unusual things.

If RTSJ is written in pure Java source code,
my question how it is done then pertains to this RTSJ.

If RTSJ is /not/ written in pure Java source code,
then I can understand how it is done (by some magic
in native methods or JVM-extensions). But if Jalopy
depends on such magic, which is not part of J2SE,
I would not call it »pure Java«.

Tom Anderson
Guest
Posts: n/a

 11-21-2008
On Fri, 21 Nov 2008, John B. Matthews wrote:

> In article <(E-Mail Removed) >,
> Tom Anderson <(E-Mail Removed)> wrote:
>
>> On Thu, 20 Nov 2008, John B. Matthews wrote:
>>
>>> I was pleasantly surprised by the speed of the JScience library,
>>> possibly due to stack-based allocation afforded by Javolution, on which
>>> Jscience is based. If staying in Java, it may be worth comparing
>>> JScience to Colt:
>>>
>>> <http://jscience.org/>
>>> <http://javolution.org/>

>>
>> I'd never come across javolution before - it's a pretty impressive
>> library. If it's as good as it claims, i'm surprised it's not more
>> well-known.

>
> Difficult question. I suppose there is the hope that one or more
> Javolution features will be obviated in a future release of the
> standard platform, e.g.:
>
> "It should be noted that future versions of the JVM may provide some
> limited support for stack allocation through escape analysis. Users can
> always turn-off stack allocation to revert to standard heap allocation."
>
> <http://javolution.org/api/javolution/context/StackContext.html>

Sun's JVM, as of 5 or 6, certainly does some stack allocation via escape
analysis.

I was more thinking of Javolution's various realtime-friendly classes, and
fast XML serialisation, which seem pretty cool.

tom

--
A paranoid-schizophrenic is a guy who just found out what's going on. --
William S. Burroughs

John B. Matthews
Guest
Posts: n/a

 11-21-2008
In article
<(E-Mail Removed)-berlin.de>,
http://www.velocityreviews.com/forums/(E-Mail Removed)-berlin.de (Stefan Ram) wrote:

> "John B. Matthews" <(E-Mail Removed)> writes:
> >I was pleasantly surprised by the speed of the JScience library,
> >possibly due to stack-based allocation afforded by Javolution, on which
> >Jscience is based. If staying in Java, it may be worth comparing
> >JScience to Colt:
> ><http://jscience.org/>
> ><http://javolution.org/>

>
> »Javolution is a pure Java Solution«
>
> http://javolution.org/
>
> Does this mean, they implement stack-based allocation
> with Java source code only?
>
> If so, how do they do this?

Penetrating question. IIUC, they don't. Apparently, they use a
configurable stack context, implemented using thread-local pools:

<http://javolution.org/api/javolution/context/StackContext.html>

As Patricia's noted, her problem has low allocation overhead. In my
tinkering with Polynomial<Complex> and Rational<LargeInteger>, there may
be more benefit, although I have no definitive results.

--
John B. Matthews
trashgod at gmail dot com