Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Re: Matrix performance question

Reply
Thread Tools

Re: Matrix performance question

 
 
John B. Matthews
Guest
Posts: n/a
 
      11-20-2008
In article <(E-Mail Removed)> ,
Patricia Shanahan <(E-Mail Removed)> wrote:

> As indicated in the "Request for profiler recommendation" thread, I have
> a program that is running inconveniently slowly. I applied one of the
> recommendations, and determined that the time is being spent doing a
> particular matrix operation, and in computing a Poisson distribution
> probability. This article is about the matrix part of the problem.
>
> This is a high level description of the matrix operation, not
> necessarily the current code.
>
> I have a matrix U, stored as a double[8][10,000], that represents part
> of the result of a truncated singular value decomposition. U has 8
> columns and about 10,000 rows, but I store it transposed to reduce the
> total number of arrays.
>
> At several points in my code, I take a vector X, one row of 10,000
> elements, and multiply it by U, producing a double[8] intermediate
> result. I then multiply the intermediate result by the transpose of U,
> producing another 10,000 element vector.
>
> I would like to do this faster, and have some ideas. To save time
> trying things out, I would welcome opinions on which of these are most
> likely to have as good performance as possible, as well as additional
> suggestions. I will then implement and measure only the most promising
> one or two ideas:
>
> 1. Throw it to Matlab. U resulted from a matlab script, so I could keep
> the matlab session open and continue operating using it. However, I
> would suffer the overhead of communication with matlab for each vector.
>
> 2. Use Colt. Matlab outperformed Colt on the truncated singular value
> decomposition, but that may be because Colt did not have a truncated
> operation - I had to do the whole decomposition and truncate it myself.
>
> 3. Refactor to group multiple operations together. In some situations, I
> need to do the operation on several vectors. I could group those vectors
> into a matrix, and perhaps open up more optimization options. Does
> anyone know whether the -server JVM restructures nested loops for cache
> use optimization?
>
> Note that pre-computing U * U_transpose does not seem to be practical -
> the result would take 800 MB of memory.
>
> Any other ideas I should be considering? Any packages that would do this
> operation fast?


I was pleasantly surprised by the speed of the JScience library,
possibly due to stack-based allocation afforded by Javolution, on which
Jscience is based. If staying in Java, it may be worth comparing
JScience to Colt:

<http://jscience.org/>
<http://javolution.org/>

To compare Java to another high-level language, I implemented a
root-finding algorithm in Java & Ada:

<http://sites.google.com/site/drjohnbmatthews/polyroots>
<http://home.roadrunner.com/~jbmatthews/misc/groots.html>

From the command line, the Ada appeared many times faster; but after
factoring out JVM startup and Ada RT initialization, it was less than a
factor of three.

It's possible to mix Java & Ada, but you'd still need Matlab for
truncation:

<http://code.google.com/p/javada/>
<http://www.adacore.com/2008/06/17/ada-java_interfacing_suite/>

--
John B. Matthews
trashgod at gmail dot com
http://home.roadrunner.com/~jbmatthews/
 
Reply With Quote
 
 
 
 
Tom Anderson
Guest
Posts: n/a
 
      11-21-2008
On Thu, 20 Nov 2008, Patricia Shanahan wrote:

> John B. Matthews wrote:
>> In article <(E-Mail Removed)> ,
>> Patricia Shanahan <(E-Mail Removed)> wrote:
>>
>>> I have a matrix U, stored as a double[8][10,000], that represents part
>>> of the result of a truncated singular value decomposition. U has 8
>>> columns and about 10,000 rows, but I store it transposed to reduce the
>>> total number of arrays.
>>>
>>> At several points in my code, I take a vector X, one row of 10,000
>>> elements, and multiply it by U, producing a double[8] intermediate
>>> result. I then multiply the intermediate result by the transpose of U,
>>> producing another 10,000 element vector.
>>>
>>> Any other ideas I should be considering? Any packages that would do this
>>> operation fast?

>>
>> To compare Java to another high-level language, I implemented a
>> root-finding algorithm in Java & Ada:
>>
>> <http://sites.google.com/site/drjohnbmatthews/polyroots>
>> <http://home.roadrunner.com/~jbmatthews/misc/groots.html>
>>
>> From the command line, the Ada appeared many times faster; but after
>> factoring out JVM startup and Ada RT initialization, it was less than a
>> factor of three.

>
> If I go with Ada for the critical code, it would have to be as a
> combined Java/Ada program. The problem code is part of 25,000 line
> program, most of which is most naturally expressed in terms of objects.


If you're willing to do a big chunk of work in native code, doesn't
fortran become an appealing option? Or rather, doesn't using BLAS or
LAPACK through a minimal wrapper become attractive? BLAS will give you a
constant-time speedup, i would guess; if LAPACK has routines that are more
specific to your needs, it might do even better.

As for algorithmic improvements to your app - pass! I'd try a numerics
newsgroup rather than cljp for that.

tom

--
a moratorium on the future
 
Reply With Quote
 
 
 
 
Tom Anderson
Guest
Posts: n/a
 
      11-21-2008
On Thu, 20 Nov 2008, John B. Matthews wrote:

> I was pleasantly surprised by the speed of the JScience library,
> possibly due to stack-based allocation afforded by Javolution, on which
> Jscience is based. If staying in Java, it may be worth comparing
> JScience to Colt:
>
> <http://jscience.org/>
> <http://javolution.org/>


I'd never come across javolution before - it's a pretty impressive
library. If it's as good as it claims, i'm surprised it's not more
well-known.

tom

--
a moratorium on the future
 
Reply With Quote
 
John B. Matthews
Guest
Posts: n/a
 
      11-21-2008
In article <(E-Mail Removed) >,
Tom Anderson <(E-Mail Removed)> wrote:

> On Thu, 20 Nov 2008, John B. Matthews wrote:
>
> > I was pleasantly surprised by the speed of the JScience library,
> > possibly due to stack-based allocation afforded by Javolution, on which
> > Jscience is based. If staying in Java, it may be worth comparing
> > JScience to Colt:
> >
> > <http://jscience.org/>
> > <http://javolution.org/>

>
> I'd never come across javolution before - it's a pretty impressive
> library. If it's as good as it claims, i'm surprised it's not more
> well-known.


Difficult question. I suppose there is the hope that one or more
Javolution features will be obviated in a future release of the
standard platform, e.g.:

"It should be noted that future versions of the JVM may provide some
limited support for stack allocation through escape analysis. Users can
always turn-off stack allocation to revert to standard heap allocation."

<http://javolution.org/api/javolution/context/StackContext.html>

--
John B. Matthews
trashgod at gmail dot com
http://home.roadrunner.com/~jbmatthews/
 
Reply With Quote
 
John B. Matthews
Guest
Posts: n/a
 
      11-21-2008
In article <(E-Mail Removed) >,
Tom Anderson <(E-Mail Removed)> wrote:

> On Thu, 20 Nov 2008, Patricia Shanahan wrote:
>
> > John B. Matthews wrote:
> >> In article <(E-Mail Removed)> ,
> >> Patricia Shanahan <(E-Mail Removed)> wrote:
> >>
> >>> I have a matrix U, stored as a double[8][10,000], that represents part
> >>> of the result of a truncated singular value decomposition. U has 8
> >>> columns and about 10,000 rows, but I store it transposed to reduce the
> >>> total number of arrays.
> >>>
> >>> At several points in my code, I take a vector X, one row of 10,000
> >>> elements, and multiply it by U, producing a double[8] intermediate
> >>> result. I then multiply the intermediate result by the transpose of U,
> >>> producing another 10,000 element vector.
> >>>
> >>> Any other ideas I should be considering? Any packages that would do this
> >>> operation fast?
> >>
> >> To compare Java to another high-level language, I implemented a
> >> root-finding algorithm in Java & Ada:
> >>
> >> <http://sites.google.com/site/drjohnbmatthews/polyroots>
> >> <http://home.roadrunner.com/~jbmatthews/misc/groots.html>
> >>
> >> From the command line, the Ada appeared many times faster; but after
> >> factoring out JVM startup and Ada RT initialization, it was less than a
> >> factor of three.

> >
> > If I go with Ada for the critical code, it would have to be as a
> > combined Java/Ada program. The problem code is part of 25,000 line
> > program, most of which is most naturally expressed in terms of objects.

>
> If you're willing to do a big chunk of work in native code, doesn't
> fortran become an appealing option? Or rather, doesn't using BLAS or
> LAPACK through a minimal wrapper become attractive? BLAS will give you a
> constant-time speedup, i would guess; if LAPACK has routines that are more
> specific to your needs, it might do even better.


This makes sense. Indeed, one widely available Ada complier (gnat)
typically implements the numeric parts of the language's standard
library using BLAS. I guess in this use, Ada would be a kind of
wrapper, albeit one with nice operator overloading.

--
John B. Matthews
trashgod at gmail dot com
http://home.roadrunner.com/~jbmatthews/
 
Reply With Quote
 
Stefan Ram
Guest
Posts: n/a
 
      11-21-2008
"John B. Matthews" <(E-Mail Removed)> writes:
>I was pleasantly surprised by the speed of the JScience library,
>possibly due to stack-based allocation afforded by Javolution, on which
>Jscience is based. If staying in Java, it may be worth comparing
>JScience to Colt:
><http://jscience.org/>
><http://javolution.org/>


»Javolution is a pure Java Solution«

http://javolution.org/

Does this mean, they implement stack-based allocation
with Java source code only?

If so, how do they do this?

 
Reply With Quote
 
Tom Anderson
Guest
Posts: n/a
 
      11-21-2008
On Fri, 21 Nov 2008, Stefan Ram wrote:

> "John B. Matthews" <(E-Mail Removed)> writes:
>> I was pleasantly surprised by the speed of the JScience library,
>> possibly due to stack-based allocation afforded by Javolution, on which
>> Jscience is based. If staying in Java, it may be worth comparing
>> JScience to Colt:
>> <http://jscience.org/>
>> <http://javolution.org/>

>
> »Javolution is a pure Java Solution«
>
> http://javolution.org/
>
> Does this mean, they implement stack-based allocation
> with Java source code only?
>
> If so, how do they do this?


"using thread-local pools or RTSJ ScopedMemory" [1]

Although:

"the default implementation [...] uses thread-local pools. RTSJ
alternative implementations could use ScopedMemory for their stack
allocations. "

So it sounds like it's done with 'thread-local pools'. I have a hard time
believing this would actually be faster than normal allocation in a modern
VM. I'd be interested to see a performance comparison, and to read the
source for this class (something i haven't got time to do right now).

tom

[1] RTSJ = Real-Time Specification for Java, a fairly scary add-on
package, with classes for doing all sorts of hairy and unusual things.

--
A paranoid-schizophrenic is a guy who just found out what's going on. --
William S. Burroughs
 
Reply With Quote
 
Stefan Ram
Guest
Posts: n/a
 
      11-21-2008
Tom Anderson <(E-Mail Removed)> writes:
>[1] RTSJ = Real-Time Specification for Java, a fairly scary add-on
>package, with classes for doing all sorts of hairy and unusual things.


If RTSJ is written in pure Java source code,
my question how it is done then pertains to this RTSJ.

If RTSJ is /not/ written in pure Java source code,
then I can understand how it is done (by some magic
in native methods or JVM-extensions). But if Jalopy
depends on such magic, which is not part of J2SE,
I would not call it »pure Java«.

 
Reply With Quote
 
Tom Anderson
Guest
Posts: n/a
 
      11-21-2008
On Fri, 21 Nov 2008, John B. Matthews wrote:

> In article <(E-Mail Removed) >,
> Tom Anderson <(E-Mail Removed)> wrote:
>
>> On Thu, 20 Nov 2008, John B. Matthews wrote:
>>
>>> I was pleasantly surprised by the speed of the JScience library,
>>> possibly due to stack-based allocation afforded by Javolution, on which
>>> Jscience is based. If staying in Java, it may be worth comparing
>>> JScience to Colt:
>>>
>>> <http://jscience.org/>
>>> <http://javolution.org/>

>>
>> I'd never come across javolution before - it's a pretty impressive
>> library. If it's as good as it claims, i'm surprised it's not more
>> well-known.

>
> Difficult question. I suppose there is the hope that one or more
> Javolution features will be obviated in a future release of the
> standard platform, e.g.:
>
> "It should be noted that future versions of the JVM may provide some
> limited support for stack allocation through escape analysis. Users can
> always turn-off stack allocation to revert to standard heap allocation."
>
> <http://javolution.org/api/javolution/context/StackContext.html>


Sun's JVM, as of 5 or 6, certainly does some stack allocation via escape
analysis.

I was more thinking of Javolution's various realtime-friendly classes, and
fast XML serialisation, which seem pretty cool.

tom

--
A paranoid-schizophrenic is a guy who just found out what's going on. --
William S. Burroughs
 
Reply With Quote
 
John B. Matthews
Guest
Posts: n/a
 
      11-21-2008
In article
<(E-Mail Removed)-berlin.de>,
http://www.velocityreviews.com/forums/(E-Mail Removed)-berlin.de (Stefan Ram) wrote:

> "John B. Matthews" <(E-Mail Removed)> writes:
> >I was pleasantly surprised by the speed of the JScience library,
> >possibly due to stack-based allocation afforded by Javolution, on which
> >Jscience is based. If staying in Java, it may be worth comparing
> >JScience to Colt:
> ><http://jscience.org/>
> ><http://javolution.org/>

>
> »Javolution is a pure Java Solution«
>
> http://javolution.org/
>
> Does this mean, they implement stack-based allocation
> with Java source code only?
>
> If so, how do they do this?


Penetrating question. IIUC, they don't. Apparently, they use a
configurable stack context, implemented using thread-local pools:

<http://javolution.org/api/javolution/context/StackContext.html>

As Patricia's noted, her problem has low allocation overhead. In my
tinkering with Polynomial<Complex> and Rational<LargeInteger>, there may
be more benefit, although I have no definitive results.

--
John B. Matthews
trashgod at gmail dot com
http://home.roadrunner.com/~jbmatthews/
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Matrix operations on character matrix element? Robert Kern Python 0 04-02-2009 12:27 AM
Re: Matrix operations on character matrix element? Terry Reedy Python 0 04-02-2009 12:12 AM
Matrix*Vector and Vector*Matrix Holgerson C++ 3 10-26-2007 07:38 AM
Matrix composed by two matrix lvcargnini VHDL 3 07-05-2006 07:21 AM
Re: Matrix DTS and Matrix 2 DTS? PeterTHX DVD Video 0 08-03-2003 05:46 AM



Advertisments