Velocity Reviews > numpy/scipy: correlation

numpy/scipy: correlation

robert
Guest
Posts: n/a

 11-11-2006
Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
m, m-err, o, o-err, r-coef,r-coef-err ?

Or a formula to to compute the 3 error ranges?

-robert

PS:

numpy.corrcoef computes only the bare coeff:
>>> numpy.corrcoef((0,1,2,3.0),(2,5,6,7.0),)

array([[ 1. , 0.95618289],
[ 0.95618289, 1. ]])

with ints it goes computes wrong:
>>> numpy.corrcoef((0,1,2,3),(2,5,6,7),)

array([[ 1. , 0.94491118],
[ 0.94491118, 1. ]])

Robert Kern
Guest
Posts: n/a

 11-11-2006
robert wrote:
> Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
> m, m-err, o, o-err, r-coef,r-coef-err ?

numpy and scipy questions are best asked on their lists, not here. There are a
number of people who know the capabilities of numpy and scipy through and
through, but most of them don't hang out on comp.lang.python.

http://www.scipy.org/Mailing_Lists

scipy.optimize.leastsq() can be told to return the covariance matrix of the
estimated parameters (m and o in your example; I have no idea what you think
r-coeff is).

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
an underlying truth."
-- Umberto Eco

Robert Kern
Guest
Posts: n/a

 11-11-2006
Robert Kern wrote:
> robert wrote:
>> Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
>> m, m-err, o, o-err, r-coef,r-coef-err ?

> scipy.optimize.leastsq() can be told to return the covariance matrix of the
> estimated parameters (m and o in your example; I have no idea what you think
> r-coeff is).

Ah, the correlation coefficient itself. Since correlation coefficients are weird
beasts constrained to [-1, 1], standard gaussian errors like you are expecting
for m-err and o-err don't apply. No, there's currently no function in numpy or
scipy that will do something sophisticated enough to be reliable. Here's an option:

http://www.pubmedcentral.nih.gov/art...i?artid=155684

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
an underlying truth."
-- Umberto Eco

Robert Kern
Guest
Posts: n/a

 11-11-2006
robert wrote:
> Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
> m, m-err, o, o-err, r-coef,r-coef-err ?

And of course, those three parameters are not particularly meaningful together.
If your model is truly "y is a linear response given x with normal noise" then
"y=m*x+o" is correct, and all of the information that you can get from the data
will be found in the estimates of m and o and the covariance matrix of the
estimates.

On the other hand, if your model is that "(x, y) is distributed as a bivariate
normal distribution" then "y=m*x+o" is not a particularly good representation of
the model. You should instead estimate the mean vector and covariance matrix of
(x, y). Your correlation coefficient will be the off-diagonal term after
dividing out the marginal standard deviations.

The difference between the two models is that the first places no restrictions
on the distribution of x. The second does; both the x and y marginal
distributions need to be normal. Under the first model, the correlation
coefficient has no meaning.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
an underlying truth."
-- Umberto Eco

robert
Guest
Posts: n/a

 11-12-2006
Robert Kern wrote:
> robert wrote:
>> Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
>> m, m-err, o, o-err, r-coef,r-coef-err ?

>
> And of course, those three parameters are not particularly meaningful together.
> If your model is truly "y is a linear response given x with normal noise" then
> "y=m*x+o" is correct, and all of the information that you can get from the data
> will be found in the estimates of m and o and the covariance matrix of the
> estimates.
>
> On the other hand, if your model is that "(x, y) is distributed as a bivariate
> normal distribution" then "y=m*x+o" is not a particularly good representation of
> the model. You should instead estimate the mean vector and covariance matrix of
> (x, y). Your correlation coefficient will be the off-diagonal term after
> dividing out the marginal standard deviations.
>
> The difference between the two models is that the first places no restrictions
> on the distribution of x. The second does; both the x and y marginal
> distributions need to be normal. Under the first model, the correlation
> coefficient has no meaning.

Think the difference is little in practice - when you head for usable diagonals.
Looking at the bivar. coef first before going on to any models, seems to be a more stable approach for the first step in data mining. ( before you proceed to a model or to class-learning .. )

Basically the first need is to analyse lots of x,y data and check for linear dependencies. No real model so far. I'd need a quality measure (coef**2) and to know how much I can rely on it (coef-err). coef alone is not enough. You get a perfect 1.0 with 2 ( or 3 - see below ) points.
With big coef's and lots of distributed data the coef is very good by itself - its error range err(N) only approx ~ 1/sqrt(N)

One would expect the error range to drop simply with # of points. Yet it depends more complexly on the mean value of the coef and on the distribution at all.
More interesting realworld cases: For example I see a lower correlation on lots of points - maybe coef=0.05 . Got it - or not? Thus lower coefs require naturally a coef-err to be useful in practice.

Now think of adding 'boring data':

>>> X=[1.,2,3,4]
>>> Y=[1.,2,3,5]
>>> sd.correlation((X,Y)) # my old func

(1.3, -0.5, 0.982707629824) # m,o,coef
>>> numpy.corrcoef((X,Y))

array([[ 1. , 0.98270763],
[ 0.98270763, 1. ]])
>>> XX=[1.,1,1,1,1,2,3,4]
>>> YY=[1.,1,1,1,1,2,3,5]
>>> sd.correlation((XX,YY))

(1.23684210526, -0.289473684211, 0.988433774639)
>>>

I'd expect: the little increase of r is ok. But this 'boring data' should not make the error to go down simply ~1/sqrt(N) ...

I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore.

http://en.wikipedia.org/wiki/Pearson...ficient#Trivia
says:
In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value.

Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too.

tells:

probable error of r = 0.6745*(1-r**2)/sqrt(N)

A simple function of r and N - quite what I expected above roughly for the N-only dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coef-err from

0.0115628571429 to 0.00548453410954 !

And the absolute size of this error measure seems to be too low for just 4 points of data!

The other formula which I remember seeing once was much more sophisticated and used things like sum_xxy etc...

Robert

PS:

my old func is simply hands-on based on
n,sum_x,sum_y,sum_xy,sum_xx,sum_yy=len(vx),vx.sum( ),vy.sum(),(vx*vy).sum(),(vx*vx).sum(),(vy*vy).sum ()
Guess its already fast for large data?

Note: numpy.corrcoef strikes on 2 points:
>>> numpy.corrcoef(([1,2],[1,2]))

array([[ -1.#IND, -1.#IND],
[ -1.#IND, -1.#IND]])
>>> sd.correlation(([1,2],[1,2]))

(1, 0, 1.0)
>>>
>>> numpy.corrcoef(([1,2,3],[1,2,3]))

array([[ 1., 1.],
[ 1., 1.]])
>>> sd.correlation(([1,2,3],[1,2,3]))

(1, 0, 1.0)

PPS:

A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy.

sturlamolden
Guest
Posts: n/a

 11-12-2006

Robert Kern wrote:

> The difference between the two models is that the first places no restrictions
> on the distribution of x. The second does; both the x and y marginal
> distributions need to be normal. Under the first model, the correlation
> coefficient has no meaning.

That is not correct. The correlation coefficient is meaningful in both
models, but must be interpreted differently. However, in both cases a
correlation coefficient of 1 or -1 indicates an exact linear
relationship between x and y.

Under the first model ("linear regression"), the squared correlation
coefficient is the "explained variance", i.e. the the proportion of y's
variance accounted for by the model y = m*x + o.

Under the second model ("multivariate normal distribution"), the
correlation coefficient is the covariance of y and x divided by the
product of the standard deviations, cov(x,y)/(std(x)*std(y)).

robert
Guest
Posts: n/a

 11-12-2006
robert wrote:
> Robert Kern wrote:
>
> tells:
> probable error of r = 0.6745*(1-r**2)/sqrt(N)
>
> A simple function of r and N - quite what I expected above roughly for
> the N-only dep.. But thus it is not sensitive to above considerations
> about 'boring' data. With above example it would spit a decrease of this
> probable coef-err from
> 0.0115628571429 to 0.00548453410954 !

This 1929 formula for estimating the error of correlation coefficient seems to make some sense for r=0 .
I do monte carlo on correlating random series:

>>> X=numpy.random.random(10000)
>>> l=[]
>>> for i in range(200):

.... Z=numpy.random.random(10000)
.... l.append( sd.correlation((X,Z))[2] ) #collect coef's
....
>>> mean(l)

0.000327657082234
>>> std(l)

0.0109120766158 # thats how the coef jitters
>>> std(l)/sqrt(len(l))

0.000771600337185
>>> len(l)

200

# now
# 0.6745*(1-r**2)/sqrt(N) = 0.0067440015079
# vs M.C. 0.0109120766158 ± 0.000771600337185

but the fancy factor of 0.6745 is significantly just fancy for r=0.

then for a higher (0.5) correlation:

>>> l=[]
>>> for i in range(200):

.... Z=numpy.random.random(10000)+array(range(10000))/10000.0
.... l.append( sd.correlation((X+array(range(10000))/10000.0,Z))[2] )
....
>>> mean(l)

0.498905642552
>>> std(l)

0.00546979583163
>>> std(l)/sqrt(len(l))

0.000386772972425

#now:
# 0.6745*(1-r**2)/sqrt(N) = 0.00512173224849)
# vs M.C. 0.00546979583163 ± 0.000386772972425

=> there the 0.6745 factor and (1-r**2) seem to get the main effect ! There is something in it.

--

>>> boring=ones(10001)*0.5
>>> X=numpy.random.random(10000)
>>> l=[]
>>> for i in range(200):

.... Z=concatenate((numpy.random.random(10000)+array(ra nge(10000))/10000.0,boring))
.... l.append( sd.correlation((concatenate((X+array(range(10000))/10000.0,boring)),Z))[2] )
....
>>> mean(l)

0.712753628489 # r
>>> std(l)

0.00316163649888 # r_err
>>> std(l)/sqrt(len(l))

0.0002235614608

# now:
# 0.6745*(1-r**2)/sqrt(N) = 0.00234459971461 #N=20000
# vs M.C. streuung 0.00316163649888 ± 0.0002235614608

=> the boring data has an effect on coef-err which is significantly not reflected by the formula 0.6745*(1-r**2)/sqrt(N)

=> I'll use this formula to get a downside error estimate for the correlation coefficient:

------------------------------------------
| r_err_down ~= 1.0 * (1-r**2)/sqrt(N) |
------------------------------------------

(until I find a better one respecting the actual distribution of data)

Would be interesting what MATLAB & Octave say ...

-robert

sturlamolden
Guest
Posts: n/a

 11-12-2006

First, are you talking about rounding error (due to floating point
arithmetics) or statistical sampling error?

If you are talking about the latter, I suggest you look it up in a
statistics text book. E.g. if x and y are normally distributed, then

t = r * sqrt( (n-2)/(1-r**2) )

has a Student t-distribution with n-2 degrees of freedom. And if you
don't know how to get the p-value from that, you should not be messing
with statistics anyway.

robert
Guest
Posts: n/a

 11-12-2006
sturlamolden wrote:
> First, are you talking about rounding error (due to floating point
> arithmetics) or statistical sampling error?

About measured data. rounding and sampling errors with special distrutions are neglegible. Thus by default assuming gaussian noise in x and y.
(This may explain that factor of ~0.7 in the rectangle M.C. test)
The (x,y) points may not distribute "nicely" along the assumed regression diagonale.

> If you are talking about the latter, I suggest you look it up in a
> statistics text book. E.g. if x and y are normally distributed, then
>
> t = r * sqrt( (n-2)/(1-r**2) )
>
> has a Student t-distribution with n-2 degrees of freedom. And if you
> don't know how to get the p-value from that, you should not be messing
> with statistics anyway.

yet too lazy/practical for digging these things from there. You obviously got it - out of that, what would be a final estimate for an error range of r (n big) ?
that same "const. * (1-r**2)/sqrt(n)" which I found in that other document ?

The const. ~1 is less the problem.

My main concern is, how to respect the fact, that the (x,y) points may not distribute well along the regression line. E.g. due to the nature of the experiment more points are around (0,0) but only few distribute along the interesting part of the diagonale and thus few points have great effect on m & r. above formulas will possibly not respect that.
I could try a weighting technique, but maybe there is a (commonly used) speedy formula for r/r_err respecting that directly?

Robert

Ramon Diaz-Uriarte
Guest
Posts: n/a

 11-12-2006
On 11/12/06, robert <(E-Mail Removed)> wrote:
> Robert Kern wrote:
> > robert wrote:

(...)
> One would expect the error range to drop simply with # of points. Yet it depends more complexly on the mean value of the coef and on the distribution at all.
> More interesting realworld cases: For example I see a lower correlation on lots of points - maybe coef=0.05 . Got it - or not? Thus lower coefs require naturally a coef-err to be useful in practice.
>

(...)

> I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore.
>

(...)
>
> Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too.
>
>

There is a transformation of the correlation coefficient that is
distributed as a t-statistic under the null. This was derived a long
way back, and is the usual, standard, way to test for significance of
(attach a p-value to) correlation coefficients. I do not recall the
formula from top of my head, but I am sure any google search will find
it. And of course, any decent intro stats textbook will also provide
it. You can look in your nearby library for popular textbooks such as
Snedecor & Cochran, or Sokal & Rohlf, or the wonderful "Beyond Anova"
by Miller, which do have the expression.

Since this is such a common procedure all stats programs do provide
for tests of significance of correlation coefficients. A whole bunch
of them are free, and one is widely used: R (check
http://cran.r-project.org). This is an easy way for you to test the
output of your stats code in Python.

Now, since you can do a significance test, you can invert the
procedure and obtain a confidence interval (of whichever width you
want) for your correlation coefficients. This CI will (almost always)
be assymmetric. (And recall that CI do have a not-obvious
interpretation). CI of correlation coefficients are not as common as
p-values, but this is doable too.

The expression (and "improved versions" of it, I think Sokal & Rohlf
show several) is easy to compute. And I bet obtaining the density of a
t-distribution with k d.f. is also available already for Python. So
this part is definitely doable.

Best,

R.

(...)
> Now think of adding 'boring data':
>
> >>> X=[1.,2,3,4]
> >>> Y=[1.,2,3,5]
> >>> sd.correlation((X,Y)) # my old func

> (1.3, -0.5, 0.982707629824) # m,o,coef
> >>> numpy.corrcoef((X,Y))

> array([[ 1. , 0.98270763],
> [ 0.98270763, 1. ]])
> >>> XX=[1.,1,1,1,1,2,3,4]
> >>> YY=[1.,1,1,1,1,2,3,5]
> >>> sd.correlation((XX,YY))

> (1.23684210526, -0.289473684211, 0.988433774639)
> >>>

>
> I'd expect: the little increase of r is ok. But this 'boring data' should not make the error to go down simply ~1/sqrt(N) ...
>
> http://en.wikipedia.org/wiki/Pearson...ficient#Trivia
> says:
> In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value.

> tells:
>
> probable error of r = 0.6745*(1-r**2)/sqrt(N)
>
> A simple function of r and N - quite what I expected above roughly for the N-only dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coef-err from
>
> 0.0115628571429 to 0.00548453410954 !
>
> And the absolute size of this error measure seems to be too low for just 4 points of data!
>
> The other formula which I remember seeing once was much more sophisticated and used things like sum_xxy etc...
>
>
> Robert
>
>
>
> PS:
>
> my old func is simply hands-on based on
> n,sum_x,sum_y,sum_xy,sum_xx,sum_yy=len(vx),vx.sum( ),vy.sum(),(vx*vy).sum(),(vx*vx).sum(),(vy*vy).sum ()
> Guess its already fast for large data?
>
>
> Note: numpy.corrcoef strikes on 2 points:
> >>> numpy.corrcoef(([1,2],[1,2]))

> array([[ -1.#IND, -1.#IND],
> [ -1.#IND, -1.#IND]])
> >>> sd.correlation(([1,2],[1,2]))

> (1, 0, 1.0)
> >>>
> >>> numpy.corrcoef(([1,2,3],[1,2,3]))

> array([[ 1., 1.],
> [ 1., 1.]])
> >>> sd.correlation(([1,2,3],[1,2,3]))

> (1, 0, 1.0)
>
>
> PPS:
>
> A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy.
> --
> http://mail.python.org/mailman/listinfo/python-list
>

--
Ramon Diaz-Uriarte
Statistical Computing Team
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
http://ligarto.org/rdiaz